Iris

Oh Iris

For some reason I never got a chance to really dive deep into compilers. Outside of open source LLVM and some other compilers I never had a formal class on it. I think compilers are really interesting I’m no systems engineer or compiler engineer but I find all of it interesting. One day I was sitting at my desk thinking why I don’t ever use my M3 gpu. It actually would’ve been helpful at some points instead of ssh into a cluster or finding a machine with an nvidia gpu. Plus I think the name is cool, “Metal” is just a really good name choice. I figured I’d just do something similar to triton and write my own for metal. As of writing this I’ve gotten a solid framework implemented, but theres a few other optimizations and performances I’d like to get out of this. Only a few months away from graduation and this isn’t a capstone or anything just something I’d like to do for fun. And this is my first time properly writing a post about a project really. Outside of the streams I do and the occasional linkedin post. I do stream most of my projects on youtube just to have an archive and keep myself motivated it’s pretty fun, but I’m getting sidetracked. The main motivator for this was just to prototype research models quicker on my laptop when I don’t have a cluster to use.

The runtime

Starting with what’s actually there. An end-to-end Metal Kernel compute runtime with dynamic kernel codegen. You can write a kernel like how you’d write triton code.

@kernel(
    param_types={
        "A": "device const float*",
        "B": "device const float*",
        "C": "device float*",
        "D": "uint",  # depth (z‑dimension)
        "H": "uint",  # height (y)
        "W": "uint",  # width  (x)
    }
)
def add_3d(A, B, C, D, H, W):
    x = metal.thread_id_x()
    y = metal.thread_id_y()
    z = metal.thread_id_z()

    if x < W and y < H and z < D:
        idx = (z * H * W) + (y * W) + x
        C[idx] = A[idx] + B[idx]

The runtime does support elementwise operators + 2D and 3D kernels

There is persistent buffers, JIT compilation, and AST -> MSL codegen. I wanted to have this post a way to pin point a decent point in the runtime in my opinion. Next steps are to write a full IR for kernel fusion and tiling.

Utilities

I enjoy writing TUis since I mostly use the terminal for everything. I added a really quick TUI in Rust that right now has to be loaded in another terminal instance/pane. I plan on making it much better, but just foucsing on the runtime right now and Rust is cool but I don’t want to write that much rust while I’m writing C++ headers dailia

NumPy

NumPy is such a crazy marvel of engineering. Numpy is goated. One of the issues I had was my first implementation of caching. It took a few hours but there was a nice work around

def download(self, metal_buffer: MetalBuffer) -> np.ndarray:
    contents = metal_buffer.buffer.contents()
    total_bytes = metal_buffer.size
    tuple_of_bytes = contents[0:total_bytes]
    all_bytes = b"".join(tuple_of_bytes)
    array_flat = np.frombuffer(all_bytes, dtype=metal_buffer.dtype.to_numpy())
    return array_flat.reshape(metal_buffer.shape)


def peek(self, metal_buffer: MetalBuffer, dtype: DType, index: int = 0):
    offset_bytes = index * dtype.size
    contents = metal_buffer.buffer.contents()
    # Slicing returns a tuple of single-byte BYTES objects.
    # e.g., (b'\x00', b'\xe4', b'\xc0', b'\x46')
    tuple_of_bytes = contents[offset_bytes : offset_bytes + dtype.size]
    # Join the tuple of bytes into a single bytes object.
    value_bytes = b"".join(tuple_of_bytes)
    return np.frombuffer(value_bytes, dtype=dtype.to_numpy())[0]

Cool implementation for later

Basically to help with that prototyping I want to be able to drop in .cu files and use that as a replacement so a translation layer at some point. I think for now it’ll go on the backburner in implementation right now, but down the line I think it would be cool. (Not a replacement for CUDA or anything)

What’s left

I think this is at a decent state where it is at, I am still motivated to write a lot more for it and get some more features out. Right now the main priority is writing the IR graph for kernel fusion and async compilation instead of having everything syncronize. There are still a few operations left to do like

Reductions
Broadcasting primitives
Conv?
Transpose with tiling
Gatther/Scatter

All that to say there just needs to be more pipeline and optimization implementation which all starts from the ir graph. Truthfully I’m not too worried since Julia has an IR graph julia IR

Writing Julia taught me so much; frameworks have so many aspects of software and design that I think fly under the radar

Conclusion

This is my first real write up I normally just stream myself coding so this was nice. Thank you for reading, I’ll write a follow up either after IR or if I stop working on it in time.

Repo

GitHub: lovechants/Iris