Project
The idea for the project was to dive into the basics of GPU programming by building a MLP in CUDA!
Experience
I was initially motivated to do this by Simon Oz youtube playlist on GPU programming. I wanted to apply what I was learning and thought to follow his steps and build an MLP.
At the start I was just getting used to the way CUDA works and how kernels are used. This was the first time I had touched any GPU programming projects so this part took a while. I specially struggled with the way multidimensional indices are flattened.
__global__ void forward(int bs, int n, int out_w, float* input, float* weights, float* biases, float* output) { // y for rows (height of the mat) int row = blockIdx.y * blockDim.y + threadIdx.y; // x for column (width of the mat) int column = blockIdx.x * blockDim.x + threadIdx.x;
// do the dot product between the row and col if (row < bs && column < out_w) { output[row*out_w + column] = biases[column]; for (int i = 0; i < n; i++) { output[row * out_w + column] += weights[i * out_w + column] * input[row * n + i]; } }}The kernel above indexes into the data with the row and column variables turning them into flat indices. This was the hardest part to get used to as I had to deconstruct my mental model of multidimensional arrays, and replace it with flat ones and flat indices.
To solidify this I have actually built a Tensor library that forced me to practice this called Rusty Flow although it is not in cuda all cpu.
I am very happy to peel of the layers in machine learning and look into how things work at a deeper level.