I am using Halide, and trying to generate a simplified version of the inner kernel in a GEMM operation, similar to this. Basically it multiplies a 12x1 column vector with a 1x4 row vector and updates an accumulator cell of size 12x4. I am targeting 32-bit ARM NEON.
Ideally, all the accumulators and operands should fit in the q registers, without spilling to the stack. However, the generated ARM assembly uses the registers in a sub-optimal way, and keeps spilling registers onto the stack and reloading them.
The relevant part of the LLVM IR is here, and the corresponding arm32 assembly is here.
Any help to how to solve this, or what might be causing it, will be greatly appreciated.