I guess it was just that I was trying to use traditional QoS. I got similar results with the stock firmware, so I enabled adaptive QoS in Merlin, and it's going at full speed. For traditional mode, you would need two full cores at 100% to max out a gigabit connection.
Traditional = zero HW acceleration
Adaptive QoS = first level of HW acceleration
No QoS = both levels of HW acceleration
There are less cpu intensive queuing algorithms for QoS available, intended more for high speed connections like gigabit, which simply ensure every device gets a fair share of bandwidth.
No reason for deep packet analysis, different classes, etc at those speeds.
The problem is that any iptables-based classification (like what Traditional QoS does) requires disabling hardware acceleration on the Broadcom platform (no idea if that's also the case for other manufacturers's implementations).
The reason why Adaptive QoS (with its DPI engine) works with hardware acceleration is that the DPI engine runs as a kernel module, and sits outside of Netfilter. So, hardware acceleration is still able to bypass the FORWARD chain, as packet marking is done by another kernel module.
So in a sense, DPI-based gives you better performance in Asuswrt because it can work alongside HW acceleration.