However like any hardware acceleration there is a drawback. small hardware FPUs have a limited accuracy compared to software float
Sorry - I do have to make a quick comment here - an FPU is IEEE compliant with 754 for the most part - if it wants to be considered an FPU - whether HW or SW driven - if I drive a calculation requirement floating point math, it better give a consistent and correct result.
Remember the Pentium fdiv bug?
Doesn't matter whether it's HW or SW - HW does it faster, but SW gives the same answer - and it better be the same answer whether it's MIPS/ARM/PPC/Sparc/x86 or SW...
hardware AES only limit is that the frequency scales with the data size so if it was made to handle 128bit data it will do so every cycle but give it 256 bit AES and it will take multiple cycles for each data.
AES-NI, at least on intel, actually runs in same path as SSE and AVX - it's just optimized instructions... don't confuse this with QuickSyncVideo or QuickAssistTech - those are functional blocks, AES-NI is not.
There are dedicated crypto blocks out there - ViA has one for their x86 chips (yes, they still make them) called Padlock, and the various ARM and MIPS processors can include crypto as well (and some are very, very, very fast... I have a friend that is working on one for an ARM scale out processor that is silly fast)
Hardware NAT cannot do complicated QoS or firewall.
Goes back to my comment about multi-layer switching, which is what BRCM is doing with ctf.ko, it is programming the ethernet fabric inside the SoC with very specific rules... it can do NAT at L3, which is a firewall onto itself in some context, but it's not a stateful implementation like Netfilter or pf, and I agree, QoS can be a challenge with an MLS, whereas a Router that is focused on L3 can do a lot more....
Which goes back to my original statement - HW Accel is a patch/crutch - if one is properly sized for traffic, then one doesn't need HW accel at all...