NAT, PPPoE (we use it in my country), VLAN routing..
Qualcomm, Broadcom, MediaTek - they do have HW off-loads, and this is needed because of CPU and memory constraints... It's a good place to be for things like this, and it's recognition that the CPU cores don't have to be directly involved in the data paths, rather work in a control path view...
Most of these offloads actually happen in the switch element of the SoC, the cores tell the switch configure these flows in whatever manner they need to.
When you look at these x86/AMD64 appliances - there's enough CPU resources available that one can run everything into the OS's native networking stack...
BSD and Linux have fast paths for SW - as someone mentioned, netfliter has flow table acceleration that can be defined inside nftables - same applies for pf in BSD-Land.
NIC's also have some level of offload capability - Intel and Broadcom have a fair amount of capability here, both at the MAC layer, and also up into the network layer of the stack.
Intel has QAT, which is pretty impressive if you have a XeonD that supports it, and there's all the work with
DPDK that Intel kicked over to the Linux foundation...
Back in 2018, post-cafeole (my science project) - working at a startup over in Santa Clara, we did a lot of work for 40Gb networking - and to get there, it was about the offloads available on QuickTransit and DPDK, along with some clever work to get L2TP tunnels at wire speed with AES-128-GCM...
Should also note that the Switch SoC's also have hw acceleration as well - one of the better documented implementation is Broadcom with their FastPath implementations.