NotTheHerbie
Occasional Visitor
My home network began experiencing crippling lag spikes (greater than 1500 msec, sometimes greater than 2000 msec) after adding three BE30000 AiMesh nodes 7 months ago. They replaced three of four RT-AX92Us running in media bridge mode. The spikes occurred like clockwork, initially, every 15 minutes. Shortly after their initial discovery, the period of occurrence mysteriously changed to every 19 minutes, 50 seconds (+/- 2 seconds).
The spikes appeared to only affect the AiMesh nodes and devices connected through them. Clients connected directly to the GT-BE98 Pro, either wired or wirelessly, did not experience the same spikes and disconnects. These spikes occurred with the latest stock firmware installed on all devices as well as the latest Merlin firmware. I currently have stock 3.0.0.6.102_36789 installed on the three BE30000s and Merlin 3006.102.3_beta1 installed on the GT-BE98 Pro. MLO is disabled for both fronthaul and backhaul. Each band is set up with a unique SSID, Channel Control is set to Auto and DFS channels are disabled for the 5 GHz band. Roaming Assistant is also disabled for all bands and AiProtection is off. The mesh nodes all use the 6 GHz wireless backhaul. Everything else (wireless related) was left at the default settings.
The lag spikes were captured using the free version of EMCO Ping Monitor. The free version allows you to simultaneously ping up to 5 hosts as often as every second and has a measurement resolution of 1 msec. It records, analyzes and graphs the results making it easy to identify patterns. The PC running the software is connected to the main router through a 10G switch. I have the software set up to ping the three AiMesh nodes every second. As a control, I’m also pinging a Raspberry Pi 4 that is hardwired through the same switch, eliminating the router’s influence from this measurement. The ping measurements for the Raspberry Pi are reported as ~0 msec with no spikes.
In an attempt to identify the cause of the lag spikes, I used SSH to run “top” on the GT-BE98 Pro and redirected the output to a file on a thumb drive. I tagged each line of output with a date/time stamp so I could correlate what was happening on the router with the ping measurements that were captured.
What I discovered was that around the time of each lag spike, a process with the name “avahi-daemon: running [GT-BE98Pro.local]” went from <= 0.2% CPU utilization to mid-single digit utilization, occasionally peaking over 10%. I didn’t know if this was just another symptom of whatever was causing the lag spikes or if it was the cause. What I did learn was that avahi is an implementation of DNS Service Discovery (DNS-SD) over Multicast DNS (mDNS), commonly known as and compatible with Apple Bonjour.
Unrelated to the above investigation, I was exploring the new interface recently released for my home automation hub, the Hubitat C-8, when I discovered a network setting I was unfamiliar with. Under Network Settings, Bonjour Options, is a setting to enable or disable “Periodically restart Bonjour service”. It was enabled. Researching this setting on the Hubitat website, I learned that when enabled, the Bonjour service is halted and restarted every 20 minutes!!! (More likely, every 19 minutes 50 seconds.) I also found forum discussions talking about how restarting this service flooded the local network with mDNS traffic.
Combining this new information with what I found in the GT-BE98 Pro’s process history, I decided to disable this feature on my Hubitat C-8 and see what happens.
The average latency was 3.52 msec, with a minimum of 1 msec and a maximum of 74 msec. These results are for a mesh node that is approximately 18 feet from the router with no obstructions. The results for the remaining mesh nodes were very similar to these. The lag spikes for all three nodes appeared to occur at the same time.
Open Questions:
The spikes appeared to only affect the AiMesh nodes and devices connected through them. Clients connected directly to the GT-BE98 Pro, either wired or wirelessly, did not experience the same spikes and disconnects. These spikes occurred with the latest stock firmware installed on all devices as well as the latest Merlin firmware. I currently have stock 3.0.0.6.102_36789 installed on the three BE30000s and Merlin 3006.102.3_beta1 installed on the GT-BE98 Pro. MLO is disabled for both fronthaul and backhaul. Each band is set up with a unique SSID, Channel Control is set to Auto and DFS channels are disabled for the 5 GHz band. Roaming Assistant is also disabled for all bands and AiProtection is off. The mesh nodes all use the 6 GHz wireless backhaul. Everything else (wireless related) was left at the default settings.
The lag spikes were captured using the free version of EMCO Ping Monitor. The free version allows you to simultaneously ping up to 5 hosts as often as every second and has a measurement resolution of 1 msec. It records, analyzes and graphs the results making it easy to identify patterns. The PC running the software is connected to the main router through a 10G switch. I have the software set up to ping the three AiMesh nodes every second. As a control, I’m also pinging a Raspberry Pi 4 that is hardwired through the same switch, eliminating the router’s influence from this measurement. The ping measurements for the Raspberry Pi are reported as ~0 msec with no spikes.
In an attempt to identify the cause of the lag spikes, I used SSH to run “top” on the GT-BE98 Pro and redirected the output to a file on a thumb drive. I tagged each line of output with a date/time stamp so I could correlate what was happening on the router with the ping measurements that were captured.
What I discovered was that around the time of each lag spike, a process with the name “avahi-daemon: running [GT-BE98Pro.local]” went from <= 0.2% CPU utilization to mid-single digit utilization, occasionally peaking over 10%. I didn’t know if this was just another symptom of whatever was causing the lag spikes or if it was the cause. What I did learn was that avahi is an implementation of DNS Service Discovery (DNS-SD) over Multicast DNS (mDNS), commonly known as and compatible with Apple Bonjour.
Unrelated to the above investigation, I was exploring the new interface recently released for my home automation hub, the Hubitat C-8, when I discovered a network setting I was unfamiliar with. Under Network Settings, Bonjour Options, is a setting to enable or disable “Periodically restart Bonjour service”. It was enabled. Researching this setting on the Hubitat website, I learned that when enabled, the Bonjour service is halted and restarted every 20 minutes!!! (More likely, every 19 minutes 50 seconds.) I also found forum discussions talking about how restarting this service flooded the local network with mDNS traffic.
Combining this new information with what I found in the GT-BE98 Pro’s process history, I decided to disable this feature on my Hubitat C-8 and see what happens.
- I am no longer seeing the very predictable and crippling lag spikes every 19 minutes 50 seconds!!!
- I do still see what look to be fairly common random spikes of up to 60 – 70 msec.
- Unfortunately, I have also seen spikes exceeding 1500 msec, but instead of 3 times per hour, they appear to occur randomly 2-4 times per 24 hours.
The average latency was 3.52 msec, with a minimum of 1 msec and a maximum of 74 msec. These results are for a mesh node that is approximately 18 feet from the router with no obstructions. The results for the remaining mesh nodes were very similar to these. The lag spikes for all three nodes appeared to occur at the same time.
Open Questions:
- What is considered an acceptable/normal amount of wireless latency? At what point should the above spikes be considered unacceptable? I have an idea of what those values should be, but I’d like to know if there is an accepted standard.
- I need to further investigate the remaining random, large (>1500 msec) lag spikes. These 2-4 large spikes per day do not look like the spikes that were eliminated. The predictable spikes would typically have a couple of elevated measurements on either side of the large spike. These remaining spikes do not. They look like a single packet was lost in the middle of otherwise nominal performance.
- Why did the mDNS storm caused by my Hubitat C-8 only affect clients connected (wired & wireless) to a mesh node and not clients connected (wired & wireless) directly to the main router? Did the mDNS storm identify a weakness in the AiMesh Wireless Backhaul implementation?