RT-AC87U 384.13_2
Diversion 4.1.8
amtm 3.0
Been having this problem for several merlin and diversion versions, so don't think its specifically related to their latest releases.
Basically, when the cable modem (WAN) looses connection, the 2 CPUs on the router are pegged at 100%. This prevents clients on the LAN from even able to obtain a DHCP address. Most of the time, the router is so resource constrained that I'm unable to issue a TOP command after SSH into it. Even when the WAN connection is restored, the router is locked up and is unable to provide routing to the LAN clients until its been power cycled.
Today, I was finally able to capture this behavior. top always showed the tdts_rule_agent process, and cycled between dnsmasq and mtdblock3 as the 2 other top offenders. here is a snapshot before/after disabling diversion, while the WAN is disconnected.
About 2 weeks ago, I migrated the diversion use case to a dedicated pi-hole device (partially because of this problem on a hunch). When the symptom was occurring today, I disabled diversion (and pixelserve), which then showed the CPUs only going between 0-50%. tdts_rule_agent was still always top of the list, but at least clients were able to get DHCP leases.
I also capture the RT-AC87U syslogs with Splunk. Looking back over the time of the incident, the only thing logged are a constant repeat of dnsmasq listing the nameservers and host addresses.
Anyone have ideas what is the root cause for this problem?
Diversion 4.1.8
amtm 3.0
Been having this problem for several merlin and diversion versions, so don't think its specifically related to their latest releases.
Basically, when the cable modem (WAN) looses connection, the 2 CPUs on the router are pegged at 100%. This prevents clients on the LAN from even able to obtain a DHCP address. Most of the time, the router is so resource constrained that I'm unable to issue a TOP command after SSH into it. Even when the WAN connection is restored, the router is locked up and is unable to provide routing to the LAN clients until its been power cycled.
Today, I was finally able to capture this behavior. top always showed the tdts_rule_agent process, and cycled between dnsmasq and mtdblock3 as the 2 other top offenders. here is a snapshot before/after disabling diversion, while the WAN is disconnected.
Code:
DIVERSION ENABLED
Mem: 194012K used, 61664K free, 3552K shrd, 1888K buff, 26700K cached
CPU: 47.1% usr 52.6% sys 0.0% nic 0.0% idle 0.0% io 0.0% irq 0.1% sirq
Load average: 2.75 2.45 1.77 4/114 5437
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
778 1 nobody R 51068 19.9 0 46.1 dnsmasq --log-async
5430 5334 admin R 1732 0.6 1 44.6 tdts_rule_agent -g -r /jffs/signature/rule.trf
192 1 admin R 5132 2.0 1 5.0 nt_center
169 1 admin S 652 0.2 0 1.4 tftpd
<SNIP>
Mem: 196496K used, 59180K free, 1708K shrd, 1200K buff, 9152K cached
CPU: 49.8% usr 46.2% sys 0.0% nic 3.5% idle 0.0% io 0.0% irq 0.3% sirq
Load average: 2.84 2.39 1.34 3/114 6274
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
794 1 nobody R 81292 31.7 0 43.0 dnsmasq --log-async
27 2 admin RW 0 0.0 1 22.1 [mtdblock3]
196 1 admin S 5144 2.0 0 2.5 nt_center
179 1 admin S 2604 1.0 0 2.4 protect_srv
<SNIP>
admin@RT-AC87U:/tmp/home/root# iostat
Linux 2.6.36.4brcmarm (RT-AC87U) 01/14/20 _armv7l_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.09 0.00 2.12 0.06 0.00 96.72
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
mtdblock3 0.12 4.16 0.00 249594 0
sda 0.03 4.64 0.11 277944 6344
admin@RT-AC87U:/tmp/home/root# dstat
Traceback (most recent call last):
File "/opt/bin/dstat", line 32, in <module>
import six
ImportError: No module named six
DIVERSION DISABLED
Mem: 129604K used, 126072K free, 3512K shrd, 1048K buff, 9620K cached
CPU: 1.5% usr 52.2% sys 0.0% nic 46.0% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 2.63 2.48 1.81 2/115 6174
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
6136 5699 admin R 1732 0.6 1 49.3 tdts_rule_agent -g -r /jffs/signature/rule.trf
192 1 admin R 5132 2.0 1 1.4 nt_center
169 1 admin S 652 0.2 1 0.9 tftpd
179 1 admin R 2604 1.0 1 0.6 protect_srv
<SNIP>
About 2 weeks ago, I migrated the diversion use case to a dedicated pi-hole device (partially because of this problem on a hunch). When the symptom was occurring today, I disabled diversion (and pixelserve), which then showed the CPUs only going between 0-50%. tdts_rule_agent was still always top of the list, but at least clients were able to get DHCP leases.
I also capture the RT-AC87U syslogs with Splunk. Looking back over the time of the incident, the only thing logged are a constant repeat of dnsmasq listing the nameservers and host addresses.
Anyone have ideas what is the root cause for this problem?