arrgh
Occasional Visitor
Hey folks,
[edit1: all three nodes are running 3.0.0.4.388_21617]
[edit2: as of 2023/10/12, 23285 has finally been stable for me! Currently 73 days uptime]
I have three XT8's, currently using wireless backhaul. Previously I had them on wired backhaul, but was having serious stability issues (blinking red lights on the satellites) so I reset everything to use wireless backhaul, and everything was stable for a month or two. Now I've been having these symptoms:
1. dnsmasq stops working every 1-4 days. I can still `ping 8.8.8.8` and `dig google.ca @8.8.8.8` so I know it's not a connectivity issue, just dnsmasq being dead or slow.
2. the admin web interface gets incredibly slow, like many requests take 5-60 seconds. The mobile app is affected identically.
Recently I enabled SSH on the main router, and after poking around I can see `cfg_server` being sketchy:
1. it uses a lot of RAM -- at the moment it's up to 159MB, maybe half an hour after I last killed it
2. it attempts a lot of TCP connections on port 7788, both from my WAN IP and from the router's LAN IP, to the satellites. They all end up in CLOSE_WAIT.
current breakdown:
source LAN:7788, dest .152 (one of the satellites): 216 sockets
source WAN:7788, dest .152: 229 sockets
source LAN:7788, dest .237 (the other satellite): 470 sockets
source WAN:7788, dest .237: 474 sockets
3. it has a lot of threads, currently 689 -- simple arithmetic tells us every thread is trying to talk to both satellite nodes via both source IPs.
When I `killall cfg_server`, the admin web UI gets fast for a minute or two, then goes back to being very slow as cfg_server respawns then goes back to spiraling out of control.
I dug a little deeper:
Hypothesis: every one of those 164 threads is contending on /var/lock/allwevent.lock, and the admin web interface backend is also trying to get it, which is why it slows to a crawl. Why are there hundreds of threads? ¯\_(ツ)_/¯
I have a case open waiting for L2, so I'm reluctant to try another firmware at the moment, in hopes that my suffering can help future generations.
googling for `allwevent.lock` yields https://www.snbforums.com/threads/a...ausing-router-not-accessible-via-webui.80469/ which seems fairly on-point.
[edit1: all three nodes are running 3.0.0.4.388_21617]
[edit2: as of 2023/10/12, 23285 has finally been stable for me! Currently 73 days uptime]
I have three XT8's, currently using wireless backhaul. Previously I had them on wired backhaul, but was having serious stability issues (blinking red lights on the satellites) so I reset everything to use wireless backhaul, and everything was stable for a month or two. Now I've been having these symptoms:
1. dnsmasq stops working every 1-4 days. I can still `ping 8.8.8.8` and `dig google.ca @8.8.8.8` so I know it's not a connectivity issue, just dnsmasq being dead or slow.
2. the admin web interface gets incredibly slow, like many requests take 5-60 seconds. The mobile app is affected identically.
Recently I enabled SSH on the main router, and after poking around I can see `cfg_server` being sketchy:
1. it uses a lot of RAM -- at the moment it's up to 159MB, maybe half an hour after I last killed it
2. it attempts a lot of TCP connections on port 7788, both from my WAN IP and from the router's LAN IP, to the satellites. They all end up in CLOSE_WAIT.
current breakdown:
source LAN:7788, dest .152 (one of the satellites): 216 sockets
source WAN:7788, dest .152: 229 sockets
source LAN:7788, dest .237 (the other satellite): 470 sockets
source WAN:7788, dest .237: 474 sockets
3. it has a lot of threads, currently 689 -- simple arithmetic tells us every thread is trying to talk to both satellite nodes via both source IPs.
When I `killall cfg_server`, the admin web UI gets fast for a minute or two, then goes back to being very slow as cfg_server respawns then goes back to spiraling out of control.
I dug a little deeper:
Code:
# cd /proc/$(pidof cfg_server)/fd
# ls -lart |grep socket |wc -l
90
# ls -lart |grep -v socket
dr-xr-xr-x 7 admin root 0 Dec 20 11:11 ..
dr-x------ 2 admin root 0 Dec 20 11:11 .
lr-x------ 1 admin root 64 Dec 20 11:11 9 -> pipe:[118871]
lr-x------ 1 admin root 64 Dec 20 11:11 8 -> pipe:[118870]
lr-x------ 1 admin root 64 Dec 20 11:11 7 -> pipe:[118869]
lr-x------ 1 admin root 64 Dec 20 11:11 5 -> /proc/1/mounts
lrwx------ 1 admin root 64 Dec 20 11:11 2 -> /dev/null
lrwx------ 1 admin root 64 Dec 20 11:11 15 -> /var/lock/allwevent.lock
lrwx------ 1 admin root 64 Dec 20 11:11 1 -> /dev/null
lrwx------ 1 admin root 64 Dec 20 11:11 0 -> /dev/null
# ps T |grep cfg_server |grep -v grep |wc -l
164
Hypothesis: every one of those 164 threads is contending on /var/lock/allwevent.lock, and the admin web interface backend is also trying to get it, which is why it slows to a crawl. Why are there hundreds of threads? ¯\_(ツ)_/¯
I have a case open waiting for L2, so I'm reluctant to try another firmware at the moment, in hopes that my suffering can help future generations.
googling for `allwevent.lock` yields https://www.snbforums.com/threads/a...ausing-router-not-accessible-via-webui.80469/ which seems fairly on-point.
Last edited: