What's new
  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Stuck commands

I didn't run or use your script for "production." I simply tried to run it on my router to see how it works as a "proof of concept" (which I'm very familiar with as a professional s/w dev. myself). The point was that the current version of the script goes into an infinite loop because some NVRAM vars are set to empty strings or not set at all, which are not uncommon scenarios on ASUS routers. A "proof of concept" demonstration should take care of common scenarios like empty values; it doesn't have to be completely foolproof, but it shouldn't go into an infinite loop either.

Look, I get it. Nobody likes criticism, and some people are more averse to it than others even when it's constructive, as my feedback was meant to be. In one way or another, we're here to learn and if you are, I can offer some advice. If not, I can certainly move on - I got no skin in this game.
I actually don’t mind criticism at all and this may be how you write but you do seem to come off as providing a little more than criticism and you are welcome to edit the script for checking values you prefer to test with that aren’t null. Again it’s just a conceptual script.
 
I seem to be experiencing the symptoms mentioned here. The AC86U in question is abroad and use a vpn connection to login to it. Wifi unstable, client list not updating, connections dropping etc etc. Could these all be symptoms of issues mentioned here? Any help is appreciated.
 
I seem to be experiencing the symptoms mentioned here. The AC86U in question is abroad and use a vpn connection to login to it. Wifi unstable, client list not updating, connections dropping etc etc. Could these all be symptoms of issues mentioned here? Any help is appreciated.
Probably not. See post #1. SSH into the router and run top or ps to see if there are any processes that look like they shouldn't be there.
 
Ran the script for a day, below is the grep from trace file for the script. Lots of 'nvram get' commands get stuck!

Code:
admin@RT-AC86U-9988:/tmp/mnt/ac86u/entware/var/log/Trace# grep nvram *.txt
StuckProcCmds_00001_06276.TRC.txt:2024-01-25 12:05:21  1119  1111 admin    S     3104  0.7   0  0.0 nvram get productid [KILLED]
StuckProcCmds_00001_06276.TRC.txt:2024-01-25 12:05:04  1142     1 admin    S     3104  0.7   0  0.0 nvram get odmpid [KILLED]
StuckProcCmds_00001_06276.TRC.txt:2024-01-25 12:04:50  1142     1 admin    S     3104  0.7   0  0.0 nvram get odmpid
StuckProcCmds_00001_06276.TRC.txt:2024-01-25 12:04:50  1119  1111 admin    S     3104  0.7   0  0.0 nvram get productid
StuckProcCmds_00002_04205.TRC.txt:2024-01-25 19:51:18  1741  1728 admin    S     3104  0.7   1  0.0 nvram get ntp_ready [KILLED]
StuckProcCmds_00002_04205.TRC.txt:2024-01-25 19:51:04  1741  1728 admin    S     3104  0.7   1  0.0 nvram get ntp_ready
StuckProcCmds_00003_05066.TRC.txt:2024-01-25 21:12:19  1142  1136 admin    S     3104  0.7   0  0.0 nvram get http_username [KILLED]
StuckProcCmds_00003_05066.TRC.txt:2024-01-25 21:12:04  1142  1136 admin    S     3104  0.7   0  0.0 nvram get http_username
StuckProcCmds_00004_06143.TRC.txt:2024-01-25 23:18:18  1138  1107 admin    S     3104  0.7   0  0.0 nvram get productid [KILLED]
StuckProcCmds_00004_06143.TRC.txt:2024-01-25 23:18:04  1138  1107 admin    S     3104  0.7   0  0.0 nvram get productid
StuckProcCmds_00005_03983.TRC.txt:2024-01-26 00:42:19  2364  2363 admin    S     3104  0.7   1  0.0 nvram get http_username [KILLED]
StuckProcCmds_00005_03983.TRC.txt:2024-01-26 00:42:04  2364  2363 admin    S     3104  0.7   1  0.0 nvram get http_username
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:07:06 99925 99924 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:06:51 99925 99924 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:06:51 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:06:34 98559 98558 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:06:19 98559 98558 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:06:19 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:06:01 97088 97087 admin    S N   2972  0.7   1  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:05:46 97088 97087 admin    S N   2972  0.7   1  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:05:46 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:05:29 95141 95140 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:05:14 95141 95140 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:05:14 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:04:56 93762 93761 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:04:41 93762 93761 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:04:41 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:04:24 92286 92285 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:04:09 92286 92285 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:04:09 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:03:51 90917 90916 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:03:36 90917 90916 admin    S N   2972  0.7   0  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:03:36 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:03:19 87729 87728 admin    S N   2972  0.7   1  0.0 nvram get custom_clientlist [KILLED]
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:03:04 87729 87728 admin    S N   2972  0.7   1  0.0 nvram get custom_clientlist
StuckProcCmds_00006_05344.TRC.txt:2024-01-26 03:03:04 87093 86301 admin    S     2972  0.7   1  0.0 nvram get vpn_server_custom
 
Looking at the source code, AFAICT the only routers that have their own prebuilt module are RT-AX88U, XT12, GT-AX11000, RT-AX56U, RT-AX58U, RT-AX68U, RT-AX86U, GT-AX6000, GT-AXE11000 and RT-AC68U_V4. If I'm reading it right all the other routers share a different common module. Given how popular the RT-AC68U is I wonder why we're not getting reports of stuck processes on that model.


Yes I believe that's the case.
Hi,
since this post is from 2022, I wonder if the issue is really one from the past / old routers...
Or in other words - if I get myself a new BE-version, can I assume that the device won't suffer from the same issues?
 
@dave14305 If you have the time would you mind running the following script on your RT-AC86U. It should print out a list of active netlink socket numbers that don't have a matching pid. My router is very minimal and doesn't run things like AiProtection so I'm curious to see if you have a lot more mismatched netlink sockets than I do (6 x 2 = 12).

Code:
#!/bin/sh

cat /proc/net/netlink | sort -nk3 | \
awk '
BEGIN {
   print "\nPrint netlink sockets for which there is no process with the same number\n"
   getline pid_max < "/proc/sys/kernel/pid_max"
}
{
   if ( $2 == "31" ) {
      if ( $3 < pid_max )
         system("kill -0 " $3 " 2>/dev/null || echo \"Process " $3 " not found\"")
      else {
         orig_pid = $3 - pid_max - 2
         system("kill -0 " orig_pid " 2>/dev/null || echo \"Process associated with " $3 " not found (" orig_pid ")\"")
      }
   }
}
END { print "\nA number in brackets is a *guess* at an associated process\n" }
'
EDIT: Removed some unnecessary experimental code from script (just in case it confuses people).
No comments on the quality of my coding please. :)

P.S. Not that this achieves anything other that satisfying my curiosity. :D

I wonder if it would be somehow possible to write a script or base it on the quoted one which identifies orphans and cleans them up so that nvram requests won't get stuck.

I read a lot about the issue but am struggling to really understand how things hang together and how to work around them, if possible at all.

I'm helping 100%-no-nerd-friends on their farm with an AiMesh setup and unfortunately, we have some AC86U routers there.
Those regularly run into that nvram condition.
My friends have 4 AC86U and it would be quite an investment to completely replace them...

So I'm trying to find a workaround.

I did the dumbass approach to check every 5 minutes via 'nvram get cfg_device_list' whether it hangs and if so, reboot the router... not considering that executing the nvram call every 5 minutes could actually worsen the issue 😜...

So I'm looking for a different approach to either regularly clean up so that it does not lock up in the first place or identify lockups and reboot without increasing their probability.

I'm a bit stuck here and any help is very much appreciated...
Also, I have to admit that I don't understand the details of what the quoted script does and why and if I could somehow use it...

Thx
Markus

PS: echo 4194304 > /proc/sys/kernel/pid_max is already in place.
 
Last edited:
The RT-AX86U doesn't suffer from this problem.
Sorry, a typo... it's 4 AC86U...

Is it only the AC86U which suffers from the issue?
I ask because you wrote in of your threads that also others (incl. AX86U) share the same prebuilt modules ('Looking at the source code, AFAICT the only routers that have their own prebuilt module are RT-AX88U, XT12, GT-AX11000, RT-AX56U, RT-AX58U, RT-AX68U, RT-AX86U, GT-AX6000, GT-AXE11000 and RT-AC68U_V4. ')...
 
Last edited:
with the result that the router somewhen stops working (at least partly) / needs a reboot, like AC86u?
No, pretty much carries on regardless.

The hangs I see are in my backup script that runs overnight, the
"nvram save /tmp/mnt/NAS/public/BACKUPS/xxxxx" command hangs.

It just sits there forever if I don't do anything.
The backup still runs fine the next day.

The only time I've had issues with the router failing is when dnsmasq was restarted by the watchdog. It hung on each attempt and as a result no new clients could connect to the router. That needed a reboot to sort it.
 
@Viktor Jaep ...
The script does not seem to work (for me)...

I did the following:
1. I ran 'nvram get cfg_device_list' successfully
2. I provoked a hang (seems to be reproducible when restarting AiMesh nodes while frantically reloading the AiMesh page in the main node admin)
3. Confirmed that 'nvram get cfg_device_list' now hangs when executed from shell

When this happens, after logging in to the admin ui, the ui loads forever until getting a timeout.

The script from this post reports

Code:
Print netlink sockets for which there is no process with the same number

Process 1323 not found
Process 1343 not found
Process 1346 not found
Process 1349 not found
Process 1520 not found
Process 1816 not found
Process 32771 not found
Process associated with 4195629 not found (1323)
Process associated with 4195649 not found (1343)
Process associated with 4195652 not found (1346)
Process associated with 4195655 not found (1349)
Process associated with 4195826 not found (1520)
Process associated with 4196122 not found (1816)

A number in brackets is a *guess* at an associated process

while the Kill Stuck Proc Cmds script tells me:
Code:
FOUND: [0]

What am I missing?

Thanks :)
 
@Viktor Jaep ...
The script does not seem to work (for me)...

I did the following:
1. I ran 'nvram get cfg_device_list' successfully
2. I provoked a hang (seems to be reproducible when restarting AiMesh nodes while frantically reloading the AiMesh page in the main node admin)
3. Confirmed that 'nvram get cfg_device_list' now hangs when executed from shell

When this happens, after logging in to the admin ui, the ui loads forever until getting a timeout.

The script from this post reports

Code:
Print netlink sockets for which there is no process with the same number

Process 1323 not found
Process 1343 not found
Process 1346 not found
Process 1349 not found
Process 1520 not found
Process 1816 not found
Process 32771 not found
Process associated with 4195629 not found (1323)
Process associated with 4195649 not found (1343)
Process associated with 4195652 not found (1346)
Process associated with 4195655 not found (1349)
Process associated with 4195826 not found (1520)
Process associated with 4196122 not found (1816)

A number in brackets is a *guess* at an associated process

while the Kill Stuck Proc Cmds script tells me:
Code:
FOUND: [0]

What am I missing?

Thanks :)
You run it 2x in a row. It's meant to run every 5 mins from cron. See if that does the trick?
 
You run it 2x in a row. It's meant to run every 5 mins from cron. See if that does the trick?
Jepp, you're right.
When I see my command hang and run the script twice, the hanging nvram command gets killed.
I installed the script now as cron as recommended and will observe over the next days it that makes a change.
Thanks for that :)

The only thing which still bothers me now is that the admin UI keeps hanging in these cases, no matter what.
Also when I do a 'service restart_httpd', it does not make a difference.
The only way to help seems to be a reboot...
I have to admit, though, that I tested that before installing the cron.

In other words: when my nvram command hangs, the admin UI gets timeouts, too... the script does kill my nvram command but does not fix the ui.
Also, the script reports zero hanging commands.

Can you make any sense of this?
Thanks, Markus
 
Jepp, you're right.
When I see my command hang and run the script twice, the hanging nvram command gets killed.
I installed the script now as cron as recommended and will observe over the next days it that makes a change.
Thanks for that :)

The only thing which still bothers me now is that the admin UI keeps hanging in these cases, no matter what.
Also when I do a 'service restart_httpd', it does not make a difference.
The only way to help seems to be a reboot...
I have to admit, though, that I tested that before installing the cron.

In other words: when my nvram command hangs, the admin UI gets timeouts, too... the script does kill my nvram command but does not fix the ui.
Also, the script reports zero hanging commands.

Can you make any sense of this?
Thanks, Markus
Not quite sure why that would be happening... but perhaps increasing that pid_max value might help it occur less frequently? See instructions below:

 

Similar threads

Latest threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Back
Top