What's new

Stuck commands

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

We can't run custom scripts on the stock firmware, can we?
For example the test loop that executes thousands of nvram get commands.
Well you could manually run a script but it's not part of the router's normal operation so I wouldn't bet on getting much help from Asus.
 
Well you could manually run a script but it's not part of the router's normal operation so I wouldn't bet on getting much help from Asus.
I noticed when conn_diag has a stuck wl command, the AiMesh page shows zero stats on the Network tab. After killing the hung wl, the stats start refreshing every minute or so. I don’t usually look at that page, but it is a potential angle to report to Asus.
 
Ok, I could use that to report the bug.
How do you see the stuck commands on the stock firmware?
I only got to see them after installing htop.
 
Statistically, the output of pidof wl nvram should be empty if everything is normal.
Thanks.
So this can also be used with the original firmware?

Btw, this is always empty on my router because the override scripts don't allow any wl or nvram command to get stuck.

I now also have the pid_max override. Placing it in init-start made a difference for the wl errors - I see a significant rate decrease.

I don't know what the orphan processes from Colin's script mean. Are these also stuck commands or it's normal to have them.
 
It seems that with the pid_max setting and the 2 overrides, there is nothing interesting to see.
The list of process numbers from Colin's script is unchanged for 2 and a half days uptime.
My log indicates no nvram errors and only 2 wl errors which were handled.

All of this also confirms our workaround solution is effective.
 
Question directed toward @RMerlin:

Why do you run these strip commands only for the AC86U/GT-AC2900 prebuilts when merging a new GPL? Why not other platforms?


Since the problems in this thread involve prebuilt commands on the AC86U, this stuck out as a unique aspect to this model.
 
Why do you run these strip commands only for the AC86U/GT-AC2900 prebuilts when merging a new GPL? Why not other platforms?
Because these weren`t properly stripped in the GPL drops Asus provided me for these two models in the past (hnd_extra is handled differently on this older SDK), that`s why I added strip commands specific to these. I never checked if they have since fixed the issue on their end.
 
Is there a way to determine whether a router is experiencing this issue by using command line commands in a OEM FW or as-installed Merlin FW?
 
Is there a way to determine whether a router is experiencing this issue by using command line commands in a OEM FW or as-installed Merlin FW?
One of the sign I use is look at the file in /jffs/.sys/diag_db/. Normally the file will get updated about every minute. When this issue happened, the file date is not updated.

I am using the script by @Martinski in this post to kill the stuck process.

Post in thread 'My experience with the RT-AC86U'
https://www.snbforums.com/threads/my-experience-with-the-rt-ac86u.79290/post-768723
 
One of the sign I use is look at the file in /jffs/.sys/diag_db/. Normally the file will get updated about every minute. When this issue happened, the file date is not updated.

I am using the script by @Martinski in this post to kill the stuck process.

Post in thread 'My experience with the RT-AC86U'
https://www.snbforums.com/threads/my-experience-with-the-rt-ac86u.79290/post-768723

Thanks. I have two AC86U networks, and each network uses two AC86Us in a main/AP configuration. The networks are configured almost exactly the same. The only difference is the OVPN configuration with one network set up as a server and the other as a client to link the two networks over WAN.

For my home network, the .db files in /jffs/.sys/diag_db/ had modification dates that were nearly a month old, and running Martinski's script gave these results:

home main router:
Code:
FOUND_03449: [2]
START_03449: [check-stuck-proc-cmds.sh]
FOUND_03449: [2]
2510  2509 xxxx   S     3504  0.8   1  0.0 /usr/sbin/wl -i eth5 nrate
2509  2027 xxxx   S     3324  0.7   1  0.0 sh -c /usr/sbin/wl -i eth5 nrate
2027     1 xxxx   S    16832  3.9   1  0.0 conn_diag
FOUND_03449: [0][ 2510  2509 xxxx   S     3504  0.8   1  0.0 /usr/sbin/wl -i eth5 nrate]
FOUND_03449: [2]
EXIT_03449: OK.

home AP router:
Code:
FOUND_04803: [2]
START_04803: [check-stuck-proc-cmds.sh]
FOUND_04803: [2]
1104  1099 xxxx   S     3504  0.8   1  0.0 /usr/sbin/wl -i eth5 noise
1099  2196 xxxx   S     3324  0.7   1  0.0 sh -c /usr/sbin/wl -i eth5 noise
2196     1 xxxx   S    16836  3.9   1  0.0 conn_diag
FOUND_04803: [0][ 1104  1099 xxxx   S     3504  0.8   1  0.0 /usr/sbin/wl -i eth5 noise]
FOUND_04803: [2]
EXIT_04803: OK.

So both of those routers are suffering from this issue.

Now for the routers at my cabin, the story is a little different. The .db files in my cabin main router were up-to-date, and that router has been running for a month. The .db files in my cabin AP router were a month old like my home routers. Running Martinski's script gave:

cabin main router:
Code:
FOUND: [0]

cabin AP router:
Code:
FOUND_10678: [2]
START_10678: [check-stuck-proc-cmds.sh]
FOUND_10678: [2]
1332  1331 xxxx   S     3504  0.8   1  0.0 /usr/sbin/wl -i eth5 noise
1331  2192 xxxx   S     3324  0.7   1  0.0 sh -c /usr/sbin/wl -i eth5 noise
2192     1 xxxx   S    16836  3.9   1  0.0 conn_diag
FOUND_10678: [0][ 1332  1331 xxxx   S     3504  0.8   1  0.0 /usr/sbin/wl -i eth5 noise]
FOUND_10678: [2]
EXIT_10678: OK.

So one out of the four AC86U routers did not exhibit this issue after 30 days of continuous operation. Not sure if that means anything.

I've had both these networks in operation for 3+ years with no real issues. Everything I need works well except for the Client List freezing, but I have a fix for that now. So the question is what issues does this bug cause? Tech9 suggested that the frozen Client LIst issue could be a manifestation of this bug, which seems reasonable. I guess there are other things, but apparently they are things that I haven't needed?

Now if I want to try to remedy this issue, which of the SomeWhereOverTheRainbow/Oracle scripts should I try? And I guess I should also increase pid_max? Assuming that ASUS never fixes this, then these scripts would be configured to run in services-start?
 
Last edited:
Tech9 suggested that the frozen Client LIst issue could be a manifestation of this bug, which seems reasonable.

Not sure anymore because I've seen it stuck on AX86U along with Web History. This Client List is a mystery box - what it detects on AC86U is completely missing on AX86U. Wireless bridge attached devices as an example. AC86U on stock 386_48260 sees them all and listed correctly (the bridge in wireless with attached to it devices in wired), AX86U on any firmware sees the bridge only (listed in wireless, but devices in DHCP leases list only). Go figure.
 
I've had both these networks in operation for 3+ years with no issues real issues. Everything I need works well except for the Client List freezing, but I have a fix for that now. So the question is what issues does this bug cause? Tech9 suggested that the frozen Client LIst issue could be a manifestation of this bug, which seems reasonable. I guess there are other things, but apparently they are things that I haven't needed?

Now if I want to try to remedy this issue, which of the SomeWhereOverTheRainbow/Oracle scripts should I try? And I guess I should also increase max_pid? Assuming that ASUS never fixes this, then these scripts would be configured to run in services-start?

What issues does this bug cause? I used to have horrible issues with my scripts getting locked up, sometimes a few times a day. I mean the router would continue running, but my tools were in a locked state... so it was just a real PITA. It would especially suck if my vpn went down, and my vpnmon-r2 tool wasn't able to recognize it went down because it got hung up. That was all until @eibgrad suggested using the "timeout" command before calling commands within the script, even simple low-level "nvram get" statements. That seemed to do wonders, and prevented script lockups. I still had to use the check-stuck-proc-cmds.sh script on a regular interval to unlock other scripts/programs that were locking up. There was a time when we were trying to get bug reports back over to Asus, but I don't think it ever went anywhere, since it only seems to be impacting the AC86U.
 
What issues does this bug cause? I used to have horrible issues with my scripts getting locked up, sometimes a few times a day. I mean the router would continue running, but my tools were in a locked state... so it was just a real PITA. It would especially suck if my vpn went down, and my vpnmon-r2 tool wasn't able to recognize it went down because it got hung up. That was all until @eibgrad suggested using the "timeout" command before calling commands within the script, even simple low-level "nvram get" statements. That seemed to do wonders, and prevented script lockups. I still had to use the check-stuck-proc-cmds.sh script on a regular interval to unlock other scripts/programs that were locking up. There was a time when we were trying to get bug reports back over to Asus, but I don't think it ever went anywhere, since it only seems to be impacting the AC86U.

After waiting approximately 28 hours, I ran check-stuck-proc-cmds again on my home network main and AP routers, and zero stuck commands/processes. I have no doubt that this is a real bug with consequences, but the severity of the bug for typical router usage, e.g., no scripts or Entware, may be fairly low. I'll update again after a few more days of checking for stuck processes. Seems that if I do see wl stuck again, raising pid_max may be a simple solution for my routers.
 
Last edited:
Update:
  • Still no hung nvram.
  • Seems like nvram is not going to hang, and if it ever does, my impression is that raising pid_max to 4194304 will stop nvram from hanging, so I'll use that approach if needed.
  • wl finally hung again.
  • I decided to implement wl-override script written by Oracle and SomewhereOverTheRainbow in this thread: https://www.snbforums.com/threads/my-experience-with-the-rt-ac86u.79290/post-769491. I commented out the command to have the script write an entry in the log every time wl was called. With the script running, there is no significant impact on CPU usage. Its perhaps ticked up maybe 2% for both CPUs. Seems like a great script for what its meant to accomplish.
Its not clear to me that wl-override is 100% effective at stopping wl from hanging, but my overall impression is that implementing these two approaches will effectively resolve the nvram and wl hang issues on AC86Us with zero impact on router performance.
 
Update:
  • Still no hung nvram.
  • Seems like nvram is not going to hang, and if it ever does, my impression is that raising pid_max to 4194304 will stop nvram from hanging, so I'll use that approach if needed.
  • wl finally hung again.
  • I decided to implement wl-override script written by Oracle and SomewhereOverTheRainbow in this thread: https://www.snbforums.com/threads/my-experience-with-the-rt-ac86u.79290/post-769491. I commented out the command to have the script write an entry in the log every time wl was called. With the script running, there is no significant impact on CPU usage. Its perhaps ticked up maybe 2% for both CPUs. Seems like a great script for what its meant to accomplish.
Its not clear to me that wl-override is 100% effective at stopping wl from hanging, but my overall impression is that implementing these two approaches will effectively resolve the nvram and wl hang issues on AC86Us with zero impact on router performance.
Did you try my proof-of-concept script that is guaranteed to hang, or your money back?
 
Did you try my proof-of-concept script that is guaranteed to hang, or your money back?

I don't recall seeing your script, but my impression is that Oracle/Somewhere script was not advertised to be prevent all hangs. And as I mentioned a few posts earlier, I'm not sure what I've done is even necessary for my needs. I'm just curious to see what works, what doesn't, and if any of these changes make a difference my AC86U functionality. Nevertheless, I'm curious to try your script. Can you point me to it?
 
I don't recall seeing your script, but my impression is that Oracle/Somewhere script was not advertised to be prevent all hangs. And as I mentioned a few posts earlier, I'm not sure what I've done is even necessary for my needs. I'm just curious to see what works, what doesn't, and if any of these changes make a difference my AC86U functionality. Nevertheless, I'm curious to try your script. Can you point me to it?
I will dig it up from the bowels of hell where it came from ... But give me until the morning please. I would definitely be curious with the changes you made. ;)
 

Similar threads

Latest threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top