What's new
  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

My experience with the RT-AC86U

Here's some log output from running the test loop on tighter timings.
It reported 5 retries, while the average is 2-3.

Running time appears unaffected: 5:07 min per 3000 loop iterations.

I switched to these versions because the other one using timeout falls apart if I unplug the USB drive.
Experienced it during the shutdown process. Unmounted the USB device and before executing sync;halt a few seconds later I was flooded with errors.
If we could somehow compile a static binary for timeout, then maybe it can be stored in JFFS. It is alot of work for doing such though..
 
If you guys want to try an alternative approach that doesn't involve scripts you could try the following tweak. This won't fix the issue, but in theory (based on the balance of probability) it ought to reduce the frequency of the problem occurring by a factor of 128 ~64.
Code:
echo 4194304 > /proc/sys/kernel/pid_max
 
Last edited:
If you guys want to try an alternative approach that doesn't involve scripts you could try the following tweak. This won't fix the issue, but in theory (based on the balance of probability) it ought to reduce the frequency of the problem occurring by a factor of 128.
Code:
echo 4194304 > /proc/sys/kernel/pid_max
So let me see if I can some what understand you on your level- by increasing the pool of PID you hope that it will somehow add more randomness to this occurrence? And more randomness means less likely to occur?

For anyone not keeping the score properly like me, here is a link that will provide you with all the information on @ColinTaylor recommendation.


The theory being that increasing this size will mean we are less likely to see these occurrences.

However one caveat that must be considered is

Please note that this hack is only useful for a large and busy server; don’t try this on an old kernel or on desktop systems.

Which may actually be perfect for our case!

Here is another interesting read.....

 
Last edited:
So let me see if I can some what understand you on your level- by increasing the pool of PID you hope that it will somehow add more randomness to this occurrence? And more randomness means less likely to occur?
Correct.

In my case (if I had an RT-AC86U) there are 12 "problem pids". 6 of these are in the 1000-3000 range and the other 6 are in the 33000-35000 range (which is beyond pid_max). When the systems' current pid number is past the first 6 problem pids everything should be fine until it hits pid_max and loops back to the beginning. This time round the loop, when the current process' pid matches one of the 6 problem pids there's a chance that that process is making an nvram call. This will cause it to hang. If it doesn't make an nvram call it's not a problem.

So by increasing pid_max from 32768 to 4194304 it takes 128 times longer to restart the loop. However, while the loop is much bigger it now also encompasses pids 33000-35000 which it didn't before. As I don't have any intensive add-on scripts that would churn through pids I estimate it would take my router about 24 days to restart each loop.
 
Correct.

In my case (if I had an RT-AC86U) there are 12 "problem pids". 6 of these are in the 1000-3000 range and the other 6 are in the 33000-35000 range (which is beyond pid_max). When the systems' current pid number is past the first 6 problem pids everything should be fine until it hits pid_max and loops back to the beginning. This time round the loop, when the current process' pid matches one of the 6 problem pids there's a chance that that process is making an nvram call. This will cause it to hang. If it doesn't make an nvram call it's not a problem.

So by increasing pid_max from 32768 to 4194304 it takes 128 times longer to restart the loop. However, while the loop is much bigger it now also encompasses pids 33000-35000 which it didn't before. As I don't have any intensive add-on scripts that would churn through pids I estimate it would take my router about 24 days to restart each loop.
My observation with conn_diag though, is that the spawned wl command is very close to the pid of conn_diag, and not necessarily the next one past the high-water mark. Even restarting conn_diag generates a new process in the “middle” of existing pids for me. YMMV.
 
My observation with conn_diag though, is that the spawned wl command is very close to the pid of conn_diag, and not necessarily the next one past the high-water mark. Even restarting conn_diag generates a new process in the “middle” of existing pids for me. YMMV.
Yes, this is expected. It won't be past the high-water mark. It doesn't really matter where in the pid pool it is, just that the size of the pool is much bigger.
 
Last edited:
Ran the script from the first post on a GT-AC2900 and the script stops before 10,000. Ran multiple times. Anywhere between 500 and 5500
 
Ran the script from the first post on a GT-AC2900 and the script stops before 10,000. Ran multiple times. Anywhere between 500 and 5500
I think one thing people fail to realize with loops is that they could still be running before the other iteration has fully completed. So real question begs to ask is nvram simply locking up because it is over welhmed and not finished with the previous iteration?
 
I'm still catching up with latest comments, but meantime:
@SomeWhereOverTheRainBow, could you explain what the time intervals from your version are actually doing? I.e., where is this time interval used and what happens when it expires?
Looks like there's hardly any difference if it's 10 or 50 - at least I can't see it.
 
Doing this, it gets all the way to 10001. Don't really understand what all this means, though, but I guess the GT-AC2900 doesn't have an issue?
This could very well mean that:
1) the GT-AC2900 does exhibit the same defect;
2) the wrapper script that solves the problem for the AC86U also solves it for the GT-AC2900.

Since you are the first and only person so far to report the nvram bug for this model, I'd say more testing and cases are needed.

I should probably reorganize my first page posts, with less words and clearer info. Then put the details in separate posts, for whoever wants to red the details. Or something of this nature.
 
I'm still catching up with latest comments, but meantime:
@SomeWhereOverTheRainBow, could you explain what the time intervals from your version are actually doing? I.e., where is this time interval used and what happens when it expires?
Looks like there's hardly any difference if it's 10 or 50 - at least I can't see it.
To be honest I wrote the script with the intention of the internal to serve as a fail safe for a fail safe. You would have to have multiple recurrence of lockups all at once for it to ever trigger the higher intervals, this was done because I have no way to optimize the interval and it serves as a best case and worse case optimization. Feel free to use the script as is or how ever you would like to modify it. I made it with the intention of that, and also to help users suffering from the deadlock of this condition.
 
That's fine, I just don't know this syntax. I.e., what is this interval, in what unit of measure, when does the count start and what if it runs out? Is that when the process is killed?
I could try to adjust it but I don't understand what to look for.
 
That's fine, I just don't know this syntax. I.e., what is this interval, in what unit of measure, when does the count start and what if it runs out? Is that when the process is killed?
I could try to adjust it but I don't understand what to look for.
If you feel it will serve fine to remove all the other intervals then do that. It is more for your adjustability than mine.
 
That's fine, I just don't know this syntax. I.e., what is this interval, in what unit of measure, when does the count start and what if it runs out? Is that when the process is killed?
I could try to adjust it but I don't understand what to look for.
here it is at just 10

Bash:
#!/bin/sh

# copy original nvram executable to /tmp
cp /bin/nvram /tmp/_nvram

# create nvram wrapper that calls original nvram executable in /tmp
cat << 'EOF' > /tmp/nvram
#!/bin/sh
#set -x # comment/uncomment to disable/enable debug mode
# required for serialization when reentry is possible
LOCK="/tmp/$(basename "$0").lock"
acquire_lock() { until mkdir "$LOCK" &>/dev/null; do touch /tmp/nvram; done; }
release_lock() { rmdir "$LOCK" &>/dev/null; }

# one instance at a time
acquire_lock

# catch premature exit and cleanup
trap 'release_lock; exit 1' SIGHUP SIGINT SIGTERM

# make the new function accessible
#export PATH=/opt/bin:/opt/sbin:$PATH

# clear rc variable
rc=""

# keep count of total session usage
if [ ! -f "/tmp/nvramuse" ]; then
   echo 0 > /tmp/nvramuse
fi
usecount=$(cat /tmp/nvramuse)
usecount=$((usecount + 1 ))
echo $usecount > /tmp/nvramuse

INTERVAL="10"
MAXCOUNT="3"
run_cmd () {
    local to
    local start
    local child
    # here as the interval number increases, the longer we wait.
    to="$1"
    to="$((to*INTERVAL))"; shift
    $@ & local child="$!" start=0
    touch /tmp/nvram
    while { [ "$(kill -0 $child >/dev/null 2>&1; printf "%s" "$?")" = "0" ] && [ "$start" -le "$to" ]; }; do
        # to account for killing too soon, as the number of tries required increases our count requirement increases before we attempt to kill the process.
        touch /tmp/nvram
        start="$((start+1))"
        if [ $start -gt $to ]; then
            kill -s 9 $child 2>/dev/null
            wait $child
            return 1
        fi
    done
    return 0
}

# make the new function accessible, on the first run we want to exit right away if successful.
i="1"
if { run_cmd "$i" /tmp/_nvram "$@"; }; then rc="0"; else rc="1";fi

logger -t "nvram-override" "Executed nvram $@, use count: $usecount, exit status: $rc"

# here we add an interval check and allow up to 3 retries.
while [ "$i" -le "$MAXCOUNT" ] && [ "$rc" != "0" ]; do
  touch /tmp/nvram
  if { run_cmd "$i" /tmp/_nvram "$@"; }; then
    rc="0";
  else
    rc="1";
    errcount="$rc";
    if [ ! -f "/tmp/nvramerr" ]; then echo 0 > /tmp/nvramerr; else errcount=$(cat /tmp/nvramerr); fi
    errcount=$((errcount + 1 ));
    echo $errcount > /tmp/nvramerr;
    logger -t "nvram-override" "Error detected at use count: $usecount, error count: $errcount";
    logger -t "nvram-override" "Couldn't execute nvram $@, exit status: $rc (124=timeout)";
  fi
  logger -t "nvram-override" "Retried executing nvram $@, attempt ${i}/${MAXCOUNT}, exit status: $rc";
  i="$((i+1))";
done
[ "$rc" -eq "1" ] && logger -t "nvram-override" "NVRAM remained locked too long; continuing anyway."
# any concurrent instance(s) may now run
release_lock
exit $rc
EOF
chmod +x /tmp/nvram

# replace nvram in /usr/sbin w/ nvram wrapper in /tmp
mount -o bind /tmp/nvram /bin/nvram

@Oracle while I respect your attempts at trying to capture the logs and error statistics, is there a way it can be done where it does not slow down the actual processing of the script? For example, how fast does the script go without all the logs and requirements to error track statistics?

Bash:
#!/bin/sh

# copy original nvram executable to /tmp
cp /bin/nvram /tmp/_nvram

# create nvram wrapper that calls original nvram executable in /tmp
cat << 'EOF' > /tmp/nvram
#!/bin/sh
#set -x # comment/uncomment to disable/enable debug mode
# required for serialization when reentry is possible
LOCK="/tmp/$(basename "$0").lock"
acquire_lock() { until mkdir "$LOCK" &>/dev/null; do touch /tmp/nvram; done; }
release_lock() { rmdir "$LOCK" &>/dev/null; }

# one instance at a time
acquire_lock

# catch premature exit and cleanup
trap 'release_lock; exit 1' SIGHUP SIGINT SIGTERM

# make the new function accessible
#export PATH=/opt/bin:/opt/sbin:$PATH

# clear rc variable
rc=""

# keep count of total session usage
#if [ ! -f "/tmp/nvramuse" ]; then
#  echo 0 > /tmp/nvramuse
#fi
#usecount=$(cat /tmp/nvramuse)
#usecount=$((usecount + 1 ))
#echo $usecount > /tmp/nvramuse

INTERVAL="10"
MAXCOUNT="3"
run_cmd () {
    local to
    local start
    local child
    # here as the interval number increases, the longer we wait.
    to="$1"
    to="$((to*INTERVAL))"; shift
    $@ & local child="$!" start=0
    touch /tmp/nvram
    while { [ "$(kill -0 $child >/dev/null 2>&1; printf "%s" "$?")" = "0" ] && [ "$start" -le "$to" ]; }; do
        # to account for killing too soon, as the number of tries required increases our count requirement increases before we attempt to kill the process.
        touch /tmp/nvram
        start="$((start+1))"
        if [ $start -gt $to ]; then
            kill -s 9 $child 2>/dev/null
            wait $child
            return 1
        fi
    done
    return 0
}

# make the new function accessible, on the first run we want to exit right away if successful.
i="1"
if { run_cmd "$i" /tmp/_nvram "$@"; }; then rc="0"; else rc="1";fi

#logger -t "nvram-override" "Executed nvram $@, use count: $usecount, exit status: $rc"

# here we add an interval check and allow up to 3 retries.
while [ "$i" -le "$MAXCOUNT" ] && [ "$rc" != "0" ]; do
  touch /tmp/nvram
  if { run_cmd "$i" /tmp/_nvram "$@"; }; then
    rc="0";
  else
    rc="1";
    #errcount="$rc";
    #if [ ! -f "/tmp/nvramerr" ]; then echo 0 > /tmp/nvramerr; else errcount=$(cat /tmp/nvramerr); fi
    #errcount=$((errcount + 1 ));
    #echo $errcount > /tmp/nvramerr;
    #logger -t "nvram-override" "Error detected at use count: $usecount, error count: $errcount";
    #logger -t "nvram-override" "Couldn't execute nvram $@, exit status: $rc (124=timeout)";
  fi
  #logger -t "nvram-override" "Retried executing nvram $@, attempt ${i}/${MAXCOUNT}, exit status: $rc";
  i="$((i+1))";
done
#[ "$rc" -eq "1" ] && logger -t "nvram-override" "NVRAM remained locked too long; continuing anyway."
# any concurrent instance(s) may now run
release_lock
exit $rc
EOF
chmod +x /tmp/nvram

# replace nvram in /usr/sbin w/ nvram wrapper in /tmp
mount -o bind /tmp/nvram /bin/nvram
 
Last edited:
I have decided to end my experience with RT-AC86U. Goes for recycling.

On my way back home picked a new toy for Asuswrt-Merlin experiments though:

1655768596873.png
 
I have decided to end my experience with RT-AC86U. Goes for recycling.

On my way back home picked a new toy for Asuswrt-Merlin experiments though:

Probably the whole reason they went with the AX line... too many issues they weren't able to fix with software. I'm not far behind you... ;)
 
RT-AX86U is up and running, but it has weaker signal to my test AC client behind 2 walls.

RT-AC86U - 585/585
RT-AX86U - 390/390

I got this one just to play with it. It has scheduled surgery procedure in coming weeks. :D
 
RT-AX86U is up and running, but it has weaker signal to my test AC client behind 2 walls.

RT-AC86U - 585/585
RT-AX86U - 390/390

I got this one just to play with it. It has scheduled surgery procedure in coming weeks. :D

That's alright... I just need a hardline into it for our general purposes. It's wifi signal is just for me to play with...
 

Similar threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Back
Top