I have 3 separate networks demanding unmanned availablity 24/7. For years now, they check each other and the Internet life several ways and fix all the problems on their own.
lan.1 is the Master and it will ssh to touch a file at lan.2 and lan.3 at certain interval. This test is new and is used for 3 months. Regardles of this test, ssh is used into and out of each location for 100s times every day. lan.1 and lan.2 are constantly connected by VPN (two way), while the lan.3 is only accesible by an ssh to the nonstandard port using the key.
What can cause outgoing ssh to stop working from the router at lan.1 towards the other two? After working normally for 3 months? Tests were repeated 10+ times over half an hour.
The only difference at the time at the log (repated 200 times in 30 sec interval), I can not explain:
ddns-start is my main managing program. I used it manually many times during the incident. It is performing these ssh to remote machines. It was working perfectly all the time. Custom DDNS is the only router function related to it and running it. But my complex ddns-start takes over and refuses to do anything unnecessary, will not run two instances and will clear remains should any instance hung.
What was not an issue? Interet was OK at all 3 locations. VPNs were working. I was able to connect from lan.2 to both lan.1 (ssh via VPN) and lan.3 (ssh). Other local computers were ssh-ing to lan.1 router with no problems. So, only the outgoing ssh from lan.1 had a problem. Free memory at router at lan.1 was at regular levels. All my other tests noticed nothing! Speedtest was normal.
Later I found out: less than 1 hour prior to alarms being sent out to me and prior to this above long entries, another user connected VPN several times there, but was unable to connect to a .7 computer on that LAN. Never happend before, so that could be related. Funny thing is: that same LAN .7 computer had no problems with ssh to the router during the incident!
As lan.1 must work, after half an hour of investigating into the problem, I did not find anything and I rebooted the modem (not the router!) and everything went back to normal. Rebooting the modem obviously resets the DDNS (double NAT). Now, 6 days after the ssh incident, all is well and the router is 91 days up with 164 mb free ram and 18044 left in NVRAM.
The router at lan.1 is an Asus RT-AC68U with old fw (380.59). Please, unless there is something related precisely documented, do not suggest fw as a problem. I have an RT-AC66U_B1 380.70 and its ssh is VERY problematic. My other routers with 380.59 do not have any problems for the last 3 years.
lan.1 is the Master and it will ssh to touch a file at lan.2 and lan.3 at certain interval. This test is new and is used for 3 months. Regardles of this test, ssh is used into and out of each location for 100s times every day. lan.1 and lan.2 are constantly connected by VPN (two way), while the lan.3 is only accesible by an ssh to the nonstandard port using the key.
What can cause outgoing ssh to stop working from the router at lan.1 towards the other two? After working normally for 3 months? Tests were repeated 10+ times over half an hour.
The only difference at the time at the log (repated 200 times in 30 sec interval), I can not explain:
Code:
Sep 19 23:14:25 watchdog: start ddns.
Sep 19 23:14:25 rc_service: watchdog 445:notify_rc start_ddns
Sep 19 23:14:25 custom script: Running /jffs/scripts/ddns-start (args: 192.168.10.10)
ddns-start is my main managing program. I used it manually many times during the incident. It is performing these ssh to remote machines. It was working perfectly all the time. Custom DDNS is the only router function related to it and running it. But my complex ddns-start takes over and refuses to do anything unnecessary, will not run two instances and will clear remains should any instance hung.
What was not an issue? Interet was OK at all 3 locations. VPNs were working. I was able to connect from lan.2 to both lan.1 (ssh via VPN) and lan.3 (ssh). Other local computers were ssh-ing to lan.1 router with no problems. So, only the outgoing ssh from lan.1 had a problem. Free memory at router at lan.1 was at regular levels. All my other tests noticed nothing! Speedtest was normal.
Later I found out: less than 1 hour prior to alarms being sent out to me and prior to this above long entries, another user connected VPN several times there, but was unable to connect to a .7 computer on that LAN. Never happend before, so that could be related. Funny thing is: that same LAN .7 computer had no problems with ssh to the router during the incident!
As lan.1 must work, after half an hour of investigating into the problem, I did not find anything and I rebooted the modem (not the router!) and everything went back to normal. Rebooting the modem obviously resets the DDNS (double NAT). Now, 6 days after the ssh incident, all is well and the router is 91 days up with 164 mb free ram and 18044 left in NVRAM.
The router at lan.1 is an Asus RT-AC68U with old fw (380.59). Please, unless there is something related precisely documented, do not suggest fw as a problem. I have an RT-AC66U_B1 380.70 and its ssh is VERY problematic. My other routers with 380.59 do not have any problems for the last 3 years.