What's new

DNS failing with heavy UDP traffic

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

storkinsj

Occasional Visitor
This is going to be a bit tricky to describe I apologize in advance for all the mistakes I'm going to make describing this.

I am using a GT-AX11000 with Merlin 386.5_2. It had been working very well until I began running a blockchain validator on my intranet. I have a very high bandwidth connection, double NAT'D (500Mbs up and down minimum).

My DNS setup includes DOT to both quad9 servers and appears to be working correctly when the validator is not running. I have 9.9.9.9 and 149.112.112.112 configured in the DOT locations as well as the DNS Server1 and Server2 locations. Again. This works perfectly when validator is running.

When I run the validator, I lose all DNS capabilities.

The validator is maintaining in the neighborhood of 1000 peers which it connects to via "udp" on its "gossip" channel; here is a slice of the outgoing tracked connections, which are capped at "500" in the UI:

udp solana 8007 141.95.125.35 8009 Untracked
udp solana 8007 184.105.146.34 8008 Untracked
udp solana 8000 146.59.68.225 8000 Untracked
udp solana 8000 141.95.35.126 8000 Untracked
...

Given a few seconds of connecting to its peers, the validator breaks DNS for all machines on the router's intranet. I did a "service dnsmasq_off" first and ran "dnsmasq --no-daemon --log-queries". I believe that basically all queries fail with that SERVFAIL message:

masq: query[A] lp-push-server-300.lastpass.com from 192.168.0.71
dnsmasq: forwarded lp-push-server-300.lastpass.com to 127.0.1.1
dnsmasq: query[A] metrics.solana.com from 192.168.4.2
dnsmasq: forwarded metrics.solana.com to 127.0.1.1
dnsmasq: query[AAAA] metrics.solana.com from 192.168.4.2
dnsmasq: forwarded metrics.solana.com to 127.0.1.1
dnsmasq: query[A] bolt.dropbox.com from 192.168.0.200
dnsmasq: forwarded bolt.dropbox.com to 127.0.1.1
dnsmasq: query[AAAA] bolt.dropbox.com from 192.168.0.200
dnsmasq: forwarded bolt.dropbox.com to 127.0.1.1
dnsmasq: query[A] ipecho.net from 192.168.0.71
dnsmasq: forwarded ipecho.net to 127.0.1.1
dnsmasq: forwarded captive.apple.com to 127.0.1.1
dnsmasq: forwarded captive.apple.com to 127.0.1.1
dnsmasq: reply error is SERVFAIL
dnsmasq: reply error is SERVFAIL
dnsmasq: reply error is SERVFAIL
dnsmasq: reply error is SERVFAIL
dnsmasq: query[A] www.1k3j1blg.com from 192.168.0.200
dnsmasq: forwarded www.1k3j1blg.com to 127.0.1.1
dnsmasq: query[AAAA] www.1k3j1blg.com from 192.168.0.200
dnsmasq: forwarded www.1k3j1blg.com to 127.0.1.1
dnsmasq: query[AAAA] www.expressapisv2.net from 192.168.0.200
dnsmasq: forwarded www.expressapisv2.net to 127.0.1.1
dnsmasq: query[A] www.expressapisv2.net from 192.168.0.200
dnsmasq: forwarded www.expressapisv2.net to 127.0.1.1
dnsmasq: reply error is SERVFAIL
dnsmasq: reply error is SERVFAIL
dnsmasq: reply error is SERVFAIL
dnsmasq: reply error is SERVFAIL

I would not have tried DNS over TLS except for the fact that I thought that using a TCP based DNS service might actually help DNS queries wade through the sea of UDP traffic.

I also have tried with DNSSEC turned on and off, although currently it is set as off.

I also tried "Use local caching DNS server as system resolver (default: No)" set to On to hopefully force the router itself to use DOT but that didn't seem to work either.

I tried Adaptive QOS, Classic QOS, and bandwidth limiter (specifically the machine with the validator) set to a much smaller value but all that did was get the validator behind on its work. DNS still failed.
---
Edit: An important detail I left out: When the validator is taken off the network, it takes about 10 minutes for the DNS to begin working again. This is true even if I reboot the router.

I thought that this may be related to the building equipment (a 10.* space NAT with NTT equipment) so I tried plugging my laptop into the WAN cable. It works immediately, even when the router is not able to do DNS.

In fact, the validator is pushing massive amounts of data through the router. That all works correctly and at high speed. But DNS fails. Again, if I reboot the router, the DNS will still not work until about 10 minutes have passed. It seems that some network state may be retained between reboots but I truly don't know that for a fact.

---

I think this is going to be difficult to get debugged (I've been trying off and on for about a week) and I appreciate that this could be a problem at the kernel level.

Any help GREATLY appreciated of course, but I will accept defeat if this doesn't work.

I am considering doing multiwan and somehow routing the DNS traffic (only) on the second connection if I can set that up.
 
Last edited:
If I remember the 127.0.1.1 is the loop back address used between dnsmasq and stubby. Have you tried without DoT enabled? You might try to increase the dnsmasq cache from the 1500 default size.

Edit: It is possible that the TLS connection used by Stubby is failing. The DoT connections, by default, are not logged. From a command prompt run "stubby -l" to view what Stubby is doing. By default Stubby will use each upstream resolver in turn as a means of distributing the DNS query load between servers. Merlin is using the current release of Stubby. There are some Stubby settings that could be tweaked if you feel Stubby is the problem. Increasing the dnsmasq cache may help to reduce the load on Stubby.
 
Last edited:
Edit: An important detail I left out: When the validator is taken off the network, it takes about 10 minutes for the DNS to begin working again. This is true even if I reboot the router.

Given the above, that sounds like it has more to do w/ those specific DNS servers than the router itself. Perhaps it's triggering DOS/DDOS protection on the server(s). Or at least being rate-limited in some fashion, much the same way you'd typically rate-limit access to SSH and other protocols to slow down hackers. I would try increasing the number of available DoT servers to further distribute the load.
 
A sure fire way to fix it is to use DNS on port 53 and make sure QoS has port 53 higher than your other udp streaming ports.
 
Thanks for the great information so far; eibgrad I will try DE-DOSSing (lol) the name servers with your tip.
------
-Note that I did try plain vanilla DNS first and it was failing- which is why I added the DOT configuration.
-Note that I did try QOS but there are two issues with that:
1) The ways of classifying the packets in the ASUS interface is limited compared to something like tomato. Screen shot shows my settings, with priority already set to "highest" (less granularity). There is no "port range" for destination so I would have to add an entry for every potential target port in the 8000-10000 range for that. I can force it down to 13 ports and then add 13 entries but it's messy.
2) If I turn on QOS, typically that means I disable something we loosely label "nat acceleration". On my newer AX11000 router, I don't see a setting for NAT accelleration and I read elsewhere that CTF is replaced with "flow accelerator", which may be replaced with "archer". However, with the setting completely missing and QOS and NAT acceleration out of Merlin's control... I am afraid that turning on QOS will STILL turn OFF whatever NAT acceleration is being used.

Screen Shot 2022-04-12 at 8.14.54 AM.png
I may try to force it down to the 13 port range and give that a shot nonetheless; however, my first attempt with the adapative... and the later attempt with the settings I have in the image, I think the DNS was prioritized.
-------
A slightly new twist. The virtual machine running the validator is using automated script to tweak the kernel's udp network buffers via sysctrl:

sudo bash -c "cat >/etc/sysctl.d/20-solana-udp-buffers.conf <<EOF
# Increase UDP buffer size
net.core.rmem_default = 134217728
net.core.rmem_max = 134217728
net.core.wmem_default = 134217728
net.core.wmem_max = 134217728
EOF"
sudo sysctl -p /etc/sysctl.d/20-solana-udp-buffers.conf


If they are "gaming the system" on the virtual machine, my sense is I may have to do the same on the router.

I know from all the reading this week that our kernel for merlin doesn't have sysctrl, so I would be looking at performing operations such as

echo 134217728 > /proc/sys/net/core/wmem_max

but with a LOT less memory. So I'm curious what values the community thinks I could use here.

potentially these are the values available below. I think that tweaking rmeme and wmem (what do these even stand for) without tweaking udp_mem could be a bad idea:

admin@GT-AX11000-E910:/tmp/home/root# find /proc/sys/net | grep mem
/proc/sys/net/core/optmem_max
/proc/sys/net/core/rmem_default
/proc/sys/net/core/rmem_max
/proc/sys/net/core/wmem_default
/proc/sys/net/core/wmem_max
/proc/sys/net/ipv4/igmp_max_memberships
/proc/sys/net/ipv4/tcp_mem
/proc/sys/net/ipv4/tcp_rmem
/proc/sys/net/ipv4/tcp_wmem
/proc/sys/net/ipv4/udp_mem
/proc/sys/net/ipv4/udp_rmem_min
/proc/sys/net/ipv4/udp_wmem_min
admin@GT-AX11000-E910:/tmp/home/root# find /proc/sys/net | grep udp
/proc/sys/net/ipv4/netfilter/ip_conntrack_udp_timeout
/proc/sys/net/ipv4/netfilter/ip_conntrack_udp_timeout_stream
/proc/sys/net/ipv4/udp_mem
/proc/sys/net/ipv4/udp_rmem_min
/proc/sys/net/ipv4/udp_wmem_min
/proc/sys/net/netfilter/nf_conntrack_udp_timeout
/proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream


I am going to go ahead and try reconfiguring to more DOT servers using the pulldowns; anxious for feedback on the NAT acceleration trigger for AX110000 and kernel tuning.
--------
I found two research papers that could keep me busy for at least a couple years no kidding:

NAT Traversal techniques and UDP Keep-alive interval optimization

Monitoring and Tuning the Linux Networking Stack

Thanks all!
 
Ugh!

Well I may have found at least one problem.

I made a simple name server lookup test to see if my new configuration with 8 DOT servers was correct.

What I discovered is that :
-9.9.9.9 was configured both for DOT and in DNSServer1 fields, but the router is really only paying attention to the dnsserver1 field.
-9.9.9.9 does start throttling me after a while.

Against advice I saw elsewhere in the forum, I changed the DNSserver1 and DNSServer2 fields to be blank. Now it's doing the lookup correctly using the DOT servers; the earlier suggestion to diagnose stubby seemed to be correct :


admin@GT-AX11000-E910:/tmp/home/root# source ./dnstest.sh
Server: 9.9.9.9
Address 1: 9.9.9.9 dns9.quad9.net

Name: www.dogs.com
Address 1: 204.74.99.100 crs.ultradns.net
Server: 9.9.9.9
Address 1: 9.9.9.9 dns9.quad9.net

Name: www.cats.com
Address 1: 72.167.191.69 ip-72-167-191-69.ip.secureserver.net
Server: 9.9.9.9
Address 1: 9.9.9.9 dns9.quad9.net


Removing DNSServer1 and DNSServer2 fields:

admin@GT-AX11000-E910:/tmp/home/root# source ./dnstest.sh
Server: 127.0.1.1
Address 1: 127.0.1.1

Name: www.dogs.com
Address 1: 204.74.99.100 crs.ultradns.net
Server: 127.0.1.1
Address 1: 127.0.1.1

Name: www.cats.com
Address 1: 182.50.132.242 ip-182-50-132-242.ip.secureserver.net
Server: 127.0.1.1
Address 1: 127.0.1.1


Since I believe DNSServer1 and DNSServer2 are supposed to be used at bootup, I changed one other setting: "Wan: Use local caching DNS server as system resolver (default: No)" to YES.

Here is the output of stubby -l now:

admin@GT-AX11000-E910:/tmp/home/root# stubby -l
[00:17:39.366487] STUBBY: Stubby version: Stubby 0.4.0
[00:17:39.369568] STUBBY: Read config from file /etc/stubby/stubby.yml
[00:17:39.369966] STUBBY: DNSSEC Validation is OFF
[00:17:39.370015] STUBBY: Transport list is:
[00:17:39.370058] STUBBY: - TLS
[00:17:39.370101] STUBBY: Privacy Usage Profile is Strict (Authentication required)
[00:17:39.370151] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[00:17:39.370192] STUBBY: Starting DAEMON....
[00:17:41.147539] STUBBY: 9.9.9.9 : Upstream : Could not setup TLS capable TFO connect
[00:17:41.148222] STUBBY: 9.9.9.9 : Conn opened: TLS - Strict Profile
[00:17:41.354422] STUBBY: 9.9.9.9 : Verify passed : TLS
[00:17:46.012720] STUBBY: 149.112.112.112 : Conn opened: TLS - Strict Profile
[00:17:46.217944] STUBBY: 149.112.112.112 : Verify passed : TLS
[00:17:47.929660] STUBBY: 149.112.121.10 : Conn opened: TLS - Strict Profile
[00:17:48.165107] STUBBY: 149.112.121.10 : Verify passed : TLS
[00:17:49.337969] STUBBY: 149.112.122.10 : Conn opened: TLS - Strict Profile
[00:17:49.573590] STUBBY: 149.112.122.10 : Verify passed : TLS
[00:17:50.546349] STUBBY: 9.9.9.9 : Conn closed: TLS - Resps= 1, Timeouts = 0, Curr_auth =Success, Keepalive(ms)= 9000
[00:17:50.546666] STUBBY: 9.9.9.9 : Upstream : TLS - Resps= 1, Timeouts = 0, Best_auth =Success
[00:17:50.546841] STUBBY: 9.9.9.9 : Upstream : TLS - Conns= 1, Conn_fails= 0, Conn_shuts= 0, Backoffs = 0
[00:17:55.414320] STUBBY: 149.112.112.112 : Conn closed: TLS - Resps= 1, Timeouts = 0, Curr_auth =Success, Keepalive(ms)= 9000
[00:17:55.414371] STUBBY: 149.112.112.112 : Upstream : TLS - Resps= 1, Timeouts = 0, Best_auth =Success
[00:17:55.414395] STUBBY: 149.112.112.112 : Upstream : TLS - Conns= 1, Conn_fails= 0, Conn_shuts= 0, Backoffs = 0
[00:17:56.618812] STUBBY: 185.228.168.9 : Conn opened: TLS - Strict Profile
[00:17:56.632014] STUBBY: 185.228.168.9 : Verify passed : TLS


Later under load I see the quad9 servers... who have already had quite enough of me.... backing off on the round robin:

[00:22:21.674832] STUBBY: 149.112.121.10 : Upstream : !Backing off TLS on this upstream - Will retry again in 2s at Tue Apr 12 00:22:23 2022
[00:21:10.573791] STUBBY: 9.9.9.9 : Upstream : !Backing off TLS on this upstream - Will retry again in 2s at Tue Apr 12 00:21:12 2022


At the moment, removing dnsserver1 and dnsserver2 and adding 8 DOT servers has FIXED this. Thank you very much @eibgrad and @bbunge ! I will leave this running and see how it does over 24 hours.
 
Last edited:
Definitely spoke too soon. The validator had stopped chatting on the peer network. Once it begain again stubby tells a tale of woes:

[00:50:43.162185] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[00:50:43.162234] STUBBY: Starting DAEMON....
[00:50:45.762561] STUBBY: 9.9.9.9 : Upstream : Could not setup TLS capable TFO connect
[00:50:45.762979] STUBBY: 9.9.9.9 : Conn opened: TLS - Strict Profile
[00:50:45.763337] STUBBY: 149.112.112.112 : Conn opened: TLS - Strict Profile
[00:50:48.126334] STUBBY: 9.9.9.9 : Conn closed: TLS - *Failure*
[00:50:48.126592] STUBBY: 149.112.121.10 : Conn opened: TLS - Strict Profile
[00:50:48.126627] STUBBY: 9.9.9.9 : Conn closed: TLS - Resps= 0, Timeouts = 0, Curr_auth = None, Keepalive(ms)= 0
[00:50:48.126652] STUBBY: 9.9.9.9 : Upstream : TLS - Resps= 0, Timeouts = 0, Best_auth = None
[00:50:48.126675] STUBBY: 9.9.9.9 : Upstream : TLS - Conns= 0, Conn_fails= 1, Conn_shuts= 0, Backoffs = 0
[00:50:48.164279] STUBBY: 149.112.112.112 : Conn closed: TLS - *Failure*
[00:50:48.164474] STUBBY: 149.112.122.10 : Conn opened: TLS - Strict Profile
[00:50:48.164508] STUBBY: 149.112.112.112 : Conn closed: TLS - Resps= 0, Timeouts = 0, Curr_auth = None, Keepalive(ms)= 0
[00:50:48.164532] STUBBY: 149.112.112.112 : Upstream : TLS - Resps= 0, Timeouts = 0, Best_auth = None
[00:50:48.164555] STUBBY: 149.112.112.112 : Upstream : TLS - Conns= 0, Conn_fails= 1, Conn_shuts= 0, Backoffs = 0
[00:50:48.596316] STUBBY: 149.112.121.10 : Conn closed: TLS - *Failure*
[00:50:48.596558] STUBBY: 185.228.168.9 : Conn opened: TLS - Strict Profile
[00:50:48.596593] STUBBY: 149.112.121.10 : Conn closed: TLS - Resps= 0, Timeouts = 0, Curr_auth = None, Keepalive(ms)= 0
[00:50:48.596618] STUBBY: 149.112.121.10 : Upstream : TLS - Resps= 0, Timeouts = 0, Best_auth = None
[00:50:48.596641] STUBBY: 149.112.121.10 : Upstream : TLS - Conns= 0, Conn_fails= 1, Conn_shuts= 0, Backoffs = 0
[00:50:48.642287] STUBBY: 149.112.122.10 : Conn closed: TLS - *Failure*
[00:50:48.642486] STUBBY: 185.228.169.9 : Conn opened: TLS - Strict Profile
[00:50:48.642518] STUBBY: 149.112.122.10 : Conn closed: TLS - Resps= 0, Timeouts = 0, Curr_auth = None, Keepalive(ms)= 0
[00:50:48.642555] STUBBY: 149.112.122.10 : Upstream : TLS - Resps= 0, Timeouts = 0, Best_auth = None
[00:50:48.642578] STUBBY: 149.112.122.10 : Upstream : TLS - Conns= 0, Conn_fails= 1, Conn_shuts= 0, Backoffs = 0
[00:50:48.690249] STUBBY: 185.228.168.9 : Conn closed: TLS - *Failure*


==== Then: ====

[00:50:48.712540] STUBBY: *FAILURE* no valid transports or upstreams available!

But we know that's not true- it's just that the router can't make those connections.

So I think I need to explore options to keep the router kernel stable and able to make those connections.
 
Try traditional DNS and allow the router to cache DNS requests. Secure DNS has many more requirements to function. Also, allowing the router to cache will take the load of your wan link for cached queries.

Good luck,

Morris
 
I know from all the reading this week that our kernel for merlin doesn't have sysctrl,
maybe not, but have you looked into entware? There's neato stuff in there that may be of immense assistance.
And maybe it fits with what @Morris suggested about caching DNS - look at unbound. (it fits nicely with something else you may be interested in: WireGuard - there are several threads in the AddOns Subforum here, but I'm not sure they've something yet for your model router)

Just trying to help a fellow crypto person...even if you're on Solana ;-p
 
Not 100% clued up on this but presume here you are running a solana node? If so there was a tweet yesterday about updating to *...Was a bit over my head.
 
Try traditional DNS and allow the router to cache DNS requests. Secure DNS has many more requirements to function. Also, allowing the router to cache will take the load of your wan link for cached queries.

Good luck,

Morris
1st post> I would not have tried DNS over TLS except for the fact that I thought that using a TCP based DNS service might actually help DNS queries wade through the sea of UDP traffic.

2nd post> Note that I did try plain vanilla DNS first and it was failing- which is why I added the DOT configuration.

If I am misunderstanding what you're recommending, please help me understand. Otherwise, yes that's what I was using until it failed.

I have no specific privacy or security reasons for using DOT. Perhaps it uses cryptography etc, but the thing about https vs dns is that TCP/IP is a guaranteed transport with retries etc; UDP is not. That is the only reason I tried something else.

======
Yes folks- this is a Solana validator. It's actually working. It's just that... when it's working- nothing else works :) Initiating any outgoing tcp or upd connections is failing in the router. If this is a CISCO $5000 router it probably work. However, I'm running just about the most powerful Merlin router I can and I think the only one that supports IPV6. I just don't know too much about kernel and debugging there. I think our tools (stubby, dnsmasq, etc) are probably working "just fine". It's that the kernel is unable to handle all the networking as configured.
 
I noticed something interesting when running stubby -l:


[14:00:51.864278] STUBBY: Stubby version: Stubby 0.4.0
[14:00:51.867848] STUBBY: Read config from file /etc/stubby/stubby.yml
[14:00:51.868272] STUBBY: DNSSEC Validation is OFF
[14:00:51.868323] STUBBY: Transport list is:
[14:00:51.868366] STUBBY: - UDP
[14:00:51.868408] STUBBY: - TCP
[14:00:51.868463] STUBBY: Privacy Usage Profile is Opportunistic
[14:00:51.868508] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)

Did I do something wrong? I want TCP to be the only transport. Since it's DNS over TLS (not DTLS) I was assuming everything is TCP.

Below is my conf:

Screen Shot 2022-04-12 at 6.55.03 PM.png


Thanks for all the help.
Greg
 
If it wasn't clear earlier-
All DNS queries on my intranet are working. That means any static hosts on my network get resolved quickly. It's only the DNS queries that depend on an upstream server that are failing. I apologize for not explicitly mentioning that.

This makes me think at least a little differently about the issue. The Kernel is handling All of the network traffic, not just the WAN facing. It is succeeding even though the local NIC is getting hit heavily with the UDP packets as well.

The big difference between WAN and LAN is that WAN uses NAT. So it _does_ seem that the problem is related to NAT.

Greg
 
Interesting... I'ver been running stubby -l for months to debug another issue... mine shows only TCP....and I use 6 resolvers... One different I see in your posted settings are I use the default symmetric and your profile shows Fullcone... I still use the quad9 settings in DNS1/DNS2 for the early resolution and I have DNSSEC both = YES and DNS Rebind = NO. Still a bit odd - maybe I need to look at stubby.yml directly again. My Privacy Usage Profile shows Strict and yours is Opportunistic?

:/tmp/home/root# stubby -l
[11:39:55.508633] STUBBY: Stubby version: Stubby 0.4.0
[11:39:55.511556] STUBBY: Read config from file /etc/stubby/stubby.yml
[11:39:55.512183] STUBBY: DNSSEC Validation is OFF
[11:39:55.512221] STUBBY: Transport list is:
[11:39:55.512252] STUBBY: - TLS
[11:39:55.512282] STUBBY: Privacy Usage Profile is Strict (Authentication required)
[11:39:55.512313] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[11:39:55.512343] STUBBY: Starting DAEMON....
[11:40:01.437938] STUBBY: 9.9.9.11 : Upstream : Could not setup TLS capable TFO connect
[11:40:01.438699] STUBBY: 9.9.9.11 : Conn opened: TLS - Strict Profile
[11:40:01.446999] STUBBY: 149.112.112.11 : Conn opened: TLS - Strict Profile
[11:40:01.447464] STUBBY: 1.1.1.2 : Conn opened: TLS - Strict Profile
[11:40:01.494325] STUBBY: 9.9.9.11 : Verify passed : TLS
[11:40:01.510935] STUBBY: 1.1.1.2 : Verify passed : TLS
[11:40:01.525386] STUBBY: 149.112.112.11 : Verify passed : TLS
....
 
Last edited:
I noticed something interesting when running stubby -l:


[14:00:51.864278] STUBBY: Stubby version: Stubby 0.4.0
[14:00:51.867848] STUBBY: Read config from file /etc/stubby/stubby.yml
[14:00:51.868272] STUBBY: DNSSEC Validation is OFF
[14:00:51.868323] STUBBY: Transport list is:
[14:00:51.868366] STUBBY: - UDP
[14:00:51.868408] STUBBY: - TCP
[14:00:51.868463] STUBBY: Privacy Usage Profile is Opportunistic
[14:00:51.868508] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)

Did I do something wrong? I want TCP to be the only transport. Since it's DNS over TLS (not DTLS) I was assuming everything is TCP.

Below is my conf:

View attachment 40776

Thanks for all the help.
Greg
I do not see STUBBY: -UDP when I run stubby -l. Did you ever enable DNS-over-TLS Profile Opportunistic ? Maybe it has not gone back to the default setting and a reset is needed.
Did you have DNS Server 1 and 2 filled in on the LAN - DHCP Server page? If so those should be blank.
You do need DNS Server 1 and 2 filled in on the WAN page.
I would use just four resolvers for DoT - Quad9 and Cloudflare Security. Cloudflare Security is 1.1.1.2 and 1.0.0.1 with a TLS Hostname of security.cloudflare-dns.com and I would alternate them (9.9.9.9 then 1.1.1.2 and so on)
In an earlier post I recommended increasing the DNS cache size in dnsmasq. An easy way to do this is to add Diversion. There is an option to increase the cache size to 10,000. It does increase the RAM usage a bit but the AX86U can handle it!
 
Interesting... I'ver been running stubby -l for months to debug another issue... mine shows only TCP....and I use 6 resolvers... One different I see in your posted settings are I use the default symmetric and your profile shows Fullcone... I still use the quad9 settings in DNS1/DNS2 for the early resolution and I have DNSSEC both = YES and DNS Rebind = NO. Still a bit odd - maybe I need to look at stubby.yml directly again. My Privacy Usage Profile shows Strict and yours is Opportunistic?

:/tmp/home/root# stubby -l
[11:39:55.508633] STUBBY: Stubby version: Stubby 0.4.0
[11:39:55.511556] STUBBY: Read config from file /etc/stubby/stubby.yml
[11:39:55.512183] STUBBY: DNSSEC Validation is OFF
[11:39:55.512221] STUBBY: Transport list is:
[11:39:55.512252] STUBBY: - TLS
[11:39:55.512282] STUBBY: Privacy Usage Profile is Strict (Authentication required)
[11:39:55.512313] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[11:39:55.512343] STUBBY: Starting DAEMON....
[11:40:01.437938] STUBBY: 9.9.9.11 : Upstream : Could not setup TLS capable TFO connect
[11:40:01.438699] STUBBY: 9.9.9.11 : Conn opened: TLS - Strict Profile
[11:40:01.446999] STUBBY: 149.112.112.11 : Conn opened: TLS - Strict Profile
[11:40:01.447464] STUBBY: 1.1.1.2 : Conn opened: TLS - Strict Profile
[11:40:01.494325] STUBBY: 9.9.9.11 : Verify passed : TLS
[11:40:01.510935] STUBBY: 1.1.1.2 : Verify passed : TLS
[11:40:01.525386] STUBBY: 149.112.112.11 : Verify passed : TLS
....
Keep in mind in the Merlin setup DNSSEC is done in DNSMASQ. It is possible to do DNSSEC in stubby by adding a line in the stubby.yml with a stubby.postconf file
 
If it wasn't clear earlier-
All DNS queries on my intranet are working. That means any static hosts on my network get resolved quickly. It's only the DNS queries that depend on an upstream server that are failing. I apologize for not explicitly mentioning that.

This makes me think at least a little differently about the issue. The Kernel is handling All of the network traffic, not just the WAN facing. It is succeeding even though the local NIC is getting hit heavily with the UDP packets as well.

The big difference between WAN and LAN is that WAN uses NAT. So it _does_ seem that the problem is related to NAT.

Greg
This should be of interest to you: https://www.wireguard.com/netns/
if you spin up a Wireguard server peer (on your router), you can (well, I'd assume you should be able to) compartmentalize/segment your validator apart from your LAN traffic on the WAN connection with whatever Solana uses as "FQDN" or locator addy/DNS A-and QuadA records...AND run an instance of unbound for DNS caching for the other clients that takes care of DNSSEC...

Is this a fibre or cable connection? if cable, can you bridge your gateway/modem and do the ISP auth on the router?
it seems you might be behind CGNAT on the WAN connection...do you get a Native IPv6 connection from your ISP? (have you enabled IPv6 on your LAN to see if it helps with the issue?) if no Native v6, do/can you get a static IP from the ISP? OR - are you amenable to setting up DDNS on the router?

-also a Greg
 
This should be of interest to you: https://www.wireguard.com/netns/
if you spin up a Wireguard server peer (on your router), you can (well, I'd assume you should be able to) compartmentalize/segment your validator apart from your LAN traffic on the WAN connection with whatever Solana uses as "FQDN" or locator addy/DNS A-and QuadA records...AND run an instance of unbound for DNS caching for the other clients that takes care of DNSSEC...

Is this a fibre or cable connection? if cable, can you bridge your gateway/modem and do the ISP auth on the router?
it seems you might be behind CGNAT on the WAN connection...do you get a Native IPv6 connection from your ISP? (have you enabled IPv6 on your LAN to see if it helps with the issue?) if no Native v6, do/can you get a static IP from the ISP? OR - are you amenable to setting up DDNS on the router?

-also a Greg
Great stuff.

First to be clear, the problem really is not DNS itself here, but the ability to make ANY outbound connections. I think that's more or less a given at this point.

This generates a side question: If I only specify a SINGLE DNS over TLS server... do you think Stubby will keep that connection alive? If so, I may be able to keep DNS working at least.

The only connections that work seem to be connections that pre-exist the Validator coming up. As an example, I run ngrok here and ngrok continues to allow incoming connections.

I have an unused fiber connection coming in (we don't know if it's live yet and I am having trouble figuring out how to buy an ONT) but we are hooked to ethernet.

The building is probably using CGNAT. I have great connectivity and I can't ping other routers in the building from my WAN interface. If I look at how CGNAT is structured, that would be a good indicator. "

Greg's suggestions :) :

Using wireguard to contain Validator traffic: We thought about doing a VPN to handle some of the traffic. However, it pushes the problem "out" to another system (DNS receiver). Getting something highbandwidth in the cloud or colocated would cost more and actually that bandwidth may not be as good as mine.

Tricking some Solana traffic into using VPN: I think Solana is probably making thousands of DNS requests. So it's not going to be possible to contain its outbound traffic, via DNS A-and QuadA records (IIUC).

DNS Caching: I think DNS issues may be a red herring. Stubby and other tools are just unable to connect outbound after a short while. I think NAT is failing.

So I have to figure out what kernel parameters to tweak to increase NAT capacity, or figure out how to decrease the UDP / TCP timeouts so that I don't have too many NAT connections open.

Think about this: TCP/IP only has 65,535 ports available to it. Each NAT outbound TCP/IP connection takes up one of those for an etherial source port. I could easily see my router running out of those in this situation. The connection tracker was seeing "15,000". I'm waiting for the validator to come up again- thanks @ColinTaylor for the tip on the Tools/Network status page! I have been using Tomato for years but not up to speed 100% on Merlin yet. Validator is coming up again I'll post the total number of connections.
 
Hi Greg,

You must rate limit that application or upgrade your wan link to support what you are trying to do. I suggested DNS caching as at least cached requests would work. The reality is you can not fit an elephant through a straw.

Morris
 

Similar threads

Latest threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top