I decided to start checking the running stats from dnsmasq hourly, anyone who wishes to do the same can just run this command to dump the stats to the log file.
Now here is my most recent stats before dnsmasq was restarted to update a addn-hosts file.
Some notes.
Google dns is only used for uk.ntp.org so just time server.
65057 and 65058 are both the same dns server just 2 dnscrypt proxies. (my private dns server)
65055 is the true backup dns the .fr dnscrypto server.
I discovered a couple of days ago no matter what server was set as primary there was a reported high fail/retry rate on dnsmasq, usually between about 20% and 50% of queries.
I tried numerous things such as.
Trying different servers as primary (some had noticebly lower failure rates like opendns but still well into double figures)
Trying direct route to my dns server with no dnscrypt tunnel.
Disabling EDP
Disabling DNSSEC
Having only one server as primary (this was to see if the lookups would become hard fails, and I would see failures in browser etc.).
The only consistent pattern really was whatever server was backup almost 100% of its queries were successful. As noted here this includes setting the same server twice so once as primary and once as secondary.
Right now I am leaning to the problem been dnsmasq, because even with the same server first and second, the failure rate isnt matched which it would do if the problem was server side. My working theory at the moment is that dnsmasq is impatient in waiting for replies, and if one doesnt come quick enough, it fails, aborts and asks the next dns server. With my server been twice, by the time it asks second time the record is in its cache and can serve quick enough.
The problem with this theory is when I have a different server in first and second slots, then the second server wont have the benefit of precaching as it hasnt been asked yet, which puts my above theory to bed really.
To be clear if I put something like dnscrypto .fr first and my server seond, it will still fail 20-50% of queries and my server would pass 99%. Its whatever is first that takes the hit. But my server does seem to have a higher average e.g. in this case its 50% with a few hours running but a lot of other servers are nearer 30-40%.
I am not noticing this on browsing etc. so e.g. a retry on a dns lookup would surely cause a longer wait but it doesnt seem to be visible to me, which may also suggest its retrying really quickly, maybe not even waiting one second for a query response.
Alot of work was done on my dns server to see if was server side, aside from adjusting many dns server configuration options, I also checked the OS side tuning parameters to ensure no bottlenecks and even went as far as to replace bind9 with unbound (which is what is currently running).
thoughts (with own data hopefully)
Code:
/usr/bin/killall -s USR1 dnsmasq
Now here is my most recent stats before dnsmasq was restarted to update a addn-hosts file.
Code:
Apr 9 22:59:01 dnsmasq[15330]: time 2009381
Apr 9 22:59:01 dnsmasq[15330]: cache size 2000, 0/3827 cache insertions re-used unexpired cache entries.
Apr 9 22:59:01 dnsmasq[15330]: queries forwarded 2243, queries answered locally 2369
Apr 9 22:59:01 dnsmasq[15330]: DNSSEC memory in use 24640, max 26532, allocated 199980
Apr 9 22:59:01 dnsmasq[15330]: server 8.8.4.4#53: queries sent 15, retried or failed 3
Apr 9 22:59:01 dnsmasq[15330]: server 127.0.0.1#65057: queries sent 1722, retried or failed 865
Apr 9 22:59:01 dnsmasq[15330]: server 127.0.0.1#65058: queries sent 891, retried or failed 2
Apr 9 22:59:01 dnsmasq[15330]: server 127.0.0.1#65055: queries sent 2, retried or failed 0
Some notes.
Google dns is only used for uk.ntp.org so just time server.
65057 and 65058 are both the same dns server just 2 dnscrypt proxies. (my private dns server)
65055 is the true backup dns the .fr dnscrypto server.
I discovered a couple of days ago no matter what server was set as primary there was a reported high fail/retry rate on dnsmasq, usually between about 20% and 50% of queries.
I tried numerous things such as.
Trying different servers as primary (some had noticebly lower failure rates like opendns but still well into double figures)
Trying direct route to my dns server with no dnscrypt tunnel.
Disabling EDP
Disabling DNSSEC
Having only one server as primary (this was to see if the lookups would become hard fails, and I would see failures in browser etc.).
The only consistent pattern really was whatever server was backup almost 100% of its queries were successful. As noted here this includes setting the same server twice so once as primary and once as secondary.
Right now I am leaning to the problem been dnsmasq, because even with the same server first and second, the failure rate isnt matched which it would do if the problem was server side. My working theory at the moment is that dnsmasq is impatient in waiting for replies, and if one doesnt come quick enough, it fails, aborts and asks the next dns server. With my server been twice, by the time it asks second time the record is in its cache and can serve quick enough.
The problem with this theory is when I have a different server in first and second slots, then the second server wont have the benefit of precaching as it hasnt been asked yet, which puts my above theory to bed really.
To be clear if I put something like dnscrypto .fr first and my server seond, it will still fail 20-50% of queries and my server would pass 99%. Its whatever is first that takes the hit. But my server does seem to have a higher average e.g. in this case its 50% with a few hours running but a lot of other servers are nearer 30-40%.
I am not noticing this on browsing etc. so e.g. a retry on a dns lookup would surely cause a longer wait but it doesnt seem to be visible to me, which may also suggest its retrying really quickly, maybe not even waiting one second for a query response.
Alot of work was done on my dns server to see if was server side, aside from adjusting many dns server configuration options, I also checked the OS side tuning parameters to ensure no bottlenecks and even went as far as to replace bind9 with unbound (which is what is currently running).
thoughts (with own data hopefully)