What's new

dnsmasq stability

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Chrysalis

Senior Member
I decided to start checking the running stats from dnsmasq hourly, anyone who wishes to do the same can just run this command to dump the stats to the log file.

Code:
/usr/bin/killall -s USR1 dnsmasq

Now here is my most recent stats before dnsmasq was restarted to update a addn-hosts file.

Code:
Apr  9 22:59:01 dnsmasq[15330]: time 2009381
Apr  9 22:59:01 dnsmasq[15330]: cache size 2000, 0/3827 cache insertions re-used unexpired cache entries.
Apr  9 22:59:01 dnsmasq[15330]: queries forwarded 2243, queries answered locally 2369
Apr  9 22:59:01 dnsmasq[15330]: DNSSEC memory in use 24640, max 26532, allocated 199980
Apr  9 22:59:01 dnsmasq[15330]: server 8.8.4.4#53: queries sent 15, retried or failed 3
Apr  9 22:59:01 dnsmasq[15330]: server 127.0.0.1#65057: queries sent 1722, retried or failed 865
Apr  9 22:59:01 dnsmasq[15330]: server 127.0.0.1#65058: queries sent 891, retried or failed 2
Apr  9 22:59:01 dnsmasq[15330]: server 127.0.0.1#65055: queries sent 2, retried or failed 0

Some notes.
Google dns is only used for uk.ntp.org so just time server.
65057 and 65058 are both the same dns server just 2 dnscrypt proxies. (my private dns server)
65055 is the true backup dns the .fr dnscrypto server.

I discovered a couple of days ago no matter what server was set as primary there was a reported high fail/retry rate on dnsmasq, usually between about 20% and 50% of queries.
I tried numerous things such as.
Trying different servers as primary (some had noticebly lower failure rates like opendns but still well into double figures)
Trying direct route to my dns server with no dnscrypt tunnel.
Disabling EDP
Disabling DNSSEC
Having only one server as primary (this was to see if the lookups would become hard fails, and I would see failures in browser etc.).

The only consistent pattern really was whatever server was backup almost 100% of its queries were successful. As noted here this includes setting the same server twice so once as primary and once as secondary.

Right now I am leaning to the problem been dnsmasq, because even with the same server first and second, the failure rate isnt matched which it would do if the problem was server side. My working theory at the moment is that dnsmasq is impatient in waiting for replies, and if one doesnt come quick enough, it fails, aborts and asks the next dns server. With my server been twice, by the time it asks second time the record is in its cache and can serve quick enough.
The problem with this theory is when I have a different server in first and second slots, then the second server wont have the benefit of precaching as it hasnt been asked yet, which puts my above theory to bed really.
To be clear if I put something like dnscrypto .fr first and my server seond, it will still fail 20-50% of queries and my server would pass 99%. Its whatever is first that takes the hit. But my server does seem to have a higher average e.g. in this case its 50% with a few hours running but a lot of other servers are nearer 30-40%.

I am not noticing this on browsing etc. so e.g. a retry on a dns lookup would surely cause a longer wait but it doesnt seem to be visible to me, which may also suggest its retrying really quickly, maybe not even waiting one second for a query response.

Alot of work was done on my dns server to see if was server side, aside from adjusting many dns server configuration options, I also checked the OS side tuning parameters to ensure no bottlenecks and even went as far as to replace bind9 with unbound (which is what is currently running).

thoughts (with own data hopefully)
 
I took a quick look at the code, and it looks like the only timeout in dnsmasq is if there are no available cache slots on a heavily loaded system. For the initial query, it seems to use the standard stack UDP error handling. So the failures would be negative response, unable to open the UDP socket, or a UDP timeout. Did you tweak the UDP unreplied timeout under Tools>Other Settings at all?

BTW....I just checked my stats. I currently have my VPN up in strict DNS mode and I'm getting 1-2% retries from the first server to the second on the list.
 
Here is my stats:

time 655005
cache size 1000, 0/8928 cache insertions re-used unexpired cache entries.
queries forwarded 46851, queries answered locally 36408
server 8.8.4.4#53: queries sent 46851, retried or failed 0
server x.x.x.x#53: queries sent 46851, retried or failed 0
server y.y.y.y#53: queries sent 46851, retried or failed 21

I setup three name servers in parallel or so called "all server" mode. Regardless of how I order the three servers, the last one as seen in the log (or the first one as you put in the config file) will always have a few errors but not the other two. That I believe is indeed a dnsmasq bug. With an error rate less than 0.05%, I don't have complaint.

I played with dnscrypt for a short time more than a year ago. Now I don't run and would avoid if it's not absolutely necessary in people's setup.

For your high error rates, I would think it's more to do with dnscrypt than dnsmasq.
 
its not dnscrypt as I tested without it, I will try what john said, thanks.
 
some more data from a few hours using google dns, no dnscrypt or dnssec failure rate is better but it still seems way too high, udp values back at default in settings, I will investigate more later in week as too tired now. thanks guys.

Code:
Apr 10 23:59:01 dnsmasq[19794]: time 2099380
Apr 10 23:59:01 dnsmasq[19794]: cache size 2000, 0/23 cache insertions re-used unexpired cache entries.
Apr 10 23:59:01 dnsmasq[19794]: queries forwarded 295, queries answered locally 632
Apr 10 23:59:01 dnsmasq[19794]: server 8.8.4.4#53: queries sent 3, retried or failed 1
Apr 10 23:59:01 dnsmasq[19794]: server 8.8.8.8#53: queries sent 292, retried or failed 127
 
Not seeing a problem here:
Code:
Apr 11 00:29:05 dnsmasq[1091]: time 166742
Apr 11 00:29:05 dnsmasq[1091]: cache size 1500, 0/12416 cache insertions re-used unexpired cache entries.
Apr 11 00:29:05 dnsmasq[1091]: queries forwarded 4060, queries answered locally 5525
Apr 11 00:29:05 dnsmasq[1091]: server 194.168.4.100#53: queries sent 3899, retried or failed 0
Apr 11 00:29:05 dnsmasq[1091]: server 194.168.8.100#53: queries sent 1078, retried or failed 0
Are you overloading your routers CPU? Router run out of RAM and/or is paging?
 
Last edited:
Two other thoughts with no data to support them :)
- Your ISP is traffic shaping UDP traffic in a heavy handed way
- Did you alter the DNS or UDP timeouts on the clients? I don't know what dnsmasq will do if there was no longer a connection to the client to deliver the response.
 
nope, in addition on the router the timeout was only reduced to 15secs (now back to 30s) which is way above what seems to be going on where it only waits for about 600ms. All clients have had no udp adjustments, my main client is a windows machine where that cannot be adjusted at all.

However this has led to an idea, since I have a samknows box connected (which I have no idea what its doing) and a vix tv box I will temporarily disconnect both to see if they causing this behaviour.

regarding the router cpu, its not overloaded but I am curious if the overclocked cpu might possibly be generating corrupted packets, so as a test I may temporarily disable the overclock also.

Regarding UDP shaping, I would hope not, but this affects dnscrypt also which goes over TCP.
 
yeah I discovered google dns has circa 50% failure rate also if I add it twice.

When its only added once the stats get skewed more favourably because all the retries are successful.

So e.g. if 50 queries and 25 fail.

Just 8.8.8.8 would show

75 queries 25 fail - 33%

Whilst 8.8.8.8 and 8.8.4.4 would show

8.8.8.8 50 queries 25 fail - 25%
8.8.4.4 25 queries 0 fail - 0%
 
In my region, 8.8.4.4 has a ping time of 5ms where 8.8.8.8 is ten times longer. Worth a check if similar in your case in relative sense..

Check each server's ping. Order them in descending order in dnsmasq config. Or ascending order...I have no idea which end dnsmasq picks first. But one of the ordering shall work out better than the other.
 
they both the same in network ping time, its whichever is first in dnsmasq config that suffers, so if 8.8.4.4 is first then that has the failed queries.
 
After my previous reply, i did some digging.

If the config has "strict-order", then servers will be used in the order specified in the config. Without "strict-order", dnsmasq devices its own order in a smart way regardless how people specify in the config:

The algorithm for determining which server to use goes like this.

In the start state, dnsmasq sends the query to all the servers. When the
first server replies, it becomes the preferred server and dnsmasq moves
into a state where only the preferred server is used. It remains in that
state until one of three conditions occur, when dnsmasq moves back to
the initial state and a query is again sent to all the servers. The
conditions are.

1) A SERVFAIL or REFUSED return code is received.
2) More than 50 queries or 10 seconds have elapsed (version 2.51 only)
3) No reply is received and a client times-out and retries a query.


http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2009q3/003295.html

The link is quite old. Info could be outdated.
 
ok I have an update

if "all-servers" is set, which queries all servers and uses the fastest result, then is 0 errors. or at least very close to 0.
if "strict-order" is set then the first server always has a high failure rate.
However if I run the grsec dns diagnostic tool which does 151 lookups, there is 0 errors regardless of the dnsmasq configuration. I then discovered its only chrome browsing that causes this issue. So a combination of chrome browsing and strict-order. dns prefetching is disabled in chrome.
 
kvic yes, my normal preferred mode is strict-order as I want to control which dns server is used.

here is data in all-servers mode, the first server is my server.

Code:
Apr 11 14:54:15 dnsmasq[22345]: time 2153094
Apr 11 14:54:15 dnsmasq[22345]: cache size 1500, 0/496 cache insertions re-used unexpired cache entries.
Apr 11 14:54:15 dnsmasq[22345]: queries forwarded 317, queries answered locally 164
Apr 11 14:54:15 dnsmasq[22345]: server 109.74.x.x#53: queries sent 317, retried or failed 0
Apr 11 14:54:15 dnsmasq[22345]: server 8.8.4.4#53: queries sent 317, retried or failed 1

and after switching to dnscrypt

Code:
Apr 11 15:00:05 dnsmasq[22392]: cache size 1500, 0/308 cache insertions re-used unexpired cache entries.
Apr 11 15:00:05 dnsmasq[22392]: queries forwarded 181, queries answered locally 65
Apr 11 15:00:05 dnsmasq[22392]: server 127.0.0.1#65058: queries sent 181, retried or failed 0
Apr 11 15:00:05 dnsmasq[22392]: server 8.8.4.4#53: queries sent 181, retried or failed 3

my server as only resolver with both all-servers and strick-order off

Code:
Apr 11 15:03:25 dnsmasq[22420]: cache size 1500, 0/247 cache insertions re-used unexpired cache entries.
Apr 11 15:03:25 dnsmasq[22420]: queries forwarded 284, queries answered locally 141
Apr 11 15:03:25 dnsmasq[22420]: server 127.0.0.1#65058: queries sent 284, retried or failed 2

as above but strict-order enabled, note since is only one server this should have no operation difference, but check the failure rate (skewed by the retries working)

Code:
Apr 11 15:06:01 dnsmasq[22436]: cache size 1500, 0/64 cache insertions re-used unexpired cache entries.
Apr 11 15:06:01 dnsmasq[22436]: queries forwarded 59, queries answered locally 4
Apr 11 15:06:01 dnsmasq[22436]: server 127.0.0.1#65058: queries sent 59, retried or failed 18

same data as above but after I did a grsec benchmark run, note no increase to failed queries

Code:
Apr 11 15:07:18 dnsmasq[22436]: cache size 1500, 0/227 cache insertions re-used unexpired cache entries.
Apr 11 15:07:18 dnsmasq[22436]: queries forwarded 212, queries answered locally 65
Apr 11 15:07:18 dnsmasq[22436]: server 127.0.0.1#65058: queries sent 212, retried or failed 18

All these test results had chrome browsing included.

finally with strict-order off again and no grsec runs so just chrome browsing data

Code:
Apr 11 15:12:11 dnsmasq[22463]: cache size 1500, 0/104 cache insertions re-used unexpired cache entries.
Apr 11 15:12:11 dnsmasq[22463]: queries forwarded 65, queries answered locally 7
Apr 11 15:12:11 dnsmasq[22463]: server 127.0.0.1#65058: queries sent 65, retried or failed 0
 
Last edited:
I have raised a couple of bugs on dnsmasq.

Discovered that with all-servers enabled its not actually all clean.

The last server (not first as is case with strict-order) always gets some failures.
This is the bug kvic mentioned but mine is much higher % wise.

So check thse 2 bits of data.

This was with all-server and just both my proxies to same server, note the last one had errors. Also note I have added a 2nd google dns for the ntp queries to stop those errors.

Code:
Apr 11 17:59:01 dnsmasq[22704]: server 8.8.4.4#53: queries sent 3, retried or failed 0
Apr 11 17:59:01 dnsmasq[22704]: server 8.8.8.8#53: queries sent 5, retried or failed 1
Apr 11 17:59:01 dnsmasq[22704]: server 127.0.0.1#65057: queries sent 1176, retried or failed 0
Apr 11 17:59:01 dnsmasq[22704]: server 127.0.0.1#65058: queries sent 1131, retried or failed 198

So I thought I will add the opendns dnscrypt to end and that would mean 65058 would be in the middle of sandwich and theory 0 errors., theory was right 65053 is cisco opencrypt server (opendns)

Code:
Apr 11 18:50:04 dnsmasq[23143]: server 8.8.4.4#53: queries sent 0, retried or failed 0
Apr 11 18:50:04 dnsmasq[23143]: server 8.8.8.8#53: queries sent 0, retried or failed 0
Apr 11 18:50:04 dnsmasq[23143]: server 127.0.0.1#65057: queries sent 154, retried or failed 0
Apr 11 18:50:04 dnsmasq[23143]: server 127.0.0.1#65058: queries sent 154, retried or failed 0
Apr 11 18:50:04 dnsmasq[23143]: server 127.0.0.1#65053: queries sent 154, retried or failed 62

why did I have 0 errors on earlier tests? its possible my earlier all-server was only grsec tests with no chrome browsing and I remembered wrong when I made that post.

It seems I have a choice of picking a redundancy configuration, which is what I had been running for the past few days, so strict-order, my server as primary and backup with the true backup as 3rd server. Downside is possible slowdowns on the queries that need to wait for retry. (slowdown is not much tho as dnsmasq very aggressively retries).
Or go for what should be no performance impact, both my server as both servers in config, all-servers enabled, the one that has errors doesnt matter as the one with no errors results will get used and sent to the client, downside is I got no redundancy if my dns server goes down.

The only configuration that seems to genuinely get no errors (unless real errors) is when both all-servers and strict-order are disabled so it just uses the default algorithm.

hopefully it gets fixed.

To really push its dnscrypt fault to bed, here is with my isp dns server

Code:
Apr 11 19:27:19 dnsmasq[23358]: DNSSEC memory in use 4048, max 5456, allocated 199980
Apr 11 19:27:19 dnsmasq[23358]: server 8.8.4.4#53: queries sent 0, retried or failed 0
Apr 11 19:27:19 dnsmasq[23358]: server 8.8.8.8#53: queries sent 0, retried or failed 0
Apr 11 19:27:19 dnsmasq[23358]: server 127.0.0.1#65057: queries sent 179, retried or failed 0
Apr 11 19:27:19 dnsmasq[23358]: server 127.0.0.1#65058: queries sent 183, retried or failed 0
Apr 11 19:27:19 dnsmasq[23358]: server 90.207.238.97#53: queries sent 194, retried or failed 125

these servers are all 100% queried directly from windows bypassing dnsmasq.
 
Last edited:
I have raised a couple of bugs on dnsmasq.

Discovered that with all-servers enabled its not actually all clean.

The last server (not first as is case with strict-order) always gets some failures.
This is the bug kvic mentioned but mine is much higher % wise.

Thanks for filing the bug.

When I was on 378.55, I did see a much higher % of failure rate. Now I'm on 380.58 alpha 4 + a heavily patched custom kernel.

I was a bit amazed to see the much lower %. First time I re-visited that statistics after upgrade from 378.55. Same dnsmasq config.

Not that I'm aware of a kernel patch that improves dnsmasq performance. Nor imply the custom kernel helps in some way. Just saying..

If dnsmasq author could fix in his program, then that's all good news for everyone.
 
I guess I'm in the minority.....looks like it's working the way it should for me with strict mode, VPN active, DNSSEC active....

Code:
Apr 11 08:15:01 dnsmasq[2545]: time 22847
Apr 11 08:15:01 dnsmasq[2545]: cache size 1500, 0/4617 cache insertions re-used unexpired cache entries.
Apr 11 08:15:01 dnsmasq[2545]: queries forwarded 1820, queries answered locally 331
Apr 11 08:15:01 dnsmasq[2545]: DNSSEC memory in use 36344, max 64944, allocated 149996
Apr 11 08:15:01 dnsmasq[2545]: server 209.222.18.222#53: queries sent 2153, retried or failed 23
Apr 11 08:15:01 dnsmasq[2545]: server 209.222.18.218#53: queries sent 23, retried or failed 8
Apr 11 08:15:01 dnsmasq[2545]: server 68.105.28.11#53: queries sent 8, retried or failed 0
Apr 11 08:15:01 dnsmasq[2545]: server 68.105.29.11#53: queries sent 0, retried or failed 0
Apr 11 08:15:01 dnsmasq[2545]: server 68.105.28.12#53: queries sent 0, retried or failed 0
Apr 11 08:15:01 dnsmasq[2545]: server 2001:578:3f::30#53: queries sent 0, retried or failed 0
Apr 11 08:15:01 dnsmasq[2545]: server 2001:578:3f:1::30#53: queries sent 0, retried or failed 0

The first two servers are the VPN servers, the next 3 IPv4 from ISP, last 2 IPv6 from ISP....I'm guessing the initial fails may be the VPN server 'busy'...
(Although it did show me I occasionally get a DNS leak from the VPN in strict mode)
 
Thanks for filing the bug.

When I was on 378.55, I did see a much higher % of failure rate. Now I'm on 380.58 alpha 4 + a heavily patched custom kernel.

I was a bit amazed to see the much lower %. First time I re-visited that statistics after upgrade from 378.55. Same dnsmasq config.

Not that I'm aware of a kernel patch that improves dnsmasq performance. Nor imply the custom kernel helps in some way. Just saying..

If dnsmasq author could fix in his program, then that's all good news for everyone.

hmm is it possible to share info of this patch with john? as I am curious if it will help.

I settled on using all-servers with 2 servers in config that point to my server.
 
Last edited:
I guess I'm in the minority.....looks like it's working the way it should for me with strict mode
Same here. Working as expected in both strict-order and all-servers modes. With all-servers I got 2 retries out of 330 to 8.8.8.8. All other servers and combinations give me 0 retries.

What dnsmasq versions are we using? I have 2.75.
 
Just to throw the info out there, I use a custom config google, opendns and unreliable isp dns servers, let dnsmasq chose which to use (and none get my all my history!) Johns Fork on n66,

Code:
Apr 11 20:36:02 dnsmasq[5186]: time 1757774
Apr 11 20:36:02 dnsmasq[5186]: cache size 8192, 0/142754 cache insertions re-used unexpired cache entries.
Apr 11 20:36:02 dnsmasq[5186]: queries forwarded 196868, queries answered locally 39860
Apr 11 20:36:02 dnsmasq[5186]: server 8.8.4.4#53: queries sent 72339, retried or failed 2398
Apr 11 20:36:02 dnsmasq[5186]: server 8.8.8.8#53: queries sent 66336, retried or failed 1075
Apr 11 20:36:02 dnsmasq[5186]: server 208.67.220.222#53: queries sent 65179, retried or failed 558
Apr 11 20:36:02 dnsmasq[5186]: server 208.67.222.220#53: queries sent 65003, retried or failed 419
Apr 11 20:36:02 dnsmasq[5186]: server 208.67.220.220#53: queries sent 65480, retried or failed 594
Apr 11 20:36:02 dnsmasq[5186]: server 208.67.222.222#53: queries sent 64918, retried or failed 513
Apr 11 20:36:02 dnsmasq[5186]: server 194.168.4.100#53: queries sent 45766, retried or failed 6667
Apr 11 20:36:02 dnsmasq[5186]: server 194.168.8.100#53: queries sent 22561, retried or failed 3696

Some errors could be due to intermittent ISP connectivity, or from not using QOS...
 

Similar threads

Latest threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top