Forked operation
Unbound has a unique mode where it can operate without threading. This can be useful if libevent fails on the platform, for extra performance, or for creating walls between the cores so that one cannot poison another.
To compile for forked operation, before compilation use ./configure --without-pthreads --without-solaris-threads to disable threads and enable forked operation. Because no locking has to be done, the code speeds up (about 10 to 20%).
In the config file, num-threads still specifies the number of cores you want to use (even though it uses processes and not threads). And note that the outgoing-range and cache memory values are all per thread. This means that much more memory is used, as every core uses its own cache. Because every core has its own cache, if one gets cache poisoned, the others are not affected.
# with forked operation
server:
# use all CPUs
num-threads: <number of cores>
msg-cache-slabs: 1
rrset-cache-slabs: 1
infra-cache-slabs: 1
key-cache-slabs: 1
# more cache memory, rrset=msg*2
# total usage is 150m*cores
rrset-cache-size: 100m
msg-cache-size: 50m
# does not depend on number of cores
outgoing-range: 950
num-queries-per-thread: 512
# Larger socket buffer. OS may need config.
so-rcvbuf: 4m
Because every process is using at most 1024 file descriptors now, the effective maximum is the number of cores * 1024. The config above uses 950 per process, for 4 processes gives a respectable 3800 sockets. The number of queries per thread is half the number of sockets, to guarantee that every query can get a socket, and some to spare for queries-for-nameservers.
Using forked operation together with libevent is also possible. It may be useful to force the OS to service the filedescriptors for different processes, instead of threads. This may have (radically) different performance if the underlying network stack uses (slow) lookup structures per-process.