Supervising BIND 9
BIND 9 is a mature piece of software. Compared with its predecessors BIND 4 and BIND 8, it is more stable and has less security problems. One reason for this is the "Design by contract" programming style used by the BIND 9 team. BIND 9 is very paranoid about data it consumes, and about its internal data structures. Once BIND 9 finds an unexpected state in its internal data-structures, it terminates the DNS server process instead of continue to run with wrong data (and risking a security vulnerability).
While this behavior is good for security, it is very bad for service uptime. The DNS server process terminates, and with it the DNS service. Users (Customers) do not not so much care about security if they cannot reach Facebook. BIND 9 had several incidents in the past years where BIND 9 terminated because of issues inside the code or data-structures, like "BIND 9 Resolver crashes after logging an error in query.c".
BIND 10 is aiming to solve this, as project lead Shane Kerr writes in "Software Robustness and BIND 10". But until BIND 10 arrives, a work-around is needed for BIND 9.
The real issue for the DNS service is not BIND 9 terminating on bad data, but that BIND 9 cannot restart after the fact. There is no "supervisor" process in BIND 9.
Some operating systems have a build-in solution: MacOS X has launchd, and the BIND 9 version Apple delivers with the OS is automatically restarted should it terminate unexpected. Solaris has SMF (Service Management Facility), and BIND 9 can be integrated into SMF. Unbuntu Linux now has upstart, and Fedora systemd, which can also monitor processes and restart them if needed.
For Unix and Linux operating systems that do not ship with a process supervisor solution, supervisord is a nice and easy to setup solution. Supervisord comes as a package with many Linux distributions, and it also works on the BSD Unixes. The configuration below is for OpenBSD, but should require only minor tweaks to run on other Unix systems as well.
Installation
Supervisord is written in Python (2.4 - 2.7) and can be installed from source (where we have to download and install all dependencies) or with the help of setuptools, which takes care of downloading and installing all dependencies. Below I use setuptools:
bash# sh setuptools-0.6c11-py2.7.egg bash# easy_install supervisor
A basic configuration file for BIND 9 "named"
Below is my basic /etc/supervisord.conf configuration file for one service, the BIND 9 DNS Server:
[unix_http_server] file = /tmp/supervisor.sock chmod = 0777 chown= nobody:nogroup [rpcinterface:supervisor] supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface [supervisorctl] serverurl=unix:///tmp/supervisor.sock [supervisord] logfile = /var/log/supervisord.log logfile_maxbytes = 10MB logfile_backups=10 loglevel = info pidfile = /var/run/supervisord.pid identifier = supervisor directory = /tmp [program:named] command=/usr/sbin/named -f process_name=%(program_name)s numprocs=1 directory=/var/named priority=100 autostart=true autorestart=unexpected startsecs=5 startretries=3 exitcodes=0,2 stopsignal=TERM stopwaitsecs=10 redirect_stderr=false stdout_logfile=/var/log/named_supervisord.log stdout_logfile_maxbytes=1MB stdout_logfile_backups=10 stdout_capture_maxbytes=1MB
starting "supervisord"
Once the configuration file is in place, we can start supervisord. Make sure that BIND 9 is not started, else you will end up with two instances of the BIND 9 server running, which is not a good idea. Also make sure that supervisord will be started on reboot of the server, either trough a startscript or other means. The Supervisord packages coming with Linux distributions install a startscript.
bash# supervisord bash# tail /var/log/supervisord.log 2012-06-16 16:59:48,812 INFO Increased RLIMIT_NOFILE limit to 1024 2012-06-16 16:59:48,949 INFO RPC interface 'supervisor' initialized 2012-06-16 16:59:48,953 INFO RPC interface 'supervisor' initialized 2012-06-16 16:59:48,963 INFO daemonizing the supervisord process 2012-06-16 16:59:48,964 INFO set current directory: '/tmp' 2012-06-16 16:59:48,967 INFO supervisord started with pid 14724 2012-06-16 16:59:49,976 INFO spawned: 'named' with pid 16701 2012-06-16 16:59:55,020 INFO success: named entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
Great, supervisord has started, and it also started the BIND 9 process "named". DNS is working now.
simulating a BIND 9 crash
to simulate a BIND 9 crash, we "kill" the BIND 9 named process:
bash# ps aux | grep named _syslogd 32633 0.0 0.1 512 648 ?? I 17Apr12 2:28.76 syslogd -a /var/named/dev/log -a /var/empty/dev/log root 16701 0.0 0.8 5684 6500 ?? I 4:59PM 0:00.50 /usr/sbin/named -f bash# bash-3.2# kill -9 16701 bash# tail supervisord.log 2012-06-16 17:03:29,192 INFO exited: named (terminated by SIGKILL; not expected) 2012-06-16 17:03:30,201 INFO spawned: 'named' with pid 9832 bash#
Works as a expected. Supervisord has detected that the BIND 9 process has terminated, and has restarted a new one. DNS is still up and running.
Controlling supervisord
The supervisord can be controlled from the commandline using the supervisorctl command. A list of all a control commands can be found with "help", and a description of each command with "help command":
bash# supervisorctl help default commands (type help ): ===================================== add clear fg open quit remove restart start stop update avail exit maintail pid reload reread shutdown status tail version bash# supervisorctl help status status Get all process status info. status Get status on a single process by name. status Get status on multiple named processes. bash# supervisorctl status named RUNNING pid 25770, uptime 0:00:12 bash# supervisorctl stop named named: stopped bash# supervisorctl start named named: started
Now, whenever there is an assertion error in the code triggered, BIND 9 will terminate, but supervisord will bring it back from the dead. Your DNS service stays up, and the users and customers happy.
Read the supervisord documentation on how to setup event notifications, so that you get an E-Mail on the event that BIND 9 has been restarted. There might be a security vulnerability nontheless, which you would like to report to bind9-bugs@isc.org.
Of course supervisord can be used to restart other processes as well, including other types of DNS Servers (NSD, Unbound, dnsmasq …).