A fairly significant rewrite, but now the `restart' command actually

does the right thing! The sub-qrunners are exec'd now from bin/qrunner using some new command line options, so killing and (auto-)restarting them will cleanly reload any changed modules. Also, the lock is acquired in the foreground so you don't get ugly error messages if another master qrunner is already running. Finally, the separate lock-refresher process is gone. Too hard to implement correct lock ownership transfer semantics with it (it was more complicated than the simple pass-thru to the child). Now, lock refresh is implemented by a once-a-day alarm signal in the master qrunner process watcher. Also implemented the `reopen' command which causes all the log files to be re-opened (very useful if you're rotating log files!).
author: bwarsaw 2001-10-18 22:18:14 +0000
committer: bwarsaw 2001-10-18 22:18:14 +0000
commit: 85fe225037ff91327785891f8e4cfa51cae307ad (patch)
tree: 190d3840955649e5414d797dc21f8d88917416d7 /bin
parent: 8bef90a9e62e3191e426b9429e20e136a7a37a3d (diff)
download: mailman-85fe225037ff91327785891f8e4cfa51cae307ad.tar.gz
mailman-85fe225037ff91327785891f8e4cfa51cae307ad.tar.zst
mailman-85fe225037ff91327785891f8e4cfa51cae307ad.zip
1 files changed, 301 insertions, 94 deletions
diff --git a/bin/mailmanctl b/bin/mailmanctl
index e5c91c5e0..30ae34d42 100644
--- a/bin/mailmanctl
+++ b/bin/mailmanctl
@@ -18,26 +18,33 @@
 
 """Primary start-up and shutdown script for Mailman's qrunner daemon.
 
-This script is intended to be run as an init script.  It simply makes sure
-that the various long-running qrunners are still alive and kicking.  It does
-this by forking the individual qrunners and waiting on their pids.  When it
-detects a subprocess has exited, it will restart it (unless the -n option is
-given).
+This script starts, stops, and restarts the main Mailman queue runners, making
+sure that the various long-running sub-qrunners are still alive and kicking.
+It does this by forking and exec'ing the sub-qrunners and waiting on their
+pids.  When it detects a subprocess has exited, it may restart it.
 
-The master qrunner will leave its process id in the file data/qrunner.pid
-which can be used to shutdown or HUP the qrunner daemon.  Sending a SIGINT to
-the master qrunner causes it and all sub-qrunners to exit.  This is equivalent
-to the `stop' command.  Sending a SIGHUP causes the master to re-open all of
-its log files, and to SIGINT kill and restart all sub-qrunners processes.
-This is equivalent to the `restart' command.
+The sub-qrunners respond to SIGINT, SIGTERM, and SIGHUP.  SIGINT and SIGTERM
+both cause the sub-qrunners to exit cleanly, but the master will only restart
+sub-qrunners that have exited due to a SIGINT.  SIGHUP causes the master
+qrunner and sub-qrunners to close their log files, and reopen then upon the
+next printed message.
 
-Usage: %(PROGRAM)s [options] [ start | stop | restart ]
+The master qrunner also responds to SIGINT, SIGTERM, and SIGHUP, which it
+simply passes on to the sub-qrunners (note that the master will close and
+reopen its own log files on receipt of a SIGHUP).  The master qrunner also
+leaves its own process id in the file data/master-qrunner.pid but you normally
+don't need to use this pid directly.  The `start', `stop', `restart', and
+`open' commands handle everything for you.
+
+Usage: %(PROGRAM)s [options] [ start | stop | restart | reopen ]
 
 Options:
 
     -n/--no-restart
-        Don't restart queue runners when they exit because of an error.  Use
-        this only for debugging.  Only useful if the `start' command is given.
+        Don't restart the sub-qrunners when they exit because of an error or a
+        SIGINT (they are never restarted if they exit in response to a
+        SIGTERM).  Use this only for debugging.  Only useful if the `start'
+        command is given.
 
     -u/--run-as-user
         Normally, this script will refuse to run if the user id and group id
@@ -52,11 +59,11 @@ Options:
 
     -s/--stale-lock-cleanup
         If mailmanctl finds an existing master qrunner lock, it will normally
-        exit with an error message.  With this switch, mailmanctl will perform
+        exit with an error message.  With this optionn, mailmanctl will perform
         an extra level of checking.  If a process matching the host/pid
         described in the lock file is running, mailmanctl will still exit, but
         if no matching process is found, mailmanctl will remove the apparently
-        stale lock and continue running.
+        stale lock and make another attempt to claim the master lock.
 
     -q/--quiet
         Don't print status messages.  Error messages are still printed to
@@ -67,20 +74,23 @@ Options:
 
 Commands:
 
-    start   - Start the master qrunner daemon.  Prints a message and returns
-              if the master qrunner daemon is already running.
+    start   - Start the master qrunner daemon and all sub-qrunners.  Prints a
+              message and exits if the master qrunner daemon is already
+              running.
 
-    stop    - Stops the master qrunner daemon and all worker qrunners.  After
+    stop    - Stops the master qrunner daemon and all sub-qrunners.  After
               stopping, no more messages will be processed.
 
-    restart - Restarts the master qrunner daemon by sending it a SIGHUP.  This
-              will cause all worker qrunners to be stopped and restarted, and
-              will cause any log files open by the master qrunner to be
-              re-opened.
+    restart - Restarts the sub-qrunners, but not the master qrunner.  This is
+              really handy for development because the without restarting, the
+              sub-qrunners won't reload any changed modules.
+
+    reopen  - This will simply cause all log files to be re-opened.
 """
 
 import sys
 import os
+import time
 import getopt
 import signal
 import errno
@@ -89,13 +99,21 @@ import socket
 
 import paths
 from Mailman import mm_cfg
+from Mailman import Utils
 from Mailman import LockFile
-from Mailman.Queue import Control
 from Mailman.i18n import _
 from Mailman.Logging.Syslog import syslog
 
 PROGRAM = sys.argv[0]
 COMMASPACE = ', '
+DOT = '.'
+
+# Locking contantsa
+LOCKFILE = os.path.join(mm_cfg.LOCK_DIR, 'master-qrunner')
+# Since we wake up once per day and refresh the lock, the LOCK_LIFETIME
+# needn't be (much) longer than SNOOZE.  We pad it 6 hours just to be safe.
+LOCK_LIFETIME = mm_cfg.days(1) + mm_cfg.hours(6)
+SNOOZE = mm_cfg.days(1)
 
 
 
@@ -107,7 +125,7 @@ def usage(code, msg=''):
 
 
 
-def kill_subrunners(sig):
+def kill_watcher(sig):
     try:
         fp = open(mm_cfg.PIDFILE)
         pidstr = fp.read()
@@ -129,6 +147,112 @@ def kill_subrunners(sig):
         print >> sys.stderr, 'Stale pid file removed.'
         os.unlink(mm_cfg.PIDFILE)
 
+
+
+def get_lock_data():
+    """Return the hostname, pid, and tempfile"""
+    fp = open(LOCKFILE)
+    filename = fp.read().strip()
+    fp.close()
+    parts = filename.split('.')
+    hostname = DOT.join(parts[1:-1])
+    pid = int(parts[-1])
+    return hostname, int(pid), filename
+
+
+def qrunner_state():
+    # 1 if proc exists on host (but is it qrunner? ;)
+    # 0 if host matches but no proc
+    # hostname if hostname doesn't match
+    hostname, pid, tempfile = get_lock_data()
+    if hostname <> socket.gethostname():
+        return hostname
+    # Find out if the process exists by calling kill with a signal 0.
+    try:
+        os.kill(pid, 0)
+    except OSError, e:
+        if e.errno <> errno.ESRCH: raise
+        return 0
+    return 1
+
+
+def acquire_lock_1(force):
+    # Be sure we can acquire the master qrunner lock.  If not, it means some
+    # other master qrunner daemon is already going.
+    lock = LockFile.LockFile(LOCKFILE, LOCK_LIFETIME)
+    try:
+        lock.lock(0.1)
+        return lock
+    except LockFile.TimeOutError:
+        if not force:
+            raise
+        # Force removal of lock first
+        hostname, pid, tempfile = get_lock_data()
+        os.unlink(LOCKFILE)
+        os.unlink(tempfile)
+        return acquire_lock_1(force=0)
+
+
+def acquire_lock(force):
+    try:
+        lock = acquire_lock_1(force)
+    except LockFile.TimeOutError:
+        status = qrunner_state()
+        if status == 1:
+            # host matches and proc exists
+            print >> sys.stderr, _("""\
+The master qrunner lock could not be acquired because it appears as if another
+master qrunner is already running.
+""")
+        elif status == 0:
+            # host matches but no proc
+            print >> sys.stderr, _("""\
+The master qrunner lock could not be acquired.  It appears as though there is
+a stale master qrunner lock.  Try re-running mailmanctl with the -s flag.
+""")
+        else:
+            # host doesn't even match
+            print >> sys.stderr, _("""\
+The master qrunner lock could not be acquired, because it appears as if some
+process on some other host may have acquired it.  We can't test for stale
+locks across host boundaries, so you'll have to do this manually.  Or, if you
+know the lock is stale, re-run mailmanctl with the -s flag.
+
+Lock file: %(LOCKFILE)s
+Lock host: %(status)s
+
+Exiting.""")
+        return None
+    return lock
+
+
+
+def start_runner(qrname, slice, count):
+    pid = os.fork()
+    if pid:
+        # parent
+        return pid
+    # child
+    #
+    # Craft the command line arguments for the exec() call.
+    rswitch = '--runner=%s:%d:%d' % (qrname, slice, count)
+    # BAW: should argv[0] be `python'?
+    exe = os.path.join(mm_cfg.BIN_DIR, 'qrunner')
+    os.execl(mm_cfg.PYTHON, 'qrunner', exe, rswitch)
+    # Should never get here
+    raise RuntimeError, 'os.execl() failed'
+
+
+def start_all_runners():
+    kids = {}
+    for qrname, count in mm_cfg.QRUNNERS:
+        for slice in range(count):
+            info = (qrname, slice, count)
+            pid = start_runner(*info)
+            kids[pid] = info
+    return kids
+
+
 
 def check_privs():
     # If we're running as root (uid == 0), coerce the uid and gid to that
@@ -144,19 +268,6 @@ def check_privs():
 
 
 
-# We need to install a SIGTERM handler because that's what init will kill this
-# process with when changing run levels.
-def sigterm_handler(signum, frame):
-    # See the 'stop' command below
-    if not quiet:
-        print _("Shutting down Mailman's master qrunner.")
-    #syslog('debug', 'sigterm_handler from pid: %s', os.getpid())
-    kill_subrunners(signal.SIGINT)
-
-signal.signal(signal.SIGTERM, sigterm_handler)
-
-
-
 def main():
     global quiet
     try:
@@ -168,7 +279,7 @@ def main():
 
     restart = 1
     checkprivs = 1
-    cleanup = 0
+    force = 0
     quiet = 0
     for opt, arg in opts:
         if opt in ('-h', '--help'):
@@ -178,7 +289,7 @@ def main():
         elif opt in ('-u', '--run-as-user'):
             checkprivs = 0
         elif opt in ('-s', '--stale-lock-cleanup'):
-            cleanup = 1
+            force = 1
         elif opt in ('-q', '--quiet'):
             quiet = 1
 
@@ -198,28 +309,64 @@ def main():
         # giving cron/qrunner a ctrl-c or KeyboardInterrupt.  This will
         # effectively shut everything down.
         if not quiet:
-            print _("Shutting down Mailman's master qrunner.")
-        #syslog('debug', 'stop command from pid: %s', os.getpid())
-        kill_subrunners(signal.SIGINT)
+            print _("Shutting down Mailman's master qrunner")
+        kill_watcher(signal.SIGTERM)
     elif command == 'restart':
         # Sent the master qrunner process a SIGHUP.  This will cause the
         # master qrunner to kill and restart all the worker qrunners, and to
         # close and re-open its log files.
         if not quiet:
-            print _("Restarting Mailman's master qrunner.")
-        kill_subrunners(signal.SIGHUP)
+            print _("Restarting Mailman's master qrunner")
+        kill_watcher(signal.SIGINT)
+    elif command == 'reopen':
+        if not quiet:
+            print _('Re-opening all log files')
+        kill_watcher(signal.SIGHUP)
     elif command == 'start':
-        # Must be `start'
+        # Here's the scoop on the processes we're about to create.  We'll need
+        # one for each qrunner, and one for a master child process watcher /
+        # lock refresher process.
+        #
+        # The child watcher process simply waits on the pids of the children
+        # qrunners.  Unless explicitly disabled by a mailmanctl switch (or the
+        # children are killed with SIGTERM instead of SIGINT), the watcher
+        # will automatically restart any child process that exits.  This
+        # allows us to be more robust, and also to implement restart by simply
+        # SIGINT'ing the qrunner children, and letting the watcher restart
+        # them.
+        #
+        # Under normal operation, we have a child per queue.  This lets us get
+        # the most out of the available resources, since a qrunner with no
+        # files in its queue directory is pretty cheap, but having a separate
+        # runner process per queue allows for a very responsive system.  Some
+        # people want a more traditional (i.e. MM2.0.x) cron-invoked qrunner.
+        # No problem, but using mailmanctl isn't the answer.  So while
+        # mailmanctl hard codes some things, others, such as the number of
+        # qrunners per queue, is configurable in mm_cfg.py.
         #
+        # First, acquire the master mailmanctl lock
+        lock = acquire_lock(force)
+        if not lock:
+            return
         # Daemon process startup according to Stevens, Advanced Programming in
         # the UNIX Environment, Chapter 13.
-        if os.fork():
+        pid = os.fork()
+        if pid:
+            # parent
             if not quiet:
                 print _("Starting Mailman's master qrunner.")
-            # parent
-            sys.exit(0)
+            # Give up the lock "ownership".  This just means the foreground
+            # process won't close/unlock the lock when it finalizes this lock
+            # instance.  We'll let the mater watcher subproc own the lock.
+            lock._transfer_to(pid)
+            return
         # child
         #
+        lock._take_possession()
+        # First, save our pid in a file for "mailmanctl stop" rendezvous
+        fp = open(mm_cfg.PIDFILE, 'w')
+        print >> fp, os.getpid()
+        fp.close()
         # Create a new session and become the session leader, but since we
         # won't be opening any terminal devices, don't do the ultra-paranoid
         # suggestion of doing a second fork after the setsid() call.
@@ -229,52 +376,112 @@ def main():
         # Clear our file mode creation umask
         os.umask(0)
         # I don't think we have any unneeded file descriptors.
+        #
+        # Now start all the qrunners.  This returns a dictionary where the
+        # keys are qrunner pids and the values are tuples of the following
+        # form: (qrname, slice, count).  This does its own fork and exec, and
+        # sets up its own signal handlers.
+        kids = start_all_runners()
+        # Set up a SIGALRM handler to refresh the lock once per day.  The lock
+        # lifetime is 1day+6hours so this should be plenty.
+        def sigalrm_handler(signum, frame, lock=lock):
+            lock.refresh()
+            signal.alarm(mm_cfg.days(1))
+        signal.signal(signal.SIGALRM, sigalrm_handler)
+        signal.alarm(mm_cfg.days(1))
+        # Set up a SIGHUP handler so that if we get one, we'll pass it along
+        # to all the qrunner children.  This will tell them to close and
+        # reopen their log files
+        def sighup_handler(signum, frame, kids=kids):
+            # Closing our syslog will cause it to be re-opened at the next log
+            # print output.
+            syslog.close()
+            for pid in kids.keys():
+                os.kill(pid, signal.SIGHUP)
+            # And just to tweak things...
+            syslog('qrunner',
+                   'Master watcher caught SIGHUP.  Re-opening log files.')
+        signal.signal(signal.SIGHUP, sighup_handler)
+        # We also need to install a SIGTERM handler because that's what init
+        # will kill this process with when changing run levels.
+        def sigterm_handler(signum, frame, kids=kids):
+            for pid in kids.keys():
+                try:
+                    os.kill(pid, signal.SIGTERM)
+                except OSError, e:
+                    if e.errno <> errno.ESRCH: raise
+            syslog('qrunner', 'Master watcher caught SIGTERM.  Exiting.')
+        signal.signal(signal.SIGTERM, sigterm_handler)
+        # Finally, we need a SIGINT handler which will cause the sub-qrunners
+        # to exit, but the master will restart SIGINT'd sub-processes unless
+        # the -n flag was given.
+        def sigint_handler(signum, frame, kids=kids):
+            for pid in kids.keys():
+                os.kill(pid, signal.SIGINT)
+            syslog('qrunner', 'Master watcher caught SIGINT.  Restarting.')
+        signal.signal(signal.SIGINT, sigint_handler)
+        # Now we're ready to simply do our wait/restart loop.  This is the
+        # master qrunner watcher.
         try:
-            Control.start(restart)
-        except LockFile.TimeOutError:
-            lockfile = Control.LOCKFILE
-            if cleanup:
-                fp = open(lockfile)
-                data = fp.read().strip()
-                fp.close()
-                hostname, pid = os.path.basename(data).split('.')[1:]
-                if hostname == socket.gethostname():
-                    try:
-                        os.kill(int(pid), 0)
-                    except OSError, e:
-                        if e.errno <> errno.ESRCH: raise
-                        else:
-                            print >> sys.stderr, _("""
-The master qrunner lock appears to be stale.  Cleaning up and trying again.""")
-                            # Process did not exist, so wax the lock files and
-                            # try again.
-                            os.unlink(lockfile)
-                            os.unlink(data)
-                            try:
-                                Control.start(restart)
-                            except LockFile.TimeOutError:
-                                pass
-                #syslog('debug', 'start 1 from pid: %s', os.getpid())
-                print >> sys.stderr, _("""
-The master qrunner lock could not be acquired.  The lock file is probably not
-stale, so it's likely that another qrunner process is already running.
-
-Lock file: %(lockfile)s
-
-Exiting.""")
-            else:
-                # Okay, process exists on our host, so we can't kill it
-                #syslog('debug', 'start 2 from pid: %s', os.getpid())
-                print >> sys.stderr, _("""
-The master qrunner lock could not be acquired.  Either another qrunner is
-already running, or a stale lock exists.  Please check to see if the lock
-file is stale, or use the -s option.
-
-Lock file: %(lockfile)s
-
-Exiting.""")
-    else:
-        usage(1, _('Bad command: %(command)s'))
+            while 1:
+                try:
+                    pid, status = os.wait()
+                except OSError, e:
+                    # No children?  We're done
+                    if e.errno == errno.ECHILD:
+                        break
+                    # If the system call got interrupted, just restart it.
+                    elif e.errno <> errno.EINTR:
+                        raise
+                    continue
+                killsig = status & 0xff
+                exitstatus = (status >> 8) & 0xff
+                # We'll restart the process unless we were given the
+                # "no-restart" switch, or if the process was SIGTERM'd.  This
+                # lets us better handle runaway restarts (say, if the subproc
+                # had a syntax error!)
+                if restart and not (killsig == signal.SIGTERM or
+                                    exitstatus == signal.SIGTERM):
+                    restarting = '[restarting]'
+                else:
+                    restarting = ''
+                qrname, slice, count = kids[pid]
+                del kids[pid]
+                syslog('qrunner', """\
+Master qrunner detected subprocess exit
+(pid: %d, sig: %d, sts: %d, class: %s, slice: %d/%d) %s""",
+                       pid, killsig, exitstatus, qrname,
+                       slice+1, count, restarting)
+                # Now perhaps restart the process unless it exited with a
+                # SIGTERM or we aren't restarting.
+                if restarting:
+                    newpid = start_runner(qrname, slice, count)
+                    kids[newpid] = (qrname, slice, count)
+        finally:
+            # Should we leave the main loop for any reason, we want to be sure
+            # all of our children are exited cleanly.  Send SIGTERMs to all
+            # the child processes and wait for them all to exit.
+            for pid in kids.keys():
+                try:
+                    os.kill(pid, signal.SIGTERM)
+                except OSError, e:
+                    if e.errno == errno.ESRCH:
+                        # The child has already exited
+                        syslog('qrunner', 'ESRCH on pid: %d', pid)
+                        del kids[pid]
+            # Wait for all the children to go away
+            while 1:
+                try:
+                    pid, status = os.wait()
+                except OSError, e:
+                    if e.errno == errno.ECHILD:
+                        break
+                    elif e.errno <> errno.EINTR:
+                        raise
+                    continue
+        # Finally, give up the lock
+        lock.unlock(unconditionally=1)
+        os._exit(0)
author	bwarsaw	2001-10-18 22:18:14 +0000
committer	bwarsaw	2001-10-18 22:18:14 +0000
commit	85fe225037ff91327785891f8e4cfa51cae307ad (patch)
tree	190d3840955649e5414d797dc21f8d88917416d7 /bin
parent	8bef90a9e62e3191e426b9429e20e136a7a37a3d (diff)
download	mailman-85fe225037ff91327785891f8e4cfa51cae307ad.tar.gz mailman-85fe225037ff91327785891f8e4cfa51cae307ad.tar.zst mailman-85fe225037ff91327785891f8e4cfa51cae307ad.zip