zmq and gevent debugging nightmares

Note

Powerhose turns your CPU-bound tasks into I/O-bound tasks so your Python applications are easier to scale.

I've released Powerhose 0.4 at PyPI - http://pypi.python.org/pypi/powerhose/0.4, and this new version has a few changes that are worth speaking of.

pyzmq + gevent = ?

The biggest issue I faced with Powerhose was related to gevent. We have a powerhose set up here at Mozilla with 175 workers and each one of them is performing crypto work.

A Powerhose worker is just a call to powerhose-worker pointing to a callable.

What I did not realize was that the module that was imported was also used by our main application, and was calling gevent and gevent_zmq monkey patching.

gevent_zmq is a library that patches pyzmq so it interacts well with gevent. It's going to die eventually since pyzmq is including a green module that will provide gevent compatibility. But this module does not provide a Poller yet.

So, in other words, any project that has pyzmq & gevent will block unless you use gevent_zmq. And if you use the Poller you need my fork: https://github.com/tarekziade/gevent-zeromq

Back to my workers. Having them patched by gevent/gevent_zmq is not an issue per se. It can even speed up very slightly things since the workers are fetching certificates on the web sometimes.

But at some point -- or more precisely, around every 24 hours, the workers were simply locking themselve and hanging there.

After a lot of work, I found out that gevent had its own dns resolver which was used when calling socket.getaddrinfo, and for some reason -- a bad interaction between zmq and gevent I suspect, a greenlet was locked.

Maybe that's a bug in gevent_zmq, maybe that's an issue in gevent itself..

I failed to find the real reason because the lock happened in various places in the gevent socket code. I did not spend more time on this since the bug seems to have gone away once I removed gevent altogether from the workers as we don't use gevent there and the workers are sync beasts.

The one thing I was able to do though was to write a little piece of code to dump all running threads and greenlets to understand what's going on:

import traceback, sys, gc

def dump_stacks():
    dump = []

    # threads
    threads = dict([(th.ident, th.name)
                        for th in threading.enumerate()])

    for thread, frame in sys._current_frames().items():
        dump.append('Thread 0x%x (%s)\n' % (thread, threads[thread]))
        dump.append(''.join(traceback.format_stack(frame)))
        dump.append('\n')

    # greenlets
    try:
        from greenlet import greenlet
    except ImportError:
        return dump

    # if greenlet is present, let's dump each greenlet stack
    for ob in gc.get_objects():
        if not isinstance(ob, greenlet):
            continue
        if not ob:
            continue   # not running anymore or not started
        dump.append('Greenlet\n')
        dump.append(''.join(traceback.format_stack(ob.gr_frame)))
        dump.append('\n')

    return dump

That should be useful in the future.

Bottom line:

  • gevent does not have good debugging tools - I guess the function I wrote is useful, it can be even injected live on a running process using Pyrasite.

    But gevent should provide this kind of utility imho - I'll propose something

  • I am looking forward for pyzmq.green with the Poll class. We've opened a ticket on this, and it will eventually land I guess.

zmq_bind() bug ?

The other issue I had was with the ZMQ bind() API. Powerhose's Broker binds a socket, but it turns out you can bind as many broker as you want on the same IPC or TPC adress !

You end up with one active broker and a lot of zombies brokers...

See this bug to reproduce the issue: https://github.com/zeromq/pyzmq/issues/209 (the past scripts will be online for a month)

And that's the thread I started in the zmq mailing list: http://lists.zeromq.org/pipermail/zeromq-dev/2012-May/017249.html

So until this is solved, what I did is add a health feature in Powerhose. You can now call the broker, but instead of passing a job, you can pass a PING and the broker will directly return its PID instead of passing your call to a worker.

That's good enough to make sure the broker is up and running, and the powerhose-broker command line has gained two options:

$ powerhose-broker --check
.. checks if the broker is alive and running, prints its pids...

$ powerhose-broker --purge-ghosts
.. kill any "ghost" broker...

The broker itself does a --check when it starts and exits if it finds a running broker on the same endpoint.

This will be useful for a Nagios checker. But... zmq should just error out when you try to bind twice.

What's next

I am wondering at this point, besides those small fixes, if Powerhose should gain more features... Circus itself provides all the stats and maintenances feature we need to manage powerhose workers..

Links:

Please let us know what you think !