A bug of life and death

My software development/consultancy company, Stochastic Technologies, has developed and owns various products. One of these products, Dead Man’s Switch, is a service that emails you every few days to check up on you, and sends some e-mails you have pre-configured to your contacts if you don’t respond within your selected timeframe.

We recently rewrote the entire thing (it wasn’t too big, so the rewrite was quick), to make it more extensible, maintainable, current, &c. The previous version was rock-solid for years, but, somewhere along the line, it came to my attention that the new version had a weird bug that sometimes wouldn’t send the notification emails on the day they were supposed to.

The previous version only accepted emailing intervals that were on the area of two weeks to a month, but the new version has a “Test email” feature, that notifies you one day after you last show up, and sends all your emails the following day if you don’t check in after that email. A few users mentioned problems with that email never arriving, and my tests confirmed this on production, but for a long time I couldn’t pinpoint the bug.

Since the service is of such importance, it is critical that false positives never occur. However, if a user doesn’t get the notification email and forgets to check in, that’s a pretty serious bug. Due to the requirement that everything be as error-free as possible, the codebase is extensively tested, with the function that decides whether a user should be contacted having undergone multiple reviews and having many tens of test cases (which is a lot for a simple, twenty-line function).

The bug in question was never happening in tests, or on staging, or anywhere but production, and it also didn’t happen on production when I tried to debug it. It was a true Heisenbug, and baffled me to no end. I added debug statements everywhere, tested the entire thing multiple times, printed the inputs and outputs of the function, and nothing. The inputs were correct, the outputs were correct in staging, everything was right, except production was wrong. To reiterate: Running a simple, side-effect free function with the same inputs on production and staging seemingly gave different results.

The relevant function is below, see if you can spot the bug. last_login is the user’s last login date, last_contact is the date we last contacted them (so the function is idempotent), intervals is their contact preferences (in days) and contact_date is today’s date, mainly used in tests to simulate other dates:

def should_contact(last_login, last_contact, intervals, contact_date=date.today()):
    """
    Decide whether we should contact a user or not.
    """
    # If we last contacted the user on contact_date, we should also not contact.
    if last_contact == contact_date:
        return False

    # The number of days since the user's last login.
    days_since_login = (contact_date - last_login).days

    # The number of days we contacted the user after their last login.
    days_since_contact = (last_contact - last_login).days

    # If we have contacted before or on the last login, we don't need to contact again.
    if days_since_login <= 0:
        return False

    ... more code ...

    return <whether we should contact the user>

See it? I saw it too, I knew about this Python quirk, but I never realized, until it dawned on me: Python evaluates default arguments at definition time. Python evaluates default arguments at definition time. Which means that date.today() is evaluated once, until the process is restarted. No wonder DMS didn’t send some emails: it thought it was a day earlier than it actually was, until the process got restarted and re-evaluated the code. This meant that this bug would never be exhibited in tests, or in staging (which didn’t live for days, as it’s only started when needed for testing), or while production was being debugged (because uploading new code restarted the process and got the correct date).

A quick change to:

def should_contact(last_login, last_contact, intervals, contact_date=None):
    """
    Decide whether we should contact a user or not.
    """
    if contact_date is None:
        contact_date = date.today()
    ...

fixed everything.

I’m not sure what the moral to this story is, except that even things you know can still come back to bite you, if you aren’t careful/don’t realize they can. The lesson I’ve learnt was that I shouldn’t avoid just putting mutable datastructures (lists, dictionaries, etc) as default arguments, but really anything that’s not specifically immutable (numbers, strings, True/False/None, that’s pretty much it). I’ll definitely be more aware of what goes on in default argument land now, but at least the bug is gone and the service is reliable once more.

Stavros' Stuff

On programming and other things.

Conceived on Jul 30, 2012

Stavros

Guy who likes computers

Connect with me

This site is part of the webring:

Recent Posts

Made with ♥ in Greece