Varnish Expiration Thread Lock Contention

I recently helped introduce a Varnish cluster at work to offload a significant number of requests from a heavily-trafficked internal web service. This Varnish cluster serves approximately 10,000 requests/second (with peaks in the 20-30,000 requests/second range) 24x7x365.

While working on this project I ran into a few interesting problems with Varnish. The most interesting one I saw was that Varnish would consume more and more RAM until eventually it would consume all the memory on the box and crash. Using varnishstat, we diagnosed that this was due to unbounded growth of transient storage within the Varnish process.

In Varnish, transient storage is where all extremely short-lived objects are placed. It turns out that our web service returns large numbers of HTTP 404 status codes which are not cached by default. The responses for these 404 errors would end up in transient storage.

By default, transient storage is allocated in RAM and allowed to grow without bound. You can force transient storage to be of bounded size, but I determined that this would likely manifest itself as the Varnish process returning endless numbers of HTTP 503s instead of 404s, which is no better than crashing for my purposes.

With closed-source software, we probably would have called up product support and waited for their diagnosis. I was curious, however, so I decided to look at the source code myself.

The Varnish source code is well-written and easy to understand. I was eventually able to determine that freeing up space in transient storage is the responsibility of the expiration thread through the function exp_timer, which I have reproduced below (source is from Varnish 3.0.2):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88


/*----------------------------------------------
 * This thread monitors the root of the binary heap and whenever an
 * object expires, accounting also for graceability, it is killed.
 */

static void * __match_proto\__(void *start_routine(void *))
exp_timer(struct sess *sp, void *priv)
{
    struct objcore *oc;
    struct lru *lru;
    double t;
    struct object *o;

    (void)priv;
    t = TIM_real();
    oc = NULL;
    while (1) {
        if (oc == NULL) {
            WSL_Flush(sp->wrk, 0);
            WRK_SumStat(sp->wrk);
            TIM_sleep(params->expiry_sleep);
            t = TIM_real();
        }

        Lck_Lock(&exp_mtx);
        oc = binheap_root(exp_heap);
        if (oc == NULL) {
            Lck_Unlock(&exp_mtx);
            continue;
        }
        CHECK_OBJ_NOTNULL(oc, OBJCORE_MAGIC);

        /*
         * We may have expired so many objects that our timestamp
         * got out of date, refresh it and check again.
         */
        if (oc->timer_when > t)
            t = TIM_real();
        if (oc->timer_when > t) {
            Lck_Unlock(&exp_mtx);
            oc = NULL;
            continue;
        }

        /* If the object is busy, we have to wait for it */
        if (oc->flags & OC_F_BUSY) {
            Lck_Unlock(&exp_mtx);
            oc = NULL;
            continue;
        }

        /*
         * It's time...
         * Technically we should drop the exp_mtx, get the lru->mtx
         * get the exp_mtx again and then check that the oc is still
         * on the binheap. We take the shorter route and try to
         * get the lru->mtx and punt if we fail.
         */
        lru = oc_getlru(oc);
        CHECK_OBJ_NOTNULL(lru, LRU_MAGIC);
        if (Lck_Trylock(&lru->mtx)) {
            Lck_Unlock(&exp_mtx);
            oc = NULL;
            continue;
        }

        /* Remove from binheap */
        assert(oc->timer_idx != BINHEAP_NOIDX);
        binheap_delete(exp_heap, oc->timer_idx);
        assert(oc->timer_idx == BINHEAP_NOIDX);

        /* And from LRU */
        lru = oc_getlru(oc);
        VTAILQ_REMOVE(&lru->lru_head, oc, lru_list);

        Lck_Unlock(&exp_mtx);
        Lck_Unlock(&lru->mtx);

        VSC_C_main->n_expired++;

        CHECK_OBJ_NOTNULL(oc->objhead, OBJHEAD_MAGIC);
        o = oc_getobj(sp->wrk, oc);
        WSL(sp->wrk, SLT_ExpKill, 0, "%u %.0f",
            o->xid, EXP_Ttl(NULL, o) -- t);
        (void)HSH_Deref(sp->wrk, oc, NULL);
    }
    NEEDLESS_RETURN(NULL);
}

What I believe, although I have not verified, is that we were being affected by the following block of code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


/*
 * It's time...
 * Technically we should drop the exp_mtx, get the lru->mtx
 * get the exp_mtx again and then check that the oc is still
 * on the binheap. We take the shorter route and try to
 * get the lru->mtx and punt if we fail.
 */

lru = oc_getlru(oc);
CHECK_OBJ_NOTNULL(lru, LRU_MAGIC);
if (Lck_Trylock(&lru->mtx)) {
    Lck_Unlock(&exp_mtx);
    oc = NULL;
    continue;
}

My theory is that there is so much contention on the LRU mutex that the expiration thread was unable to ever get the LRU mutex and thus it was never able to free anything in transient storage. I believe that if this code was changed to follow the recommendation in the comment, it would fix our issues.

Before I built a custom copy of Varnish, I noticed that the aggressiveness of the expiration thread can be controlled by the expiry_sleep parameter. We decided to try increasing the aggressiveness of the expiration thread by reducing expiry_sleep from 1 second to 0.1 second, and then later 0.01 second. It worked – the transient storage space started to free up!

Unfortunately, we then started to see the following Varnish panic occur with some frequency:

200
Last panic at: Sun, 03 Jun 2012 08:43:09 GMT Assert error in oc_getobj(), cache.h line 452:
 Condition((oc-&gt;flags & (1&lt;&lt;1)) == 0) not true.
thread = (cache-timeout)
ident = Linux,2.6.18-308.4.1.el5,x86_64,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
 0x42c7a6: /usr/sbin/varnishd [0x42c7a6]
 0x420a35: /usr/sbin/varnishd [0x420a35]
 0x42ebac: /usr/sbin/varnishd [0x42ebac]
 0x3a9de0677d: /lib64/libpthread.so.0 [0x3a9de0677d]
 0x3a9cad325d: /lib64/libc.so.6(clone+0x6d) [0x3a9cad325d]</pre>

We eventually decided to resolve the issue by caching 404s for 2 minutes, which got the 404s out of transient storage and also reduced the load on the origin server significantly.