I recently helped introduce a Varnish cluster at work to offload a significant number of requests from a heavily-trafficked internal web service. This Varnish cluster serves approximately 10,000 requests/second (with peaks in the 20-30,000 requests/second range) 24x7x365.
While working on this project I ran into a few interesting problems with Varnish. The most interesting one I saw was that Varnish would consume more and more RAM until eventually it would consume all the memory on the box and crash. Using varnishstat, we diagnosed that this was due to unbounded growth of transient storage within the Varnish process.
In Varnish, transient storage is where all extremely short-lived objects are placed. It turns out that our web service returns large numbers of HTTP 404 status codes which are not cached by default. The responses for these 404 errors would end up in transient storage.
By default, transient storage is allocated in RAM and allowed to grow without bound. You can force transient storage to be of bounded size, but I determined that this would likely manifest itself as the Varnish process returning endless numbers of HTTP 503s instead of 404s, which is no better than crashing for my purposes.
With closed-source software, we probably would have called up product support and waited for their diagnosis. I was curious, however, so I decided to look at the source code myself.
The Varnish source code is well-written and easy to understand. I was eventually able to determine that freeing up space in transient storage is the responsibility of the expiration thread through the function
exp_timer, which I have reproduced below (source is from Varnish 3.0.2):
What I believe, although I have not verified, is that we were being affected by the following block of code:
My theory is that there is so much contention on the LRU mutex that the expiration thread was unable to ever get the LRU mutex and thus it was never able to free anything in transient storage. I believe that if this code was changed to follow the recommendation in the comment, it would fix our issues.
Before I built a custom copy of Varnish, I noticed that the aggressiveness of the expiration thread can be controlled by the
expiry_sleep parameter. We decided to try increasing the aggressiveness of the expiration thread by reducing
expiry_sleep from 1 second to 0.1 second, and then later 0.01 second. It worked – the transient storage space started to free up!
Unfortunately, we then started to see the following Varnish panic occur with some frequency:
200 Last panic at: Sun, 03 Jun 2012 08:43:09 GMT Assert error in oc_getobj(), cache.h line 452: Condition((oc->flags & (1<<1)) == 0) not true. thread = (cache-timeout) ident = Linux,2.6.18-308.4.1.el5,x86_64,-smalloc,-smalloc,-hcritbit,epoll Backtrace: 0x42c7a6: /usr/sbin/varnishd [0x42c7a6] 0x420a35: /usr/sbin/varnishd [0x420a35] 0x42ebac: /usr/sbin/varnishd [0x42ebac] 0x3a9de0677d: /lib64/libpthread.so.0 [0x3a9de0677d] 0x3a9cad325d: /lib64/libc.so.6(clone+0x6d) [0x3a9cad325d]</pre>
We eventually decided to resolve the issue by caching 404s for 2 minutes, which got the 404s out of transient storage and also reduced the load on the origin server significantly.