So I had a little weekend project this... weekend... and there was a concurrency glitch that kept gnawing at me. Now, I know this sounds braggy, but diagnosing it didn't take very long, because I have a very good feel for concurrency. It's almost tragically simple - multiple worker processes sharing a single upstream connection to a Redis server, which led to responses coming back out of order and therefore to the wrong callbacks.
No, it didn't take long to determine the problem. But fixing it is another matter. I'm baffled at how many people praise the Perl ecosystem for its documentation, because every time I have an interesting problem the answer is usually buried in PerlMonks if it exists at all. In this case, the problem was the data sharing model of Apache/mod_perl worker processes. How do I make sure that a variable - in this case, a Redis client - is instantiated freshly in each worker, but not per-request? Is there some sort of library for thread local storage? Is the answer buried in the source of DBI::Pool? ... where the fuck are the docs for DBI::Pool. Or the source. This was implemented in C wasn't it, cycle-greedy bastards. Interpreted language modules written in compiled languages is basically the Russian Front of research territory.
But what makes me feel really dumb is how simple the solution is. Let's see how fast the audience can put it together as I explain it hint by hint:
* There are effectively two phases of memory isolation over the lifetime of the process (the first being the fork parent). First we build up the common environment (including file descriptors). Then, at some point, this is forked into pseudo-isolated processes.
* Once you're in your per-worker process, any new objects and connections are your own.
* Create a Redis connection at the right time, and it will be reused from request to request in a single thread, but not between concurrent threads.
* There is no callback for "I just forked into memory isolation", BUT. You won't get any HTTP requests being processed until you're forked. So once that's happening, you know that you're in the clear.
* It's easy to write an initialization function, which returns immediately if it's been run before.
* If you only run this module init function from the HTTP request processing path, you can guarantee that you've deferred initialization until you have memory isolation.
There you have it. Create your module-level Redis connection object in an initialization function that runs once and defers execution until after the fork by running in the request handler. It's that easy.
The thing is, I've seen this pattern a lot, but I've never correctly guessed the motivation. Now it feels pretty obvious, and in the future, knowing this will give me a new critical understanding of how existing infrastructure works, and why it does what it does where it does.