As part of rolling out Mesos/Aurora at TellApart we turned on cgroup CPU throttling. We're using the default cpu.cfs_period_us=100ms right now and have noticed that it really bumps up the variability of our 50th and 90th percentile latencies (interestingly, 99th and 99.9th don't move - probably because other factors dominate that far out). Clearly we're burning through the CPU early in the 100ms block and the whole application pauses for 60-80ms.
Note that because of online ad-bidding rules our applications have to respond within 30-40ms, otherwise the entire request is invalid. So I'm very willing to trade throughput for latency. 10-20% CPU overhead would probably pay for itself :)
I figured we have 3 levers to pull (assuming CPU/request cannot be fixed):
1. CPU/task (inversely proportional to # tasks of course). Adding more CPU basically "smears" the high-CPU events (a big request, GC, etc) and lowers the probability that throttling will occur.
2. Lower the # of threads. If have 3 CPU allocated then "ideally" we should never have more than 3 threads running so we don't get throttled. Work should queue up waiting on the thread (instead of pausing the entire app). In reality, controlling threads like this is difficult...
3. Lower the cpu.cfs_period_us setting (maybe 20ms?). That would make a much lower "max pause time". Combine this with "throwing away" requests that have expired (e.g. if the throttling kicked in) would probably eliminate a lot of the variability.
4. Turn off CPU enforcement altogether.
What have people tried? Especially curious what Borg did; I never looked into the kernel settings of LOW_LATENCY-class applications. Reading the Borg paper (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) it looks like (3) was dynamically changed, and perhaps (4) was true for most machines/tasks but also dynamically changed?
since you've run very low-latency stuff in cgroups...
Here's a key quote that explains what Google thinks is the problem:
"You almost need a second vacation to go through the pictures of the safari on your first vacation. That’s the problem we’re trying to fix — to automate the process so that users can be in the moment."
Later, they justify focus on this (and not social) with: "Only a small fraction of your photos are actually shared."
But that's because sharing is too difficult!! That's the problem! The problem is not organizing - the only reason organizing exists is to make sharing possible!
People are about loved ones being part of their lives. That's why photos exist - to show others what we're doing. Reminiscing - Google's "core use case" - is rare.
I really think they misread the data & problem space in this case and did a phenomenal job of solving an unimportant problem...
"After a long struggle trying to figure out how to upload my pix to Google Photos or Google Plus and a final rejection when they said they didn't recognize the pix as images I gave up and went with Flickr, instead. The process was easy even for a IT phobe like me. What the heck is wrong with Google that they can't figure out photos???"
1. Use an enum ("CONSISTENT", "INCONSISTENT", "CACHED", "ANY")
2. Use a freshness-in-ms long (0ms == consistent, 10ms, 10h...)
I'm leaning toward #2 since it's more of "what the client actually cares about" and keeps the door open to changing infra under the hood without API/client changes. E.g. we could put a cache in front of the storage that's ~minutes old and serve from it for some clients.
Also, Spanner does this and I'm sure they have a good reason :)
I think the bigger challenge is to prove that a mobile ad actually drove or influenced a purchase. Even if it happened from a tablet, desktop, or in a store. That's exactly what we're doing :)
- UC BerkeleyB.A. Mathematics, 2000 - 2002
- Twitter Inc.Staff Software Engineer, 2015 - present
- Software Engineer, 2005 - 2014
- TellApartSoftware Engineer, 2014 - 2015
Box + Subspace: Extending Content Securely Across All Devices | Box Blog
Today, I'm excited to announce that Box has acquired Subspace, a startup focused on enabling secure access to data and applications on the w
The Baddest Of George Thorogood And The Destroyers
The aptly-titled The Baddest of George Thorogood and the Destroyers offers a dozen tracks that cleanse the church of rock'n'roll of all but
safe kosher-style dill pickles: fermented and non-fermented
This post contains two recipes: (1) a tested recipe for vinegar pickles, which are canned immediately and therefore called "Quick Kosher Dil
How I Make Lacto-Fermented Salsa | Nourishing Days
Now that tomatoes are rolling in I thought I'd share this recipe from last year. This is our absolute favorite ferment and some of the tasti
Making Sour Pickles | Wild Fermentation :: Wild Fermentation
Resources for fermenting a vast range of nutritious and delicious live-culture foods and drinks.
HGIC 3101 Common Pickle Problems : Extension : Clemson University : Sout...
Pickling problems including pickle spoilage, safety questions, using slicing cucumber instead of picklers, cloudy pickles, pink discoloratio
Making Traditionally Fermented Pickles | Chiot's Run
Traditionally fermented food are super healthy. It's always nice when you can make something using these methods. Not only is it quick and e
Ogórki małosolne :: Palce Lizać! - same sprawdzone przepisy - kuchnia po...
przyrządzanie: Ogórki dokładnie umyć i ułożyć w wyparzonym słoju. Czosnek obrać ze skórek i ząbki przekroić na połówki. Każdą warstwę ogórkó