One of my favorite tools to browse through is the Google-wide Profiler:http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36575.pdf
It basically tells you fleet-wide, across all Google datacenters, where all the CPU cycles are going, in a way that you can easily slice and dice to find performance optimization opportunities. It also takes the guesswork out of determining what type of acceleration makes sense.
One side-benefit of such shared infrastructure is AutoFDO, which optimizes your binaries automatically based on their live profile running in production. It's really interesting what crazy things begin to make a lot of sense once you think at scale.