So, You Want To Trace Your Distributed System?
This technical report I co-wrote with +Raja Sambasivan
, +Greg Ganger
, and +Ilari Shafer
goes through choices you have depending on the nature of your system, what you want to learn from your instrumentation, and things you should pay attention to.
End-to-end tracing captures the workflowof causally-related activity (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for management tasks like diagnosis and resource accounting. Drawing upon our experiences building and using end-to-end tracing infrastructures, this paper distills the key design axes that dictate trace utility for important use cases. Developing tracing infrastructures without explicitly understanding these axes and choices for them will likely result in infrastructures that are not useful for their intended purposes. In addition to identifying the design axes, this paper identifies good design choices for various tracing use cases, contrasts them to choices made by previous tracing implementations, and shows where prior implementations fall short. It also identifies remaining challenges on the path to making tracing an integral part of distributed system design.