Twitter is open-sourcing a tool that enables it to discover performance anomalies and usage spikes on its network that were too brief in length to trigger its normal observability and metrics systems.
Staff site reliability engineer Brian Martin said in a blog post that Rezolus—which he described as a “high-resolution systems performance telemetry agent”—is now available via its public GitHub repository.
He wrote that Twitter has been running Rezolus in production for more than one year, using it to help quantify workload characteristics, provide data to drive optimization efforts and diagnose runtime performance issues.
Martin explained the origins behind Rezolus, writing, “Rezolus was born out of a need to understand systems performance on fine-grained timescales. We found that while running very high throughput synthetic benchmarks, there were very brief but sometimes significant performance anomalies. Our existing telemetry, which samples minutely, was failing to reflect these anomalies. This was because the anomalies, which were about 10 seconds in duration, were being masked by a low sample rate relative to the length of the anomalies. This made it difficult to understand what was happening and tune the system for higher performance.”
He also shared an example of how the tool was used.
Martin wrote, “One time, several services were experiencing repeatedly degraded success rates for a few minutes at a time. These services each found that they were being throttled by a backend service. The team responsible for that service didn’t see anything on the existing telemetry that they could use to figure out what was happening during the minutes where throttling was occurring. But knowing that throttling decisions are made on a finer timescale than the default telemetry collection, they began to suspect sub-minutely bursts.”
He concluded, “By deploying Rezolus to the backend service, we were able to see that even though average request rates weren’t significantly elevated, there were bursts of over five times the baseline traffic during which CPU (central processing unit) utilization was bursting to 100%. We were also able to identify exactly when they happened. With the additional telemetry from Rezolus, we were able to correlate with the backend service logs and determine the source of the spikes.”