NetApp specializes in helping customers get the most out of their data with industry-leading cloud data services, storage systems and software. The company brings enterprise-grade data services you rely on into the cloud, and the simple flexibility of cloud into the data center. NetApp’s solutions work across diverse environments and the world’s biggest clouds.
NetApp uses InfluxDB for real-time resource trending, SLO/SLI calculations, and alerting. The SRE team relies very heavily on the ability to identify trends in resource consumption for critical Linux servers within their infrastructure, DB monitoring, and custom resource monitoring. The company has been using TICKscript for downsampling and alerting, but is now starting to look into using Flux.
The company has found that InfluxDB has a high ingest, integrates well with other tools, and is extremely performant. They are able to monitor multiple systems efficiently and integrate with Grafana, which is their preferred method of displaying dashboards. They have also found the Slack integration to be very useful since that is what the team uses for communication across the globe. If they have an alert triggered via data that they are storing in InfluxDB, their team members in India can see it at the same time as in the US, allowing the company to coordinate quick responses.
Lead Site Reliability Engineer Dustin Sorge likes that InfluxDB is highly effective for storing and processing time series data. For the SRE team, time series data has allowed them to efficiently detect trends that can lead to failure conditions within their environment. The system data that is collected via Telegraf is also useful when investigating failure conditions (trends in memory usage, CPU usage, etc.) which is key to the SRE postmortem process.
Sorge recommends checking out the Slack integration. NetApp is currently using Kapacitor to alert Slack via Webhooks. This empowers their globally distributed team to function seamlessly with the foundation of time series data stored in InfluxDB. Sorge is also looking forward to checking out Starlark within Telegraf.
Linux server monitoring
SRE team relies on trend analysis
Improved operations
Through better alerting utilized by their worldwide team
Reduced downtime
Via better detection and trend analysis
“The biggest benefit to using InfluxDB for our team is how easily it is for us to write customized alerting for our homegrown software. I would recommend InfluxDB to anybody who’s trying to trend any kind of data over time, and they’re looking for a tightly integrated software stack that enables them to do that.”