Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf
Session date: Jul 19, 2022 08:00am (Pacific Time)
NetApp is a global cloud-led, data-centric software company. They are an industry leader in hybrid cloud data services and data management solutions. Their platform enables their customers to store and share large quantities of digital data across physical and hybrid cloud environments. NetApp Engineering’s Site Reliability Engineering team is tasked with supporting their internal build environment, test, and automation infrastructure. After collecting their time-stamped data in InfluxDB, they are using Kapacitor to push alerts directly to Slack via webhooks. Their globally distributed SRE team are able to seamlessly collaborate and troubleshoot. Discover how NetApp uses a time series platform to detect trends in real time that can result in failures within their environments, and to provide key metrics used in SRE postmortems.
Join this webinar as Dustin Sorge dives into:
- NetApp's approach to monitoring their SRE team's metrics - including SLO's and SLI's
- Their best practices and techniques for monitoring memory usage and CPU usage
- How they use InfluxDB and Telegraf to detect trends and coordinate fixes faster.
Watch the Webinar
Watch the webinar “Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Senior Manager, Customer and Community Marketing, InfluxData
- Dustin Sorge: Lead Site Reliability Engineer, NetApp
Caitlin Croft: 00:00:00.162 Hello, everyone, and welcome to today’s webinar. My name is Caitlin and I’m joined today by Dustin from NetApp. We will be talking about how his team uses Grafana, InfluxDB, and Telegraf. Once again, this session is being recorded and will be made available. Please post any questions you may have in the Q&A and we will have them answered at the end. And I just want to remind everyone to please be courteous to all attendees and speakers as we want to make sure that this is a fun and safe place for all of our community. Without further ado, I’m going to hand things off to Dustin.
Dustin Sorge: 00:00:41.082 Great. Thank you, Caitlin. I appreciate it. So yeah, I appreciate this opportunity to talk about something I’m very passionate about, and that’s site reliability engineering. So here’s an agenda of the topics I’m going to be covering: I’ll talk about NetApp a little bit, what the company is; BAERO, which is the team that I’m on within NetApp’s ONTAP engineering organization; a brief bit about me; we’ll talk about SRE, kind of formalize what SRE is just to make sure we’re kind of speaking the same language; I’m going to touch on what we were doing before we started using InfluxDB; I’m going to give a couple of example services that we use within our team, within our BAERO team; I’m going to talk about some key metrics for SRE. These are kind of industry-standard metrics for reliability engineering. Then, I’ll move into how we use InfluxDB to actually measure them. I’m going to talk about some coordinated incident response, discuss a little bit how we collaborate to solve issues as SREs across multiple sites and across the globe, talk a little bit about measuring system resources. That’s something that all SREs are concerned with measuring out. What’s CPU look like? What’s memory look like? What’s all these other different system statistics look like? Some tips for the TICK stack in InfluxDB, where our team is possibly looking to go with Influx in the future, and then we’ll open it up for questions.
Dustin Sorge: 00:02:14.325 So let’s start out with NetApp, the company, who we are. So NetApp is a global, cloud-led, data-centric software company. So I’ve personally been at NetApp for about 11 years, and it’s been very cool to watch this transition where NetApp is starting to transition from a traditional, on-prem, data storage company to a cloud-based services company, or a software company. So we went from being a traditional, on-prem company to now having a very rich portfolio of cloud-based services. And it’s been really awesome to watch that transition happen. And it’s a transition that’s still occurring. So that traditional, on-prem, data storage business still exists in conjunction with all the new cloud-based applications and services that NetApp provides. So any storage needs you could possibly have, NetApp has you covered 100%.
Dustin Sorge: 00:03:10.314 We’re a fairly large company, about 11,000 employees. We’re spread across many sites all over the globe, and we’ve been around since 1992. So it’s really a fantastic company to work for, and I’m very proud to be working here. So the team that I exist on within NetApp is called BAERO. This is a kind of initialism or acronym that stands for Build Automation and Engineering Operations. Our team is responsible for the build, test, and automation infrastructure for NetApp’s ONTAP engineering organization. Within our team, we support a number of services, and I just give an example of four of them here. And I’m going to dive a little deeper into a couple of these. One of these is the ONTAP build farm that we have, where ONTAP is the operating system that drives our storage products; our Common Test Lab; and then our continuous integration testing. And then, we also have some really cool AI-driven code pre-submission automation that exists within our organization to really help our developers who are submitting a code change. And we have some AI in the background that will go out and say, “Okay, you’re submitting this code. Here is a subset of tests you should run to make sure your code is okay to check in.” Yeah. Some really, really neat stuff. The team tends to be very bleeding edge in terms of technologies, and innovation is highly, highly encouraged, which makes it a very fast-paced and fun team to be a part of.
Dustin Sorge: 00:04:47.892 A little bit about me. I’m in Pittsburgh, Pennsylvania. I’ve been here for about 20 years. And like I said, I’ve been at NetApp since 2011. I did my undergrad at the University of Pittsburgh and then did my graduate degree at Carnegie Mellon. And before I came to NetApp, I was working for Carnegie Mellon’s High-Performance or Supercomputing Center, doing HPCE operational-type work and some software development. So let’s go ahead and kind of delve into what SRE is. I’m sure we all know that SRE stands for Site Reliability Engineering. This is a role that you primarily focus on production engineering-type issues. One of the things that makes SRE unique is that it blends the skills of a — if you’re a very good systems administrator or systems engineer and a talented software engineer, it takes those two skill sets and kind of marries them together.
Dustin Sorge: 00:05:48.139 So oftentimes in people’s careers, those tend to be very independent tracks, and people will spend their career focusing on one or the other. But SRE tends to blend the two of those together into one role, which I find incredibly exciting and very fulfilling. Sometimes people will ask, “Is SRE DevOps?” The answer is, well, yes, it’s an implementation of DevOps. So in my eyes, DevOps is a very broad philosophy and SRE is very narrow on purpose. So if you were to boil SRE down to one thing, SREs care about service uptime. That’s the one thing we care about above all else. So SRE broad — or DevOps broad, SRE very narrow. And as SREs, we’re constantly trying to — the famous quote is, “Automate yourself out of the job.” It’s something you’ll never do, but it’s something you should be aspiring to try for. So when I think of what SRE is kind of at a more broad level, to me, it’s the balloon game. So I don’t know if growing up you ever played this game where you blow up a balloon and you hit it, and your goal is to keep the balloon in the air without having it touch the ground.
Dustin Sorge: 00:07:04.269 I was watching my kids play this 1 day and I had this epiphany that like, “Oh, that’s what I do.” And in my example here, the balloon is the service you’re supporting. So as long as the balloon is in the air, you have uptime, and if the balloon touches the ground, that’s downtime. And the game that SREs are playing 24 hours a day, 365 days a year, is, how do you know when that balloon is starting to fall when you’re not looking at it, right? Do you have the right monitoring in place? Do you have the right SLIs in place? Do you have the right alerting in place to know that your service is possibly starting to degrade and go south so you can go hit that balloon back up in the air and make sure it stays up? So that’s my kind of simple analogy for explaining at a high level to someone who may not be as familiar with SRE that that’s what it is to me. So before our team started using InfluxDB, we, of course, had metrics. We, of course, looked at system resource monitoring. We had a little bit of alerting in place as most teams do.
Dustin Sorge: 00:08:11.745 So our team, we had a number of metrics databases that were just, basically, MySQL instances, MariaDB instances that sort information, and they were spread around a little bit. So these metrics DBs existed kind of over here, and then we would have a need to look at system resource utilization on hosts. So we relied on [inaudible] observability tools that just kind of exist out there. These are common tools that all sysadmins tend to be familiar with, and we’ve all run [TAP?] and looked at what processes are eating up memory or using a certain amount of CPU. You want to dig into the CPU a little bit more, so you might run mpstat and get a per CPU view. Maybe you have a hot CPU running somewhere. A disk stat, you’d use iostat, and then, SAR, of course for kind of a historical view of system resource utilization. These are all fine tools and they’re great, and they still get used all the time. I know SAR at times might not be as granular as you would like it to be, so this is where the Telegraf steps in, which I’ll definitely get into later in this talk, and other pieces of the TICK stack to really homogenize these things.
Dustin Sorge: 00:09:29.839 And then, some rudimentary alerting with Nagios. Yeah. We could get alerts that we have a disk filling up or there’s a file system that’s above some level. But it didn’t really allow for a lot of custom-type alerting, which is one of the major strengths that the Influx and the TICK stack provides. So the question is, well, why did we land on InfluxDB? Because there’s other solutions out there, for sure. InfluxDB isn’t the only time series database. So InfluxDB is incredibly scalable. So at the time, we were looking at Prometheus a little bit. And when we were doing this evaluation, it was common practice for people to write to Prometheus and then take the data from Prometheus and then put it in InfluxDB to get the scalability. So we decided, well, let’s cut the middleman out altogether, and let’s just write to InfluxDB. And that’s what we did.
Dustin Sorge: 00:10:30.992 Here is a view of the architecture where Influx fits into our environment. So in the middle here, yeah, you’ll see InfluxDB Enterprise data node. We have four of those. We have a four-node enterprise cluster, three meta nodes off to the side. We use Kapacitor quite a bit for our monitoring and alerting. You can see how that writes out to Slack, email, and we’re doing some experimentation with PagerDuty at the moment. We have scripts and automation. Whoop. I’m sorry. Scripts and automation. So we use the Python InfluxDB client library. We actually have a Perl implementation that an engineer within our team wrote for InfluxDB client. If you’re using Perl, NetApp has a lot of legacy Pearl automation around. So we still use that to an extent even though we tend to be more modern nowadays. There’s times where we want to modify some legacy code to pull some custom metrics.
Dustin Sorge: 00:11:33.768 You see Jenkins on the upper left there. There’s a InfluxDB plug-in for Jenkins that we use to let us know when certain jobs are failing in Jenkins. I know Jenkins will email you if there’s a problem, but being able to send Slack alerts because Jenkins jobs are failing is very powerful to us because Slack is very crucial in our workflow. Our SRE team has a custom REST API service that we’ve built, so we use this when we’re inserting records into our outage database, which I’ll definitely get into a little bit later, so. And that connects directly to InfluxDB. It also connects to Grafana. We actually use this REST API service. Whenever we have an outage, we’ll not only insert a record into our outage database, we’ll also use the Grafana API to annotate the time range in which there was an outage. So you can actually start to correlate spikes and dips in the graph to outages, and it’ll annotate over them. And you can hover over that annotation, and it’ll actually include a link to our postmortem documentation for the outage. So it provides a very complete way of interpreting the data in your dashboard.
Dustin Sorge: 00:12:58.045 We rely on Telegraf quite a bit. We have Telegraf all over our infrastructure, pulling out custom metrics, reporting system metrics. We’ll write custom shell scripts that dump out info and line query protocol, and that just gets ingested and put it in InfluxDB right away. And then, on the bottom left here, we have an open-source instance of InfluxDB. And when I get to the tip section, I’ll talk about something called the Watcher of Watchers. But the idea here is that we write Telegraf data of the data nodes themselves to a separate instance of InfluxDB. So if you find yourself in a situation where the cluster is having problems or isn’t performing, which doesn’t happen very often at all, but you want to have that data going to a separate place so you’re not trying to interpret data that may not be what you expect.
Dustin Sorge: 00:13:58.401 Okay. So now I’m going to get into a couple services that I’d alluded to previously. One is our CTL environment. So the CTL is our Common Test Lab. So this is approximately 30,000 compute nodes, both physical and virtual, spread across multiple data centers around the globe. And this is an environment that allows NetApp developers, QA engineers, they can go in and say, “I need to run a test on this specific hardware configuration. Run the test and just give me the results.” Or they can make a reservation and say, “I need a hardware configuration configured exactly this way. I need 50 Linux RHEL 7 clients, I need 30 Windows clients, I need 10 SUSE clients. And we’ll hand all those over and lease them to the engineer for a given period of time. So they might need that for two weeks and we say, “Okay. Here’s all of your — here’s your cluster. Test as much as you would like to given that time.”
Dustin Sorge: 00:15:07.648 And when that time expires, we reclaim the gear, clean it up, put it back in the pool. So within this environment, we’re actually doing about 100,000 hours of ONTAP testing per month, which is huge. And there’s a large number of unique hardware configs that are available to our engineers. So for a little historical context around the CTL and SRE, so the CTL is where SRE was born within BAERO. At the time, a number of years ago, this was split up into an operations team in Raleigh, North Carolina, and a development team in Pittsburgh, PA, which I was a part of. And the CTL environment is driven by a microservices architecture software stack we call CIDR internally. So we basically take CIDR, development team would build it, release it, throw it over the wall to the operations team. The operations team would get it and they would fill tickets from users and reach out to us and say, “Hey, we think there’s a regression here.” They’d throw it back over the wall to us. We would then take, decide, “Do we deploy a hotfix? Do we roll back the release?” That kind of thing. And it was fine, but it was a little clunky. And it’s a very common workflow where you’re just taking this ball of code and just kind of throwing it back and forth. So SRE was created to be a conduit between the two. So we basically ripped that wall down, put SRE in the middle, and now everything flows through the SRE team. And it’s much more fluid nowadays.
Dustin Sorge: 00:16:52.432 So by doing this, we took our development team and remove any operational-type task they may have been spending time on, and now they’re able to focus solely on development. So one of the really cool things that falls out of SRE when implemented correctly is that it accelerates development velocity. So now that we have an SRE team that handles all kind of automation, development, and operational issues, along with some of our data center folks in the operations team in Raleigh, the development team could just focus on features and this really accelerated feature development and made things better quickly. Another service that BAERO supports is our continuous integration testing environment. So our CIT environment is huge for protecting the stability of our code line. So developers will submit changes into the code line that have already passed their pre-submission testing, which I had alluded to before.
Dustin Sorge: 00:17:58.308 And we have over 1,100 individual CITs that are being run on cadence throughout the day, every day. So this gives us 440,000 hours of testing per month. So between the CTL and the CIT environments, you’re getting over half a million hours per month of testing to protect the integrity of the code line. So if you’ve ever wondered why NetApp’s quality is so good, these are two of the reasons why. There’s other reasons, but these are two very big reasons. There’s an insane amount of testing that goes on to protect our IP. Within this environment, we have a bisect functionality which allows us to narrow down the exact change number that caused a failure, and then, we can take that change and revert it out of the code and return it to the way things were when tests were passing. So this happens sometimes. And it’s not necessarily a bad thing to get reverted, but it happens. And it’s great that we have the safety net in place, and this really helps protect the integrity of our code.
Dustin Sorge: 00:19:21.010 So I’d like to delve into some key metrics for reliability engineering. Two of the major metrics that are used for SRE. One is an SLI. This is a Service Level Indicator. SLIs are great because they give you a way to quantify user experience, get that narrowed down to an actual number. I like to think of SLIs as like a probe you’re putting into your service and you can check the temp. It’s like you’re roasting a turkey and you’re taking a probe, and you’re putting one in the breast and one in the leg and one on another part, and you’re checking the temperatures of different pieces of the bird at the same time. So you can do this with a service. And I’ll talk a little bit about what makes a good SLI coming up. And these are moving targets. As your service improves and you’re meeting SLI, you can adjust them to your liking. I mentioned that an SLI is — it’s a number, it’s a percentage. So it’s basically good events over expected events times 100. And when I give an example a little later on, I’ll show exactly extracting those values and calculating that percentage.
Dustin Sorge: 00:20:33.281 And then you have your Service Level Objective, your SLO. This is the uptime that you’re committing to for your service. It may seem like you would want that to be 100% because who wouldn’t want a service up all the time? But that’s not realistic. Software systems are complex, there’s a lot of moving pieces, there’s external dependencies. So a goal of 100% should never be the goal. It should be something high. A percentage that you talk about and agree upon and say, “Any amount of downtime lower than X, our users aren’t going to be happy.” And you strive to meet that goal. And that’s something that you measure consistently. Which brings us to error budgets. So an error budget is what falls out of your SLO. So it’s 100% minus your SLO target. So error budgets are important because it’s a metric that says, “This is how unreliable we are going to allow our service to be,” because there’s an expectation that it can’t be up all the time. In my example of the balloon game, it’s unrealistic to think the balloon will never touch the ground because it will at some point, but it’s going to be up in the air most of the time.
Dustin Sorge: 00:21:52.574 So error budgets are used to make decisions for feature content for the service that you’re supporting. So if you have a broader error budget, you can be a little more risky in your release content. And if you find yourself starting to chew up your error budget, you might want to become a little more risk averse. You might want to be a little more conservative with your release content and value stability over all else because we’re looking out for the users here. We want the uptime to be high. And just like an SLI, an SLO should be measured consistently and then displayed via dashboard. These are some other metrics that are common within SRE. You have a mean time to alert. So I have an outage timeline here at the top. So the time to alert is the delta between the bad thing happens and then the alert ends up being triggered. The amount of time it takes to notice that there’s a problem. So you take that delta per outage and then average them across all of your outages for a given period of time.
Dustin Sorge: 00:23:05.619 You have a mean time to recovery. So MTTR can stand for a number of things. These are two common MTTR metrics. So once you have your bad thing happen, how long does it take on average to restore your service? As an SRE, our goal is to not necessarily understand what happened right away. It’s get the service back up as quick as humanly possible. Yeah. We’re trying to keep that uptime up as much as we can. We’ll figure out what happened later. Yeah. Just get the service back up and then we’ll do a postmortem analysis afterwards. So MTTR can also stand for mean time to respond. Certain SRE teams may have SLAs around the amount of time it takes to respond to an alert. Our SRE team’s a little more lax on that, given the nature of our services, but depending on the service you’re supporting, you may have a very narrow SLA for when that pager goes off, or when you get an alert on your phone to when you have to engage.
Dustin Sorge: 00:24:16.960 And then MTBF, mean time between failure. How often is your servicing encountering failures? So you want to measure the average amount of time between when you have your service going down. So this brings us to using InfluxDB to measure those metrics, and that’s what we do. So I had alluded to the fact that we have an outage database. We have one per service within our team. So here’s an example of our CTL outage measurement. And down here on the bottom left, you can see I have some strategic tags and fields outlined. So we have a separate measurement that will track all of our outages. We store the service that was affected, whether the service was planned or unplanned; the category of the outage, so was this a software change where we shot ourselves in the foot and the service went down? Was this an external dependency that we couldn’t control? Was this an infrastructure issue? These tags are really important because it allows you to provide context around the outages you’ve had.
Dustin Sorge: 00:25:33.478 Because at the end of the year or whatever period of time you would like, you’d like to go back and look at the history of your outages and get some context around them, like, are we hurting ourselves? Is there things that we just couldn’t help? And context is very important. And then we’ll store a description of the outage, the time that the outage happened, when it was recovered, and then, the time that we were notified. So as an SRE, as a lead for the team, I’m very interested in knowing when we knew that there was a problem. As an SRE, I don’t want other people telling me there’s a problem. I want to know there’s a problem before someone else does. And this is a way that we can track that. So on the right here, this is an example of a REST API call. This is kind of a Swagger UI instance of it. So whenever we have an outage, we do a postmortem analysis and determine when did this happen, at what time was it recovered, when did we know about it? And some other data about the outage. And we create automation Jiras around how do we make sure this outage doesn’t happen again? Or how do we know that there’s a problem sooner than we did this time? So we’re always looking to get better.
Dustin Sorge: 00:26:52.597 But whenever we run this REST API call, it inserts the record into our outage database, and then from there, we have automation that’s running via Jenkins that constantly scans that outage database and generates our SLO targets because the postmortem is kind of the final stamp on the outage. And all the outage window is finalized and all of this data’s finalized. So it allows us to ensure an accurate SLO. And given that this is a REST API call, if you want to do work in some more intelligent automation to detect outages and automate the insertion of these records, you absolutely could because this is just a REST API call. So this leads me to an example SLI. In this example, I’m using or measuring the latency of an API called reservation add. And I like to just talk — before I talk about how we extract this data and feed it to the TICK stack, I do want to talk about why API latency is a good metric. So API latency is great because there’s a very definitive correlation to latency and user experience. So if this API call is taking a long time, you can assume that users are starting to feel pain here.
Dustin Sorge: 00:28:20.637 So one of the cool things that we’ve been able to take advantage of with measuring API latency as an SLI is that we’ve been able to identify software regressions based on this that were missed along the way. So we have guard rails in place when we release our software. We have staging environments, we have unit testing, we have all that good stuff, but there’s no real substitute for the scale of production. So as we try to do this, sometimes the unit test will pass, but they took longer than they normally do. But everything is green and everything looks good. But we now have this extra layer of analysis with this SLI. And we’ve had instances before where I look at the data for a reservation add and there’s a huge spike, and we go, “Oh, there must be a software regression here.” And then I look at the graph and I can see exactly when the spike occurred. If it correlates with when we did our software release, so that’s when the new code went out. And then you can go back into Jira, look at the actual release, view the release content, identify the Jira that caused the breakage, click on the code review that was submitted, and you can actually identify the lines of code that caused the regression in minutes. It’s a very cool process.
Dustin Sorge: 00:29:42.954 So back to how we’re extracting this data. So here’s an example of — we log all of our API calls to a common log file that’s on a log rotation schedule and Telegraf is basically tailing this file at all times. And whenever it detects a line where latency milliseconds are displayed and the method is reservation add, Telegraf extracts that data, inserts it into a measurement called API latency raw, and you can see the log parse a code from Telegraf here on the bottom. So now we have a measurement in Influx that contains all of our API calls and how long they took. So that used to live in a text file, and now it lives in InfluxDB. So I had mentioned earlier that SLIs are a percentage, so it’s good events over expected events times 100. And here’s an example of that. This is the exact TICKscript that we use to calculate our reservation at SLI. So for this one, I’m using a batch query to look at InfluxDB and say, “Okay, for reservation add, how many reservation add calls have there been in the past 7 days?” It took less than 10 seconds. And then I say, “Okay. Well, given the number of all the reservation add calls total, and then divide the two, multiply by 100, and take that percentage and stick it into a measurement called a reservation add SLI. And that’s what we’re doing here.
Dustin Sorge: 00:31:21.642 So we do this every hour. You can do this at whatever cadence you would like. So basically, the SLI is defined as such as 99% of reservation add calls within the past 7 days should complete within 10 seconds. I know 10 seconds seems like a long time for an API, but trust me, it does a lot. We’ve identified reservation add is a great API to check because it’s used so frequently amongst our users. Once you have this stuff in InfluxDB, you can then display it in Grafana. So on the left, we have our percentage, we’re meeting our SLI. That’s great. And we also have that API latency raw table. So let’s display that data just because we can. So each dot in that graph is a call to that API. And then we also store a moving average just to see how we’re doing. So dashboards are great if you’re looking at them. And you’re not always looking at them. And that’s just a matter of fact with dashboards.
Dustin Sorge: 00:32:28.393 And before COVID, we were in our offices. We could have these dashboards up on TVs, on the walls, and people can always look at them, which is awesome. But now that we’re home, sometimes it’s not as easy to do that. So I had mentioned before my balloon example. You want to know that the balloon’s starting to fall when you’re not looking at it. And this is how we do that. Basically, we have a TICKscript that’s watching the SLI percentage, and if it drops below our target, it generates a Slack alert. And we’re watching Slack all the time between our SRE teams and where we’re distributed. So here’s an example of a TICKscript that watches that, looks at the percentage. Are we missing our SLI target? If so, alert to Slack. Let us know. And once we see this alert, we know that SRE should engage and figure out what’s going on. Is this a software regression? Is there some sort of network issue? What’s going on? Is there a problem with our infrastructure? This leads us to take action immediately.
Dustin Sorge: 00:33:41.349 Okay. Now I’d like to get into what I call coordinated incident response. Our team at NetApp, we’re extremely, extremely lucky in that we have a site of extremely talented engineers in Bangalore, India that are our friends and co-workers, and we collaborate all the time. And our SRE team is split between the two sites. Majority of our SREs are in Pittsburgh. We have a few in Raleigh. But we also have a team in Bangalore. So within SRE speak, this is called a follow the sun model. So the idea being that the more time zones SRE can be in, the more impactful it can be. And as someone who has gotten pages at 3 o’clock in the morning and 2 o’clock in the morning, it’s really, really great to know — and I have the confidence that our team in Bangalore, India can handle any problem that the team in the US can handle. And they can take care of those things while I’m sleeping and our SREs are sleeping and vice versa. When they’re in bed, we can handle things. And then there’s a little time for crossover. Our team in India is absolutely amazing. They spend a lot of — they get online at times in their evening to interact with us, and that’s never overlooked or not appreciated. We always appreciate that. And this, like I said, reduces off-hour pages. If you are lucky enough to have engineers geographically distributed like this, implementing a [inaudible] methodology is very much recommended.
Dustin Sorge: 00:35:28.841 So I mentioned before that Slack alerting is very important to our workflow. And it is. So Slack provides a single point of interaction for problems. So with InfluxDB, we can write custom alerts for service-specific problems and have them alert to one single place where SREs from the US and India can collaborate at one time. And this is very huge. This allows us to be very effective. So here’s an example of us doing that. So in this example, I’m using a dead man alert. So I’m watching a measurement that I’m expecting data to be written to on some sort of cadence. Ooh, sorry. And if we don’t see any data points for 2 hours in this one, an alert gets triggered that says, “Hey, there might be something wrong.” Now you might want to figure out why the automation that normally writes data here is not being written. So an alert gets triggered to Slack to our alerts channel, so we can triage the issue in thread. I don’t have the thread visible here but you can see the four replies at the bottom there. So we’re working together. We’re figuring out the problem. We’ve solved the problem and then the okay alert comes in. Everything’s good now. So that’s always a great feeling when that okay alert comes across.
Dustin Sorge: 00:36:58.639 And this is all enabled by the fact that the TICK stack is so homogenous that all the pieces fit together so incredibly well. Before we had InfluxDB supporting our SRE team, all the pieces were very disjointed, so we did the best we could. But this makes things so simple to enable these workflows and we rely on them very heavily. So measuring system resources, this is something that every SRE is going to be familiar with. This often helps us answer the question that SREs get asked constantly, and that is, “This machine’s slow, this service is slow, what’s going on?” And part of that investigation is you want to start looking at system resources. There’s a — and he was a performance architect at Netflix. I’m not sure if he still is. His name is Brendan Gregg. He pioneered something called the USE method where you take different subcomponents of a system and then look for bottlenecks. So you basically take your memory, CPU, disk network, and you look for utilization, saturation, and errors. And these dashboards make this investigation very easy. And the idea being that most of the time, you can figure out performance issues based on that. So maybe your CPU usage is spiking and things are slowing down, or [inaudible] is spiking. Yeah. There may be some bottleneck writing to your back-end storage. So Telegraf is crucial to doing this, and it is my favorite part of the TICK stack. If someone were to ask me what my favorite part was, it would be Telegraf, for sure.
Dustin Sorge: 00:38:49.208 Basically, once you have a template defined in Grafana, like we have here on the right, adding host to this, I mean, a minute, maybe. I mean, you literally just have to install an RPM, copy over a Telegraf config file that’s pointing to your Influx instance, and then just start the service, and boom, you get so much out of the box without having to do much. And then, once you start collecting this data, if you already have alerts based on some other things like resource utilization here, like we want to know that this disk is filling up or the CPU is spiking on this host, you just start getting that for free because you already have the alerts defined and now you’re just collecting data from another host, and it’s going into these measurements, and the alert can parse them out. So I’d like to talk about a couple of tips that I came up with for running InfluxDB within your environment. One I alluded to earlier is called the Watcher of Watchers. So the idea being here that while you pay for Influx Enterprise, they also have an open-source, single node instance that you can have for free. So what we do with our environment is we install Telegraf on our data nodes and then write all of our Telegraf data to this open-source instance and then connect Grafana to it. So if we’re trying to triage some sort of problem with Influx support, I can bring up a dashboard immediately and say, “Okay, well, Influx D is in a crash loop, and here’s what the memory looks like while this is happening.” So this is an actual screenshot of an issue we were having at some point in time where Influx was in a crash loop, and we were able to view those statistics.
Dustin Sorge: 00:40:40.333 That being said, that doesn’t happen that often. And I can honestly say that the support cases that we’ve had opened in the past where InfluxDB wasn’t performing like we had expected. These were issues where we basically, probably shot ourselves in the foot where we had data that wasn’t properly structured, tags that maybe had more unique values than you would care to have. And actually, I’ll touch on that next. Yeah, and I don’t think we’ve actually had an outage that was due to Influx having a defect. They were all things environmentally that we just had to tweak and get under control. And I had mentioned having a tag with too many values. That’s what we call runaway series cardinality. So when you’re choosing the data that you want to insert at InfluxDB, you want to be very aware of what should be a tag and what should be a field. Tags are what you’re grouping by, so you don’t want to have too many values there. So I can’t necessarily give you an exact value of what too many unique values is. It kind of correlates to what resources you have available on your cluster. But we’ve had instances where I looked at a measurement and there were close to a million unique tags on it. Okay, we have to change that up.
Dustin Sorge: 00:42:07.495 So generally speaking, a field is what you’re measuring and then a tag is what you’re grouping by, but that doesn’t mean that a field has to be a number. You can take things that you want to select and display and insert them as fields, and they won’t get indexed in memory. The idea being that tags are indexed in memory, so if you have cardinality issues, your memory usage can start to spike. So identifying those suspect measurements, you could use Chronograf. That UI is very useful for quickly exploring your databases and retention policies to see where you may have a large number of tags because Chronograf will tell you in parentheses next to your tag how many unique values there are. So that’s very helpful, so. And just like I said, tags are indexed to memory, so if you’re seeing a spike in memory on your data node, this may be correlated to a series cardinality issue, and that’s something that can be addressed. So within our installation of InfluxDB, these are some of the things that we may look to be doing next.
Dustin Sorge: 00:43:17.697 So TICKscripts have been incredibly effective for us. So we use TICKscripts to downsample our data. Like, we’ll write a large amount of data to a retention policy with a very short-lived amount of time, and then aggregate that data and write it to another measurement that we keep on for longer. But we should start looking at Flux. I know there’s a ton of development and work that’s going into Flux, and we should start to look to see how we can leverage that properly and enhance our workflows. So that’s something that we’ll certainly be looking to do. And another thing we may look at is some hardware modernization within our cluster. So we’ve been using InfluxDB for close to 5 years now. And when we set this up — our data nodes are virtual machines. They’re not bare metal. And the data and wall volumes are backed by all flash storage over [inaudible], so they’re not directly on the machine. So as our workloads increase, Influx has been hanging in there great for us. It’s been working fantastic. It’s been performing just like we would expect it to be. But as we advertise more and more of these things within the company, more and more people say, “Oh, I want to utilize that for my project, and I want to start storing X, Y, and Z,” which is great, but we may hit a point where we need to modernize our hardware configuration. So right now, we have a four-by-eight license. So that’s four data notes. Each node has eight CPU or eight cores. And as you increase the CPU count, you actually increase the amount of concurrent compactions you can run in the background for your data. So we may want to look to having more modern hardware configuration in the near future.
Dustin Sorge: 00:45:09.759 So to summarize, InfluxDB has been incredibly important for our implementation of SRE within the BAERO team within NetApp. We essentially built a house of SRE from the ground up, and the foundation for that home is the TICK stack. So it’s been a rock-solid foundation for us, and all of our SREs workflows are essentially driven through it. And it’s been really great, and I can’t wait to see how we can leverage it even further because I know there’s more things that we haven’t done yet that we can take advantage of within the TICK stack to really help us out. And from there, that’s all I have. I think I’ll open it up for questions.
Caitlin Croft: 00:45:59.875 Awesome. Thank you, Dustin. That was great. So there were a few questions in the chat. I think people actually already answered them, but what scripting language and libraries are these? Is it just TICKscripts right now? Dustin Sorge: 00:46:17.617 Well, so we definitely do TICKscripting, but also, we use the InfluxDB client Python library for putting hooks into our automation to write directly to InfluxDB from our automation. And then, I’d also mentioned we also have a Pearl implementation of it internally that we have hooks. And there’s also shell scripts that we might make a [inaudible] call to InfluxDB directly and put some data in that way. But mainly Python InfluxDB client. That’s the library we’re using.
Caitlin Croft: 00:46:52.294 Awesome. And I was glad to hear that you’re looking into Flux. It’s kind of amazing what it can do. All right. So are you just using Kapacitor for alerting?
Dustin Sorge: 00:47:06.760 So no. So we use it for downsampling as well. So we use it to downsample and for alerting. And for the alerting, we’re using both query and stream functionality. So we’re reaching into Influx and generating some values and alerting based on that. And we’re also watching data as it comes across the wire using the stream processing. So we’re using both pieces of Kapacitor for that. Caitlin Croft: 00:47:37.372 Is that where you’re defining your alerts?
Dustin Sorge: 00:47:40.998 Yes. So I mean, the UI that you’re seeing is Chronograf, but those are running on Kapacitor.
Caitlin Croft: 00:47:46.730 Yeah. All right. Let’s see. Were you using a Go script.
Dustin Sorge: 00:47:54.544 A Go script?
Caitlin Croft: 00:47:55.575 Yeah. Someone’s asking if you were using — if there was a Go script. I apologize I — Dustin Sorge: 00:48:00.614 A Go script?
Caitlin Croft: 00:48:01.574 Yeah.
Dustin Sorge: 00:48:02.341 So we don’t use — So Go exists within our environment, but our SRE team is largely Python, so we’re not using any hooks from Go for InfluxDB currently.
Caitlin Croft: 00:48:18.524 Awesome. Can you explain a bit more about the Jenkins job monitoring?
Dustin Sorge: 00:48:26.584 Sure. So there’s a plug-in for Jenkins that whenever a job completes, it will run as a post-build action that will write to a measurement in InfluxDB the status of the job, whether it completed, whether it succeeded, whether it failed, and then some other data about the job. So once you have that data in Influx, you can then start writing custom TICKscripts to alert to when jobs are failing. And then we take that alert, and then we publish that to Slack. So we know right away if there’s a job that we’re depending on completing and succeeding. We know that it failed right away.
Caitlin Croft: 00:49:09.627 Cool. Someone else is saying, “We are also using Telegraf, InfluxDB, and Grafana. Could you give me some tips and tricks for having an effective monitoring in Grafana if we have a multiple region infrastructure?
Dustin Sorge: 00:49:28.841 So that’s a really good question. Yeah. Because our sites are in Bangalore, India, and in the US across multiple time zones. You could standardize on GMT for your time stamps. Right now, we’re kind of relying on local browser settings to do sort of the time adjustment through Grafana. Or you can maybe depend on epoch seconds. We use epoch seconds in our measurements when we’re trying to standardize on some of those times. Caitlin Croft: 00:50:06.570 Perfect. I think we’ve answered everyone’s questions. Please feel free to post any more in the Q&A or the chat if you have them for Dustin. We’ll keep the lines open here just for another minute or two. Dustin, I thought it was really great you showing those dashboards that back when we were all in offices that you guys would kind of crowd around and troubleshoot, and how things changed in the last couple of years. So do you guys just hop on Zooms now and stare at those Grafana dashboards together, or how does that work?
Dustin Sorge: 00:50:38.431 No. No. No. So I mean, with our team, we know the dashboards that SRE should be looking at, and we don’t keep them secret. We advertise them. If users are asking about statuses of our services, we say, “Ho, hey, why don’t you go check out this dashboard in Grafana?” We keep that link handy, and it’s being updated consistently. But yeah, it’s not quite as nice as it was when you’re in office and you just kind of — you go get a cup of coffee and you look up and you see the numbers, and you can go, “Oh, things are looking good,” and you kind of walk past. But you do the best you can in a remote environment.
Caitlin Croft: 00:51:15.432 Well, and I’m sure with the follow the sun model, which has been around forever, I mean, long before COVID, you guys having to coordinate with teams in India or everywhere else, you guys were already probably somewhat adept at figuring that out remotely.
Dustin Sorge: 00:51:29.827 Yeah. Absolutely.
Caitlin Croft: 00:51:33.614 Awesome. Well, thank you, everyone. Oh, okay. Adding the incident and issue overlay in Grafana is a really cool feature. Are these stored in InfluxDB?
Dustin Sorge: 00:51:47.284 So we store the outages separately. So they’re actually two separate API calls. One’s to Influx and one is using the Grafana API. So the annotations are actually stored in Grafana locally.
Caitlin Croft: 00:52:02.663 Perfect. Thank you, everyone, for joining today’s webinar. Thank you, Dustin, for a fantastic presentation. Once again, everyone, this has been recorded and will be made available later today or tomorrow morning, and the slides will be made available with it as well. So thank you everyone for joining. And great job, Dustin.
Dustin Sorge: 00:52:26.082 Thank you so much. I appreciate it. Appreciate the opportunity.
Caitlin Croft: 00:52:29.080 Thank you. Bye.
Dustin Sorge: 00:52:31.030 Bye.
[/et_pb_toggle]
Dustin Sorge
Lead Site Reliability Engineer, NetApp
Dustin currently resides in Pittsburgh, Pennsylvania and is the Site Reliability Engineering Technical Lead for NetApp's ONTAP Engineering organization. His team has been using InfluxDB for 4+ years and continues to leverage it for the support of critical services. He is a proud alumni of both the University of Pittsburgh and Carnegie Mellon University. Prior to joining NetApp, he was a High Performance Computing Operations Engineer and Software Engineer for the Pittsburgh Supercomputing Center.