How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
Session date: Apr 20, 2021 08:00am (Pacific Time)
Mean Time to Resolution (MTTR) is a foundational KPI for most organizations. DevOps and SRE teams are under intense pressure to reduce MTTR when resolving incidents. Often parts of incident response processes are manual, bringing together alerts, runbooks, ad-hoc scripts, and people to form a response.
In this webinar, we will show you how to improve resolution time by configuring InfluxDB notification endpoints to PagerDuty and triggering auto-remediations with Rundeck. Using Rundeck’s automated runbooks, customers have experienced up to 50% reduction in incident response time, greatly improving team productivity and reducing unnecessary outage time.
Join Craig Hobbs - Sr. Solutions Consultant at PagerDuty - as he dives into:
- DevOps best practices for using: a time series database, an incident management platform, and a runbook automation solution
- PagerDuty's approach to reducing time to resolution for critical incidents
- How to go from alert to remediation in 10 seconds!
This is your opportunity to streamline incident response for your organization. Register to watch.
Watch the Webinar
Watch the webinar “How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
-
-
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Craig Hobbs: Sr. Solutions Consultant, PagerDuty
-
Caitlin Croft: 00:00:00.536 Welcome everyone, to today’s webinar. I’m very excited to have Craig Hobbs here from PagerDuty. My name is Caitlin Croft. I work here at InfluxData. And I’m really excited to have Craig here to talk about how to streamline your incident response using InfluxDB, PagerDuty, and Rundeck. Once again, this session is being recorded and will be made available later today. And please feel free to post any questions you may have for Craig in the Q&A or the chat. I will be monitoring both throughout the webinar. Without further ado, I’m going to hand things off to Craig.
Craig Hobbs: 00:00:44.447 Thank you, Caitlin. I really appreciate the intro. And welcome everyone. As Caitlin mentioned, my name is Craig Hobbs. And we’re going to be talking today in this webinar about the incident response pipeline and options for shortening the overall incident response. As Caitlin mentioned, I’m a Solutions Consultant here at PagerDuty. I’m also an SME of the Rundeck platform. And as my profile says here, I’ve certainly spent a number of years in infrastructure monitoring, performance, enterprise security, and other automation at various stages companies before landing at PagerDuty but really all with the same focus. And that focus is to help create solutions that help users and engineers get stuff done. So always, always looking at how to simplify, ease, and use best-of-breed tools to get things accomplished. I’m a huge open source fan. So you’ll see a lot of my thoughts and ideas and solutions leverage community projects but really any combination of software and platform that make it easier to get things done is always the ideal state.
Craig Hobbs: 00:01:57.003 So looking forward to diving in. And as we dive in here, let’s talk a little bit about what’s on our agenda today. So we’ll be starting off by touching on MTTR and what that impact really means for DevOps teams and SRE teams alike. So having a unique position here at PagerDuty of engaging with lots of teams, it’d be great to just get us grounded on what teams tend to see that is. But really, where I want to spend the time in is how users can go about shortening the incident response process and their incident response times. So really looking at where those savings come in and what can be done with the solutions that are available. And in that, we’ll take a look at a solution or review and workflow and approach to that. And then we’ll dive into a demonstration of the combination of Influx, PagerDuty, and Rundeck. And then we’ll leave time for questions at the end. So I’m really hoping you’ll get a lot out of seeing the solution in the demo and just kind of engage in some tips and tricks or some other commonalities that you’ve seen in your MTTR work.
Craig Hobbs: 00:03:07.001 So let’s go ahead and get started now. So starting off with great information that you get from the various pundits related to automation and these types of workforce solutions. So Gartner predictions and guidance for 2021, when I first saw this prediction of automation and its impact on cost of operations, it comes as no surprise to DevOps engineers and SREs. We understand that automation isn’t the driving, the scale, and the costing that we’re getting from our overall solution pipelines. But what I really found interesting and that it’s kind of buried at the end is the idea that this will come with redesign of the operational process. So while we all know automation is going to be key, it’s how are those processes going to be redesigned to take advantage of that and ultimately leverage that automation in a way that produces those types of cost savings and operational savings. So looking at that, I kind of turn to MTTR and when I get to engage with DevOps teams and SRE teams, we kind of talk about what they’re looking for at MTTR. There’s lots of definitions for MTTR - lots of standard definitions that the market has applied, but I’ve come to find that each team, each organization, they’ll leverage it a little differently in how they see MTTR and what it means to them. But what I find that at a minimum, we all kind of take a look at it is that average time to resolve a production failure - right, to fully resolve a production failure. There certainly are a couple of different slices and looks at that. Some will look at the meantime to repair, some will look at the full time it takes to resolve it so that it’ll never occur again, but no matter how the definition I’ve seen come from all the different teams I’m engaging with, it always kind of culminates with that idea that the one constant is that you’ve got to keep it down, right? You don’t want your MTTR growing. That’s a key indicator in the world of DevOps and SREs that you’re spending your time in a way that’s completely inefficient. And certainly, as your services grow, as your environment expands, you know that number is only going to go up. And it starts to create work that is toil, that is effort, that takes those valuable resources away from the real-value add work that you want them doing related to your service.
Craig Hobbs: 00:05:47.719 So the impact then of MTTR and that’s really where I wanted to kind of focus as opposed to kind of looking at the different ways people are identifying it. Really, what’s the impact? And the impact is that it changes the practices of DevOps and SRE teams. So what they now focus on, what they emphasize as part of addressing MTTR. So everything from lowering the impact - and that’s really kind of focused on how their customers are impacted by these types of incidents or issues or outages, how can they lower that impact. Saving time and that really comes into escalation. You have L1 engineers who are generally focused on engaging with these types of requests or issues, but if you have to now escalate that to different L2s, L3s, that starts to increase that time for acknowledging, for responding, and ultimately getting back to that value at work. Proactive monitoring, so there’s always that worry of noise, right? There’s so much coming in. So the teams now are out and are proactively monitoring to get ahead of it to see if they can identify issues before they happen instead of getting behind them, but there’s so much information, there’s so much data pouring in that now you have to kind of sift through it intelligently and know what to react to and what not to. And then kind of finally, the overall improvement of the service quality through quick acknowledgment and resolution of incidents. How can I keep my service quality up and high knowing that I have this large amount of incidents coming in, knowing that I’ve got to save time, and knowing that we got to lower all of this down.
Craig Hobbs: 00:07:20.281 So quite a bit that changes on the practices of these teams. So what can you do, right? So, really, what’s the idea? What are we driving at? And there’s a lot of different tools that offer options to address kind of that end-to-end incident management workflow. But what I’m really always looking at is the idea of how we can do it easily for the teams and users alike while offering best of breed in both the solution and best-fit tools for each stage of the incident workflow. So I’m always looking out for those ideas when I start asking myself, “What can I do?” And with that, I started looking at it at the incident workflow as stages. So you have that initial stage, again, we talked about that active monitoring. So you have that initial stage where you’re identifying the issues. And if I can quickly identify those issues in a real-time advantage, I can get to that - I can get to that information - get to that information across my infrastructure and now act on what I’m seeing. If I can also limit the noise, because I don’t want to just be acting on every little signal that jumps up, that would also be critical as well as this is part of my identification process. Then I need to mobilize. Now that I’m acting, I need to also mobilize. And that is that idea of incident management, right? So I see incident management as that area where now I need to have that nerve center of bringing together all the different aggregates of the alerts and the requests that are coming in, all the escalations, all the stakeholders that I need to notify, all the data analytics that now goes into examining your MTTR, your postmortem analytics, how have you reacted to these incidents. There’s always an incident management layer that mobilizes and brings all of that together. But then finally, there’s that layer that handles the resolve. So this is that idea now of, “How can I, in some automated fashion, reliably, quickly, and securely resolve the issues that are occurring and get back to a production state?” So now I want to get that resolved, get back to production state, but also continuing back up the stack. I need to notify incident management so that they can notify stakeholders, notify escalation procedures, have those postmortem conversations, and roll it back into my monitoring. So now that my monitoring can see this and know that the alert is cleared, everything is fine and I can continue on. No need to pull in any additional resources to follow up with this. So a lot of the stack, it moves down and it moves back up. So knowing where you are on the stage is really critical.
Craig Hobbs: 00:10:05.540 So taking a look at all of those things, I came to the idea that I wanted to combine the best of breed and best of fit at each one of those stages. So for my first stage, InfluxDB. So really looking at InfluxDB at that stage of the monitoring and alerting. InfluxDB Cloud time series platform, the real-time monitoring dashboards, and analytics for processing time series metrics is that natural best of breed, best of tool fit for gathering all of the - gathering all of my monitoring metrics, identifying those issues, being able to identify and roll through the various noise that’s in your metrics with the real time processing engine that’s available in the InfluxDB Cloud. Truly that tool that’s purpose built for just the massive amounts of data and sources that flow through your infrastructure with timestamps. So InfluxDB really brings that critical element to the idea of these stages. Also, it includes lots of developer components for a solution like this. And that was another thing that was very critical in putting together a solution overview. So InfluxDB also offers InfluxDB Templates which are preconfigured solution packages that I can now roll out to address all of my monitoring different areas of the solution. And we’ll see that in the demonstration and workflow how powerful the InfluxDB Templates can be in getting up running in a single click very quickly. I’m also leveraging Telegraf. So Telegraf is the InfluxDB collection agent. It’s a pluggable architecture that comes with several different plugins covering many different infrastructure pieces from monitoring your Linux to your Windows to your network gear to your cloud. So all the different areas where your metrics and events will live, Telegraf has a plugin that makes it easy to pull that information into InfluxDB. And then finally, we’ll take a look at the alerting pipeline that’s available within InfluxDB Cloud. So I found that to be a very, very key element for querying the metrics and ultimately dispatching the alerts.
Craig Hobbs: 00:12:25.850 So critical to being able to pull through that noise, crunch through that time series data, and now dispatch the alerts to the necessary incident management stage which then brings me to PagerDuty. So for PagerDuty incident management, again, one of those best-of-breed tools for [inaudible] business processing and addressing those types of outages, disruptions, and incidents from concept to full resolution. So PagerDuty allows you to really aggregate and correlate all the different signals and triggers that would come from your monitoring, that would come from emails, that would come from any different source. And now the now acting as that nerve center, if you will, for organizing the right users, dispatching the alerts, giving you other key reports and information related to your incident so that you can better fine-tune those processes and grow with your service. So we’ll look at some of the key pieces of PagerDuty, but now PagerDuty can address dispatching all that information to either the right engineers or the right automation.
Craig Hobbs: 00:13:34.195 And that now brings us to the third piece of our stack and that is Rundeck. So Rundeck is Runbook Automation that provides secure and audible and reliable automated workflows. So that’s a mouthful. So secure, audible, reliable, automated workflows. So Rundeck provides the orchestration layer for users and administrators alike to deploy automation across their entire toolchain. So when I think about DevOps teams, there’s the toolchain that you may use it - that you’re using of different scripts, different commands - different scripts, different commands, different programs. Again, the automation that we talked about at the very beginning with Gartner is already being used by these teams, but now, there’s so many different options that teams are using. How can I orchestrate them all together? Rundeck allows you to orchestrate any piece of your toolchain together for this type of Runbook Automation and doesn’t require you to change any of your scripting, your language, or your workflow approach to how you do this. It just brings it together into this centralized platform so now it can be secure and reliable. Rundeck also provides a self-service layer. So in addition to automating end to end, you can now give self-service access to different users and then different organizations, different silos to run this information that would run these types of jobs and tasks that you traditionally only had your SREs running or your DevOps team running due to requirements around access to sensitive environments, sensitive networks. Rundeck can do all this in a very secure manner.
Craig Hobbs: 00:15:18.032 So those are my three - those are my three best-of-breed, best-fit tools for approaching this. And I wanted to really now kind of take a look at what that solution would look like end to end. So starting left to right with the InfluxDB as, again, my real-time monitoring across my infrastructure. So they’re at the bottom of my graphic here. I have my infrastructure leveraging InfluxDB Telegraf. Again, that component that allowed me to collect metrics and events across my infrastructure, whether I’m in Kubernetes, whether I’m collecting from targeted databases, microservices, my network gear, again, Influx, has a plugin for that in Telegraf. And because it is, again, designed around developers, it’s a tool for developers, if there’s something there that you want to tweak, update unique to your environment, very easy to do with toolsets like InfluxDB. So that data acquisition, that time series data, being able to get it, so critical to have a platform that’s developer friendly. Then in the middle, once Influx has now gathered all of that information, taken all the huge volume of metrics, it can now intelligently dispatch the various triggers and alerts to be acted upon. And that’s now where PagerDuty comes in in the middle there so my incident management. Again, my nerve center, it’s going to be getting all of those different dispatches from Influx, from any other sources you may have. And again, the centralized area where everybody can now go to get all of the background metrics, the various identifiers, again, you won’t miss a beat by having that type of centralized incident management platform there in the middle. And then finally, on the back end, PagerDuty will dispatch various requests and services to Rundeck. So Rundeck will be now my Runbook Automation. So in this particular use case, if InfluxDB has identified an issue, sent a trigger over to PagerDuty, PagerDuty will pick the right Runbook Automation, dispatch it to Rundeck to securely execute. Rundeck will execute that in a reliable manner, return back all the information it’s collected to PagerDuty, but also, in this case, I can also return the information back to InfluxDB as well. So along the bottom there, I can actually redirect information back to InfluxDB as well. And what makes that possible again is the idea that InfluxDB is so flexible for the developers to ingest information. And other times, they have information from so many different sources, it was really easy to now just bring information back into InfluxDB. And now this really sets up the idea, as a DevOps engineer, I can function in a single pane of glass which is InfluxDB to see my alerts get dispatched and see them ultimately get resolved without actually leaving the InfluxDB interface. But I still have the benefit of the incident management nerve center where, again, other stakeholders will be able to get a view of what’s going on while Rundeck acts as that virtual DevOps engineer for me. So handling the automation directly.
Craig Hobbs: 00:18:29.827 So with that, I want to get ready and kind of dive into a demonstration and showcase what this would look like if it’s moving and connected together. And I’m going to just kind of start with that idea of that virtual DevOps engineer. So that’s what I’m always looking to help create. And it’s that idea now that on one side, I have customer care - I have my customer care team, I have my service request team, I have any issues or requests that are coming that are then moved to that other side of the wall there to my DevOps team. While my team is really busy working on other value-add pieces, they’re already on call for so many different teams. They’re on call for engineering, they’re on call for product, they’re on call for development, now, they’re on call for the customer care team. So something comes in, there’s a problem with the service, the disk space is being filled up, now, whatever my team is doing, they have to drop that, drop what they’re doing, quickly acknowledge the fact that they’ve received the ticket because this is impacting customers. And now, go into a process loop of figuring out what’s going in, logging into these servers with the proper credentials that are secure, doing their review, finding that issue, resolving it, confirming that it’s resolved, closing out the incident and all the tickets, and then renotifying the monitoring engineer that, “Yep, everything is fine. You should see that resolved.” So while this comes in many different forms, this is always the common use case I hear from those DevOps teams or the SRE teams that something’s come over and now we have to acknowledge - we have to acknowledge, resolve, securely confirm, close all the necessary tickets, and then get back to work. So if you get two or three of those a day, you really start to rob that team of all of its productivity.
Craig Hobbs: 00:20:17.522 So let’s look at how the solution can really help with that type of workflow and use case. So what I want to do now is go to a demo of what we have set up. So I’m going to go ahead and bring us into our demo setup here. And what I’d like us to then take a look at is where we - so, hopefully, everybody should be able to see my screen right now and you should be able to see my InfluxDB Cloud screen as I’ve done my switch over. If you’re not seeing the InfluxDB Cloud screen moderator, please let me know, but hopefully, everybody is seeing my InfluxDB Cloud screen. So I’ve actually leveraged the InfluxDB Cloud. It comes with a free tier. So it’s easy to get up and started with a free tier. Then it also offers pay as you go if you want to continue to go into a higher volume. So we’re really, truly an elastic service that’s cost effective from the start of free and then if you’re actually growing and if you’re talking about large enterprise. So here I am in my InfluxDB Cloud. It comes with lots of options, again, for developers to get started building and going through your monitoring. I will actually go through a lot of those options, but as a starting point, I just want to look at what’s already - I’ve already loaded up some pieces to demonstrate the idea of this incident workflow. So along the side here, I’m going to take a look at a few of the options that I’ve already placed inside of my InfluxDB Cloud. InfluxDB Cloud allows me to load data, several, from client libraries that I can take advantage of. Again, I talked a little bit Telegraf. So many different plugins where I can bring data in from different sources. I can see Azure here, I can see different databases, I can see different Cisco gear, so many different places I can bring sources of data in. That data then comes into what’s known as a bucket. And now from that bucket, I can operate on that data. So right now I have a couple of buckets you can see here. There are two system buckets that come with the - that come with the generic installation but I’ve actually added two other buckets here. I’ve added a bucket here for Rundeck. So this bucket for Rundeck is actually where I’m collecting all of my Rundeck data related to anything that I’m executing as far as the auto-remediation. And then I have a bucket for Telegraf - or I’m actually collecting all of my performance and metrics data for the system. I’m actively monitoring in real time. It also provides Telegraf configurations for collecting that data. So I have multiple Telegraf configurations, one for collecting my Rundeck data. And for this example, I’m collecting Linux system data. So if you’re not familiar with some of the piece of Telegraf, it comes with complete setup instructions for you to be able to get up and running with Telegraf. InfluxDB provides lots of background and information on how to push Telegraf at scale to your entire infrastructure. So I if you’re unfamiliar, I completely advise you to go check that out, but really easy to get up and running in minutes.
Craig Hobbs: 00:23:31.389 I also have a few dashboards that I’ve loaded up. So I have a Linux system dashboard for monitoring. So if I click on that and come in, I’m actually monitoring my Linux system right now. I could have multiple Linux systems. Again, InfluxDB works at scale with all of the different volumes across your infrastructure. I just have a simple system here and I’m monitoring everything from my disk-space usage to my CPU usage to my load. These are completely configurable. So what you see here, if you wanted to add more based on what you as a DevOps team or SRE team look to monitor, this is completely configurable. So this starts off as your base. And if there’s anything more that you would add that are specific to your environment, you can quickly just kind of jump in and make updates. Additional dashboards I have, I’ve added one for Rundeck. And if I actually go into my Rundeck dashboard, what I’m actually seeing here - and this is one I’ve created based on my Rundeck data. What I’m actually seeing as part of this dashboard is my job auto-remediation task. So I’m seeing every job that I’m executing. So if I’m executing jobs related to databases, I can see I’m shrinking database log files, I’m querying databases, I have jobs related to Kubernetes, I have jobs related to Windows. Wherever my jobs are, I can kind of see what’s going on with them. I can see active running jobs. So imagine, if I triggered something that is actively running and it’s my job to make sure that it completes successfully, I can see the jobs that are actively running. So I’ve got jobs that are actively running, how many succeeds I’ve had. If I’ve had a failure, maybe that might now trigger another event for escalation. But at least from a DevOps standpoint, I know I’ve done everything I can do in an immediate standpoint to get it done. But for Rundeck now, with this data now, as I’m pulling it from Rundeck, it gives me a single pane that I can look at and see what’s going on with all my job executions.
Craig Hobbs: 00:25:24.012 Then finally, I have a system here for alerting the learning pipeline. So this was something I mentioned earlier in the setup that there is an alert pipeline available in InfluxDB Cloud that allows me to configure a setup of checks and notification rules, checks in the alerting pipeline. Our queries that I run against the time series data - Influx collects at predefined periods. That check will then set a particular - can set a particular service level. So that might be critical, that might be a warning, that might be info. My notification rule then checks against those conditions. So once it sees those levels, critical warning notification, whatever I’ve configured, it will then dispatch based on the conditions to one of the notification endpoints. Influx comes with several notification endpoints. If I want to create a notification endpoint, I can see I had a notification endpoint here for PagerDuty, there’s others for Slack and HTTP. Again, a developer platform so anything that you don’t see is easily plugged in. There’s also many other destinations that Influx has available that you can have them load as part of your setup.
Craig Hobbs: 00:26:40.201 What I want to do is now kind of get started with that workflow. So, again, playing that role as the DevOps engineer and I’m doing my monitoring. If I come back to my Linux system really quick, I can see here that I - I can see here that I’m monitoring for, again, disk space, CPU usage, all the things that I’m interested in and I can see that my disk space right now is about at 19%, just a little over, and I feel comfortable with that. I really don’t want it to get any higher. So if it starts to get any higher, I want to take action so that I don’t find any service interruptions, but I don’t want to be paged in the middle of the night. I don’t want you to wake me up just because the space is going - I want to find a way to put that into that incident pipeline where I can automate the maintenance of that but still keeping me as the DevOps engineer in the know so that I know what’s going on and what’s transpiring at every step. So what I’m going to do is go into my alerts here. And I’ve already got alert preconfigured, but if I actually click on my inbox alerts here, I can see that this is my - this, again, is my check. So this is the query that runs and determines what is the right levels and values and thresholds for my setup. And right now it’s just set at - right now I have it set at 20. So any time this alert - any time this check goes above 20, it will set my critical status at - it will set my status for this check at critical. So right now I’m not above 20 but the second it jumps above 20, I will get that critical. Also, those statuses I can configure. I can configure one for warning, I could figure one for info. So several of those statuses you can determine, but right now I’m just focused on my critical status. And 20 is pretty low. In DevOps world, I’d probably never set it for 20 but yeah, I want to keep it clean and I don’t want it to get above 20%. So I’m going to leave it at 20. I can also change the schedule interval of how often this check happens. So 15 seconds is probably pretty quick. You might want to set it for longer periods of time, but for webinar, I’m thinking 15 seconds is more than enough.
Craig Hobbs: 00:28:48.028 And then finally, I have my message here so I can actually update my message, I can customize this in any different way that I want to see - any different way that I want to see as it gets passed along to the different areas of the solution workflow. So I’m going to call this my Influx trigger so that it just gets passed along. You see I can also pass along other variable data that’s available as part of the checking message and send that along. So that is absolutely fine. I’m going to go ahead and save that. We save that. Actually, I’m going to give it a different name as well. So “I’m checking disk space.” Let’s really check disk space. So we’re going to check our disk space, and I’m going to go ahead and save that guy. So now I have my check. It’s checking disk space. I know it’s going to pass to this if I go into my notification rule. I’ve already got a notification rule preconfigured for this setup. My notification rule will check every minute for the change in that status that we just talked about, that critical status. And the condition it’s checking for is any time my status goes from okay to critical, it is then one to dispatch this trigger to whatever notification endpoint I’ve decided. I’ve only got one notification endpoint right now configured. Again, you see these columns. You could actually have multiple different checks, notifications, and endpoints configured depending on where you want to dispatch your alerts. For this setup, I’ve just got one so I’m going to dispatch it to PagerDuty. And then on the back end, PagerDuty is going to pick that up and send that along. Just before I dispatch this alert, two other pieces of our solutions stack. Up here at the top, I’ve also got PagerDuty. So I’ve switched to my PagerDuty window. So, hopefully, everyone is seeing my PagerDuty window. And if you’re familiar with PagerDuty at all, PagerDuty groups all of the different alerts, aggregates them into services. So now for each alert, I can send it to a different service, and that service understands what to do with that alert from the idea of where to escalate it, so what engineer to page or wake up, what other organization or mobilization pieces that need to be done. And if I need to create a Slack channel, if I need to open up a webinar, if I need to notify other users, I just need to keep a status there so that other affected parties see what’s going on. All of that is now organized here in my PagerDuty service. So I have a service for my disk space.
Craig Hobbs: 00:31:23.145 And then finally, I have Rundeck. So this is my Rundeck UI. Again, from the Rundeck perspective, I can see here I have different groups. I have a group for databases, I have Kubernetes, Linux. These are all my different runbooks that I’ve configured for addressing issues on the back - addressing issues throughout my mini infrastructure here. So if something happens, Influx can trigger it, pass it to PagerDuty. PagerDuty knows who to organize, who to mobilize, and then if it needs to, call any one of these runbooks. The runbook is complete with handling everything from if there’s an error or if there’s something it didn’t notice, what to do next, what actions to take, whether that’s other jobs or other workflow or it may be actually a need to notify or wake up an engineer. All of that is configured here in my Rundeck setup. And in addition to that, below that, I can see I have activity. So Runback maintains a complete activity trace history of every automation that’s been done. So I know who’s run it, how long it took, what time it started, when it ended as well as all of the logging information that’s associated with that job.
Craig Hobbs: 00:32:33.299 Switching back to InfluxDB because as a DevOps engineer, I don’t want to have to worry about what’s taking place in PagerDuty and Rundeck. They’re doing their jobs as part of the solution workflow. I just want to be focused on what I have to do here and Influx gives me that single pane of glass. So if I come to my alert history now, I can see that my checks have been running. So I can look down this list and see over and over that I’ve got checks running, it’s checking my disk space. So what I’m going to do is go ahead and create a circumstance where I’m able to actually go ahead and go beyond that 20% disk space usage. So I’m actually going to create the case where I’m exceeding my disk space and triggering that alarm or alert. So I’ve gone ahead and actually created that incident. So, again, playing the role of that DevOps engineer, I’m still sitting here. If I were watching my dashboards and some form of large center or NOC center, I can see here there’s my disk usage. All of a sudden, I see this huge spike in my disk usage. I went from 19, I’m already, according to this, over 20. I’m at 23 and rising. I can see my CPU usage is growing here and I’m starting to lose idle CPUs. So definitely something is taking place. If I go to my alert history, I can see really quick the check disk space has reached critical warning and again, that interval that’s happening. So every 15 minutes it’s firing off that interval and setting it to critical. If I check my notifications tab here and come to notifications, I can see that the notification of check disk space has already sent off the rule to resolve. So it’s literally just sent that rule off to resolve only a minute ago. So while I’m the DevOps engineer, I can still see, just from the Influx setup, all the things that are taking place. If I come back to my dashboards now and go over to my Rundeck activity, I can see from my Rundeck activity I have current running jobs, I have two running jobs. If I come over to active running jobs, I can see I have one Windows. I know what that one’s doing. I have a new one now that’s actually doing disk space remediation. So without leaving, without going into any other interface, any other setup, I can see that there’s action being taken. There’s automated action being taken to resolve that.
Craig Hobbs: 00:35:02.228 And again, this now is that moment of shortening that incident. So instead of having to throw this over the fence - throw this over the wall to one of my DevOps or SREs, on the support engineer - on that L1 engineer monitoring, I can see, “Okay, this is taking place.” I can just keep my eye on it. I’m pretty confident that it will all go to resolution because I have a complete history of all the executions here and they’ve all been successful. I have a small execution status here. But again, I can continue to watch and monitor all from the Influx platform. If I were to - if I were actually curious about what’s going on, there’s a couple of different areas I could look at. One area is with PagerDuty. So as I click back into my PagerDuty interface, if I actually wanted to see what’s actually being executed here, I can see if I can click on a specific incident. So once this was triggered in Influx, the incident got passed over to PagerDuty and it created my InfluxDB trigger and it let me know what’s happened and that it reached critical - this is the PagerDuty timeline. So this lets me know everything that’s occurred since this incident was raised. So from the moment the incident was triggered by InfluxDB, what’s actually occurred is that it was acknowledged instantly. So, again, reducing that time to acknowledge. It’s been acknowledged by Rundeck. So that’s a pseudo user. So I haven’t actually alerted any engineer, I haven’t gotten anybody out of bed, it hasn’t been escalated. We’re going to let Rundeck Automation take care of it first. So instantly, Rundeck was able to acknowledge it within the minute. And then now, that runbook is starting to add in the various checks and confirmations that it’s doing. So while it’s running through its confirmations, it’s adding that back to the PagerDuty timeline. So I was worried about CPU so I had it add in CPU.
Craig Hobbs: 00:36:53.008 So again, as a stakeholder or another engineer and I want to - after I’ve now gotten up this morning, I want to see what happened, “Ah, at this very time, this is where the CPU was.” I can see what the disk space was. So I was worried about disk space. Obviously, I saw the disk space growing. “Let me know what the disk space on every file system was.” So Rundeck has added that back to the PagerDuty timeline. It’s also added in the large files. So, actually, part of it is runbook. It’s gone out, searched, found the largest files on the system itself and brought back the top five, top six largest files. In its next step of Runbook Automation, it’s actually cleaned those files off of the system successfully and verified that it’s done that. And after verifying that it’s done that, it’s resolved the incident. So all of that was done in an automated fashion without alerting any other engineer, without escalating it to any other engineer, and I have a complete record of everything that was done. So as other stakeholders come into the play, other engineers come into play, they can see what’s done. PagerDuty also provides nice analytics and reports on how you’re tracking against all of your incidents and services. So if I actually went over to just a really quick PagerDuty report, I can see some of my incidences and services. Again, the incident management piece being that nerve center for this whole process, while Influx is identifying and acting for me, PagerDuty is maintaining all of this information. So as we started with MTTR, I can see for my Linux disk space, my MTTR is pretty low. I’ve got a pretty good number there because all of it is done in an automated fashion, and I continue to monitor that while other services I can see have different MTTRs based on how fast that I’m acknowledging it and how fast they’re getting resolved. So now these start to be ideas for other candidates that I might use with my automation pipeline.
Craig Hobbs: 00:38:47.889 If I come back, again, into InfluxDB, because this is where I want to leave, if I will ultimately refresh the screen, I can see that that job’s no longer running. I’m back down to one. I just have my one long-running job. I can see that it’s been completed. It’s been completed, passed into that column. If I actually wanted to jump out into Rundeck, I could from this dashboard. And if I actually click on the link, I could actually jump out. So this is my disk space remediation. If I wanted to see what Rundeck did, what nodes have run on, if I wanted to see the raw execution logs, I have all that information here. Again, and all of that was done in a secure manner by Rundeck. There was no need to get any engineer involved who had secure access to those different nodes, who had knowledge of what to look for. All of that security - all of that knowledge, if you will - was in the Rundeck Automation. So it was able to go from end to end with all that information. And in the event there were an error, it would also have the knowledge of how to address errors that might come up if it were trying to - as it’s trying to delete these files or take other actions.
Craig Hobbs: 00:39:59.913 And then finally, just kind of coming back to my Linux system, if I take a look now, I can see that I’ve completely flattened out. If I give that a little bit more time here, I can see there was my spike, right there was my drop in CPU. So if I were coming in, I’d want to know, “Hey, what was that CPU? What was that spike?” I can go to my PagerDuty analytics and reports and see that, but it’s completely resolved the issue. It took less than a minute. It would actually take only seconds. I have to put - I have to put simulated sleeps inside of my automation so that we’d actually have time during a webinar to even see to observe, but in real time, these issues in automation are resolved within seconds. But it gives me a complete view of what happened, and I can correlate that with what happened with my Rundeck execution jobs on the same tab. I can also correlate that with my alert history. So if I come back to my alert history, now, I can see here was the minute, second where the critical status was set. But now, my alert history is showing all clear. So going back again through those stages, I’ve had the real-time monitoring identify it. I could set different intervals for when that monitor would dispatch it. So this gives it a chance, if it is indeed noise, for noise to clear and dispatches it to my centralized incident management system, PagerDuty, where I keep all of the history and setup. That then knows to send it over to Rundeck for a complete auto-remediation. And I bring all of that job execution information back into InfluxDB so I, as the engineer, don’t have to [inaudible].
Craig Hobbs: 00:41:37.132 So really nice pipeline and a really nice solution. And again, using just those best tools, best fit for resolving this while still giving me as a developer full control of how I want to see that information and bring it in. So that’s a look at the pipeline. One thing I wanted to quickly touch on is how easy this was to really kind of bring together, so. And again, leveraging the components that are InfluxDB in doing that and bringing that information together. So to bring all of this information together, what I have leveraged as part of InfluxDB is the InfluxDB templates. So if I click on my tab here, InfluxDB templates, again, I spoke about this as one of those components in addition to Telegraf, in addition to alert pipeline, InfluxDB templates were a really key component in making this all easy to get up and running in minutes. So Influx provides several of these prepackaged solutions that come complete with everything from your dashboards to your alerts to any queries that need to be run, as well as any Telegraf collection configurations that need to be provided as part of the setup. It’s an open community of templates. So these templates can be added to. They can be used free of charge. There’s no cost to these templates. And solution templates cover a wide variety of different monitoring scenarios. So everything from Linux to Windows to Kubernetes. But important to note that they don’t just have to be monitoring scenarios; they can cover a wide variety of different analytics and time series data staging. So there’s things in here such as - while there’s monitoring, there’s things such as Fortnite for working with your gaming system. There’s things such as JMeter for bringing in JMeter information and seeing that on the dashboard similar to what I did with Rundeck. So if I actually were to click on my Linux system community template, I can see there’s my template. It lets me know what - it gives me a quick preview of what the dashboard is, lets me know how to install the template as well as any other setup and resources I need.
Craig Hobbs: 00:43:46.392 So these are really good ways for the DevOps teams, SRE teams to get up and started with monitoring your infrastructure in these preconfigured solutions. They’re scalable and they’re completely configurable by you based on what your specifics are. So in doing that, I actually created a template for Rundeck. So I actually had a Rundeck template that I’ve created. I will be adding it to the community templates in the coming weeks, but again, having a developer tool like this that allowed me to prepackage all of the information that’s coming from Rundeck and how that information is gathered related to jobs, execution, what’s running, what’s failed, links out - there’s so much more information that I can actually gather from Rundeck that would be useful to that DevOps view, but this is certainly just the starting point. But it allows me to bring all that information in and get up and running.
Craig Hobbs: 00:44:41.626 As a last point, if I actually come back to my setup here, what I’ve actually done is - actually, I wanted to go ahead and remove some of these pieces here if I can. I don’t know if I got them removed yet, but I want to remove some of my pieces and see if I can give a quick setup of how you can get up and running quickly with some of these templates. Again, these templates are designed to let you simply scale them, add them in and - scale them, add them in and deploy them. So what I’m going to do here is - let’s see if I can go ahead and remove my templates just to give everyone a quick view of how these can be - how these can be deployed. So what I’ve done really quickly here in the background is I’ve moved my InfluxDB templates. So if I come back now to my InfluxDB incidents, I basically set it back to a point where there’s nothing in it. So all that great work I’ve just created is gone. My dashboards are gone. If I come to my data load, my buckets are gone. So all that data that I’ve collected, all that’s gone, my Telegraf configs are all gone. I’ll load them. I’m sure they’ll disappear. Yeah, they all disappear. And that alert pipeline is also gone, right? The check disk space is still there because I created that as we started it, but I’ve actually removed my alert pipeline. All of that great information is gone.
Craig Hobbs: 00:46:17.175 So how did I get up and running that quickly with my setup? So as part of this, using InfluxDB templates - and if you can see my screen - still see my screen now, I’m actually at my command line, and InfluxDB templates are something you can leverage either from the UI or using the Influx CLI. I happen to have the Influx CLI installed on my local. If I type in Influx, I can see some of the CLI commands. I like using the CLI because it allows for easy local development. And again, I created my own template. I’ve gone ahead and downloaded all the community templates to my local machine here. So if I actually want to upload one of the templates and I believe I have one of the Linux templates here, so I have all the community templates downloaded. There’s my Linux template. The Influx CLI allows me to actually view a summary of the template. So if I actually select to view a summary of the template, here’s all the information that’s contained in that template.
Caitlin Croft: 00:47:15.283 Craig, we’re not seeing your terminal window. I don’t know if you want us to see that or not.
Craig Hobbs: 00:47:20.163 Oh, absolutely.
Caitlin Croft: 00:47:29.967 There we go. Awesome.
Craig Hobbs: 00:47:32.573 Absolutely. Sorry, everyone. But yes, what I was just showcasing here is that from my terminal window, I have access to all of my downloaded InfluxDB templates. And using the CLI, I can actually view the contents of those templates. So while we were just looking at them on GitHub, I can actually view the contents of the templates from the downloaded YAML files. All of the code - all of the template configurations are maintained in YAML. So it makes it really easy for storing these off in your own source control repositories or just pulling them and making quick updates to deploy across your different employee setups or share with other community members or other individuals in your organization if they’re doing similar monitoring. But once I have this template now, I can actually deploy it. And this is exactly how I got started with my template. I believe I have in my command history just the ability to deploy a template. So using the CLI again, instead of doing it, I can just say deploy. And what I want now to happen is in that same view that I was just looking at before I get this green coloring - and this green coloring lets me know that these are all the different attributes I’m about to deploy as part of this template, everything from dashboard to a bucket to a Telegraf configuration to various labels. This can include alerts, other queries. Again, this is that developer component that really allows the setup and development of this type of pipeline. And if I just say yes, instantly, it’ll go yellow and lets all of that information go in. So now if I switch back over to my setup, I can see - and I’ll do that here quickly. So hopefully, you’re seeing my Influx screen. I think you might not be -
C
aitlin Croft: 00:49:27.939 We are still seeing your terminal.
Craig Hobbs: 00:49:32.264 No worries. I think you’re still in my Influx screen. Yeah, so if you’re seeing my Influx screen, really just kind of culminating with the idea that in that one command, I was able to now reload my screen, load up my Linux system dashboard, and instantly start monitoring data again. So all of that that happens, collecting data from the various Linux systems, having a dashboard that set up for this, having a bucket for working with the data, having a Telegraf configuration, working the data, all of that I just wiped out, I brought back in one command and I’m instantly monitoring. So, again, this is the level of developer access and options that make creating this type of workflow, best of breed, best of tool fit and using tools such as PagerDuty and Rundeck to now kind of feed into this workflow. So I really wanted to kind of showcase that idea that leveraging InfluxDB templates, leveraging the alert pipeline, leveraging Telegraf are all key components to making workflow like this possible. So with that, I’m going to pause and kind of bring us back to where we started. And that goes with some of our goals, with some of our setup and demo pieces here. So, hopefully, now we are back to our slide deck here and really what I -
Caitlin Croft: 00:51:13.287 We’re seeing the InfluxDB Cloud.
Craig Hobbs: 00:51:16.606 Still seeing InfluxDB Cloud. Oh God, the one thing you never expect to happen to you. Hopefully, you’re seeing InfluxDB slide deck. And thank you, everyone, for all of those movements, but really, I just want to bring everybody back to the key points that I wanted to double click on. And that’s, again, that idea of InfluxDB templates. It was easy for me to create one for gathering Rundeck data because I like a single pane of glass to work from. So now I can do all of my work, all of my monitoring, all of my status from InfluxDB, again, using these developer tools. So I’ve created one for Rundeck. I will be uploading that as part of the setup going forward. But again, making it easy to gather that data, monitor that data, and understand what’s going on using these prebuilt configurations. And finally, bringing all of these solutions together for shorter incidents. And the idea is bringing together these best-of-breed solutions, both from the world of open source as well as best-of-breed incident management platforms like PagerDuty really create a solution that’s flexible for both developers as well as being able to meet the demands of your growing volume, the various stakeholders, the various escalations, and the complexities that can be involved in infrastructure monitoring and auto-remediation but also, bring down those incidents and get that MTTR down. So, again, I’ll pause there. And I just want to thank everyone for spending some time with me. And hopefully, we have some time for questions from the audience. But again, thank you everyone for taking a look at some of the solutions and work that we’re doing with some of the DevOps teams and SRE teams that we’re working with.
Caitlin Croft: 00:53:04.514 Perfect. Thank you so much, Craig. That was amazing. Before we jump into questions, I just want to remind everyone once again of InfluxDays is coming up in May. So we have the Flux training, which there is a fee attached to. There is the pre-Telegraf training. And the conference itself is free. And also, this year, it’s brand new, we’re really excited to announce that we’re having the built on InfluxDB awards. We know that our amazing community has created solutions where we’ve actually built it on top of InfluxDB and we want to highlight that. So we’re going to have these awards presented at InfluxDays. I just threw in the submission form into the chat. Submissions close May 30th. So please feel free to submit either your project or maybe someone else’s that you know of in the community. So you can either self-nominate or nominate someone in the community. All right, so the first question is how will Telegraf get the fault? Is it via Syslog, SNMP pull, or do you have to have a listener to get that SNMP track?
Craig Hobbs: 00:54:20.748 Yeah, so Telegraf actually comes with an SNMP plugin. So if you take a look at the Telegraf product site, you’ll see a plugin that’s configured to gather that track. I don’t remember exactly what it’s using to capture that track, but again, I think one of the great things about Influx is that, again, it’s all open-source code. So you’ll be able to look and see exactly how Telegraf is gathering that trap. I forgot what it’s using directly, but it’s all in the Telegraf config of what it’s using. So to the questioner, please take a look at Telegraf again, all open-source code.
Caitlin Croft: 00:55:08.347 Awesome. And the cool thing about Telegraf is there’s over 200 plugins and a majority of them have been created by the community. So if there’s isn’t something there that you need, I’m sure there could be a community member even working on it or someone who can help you. Does Rundeck require any agents for executing auto-remediation jobs?
Craig Hobbs: 00:55:31.426 Yeah, great question. And it doesn’t. So Rundeck itself is agentless by design. So Rundeck leverages the way that you as a DevOps engineer, SRE, work with your infrastructure. So what we’ll typically see is if you’re working with systems that use SSH, if you’re using Linux or network gear that uses SSH, Rundeck will leverage the same SSH credentials that you use. If you’re using Windows, we’ll typically see PowerShell or WinRM as the mechanism that’s used. But Rundeck really becomes flexible and secure by not requiring any agents for its access to enact the auto-remediation. If you’re curious, please check out Rundeck.com. There’s lots of great information and videos on exactly how that’s done but agentless by design.
Caitlin Croft: 00:56:27.170 Fantastic. So we’ll keep the lines open just here for another minute or so. So if you have any more questions for Craig, please feel free to post them in either the Q&A or the chat. So feel free to post anything else that you have. This was a really great presentation, Craig. I think it’s amazing how you showed how the three different products work together and also how easy it is to, one, create an InfluxDB template and to get it up and running. So for those of you who are interested in the InfluxDB template that Craig created, please keep an eye out on the GitHub repo as well as our website because we will definitely be creating a page for it. And of course, it will be in GitHub. And I did post in the chat we have a library of the existing templates. So if you want to get started, you can, and it’s completely free. So it’s a really great way to get started with InfluxDB quickly. All right, well, thank you everyone for joining today’s webinar. Once again, it has been recorded and will be available for replay later today. Thank you so much for attending. And thank you, Craig, for presenting.
Craig Hobbs: 00:57:46.741 Thank you, Caitlin.
[/et_pb_toggle]
Craig Hobbs
Sr. Solutions Consultant, PagerDuty
Craig Hobbs is a Partner Engineer at InfluxData. His experience with keeping systems performant over the years has helped to shape the kind of Sales Engineer he is today – one who enjoys solving complex problems and keeping sales people honest. His specialties include: Solution Architect, Project Management, POC Design and Development, Software Customization, Technical Training, ETL Automation, Application Integration and Deployment. Craig has a BS from the University of Illinois.