How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Telegraf
Session date: Nov 17, 2020 08:00am (Pacific Time)
Network to Code, LLC is a network automation solution provider that helps companies transform the way their networks are deployed, managed, and consumed on a day-to-day basis by leveraging network automation, software development, and DevOps technologies and principles. They provide highly sought-after training and consulting services that integrate and deploy network automation technology solutions to improve reliability, security, efficiency, time to market, and customer satisfaction while reducing operational costs.
In this session Josh VanDeraa and David Flores from Network to Code will present how to monitor your network devices with Telegraf using both the SNMP and the gNMI input plugins. They will also present what the challenges are with ingesting the same type of data from different sources and how to remediate that by normalizing the data in Telegraf using processors.
Additional resources:
- Monitor Your Network With gNMI, SNMP, and Grafana
- Network Telemetry for SNMP Devices
- Monitoring Websites with Telegraf and Prometheus
Watch the Webinar
Watch the webinar “How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Telegraf” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Telegraf”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
-
- Caitlin Croft: Customer Marketing Manager, InfluxData
- David Flores: Senior Network Automation Consultant, Network to Code
- Josh VanDeraa: Network Automation Engineer, Network to Code
Caitlin Croft: 00:00:04.322 Welcome to today’s webinar. Super-excited to have Network to Code who will be discussing how to introduce telemetry streaming into your network using Telegraf. Once again, please feel free to post any questions in the Q&A box. And without further ado, I’m going to hand it over to Josh and David.
Josh VanDeraa: 00:00:28.948 Good day everyone. So as mentioned, the title is how to introduce telemetry streaming using gNMI in your network with SNMP involved as well using Telegraf. So first off, a little bit of, who is Network to Code? Just a slide, to introduce ourselves. It was founded in 2014, and we’re a network automation solution provider, which means we provide consulting and really try to bring the DevOps to the network organization to help companies through that. We’re a vendor-independent organization. So infrastructure-wise, we work with pretty much all vendors: Cisco, Juniper, Arista, HP, Cumulus, F5, the whole name soup there. But also, we leverage heavily open source tools, so there we’ve got our full list of tools, again, from Ansible, Python, Puppet, Terraform to Telegraf, InfluxDB, Prometheus, Grafana, etc. We look at really all the toolsets available to us. And then we also have experience with many IT operation software such as ServiceNOW, Remedy, IBM, etc. So we’re really focused towards the enterprise side of bringing that DevOps to the network team. Then introducing ourselves, first, David.
David Flores: 00:01:51.981 Hello, everyone. I’m David aka net panda for those of you who know me on the Twitterverse. I reside in Dublin. I’m an automation consultant at Network to Code, and my experience as a network engineer have been in service provider networks, big data [inaudible] fabrics. I’ve been doing software development and automation since 2013. I’ve been using Telegraf since 2019 so pretty new on this space on the Telegraf side of things, but excited to be here.
Josh VanDeraa: 00:02:25.735 And I’m Josh VanDeraa. I’m in Minnesota, United States. I’m a network automation engineer with Network to Code and been working on networks since really it’s 2000, but really get into doing network automation since 2015. And as soon as I got some exposure to Telegraf in 2018, it’s been a super interest for me to be able to see what we’re able to do with Telegraf and how it can play in the network space. So then now that we’ve introduced ourselves, the quick agenda. First off, this is one of the best times to be gathering telemetry data from network devices. I remember when I first got started, we could only get 15-minute samples and that was quite just an average, over 15 minutes. Things changed quite a bit down to 5 minutes and such. Now we’re going to be talking about sub-minute resolutions coming up here. But we’ll start with what we see as a network streaming telemetry stack. We’ll talk about gNMI. Then we’ll look at gathering of data from network devices with Telegraf. Once we’ve got that data with Telegraf, we’re going to look at what Telegraf can do to enrich and modify those metrics. And then all this stuff I’ll kind of go through in slides originally, and then we’ll get into the seeing it live with David doing a demo. And then we’ll have some wrap-up and cover some tips and tricks that we’ve come along the way with Telegraf.
Josh VanDeraa: 00:03:54.951 So as we see the telemetry stack, at the very top we’ve got our network devices. They’re passing traffic making the Internet work. We then need to collect the data, enrich the data from there continuing on down, then store that data, offer analysis, and then dashboard that data. And so when we look at Telegraf, it really fits into three of these items in the whole Telegraf stack, so we’ll take a look at that in a little bit. So streaming telemetry in gNMI, I think its first good to go ahead and level set, what is gNMI? It’s a model-driven configuration and retrieval of operational data over gRPC. That’s the term from a vendor’s page. And really, gRPC is remote procedure calls developed by Google, which has a pretty strong standard to it, and it helps from efficiency and speed. And you also hear these called protocol buffers. So gNMI is a subscription model in that the device is going to keep track of who is subscribing. You have Telegraf that’ll reach up to the device, and then from there, the device will send data at a sample rate to the Telegraf instance. One quick note on gNMI, previously in Telegraf, the Telegraf plugin it was called Cisco gNMI Telemetry up until Telegraf 1.15. At 1.15, they changed the name over to just gNMI because it did play well with all of the vendors, and looking at the GitHub issue they said, “Let’s change the name.” Every vendor really jumped on board and said, “Yes, this is a good thing.” It’s just gNMI. So as you look a little bit later and you search for this particular plugin, you may be taken to the old documentation. Just take a look at this link here, erase anything about Cisco or telemetry and just look up gNMI on GitHub.
Josh VanDeraa: 00:06:03.980 So Telegraf for network telemetry, first off, what is Telegraf? Taken from the site, Telegraf is a plugin-driven server agent for collecting and reporting metrics. I put on here as well that it’s written in Go which makes this super-fast, and it’s a compiled binary. And so it’s got a low footprint and just works really fast. It’s going to pull metrics from the system it’s running on, third-party APIs, or as we’ll see, we can also have it gather telemetry from network devices. And then with the plugin nature, it also supports output plugins. So you’ll be able to send this to various data stores with InfluxDB being a strong candidate there. So the configuration for Telegraf, it’s all written in TOML. And when we take a look here, this is all that’s needed to get gNMI started from a Telegraf perspective. At the top, we’ve got our inputs for gNMI, and then we say what addresses, so it takes a list of devices, and then the username and password. So that’s all that you need from a credential, and then underneath that, you start to define all the subscriptions. So here, we’ve got our interface counters that we’d like to gather. You can also gather information about your routing protocol, and really, the telemetry streaming options are quite vast. And so you just need to take a look at the documentation a little bit further for what else that you may want to subscribe. But just this alone gives us a lot of good data about the interfaces.
Josh VanDeraa: 00:07:43.275 So in the network world, we’d love to monitor all this. We have switches. We have multiple vendors. We’ve got routers, everything going around that we’d love to have this all collected with streaming telemetry. But lo and behold, we’ve got several devices that don’t support streaming telemetry. Does that mean we’re out of luck for being able to collect things? That’s one of the awesome things about Telegraf is that we can do hybrid collection. So hybrid collection is being able to go ahead and gather data from inputs for SNMP, which is what we’re going to show here, and gNMI. And we’re going to look at how we make that all look the same a little bit later on here. So the configuration for SNMP is a little bit longer. SNMPv2 if you aren’t aware, I double-checked, it’s been around since 1991 from an RFC perspective. So it’s got quite the number of years on it, but it’s still the solid base of many network monitoring solutions, and so here, we’ve got our inputs SNMP. We define the agents of where we’re going to connect to, what SNMP version. So there is also an SNMP version 3 which has some more authentication on there, but from ease of demo, we’re going to show you version 2. We set the interval at which we will poll all the devices, and 60 seconds is about the lowest I’ve seen ever to be able to poll with SNMP, and we’ll talk about that a little bit later yet. But as we go through, then we gather a particular OID to get the hostname of the device to be able to pass through Telegraf, and then we’re going to also generate an SNMP table. So we’re going out and grabbing the ifXTable from the particular device which is going to give us all the information about the interfaces that are on that particular network device.
Josh VanDeraa: 00:09:46.427 And from there we want to set a tag, and we can see this is tag highlighted down here, that will go out and from that table that we have will look for the field of interface description. And we’re going to set that as a tag to be passed into the Influx line protocol. So on this side, I’m going to start you with the Influx line protocol, actually, on the bottom here of just making sure we level set, what does the Influx line protocol look like? At the beginning, we’ve got our measurements. So in this particular case for gNMI, the measurement is interface, and then all the tags that get defined. So tags are a key value pair of the tags that we’re going to apply to this particular measurement that happens. So we got the source and name for tags, and then fields, from a network perspective, are the things that we’re looking to measure. So we’re going to get the OutOctets and InOctets as separate items coming back. If we take a look at the same line protocol up here, we can have multiple fields where we have OutOctets and InOctets all coming across on the same measurement. And then lastly, we have the timestamp of when this was actually gathered.
Josh VanDeraa: 00:11:11.302 And so what we’re taking a look at here then, the big picture of this slide, is SNMP. We get one thing that says interface counters. We have an agent host equals router one, and we have different things with gNMI down on the bottom here. We have interface for the measurement. We’ve got the source is now an IP address, and up here we have interface name of Ethernet 7, which kind of lines up with the name of Ethernet 7 down here, and so we have different data points. And this is one of the things that makes Telegraf awesome is that we can then take it and send this data through the Telegraf pipeline. So Telegraf pipeline as we have it at the top, we have SNMP, gNMI. We also have, not covered in this particular webinar, the Execd, but those are some of the input plugins that we’ve used the most to gather data from network devices. Take a look, there’s over 190 plugins available there. That list just keeps getting larger and larger with every new release. And then after we collect the data, we send the data through data enrichment. And here we’ll show the regex and rename processors to change those fields that we saw that SNMP will line up with the gNMI data, and then I’ll put plugins. There’s over 40 available, again, continue to ever increase, and we’ll take a look at what that looks like to send out to InfluxDB and also send it to your screen.
Josh VanDeraa: 00:12:50.028 So data enrichment, this at the top, we’ve got SNMP, gNMI. The first thing we’re going to do is normalize the data, send that through the rename processor. And then after we’ve renamed and set up the tags and fields to match, we’ll send that through the regex processor to enrich the data. So we’re going to try to get some additional pieces. That way, in the future, when we’re comparing devices, we can get more information and present the business logic of, what does this interface actually mean to us? And then lastly, we all output that to InfluxDB. So the renamed processor, as we take a look here, we have the processors.rename. Again, this is in the Telegraf config. And so the first thing we’re going to do is change the field of interface high counters InOctets and just change it to InOctet. And we’re going to do the same with the OutOctets to match what’s on gNMI, and then so those are fields that were changed. We can also replace tags where we’ll take an agent host and changing that over to source to, once again, match what is coming in from the gNMI plugin.
Josh VanDeraa: 00:14:11.698 After we’ve renamed and we normalize the data, we can then send this through and add metadata. So we’ve got another regex processor at the top. When we do tagpass, we’re really filtering is what tagpass is doing. It’s filtering that based on this key-value pair, whenever the host is device one it will then pass this information down to the next couple stanzas in the configuration. So then these are both going to be with tags where we’re going to take a look at the - we’re going to key in on the name. And when the name matches this regex pattern, we’re going to go ahead and create this new key of interface role with the value of management is what that’s really stating there. So as we go through, we’re saying device one, gig 000 is a management interface. And then we set up for this particular device that gig 001 is a data interface. So it does the exact same thing there. And with that, I’m going to stop sharing for a moment and pass this over to David.
David Flores: 00:15:27.121 Thank you, Josh. Let’s set up over here. Okay. So for the demo environment, we have a couple of routers. We have router one which is configured with SNMP, and we have router two which is configured with gNMI. The objective here is to gather interface traffic statistics and go through the data transformation process that Josh already explained. And finally, we’ll do an example visualization with Grafana. Okay? On the monitoring side, I’m using the TIG stacks. So I’m using Telegraf for data collection, normalization, and enrichment, InfluxDB for storage, and Grafana for visualization. Okay? Right now, my monitoring stack is actually running on Docker containers at the moment. On the Telegraf side, I have a couple of more Telegraf instances that have the full configuration and at the moment is gathering statistics from both routers and sending it over via InfluxDB. But I’m going to go through the transformation process of the data you see in telegraf-r1 and telegraf-r2, outputting that data to standard output so we can actually see the Influx line protocol in action. Okay?
David Flores: 00:17:05.275 So first, we are going to go through the telegraf-r1 configuration. So telegraf-r1 one is the agent that is configured to router R1 and is going to perform the SNMP data collection. Okay? So Telegraf, as Josh said, is based on TOML which is an acronym for Tom’s Obvious, Minimal Language, and it has three major sections. The first section is you specify the global tags. Now, tags are key body of fields that we are going to use to push into the metric that we’re going to process later on. Then we have an agent section where we define, mainly, the Telegraf parameter settings, well, of the agent itself. And you see here what I’m actually doing is setting the hostname of this agent to be a telegraf-r1. Since these are containers it will send the container ID instead of the hostname. So I’m actually setting it up over here in the agent section. Now, the third section on a Telegraf configuration file are plugins. The plugins are inputs plugin. We have processors plugins and [inaudible] plugins, and also output plugins. The inputs plugin is the one that we’re going to use to collect data, and we’re using SNMP one. And this is a lot similar of what Josh already specified on the slide, but the agents that I’m going to connect to is router R1. I’m using SNMP version 2. The SNMP community, I’m getting it from an environment variable, and the polling frequency is every minute against this device. And we have some parameters in case of network interruptions or - yeah, network connections disruptions.
David Flores: 00:19:07.030 Now, the metric that we’re going to collect is the ifXTable, okay, OID. So we’re going to collect all the counters from an interface of this device, and we’re naming that metric interface. Okay? It’s important to know that this name you can actually change it in order to match whatever naming conditions you’re going to use on your stack. Okay? Now, also, I want to be able to collect the interface name. Right? And so I may specify here a field which is collecting this OID which is the IF description, and I’m saving that as a tag into the name - as a name tag. Okay? And we’re going to see that in the output we’re going to see later on. And finally, just for demo purposes, I’m just printing this out to a standard output in data format of Influx line protocol. Okay? So let’s take a look.
David Flores: 00:20:18.928 Okay. So here on the logs, let’s take a look at this metric. The first thing that we notice is the measurement name, which is interface, that we already configured. This section over here are the tags of this metric, and we have agent host set to R1. Okay? This is the same information that I already specified on the configuration file, okay, on this device. So SNMP takes it as agent host, that field. The device role is set to spine. This is a tag that I actually manually inserted via the global tags in the Telegraf configuration. Okay? The host is telegraf-r1. That’s on the agent section where I changed the hostname. This is what is reflected at the moment. If I left it by default, it would just say the container ID. And lastly, I have the interface description field that is set, yeah, the OID, and is gathered here as a tag. So we have here the interface name. After this, we have all the counters of the interface. Okay? So these are the fields in an Influx line protocol. These are called fields, and this is the key-value pairs. And you can see that we have multicast packets, some broadcast packets. We have ifHCInUcast, and so on. And lastly, we have the timestamp that is going to be stored, well, in this case, in standard output. But the idea is that this is going to be stored in the InfluxDB database. Okay?
David Flores: 00:22:01.687 Now let’s take a look at the configuration on R2. So on R2, the general operation is pretty similar. So we have global tags. We have device roles set to spine. The agent is telegraf-r2. Okay? This is R2, so we’re collecting the data through the gNMI plugin, and the settings are pretty much standard. The configuration that we have here is based on an Arista EOS router, and it has a minimal default configuration. Okay? So we have some protections in order to connect to the router, and we have also some redials in case of network interruptions. Next, we define the subscription. Okay? So we’re using subscription. We’re using the stream subscription mechanism in gNMI, and for that, we need to define the origin. Okay? We’re using OpenConfig interfaces as the origin, and then we specify the path. On the path what we want to collect is the counters of the interfaces. So this is the path that you will normally get, interfaces/interfaces/state/counters. Okay?
David Flores: 00:23:16.813 And the subscription mode that we set is to sample. Now, Telegraf supports the gNMI plugin supports the normal gNMI standard on the subscription model, so you can have on change which basically is it’s only going to stream out the data if there was a change on the value on the device side. So that’s when it’s going to stream out the data. Or you can have a method called target defined which is more the device - well, the vendor implementation and device based on the lead from the query that is going - the metric that you are trying to collect is going to verify which subscription method is better. So it can either sample or can be on change, and it’s going to set it up for you. Okay? Right now, we’re using the subscription mode set to sample, okay, and the interval is going to be five seconds. Okay? So another thing that I didn’t mention, this is the minimal configuration on Telegraf. And with this, you should be able to start collecting data on the side of your agent, and [inaudible] SNMP outputting this information with Influx line protocol to standard output.
David Flores: 00:24:42.933 Okay. So these are the metrics on the gNMI side of things. You can clearly see that it is different, one metric on the gNMI output if you look at it with the SNMP side. First, you can see that on the SNMP side, what we’re collecting is the table of the interface. So basically, we’re collecting all the counters of that interface. gNMI is sending a metric on a per counter basis. Right? But the structure of the message is still the same. So we have the measurement name set in interface, then we have tags. Okay? So it is a lot similar with the example of SNMP, device set also to spine which is a global tag, host set to telegraf-r2 which is in the agent section, and then name of the Ethernet 2. This is actually taken from the gNMI plugin. Now, source R2 is how Telegraf identifies the device that is sending this data. It’s setting that as a tag main source. So there’s some discrepancy there if you want to monitor the data later on. And here you can see also that the fields are like an [inaudible] case while in SNMP you just see them as a camel-case scenario. But yeah, still the same format, so you have the timestamp at the end.
David Flores: 00:26:18.609 Now, this is the raw data. This is the minimal configuration that you will have with your Telegraf agents. The problem if you start sending this information to a database is that if you want to go through a visualization tool or you want to perform some kind of queries to your data, you need to have a query specified for an SNMP and a query specified for gNMI. Why? Because fields are different, tags are different, and you need some kind of data normalization process so the same kind of queries can work for both. It doesn’t matter what implementation you have. So that’s one of the reasons that you need a common data model or scheme when you’re handling the telemetry data. Okay? Let’s enable data normalization. So I’m going to enable it to the gNMI side. Over here, I’m just using the rename processors, and I’m replacing the source to a tag named “device”. So I’m going to use in my systems and the field that’s going to be exposed - well, the tag that is going to be exposed for query is going to be “device”. Okay? So I’m enabling device here, and on telegraf-r1 I’m going to do some transformation. So the first thing is that these fields, the counters, I’m going to change only the in_octets and the out_octets to their [inaudible] case - well, to the counterpart in gNMI basically. Okay? So we’re going to use the same kind of fields that gNMI is getting us. And also, I’m replacing the agent host tag, which is where the name of the device, the hostname is actually there, with the tag device. So both telegraf-r1, telegraf-r2 have that tag set the same way. Okay?
David Flores: 00:28:17.596 So let me restart over here to apply to configuration. [inaudible] here really quick. Oh, I don’t think maybe - yeah. So it’s already collecting data. gNMI is really fast in that part. On SNMP, you have to wait the polling time, which is in this case 60 second, and so here you have the metric. The first thing that you can notice from the output on the gNMI with the one before is that here is source and now we have the tag set as device. Okay? So that’s really good that we managed to do the transformation on the gNMI side of things in order to match the tags. Let’s take a look on R1. So on R1, we can do a couple of things here. So the first thing is we can see that the name change, it was agent host before and now we have set that tag as device which is good. Yeah. And also, you can see that the InOctets counter has been changed. Okay? So that field has been changed and the same is with OutOctets over here. Okay? So to see this a little bit better, you can see them there. Okay?
David Flores: 00:29:57.292 So that means that, of course, in production environment, the idea will be to have all these counters to have a counterpart on the gNMI basis and to have a one to one mapping of all these fields. So well, you don’t want to only monitor bandwidth. Right? You want to also monitor [inaudible] broadcast packets, and it’s important for you to do that mapping in a production environment. Well, for the purposes of this demo, we’re only going to show the RX and TX. That means that my data if I look at bandwidth perspective is already transforming. It’s already normalized. Right? The next bit that I want to do is enrich it. And why enrich it? The idea behind enrichment is to add extra metadata that is going to be useful when you’re doing queries, when you’re creating visualizations that can provide real value as an operator or on a business device level. Okay?
David Flores: 00:31:03.971 So let’s go to the enrichment section. Okay. So for the enrichment, basically you have the regex processor getting the interface measurement. And I’m going to explain this with my own words because the first time that I actually saw all these fields, it was kind of misleading what I was actually doing. So the way that this works is that you have the interface measurement. I’m going to collect the tag. Based on the tag, I’m going to collect the tag name which is the interface name, and I’m going to apply this regex pattern. So if this is regex pattern, these match, then I’m going to create a tag which is interface role, and I’m going to set management as its value. Okay? So in general terms, this device is a Cisco US device, and at the moment, it’s only Gigabit Ethernet. It’s a lab device, so it doesn’t have too much interfaces. So the regex pattern is actually like a one to one mapping. And you can see that here I have one of these interfaces set as management, another interface set as spine, and another interface as backbone. Okay? Now the spine one is one that is important here. That’s the one that connects between R1 and R2, and I’m going to try to generate some traffic on those. Okay? So on telegraf-r2, we have the same processors regex, and you can see that the process of regex is still the same. The only thing that changes is the interface patterns because in this case, this is in Arista EOS. So I’m using a different regex pattern in order to match the interfaces and apply the interface role. Okay? Now let’s apply this.
[silence]
David Flores: 00:33:35.255 So actually, this [inaudible] interface role because not all the interfaces are matching at the moment. Again, so in a production scenario, your new interfaces could have a role. So you can have another regex expression there that is going to have a default role for the rest of the interfaces. I’m filtering here based on the tag of interface role. So here you can see that it match to collect information of Ethernet 2, and based on this Ethernet 2, added a tag interface role set to backbone. Okay? And the same can be said to R1. Yes. So on R1, we can see that interface Gigabit Ethernet 02 is set to backbone as well. So this is a great sample of how you can enrich your data and add metadata to your matrix in order to provide a better operation and business body. Okay? So this we already covered here in the demo environment. As I said, I have a couple of Telegraf instances sending information to do the InfluxDB storage, and we’re going to visualize that data in Grafana.
David Flores: 00:35:02.964 Okay. So I have a couple of panels here in Grafana. I’m not going to explain how they were configured, but I just want to go in a really high-level overview of what this means and how the fields that we just collected how they are reflective in the visualization. So the first thing is that this panel is based on our interface traffic on a per device basis, so essentially based on this variable here which is set to the device tag that I modified during the data normalization process. I can change between devices and see the traffic. A quick note here, the traffic that is like a plus traffic on the upper side is the RX side of things. And the one below is the TX side of things. I should have actually put a label there, but you can quickly see the differences between the polling mechanisms of - well, the data collection mechanisms between R1 and R2. There’s a lot more precision on gNMI while there’s a lot less on the SNMP just because of the polling interval. Right? Here, let me actually try to refresh it and let me poll it in the last 15 minutes across the window range, and you can see more precision with the gNMI output. Okay? So that’s a thing that I want to highlight in this session is the benefits of gNMI in this case. So this is the interface traffic.
David Flores: 00:36:50.085 Now, the other field that we managed to add was the interface role. Right? So the interface role is a tag that we added in the enrichment process, and we have this table down. Actually, let me put it in a higher resolution. So it’s actually interface traffic on a per role basis, and this variable over here, interface role, affect this panel. And right now, what we’re seeing is it’s actually collecting the management traffic of the interface. I’m going to generate some traffic really quick.
[silence]
David Flores: 00:37:45.381 Okay. And we should be seeing some traffic changes over here. But of course, I’m generating traffic between R1 and R2 so should be able to start seeing some changes here. Let’s put in a bigger solution. We can actually start - since this is data on a per device basis, we’re seeing some traffic change over here. But also on the spine role, we are looking at traffic over here. And this is interface role, so I’m actually looking at traffic from both R1 and R2, okay, on the spine interfaces. So a quick note that you can see here is that the SNMP traffic still is not gaining the traffic yet. It’s still waiting while the gNMI already has information data about the traffic that is transversing on the interface. Okay? Well, with this I’m gonna stop the demo. I’m going to give it back to you, Josh.
Josh VanDeraa: 00:39:00.423 Sounds good. So just to recap then, what we’ve shown here is we’ve collected data from devices, and then we’ve normalized it, enriched it, and sent the output to InfluxDB at which point then we could take the data and create a dashboard from it. So as we’ve gone through that, I know everything that David’s done here is going to be available at the end of the slides. We have an appendix section that’s going to be some of the useful information. So what’s next for us in the tips and tricks? So we’re taking a look at the flux language that looks very, very appealing. It’s got some great capabilities from what we’ve seen thus far. InfluxDB 2.0, I think, had GA last week, and so we’re very excited to get an opportunity to take a look at what that has to offer. InfluxDB IOx as well is going to be of interest for us.
Josh VanDeraa: 00:40:05.647 And then on the plugin perspective, we really want to take a look further at what inputs are changing. We’ve definitely seen some changes in 2020 alone to what inputs are available. We’ve seen some growth there and some capabilities, and then also same on the processor side. So enriching the data helps to give us greater visibility, greater - or better dashboards that are going to make more business sense than just ask the question, “Okay. We’ve had this much traffic to a device. But what kind of traffic is it? Or is it going to be impacting to us?” And then we call out specifically here one item of the Starlark processor. From what we’ve looked at, Starlark is very similar to Python. Being that Python is one of our core competencies at Network to Code, we would love to see what the capabilities are between Starlark as a processor for enriching data or changing data as needed versus just trying to run that with Python.
Josh VanDeraa: 00:41:13.652 Some of the tips and tricks, we ask this between ourselves and more than just David and myself that are here, but what have we come across? First off, what we’ve seen with the configuration, I’d say that was a very basic one for David. It was two devices. And how does that scale out to thousands of devices? First off, there is automate the configuration with templating, and experience here, Ansible is a great choice to just be your template renderer. We’ve seen this with David, and this is something strong, is run this in Docker as well so that way you can kill off the container. You can make updates to the container and not have to make any changes to the system because that goes along with our third bullet, installing the SNMP MIBs locally. So put the SNMP MIBs - if you run it on bare metal or on a VM, put the MIBs there so that way you can reference things by name and not by a long OID. We definitely showed, from a demo perspective, the line output. So you use the files output to send things out to the line. So when you’re just trying to get started, is this able to connect? Are we actually getting data? Use that plugin to help. And this is where the Execd processor on our fifth bullet here is if you can’t get the data via SNMP, gNMI but you know the data’s there from a vendor, you can get it from the command line, look at writing your own executable. We actually did that where we could not get the data from the vendor from any kind of polling, but we could get it from SSH. So we’ve written an Execd plugin that logs in, and it actually stays logged in the entire time and then provides that data back.
Josh VanDeraa: 00:43:05.188 Take a look at the input internal plugin, double-check to see what’s going on within Telegraf. If you have the capability on network devices - we’ve talked about, where do we run this? If you are able to, if it’s a Linux OS on your network device, consider running Telegraf actually on the device. That way, you reduce any latency of the network to poll the device or have the data being ingested. If you just have it right there on the device, you’ve got some good capabilities there. Then take a look at plugin altogether are the last two bullets. There’s a Network Response plugin that’s able to more or less measure your latency from one endpoint to another. You can double-check DNS. So we’ve all seen it an outage happens, does DNS resolve? That’s an easy way to put that into a check and help your operation systems. And one that I love double-checking as we rely on SaaS services is there’s an HTTP Response plugin. I didn’t put that on here in the slides but be able to go out, poll, is GitHub actually up and running, or are you running some other services? You can have this poll periodically. Be respectful to your SaaS vendor. Don’t overload them with too much traffic, but you do want to know if the service is available before you start getting phone calls, and then double-check the plugin documentation. And really what I’m saying here is every new version of Telegraf, they usually put at the very top of what the new plugins are. And so from there, understand what’s changing in the landscape to be able to help monitor your environment and monitor your network devices.
Josh VanDeraa: 00:44:54.949 And lastly, some other follow-ups. I know there are a few questions coming in around, how can we see some of this? Very first blog post at the top there at blog.networktocode.com is about working very similar with gNMI, SNMP, and Grafana. And then the second one is all about SNMP because SNMP can be tricky. And lastly, I just want to say thank you for the opportunity. We do have a slack as well that is network automation focused at slack.networktocode.com. You can get a sign up there, and within that workspace, there is a telemetry channel that is telemetry focused. Otherwise, you can find us on the web at networktocode.com or our couple social interactions listed there as well. And I think with that, we should have some Q&As in there.
Caitlin Croft: 00:45:53.190 We have plenty of questions for you guys. So thank you so much. That was fantastic, and for everyone who is on the webinar, I will make sure that those links that Josh just showed are available when you go check out the replay. So the first question is we are interested in getting events from network devices like Cisco or Juniper. We want a similar feature for devices that support streaming telemetry. We would like to implement support in the existing system instead of building a new one. In the future, we plan to collect status from hundreds of thousands of devices and several million interfaces. How many device streams can be supported by one Telegraf? And does Telegraf have the ability to correlate, suppress, and filter events?
Josh VanDeraa: 00:46:46.874 That’s quite the question already there. So I’ll try to cover some of this. So again with monitoring, you’re going to be getting counters, the data, typically. While there is a status involved with those counters and with those polls, when you’re talking about events there is also a Syslog plugin. That was something as I saw that question, double-checked to make sure that’s there. So we’re able to keep all of the stuff in the same system. And as you look through that, does it have the ability to correlate, suppress, and filter? I think that’s a little bit more of the alerting mechanism side of things to be able to do correlation of what’s up and downstream of each other because of its nature. Telegraf as a plugin system is about collecting, changing the data, and sending it off. So from a Telegraf perspective, I am not aware of anything that does what they’re looking for specifically. But Telegraf and the whole ecosystem here does have that capability. That’s one of the complexities that comes along with it.
Caitlin Croft: 00:48:03.962 Let’s see. Are those two separate Telegraf instances or can it be the same one? I’m wondering if this might have been during David’s demo.
David Flores: 00:48:14.146 Yeah. These are two different Telegraf instances. Basically, different Docker containers that have a Telegraf configuration, one with SNMP and the other one with gNMI. Well, you can have one Telegraf instance connecting to multiple devices. That’s why on the Telegraf configuration, the way that you set the names it was like an array format so you can actually put multiple targets there. So that’s a possibility.
Caitlin Croft: 00:48:46.096 Are you using physical boxes or VMs or VRLs?
David Flores: 00:48:52.504 For the network devices, I’m using visual images for this demo. Yeah.
Caitlin Croft: 00:49:00.719 So David, this question is specifically for you. How can we add more devices and their different resources like CPU and memory in interface? We have to configure some identification in the configuration.
David Flores: 00:49:16.978 I don’t quite understand. Can you repeat the question [crosstalk]?
Josh VanDeraa: 00:49:22.172 I can jump on in this, David, and I think this was in relation to the SNMP configuration and specifically the Telegraf configuration because this was during the demo. And so if it’s SNMP, it’s adding additional SNMP configuration. I think we have some examples of that on the blog sites. So this was specifically for interfaces that we are sharing because otherwise, we could be here for quite some time, and also the same with the subscriptions on the subscription front. You can have multiple stanzas of the subscription. You don’t need to have just one subscription per Telegraf file. It can get quite lengthy as well in the subscription.
Caitlin Croft: 00:50:11.324 input.snmp.table or input.snmp.table.field, which option is good if you have more than 20,000 VLAN interfaces in one device, like if you had approximately 1,000 thousand devices?
David Flores: 00:50:31.833 Yeah. So I was actually on the verge of answering that one. So it depends. At the end, the table and the field, what you’re trying to get is the one that mimics the SNMP GETNEXT or GETBULK method while the other one leverages the GET method. Right? So on a table perspective, you’re using bulk. So if you’re trying to collect multiple VLAN, so a lot of data, definitely a bulk method is better instead of doing 30,000 GET methods because the overhead of establishing the connection and sending that data is going to be a lot more.
Josh VanDeraa: 00:51:16.175 And I’m going to jump in here to add some color commentary to that. When you’re talking about many interfaces, you are going to double-check the polling as well. If you’re doing this with SNMP, that may take more than a minute. And so if you do have your resolution at a minute, you’re making another SNMP request before you even finish the previous one. And so, yes, we’ve got a router here with five, seven interfaces, something like that, and so easily [inaudible] one-minute polling intervals can take. But I do just want to put that small question around when you’re talking that many VLAN interfaces, double-check the polling. And that’s, again, where gNMI wins because it doesn’t have to gather all this data. It just knows how to handle that and is a little bit more - it’s more friendly to the networking devices.
Caitlin Croft: 00:52:10.466 Perfect. So there’s a couple of questions around the git URL for the demo script. So if it’s possible to share, I’m happy to link those on the page so that if you go and rewatch - if you go and watch the replay, the link will be available if that works for you.
Josh VanDeraa: 00:52:29.334 And we’ll have that in the slides, actually. We did not want to put this in git because then if we put it on a git URL, we’d need to maintain it over time. And so this being a point in time reference, we felt it would be better that we just put this into the actual slides rather than have it available on git. And again, it’s only because of the long-term maintenance of what happens a year from now and things have changed, so.
Caitlin Croft: 00:53:00.282 Totally understandable. Will you show an example of gNMI configuration on the device side?
David Flores: 00:53:09.879 Sure. I think I can show it. It’s pretty simple actually. It’s just two lines. Let me [crosstalk] -
Caitlin Croft: 00:53:17.508 You guys are very popular. So many questions.
Josh VanDeraa: 00:53:21.246 And while David’s pulling this up to be able to share, we’ve got Arista. We’ve got experience of each of these. The configuration will be different per vendor depending on how the vendor has chosen to put this into their configuration environment.
Caitlin Croft: 00:53:43.505 And I know we’re coming up to the top of the hour. We will be continuing to answer all of your questions, so I understand if you need to drop off. But we will be answering all the questions, and they will be available on the recording.
David Flores: 00:53:59.751 Yeah. So this is a [inaudible] device and you can see it here, pretty straightforward.
Caitlin Croft: 00:54:08.462 Perfect. Thank you. Does Telegraf have alerting - or sorry, does Telegraf have error handling capabilities, for example, if the remote network device is not available for metric capture?
Josh VanDeraa: 00:54:25.711 So from what I’ve seen of Telegraf, it will continue to poll if it is unable to get anything. It will not continue down the Telegraf pipeline as we have it. So if it’s polling via SNMP and it does not get anything or if it’s got gNMI and it’s not receiving anything, it just will not output to the output. So if you’re trying to send that to Influx, you’ll start to have no data points. If it’s others that may have a - yeah, so it will not crash the Telegraf instance, but it will not report back that there is an error. That comes back into the other logic handling that you need to implement as well on top of that.
David Flores: 00:55:12.978 And also the configuration on the Telegraf configuration side, there’s an agent, and you can find some log mechanisms like a log file. And all the events that’s happening on Telegraf can be actually sent there, and you can specify the log rotation, the rotation interval. So for example, you can actually send that information over to log file system and then send it to whatever log processing platform you have on the backend. Okay?
Caitlin Croft: 00:55:46.754 Do we need to add OpenConfig RPMs to the device, compile, or install OpenConfig models on the device?
David Flores: 00:55:57.010 Okay. So that’s really on a vendor implementation basis. The example that I used is using OpenConfig interface, so it’s basically depending on the data model that you have and the vendor. I have not seen other implementations around it, but I would be really interested to see, well, other options. But I don’t think that’s a limitation as far as [inaudible].
Caitlin Croft: 00:56:29.443 Okay. There could be cases where we need to pick partial interface data from SNMP and the rest of - or and from gNMI. Can we do this in Telegraf?
David Flores: 00:56:42.376 Yeah. The simple answer, yes. Basically, you can. You do need to look on the configuration side of things of SNMP. And depending on the OID, you can do queries on the specific OIDs, and you can query on a specific part of your gNMI data mold, the entire path, to get the data that you want to collect from and then process that.
Caitlin Croft: 00:57:12.239 If I have 1,000 devices, do you recommend one container for the Telegraf agent per device?
Josh VanDeraa: 00:57:21.701 One of our colleagues at Network to Code, his handle is it depends, and I think that’s the answer here. It depends. My personal flavor is I prefer to have one instance of Telegraf running per network device. This gives you the ability to modify the configuration. Let’s say something’s happened, and you need to make [inaudible] changes to it, to only take down polling for the one device and not for all the devices. I also do prefer the Telegraf instance to be in a container as well. But if that becomes unruly for you, there’s nothing to say that we can’t - we’ve been down both sides of the aisle, so to say, is we’ll configure it as what makes sense in your environment and from an operations perspective. We don’t want to create operation burden, especially while you’re early getting on - or getting started with Telegraf. You may want to do a single Telegraf instance or Telegraf config and instance per device until you’re satisfied with the data you’re collecting.
David Flores: 00:58:32.109 And I would like to add on that, benchmarking is really important here and baselining, so infrastructure on both sides. On the connector side of things and also depending on the model of the device that you’re having, you may have some constraints on either side. So performing, as Josh said, just one configuration, roll with it, and try to do some benchmarks on it. So based on that, you can see how you are going to a scale in the future. Right? And another thing that is important is think of how you can automate this. Right? So Josh said at the beginning that we need to start looking at templating because if you have a large scale network, you need to do that. Right? So it’s a lot easier to just replace a configuration file on an agent device than trying to modify the contents of a file of multiple devices. So there are some tradeoffs in that side as well.
Caitlin Croft: 00:59:30.010 Great. Is there a way to enrich data during ingestion using a lookup file of some sort?
David Flores: 00:59:40.720 There’
s the Execd. I actually like the Execd processor which is basically you can do almost whatever you want. Lookup file, I haven’t used it but there could be [crosstalk].
Josh VanDeraa: 00:59:54.124 Yeah. And on top of that, what David’s saying with the Execd, is we have done a Python’s - or a Python executable as an Execd process, and so that would be the best part. Although at the same rate ingestion, the processors are quite powerful, and I would recommend sticking towards that. It’s written in Go, and all the processors are also in Go, and so you get some inherent speed from that as well.
Caitlin Croft: 01:00:31.876 What kind of processor load does this put onto a network device? SNMP polling can use a lot of processor time in some cases. Is gNMI comparable or less intensive?
David Flores: 01:00:46.014 So these are different implementations and depends a lot on how the vendor has implemented the gNMI service and how it behaves on the box. So as Josh said, it depends. Right? But one thing that you need to be aware of is that on Telegraf SNMP side of things is the normal SNMP operation. So it’s still GETNEXT, GETBULK, GET operations that you are performing against a device. So on the wires, it’s still going to behave the same. Okay? On the gNMI, you open a connection, and you have that connection alive and the device is sending the data out. Okay? So normally, that process is really efficient because you don’t have to establish new connections every time you want to collect the data. There is just streaming out the data. Okay? But again, it depends on the platform. It depends on the hardware, the model, and, well, the end-to-end implementation of those services.
Caitlin Croft: 01:01:48.072 Great. Have you simulated streaming telemetry data traffic - or how do you simulate streaming telemetry data traffic?
David Flores: 01:01:58.268 Okay. So the streaming telemetry data is actually, well, the gNMI, so the device is actually sending the data. Right? And the simulation was basically when you started - I’m guessing that the question is around the traffic. So the gNMI is connected to one device. The SNMP is connected to another device. I just connected to one router, perform a ping between two of them. So I started gathering data, and the gNMI is collecting data from the interface counters of where the traffic is actually happening and the same SNMP. So this streaming telemetry side of things is how that data is actually sent to the agent.
Caitlin Croft: 01:02:41.768 Have you deployed the solution in large scale?
Josh VanDeraa: 01:02:48.975 Yeah. We’ve got insight, deployed it ourselves at Network to Code. We’ve been partners with others that have actually done this and some over 1,000 devices easy. And before my time at Network to Code, yes, I’ve done with Telegraf collecting ads. I think it was a small position in 2000, and I didn’t see any reason why I would’ve been hung up if I needed to continue to evolve and make this. In a further past life, I’ve been part of large enterprise, not just enterprise. I do see this being able to scale at that level.
David Flores: 01:03:37.530 Yeah. And also, a quick look of how gNMI is actually made to be, how gNMI is actually - well, one of the configurations [inaudible] is out there, you can see who actually implemented it and pull the [inaudible] out. And their data centers, I can guarantee they’re pretty big.
Caitlin Croft: 01:04:02.017 What is the major difference between when you compare the TIG solution versus the vendor solution? So comparing Telegraf, InfluxDB, plus Grafana versus vendor solutions that come from [inaudible] or Cisco that provides their own monitoring tools? And why should a business move to this architecture?
David Flores: 01:04:27.273 Okay. Go ahead. Do you want to [crosstalk]?
Josh VanDeraa: 01:04:29.862 Yeah. I mean, I’ll gladly go into the single vendor tools. Well, they tend to start to branch out because I do know some of the vendors’ monitoring solution. They say they’ll take care of others because it’s SNMP. The best part about it right now is that we’ve got choices. And the choices here just outside of vendor solutions in the open source community, we have various solutions to be able to look at. And really, it’s about getting into being able to make the data work for your environments, I think, is the biggest thing. When we take a look at what we’ve done with our clients around Grafana or the visualization periods of how we enrich the data, this plugin system is there. It’s open source. If you need something to get fixed, you can open an issue on GitHub and track it there. You don’t send it in to the vendor support and what’s going there. But this is a great time, a lot of great opportunities to enhance our monitoring capabilities. That’s the way I look at it. Now, David, if you want to add some more color.
David Flores: 01:05:44.557 Well, yeah, the one thing that I’m a strong believer in open source tools. But also, there’s a big advantage here that when you are actually comparing a TIGI solution with a vendor or proprietary one is the openness, right, that you already mentioned. Also, think about that if you have a small footprint, let’s say that it’s small enterprise, you can bring data from your servers, from your storage, from your infrastructure, from your cloud instances, from the network devices and have all of them correlated in the same monitoring dashboard, have queries against them. So I think that’s a really big plus, especially on the operational side of things. I can guarantee that the operations’ engineers are going to love just looking at just one tool than going through a plethora of systems.
Caitlin Croft: 01:06:37.913 And like you guys said, the great thing is Telegraf is open source, so it works with a bunch of different selections beyond just InfluxDB and Grafana. And a lot of companies like Cisco and Juniper, they use Telegraf in their devices, and you can put this data into a number of different databases besides InfluxDB. And there’s a ton of resources online showing how people have done it, and so you can learn tricks from them as well. Let’s see. There’s a few more questions here. Is there any plan to build net config based telemetry plugins? So whoever asked this, I did ask our product team, and there is a sFlow plugin and there is an issue in GitHub. So I will post the GitHub link as an answer. And then for you guys, does Telegraf support TLS encryption [inaudible]?
David Flores: 01:07:45.829 So yeah, gNMI can handle TLS client-side TLS certificates. So the streaming part of it, it can support perfect.
Caitlin Croft: 01:08:00.037 Do we need to add OpenConfig - oh, I think we already answered this.
Josh VanDeraa: 01:08:05.071 Yeah, [crosstalk].
Caitlin Croft: 01:08:05.061 Do we need to add - yeah. Yeah. There also is a JTI OpenConfig telemetry plugin, also OpenConfig. What is the difference?
Josh VanDeraa: 01:08:18.627 All right. I will take that because I have done a project where I have actually done gNMI and JTI going through this exact same process of rename. JTI, that’s Juniper Telemetry Interface if I’m not mistaken. Really, they both do very similar things. The configuration looks almost identical. And so really, they’re both good plugins. The one was written specifically. The one thing I will say about the data return from JTI is much more expansive on their tags and field names versus gNMI is very short. JTI will actually say, “What is the entire OpenConfig path?” whereas gNMI says, “Okay, we’ve already got the path. We’re only going to put the measurements in.” So really, they’re both getting the same data. gNMI is a little shorter.
Caitlin Croft: 01:09:23.060 So someone asked, is only pull supported? Can we get a push from a proxy for real-time streaming? So the cool thing is InfluxDB can handle both push and pull data, which of course, gives the user more options. Is there a way to reload Telegraf config without reloading the entire Telegraf service?
Josh VanDeraa: 01:09:47.988 So this one I had to double-check because I had always just gone ahead and killed the service and restarted it. But there is an issue on GitHub that’s closed that if you send a SIGHUP process to the Telegraf instance, it will go ahead and refresh the config. But again, my experiences, I’ve been pretty much able to kill a single instance and just restart it at that point, so.
David Flores: 01:10:16.453 Yeah. And also one thing that we want to take into account here is that the Telegraf process, the idea is to be as lightweight as possible. If you have a large scale network with thousands of devices and [inaudible] in the way that they work. Right? So restarting it, if it’s a container recreate the container in order to make it work again. That’s what you’re trying to achieve in order to have some declarative way to manage that service.
Caitlin Croft: 01:10:53.951 So we’re just going to take two more questions that have come in. I know that you guys have got a lot of questions. So we’ll take the last two and then wrap things up. Someone did ask if our session is being recorded. Yes, it is being recorded, and so the recording and the slides will be available later today. Is there any support for SNMPv3?
David Flores: 01:11:18.731 Yeah. So I’m just looking really quick.
Josh VanDeraa: 01:11:22.790 Yes, there is. I have done it. It adds a few more configuration items such as username, password, and then you have to make sure you get your encryptions right and all that. And again, that’s a little bit more involved than what we want to cover during a demo, but it does support SNMPv3.
Caitlin Croft: 01:11:44.675 At large scale, how do you monitor the container state?
David Flores: 01:11:49.932 Well, there are multiple ways to monitor container state. So I think a normal tool that you will hear a lot is Kubernetes OpenShift, and there are multiple solutions out there that can help you, well, maintain, control, create the containers. So Docker Swarm is another option.
Josh VanDeraa: 01:12:16.019 There is a Telegraf plugin for Docker as well so you can monitor the Docker with Telegraf, so.
Caitlin Croft: 01:12:24.893 There are so many plugins. We’re well over 250 Telegraf plugins, and there’s even more coming along the way. All right. Well, I know we’ve gone completely over. Thank you so much. Clearly, there’s a lot of questions around Telegraf. We actually have a webinar coming up in December. I think it will be December 17th, and it will be with our Telegraf team from the product side as well as engineering. So for all of you who are still on the broadcast, be sure to check back. That will be another really great webinar and you can continue asking your Telegraf related questions. Thank you, Josh and David. This was a fantastic session. I think it’s very clear that people liked what you had to say and had lots of questions about your expertise. Thank you very much, everyone, for joining today’s webinar. Once again, it has been recorded. The recording as well as the links will be made available later today. Thank you, everyone, and I hope to see you on our next webinar.
David Flores: 01:13:35.403 Thank you.
Josh VanDeraa: 01:13:35.692 Thank you for having us.
Caitlin Croft: 01:13:37.864 Thank you. Bye.
David Flores: 01:13:39.166 Bye.
[/et_pb_toggle]
David Flores
Senior Network Automation Consultant, Network to Code
David Flores is a Senior Network Automation Consultant at Network to Code focused on Software Development and Automation practices for the NetDevOps world. David is an experienced networker in Service Provider and Datacenter Fabric infrastructures.
Josh VanDeraa
Network Automation Engineer, Network to Code
Josh is a Network Automation Engineer with Network to Code, with over 20 years of experience within various networking practices. Josh has focused on Network Automation over the past 4+ years with Python and Ansible to deliver on daily and project-based tasks. Josh is always looking to improve the IT environment and networks and to get the right data in the right system at the right time.