Cisco NX-OS and InfluxData for Pervasive Network Visibility
Session date: Aug 18, 2020 08:00am (Pacific Time)
Legacy networking has long suffered from limitations in network visibility, largely due to slow, inefficient polling mechanisms such as SNMP. Leveraging gNMI with OpenConfig, Cisco NX-OS, Telegraf and InfluxDB brings a modern approach to network telemetry, providing real-time data via open industry standard transport and format.
In this webinar, Gerard Sheehan, Product Manager for Cisco NX-OS and Russ Savage, Director of Product Management at InfluxData, will share how the Cisco Nexus data center provides telemetry monitoring built on Cisco NX-OS, the industry’s most extensible, open, and programmable network operating system for data centers, Telegraf, and InfluxDB. This powerful combination provides unique, event-driven alerting to solve complex data center network monitoring and alerting challenges.
Related Resources:
- Nexus 9000v docs
- N9Kv download link
- DME model reference
- gNMI docs in the NX-OS programmability guide
- Demo Repo
Watch the Webinar
Watch the webinar “Cisco NX-OS and InfluxData for Pervasive Network Visibility” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Cisco NX-OS and InfluxData for Pervasive Network Visibility”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
-
- Chris Churilo: Director Product Marketing, InfluxData
- Russ Savage: Director of Product Management, InfluxData
- Gerard Sheehan: Product Manager, IBNG, Cisco Systems
- Shangxin Du: Technical Marketing Engineer, Cisco Systems
Chris Churilo: 00:00:04.571 Thanks for joining us today for our webinar with Cisco Systems and InfluxData. And I’m happy to have three really great speakers that are going to review the new Telegraf plugin that the Cisco and NX-OS team had built and how they’re able to use it to be able to collect all kinds of great metrics from the great products from Cisco Systems. So with that, I will let the three gentlemen introduce themselves and get started.
Gerard Sheehan: 00:00:35.487 Hi. Can I go first? Hi, everyone. My name is Gerard Sheehan. I’m a product manager at the Intent Based Networking Group within Cisco, which is essentially the Cisco NX-OS portfolio. I work primarily on programmability, automation and visibility within that family of products.
Shangxin Du: 00:00:58.335 Hi, everyone. This is Shangxin Du. I’m also technical marketing engineer of IBNG and I’m focusing on the program on the programmability of Nexus OS and also the term entry of the next platform. Thank you.
Russ Savage: 00:01:13.093 And hi, my name is Russ Savage. I’m a product manager here at InfluxData and I work across all the products, but mainly focusing on data ingestion, anything you have to do with data analysis, and monitoring, alerting, so I’m glad to be here.
Gerard Sheehan: 00:01:33.012 So, just on that note, today, I’m just going to give a brief overview of kind of why we’re doing this and how we’re doing it and Shangxin will go into the more technical details of how you can integrate Telegraf into monitoring your NX-OS Fabrics. But just to give a quick level set of kind of why we’re doing this. I mean, ultimately what every operator wants is to be able to break operational silos between network, compute and storage. They want that kind of cross-domain visibility and to perform rapid troubleshooting with qualitative and quantifiable data. So they want to be able to know the application geography, the layout, and the mapping of that application to the underlying network. So today, here, we’re just going to focus on the network aspect of this. Because Telegraf and the overall TICK Stack is used across both storage and compute quite broadly. Just focusing today on the network portion of that.
Gerard Sheehan: 00:02:47.226 So control over network ultimately begins with visibility. It requires a kind of a clear understanding of a network and such knowledge is the first step towards evaluating potential network capacity, performance, or even security issues. Further event correlation can provide timely information about network health. And this is where you want to be able to take that application level information and tie that to the network. So how is that monitoring? Or how is that being done today? And today it’s currently being done via SNMP, syslog, or CLS and it really does leave a lot to be desired. The challenges with SNMP are well understood across the industry. But they’re often not a huge area of focus for most vendors around how they can be optimized or the amount of data that they can provide.
Gerard Sheehan: 00:03:49.672 Covert network visibility continues to be a top priority for our customers. So like I said, legacy approaches they tend to be architected on SNMP, syslog and CLs of screen scraping, providing the basic level of insight and visibility into the infrastructure. And just do not provide the visibility into the state of the fabric or how it would impact application performance. SNMP polling can be on the order of minutes, which is an eternity in a modern deployment often missing vital events and alerts that have occurred leading to customer visibility blind spots. CL eyes are unstructured and a change to the output of that structure will ultimately break any scripts that have been used to gather that information. Similarly, with syslog stuff sent over UDP being unreliable transport on their own structured again. So additionally, all of these methods are sending data to different points. And then there’s a heavy burden on the back end to be able to pull data from all of these different points, gathers correlators across a fabric, quite a lot of normalization is then needed to be done before any analytics or alerting can be done upon that data.
Gerard Sheehan: 00:05:11.561 So a new report is ultimately needed to be able to address these limitations. What customers need and want is an efficient, modern, scalable infrastructure to meet their monitoring requirements. They want to be able to gather data as if it’s a car, as well as interval-based updates via a modern-based interface that can be easily consumed and integrated into the broader monitoring infrastructure. They’d want to be able to take this data and through further analysis, predict and mitigate network outages or network incidences. So just some of the rich network data that can be gathered includes microburst, buffer utilization, [inaudible] utilization, QoS data as well as the traditional data that has always been gathered including environmental information, protocol state and events, link utilization, except now that are being gathered with much finer granularity.
Gerard Sheehan: 00:06:17.212 So this is just kind of two very high-level use cases for telemetry within the data center. So broadly broken down by availability and network operations. So, availability is interfaces is our link, is it up, down? Is it online? Data plane statistics, environmental monitoring, so, power supplies, fans, then CPU memory, these type of traditional things that have always been monitored and protocols TCP, IP, BGP, OSPF or any VPN fabric if you gather this information with much, much finer granularity. So the network operations then is broadly kind of defined as hotspot detection, congestion monitoring. So how much of a link bandwidth is being used? And how is it trending over time? Can I alert on that before it actually becomes overutilized or oversubscribed within the fabric?
Gerard Sheehan: 00:07:28.386 Similarly, roughly utilization of microburst detection, so to be able to say, if a microburst goes over a certain threshold to be able to trigger an alert based upon this, and then the duration of that microburst and when it came below that low water max or when it came below that to back to what would be defined as a normal level of utilization. So just to briefly touch on how this is being done. So our DME is our structured data modelling. Shangxin will go into this in much, much more detail. The DME is the foundation of our automation and visibility platforms. So this is where - it’s a store of both the configuration and the operational status of the protocols and processes.
Gerard Sheehan: 00:08:37.995 The DME is organized as a hierarchical tree-structure one-foot warehouse which we called “MO’s” or managed objects, which exist at each level of the tree and represent a discrete element of configuration or operational state. So when you configure telemetry to stream data out of DME, you’re identifying the path in the model where the object or objects exist, and you specify the individual MO’s. MO’s don’t have certain depth below that specific level. So it would be put into the one object, the children object of that object in the tree. We can also support filtering, amongst other capabilities being cadence-based or alert-based streaming. So we often get asked about how do you figure out which objects contain which data. And so we published the entire DME tree on developer.cisco.com so that you can take this, understand the objects, the tree and where the relevant data points for that would be within the structure that you would want.
Gerard Sheehan: 00:09:54.582 So additionally, on top of our DME data management engine, we also have both native and OpenConfig YANG models. So if you’re not familiar with the YANG, it’s an [inaudible] standard data modeling language that we use to define [inaudible] of the operational space for transmitting over various transports. [inaudible], gRPC, gNMI is what we would focus on today. So we have implemented this [inaudible] our native YANG models which is essentially a one-to-one mapping between or backend DME and our native YANG models. We also have our openConfig YANG models. These are vendor-neutral kind of common data models defined by the operator. So this is primarily been defined in the NSCDC web-scale space with SPs who’ve organized the goal of defining a set of common-managed objects that should be available for configuration and streaming across the board, regardless of the vendor.
Gerard Sheehan: 00:11:06.839 And this is really gaining a lot of industry traction right now. And we are incrementally expanding the number of OpenConfig models that we support and flashing out the implementation to include both configuration and operational status. This is just a very quick flash of the OpenConfig models that we do support. Initially, we targeted configuration only. In our most recent 935 we expanded as to support quite a lot of operational data so that it can be gathered via gNMI, and integrated into Telegraf and taken an interview at our overall ecosystem. One of the primary reasons that we’ve often heard from customers is how do we integrate this across various LSPS, NX-OS, iOS xe, or iOS xR.
Gerard Sheehan: 00:12:16.370 And gNMI and OpenConfig is the common integration with Telegraf provides the solution to our customers. So with the Telegraf plugin with the broader ecosystem integrations, it can alert, you can build your visualizations and you can gather that data from other data sources, so you can gather it from the compute, you can gather it from the storage aspects. And it’s that single, again single pane of glass across multiple OSs. Additionally, the transport and communication to connect connectivity, it’s secure and reliable. So authentication and encryption are both supported across that transport. And one of the major drivers is also to easily scale out deployment of Telegraf so that if you need additional instances or to consume additional information or to consume other data points, this can easily be done with Telegraf in conjunction with NX-OS.
Gerard Sheehan: 00:13:27.909 The operational simplicity of this is that everything is defined on the client. So in the situation, Telegraf seems to switch the information for what it wants, how often it wants us, and then in turn, the switch will send that information back so there is minimal configuration on the switch. And there’s great operational kind of troubleshooting flexibility in this type of deployment. If during a troubleshooting scenario, and you need to dial up the fidelity of the data being gathered, you want to gather additional data points. It’s as simple as restarting Telegraf with a new configuration file. If you think about that, in comparison to going into the configuration of potentially hundreds of thousands of switches and changing it together that is additional information. I really do want to kind of highlight that operational simplicity and how this can be beat on. And Shenzhen, I’ll pass this over to now and he will go into how you can do that and how that integration can be achieved. Thank you.
Shangxin Du: 00:14:43.033 Hi everyone. I already introduced myself, I’m Shangxin Du. I’m technical marketing engineer of the IBNG, focusing on program BP and also the telemetry with BT in Nexus-OS. So in next session, I’m going to a briefly the gNMI dial-in. And what is the gNMI, and how the gNMI is implemented in the Nexus-OS, and also some design considerations. That the way you design your visibility system or a telemetry system, and choose between the dial-out or dial-in approach. And also just finally, I’m going to spend probably five to seven minutes to go through a quick demo and how to use streaming commentary from Nexus-OS to Telegraf. Okay, so today’s topic is focusing on the GMI. So let’s take a look at what GMI is. And it is a network management interface, just like API, right. So it’s provided by, just like API is provided by any software product.
Shangxin Du: 00:16:02.299 However, we both know that the problem of API is that the data structure and operation is defined by each vendor. So in order to use an API, you have to read the API reference of each random. So there’s no real consistency. It’s the API design by each vendor. There’s different URL and different operation that is supported. So the gNMI here is aiming to solve it. And you using GMI, theoretically, as long as the common data model, like YANG is a support by multiple vendor, with the same set of gNMI operation, you could manage any type of the device and the streaming data from them. So the letter “G” here stands for gRPC, and which is the remote procedure call built on top of HTTP/2, as using the HTTP/2 as an application layer protocol, so streaming using gNMI can benefit from the feature provided by HTTP/2 Like flow control. So gNMI only defined the management interface, the only defined operation, it didn’t really specify any data.
Shangxin Du: 00:17:08.792 So the data that it carried can be anything. And it can be JSON, and it can be XML. It can even be the [inaudible]. But most of the use case is using a GPP, Google protocol buffer. So it sounds familiar, right? Yes, it has a lot of common to netconf and the restconf, but it is another implementation of network device management. However, it does provide some unique capability for streaming telemetry. So gNMI is based on gRPC. So it has to define some operation. What operation is it to support and the capabilities get said operation are quite like the netconf capabilities get-add operation if you’re familiar with netconf or restconf. But it defines a unique operation called Subscribe. This is the RPC that we are using now for the gNMI dial-in telemetry. In Subscribe message gNMI client will tell the network device, what the data that it is interested in, and interface counter, the state of a BGP neighbors, CPU and any environmental data, then the device will stream the data to the gNMI client based on the [inaudible] mode, like John mentioned, and how often you want the data.
Shangxin Du: 00:18:30.570 The device can stream the data every 10 seconds or every one minute even there’s no change. And this is usually called sample-based telemetry or cadence-based telemetry, or device can only stream the data whenever there’s change, and the which we usually call it event-based comment. So it’s a trigger-by-chain to trigger-by-event. For example, we only wants to stream the data or the interface state whenever there is any like [inaudible] or reflecting on interface, so, the common use case. So, needless to say, for the data that is suddenly changing like interface counter or the CPU and memory, we will like the simple-based telemetry and for the data that tend to be stable like the routing table or the BGP peers state and even-based on entry should be now. So this is architecture of Nexus OS parameter streaming. Nexus OS is a full modular operating system and the streaming engine takes the data from different sources, but DME is one of the major source. So the DME is - Gerard already mentioned that the DME is short for a Data Management Engine and stores all the configuration data and the operational data of the switch.
Shangxin Du: 00:19:50.049 So this is a native data model of Nexus OS and it can be directly queried from API of Nexus OS. In order to support the open standard model, the YANG model, the native data is also converted to a YANG model. And it will, of course, during the conversion and it will introduce a little bit latency between the collection and the streaming. But compare the time that the collection and streaming consume, most of the cases, it can be ignored. [Inaudible] engine can also take the flow information from EC directly and then streaming it out. This is actually exactly same technology that we are using now in titration network instead of Cisco. But our telemetry is not really the focus of the day is the will only focus on the software telemetry and more specifically, the GMI dial-in telemetry. So with the gRPC and the GMI support you could do to your own collector, right It’s open-standard and it’s not proprietary, and it’s open-standard and so you can build your own. Or you can use the vendor provide the solution like Cisco’s [inaudible] or you can build time the telemetry system and use an open-source tool like Telegraf.
Shangxin Du: 00:21:03.987 So how do we implement it in the Nexus-OS? gNMI has been supported since Nexus OS 9.3(1), and initially supported a situation only for telemetry but as a full-set operation support on the latest version which is 9.3(5), and it is implemented based on gNMI version 0.50. And we support both ON_CHANGE which is the event-based subscription and the SAMPLE based subscription. So coming to the disk encryption, TLS is always on and it cannot be disabled. So you need to configure a certificate for data encryption, which is also the best practice for production and production environment but for testing purpose, Nexus OS will also generate a self-signed certificate if it is not provided. So regarding to the encoding or data serialization, we choose to use the Google protocol buffer for encoding and more specific are we actually using a more self-describing version of it is called Key Value GPB.
Shangxin Du: 00:22:12.298 I think most of the people are quite familiar with auto-encoding like a [inaudible]. And these are all tree-like data structures and that they are all self-describing at all the key and value are in the string. So this is to make the data or the message in the payload human readable. But the problem is, it also not that efficient for data encoding because it will make the data or that makes the packet really big. So GPB goes in that direction. Everything can be binary, so both - but because the average is the binary, so both the client and the server side have to use a common DICOM file to decode the data to come to your model, that means you have to - I mean, you have to use a separate DICOM file for each individual model. So that will bring a lot of capacity to operation and also to developer. And this is the original version of GPB, which we call the compact GPB as a data or are all competitive together as the binary.
Shangxin Du: 00:23:13.711 So we choose to stand in the middle ground. And I don’t have it like enough time to explain them all. But in nutshell KV-GPB is a good balance between easy-to-use, or easy-to-develop and [inaudible] efficiency. So it’s encoding efficiency is better than JSON XML, but not good as compact GPB. But it is quite friendly to user or developer as the key is still in the stream format. So in this way, with all the model, with all the YANG model, you only need a unified or single DICOM file we call [inaudible] and to decode the content. So come to the data model we support YANG, of course. And we support both Config, OpenConfig model and native YANG model, which is a Cisco-specific model and offer DME. So this is a subset of your model that we support on the latest version. And in order to stream that OpenConfig YANG model, RPM or RPM package needs to be installed on the Nexus OS. But if you’re streaming the native YANG model, or DME, and this is not required, yearly when one model is supported, it doesn’t really mean all the paths in this model is so supported. So be aware of the deviation if some specific paths is not supported, usually vendor will create a deviation to market.
Shangxin Du: 00:24:41.773 However, you can always fall back to the native YANG model. Native YANG model is basically a one-to-one mapping to our DME objects. OpenConfig models didn’t really cover all the use cases. Sometimes is only to find the raw data, for example, interface counter. But it’s only streaming that was the current interface counters since last to include the counter and but what the user actually need is a link utilization, right? I want to say that what is the packet-per-second or by packet-per-second of my interface. So, I know that was the utilization or was the user interface. So you will still need to calculate by yourself if you are using the OpenConfig YANG model, but with native YANG model and so we have a lot of useful metric that is not defined that in open standard. Okay. So come to design consideration and we use GMI for value in telemetry so what does that really means?
Shangxin Du: 00:25:44.600 It is when we speak from a switch perspective, they are out means the switch send a SYN packets to a TCP connectivity handshake and send all that data off wherever since a path is configured on the switch. So dial-in is reversed. The telemetry collector defines what data it is interested in and the subscriber sensor path to all the devices. So device would just listen on the GMI interface and send all the data that it is told to. The TCP connection is from the collector side. So that’s why it was called dial-in. So when you design the telemetry system, there is some consideration, that which method you would like to use. There are pros and cons. I will now say that one is the superior over another because there are always pros and cons and you really have to choose based on your actual requirement and the environment. So with that out, the low balancing is a simple for big fabric and poor low balance in the middle. The clutter can be scaled out easily.
Shangxin Du: 00:26:50.394 However, you need to configure the sensor pass on each individual device. But to be honest, in most of the case it’s not really a big problem as we do have so many ways to automate that and as both could be one of them. With dial-in you just need to configure the intrinsic sensor on the connector. And the configuration on the device is a super simple, it’s probably the 224 line code of the command. However, as a connection is from outside to the device, specific import for gRPC or GMI need be open on management network, which not all the customer will like it. And come to a load balancing because of the direction of TCP connection, it is and it is using the same TCP connection for subscription and the streaming. The load balancing will not be that straightforward as they are out, you may have to evenly distribute the switch between different character instance, so then you have to think About how to synchronize the sensor pass configuration between different connectors, right?
Shangxin Du: 00:27:54.098 So some design consideration. So there are some pros and cons, choose based on your environment and the preference. Finally, coming to a configuration, to configure gNMI interface on the Nexus OS, you just need to enable the feature then use the trust point, all the Nexus OS configure certificate of GMI. So probably, totally is 224 line of command on the device. I will demonstrate quickly and later. Then next, coming to a collector. Multiple configuration is actually on the collector set. Then, you will also need to configure certificate for rectification and then the username and the password for authentication purpose. And since we only streaming or querying the data from switch, we didn’t really change anything else on switch, so it doesn’t need to have - it doesn’t have to be the now we’re adding new rows that now operator row is enough. Then the rest of the configuration is center pass, you just then define what you want and what they want, and how often you want the data.
Shangxin Du: 00:28:57.163 So that’s the configuration workflow. So next, let’s come to a quick demo. Before we come to the demo. This is a topology that we’re going to use today. And this is a typical [inaudible] fabric. So we have two spine, four leaf. And I have one - I have instance for Telegraf and InfluxDB. So in this demo, Telegraf is only going to dial into the leaf switch to streaming some environmental data and the [inaudible]. Then the switch will stream the data back to Telegraf and type graphics and write into InfluxDB. So at the end, we can use the Kernel graph to visualize the data and I had some pre-built dashboard to show you. Okay. All right. So this is all going to be a live demo. I hope -
[silence]
Shangxin Du: 00:30:00.120 Okay. Okay, it’s connecting. Okay, first let’s come to one switch and I’m going to show you how to enable the gRPC on the switch. So on enable, so first we need to configure the certificate for gRPC, as I mentioned is always on and it cannot be disabled. So on production environment, you will definitely need to configure a customized certificate for gRPC. So we already copied the certificate to a switch. This is a gnmi.pfx for included the certificate and also the private key. We’re going to import the switch using the crypto command. So first, we need to create the crypto trustpoint and the name it as gnmi and then use the crypto trustpoint to import the gNMI. And also with the file name and also the password. Sorry. So our certificate is imported switch, you can use to WiFi with the certification you just important.
Shangxin Du: 00:31:11.678 As you can see it’s a [inaudible] self-signed certificate, but in the real world, you can use the [inaudible] signed certificate for sure and in this moment. And next is how to enable gRPC. So to enable gRPC agent, you just need to tap the feature gRPC. So this command is going to take a couple of seconds and to enable because they have to - on the first run it has to generate the self-signed certificate and also register itself to different source and make sure it can receive the track of event and from a different source and from interface and different process. Synology RPC is enabled and once the gRPC is enabled, you can use the gRPC, gNMI server service that [inaudible] and to verify with this current state. So as you can see, now, the gRPC agent is by default it’s going to run in the “management” world and listening on the port 50051. As you can see that by default, it will generate a self-signed certificate, but it’s only valid for one day. So this is only for testing purposes, right?
Shangxin Du: 00:32:12.984 And now we want to specify the customized certificate to the gRPC agent. So in order to do that, we use a gRPC certificate and the widget with the trackpoint name. So here’s the gNMI. So once you configure customize certificate, the gRPC agent is going to restart and we can use the same command to verify the current state. As you can see, now it is initializing and it probably will also take a couple seconds to initialize the reinitializing gRPC endpoint. And to wait for a moment and – okay, now it’s up and you can see the certificate is changed to one that we just imported which is valid for 10 years. So that’s it. We already finished all four gRPC configuration for one switch. Then let’s come to the [inaudible]. So as I mentioned here, a demo we’re going to use a Telegraf, InfluxDB and Chronograf config to demonstrate the how we’re streaming that. And because we haven’t started server service, so I’m going to use a docker-compose to help on this drop.
Shangxin Du: 00:33:14.893 So a docker-compose file is easy and it just started the service, necessary service for today’s demo, like InfluxDB, Telegraf and also Chronograf because we have to do some configuration work, for example, to generate the volume of docker and also generate the certificate if needed. So I created script and to start everything. So this script will do all the preparation work and also a will import some pre-built dashboard to the Chronograf and just [inaudible] purpose. In this script, we have a two pre-built dashboard. And one is for the MPP dial-out and the one support GMI dial-in. So this is only for demo purpose. In the real production real world, you can always combine the data that are collecting from any source and into your telemetry system. This is just for demo. And now we have - everything is ready, and we can verify that every container is up and running. There’s no problem.
Shangxin Du: 00:34:19.221 Okay, let’s go to the GUI, refresh. As you can see, we have two dashboards and they were only focused on the dashboard that’s the [inaudible] for gNMI. So now you can see that we are receiving data from one switch, and it’s combined - probably make it a little bit bigger. So it’s a combined the environmental data like the CPU, memory, temperature, and also the portable data like the EVPN path, you learn on this switch and the route, the BGP peers and also some interface [inaudible], the ingress and egress octet rate. So this is, like I mentioned that this one is not really available in the OpenConfig models. So it’s streaming from an API model. But it’s combined with the different source. OpenConfig and native YANG [inaudible]. So yeah, it’s pretty straightforward. As you might notice that I’m only streaming from one, even you can see that the configuration of gRPC on the switch is relatively simple. It’s just a couple lines of command, but still have a command on each endeavor by the switch is still a little bit tedious.
Shangxin Du: 00:35:32.809 So here we are going to use that. We’re going to use the help of Ansible. So the Ansible script - with Ansible you can quickly automate this region configuration and here I’m going to use this to configure for the rest of switch. So the playbook is really simple. Just copy the certificate from the collector, the GMI certificate from the server to the other switches. And on the first one, which is the - pre-copy that one before a demo, and then create a trustpoint and also enable a feature gRPC and, and also create a customized certificate for gRPC agent. So it’s a repeat all the way half down on the first wave. But with Ansible, we can quickly automate this process for Hendra switch. So now everything’s ready and let’s go to the second switch. Sorry, let’s go to the second switch to make sure that gRPC agent is running and without any problem, so we can use the same command: gRPC gnmi service statistic.
Shangxin Du: 00:36:44.549 So as you can see, after we create a customized certificate, it is actually still reinitializing, and then we can wait a little bit. Okay, now it’s ready. And it has also customized certificate installed, and which is the same one that we use on the first switch. So let’s go back to collector, and we refresh the page, you will notice that we’re still - oh, okay, perfect. It’s actually quicker than what I thought. So as you can see that we already receiving data from all four switches, and you can switch between them, and to see the different data to see the different data and from different switches. So yeah, that’s basically pretty much all demo I have. Thank you for watching.
Gerard Sheehan: 00:37:30.749 Yeah. Thank you. Just one quick question, can you show the Telegraf configuration?
Shangxin Du: 00:37:36.397 Oh, thank you. Thank you for reminding. Yes. Yes. So I’m going to show you the Telegraf configuration for GMI-specific and so this is for the generated configuration for this one. For the GMI configuration, first part is address, or the address and port for the switch that you want to subscribe to. And the second part is the configuration for username and password for authentication, and then the certificate for the verification and encryption. For reference configuration these are all the subscription. As you can see that you can define a different model. So if this is an OpenConfig YANG, the origin is going be OpenConfig, and that with subscription path. And if it is native YANG model, the origin is going to be “device”. So most of the configuration is going to be a subscription sensor. And then the last part is just the “outputs”. You write the data to InfluxDB. So that’s it. All right, cool. Any more questions?
Chris Churilo: 00:38:52.964 That was pretty impressive. And I want to - there are some questions that have come in through the chat and Q&A. So we’ll start with the first one, which you actually answered Gerard in your in the in the slide deck, but let’s just ask if everyone could just make sure that they get a good understanding. So Kareem asked: is a DME a YANG model? And Gerard or Shangxin, I’ll let you answer that one so we can get it on the recording.
Shangxin Du: 00:39:22.269 Yeah, I can answer that question. So I already answered that. So that DME is not the YANG model. It’s still a tree-based architecture, but this is not a YANG model. However, we do have a native device YANG model that is basically one-to-one mapping to direct DME object.
Chris Churilo: 00:39:43.776 All right, hopefully that’s cleared up. I thought it was pretty clear but if there’s a little bit more clarity that you’d like, Kareem or anybody else on the call, just let us know in the Q&A. All right. Another question is from Don, and Don asks, is the source code for the Ansible playbook, the docker-compose file etc. available online?
Shangxin Du: 00:40:03.089 Yes, actually is through - the scripts, I’m using is actually already publishing in the GitHub. I can share you - let me see if I can share right now. Yeah, I can share you later in the chat, and does the open source distribution with the bash, is the relatively easy.
Chris Churilo: 00:40:22.454 What we’ll do is on the page we have the recording I’ll put the links to those files as well.
Shangxin Du: 00:40:29.680 Okay, which was to share here and I will also share with Chris later.
Chris Churilo: 00:40:34.419 Cool, cool, cool. So what was your experience in building the Telegraf plugin? Was it really straightforward? Just want to share with the community members that are on this call today.
Shangxin Du: 00:40:46.694 Yeah. The telemetry plugin is really straightforward, and I will say the Telegraf is a really nice product because its design is modular and you can design whatever the plugin you want and accept any data from any source. So yeah. The setup of the Telegraf plugin is also pretty straightforward. As you can see that the configuration in it is basically self-describing.
Chris Churilo: 00:41:12.695 So I think that one of the cool things that actually I alluded to in the beginning is that application information can often indicate the health of your network. And so, the cool thing about what we saw today is, you don’t have to just collect the telemetry from the switches that you have configured. There’s over 250 Telegraf plugins that Russ and team have built, which means there’s a really wide range of other metrics that you can easily collect. As easily as you saw, Shangxin demonstrate in his demo today, it’s just a matter of just configuring that plugin to collect from the right source, and then boom it pulls into InfluxDB and then your favorite dashboard, in this case he was just using the InfluxDB dashboard. But there’s a wealth of other information that you can tie together to really get a much more comprehensive understanding of the state of your switches or your entire application stack for that matter.
Russ Savage: 00:42:11.751 Yeah. And I’d love to jump in there real quick. So we just released a new version of Telegraf 1.15, and some really awesome capabilities in this release of Telegraf that allow you to actually build completely external plugins. And so, there’s a - having all of the plugins and a centralized Telegraf is very convenient but at some point it grows to a size that’s unmanageable and so we’ve introduced techniques and technologies that let you design and build plugins that are completely external to Telegraf and you could host them in your own repository. And so it’s really, really useful for maybe smaller projects or projects that require some proprietary technology to connect to. So really opens and expands the universe that you can build Telegraf plugins for. And of course the standard mechanism of building a plugin and including it in the main repository is still there for large scale technologies.
Chris Churilo: 00:43:16.965 And Kareem, just to reiterate what Russ is saying, so yes, there are over 250 Telegraf plugins that the communities already built and they cover a wide range of data sources, so those are open source. In addition Telegraf itself is open source, you could actually support your favorite data format and share that as a Telegraf plugin just like the Cisco team did with the GNMI plugin. And then, alternatively as Russ mentioned, we do have a new mechanism where you can very quickly come up with even more plugins to suit the bespoke nature of your environment. All right. Tom [inaudible] asked, is there a plan to support EIGRP? Gerard?
Gerard Sheehan: 00:44:06.198 Yeah. So within OpenConfig, there are no plans to support the EIGRP just given the Cisco-centric nature of the protocol. Yes, within our native YANG models, we do plan to fully support the EIGRP. So it’s just a different data source. It’s would be our native YANG as opposed to OpenConfig YANG.
Chris Churilo: 00:44:31.356 So Tom [inaudible], hopefully that answered your question. If you do have any - oh, and then looks like the telemetry collector was also posted by a couple people in the chat so, but we’ll also add that to the same place that we have the recordings, so if you lose that link, no worries, we’ll make sure we get that to you guys. So, Russ, when you and I first saw the work that this team did, what were some of your first impressions and maybe some things that you might want to point out to the rest of the community about what they’ve built.
Russ Savage: 00:45:12.976 No. I think what I really enjoyed and I think was really powerful they touched on a little bit in the webinar was the fact that you could actually communicate both ways with the Telegraf plugin and so you don’t have to configure 4000 network devices at the same time. So I think that’s a really awesome pattern for situations where you don’t necessarily have direct access to the devices that you’re configuring. So that was really cool and then in general, I thought it’s a great use case for data collection in general and network data. Telegraf has always had the capability of gathering information as we spoke about SNMP and other network protocols. And so adding this just made a ton of sense. So we were really excited to - for Cisco to reach out and build something like this together.
Chris Churilo: 00:46:16.119 So Gerard, what was kind of the inspiration for this project? And how did you guys hear about Telegraf?
Gerard Sheehan: 00:46:25.692 Yeah. So we heard it from our customers essentially. Our customers, it is by far and away, this is the integration that our customers were looking for. They were already leveraging the TICK Stack with further existing monitoring infrastructure and they wanted to have to incorporate NX-OS into this. So we took the initiative. We implemented GNMI and gRPC on the switch side, and then built the integration with Telegraf.
Chris Churilo: 00:47:02.903 The nice thing is not everybody can upgrade or update all their switches all at once, having support for both the legacy SNMP, and these more modern streaming techniques, just makes it easier for people right as they’re trying to make these changes to their infrastructure.
Gerard Sheehan: 00:47:23.670 Absolutely.
Chris Churilo: 00:47:26.191 Let’s see. So once again looks like people are pretty excited to get their hands on the playbook, and like I mentioned, we’ll make sure that all the links to the various repos are - in fact, what we’ll do is we’ll also put it in the email as well as on the page with the recording so everyone can really get that and start playing with this. We’ll just leave the lines open for just another minute or so. But before we close off and wait for some other questions to come in, Gerard and Shangxin, are there any other words of advice or things that you want people to try out?
Gerard Sheehan: 00:48:06.942 I do kind of want to highlight is, there’s - we have a virtual form factor of our Nexus 9000, so customers can take this, they can pull what Shangxin has built, and they can virtualize this entire environment, there’s no kind of requirement for physical hardware, to be able to begin this journey essentially, so I do just want to call that out as one additional benefit and one additional integration point where customers can start.
Chris Churilo: 00:48:39.855 Oh, that’s pretty cool. If you can also send me the link, then I can also make sure that everyone gets that as well, because that would definitely be really useful to everybody. All right, it looks like there are no other questions but what often happens in these webinars is we close the webinar and then you come up with a whole bunch of questions, so if that happens, just shoot me an email I will be more than happy to forward your questions to the team from this today’s webinar. And then I’ll make sure that we get all these links to you guys as well as the recording at the end today. All right. Awesome. So that was very inspirational, I’m super excited, and we can tell by all the chat in our Q&A and chat panel that all our attendees were equally as excited, and they are pretty keen to get started with playing with this. So thank you very much everybody. And we hope you have a pleasant day. Bye bye.
Gerard Sheehan: 00:49:35.787 Thank you.
Shangxin Du: 00:49:36.735 Thank you.
[/et_pb_toggle]
Gerard Sheehan
Product Manager, IBNG, Cisco Systems
Gerard Sheehan is a Product Manager with a focus on Network Automation and Programmability at Cisco Systems. He is passionate about bringing automation strategies together that modern networks become more agile and adjust to business needs faster. He started his career as a System Engineer at Cisco and hold several certifications in installing, configuring, and managing technologies like VMWare, AWS, and various Cisco routers and switches. Gerard holds a BS in Computer Application from the Cork Institute of Technology.
Russ Savage
Director of Product Management, InfluxData
Russ Savage is the Director, Product Management at InfluxData where he focuses on enabling DevOps for teams using InfluxDB and the TICK Stack. He has a background in computer engineering and has been focused on various aspects of enterprise data for the past 10 years. Russ has previously worked at Cask Data, Elastic, Box, and Amazon. When Russ is not working at InfluxData, he can be seen speeding down the slopes on a pair of skis.
Shangxin Du
Technical Marketing Engineer, Cisco Systems
Shangxin Du is currently a Technical Marketing Engineer at Cisco Systems helping customers build network automation solutions using streaming technologies and Networking as Code best practices. He also builds and maintains a hands-on lab for testing and developing competitive content used for training the Cisco field teams. Prior to this, Shangxin has held various solutions support positions at Cisco. He holds a Master in Electrical Engineering from Shanghai Jiao Tong University and a BS in Computer Science from the Huazhong University of Science and Technology.