Unlocking Telemetry and Instrumentation: Cisco's Journey with InfluxDB
Session date: Apr 09, 2024 08:00am (Pacific Time)
Join us as we delve into Cisco’s innovative use of InfluxDB’s purpose-built time series database with its latest DevNet tool, Cisco Metrics Search Engine (CMSE). Discover how CMSE revolutionizes the way developers and network engineers find telemetry and instrumentation data from various sources, including APIs, YANG Models, SNMP MIBs, and CLI command references.
In this session, Jason Davis, a Distinguished Engineer at Cisco, will share updates on how they harnessed the power of InfluxDB and Grafana in the Network Operations Center (NOC) during their recent CiscoLive 2024 event. Learn how they successfully monitored conference WiFi using these tools, ensuring a seamless and uninterrupted experience for attendees.
Don’t miss this opportunity to explore the intersection of cutting-edge technology and network monitoring and discover how Cisco and InfluxDB are transforming how we leverage telemetry and instrumentation data.
What you’ll learn in this session:
- Introduction to CMSE and its role in finding telemetry and instrumentation data
- Integration of InfluxDB with CMSE for efficient storage and retrieval of data
- Real-world deployment of InfluxDB and Grafana for network monitoring
- Challenges faced and solutions implemented during the deployment
Watch the Webinar
Watch the webinar “Unlocking Telemetry and Instrumentation: Cisco’s Journey with InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Unlocking Telemetry and Instrumentation: Cisco’s Journey with InfluxDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Caitlin Croft: Director of Marketing, InfluxData
- Jason Davis: Distinguished Engineer, Cisco
CAITLIN CROFT: 00:00
Just to let everybody know, this webinar will be recorded, and the recording and slides will be made available after the webinar. And be sure to check out our events page at influxdata.com to learn about our upcoming webinars and other events. We have product webinars and webinars with fellow community members share their InfluxDB expertise. Don’t forget to check out our community forums and community Slack workspace. These are both fantastic resources for people brand new to InfluxDB and those who are looking for tips and tricks. There are amazing community members and Influxers who are in there ready to answer your questions. And don’t be shy. If you have a question, somebody else probably has a question also. And there’s probably somebody who has asked it. Or if not, be the first. And they’re probably somebody more nervous than you. I’m always nervous to ask questions. And whenever I do, they’re usually not that bad.
CAITLIN CROFT: 01:01
Please post any questions you have for this webinar in the Q&A. If you look down, you see the chat. And then there’s also a Q&A. I’m going to be moderating the Q&A. I will be scanning the chat, but if you put them in the Q&A, I’ll probably get to it faster. And we’re going to do the questions after the webinar. So, pop them in when you think you have a question just so we have them locked. But after the webinar, we’ll be sure to get to them. Okay. So, we also have Greece and Virginia, okay, and Northern Ireland. Okay. Everybody, welcome. So, I am going to turn the recording on, and I’m going to kick it over to Jason now. The recording’s already on, so I’m going to kick it over to Jason now.
JASON DAVIS: 01:52
All right. Good morning, everyone. My name’s Jason Davis. I’m a distinguished engineer at Cisco. If you’re familiar with our DevNet organization, which is about developer advocacy and network program ability, I’m the senior technical lead for that organization. I do strategy and special projects. Part of my side job at Cisco is to help with the network management and the network operations center for our Cisco Live events. And I do the US and Europe events, and we’re knee-deep in setting up for Cisco Live Las Vegas here in a couple of months. To that end, that’s what we want to talk about a little bit today is how we do Cisco Live and how we use Influx data technology to enable our management and monitoring of the event. So, we’ve been doing Cisco Live since 1989. It used to be called Networkers way back then. And it’s really a cool event. It’s about fun, education. We have a customer appreciation party, give out hats. And usually, there’s some headliner that performs for us. And there’s product launches. You can go and see what’s going on in the NOC with all the stats and everything about the show.
JASON DAVIS: 03:26
And we do it in multiple theaters every year. US and Europe are the largest ones, but we also have them in Australia, in Melbourne, and we’ve had it in Cancún. The US and the European ones are large enough, though, that we need to bring our own IT staff to set up and run the event. Typically, when you have an event at a large-scale venue, there is an IT staff already there. And typically, that venue will have their own IT infrastructure. Most often than not, it’s a Cisco infrastructure. Sometimes it’s an older Cisco infrastructure. Occasionally, it could be something other than Cisco’s products. So then when we come in there, we’re going to bring a lot of our own equipment. And we’re talking about over 2,000 wireless access points. So, if you go to one of these events and you see a lot of aluminum poles with the access points mounted on the poles, that means that the venue has either older Cisco equipment or non-Cisco equipment, and we had to bring our own equipment. But if we’re using the wireless from the venue, then we’ll swing all of their devices to our own wireless LAN controllers and just take over management of the event venue equipment.
JASON DAVIS: 04:53
We’ve had a maximum of 28,000 attendees and 74,000 mobile devices when you think about everybody bringing a phone, a tablet, and a laptop. And we have to distribute over 600 switches into the venue, into the various classrooms, and up into the aluminum strut where the lighting is. And also, the wireless access points are sometimes mounted up into the ceilings on aluminum strut. We have to bring in big internet pipes from our internet service provider. Sometimes the venue provides internet connectivity. Sometimes we go directly to a service provider and have them drop off really fast links. Last year, we had triple 100 gigabit per second links and dual 10 gig backup. So having 320 gig of internet pipe is quite a bit of connectivity. We also will simplify our deployment by dropping in a mobile shipping container filled with our router switches, the core of the network firewalls, also the storage from NetApp, cooling, and all that. So that’s a shipping container. It’s not a docker container. I like to joke around. But that allows us to drop it in, drop in the internet pipes and the power, and then fan out to the rest of the venue through their wiring closets and into the different rooms and such.
JASON DAVIS: 06:29
We don’t have a lot of time. Typically, the venue gives us four or five days before the event opens to get everything going. And then when we go in, it’s pretty much an open pallet. It’s concrete floors, nothing. We have to build up all the booths in the world of solutions and the vendor space. We have to roll out the equipment, get into their wiring closets. And you see a picture here of a guy wearing a harness up in the ceiling space 30 feet off the floor. Sometimes we have to get up there and reorient antennas and such. Sometimes we’re on the roof of a building, and we’re extending our show network from that venue to an adjacent hotel or something like that where we may overflow and use their space for whisper suites or NDA conversation type activities. So that’s kind of fun that we get to really use a lot of high-tech equipment, laser network extensions across– sometimes they don’t have fiber running between the buildings, and we just try to figure out a way to do point-to-point connectivity at high rates.
JASON DAVIS: 07:48
Not having a lot of time means we have to work really quickly, and brutal automation is very necessary. Having a very solid and modular network build really goes a long way to making sure that we can build out really quickly. So, things are very templatized, and because of that, we can grow really quick and tear down just as fast. Sometimes parts of the venue are not used. And when we’re done with that part of the venue, we can reclaim our equipment and such. We have a three-pronged strategy for when we manage and monitor this rapidly deployed environment. And one of that is to definitely use the commercial management products that we have, our Catalyst Center, formerly known as DNA Center, using our own commercial tools like Prime Network Registrar for DNS, DHCP; Cisco Telemetry Broker to multiplex out the telemetry and instrumentation. So, if you’re familiar with taking syslog messages in from your devices, sometimes you need to multiplex that information from the device to multiple consuming management applications.
JASON DAVIS: 09:11
So, our Telemetry Broker allows us to take in that information forward and filter, so we don’t have to set up as many trap receivers and syslog receivers and GRPC telemetry receivers. It handles a lot of that message and alert forwarding for us. We’re doing a lot with Umbrella and Meraki, ThousandEyes from availability latency monitoring. But since we’re here talking with Influx data, we also use open-source solutions in their commercial analogs to supplement our needs specific to large-scale conference venue. And Influx is a strong part of that. Grafana allows us also to do the visualizations of the data that we’re collecting from our router, switches, wireless access points, even those little Raspberry Pis that we deploy across the venue that act like end users, making sure that their wireless experience is sufficient. Ansible is also very common for us to use for provisioning the devices. Kubernetes, as we build up our microservice-based architecture for high scaling and polling across all these thousands of devices that we’re deploying. And Fping is an open-source project that I’ve contributed to that allows us to do fast pings.
JASON DAVIS: 10:44 We can say, “Here’s 3,000 devices we need you to ping,” and it’ll do it en masse and return back the information to us really quickly in a JSON format, which is my favorite data encoding method, JSON. Another dad joke there. But we get that information back, and then we can throw it up into dashboards and show the availability, latency, and packet loss of anything going on in the network. Finally, we end up using purposeful software development, the DevOps, SRE kind of principles to fit the unique requirements that we have. And commercial products are great. The open-source solutions are great too. But sometimes we have a little niche thing that we want to monitor for, and we have to go and write our own little Python script or something to collect that information.
JASON DAVIS: 11:41
Now, our journey with databases has been pretty long and storied. Probably like you, we’ve gone through and used OLAP and OLTP type databases and transactional-style databases like Oracle, MySQL, Postgres, MariaDB. We tried to fit that square peg into a round hole many times with performance data. And using some of those traditional transactional databases, we were saying, “It’s taking more work for us to try to build in a sense of time and doing the math and reporting that we were looking to do.” So, we’ve been using Influx since probably 2017, so early days of the time series database mentality. And so now we still use open-source database like MySQL to handle more the inventory and asset management. And it underpins our NetBox asset management tool. But we’ll use InfluxDB primarily to take in the performance data that has that time sense. And we’re timestamping as we get information about how many wireless clients we have or how many terabytes of traffic we’ve moved through the network.
JASON DAVIS: 13:05
So, all this, when it has a sense of performance and time, makes great sense for us to leverage InfluxDB. And then it may go into being pulled into Grafana, or we may have our own graphing architecture to pull in the information and show it. Now, I’ll share a few of those dashboards here right now. This one is a custom dashboard that we created for the wireless team. And if you’ve done network management very long, you probably remember that a lot of the ideas around network management are managing a device atomically or one by one. The idea about controllers has been new since software-defined networking SDN in the last 10 years. And having controllers together operating as a cluster is an important part of a high availability type of situation and environment. So instead of having a device managed atomically, we want to look at them as a clustered pair. And so, this dashboard is more of a custom one where we would gather the data from each individual wireless LAN controller—and these are managing the thousands of wireless access points in the venue—and we could pull out of these WLCs how many APs are out there, how many wireless clients, what wireless network they’re on, how much traffic they’re generating.
JASON DAVIS: 14:46
All the interesting rich telemetry that we’re interested in would be pulled out of these wireless LAN controllers as they are received from the wireless access points. And what we were learning over time was that some of the telemetry and instrumentation, it wasn’t exposed in some of the newer management protocols that we were really using. So, in some cases, we had to go all the way old-school with scraping CLI show command outputs. We were doing that enough with some of these systems that we wrote a tool called SSH to Influx. And what that does is it allows you to define a YAML file that says, “Here’s the inventory that I’m after. Here are the global variables and commands that I want executed. Here are the parsing specifications and regular expression pattern matches that I’m looking to extract out of that command output. And here are the Influx tags and keys that I want to associate to each of those captured pieces of information.” And so, if you’re interested in it, if you still have a challenge with potentially using older equipment that still needs more of an SSH kind of management style, it can be used by anything, even Raspberry Pis. Log into your Raspberry Pi and run some– maybe you’ve got a small Kubernetes cluster running on it or something. You could run any command and then execute, get the output, define what your regular expression match is, and then take that information, tag it, and it would be pushed into Influx for you.
JASON DAVIS: 16:33 And from there, then you can build rather cool dashboards like you just saw. We also do have quite a few electronic devices that are supporting the newer management, GRPC streaming telemetry, NETCONF, RPCs, and such. And so, we also did a NETCONF to Influx project, which allows us to gather the YANG model output and decide how we want to show that by dumping it into Influx also. And what you’re seeing here is a wireless dashboard that’s showing the different wireless radio standards in the IEEE. You may have heard it 802.11ac or the newer marketing names like Wi-Fi 6, which would be 802.11ax. But what you’re seeing here is a breakdown of all the attendee client wireless devices as they’re connecting to our access points. And whether they’re on Wi-Fi 4, which is 802.11n on the 5 gig band, Wi-Fi 5, 802.11ac, also 5 gig band, or the new Wi-Fi 6, which is 802.11ax, and Wi-Fi 6E, which is the 6 gigahertz band. So that’s a newer band that hadn’t been available except in the last three years as Wi-Fi 6 and 6E was ratified and promoted.
JASON DAVIS: 18:06
In blue, in the middle of that dashboard, is something that, just digging around through telemetry and instrumentation, I found was something pretty interesting. And this was Wi-Fi 6 capable clients. So, these are users or devices that connected to the network. They were capable of Wi-Fi 6E, but for whatever reason—probably that area didn’t have a Wi-Fi 6 access point—they had to step back down to a Wi-Fi 5 or older technology. Finding this metric was interesting to us because this tells us as the number of users increase here, that is a strong indicator that the venue or the location needs to upgrade their infrastructure because they have more users that could have a better experience. They have the newer equipment, but they’re not connecting at the maximum capability of their device because the infrastructure is slower.
JASON DAVIS: 19:17
So anyway, interesting things that you can find when you kind of dig into the telemetry and instrumentation that’s embedded inside these products. And so, we also get the question of what’s the maximum number of wireless clients throughout our event? So, grabbing that information, putting it into Influx, building up our dashboards in Grafana allows us, again, to build these kinds of cool reports. And then if you ever had a question about Influx’s ability to handle a lot of information at scale, this is one of our newest dashboards. And this dashboard really kind of put that question at ease for me because this is showing the 2,300 wireless access points. And each of those is almost a pixel in this heat map. And what we’re gathering is the wireless client information, the radio stats, the traffic rates, signal strength, all this metrics coming in every two minutes to create the heat maps. And if you hover over the heat map, every pixel will pop up a little menu that shows this access point, how many clients are connected to it, transmit/receive utilization, and the signal strength channel utilization.
JASON DAVIS: 20:37
So, this is kind of fun information from a geeky wireless perspective because we can see things like, “Oh, it’s getting brighter here at this location.” This is when people are coming in about 7:30, 8 o’clock in the morning to have breakfast at the venue. And okay, now about 8:39, people are going to their first sessions of the day and people moving through the hallways. We see the hallway access points getting red because there’s so many people just crammed into the hallway, getting to their various rooms and to the keynotes and such. And then we’d even see like, well, what is this? We got one access point that’s just read across the whole time series, and this is showing a four-hour block of time. And when we dug into it, we said, “Oh, well, this room is actually where the people go to recharge the handheld systems that they used to scan people.” So, when an attendee comes in, they scan their badge. All those scanners would go into this room. And so, imagine there’s a room with 150 of these devices all sitting in chargers, but they’re also talking to the wireless network. So, 150 of those all in one really constrained space caused that graph to show a red line. And we’re like, “Okay. We have a good explanation for that. No cause for alarm. Let’s move on.”
JASON DAVIS: 22:03
But then we also want to take information and slice it and dice it in different ways. So, what are the different wireless networks that people are attached to? And then what wireless standard are they connected on? And a funny story about this. Many years ago, we actually had a wireless network besides Cisco Live as the broadcast SSID. We had a Cisco Live IPv6. And if you connected to that wireless network, you only got an IPv6 IP address. Did not get an IPv4 address. So, any surfing that you were doing had to go to IPv6 capable websites and services. And what I saw was one device that was on this SSID, but they’re over on the 802.11g 54 megabit per second wireless standard. And so, we looked at that and said, “Here is someone who is so forward-thinking about using IPv6 and next-generation protocols, but they’re so frugal that they’re using 15-year-old radio technology in their wireless that we had to go get them a wireless dongle and bring them up to the 2020s.” So that was kind of a funny story. But as we take that information, we have the flexibility. Once it’s into Influx, we can write queries and then pivot the information any way we want to and show, without regard to the wireless SSID, we can now look at the adoption of the various wireless protocols.
JASON DAVIS: 23:51
And this allows us to understand how people are upgrading their equipment. And Wi-Fi 6 was ratified in 2019, but then the world went crazy shortly after that with COVID. And so, for those three years, there was not a lot of– there were no in-person events, right? So, the best practices and what we could develop out of Wi-Fi 6 monitoring and management really wasn’t being developed for a few years because we weren’t having large-scale events. But when we got back out of it recently and started collecting more of this information, we’re seeing that people did actually upgrade their phones and tablets and laptops through COVID. They just weren’t in these places where the venues would have that connectivity. And so recently, we had Cisco Live in Amsterdam, and we saw close to 65% of the equipment was capable of Wi-Fi 6. So, it’s just like the progression of Wi-Fi 4 and 5 over the years. It’s really cool to see how this technology is adopted and allows us to make better decisions like, “Hey, nobody’s on the 2.4 gig band anymore. Maybe we can save some spectrum and RF radiation by turning those radios off and just focusing our efforts on 5 gigahertz and 6 gigahertz band.”
JASON DAVIS: 25:21
And so, we get this information out there and we show it off in our network operations center and the video display wall. And it’s a lot of the same information you would see behind the scenes if you were in the work area where we’re just monitoring the network. So, we share the information warts and all. But as we go through this, it’s kind of a chicken and an egg situation. Influx and Grafana and some of these other tools that we use are great, but they only do their job when you get information into them. So where do we start? How do we get that telemetry and instrumentation into this equipment so that we can have this cool dashboard and everything? So, it’s really an integral problem that we have to solve of getting the information into Influx. And so, when you think about the various management protocols, all the way back old school, command line interface, executing SSH to get into something and running a show command to read the output. I call that tongue-in-cheek finger-defined networking or FDN. If you’re familiar with software-defined networking or SDN, FDN is where you manually gather information.
JASON DAVIS: 26:38
Then you may see old-school SNMP. And I kind of grew up in this industry doing that. SNMPguy is my Twitter handle. It’s also my license plate on my car in Raleigh, North Carolina. So, I’m very familiar with SNMP, but SNMP has had it today. And so, we’re moving into newer management protocols, REST, RESTCONF, NETCONF. Streaming telemetry with GRPC and GNMI, I believe, is a great protocol to take over from SNMP. So, it’s all about the APIs. Okay. Dad jokes. I’m looking here. Looks like mostly guys are on this conference, so you’re going to have to take it. Ladies, if dad jokes get to you, sorry about that. But I got four kids, so I know a lot about dad jokes. But it’s about the APIs and how we get the information out of this equipment, out of the management tools, and into the systems that are building our graphs for us, right? How do people typically do this? Well, they’ll go to Google or something like that, and they’ll just search. And Google’s great, but it’s got a lot more in there than just network management tools and telemetry and instrumentation.
JASON DAVIS: 28:05
It’s got other things that people may be searching for. So, trying to narrow down what is the Wi-Fi 6 e-client count and what is that metric, so I know what to collect and what to put into influx? That’s going to be a lot more difficult to find in Google when it’s going to be polluted with a lot of other information. So, we’ve come up with this tool at Cisco and specifically in the DevNet organization we’re calling the Cisco Metrics Search Engine. And it really enables our developers to find that telemetry and instrumentation that’s built into our products and be able to extract it so they can use that in their own management tools. And for Cisco, we have a lot of products in our portfolio. So having a tool that brings together a lot of this telemetry and instrumentation in a way that people can find things helps it look better as a one Cisco strategy rather than a lot of different product teams that you may have run into before.
JASON DAVIS: 29:10
So, the analogy of this would be like a car. My friend from Athens, Greece, there probably appreciates car analogies. But if you sit in a car, you have this dashboard, right, and it’s pretty intuitive to use. And it doesn’t matter what manufacturer car you get into, you can kind of understand, “Here’s the RPM meter. Here’s how much performance information in RPMs. Here’s how fast I’m going.” You may even have some fault indicators like check engine lights and airbag deployment. You have volume information and capacity information, like how much gasoline is in the vehicle, or if it’s an EV, how much voltage is still in the batteries. But it’s a pretty easy thing to understand, intuitive. And this car dashboard is kind of like a commercial network management tool, right? It’s been developed to look a certain way, and we think everybody wants it to look this way, right? Intuitive, simple, easy to use across different vehicles.
JASON DAVIS: 30:21
But if you’re trying to use information about your network to a competitive advantage, you need more than that dashboard. You need that onboard diagnostics port that’s the plug underneath your steering wheel. And what you’re seeing scrolling here now is a Wikipedia article that shows a lot of parameter IDs that come out of those onboard diagnostics ports. And if you’ve taken your car to a dealership or repairman, maybe an inspection station, if your country or state requires you to get annual inspections for smog and pollution and things of that nature, they’re going to plug into that onboard diagnostics port. And you’re still seeing this Wikipedia article scroll by because there are hundreds of parameters that are captured in the telemetry of your vehicle. And you don’t want that information to be shared with your insurance company if you’re a lead foot because they’ll see how fast you’ve ever driven this car, the maximum speed, and all of that. But now you’re kind of getting a sense of, okay, that telemetry that’s built into the car, that’s kind of like the YANG models and the SNMP MIBS and all the things that are deep into the routers and switches and access points. If I can get to that information, then I can build some really cool dashboards.
JASON DAVIS: 31:49
And so again, Cisco Metric Search Engine is helping us because when we think about how a developer ideates– they can build. But if they have to build from scratch, that’s pretty difficult. They may buy. If they’ve got a pretty decent budget, they may go buy a tool that allows them to collect information from that equipment. But what they really like to do is reuse, right? Somebody else has written something and, “Hey, that’s great. I like it. It’s helpful. I’m going to reuse that.” So, what we find are that the customer and developers are largely unaware of the telemetry and instrumentation that’s in our products. And because it’s all over the place and web pages, and YANG models are over here in GitHub and MIBS are over here. APIs are different product websites. And because it’s spread across all these different locations, they just don’t take the time to dig in and learn. So, Cisco Metric Search engine is taking all four of those categories, API documentation, YANG models, MIBS, and command line references, and pulling it into a big data lake so that you can search for that information.
JASON DAVIS: 33:07
And then as you search for that metric you’re interested in, we show you, here are the different hits in that data lake when you’re looking for Wi-Fi 6E client count or BGP neighbor uptime. Whatever it is you’re looking for, we show you here are the hits in the telemetry that we have in our equipment. You can start to narrow it down by operating system, by device type, or software version so that you can have more of a semblance of, “Is it supported in the products and the software versions that I’m running?” And then once you find something that interests you, just like with Postman, you can say, “Generate me a code snippet.” And regardless of if it’s an API call, a YANG model, a MIB, or a CLI, we will build the authentication headers necessary to connect to that device, create the payload. And then the output of that is however you want to use it. And one of those things that we’re looking to do, and we’re still working on, is automatically generating some of the code that would push it into Influx for you.
JASON DAVIS: 34:19 But to do that now, as you use CMSE Cisco Metric Search Engine and you take the software recommendation or code recommendation, you can drop that Python code it suggests into your VS Code IDE, and then change your device, IP address, and authentication information, and then validate that, yeah, this is the information I’m looking for. This one showed me that I have one user connected into this device. One of the things that we were interested in was how many of the NOC users are actually logged into the various devices out there? And do we have too many people managing the equipment? So, if we have three or four people logged in at the same time, are they running and bumping into each other? So, we created a little mechanism to monitor how many active sessions that we have on each of these devices, especially the core of our network devices. And we run this and have a dashboard that’s showing which ones are connected and most heavily connected at that time. We can augment that suggested Python script to extract the specific metric we’re interested in and then say, “Okay, this one is showing I have zero sessions running on this device,” and augmenting it one more time with six, seven lines of basic Python to write that information out to Influx. In this case, we’re writing it out to Influx in the Cloud. And we can use the new Influx DB3 style of connectivity. Then we now have that information available to us anywhere.
JASON DAVIS: 36:12
So, running that, we can see still have zero users logged into that device, and we’ve created the right line protocol that was pushed out to InfluxDB Cloud. And pulling it up into our Data Explorer shows us, at one time we had a user logged in, and then over the last several minutes, we’ve had zero. From there, we can take that query and then put it into Grafana or build our own dashboard however we’d like to and add it to our video display wall and our monitoring solution. So again, you can take just about any metric that you’re looking for and then extract it and then build a template for how you want to use it, look at it in Data Explorer, and then take it to production. If you’re interested, if this is a problem that you’re trying to solve, building your own custom dashboards, you’re trying to do something to find information about how well your business is running, then I’m going to encourage you to try out CMSE. It’s at developer.cisco.com/cmse. And we have a link here if you want to send us some feedback about how the tool is working for you. We’re enhancing it on a regular basis, adding more products and more telemetry as we find it. Most of you have heard about Cisco acquiring Splunk recently. And so, we’re adding Splunk into our API ingestion engine.
JASON DAVIS: 37:51
And I would be remiss if I didn’t mention the rest of my team, the DevNet team. So developer.cisco.com, we have a lot of information. There are learning labs. There’s reservable sandbox environments. If you want to try out something and not test things in production in your environment, you can check out a router or switch from our sandbox environment. There are forums to talk to those of us that are in the developer mindset, and we just encourage you to reach out and give it a try if you think network programmability is important. With that, let’s see. Are there any questions? I kind of ran through that pretty quickly. What happened to my screen? No longer there.
CAITLIN CROFT: 38:41
Thank you so much for the great presentation. And we do have questions.
JASON DAVIS: 38:47
All right.
CAITLIN CROFT: 38:48
Not sure where you went, but I’ll lead the Q&A. All right. So, the first question we have is from Alexander. So wireless access point all connected with ethernet cable or else. Did you try Wi-Fi mesh for such large installations and a number of client devices?
JASON DAVIS: 39:10
Not sure where I went either. Let me stop my video and start again because I can give that a try. Huh. I don’t know. Anyway, you’re still hearing me, right? I don’t know why the camera stopped going. But anyway, using Wi-Fi mesh is an interesting technology, but it’s going to use some of that wireless bandwidth to backhaul the connectivity. So, it just makes a lot more sense if you only have a high connectivity kind of environment to bring everything back. Most of the time, you would see gigabit connectivity. But now we have wireless access points in Wi-Fi 6 and 6E that have multiple gig connectivity for the end user. So, we end up having to have access points that support multi-gig connectivity back to the environment. So, a 5-gig connected access point is not unheard of because you’re trying to bring hundreds of users’ traffic back, and some of them can have over a gig of wireless connectivity. Doing that through a wireless mesh where you’re backhauling it through wireless is going to eat up some of that precious bandwidth, so.
CAITLIN CROFT: 40:28
Yeah. And what are you installing on the roof? Is it satellite antennas? And do you use satellite communication on only ground fiber?
JASON DAVIS: 40:39
So, what I was showing on the roof there was actually a laser cano beam from the San Diego Convention Center to the adjacent hotel. And I can’t remember if that was the Marriott or the Hilton. But we needed to run the Cisco Live network over to that hotel and extend it so that the users would have a seamless experience, whether they were in the San Diego Convention Center or in the hotel. They could be on the same Wi-Fi network. And so, we extended that through essentially a laser 1 gig point-to-point wireless shot across the field to the hotel. Yep. Sometimes we put people– if you’ve followed me, I’ve got this little Raspberry Pi thing I’ve created that has a GPS antenna. And we use the Raspberry Pi to be our network time protocol server for NTP. So that’s kind of a fun thing, too, is less than $100 of parts, and you’ve got a stratum one time server for your network. And you build a bunch of them, and you can have really solid clocking that’s reliable and fault-tolerant, so.
CAITLIN CROFT: 41:53
Thank you. Does your department pay for all Cisco equipment and monitoring software, or is everything for free as you’re part of Cisco?
JASON DAVIS: 42:02
We don’t have to pay ourselves for the use of our products. Some of the hardware we can get from ourselves at a discount. Some of the hardware we get from our own demo depots. So, if you’re a customer and you’ve asked one of your salespeople, “Hey, can you get me one of those newer 9500 switches? I want to try it out.” We have a demo pool for equipment. And so sometimes for events, we will use some of that equipment from the demo pool. But if you’ve ever been to San Jose, Cisco Campus, and Building 17, it looks a lot internally like Home Depot for Cisco. And there are just hundreds and hundreds of routers and switches on these racks. And we just pull the equipment down as we need it, configure it, deploy it. And then when we’re done, we pull it back and put it back up on the shelf. So, yeah, a lot of equipment, so.
CAITLIN CROFT: 43:00
Okay. How is Elasticsearch used for your installations?
JASON DAVIS: 43:06
Well, seeing how we’ve acquired Splunk, I’m not sure that we’re going to be going too much further with Elasticsearch, but generally taking in syslog fault information, logging, alerting information. And then we can search through all the messaging to say, “Okay, did we see a problem with a device having problems with power over Ethernet or anything?” Just how you’d use Splunk or Elasticsearch and query your syslog message, your traps if you’re taking them in, messages that you may have taken in on a message bus and turned into a regular message. That’s typically how we use it. Yep.
CAITLIN CROFT: 43:48
Did you find Grafana not so useful? Why did you have to implement your own alternative solution?
JASON DAVIS: 43:57
We did find Grafana pretty useful. I mentioned how we take a three-phase approach. We use our own commercial tools. Obviously, for this discussion, we weren’t going to talk about Catalyst Center or the nuances of Crosswork or anything like that. You have any number of salespeople. If you’d like to talk about our commercial management products, I’m sure that they would help you. Because I’m in the DevNet team and we evangelize network programmability and we’re not afraid of open source, we want to show the rich capabilities of our products with the embedded telemetry and instrumentation. So, use the commercial products as you can, but when you need to do more than what those commercial tools do, that’s where we’re in there and saying, “What can we get out of the products directly? How do we use it? We got to store the data influx. We got to show it.” So, then we go over to Grafana, and then we build up our nice dashboards there.
JASON DAVIS: 44:55
And some of the dashboards, you may have seen, were not Grafana. Some of those were basic HTML, CSS, custom dashboards that we’ve done ourselves. So, it really depends on what the metric is and how sophisticated it is to show whether we’re going to build it ourselves or if we’re going to try to do something with Grafana.
CAITLIN CROFT: 45:16
Thank you. What steps are involved when it’s necessary to expand already installed configured infrastructure, like to add another 50 wireless access points and onboard them? Or you have static configuration, which is not easily expandable once completed?
JASON DAVIS: 45:37
Adding more capacity isn’t very difficult at all if we need to expand another 100 or 200 access points. There’s configuration profiles and templates that are pretty well defined that we can just say, “Okay, when this device comes online, assume this template, and boom, it goes.” When we have more work, it’s generally we’re deploying a new type of wireless access point or a new type of switch. So, this summer in Las Vegas, we’re actually moving to an all-new style of access layer switch where we had been using the Catalyst 3560, the little 10-port POE capable fanless switch. We’re moving up to the new Catalyst 9000 compact switches that are similar port density, but they’re much newer, and they support the newer management protocols. So, I’m going to be able to do a lot more stuff with NETCONF on those products. So as the products evolve and churn, there’s more work. But when we just talk about expanding device counts, we have the scalability and the processes to handle that pretty easily.
CAITLIN CROFT: 46:50
Do you see NETCONF as an industry-wide, well-established standard, or something else is coming as a replacement nowadays?
JASON DAVIS: 46:59
I see NETCONF being very mature. SNMP is mature too, but there’s not a lot more development going on in SNMP. And if you were talking with some of the web scalers, the Googles, Netflix, and such of the world, they don’t use SNMP. They use streaming telemetry. Oftentimes, GRPC in a push model. The device pushes the telemetry and is received. In this case, Telegraph might receive it, right, to put it into Influx. When you go that direction with streaming telemetry and you say, “I want this branch of information,” it’s going to send you everything in that branch. And so, I tend to move to NETCONF, which is similar to SNMP in the sense that you ask your poll, and you get something back. But it’s using a newer protocol than SNMP. And when you use that NETCONF remote procedure call, you can be very specific about the payload, and you don’t get a lot of extra information just to ignore. Hopefully, that makes sense.
JASON DAVIS: 48:08
It’s like what I’m doing right now. Somebody asks a question and I talk for five minutes. You’re probably hearing more than what you asked for. But in that amount of information, you got some nuggets that are probably interesting to you. So, with NETCONF, I can be very specific. Just give me this bike counter. Just give me this gauge or whatever, instead of streaming telemetry pushing me 60 things, and I’m only wanting 6 or 7 out of that group of metrics.
CAITLIN CROFT: 48:41
I’m curious to know why you chose InfluxDB as your time series database. Could you share the reasons behind this decision?
JASON DAVIS: 48:49
Well, initially, when it was looking at trying to use regular transactional databases, it was the square peg in a round hole. It was trying to make the information that had a lot of time concepts fit with a database type that time was kind of a data type, but not one that was the focus. So, we realized, “Hey, there’s some cool things going on back in 2017, 2018.” When the notion of time series databases was really starting to take off, we were looking at it like, “Okay, these guys kind of get it. There’s an idea here where you have a metric. You just dump it in, and it timestamps it. You can do the math pretty easy. I need to go back seven days, or I need to go back two hours and aggregate information.” So, all the reporting things that had time understanding made a lot of sense with time series databases. So, then the next question was, who has the best time series database offer? And just through our analysis and use, we landed with Influx. And the partnership’s been great for many years. Obviously, having this opportunity to work with them. Influx is in some of our commercial products too. So, yeah.
CAITLIN CROFT: 50:15
Have you ever combined Influx and a transactional DB to satisfy a particular visualization?
JASON DAVIS: 50:24
We do that even still. I mentioned how we’re using Influx and still MySQL. The MySQL database is really where we store inventory kind of information or current state kind of information that doesn’t need to have a sense of change over time because we don’t want to have to build the notion of timestamping and being able to do forward and backward-looking comparisons. So, we’ll use MySQL to say this is a model of a device. So, our NetBox system will store information for us about device models, what software it’s running, how many interfaces something has. That makes sense for a regular OLTP-type database.
CAITLIN CROFT: 51:17
So, I find that when you have many data points over a short period or fewer data points over a long period, or even worse, many data points over a longer period of time, query performance suffers. Do you perform continuous queries influx v1 or tasks influx v2 to downsample or pre-calculate metrics? Or do you do other metric manipulation before submitting to Influx?
JASON DAVIS: 51:50
Because our events tend to run only a week and a half, we may be storing information for two, two and a half weeks maximum. And we’re over-provisioned for compute, CPU, memory, and storage. This is really a technology show. So, we have a lot of resources at our disposal. I mean, you’re still going to want to do relevant things like setting time periods. So, when you’re doing a query, it’s pulling enough information and just enough. But we don’t have a lot of downsampling that we have to deal with because our events are a week or shorter, so.
CAITLIN CROFT: 52:32
Thank you. What is the TTL of data in your InfluxDB? Is it the same for all collected data or different metrics have a different TTL? Do you keep this data in archive after all events are completed?
JASON DAVIS: 52:46
Different metrics have different time collection periods and retention periods. Something like optical transceiver power levels where we’re monitoring the laser light levels for our routers connected to our service providers, that’s an interesting metric. When you go over 100 gig links, you want to monitor those lasers. And we’ll monitor every five minutes because those light levels don’t tend to shift all that much. They shift because of temperature and things of that nature, but they don’t change that quickly. Other things environmental, like temperature, doesn’t change all that often. It may warm up, but unless there’s a fire or something, you’re not going to see something change really quickly. There are other metrics that we do care about where we may grab something every 10 seconds, and wireless client count information because people are moving between facilities and rooms. So yeah, we’ll gather that kind of information. We’ll hold on to it unless there’s data privacy rules. In Europe, there’s rules about you can’t hold onto people’s MAC addresses and things of that nature. So, we will make sure that we’re not collecting that information. If we find out that we accidentally did collect that information, then it gets whacked out of the stores. But we will hold on to information like raw client counts and information like that because it’s good for us to understand large-scale conference venue technologies that way.
CAITLIN CROFT: 54:31
Thank you. Do you use data collected in your InfluxDB for display purposes only, like show overview and user interface for admins to monitor, or machine-to-machine influx DB select queries are also used during live viewing events– during live events? Excuse me.
JASON DAVIS: 54:46
We do both of those things. We gather statistics from the network devices. We gather statistics from NetApp, from VMware, from Kubernetes. We’re doing the whole IT service management thing. You may not see a lot of dashboards about how many containers are running in the Kubernetes cluster because on the NOC video display wall, a lot of the people are more interested in the network stats. But behind the scenes, because we want to know, “Hey, there’s some kind of monitoring process that’s going on in that Kubernetes cluster.” We want to know, ‘Does that system still run and how many replicas are there?” and things of that nature. So yeah, we’re gathering things that probably appeal more to application developers than network DevOps-type folks.
CAITLIN CROFT: 55:43
Thank you. To what extent is end-to-end data collection, storage, and visualization real-time? What protocols and tools are involved, and how much scraping SSH type outputs?
JASON DAVIS: 55:55
It differs a bit between the US and Europe because they’re different teams. But in the US, we do zero SNMP. We do, I would say, 80% NETCONF RPC collections, probably 15% streaming telemetry GRPC push, and then the remaining 5% SSH CLI scraping because there’s some metric. Some of the people that support us in the back are tech engineers, and they know some of those seven or eight command argument-long things that you have to execute to get some interesting ASIC register or whatever. And so, they’ll say, “Hey, can you run this command every hour or every 20 minutes?” And so, we’ll do that too, if necessary.
CAITLIN CROFT: 56:48
Okay. Can you repeat how you get from the CMSE to the code?
JASON DAVIS: 56:54
So CMSE, there’s a button on every search result that says generate code. And you can click that button. It will create the authentication header, the payload. If it’s a NETCONF/YANG result, then it has to build the sensor path. If it’s an SNMP MIB, it’ll get the MIB object and build the payload for you. You can copy that command, the Python script that we suggest into your VS code. And then you just have to put in your IP address and your credentials, and then you’re able to use it. One of the things we’re working on shortly here is to enhance it so that it will default to using our own DevNet sandbox environment. So, you could automatically try those metrics out against the open always-on equipment that we offer.
CAITLIN CROFT: 57:50
Okay. And the last question we’re going to get to today, because we’re running up at time is, do you use Telegraph on some devices to collect data for InfluxDB? Did you consider implementing Cisco Router’s plug-in for Telegraph?
JASON DAVIS: 58:05
We use Telegraph primarily to take in the GRPC streaming telemetry from our edge routers that connect to the internet service provider. We’re not using Telegraph to do active polling, even though it’s capable of doing that. It acts mostly as a GRPC receiver for us. And then we have our own tool called Cisco Telemetry Broker, or CTB, that helps us multiplex collections and receiving traps and syslog and NetFlow data and forwarding it to the various applications that need to consume it.
CAITLIN CROFT: 58:42
Well, thank you so much. Thank you, everybody, for joining. And don’t forget, the slides and the recording will be made available shortly after this webinar. So, thank you so much for– thank you so much for joining us. Thank you, Jason. Thanks, everybody, for your time today. And I hope that you enjoy the rest of your day.
JASON DAVIS: 59:03
Take care.
[/et_pb_toggle]
Jason Davis
Distinguished Services Engineer, DevNet, Cisco Systems
Jason is a Distinguished Engineer in Cisco's DevNet organization. His role is to foster Developer Relations, develop Automation Strategies, and evangelize network programmability. His career has involved providing strategic and tactical consulting for hundreds of customers. Jason's primary expertise areas are in Network Management Systems, Automation & Orchestration, Virtualization, Data Center Operations, Software Defined Networking, and Network Programmability.