How Cisco Provides World-Class Technology Conference Experiences Using Automation, Programmability, Python, InfluxDB and Grafana
Session date: Feb 15, 2022 08:00am (Pacific Time)
Cisco Systems’ Cisco Live conference is held annually in the winter in Europe and in the summer in the United States. A typical US event hosts 600+ breakout sessions, dozens of keynotes, Certification Testing and walk-in labs. The conference serves over 26,000 attendees and all their mobile devices. The internal network team is responsible for ensuring that 2,200 wireless access points and 800 switches are providing sufficient network connectivity, availability and bandwidth for all attendees, speakers and organizers across 2 million square feet of conference space. Over 9 days (4 days of setup and 5 conference days), event staff and attendees have pushed over 84 terabytes of data from the conference to the internet! Dual 100 Gigabit/second primary links and backup 10 Gigabit/second links handle anything the users can throw at it.
The infrastructure required for this event also includes servers, VMs, and containerized workloads. With the growing need for hybrid events, Cisco’s team also ensures 100% video streaming uptime. Discover how Cisco uses InfluxDB to store key performance metrics across many IT domains alongside their commercial management solutions. The team continues to iterate and improve year-over-year to gain visibility into their network and devices to streamline troubleshooting and quickly respond to events before they become service impacting.
In this webinar, Jason Davis dives into:
- Cisco's approach to using automation, orchestration, Python scripts, SNMP, and streaming telemetry to collect network data
- Their methodology to troubleshooting, prioritizing, and scheduling fixes to ensure the best client experience
- How a time series platform is crucial to their real-time data analysis
Watch the Webinar
Watch the webinar “How Cisco Provides World-Class Technology Conference Experiences Using Automation, Programmability, Python, InfluxDB and Grafana” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How Cisco Provides World-Class Technology Conference Experiences Using Automation, Programmability, Python, InfluxDB and Grafana”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Jason Davis: Distinguished Services Engineer, DevNet, Cisco Systems
Caitlin Croft 00:00:06.386 Hi, everyone. Welcome to today’s webinar. I appreciate your patience as we figured out just a few technical glitches. So we’ll just get started here. My name is Caitlin Croft. I work here at InfluxData, and I’m really excited to be joined by Jason Davis from Cisco, who will be talking about how they use Python, InfluxDB and Grafana to make the experience for their customers and partners at their conferences the best it can be. So this session is being recorded and will be made available for replay later today or tomorrow. And you can find in the bottom right of the Webex window, you can find the chat box. So if you have any questions for us, please post them there and we will answer them at the end. I just want to remind everyone to please be courteous to all attendees and speakers. We just want to make sure this is a fun and happy place for everyone. All right. Without further ado, I’m going to hand things off to Jason.
Jason Davis 00:01:08.782 Hello, everyone. My name is Jason Davis. I’m a distinguished engineer at Cisco and part of our developer relations team and the DevNet program. I’m based out of our Raleigh, North Carolina campus, and I’ve been doing the Cisco Live Network Operations Center for about 11 years now. And in that, I get to go a lot of places in the US and Europe in preparation for our events. And if you’re not familiar with it, Cisco Live, formerly known as Networkers, has been running since 1989, and it’s really the industry’s premier event for education, inspiration and fun for our network-focused attendees. And we hold it in several theaters each year. The US and Europe theaters tend to be our largest, with the US being about 24,000 attendees and Europe being about 14 to 16,000 attendees. And you can see here some pictures of the Feira de Barcelona, San Diego Convention Center, Las Vegas and the Messe in Berlin. And we also have events in Cancun and also in Australia, in Melbourne. And those events tend to be a bit smaller and don’t require the IT support staff and set up that we tend to do. And we have some functional requirements for doing an event of this large. We have to be ready by registration open, and usually that’s around noon on a Saturday before the event. And the venue typically doesn’t give us a lot of access before. We’ll get four or five days before the event to get in, bring all of our equipment and set it up.
Jason Davis 00:03:03.259 And there can be thousands of pieces of wireless access points and hundreds of switches that we have to deploy across the venue so we can use our own equipment. And we have a few days of travel to get there. And depending on where somebody is coming from, they might be a little bit jet lagged. And we’ll work on it for about four days and try to be ready by Friday night. So Saturday at noon, we can start to open registration and allow people to come in. And our requirements are also that we need to be able to rapidly configure and provision this equipment. We try to pre-stage it as much as possible, but we really don’t have a maintenance window. It’s all a maintenance window. So if somebody’s going to give a keynote, we need to update some devices, we do it right then. We can’t wait till that evening to configure something because people are speaking all the time, from 8:00 in the morning until 6:00 or 7:00 in the afternoon, evening. Our attendees like to see the high levels of visibility into our availability and performance and other metrics for the event. How many devices are on the wireless network, how many terabytes of traffic have we moved with the Internet. And we try to secure and protect the customer’s privacy as much as possible, while still getting some data points about how many attendees we have and what kind of devices. We strive to make it intuitive for our Cisco Network Academy students, who help us set up and run the Show Network, and we want it to be non-stop and high performance for all hours of the show. So here we are by the numbers, four days of set up, about five days of the event.
Jason Davis 00:04:56.736 For the US event, about 25,000 attendees, 600 speakers, maybe a couple dozen keynotes, and one big customer appreciation event, which usually draws in some big name performer. And at those customer appreciation events, we also give away hats. So here are some hats from my own personal collection above my desk. Depending on the venue, we may take over their wireless environment, or we may set up our own temporary one. And we have had situations where we’ve had to set up 2,300 wireless access points. We typically do have to bring in over 600 switches and closet distribution aggregation switches. We’ll bring in our own service provider grade routers to connect on the venue and also into the colo facilities to provide us that Internet connectivity. And we bring our own UCS and Hyperflex compute equipment to run virtual machines for management tools, applications for registration, and for the video session capture and things of that nature. And we partner with other companies like NetApp to bring in some of their best equipment, like their all-flash storage arrays, which allow us to really quickly capture the video sessions and also do our network management applications. For the last five years, we’ve had in the US the ability to bring down dual 100-Gig links to the venue, whether it was in Orlando or Las Vegas. And again, we’re covering millions of square feet of conference space. This is what the network topology looked like.
Jason Davis 00:06:55.964 And this is from San Diego 2019, which was in June of 2019, the last time we had an in-person event. And we tend to bring in the core of our network in racks that are already pre-populated and pre-cabled and then we build out from there. And we can see the connectivity to the colo facilities and any on-premises IT Department, like Smart City sometimes provides a venue’s connectivity. But we’ll partner with CenturyLink to get those 10-Gig and 100-Gig links to connect the show. And then we do a layered approach in security using our firepower firewalls to make sure that we’re protecting our own network operations center block of equipment, we’re protecting the whole conference center coming in, and then we also have little distribution blocks or nodes where we concentrate things like our wireless land controllers. Sometimes we also have to connect to adjacent hotels or buildings to provide overflow space and whisper rooms and things of that nature. So it’s not uncommon to see, besides the conference center’s own connectivity, that we have connectivity into other hotels. And we’ve done some fun things, like in San Diego Convention Center, the Hyatt did not have any existing fiber between its hotel and the adjacent conference center. So we ended up having to do microwave connections on the roof over to that hotel. And if you haven’t been inside one of these conference halls when they’re setting up, it’s just huge blank space of concrete floor.
Jason Davis 00:08:54.273 And the IT department for the conference center will sometimes lay down electrical and Ethernet cabling from floor pockets to the various booth locations for any of the vendors that are showcasing with us. And then we bring in our own data center equipment to run the whole show network inside the venue, and we call it the world of solutions. If you’re a little bit scared of heights, you’d need not apply because sometimes we have to get up into the ceiling, whether it’s on a lift or, as you see here, somebody on a rig, to be able to install or adjust the antennas for the access points. And there’s a lot that goes on. Again, we’re bringing in our own equipment and shipping containers. We’re plugging things together and connecting in things, like the adjacent Hyatt Hotel, which did not have any fiber. So we ended up using microwave and wireless bridges so we could extend the show network into that building. And so people would have a seamless experience as they’re walking through the entire venue, even as they were walking outside into go over to an adjacent hotel, they could stay on the Cisco Live. In a few situations, we’ve also put in 4G cellular data connections so attendees getting on a bus from their hotel could be on the Cisco Live Network and connected into the show, even when they’re on a bus that we provide to shuttle people back and forth. And then eventually, what ends up happening is we have all of our racks of the data center available and people can see it. And this is the show. And you’ll note a couple of big fibers running up into the ceiling. And that’s the whole show.
Jason Davis 00:10:51.412 It’s the Internet connectivity, the DNS DHCP services, it’s the network monitoring, it’s the session capture and recording, digital signage, registration desk, all the IT gets put back into this equipment. And so as we’re thinking about what we’re doing, we have to evaluate the hardware and software that we’re going to be using and then start to do functional mapping about the instrumentation and telemetry that we’re interested in, whether it’s something coming from SNMP, streaming telemetry, gRPC, using Net Comp to do mass configurations, REST APIs to collect information and to make changes, or if we have to fall all the way back to automating some CLI and screen scraping, those of you may have heard me present before, I call that Finger Defined Networks or FDN, instead of Software Defined Networks or SDN. We have to think also about how frequently do we want to pull this information or how frequently should it be pushed to us if we’re doing streaming telemetry? And then who gets the information and in what format? Is it something they need as a dashboard, something they need emailed, stored as a log file, sent as a text message or a chat message? There’s a lot of different options and different teams have different requirements. So we have to track to those, and then we have to think about how long does the information need to be maintained? Can we take information from multiple sources and match it up together and get new insights out of that information, even when it comes from different sources? And then we consider the security and privacy of the information.
Jason Davis 00:12:46.533 In Europe, they have some privacy laws that say we can’t publish or show on dashboards, essentially, IP addresses and MAC addresses of individuals devices. So we have to be careful about the information we do collect, how that appears to other attendees in a public fashion. This is the topology or framework that we use when we’re managing a show. It starts with a service catalog, and people use a lot of different types of service catalog. We had our own at Cisco. ServiceNow is a very popular one, and there are open source solutions for taking in requests to trigger a job or to request other types of services. We tend to lean on having orchestration engines and Python scripts and micro services that do actions and talk to other devices, other application services, collecting information, changing information, transforming that information and pushing it somewhere. And you’ll see here that we definitely have InfluxDB and Grafana as part of our solution. And this isn’t all the tools, because it wouldn’t fit into one slide, but these are kind of the high-level nuggets as far as element management solutions, some of our own commercial management tools like DNA Center and components like Webex, Meraki equipment and the routers and switches and access points in storage that we use to run the show. You’ll see also that we have Smartsheet listed there, and you might be scratching your head. So Smartsheet is kind of like an Excel spreadsheet on steroids as a service in a web browser.
Jason Davis 00:14:43.183 And we had situations where some of our registration desk folks wanted to monitor the printers and other devices in their work area, and I was happy to add those HP printers into the monitoring, but there wasn’t a centralized tool or controller that I could use to manage all those printers. So we had them use Smartsheet as a way to put in and maintain their own list of devices they wanted monitored. And then every hour, our orchestration tools would go and read the Smartsheet REST API to extract that sheet information and then add them into our monitoring. And we have several metrics that we’re concerned about. Some of the basic ones are CPU memory, interface and errors across the routers, switches, access points, the servers that we bring in, and even the NetApp storage. We’ve learned over time that things such as routing table size and MAC address or CAM table size is important. The peer adjacencies with other routing protocols and our upstream service providers, the number of wireless clients that we have per access point, the signal strength of each of those access points, and their channel assignments, they’re all very interesting and important metrics that we need to watch. When we got up to doing 100-Gig interfaces with the Internet, the optical transceiver power levels were something that we are concerned about because we want to make sure we have a solid laser signal with that service provider. And it’s interesting sometimes to see it dip with weather fluctuations. Heat and cold will cause fiber to change how transmissive it is.
Jason Davis 00:16:41.390 And being able to capture that information through different telemetry, putting it into Influx, and then using Grafana to provide some of the visualization was an important capability for us. And being able to have that flexibility was important. We wanted to monitor our IP address consumption, make sure that we had large enough blocks and weren’t running out of IP addresses. And then obviously, availability latency and packet loss, depending on the venue temperature is either very critical or it’s just nice to know. And in Las Vegas, where it can be 116 degrees outside and it can be 85 degrees in the conference hall before they close all the garage doors and start to condition the space, temperature is very important. In Orlando, it can be very humid and warm, and that’s not good for the equipment. And so imagine you’re setting up this equipment in a huge conference space and there’s these garage doors. They’ve got trucks and forklifts coming in and out, and it’s just hot, humid air coming in for four days. And then eventually, they say, “Okay, we’re getting ready to run the show. So let’s close the doors and turn the big air conditioners on.” Your equipment has to be able to run for potentially four days at very high temperatures. So sometimes we have to monitor that and then bring in our own portable cooling units where we see hotspots. Power consumption, we want to make sure we’re good stewards of the power that we have and not overly abusing the power grid. Now, we’re going to build some of our own dashboards. We’re going to use some dashboards from our commercial tools.
Jason Davis 00:18:34.031 And then what I’m showing here are a lot of Influx and Grafana based dashboards from when we create our own custom visualizations. And there are a lot of challenges to deal with in a show when you have wireless covering millions of square feet. We have our own wireless network that we can cover pretty well and understand the RF planning, but then there are partners that bring in their own wireless for their booth and sometimes that can conflict with what we’re doing. So we have to be able to see what’s going on there. And the availability monitoring is very important, not only what is down, but what is dropping packets and what is slow to respond. And so what I wanted to offer you guys is the same dashboard or a very similar one as what you just saw for Cisco Live. We’ve done some tweaks with it in our DevNet organization. We call it now the DevNet Dashboard Availability Monitor. And what it uses is inputs from authoritative sources like DNA Center, ACI, APIC controllers, if you have that. And it will allow us to– and also prime infrastructure, if you have that legacy tool. So we’re pulling in the information from those solutions to make the source of truth for what we need to essentially ping. And this QR code gives you access to a Git repo where you can deploy this. And I believe there’s also a Docker containerized version if you just wanted to suck that down and run it. You can set up the authentication to your own systems and it will start to build a dashboard, like what you’re seeing here.
Jason Davis 00:20:29.591 And we hope over time, that Cisco and the open source community will continue to add other sources of truth, like other element management systems, other controllers that can feed into that central inventory list to monitor. All right, moving on. I mentioned monitoring wireless access points and how many clients are in each one. This is important because if you have access points that are too heavily loaded, that’s a bad experience. So this gives us the opportunity to see which ones we need to tweak. We might need to adjust the RF power or add another access point somewhere nearby so we can kind of shed the load around. And then it’s kind of interesting to know how are people using the newer wireless standards? There are some really old ones, like 802.11B from back in the early 2000s, but now we have some of the newer standards like 802.11AC, sometimes called WiFi 5 and now 802.11AX, which is WiFi 6. And this dashboard predated the WiFi 6 capabilities, but we could see where most of our devices are falling on the wireless spectrum. And just kind of an interesting geek joke here was, a few years ago, I saw that there was someone– there was a device that was on the Cisco Live 2.4 older band wireless network, specifically on 802.11A, and they were also on an IP version six, only wireless network.
Jason Davis 00:22:22.371 So I looked at that and I said, “Well, this is interesting. Here is somebody who is so forward thinking about networking that they only want to be on an IPV 6 network, but they’re so frugal that they’re using 15-year-old radio technology.” And so we found this individual and we gave them a new USB wireless dongle and said, “Welcome to the 2020s. You now have the ability to be on WiFi 6 and also use 802.11AC.” So if you don’t capture the information, you don’t know and it’s hard to make business decisions. So that’s why we tend to just have a very broad sense of data collection so we can make better informed decisions about is it time to shut down the older bands? Are there still devices out there that are using them? And again, having dashboards like this helps us understand when it’s time to shut down some of those older ones. To make this all happen, it’s really about the APIs. All right, you guys tracking with me here on the APIs? Okay. And whether it’s a device API or it’s an application API, sometimes we extract information from a router or a switch, transform it a little bit, and then push it into something like Influx and use their right line API so we can put it into the right sensor group, tag it appropriately, and boom, now it’s available for us to pull into Grafana. And we’re able to build some fun dashboards like this one. I lovingly refer to this one as the Jerry Lewis Telethon Dashboard, which just like Jerry Lewis years ago, had an annual day to raise money for muscular dystrophy. And he would show this tote board about how many millions of dollars they raised for muscular dystrophy.
Jason Davis 00:24:22.770 This is kind of our dashboard that shows, every five minutes, how many terabytes of traffic we’ve moved with the Internet. And for the setup team at Cisco Live, it’s pretty small amount when you consider you have maybe 200 gigabits per second of Internet connectivity. But then as the 25,000 attendees get there, then you start to crank up the downloads. And usually by Thursday, people are downloading a lot of movies to their devices because they know they’ve got a trip home and they need to have something for the flight. So it’s kind of fun to see how people are using the bandwidth as necessary and knowing, again, when that high point is and seeing what the triggers were that caused a lot of bandwidth to be used. And invariably, either Apple or Microsoft releases some major patch during our event. And all of a sudden, people are downloading patches and doing software updates because they have a lot more bandwidth available to them potentially than they have at home or their own work environment. The spike that you’re seeing here on this graph was kind of funny because our NetApp partner decided with over 200 gigabits per second of bandwidth that, “Hey, let’s back up the storage array across the Internet to another storage array in the colo facility.” So they decided, “Let’s take as much bandwidth as we can to do those snap backups.” Because we are a technology show, there are people that have special interest in different technologies, like IP version six. And people want to know, what is that adoption rate look like, how much traffic is being used that is IPV 6?
Jason Davis 00:26:20.116 And so we collect that information. And invariably at the show, we have people who are not network technologists, but they’re there to learn something new or to share their experience in another discipline. And so folks sometimes ask the question, “Well, is a terabyte something big?” So we created a dashboard that was pretty elaborative to translate a terabyte into other fairly known quantities and volumes of data. So if you go really old school, we’re showing you that 67 terabytes is over 561 billion punch cards of data, right? It’s also equal to 44.6 million three and a half inch floppy disks, if you remember those little floppy disks from the day. When I was building this dashboard, one of my children came and said, “Dad, how much would this be if we were talking about Marvel movies?” And they’re thinking about the DVDs that they watch these Marvel movies. So at that time, in 2019, we were in the middle of all these Marvel movies. And so I did the calculations to find out at that time, 22 different Marvel movies on a DVD, how much time does it take to encode it? And it would have taken– 67 terabytes would have been 2,200 copies of all the Marvel movies at that point in time. So numbers of CD-ROMs. And if you’re where I am in North Carolina, 62,000 pickup trucks full of books, right? So a lot of pickup trucks around here. And we’re moving on.
Jason Davis 00:28:16.020 Now, one of the things we like to do is take the data and we try to overlay as many devices as we can, tag them and overlay them on a panel. So then we can start to compare the relationships of devices. And if they’re all performing a very similar function, they should have a very similar CPU and memory utilization. And when we see this, we can see it’s pretty standardized and things will fall in line pretty well. But sometimes when we’re overlaying all these devices, we’ll see outlier data points, like at the top, we had this one device that really shot its CPU up. When we expect all these devices to be around 16% CPU, and we’re talking hundreds of access layer switches across the venue to go into the breakout rooms, to support digital signage, the testing center, the registration desk. We expect all of these to have a fairly consistent CPU and memory utilization. And the one on the top was a device that just went from 17% up to 80%. So that got us thinking, “We need to check this device out.” And on the bottom, you might notice this one device is showing a ramp that’s increasing its memory utilization. Folks, that is a classic memory leak situation. And as we’re looking at the data, we started asking ourselves, “Will this device run out of memory during the day while our attendees were here? Or is it something that we can take care of in the evening when nobody’s there and reboot the device? Or maybe we just let it go because it won’t exhaust all of its memory until well after the show.” So again, collecting the information, making it available is something that’s very important and it allows you to make better informed business decisions.
Jason Davis 00:30:14.657 We like to see things like IP address management, MAC address consumptions. When we’re doing automated and orchestrated workflows, we want to make sure that they’re hitting every time that we expect them to run. So this graph is showing all the executions of these workflows. And so we expect to see them very periodic. And if they’re missing or if they’re overlapping, that means there’s a timing issue. Either too much work is happening and it ends up running too long into the next interval, or if it’s missing, then maybe the workflow just didn’t run. So a good visualization here to help us see that everything is consistent as it’s running. We also got to thinking about answering some of the questions, what is the return on investment by doing all this automation? And knowing how some of these activities, how long it would take in a manual sense, along with how much it costs per an engineer, and then when we automate it, how quickly it can run, allowed us to do an ROI calculator so we could report back to some of our management about how much time savings we were achieving by automating some of these processes. And then as we’re setting up a venue and then shutting down parts of the venue, it’s nice to have a turn up histogram that shows us the number of devices that are online versus the number of devices that are offline. And you want to know as you get towards the end of the show, is there still equipment that’s deployed out somewhere that you need to go retrieve so you don’t leave an expensive switch or some access points somewhere in that 2 million square feet of conference space?
Jason Davis 00:32:11.511 And another kind of fun thing we do is we pick up the host name. So as your devices come online, you register your devices host name, laptop, phone, tablet with the DHCP server. And it’s kind of funny, people pick their own kind of host names for their devices. And I don’t know what it is about the FBI and the NSA, but people really like to name their device after something related to the FBI and NSA. I love Arnold Schwarzenegger. I do a lot of Arnold impersonations, especially when I’m in California. So anytime somebody puts an Arnold host name out there, I’m going to get noticed on that. So Arnold’s iPhone. And then of course, because my name is Jason and I think about data encoding, JSON, JSON’s iPhone, that one really resonates with me. So things that we’re thinking about for future Cisco Live, we’re going to have dual 400-Gig links. We went from dual 100-Gig links to dual 400-Gig links. That is 800 gigabytes per second of bandwidth, that’s going to be insane. I’m looking forward to it because I think the bandwidth ratios for the attendees is going to be insane. We don’t expect to have as many attendees for June, as we did in 2019, but we’re going to go up to 800-Gig. So that’s a lot. We’re going to have a potential to upgrade the Mandalay Bay to all WiFi 6 access points, and we’ll see how that negotiation works. And then we’re going to focus a lot more on streaming telemetry.
Jason Davis 00:34:03.273 We’ve used it in the past, but three years ago, streaming telemetry for Cisco equipment really meant service provider grade equipment. Now in the last three years, we’ve pushed that gRPC capability to a lot of enterprise grade equipment, and we have a lot more capability for that. And I look forward to using more streaming telemetry rather than using a lot of SNMP or API polling. And then we want to make sure that we’re moving up to the latest InfluxDB versions and Grafana. I’m looking forward to also using the newer cloud solutions that did not exist years ago when we had an in-person event. And so we want to do dual data injections, keep data local, but then also push it into cloud. And then we want to augment our mass ping utility that I showed you earlier that was the availability dashboard and augment that with ThousandEyes. So beyond using Ping and using IP SLA and some other availability techniques, we want to take advantage of the ThousandEyes agent technology we’re putting into our equipment. So with that, are there questions? I know I kind of ripped through that pretty quickly because we got started a bit late.
Caitlin Croft 00:35:27.390 Thank you, Jason. That was awesome. So one question we have here is, do you have any recommendations or best practices that you can share with us about how to scale telemetry with InfluxDB?
Jason Davis 00:35:42.385 So what I tend to do is start by taking in as many as I think I’m going to need. That could be two or four, and then monitoring the Influx database performance itself. And then if I start to see that I’m hitting thresholds, then I deploy additional ones. And I don’t know how people are embracing, how well they’re embracing load balancers and Kubernetes and containers and scaling that way, but it’s also a very good technology that we use to quickly spin up another polar instance or another collector instance, depending on the system that we’re working on. So starts by measuring the collector to see how it’s performing. And when you start to notice that there are performance concerns, memory, CPU interface, then you start to deploy additional instances, because not everybody has the freedom to have multi-gig, multi-hundred-gig or NetApp all flash storage arrays available to them. So you have to make sure you’re understanding the constraints that you have within your compute environment and storage.
Caitlin Croft 00:37:05.000 And when did you start using InfluxDB at Cisco Live events?
Jason Davis 00:37:11.568 I believe it was about seven years ago. We were pretty early with using time series databases. We had been using databases for many things before and tried to use some of them in a time series sense. But when we started running into what you guys were doing with InfluxData and how you are implementing a time series database, it just resonated a lot that, “Hey, there’s this information that we need to store and do as a time series graph.” So it fits that time series database model really well.
Caitlin Croft 00:37:54.956 Prior to using InfluxDB, were you using other time series solutions or a different type of database?
Jason Davis 00:38:03.198 We were using other types of databases and then trying to do that square pegging around hole, trying to make a relational database try to do time series things. And you can kind of make it work, but it’s just not as elegant or as performant as what you guys have provided. So we are happy to look at the data that we’re using and collecting, and saying, “This makes sense for a time series effort. So push it over to InfluxDB.” And we still have cases where there are some pieces of data that have no really time concept. So you’re just keeping track of IP addresses or MAC addresses. Those things, if they don’t have a concept of time, then you can put that into a relational database with no problem.
Caitlin Croft 00:38:59.664 And it’s amazing just the scale of these events that Cisco puts on. How do you see network monitoring changing for you guys with the popularity of hybrid events?
Jason Davis 00:39:12.463 Well, I see that we are going to move to more of the streaming telemetry model, whereas we’ve had a rich past many years, decades of SNMP where you’re pulling a device and it’s responding to you. With the advent of streaming telemetry, gRPC, what you’re seeing is devices are pushing that information to you and you’re just playing a catcher model. And that cuts your network traffic practically in half. You’re not requesting and getting a response - you’re just receiving it. And the interesting thing about using a receiver model is if you don’t receive some metrics in a time period that you expect, now it takes on an availability monitoring construct. I didn’t receive that CPU information from that device and it’s supposed to be pushing it to me every 10 seconds. So after 30 seconds, maybe you kick off a process where you proactively interrogate that device to find out, “Why didn’t you send me the information you were supposed to?” So availability monitoring gets wrapped up into streaming telemetry in a sense, even though you’re using that for performance sake.
Caitlin Croft 00:40:37.543 Absolutely. Are you using the Telegraf Cisco MDT input plug in for any of this or all right line function for InfluxDB?
Jason Davis 00:40:51.512 We’ve used both. Three years ago, when we had our last large event that was requiring 100 gigabit interfaces, that’s when we use some of our carrier grade equipment. And that would have been where we would take in the streaming telemetry information because those service provider carrier grade devices were capable of it years ago. Now, we have a broader enterprise grade set of equipment that supports it. So I’m looking forward to doing even more streaming telemetry. And as far as the right line capabilities go, that was early functional ability that you folks had, Caitlin. And in some cases, when we’re dealing with data that isn’t gRPC streaming telemetry, we need to go and get that data somehow. We need to do a REST API call and take some data and convert it from XML or JSON into something else, or transform it in another way and then push it into Influx. And the right line method has been pretty effective, create a nice little JSON payload and push multiple metrics into the database in mass and tag them. Very convenient.
Caitlin Croft 00:42:17.118 Awesome. I have a question about data retention. So you have these events, you’re there a week before then roughly a week of the event. I’m kind of assuming you don’t need to keep the data too long after the event, maybe some post-mortem. Do you guys keep it at all? Do you guys maybe down sample it to review it and compare it to other events down the road, like a year from now or–?
Jason Davis 00:42:41.631 Yeah. Usually, the first couple months after the event, we’ll take a look at the data, down sample it, as you mentioned, or summarize it, make sure that we’re able to compare event from year to year as far as device counts and different types of device models. So how many Apple devices, how many Samsung devices, devices that were using, 802.11AC. So we’ll take that information and summarize it, and then pretty much the next event that we need that equipment for, it’s getting wiped. So this equipment gets used for Cisco Live events. It gets used for our own internal Cisco events, like Impact, which is our sales convention, our partner showcase, and other types of events. So that information regularly gets deleted and reformatted on the storage arrays. So hopefully people can feel confident that we’re not taking advantage of their private data. We’re only looking at information in a summarized sense to evaluate year over year changes.
Caitlin Croft 00:43:52.610 So I have to ask, how are you using InfluxDB in other ways? Because I definitely have noticed the dashboard behind you. I definitely recognize the InfluxDB UI.
Jason Davis 00:44:04.909 Yeah. So this is some personal monitoring of my home network. And so I’ve got a few Raspberry Pi’s in my home network that have been deployed for temperature, humidity, light, propane sensors in case we have a propane or natural gas pipe leak. And so those Raspberry Pi’s have these sensors, they’re collecting whatever that sensor data is, and then it’s getting pushed into Influx. And what I found is you guys have a really cool capability now with not just the database side, but an introductory form of visualization with the panels and graphs that you can do. So what I feel for home users, small medium business, where you just want a really simple one installer, you can get by with that. And what you saw with our Cisco Live event, where we need to take it to the next level and use some stronger capabilities on visualization, that’s where we’re using you for the time series database side and then jumping up to Grafana for some of the visualizations that they have. But I’m really liking what you guys have for visualization for personal home user since the installs are super simple and gets the job done, so. Cool. Did we lose Caitlin on her audio?
Caitlin Croft 00:45:49.407 Awesome. So I know you’re involved with network management. Oh, I’m here. Can you hear me?
Jason Davis 00:45:58.098 I can. Yeah.
Caitlin Croft 00:46:04.055 I’m here. Can you hear me, Jason?
Jason Davis 00:46:06.563 I can. Yeah, I was just looking to see if– yeah. That’s why we do all this monitoring.
Caitlin Croft 00:46:13.024 Exactly. So you’re involved with the network management and operations at Cisco. Where do you think the industry is going in the future, given the popularity of SNMP and telemetry?
Jason Davis 00:46:29.771 Yeah. Well, it’s definitely going the telemetry route. Even configuration management provisioning, we’re starting to normalize that as you’ve seen with RESTCONF and NETCONF capabilities. Now with newer modeling capabilities using YANG models to try to normalize the function of how a feature is deployed. That’s very important. We’re seeing that in the service provider space, like AT&T, Verizon, Reliance for many years. It’s coming into the enterprise space where most of you folks should be seeing the benefits of that too. So keep an eye on gRPC for streaming telemetry, gNMI for network monitoring and provisioning of services. It’s going to help along.
Caitlin Croft 00:47:20.551 One final question. How do you deal with alerting at these events? Because I can only imagine if you had every single thing sending alerts, it could be a little unyielding.
Jason Davis 00:47:33.499 Yeah. Well, we’re taking that melt approach method, messages, events, logs and traces, and we are running filters and aggregations on them to identify patterns. And then once those patterns have been identified, the actual alerting can happen as a dashboard. It can happen as a Webex message in chat room. Or if somebody says, “I really want this as a text message,” then it depends where they want it. But that’s the benefit of having orchestration as a key part of what we do because an orchestration engine will take information from different sources and different formats, change it up somehow and then push it somewhere else. So it’s kind of like multiplexing the information.
Caitlin Croft 00:48:32.118 Very cool. Well, we’ll stay on the line here just for another minute or so to see if anyone has any other questions. But Jason, I think this was great. I really appreciate everyone sticking on. I know we had a few technical difficulties there at the beginning. So I really appreciate everyone staying on and been great. There’s been lots of good questions. And so I’m really happy to see that. It’s interesting how network monitoring has evolved in the last 20 years, and I’m sure it’ll evolve even more.
Jason Davis 00:49:06.144 Sure. It will, too, so.
Caitlin Croft 00:49:08.613 All right. Jason, is there any other last minute comments or words of wisdom you’d like to share?
Jason Davis 00:49:15.214 If you want to follow me, I had it on the lead in slide, SNMP guy is my Twitter handle and you’ll see me blogging and chatting and presenting in various opportunities about Cisco network technology, open source, developer relations, coding, how we bring automation and orchestration to IT. So appreciate, Caitlin, the opportunity to partner with you guys and to share some of what we do.
Caitlin Croft 00:49:48.934 I appreciate it as well. It’s always great to hear more InfluxDB use cases. Yes, this webinar is being recorded. So it will be made available later today or tomorrow morning. We just want to clean it up and we’ll get it on the site. And the great thing is it will be available on the page that you guys registered for the webinar. Super easy to find. So the recording as well as the slides will be made available by tomorrow. And with that, thank you, everyone. Thank you, Jason. And I hope everyone has a good day.
Jason Davis 00:50:24.705 Right. Take care, everyone. Bye-bye.
Caitlin Croft 00:50:26.557 Thank you.
[/et_pb_toggle]
Jason Davis
Distinguished Services Engineer, DevNet, Cisco Systems
Jason is a Distinguished Engineer in Cisco's DevNet organization. His role is to foster Developer Relations, develop Automation Strategies, and evangelize network programmability. His career has involved providing strategic and tactical consulting for hundreds of customers. Jason's primary expertise areas are in Network Management Systems, Automation & Orchestration, Virtualization, Data Center Operations, Software Defined Networking, and Network Programmability.