How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf
Session date: Sep 20, 2022 08:00am (Pacific Time)
WideOpenWest is one of the US’ top broadband providers with over 3,000 employees. They aim to connect residential homes and businesses to the world with fast and reliable internet, TV and phone services. WOW! uses SNMP and Telegraf to collect network data from cable modems and metrics from VMs/containers; they use Kafka to stream all time-stamped data to InfluxDB. Kapacitor is used to send alerts to Slack, ServiceNow, and email. Discover how WOW! is using a time series platform to collect, monitor, and alert on their entire service delivery network.
Join this webinar as Peter Jones and Dylan Shorter dive into:
- WOW!’s approach to reducing infrastructure downtime and improving service uptime
- Observability and alerting best practices
- How they use the InfluxDB platform to monitor 600K + devices
Watch the Webinar
Watch the webinar “How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Sr. Manager, Customer and Community Marketing, InfluxData
- Peter Jones: Sr. Manager, Software Product Integration, WideOpenWest
- Dylan Shorter: Software Integration Engineer, WideOpenWest
Caitlin Croft: 00:00:01.255 Hello, everyone. And once again, welcome to today’s webinar. My name is Caitlin, and I’m joined today by Bria, who’s here at InfluxData, as well as Peter and Dylan, who work at WideOpenWest. Please, feel free to post any questions you may have for Dylan and Peter as they show how they’re using InfluxDB and Telegraf. So please, feel free to post any questions you may have for them in the Q&A or the chat. I’ll be monitoring both. This session’s being recorded. We’ll answer all the questions at the end. And don’t be shy. If there’s something specific you want to know about their implementation, I’m sure they would be happy to fill you in. And without further ado, I’m going to hand things off to Peter and Dylan.
Peter Jones: 00:00:51.760 Thanks, Caitlin. So hi, thanks for joining us this morning. I appreciate everybody coming out. Yeah, so monitoring DOCSIS devices with InfluxDB. So yeah, just some quick brief introductions. Again, my name is Peter Jones. I’m the senior manager for software and product integration engineering for WOW!. I’ve got about 22 years or so in IT telecom software development, and then 20 years with WOW! in a whole variety of roles. So with me today, I’ve got one of my lead engineers on my team. Dylan, I’ll turn it over to you.
Dylan Shorter: 00:01:31.284 Yeah. I’m Dylan Shorter, engineer, working for Peter. I’ve been working in IT for almost 18 years, worked in the financial industry, worked with multiple startups, and now working for WideOpenWest. Been there for almost three years, and I’m the lead engineer on our effort to roll out a new monitoring system leveraging InfluxDB.
Peter Jones: 00:01:55.678 Thanks, Dylan. So kicking things off, what is WOW!? So WideOpenWest (dba WOW! Internet, TV, and phone) offers internet video and voice services in a number of markets in Michigan, Florida, Georgia, Alabama, South Carolina, and Tennessee. Just some kind of quick points, we were founded in 1996 in Denver, Colorado. In 2001, we acquired the Americast properties in Chicago, Cleveland, Columbus, and Detroit. And then, in 2006 was the acquisition of Sigacom LLC and Columbus — excuse me, Evansville, Indiana. 2012, we acquired Knology, who operated in 13 markets in the Southeast and Midwest. 2017, we had an additional public offering. And then, last year, we announced the sale of our Illinois, Indiana, Ohio, and Maryland properties. Along with that, we are also going to be building out additional fiber to the premise builds in Seminole and Orange County, Florida, and then Greenville County, South Carolina. So again, just another quick little blurb about WOW!, our service areas. So we’ve got — our corporate headquarters are based out of Denver, and then we’ve got some markets in mid-Michigan and Detroit, Michigan. We’ve got some build in Knoxville, Huntsville, Alabama; Montgomery, Auburn, Dothan, Alabama; and then Charleston, South Carolina; Augusta, Newman Valley in Columbus, Georgia; and then Pinellas, Florida.
Peter Jones: 00:03:26.625 And then, again, we’ve announced greenfield builds in Greenville County, South Carolina, and Seminole and Orange county, Florida, so just to kind of give you a little idea of where we operate within the United States. And then, so if we’re discussing monitoring DOCSIS devices, what is DOCSIS? So DOCSIS pretty simply is just an acronym for data over cable services interface specifications. So this is something that was put together by CableLabs. And CableLabs is basically kind of a standard governing body for cable. It was initially formed, I think, in the mid-early ’90s by a lot of large vendors in the cable space at the time, I think Broadcom, Scientific Atlanta, a bunch of different companies. So anyway, there won’t be a quiz at the end on all of the specification versions, but this is basically really just here to kind of be indicative of how much capacity RF has grown over the course of the past 20 years or so, so going from 40 megabit or downstream capacity in 1997 all the way to - excuse me - 10 gigabit in 2017, so massive growth there, so.
Peter Jones: 00:05:00.244 And then, so we’ve explained WOW!. We’ve explained DOCSIS. So what does WOW!’s network look like? What are we actually trying to monitor with InfluxDB? So WOW! is what’s called a fiber — or at least in our brownfield properties, WOW! has a fiber to the node network architecture. So we’ve got our head-ins and circuits coming into those, satellite receivers, etc. And then we’ve got fiber going out to our hub sites, and then fiber going from the hub sites out to the node. And then at the node is where we have coaxial or RF legs running out to each of the premises that are serviced by that node, so. So getting into monitoring, so concerning monitoring of nodes, so circuit 2015, much of the integration work for the Knology acquisition was completed. And at that time, we kind of asked ourselves like, “Okay, how can we monitor individual customer cable modems within our network, as well as determine the health of a node as a whole?”
Peter Jones: 00:06:15.603 So keep in mind, as I mentioned on that previous slide, we had a lot of acquisitions over the years. So various markets had different monitoring platforms that they were already using; others did not. Additionally, purchasing hardware so we could have a uniform platform across all our markets would’ve been cost-prohibitive because there would be equipment that we’d have to get at each individual node. And I think, at least prior to the divestitures last year, we had, I think, about 14,000 nodes out in the field, so quite a lot of hardware that we’d have to support to do monitoring at the node. So we did have, though, some rudimentary processes that were already in place for gathering telemetry from individual modems. And this was primarily being done at the time for capacity planning purposes. So at that point, we thought, “Hey, let’s add some additional logic and resources to those monitoring processes, and then we can use those polling processes so that basically, we can monitor the nodes based on just looking at what’s going on with each individual modem out in the field.”
Peter Jones: 00:07:29.727 So that’s a homegrown solution for monitoring. So we used that process for a number of years, and it works pretty well. So for about five years after we had put that process together, we were using the same time series database on the back end. And when it worked, it went pretty well. However, we often had issues with the database. We’d have to bounce the database to get the read and the write nodes back in sync. And then once a year or so, that wouldn’t work. And then we’d have a 30 or 40-hour outage, which often seemed to coincide with the weekend. So anyway, monitoring system worked well, kind of, so just some issues with the back end. So at this point entered InfluxDB. So in 2020, we kicked the tires on a couple of potential replacements for the previous time series database, and this load testing was performed with Time Series Benchmark Suite.
Peter Jones: 00:08:32.947 So obviously, what I’ve got here is very, very, very condensed in terms of the testing we did. But overall, so looking at TimescaleDB, which we were initially trying to use because we’ve got a number of instances where we’re using Postgres, and Timescale is just, I believe, a Postgres extension. So we thought, “Hey, this will work great. We won’t have to do as much additional coding, so we’ll give this a shot.” So the read speed, we could get about close to 27 queries per second. And write, we could write about 5,200 rows per second, so. And with InfluxDB, the read speed was a little less, about the same, though. And the write speed was just order of magnitude more. So overall, InfluxDB was a lot more attuned to what we were wanting to do with it. So we’ve got about — it’s grown over the years, but we had up to about 800,000 modems. I think, with the acquisitions last year, we’re at about 650,000 right now. So if we’re polling — I think we’re polling six or seven — no, eight bits of data from each modem every — let’s see. We’re doing 10-minute polling cycles now, so do the math. That adds up to be a lot of data really quickly. So let’s see. Dylan, did you want to take over here, or did you want me to —
Dylan Shorter: 00:10:12.242 Sure.
Peter Jones: 00:10:12.728 —go through? Okay.
Dylan Shorter: 00:10:13.953 Yeah. So [crosstalk] —
Peter Jones: 00:10:14.807 [crosstalk] conversation.
Dylan Shorter: 00:10:17.156 — basically, out of what our production implementation looks like. You see the cable modems we’re polling. Almost all those are actually being polled via SNMP. I mean, SNMP is old, but it’s something that will be ingrained in our industry for a long time. But like he said, I think we have 650,000-some nodes or modems, and we’re polling them. I thought it was actually five-minute polling cycles, but it’s been a lot of data. Anyway, we opted to install a Kafka cluster in front of Influx so that we can control the input and output, and also so that we could easily consume or move that data into different regions if we need to, or to different systems. And we are running a four-node InfluxDB enterprise cluster in production. And then we also have a two-data node cluster in our test region. We have integrated, of course, with Slack, which is out of the box. And then, also, our ticketing system is ServiceNow, which we are also using for automated ticketing. Next slide, please.
Dylan Shorter: 00:11:35.767 Yeah. So we started originally with our proof of concept using Influx 1.8 open-source. And then eventually, once 2.0 was released, we moved to that. At that point, we were convinced, like what Peter showed, that we wanted to go with an InfluxDB. So we opted to go with an all-in solution and got InfluxDB Enterprise. Like I said, we have our production cluster, and we have a test cluster in the lower regions. I was blown away with how easy it was to install and configure InfluxDB Enterprise. I mean, the clustering was easy. The documentation was great, and the support has been second to none, so. Next slide, please. So again, the primary purpose for Influx was for us like a time series bread and butter, is monitoring, alerting, and telemetry. So that’s the reason we went with it, because you’ve seen all the acquisitions that WOW! has. We need to collect information from all sorts of old or different systems. So we’re using Telegraf when we can, as much as possible to collect data. But of course, we have Filebeat, plenty of custom scripts. We’re having to hit vendor APIs, and then there’s plenty of SNMP polling and trap handling.
Dylan Shorter: 00:13:01.077 But the reason we did go with Influx is because it provides us the flexibility that other monitoring systems wouldn’t have so that we can actually integrate all those different systems, especially when we’re dealing with plenty of vendor-managed systems and they have their own restrictions that InfluxDB has enabled us to work around. Next slide, please. So yeah, like we said, one of our biggest data sets that’s currently going into Influx is our modem data. Again, we’re doing five-minute polling cycles on 650,000-plus modems, which is a lot of data. Influx has handed it with a plumb, and we’re using that data for analytics so that we can look back and see historical data, see trends. Also, we’re exposing that data for our operations team so that they can identify outages and do troubleshooting as quickly as possible. You’ll see there that that’s an example alarm that we have sending to Slack, but the same data is also going into ServiceNow with our integration so that there’s automated ticketing. And we opted to do our visualization in Grafana, and that’s because it was a tool we were already using. And we needed to be able to have dashboards that are pulling from other sources than just Influx, as opposed to using Chronograf. And then every alarm we send out does come with a link to the appropriate dashboard, which you’ll see an example of in this next slide.
Dylan Shorter: 00:14:46.002 So this would be a node health dashboard that we have. You can see the online modem percentage history, the signal levels, the port levels, the power levels, as well as the health of each individual modem that is connected to that node. And Influx just 100% enabled us to have quick and easy dashboards like this, which has become essential to our operations team. Next slide. So another example of some of the dashboards we’ve created which is pulling from Influx is that we’ve also started collecting information from various points throughout our content delivery network and leveraging Influx so that we could determine the health of the services that we’re actually providing to the customer. So here would be a channel status dashboard that we’re using. We can see when the channel stopped working, which ones are up, which ones are not. But again, I just want to show this is an example of just a general overview that we’ve been able to really help build the quality of the services we’re providing and know, hopefully before the customers do, whether or not something is working. Next slide.
Dylan Shorter: 00:16:07.575 So the challenges that we did find through this whole process was that, for one, especially for people who aren’t familiar with time series databases at all and only know relational databases, is there’s a significant learning curve for Influx, especially. I think that’s the two query languages. Now, Flux is obviously the preferred one, but there’s still plenty of TICKscript scripts out there. And it is definitely not — someone coming from SQL or something can just easily pick up. The other challenge we had is that we had to roll back a little bit because we did do our POC Influx 2.0. So we had to change some things once we did actually go with Enterprise because I think, now, it’s 1.9-something. So there were some changes moving back to that. And another big challenge, and this might be more specific to our industry, is that it’s often difficult to convince vendors to allow us to — ideally, we would be using Telegraf everywhere we could to collect data because it’s what it’s made for, and it works so well with InfluxDB. So we’ve had to work around that.
Dylan Shorter: 00:17:17.641 And then I also find one of my biggest pain points is actually debugging and testing stuff in Kapacitor. It could be especially hard to write automated tests or to actually figure out what is not working and why. The other thing is, this might be specific to our ServiceNow deployment, but the plugin for Kapacitor did not work for us out of the box. So we had to just write our own custom integration with ServiceNow. But at least InfluxDB provided us the tools so that we could do those custom integrations. Next slide. So the strengths, and there’s plenty, and I made this as succinct as possible. But like I said, the ease of setup and installation was amazing. I mean, follow the doc just once and we had our cluster up. And we also were able to easily automate that installation setup in Ansible, which we are currently using for deployment. The performance, as Peter showed, is second to none. The support has been fantastic, and it’s actually allowed us to manage our infrastructure as code so we can follow standard dev practices or proper dev practices where we can review code changes and deploy in a CI/CD pipeline.
Dylan Shorter: 00:18:45.012 I think I said one of the challenges with Influx is the learning curve, but part of that is because of the power, right? Flexibility and power is also one of its biggest strengths. Again, Telegraf is my new favorite hammer. I mean, I probably use it for things that I shouldn’t anymore because I love it so much. As far as for being able to collect data and transform data, especially with all the included plugins, it’s absolutely amazing. We use Telegraf to serve up webhooks because some of our vendor tools need to hit a webhook. So we have Telegraf doing that. We have Telegraf set up as an SNMP poller. We have it set up as a trap handler and all sorts of various things. And I guess one of the biggest strengths and one of the reasons we went with Influx over something like Prometheus is that the push model works so good for us so that, I mean, anyone can spin up a new server or VM. And as long as they have Telegraf on it, it just instantly starts coming into our database. And we don’t have to manage central configuration to add new nodes or new resources. So that’s definitely one of the biggest, pluses. Next slide.
Dylan Shorter: 00:20:13.067 For the next steps, we want to do full CI/CD implementation. Like I said, we do have the cluster set up and installation already pipelined using Ansible. But next steps would be to actually automate and put into CI/CD the promotion and testing of the Kapacitor scripts and alerts along with the dashboards. Before we can do that, we obviously need to improve our automated testing. We can’t have proper CI/CD without proper tests. Other than that, we will continue to transition away from our existing monitoring solutions. Ideally, we’ll be 100% off the other paid products we’re using and move everything into our new InfluxDB solution. And along the way, we’ll continue to add more of our infrastructure and resources into our monitoring. But yeah, in the end, we’ve been absolutely stoked with InfluxDB, so yeah.
Peter Jones: 00:21:11.110 Absolutely.
Dylan Shorter: 00:21:14.780 I think that’s it. So any questions?
Caitlin Croft: 00:21:18.747 That was amazing, guys. That’s fantastic. If we can go back to the slide with the Grafana dashboard screenshot, I had a couple of questions on that. Yes, this one.
Peter Jones: 00:21:32.210 This one, or? Okay.
Caitlin Croft: 00:21:32.820 Yes. So just kind of curious. You guys have been collecting time series data for a long time at WOW!. What have been sort of the overall impact on service? I’m just looking here at right after the 10:30 mark. There was a little dip. Are you guys able to correlate that with something else that was going on? And then the second part is the long-term impact of having this time series data readily available and these dashboards available. How has that impacted WOW!?
Peter Jones: 00:22:08.042 So as far as correlating events, we’re still working on being able, in an automated fashion using InfluxDB, to pull in as much data as we can from different points within our network. So I don’t know that we’re at a point yet where we can say, “Aha, something happened at 10:30. Let’s go draw other conclusions based on other data points that we’ve got in the network.” That’s still kind of a manual process in terms of, okay, so we just lost — let’s say we lose half the modems in a node. So it’s like, “Hey, is there a weather event? Is commercial power lost in the area?” that type of thing. So those are things that we probably look at first before we go say, “Okay, is there an issue on the node? Is there an issue out in the field?” etc., so.
Dylan Shorter: 00:23:07.676 Yeah. And to add to that, I was saying we’re still pretty young in this rollout. So before you can actually even automate doing those kind of correlations, it’s all about collecting the data. And we’re still in that phase, right? We’re first collect, then visualize. Then you can actually start to make those correlations. That said, we have been leaning — at least the operations team has been leaning a lot on the logic internal to ServiceNow to correlating outages and alarms because it has its own machine learning stuff. It’s beyond me. But yeah, so we definitely want to improve in how we correlate all those things and automate some of those trends.
Peter Jones: 00:23:51.050 Just kind of tangential to that question, Caitlin, I saw there was another one that came in, I think, in the chat for do we collect SNMP only and how about NetFlow, SFlow. So to Dylan’s point that we’re still working on getting as much data in from other sources as we can, and I think I made a similar point, so we’ve got — I mentioned different acquisitions earlier. We’ve got a number of different products that we’re selling to customers. Internet, we’ve still got legacy video in some markets. We’re working on transitioning that all to IP video. We’ve got a number of voice products. So it’s a lot of disparate pieces of gear that we’re monitoring. Some of the older video gear doesn’t even support SNMP. A lot of things do support SNMP, but we’re looking to get away from that, obviously NetFlow, gNMI, gRPC, that type of stuff. So the thing is, is there’s only a real small subset of things that actually kind of support those protocols at this point, so.
Dylan Shorter: 00:25:08.116 Yeah. And to add to that, I mean, a lot of the data — some of our biggest data points were actually using Telegraf to actually hit API endpoints from the different vendors. So it’s actually doing health checks that way. So I think we even have a system, I think, we need to add that its only method of sending out its data is emails. So we’ll probably have [inaudible] something to handle that. Just because we have so many different vendor systems and so many legacy systems from different acquisitions, we’re at the whim of basically the vendors and the technology. So we’re having to collect data in any way we can get it. And a lot of times, that’s SNMP. But a lot of times, it’s not.
Peter Jones: 00:25:53.952 A lot of technical debt.
Caitlin Croft: 00:25:56.772 Well, hey, at least with Telegraf, there’s so many plugins. You can pull it all into InfluxDB.
Dylan Shorter: 00:26:03.330 Right. Totally. As long as you can read it in, it’ll do the transformation, and it’s easy to get it into DB at that point.
Caitlin Croft: 00:26:11.191 Yeah. So if you don’t mind going to the slide with the architecture diagram, someone was asking, why do you need Kafka?
Dylan Shorter: 00:26:23.584 So okay. And this is something that we questioned whether or not we want to do it because it does add an extra level of complexity. But I think there’s a lot of benefits to using Kafka. In fact, I would highly suggest anyone who’s rolling out Influx do put Kafka in front of it. One of the biggest benefits to it is it is an extra layer of redundancy because I think our Kafka cluster, it’s keeping all messages for a week. So if, for whatever reason, something does go wrong with the Influx side or we need to do maintenance, everything that’s sending in data, it’s still collecting in Kafka. And as soon as we get the cluster back up, it picks up right where it left off. And all those cable modems and everything that’s sending data, unaffected and has no idea we do anything behind the scenes.
Dylan Shorter: 00:27:14.430 The other thing is that, with Influx being the push model, where it’s unlike Prometheus or something where it’s actually going out and it’s pulling, you’re controlling centrally what data is coming in and how much of it. With the push model, you never know if there could be a bad actor that is sending more data than we expect. That could cripple the whole system. Kafka allows you to — that message broker makes it so that we can control how we are ingesting that data into Influx in a central location. Maybe we’ll have multiple Telegraf Kafka consumers. Maybe we turn it off because something’s not working. But it just gives you that extra safety. The other big benefit to it is that a lot of this data we’re collecting is not just going into Influx. Some of it might need to go into our ELK servers so we can analyze the logs. Or maybe some of this data we want to send — we want to use in a lower region for testing. Kafka makes it so it’s super easy for multiple consumers to read that same data stream, and it just adds the flexibility, yeah.
Peter Jones: 00:28:34.517 And I’ll add this is a very simplified implementation or, I guess, depiction of our topology. So the probe servers, the SNMP monitoring, as well as the Kafka clusters in a number of instances, depending on what data we’re pulling back, most of that’s actually out at the Edge. So we’ve got some compute clusters that are in the head-ins in each market. So we’re a little closer to the customer network than we are if we’re doing all this in our data center, where we’ve got the Influx cluster, so.
Caitlin Croft: 00:29:18.755 [Inaudible] of shooting InfluxDB queries is often easy with some real-time data explorer with tools like Chronograf and Grafana. Do you have any general strategies for testing Kapacitor, especially with streaming data?
Dylan Shorter: 00:29:35.713 No. And that’s also something that I pointed out as one of the challenges we’ve had, is that I find it’s very challenging to test Kapacitor scripts. And I have yet to really dig in or figure out how we could even ever do any efficient automated testing. So I’m sorry. I don’t have any tips there. I’d love to get some myself.
Caitlin Croft: 00:30:00.209 Well, maybe someone else in the community has some tips on this, and we can figure something out to share with them.
Dylan Shorter: 00:30:05.614 And I’m sure there have been people that have figured this out, but yeah.
Caitlin Croft: 00:30:11.080 How large is your data store, and how long are you keeping the detail? And what’s your summarization archival strategy?
Dylan Shorter: 00:30:20.897 I don’t know if you can speak. I don’t remember what the hardware sizes are for the actual storage we have.
Peter Jones: 00:30:29.112 Yeah. I’m trying to remember how much we’ve got allocated. So we just turned up a new virtual cluster, a new OpenStack instance that replaced an older one that we had that’s not in the best of health. There’s quite a lot of storage on there, so we’re sort of incrementally expanding as we have need. I think, overall, at this point, we’re keeping about between 60 and 90 days of data of these RF polls that we’re doing. As far as roll-up, I think we — roll-up policies [crosstalk].
Dylan Shorter: 00:31:17.373 So no. So actually, so the modem data is our largest data set right now going into Influx. We are keeping a week’s worth of the full sample, and then we are downsampling and saving that for — I believe it’s six months now. And then we probably in the future we’ll still do further downsampling to keep it for a year, I guess. But so far, our strategy is we’re not looking at Influx as a proof of record. It’s more for our real-time analysis. I mean, I think that’s more the strength of a time series database. The way I see it, it’s not the best for that long-term auditable type data but more just for identifying trends. And that’s more how we look at it. We do have some offsite long-term cloud storage, but we’re not using that right now from our Influx implementation.
Caitlin Croft: 00:32:24.875 And which cloud provider are you using for that longer-term storage?
Dylan Shorter: 00:32:30.152 I actually don’t know.
Peter Jones: 00:32:32.913 I think we’ve got through AWS. I’m not sure of the exact product.
Caitlin Croft: 00:32:43.176 [crosstalk] —
Dylan Shorter: 00:32:43.176 But again, none of that is connected to what we’re doing here with the Influx stuff.
Caitlin Croft: 00:32:46.797 Yeah. And the AWS, there’s so many products under that umbrella.
Peter Jones: 00:32:52.139 Right.
Caitlin Croft: 00:32:54.518 Let’s see. Are you doing any analytics as it relates to the network hierarchy? Are you able to perform aggregation on a deep network graph?
Peter Jones: 00:33:07.630 We’re not doing a lot of that yet. I think that’s probably a use case that’s somewhere down the road. Most of our network monitoring is still being done by a legacy monitoring application, so.
Caitlin Croft: 00:33:26.566 And —
Dylan Shorter: 00:33:26.566 So that’s certainly the goal though, right?
Peter Jones: 00:33:29.576 Absolutely. Yeah.
Dylan Shorter: 00:33:30.627 [crosstalk]. Yeah.
Caitlin Croft: 00:33:32.259 So this is another question but also asking, in addition to aggregation, are you doing any roll-up stats? And where does the roll-up get calculated?
Dylan Shorter: 00:33:43.401 So yes, we are doing some downsampling/roll-ups. And this comes from when we were first doing the proof of concept. We have a couple that are actually just continuous queries, but I do want to move those. That was in the early parts of my development, doing those. But ideally, we’ll move all those into Kapacitor to do the roll-ups there, especially since you’re not taxing the database then. And you can have Kapacitor living on its own server and just separate those resources. So yeah, it’s definitely on the roadmap to move all those roll-ups to the Kapacitor server using either probably Flux, so.
Caitlin Croft: 00:34:31.167 Sounds like you guys have a lot of work ahead of you. [laughter]
Peter Jones: 00:34:33.822 We do.
Dylan Shorter: 00:34:34.643 Who doesn’t?
Caitlin Croft: 00:34:37.126 No, I think it’s great. I think it’s really great to hear what you guys are hoping to do next. I’m sure the entire community, you start using InfluxDB, and you get going on it. And you start collecting all that data. And you realize how much more you want to do with it, and also what you can do with it.
Dylan Shorter: 00:34:55.350 Right.
Caitlin Croft: 00:34:57.858 Okay. So I think you’ve kind of covered this a little bit. But are you considering moving from InfluxDB Enterprise to InfluxDB Cloud?
Peter Jones: 00:35:10.291 Probably not in the near future. So we like having stuff on prem, basically. So we’re our own masters of our own destiny, so to speak.
Caitlin Croft: 00:35:24.503 It’s fascinating to me. There’s some people who are all things cloud, and some people still like having on prem. It definitely is a personal choice there.
Dylan Shorter: 00:35:35.942 Yeah. We —
Peter Jones: 00:35:35.942 And then it’s probably going to be a Q1 Q2 thing, but we are looking to get kind of a cloud failover for our on-prem cluster, so.
Dylan Shorter: 00:35:51.864 We are using Influx Insights, though, so. Yeah, I guess that we’re touching the cloud there, but.
Caitlin Croft: 00:36:00.158 And what are you using InfluxDB Insights for?
Dylan Shorter: 00:36:03.753 Just to monitor the monitor, basically.
Caitlin Croft: 00:36:06.292 Okay.
Dylan Shorter: 00:36:07.514 Nothing more than that.
Caitlin Croft: 00:36:08.979 Do you mind kind of going into a little bit more of that? Do you have dead-man checks, or? Because there’s so many different things that you can find when you monitor the monitor.
Dylan Shorter: 00:36:20.196 Well, so yeah. We do have some deadman checks. I guess, especially with a push model like that, you don’t know something’s broken until something’s quiet, right? So we have plenty of deadmans. We definitely need more of those. But as far as — we do have some other monitoring tools that we’ve pointed to our cluster just to get the general health of it. But mostly for monitoring the monitor, we’re mostly just relying on — is it what? Influx Insights and Influx Aware?
Peter Jones: 00:36:57.247 Influx Aware.
Caitlin Croft: 00:36:58.802 Yep. And do you guys have —?
Dylan Shorter: 00:37:01.251 [inaudible].
Caitlin Croft: 00:37:05.415 And do you guys give —?
Dylan Shorter: 00:37:05.593 I’m sorry.
Caitlin Croft: 00:37:07.023 Oh, sorry.
Dylan Shorter: 00:37:08.178 Go for it.
Caitlin Croft: 00:37:09.289 Oh. I was just curious. Given what you guys do, is there any crazy SLA that you guys have to meet? If the network is down, obviously, all of your customers will be frustrated. And we all get it. Outages happen. But do you guys have any SLAs around how fast the network has to be back up and running or anything like that that you guys are trying to adhere to?
Peter Jones: 00:37:35.759 We do. Most of those are — most of that’s kind of tracked by our NOC, so kind of a level or two removed from us. That said, again, we’re a service provider, data, video, voice. If there are certain outages that we have with voice, if those are big enough or affect so many customers, we have to report those to the FCC. So yeah, I mean, with some of our — maybe not SMB but more medium enterprise-grade customers, we’ve certainly got SLAs with them that, if you’re down so long, credits or that sort of thing, so.
Caitlin Croft: 00:38:32.803 Yeah. No one likes their internet being down. [laughter]
Peter Jones: 00:38:36.308 Absolutely.
Caitlin Croft: 00:38:39.629 So you mentioned on — and I can see here on the diagram that you use Telegraf. Which plugins are you using specifically?
Dylan Shorter: 00:38:49.863 Man, there’s a lot of them. Obviously, we’re using the Kafka consumer for our Telegraf Kafka consumer instance, obviously, the Influx Input. We’ve got HTTP Response, SNMP Trap, SNMP Puller, webhook, file. I mean, yeah, we’re using a bunch of them.
Caitlin Croft: 00:39:18.483 Cool. Yeah. I mean, Telegraf can do so much. How many network devices are you monitoring and how many probe servers?
Peter Jones: 00:39:30.271 So for network devices, devices within the network, we’re looking at about 650,000. Again, most of those at this time are cable modems. We’re not necessarily using InfluxDB to monitor our core or access networks yet. But the probe servers, we’ve got some custom polling scripts that we’ve written that do all the modem polling. And I think at this time we’ve got, I think, six of those that are active in our Edge clusters in the individual markets, so.
Caitlin Croft: 00:40:15.869 Yeah. I know you guys have tons of devices. Well, if anyone has any more questions for Peter and Dylan, please feel free to post them in the Q&A. Just want to remind everyone, once again, Influx Days is coming up here in a couple of months. It’s in November. It’s completely free. The conference itself is completely free. We also have a Telegraf training. So it’s our Taming the Tiger Telegraf and InfluxDB training. That will be virtual, and it’s completely free. If you are in the Greater London area. We do have an advanced Flux training coming up in November as part of Influx Days, and it’s going to be in person. So if you’ve attended any of our other trainings or you’ve used Influx — or sorry, if used Flux a lot and you want to get more tips and tricks and become more advanced at it, be sure to check that out as well.
Caitlin Croft: 00:41:17.371 There is a fee attached to the Flux training, but everything else related to all the other events as part of Influx Days 2022 are completely free. So we’re really excited to see our community out in person again. I think we’re all looking forward to that. Dylan, Peter, do you have anything else? I know there were a ton of questions that were thrown at you guys. Is there anything else that you’ve thought of that you want to mention or anything like that?
Peter Jones: 00:41:49.881 Nothing I can think of offhand.
Dylan Shorter: 00:41:51.922 No, same. This has been a good time, and I appreciate everyone’s questions.
Caitlin Croft: 00:41:56.915 Cool. Thank you, everyone, once again for joining today’s webinar. It has been recorded and will be made available for replay tonight or tomorrow morning, so be sure to check that out. And once again, thank you for joining today’s session. Bye.
Dylan Shorter: 00:42:15.997 Yeah. Bye.
Peter Jones: 00:42:16.771 Thanks, all.
[/et_pb_toggle]