Infrastructure Monitoring Basics with Telegraf, Grafana, and InfluxDB
Session date: Feb 13, 2024 08:00am (Pacific Time)
Infrastructure Monitoring is a critical aspect of ensuring the reliability, performance, and availability of IT systems. This talk provides a comprehensive overview of the basics of infrastructure monitoring and introduces you to the popular open-source tools Telegraf, Grafana, and InfluxDB. InfluxDB is the purpose-built time series database. You will learn how to install, configure, and use these tools to monitor various aspects of your IT infrastructure, including system performance, resource utilization, and network traffic.
The talk also covers some best practices for creating effective and meaningful dashboards, setting up alerts, or integrating with other monitoring tools such as Prometheus. By the end of the session, you will have a solid understanding of how to get started with infrastructure monitoring and how to use Telegraf, Grafana, and InfluxDB to achieve your goals.
Watch the Webinar
Watch the webinar “Infrastructure Monitoring Basics with Telegraf, Grafana, and InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Infrastructure Monitoring Basics with Telegraf, Grafana, and InfluxDB.” This is provided for those who prefe
Click here for presentation Here is an unedited transcript of the webinar “Infrastructure Monitoring Basics with Telegraf, Grafana, and InfluxDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Caitlin Croft: Director of Marketing, InfluxData
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
CAITLIN CROFT: 00:01
Hello, everyone, and welcome to today’s webinar. My name is Caitlin. I’m joined today by Anais. And today, we will be talking about infrastructure monitoring basics with Telegraph, Grafana, and InfluxDB. Please use the Q&A at the bottom of the Zoom screen to ask questions, and we’ll answer them at the end. And without further ado, I’m going to hand things off to Anais.
ANAIS DOTIS-GEORGIOU: 00:24
Hi, everyone. And thank you so much, Caitlin, for that introduction. So yeah, as she mentioned, we’re going to be talking about infrastructure monitoring basics. So, my name is Anais Dotis-Georgiou, and I’m a developer advocate here at InfluxData. For those of you who aren’t familiar with what developer advocacy is, it’s basically someone who represents the community of the company and the company of the community. So, I do that through giving webinars like this, through creating InfluxDB university courses, technical tutorials, demos, POCs, example codes so that you can have a bunch of solutions and examples for how to use InfluxDB with a variety of other tools to get started on your projects. I also spend a lot of time hanging out in the community Slack and community forums. So, if you have any questions there or just want to chat about a project that you’re working on, I’d love to. I’d love to learn about what’s interesting you, why you’re interested in InfluxDB and what you plan to do with it. And I also encourage you to connect with me on LinkedIn if you so want.
ANAIS DOTIS-GEORGIOU: 01:32
So, before we get started talking about infrastructure monitoring, I just wanted to take a quick step back and take a quick glance at InfluxData itself. So, we were founded in 2013 by Paul Dix, who is our CTO and founder. And our core mission is to enable developers to change the world with impactful applications that are time series applications. And so, time series applications are really big in the IoT space, in the analytics, and cloud-native services. We also have one platform, one API, and we serve across multiple clouds and have both on-prem and cloud environments. And so InfluxDB is— the goal here is to use InfluxDB to build and scale applications with time series data as the foundational or fundamental component. We have a ton of different customers with over 750,000 daily active OSS deployments. So, our open-source community is really big, and we’re really proud of it. And we have a bunch of those—a bunch of our paying customers having first started with the OSS version and then moving to a paid version. So, we have customers like Google who use it for IoT monitoring application solutions, Cisco. Tesla monitors their battery walls, PTC Honeywell, and a lot of other big names as well. And we also have usage-based pricing with multiple methods of payment. So, you can pay as you go. But for today, we’re going to be talking about monitoring and observability with InfluxDB, Telegraph, and Grafana.
ANAIS DOTIS-GEORGIOU: 03:20
So, the agenda for today is basically we’re going to start off kind of laying some groundwork and talking about the difference between monitoring versus observability and break down what each area is and how they really differ and how they’re kind of alike. Then we’re going to create a hypothetical problem because problems really drive learning. So, we’ll create a scenario where we want to observe a particular application. In fact, we’ll ask ChatGPT to create this imaginary observability problem for us, and then we’ll actually try and solve that problem. So, we’ll deploy an open source stack set of tools like Telegraph, Grafana, and InfluxDB and OpenTelemetry to solve this observability problem. And then all the source code that will solve this imaginary problem is online for you to use and try out at your own pace. And then last but not least, before I leave, we’ll take a second to get our hands on the source code and get involved with our community and learn about all the resources that are available to you so that you can continue learning about this area and the options that are there. But let’s just dive into it. So, let’s talk about the difference between monitoring and observability. So, we think about monitoring. We think about collecting metrics, logs, and events. And when we consider metrics, we usually think of things like system stats that are taken at regular intervals. So, if you’re familiar with Prometheus, this is equivalent to a gauge, for example. And then there are also— in the time series world, metrics are taken at a regular interval rate. There’s another type of time series, and those are called events. And those are typically irregular metrics. So, for example, in the healthcare space, your heart rate would be a metric, and an event would be a cardiovascular event like an AFib or something like that.
ANAIS DOTIS-GEORGIOU: 05:14
The point about events is that we don’t really know when they appear. So, this could be something like an error status or maybe a user-driven process. But the cool thing is that we can also typically convert irregular metrics or events into regular metrics by using regular aggregations. So, you could count, for example, every hour how many events you have. And then all of a sudden, you’ve taken this irregular metric and turned it into a regular metrics. And so, logs, for example, they can fit into both worlds. But in monitoring, it’s more along the lines of parsing those logs to extract certain values. And then that can differ in how the logs are used in observability. And then we look at observability. And the goal here is to expose the underlying behavior of systems in a distributed system and figure out how they interact with one another. So essentially, we’re thinking about trying to understand what is actually causing an error message, right? We’re trying to proactively drill into our code to understand what user-driven event is causing poor health in our system, and we’re monitoring that poor health in our system with logs and traces and events and metrics.
ANAIS DOTIS-GEORGIOU: 06:32
So, monitoring and observability fields exist across every aspect of an application. So, for network monitoring, for example, what we’d want to do is observe the performance of network components, such as our routers, our switches, our firewalls, because we want to ensure that we have efficient data transmission. We want to watch out for any data bottlenecks, and we want to identify any security threats. With server monitoring, we’re doing things like tracking the performance and availability of physical or virtual servers, including CPU usage, memory consumption, disk space, response times. We do all this to ensure that we have optimal performance and that we don’t encounter any downtime. And then for application monitoring, we’re monitoring any issues related to bottlenecks, inefficiencies in the code, databases, or other infrastructure components. And I’ve highlighted this application performance monitoring section in pink because this is where monitoring and observability really meet, in my opinion. It’s where we can monitor our application infrastructure but also observe our application infrastructure by looking at traces produced within it to perform root cause analysis. And then last but not least, we also have cloud infrastructure monitoring. So here we’re tracking the performance and availability of cloud-based services such as virtual machines, storage, databases, all to optimize resource allocation and hopefully minimize cost.
ANAIS DOTIS-GEORGIOU: 08:04
So, let’s look at this problem for today. So, the problem that we’ll have today was actually ChatGPT-driven. We asked a prompt we want to have an observability, an infrastructure monitoring use case. Can you give me one? And ChatGPT actually modeled this after itself. So, in this idea here, we have this product called Whisper GPT, and its purpose is to provide a natural language processing and machine learning techniques to provide users with highly accurate text responses. The problem is that this application has unprecedented growth, and it prevents a few challenges, including various bottlenecks and latency issues. And it needs to be able to scale seamlessly in order to handle the really high influx of new users. So, the question is, how can the Whisper GPT team monitor and optimize this solution and this application and their cloud infrastructure to maintain optimal performance and give the users the experience that they need? So, we basically are looking to build a skilled monitoring solution here in a hybrid architecture for this problem. And what I mean by that is that we have both on-prem stuff that we need to monitor and also a cloud-based application that we need to monitor as well. So, in this example, we’ll imagine that we have a series of servers running our own Whisper GPT model on our own GPUs. And it’s talking to, let’s say, an AWS instance where we’re actually running our API and our UI. And we need to monitor all of this in this hybrid architecture.
ANAIS DOTIS-GEORGIOU: 09:41
So how are we going to do that? Well, to break it down— sorry. Hold on. Let’s go ahead and solve that problem. There we go. Okay. So basically, in order to solve the problem, what we’re going to do is we’re going to have to understand our monitoring data sources. So, when we build a monitoring platform, we’re going to think about doing kind of three things. The first is data collection, then data storage, and then actually acting on that data. And the way that we’ll perform our data collection is with Telegraph. So, Telegraph is our open source plug-in, open source agent for collecting metrics and events. It’s plug-in-driven and has over 12,000 stars on GitHub. It’s widely adopted, has an excellent community. And really, we are just the caretakers of Telegraph because the majority of telegraph plugins have been contributed to telegraph by the community. So, for example, there’s over 300 plugins for ingesting and outputting data. So, in that way, Telegraph is one of the most versatile ingest agents for time series data. It’s also really lightweight. You can also reduce the binary size to just the plugins that you need. Yeah. And it’s configurable through a single [Tamil?] configuration file downloadable as a single binary. So here are some of the input plugins that are available to you. And I’ve highlighted a few in blue that might be useful to this particular problem like CloudWatch, CPU, Disk, Disk I/O, etc. But there are even more input plugins and even more that might have relevance to our problem today. For example, maybe we use the Nvidia for monitoring our GPU capabilities. We want to monitor our Kubernetes, look at our memory. I also highlighted Minecraft, not because we would use it here, but just because this is an example of how we’ll accept any plugin that the community contributes so long as the code is sound and it solves a problem, which the Minecraft plugin does. It monitors players that are playing Minecraft. So yeah. It’s just a cool example of kind of how community-driven Telegraph is.
ANAIS DOTIS-GEORGIOU: 12:07
We also have other input plugins like OpenTelemetry, Prometheus, SNMP, Processes, System, etc. So, a ton of input plugins that are very relevant to the problem we’re trying to solve today. And this is what the general Telegraph architecture looks like. So, like I mentioned, it can be configured in a single file that’s written in Tamil or a series of them. So, you can also chain Tamil configurations together. And essentially, what you want to do is define each plugin and its parameters in that configuration file. And the process works like a data pipeline with a memory buffer as well. So, input collection occurs either push or pull depending on the plugin. And then we have processor plug-ins that allow you to pre-process some of your data, transform it, decorate it, add, filter it. Then you have aggregator plugins, which allow you to pre-aggregate your metrics before sending it to an output. And then we have output plugins, which allow you to send all of that data after it’s passed through this pipeline to the output of your choice. And typically, when a developer advocate like me from InfluxDB presents on Telegraph, we’re almost always using the InfluxDB output plugin. But Telegraph is database agnostic. You can send your data to a lot of different outputs. File, Kafka, CloudWatch, MongoDB, OpenTelemetry. So really, it’s up to you.
ANAIS DOTIS-GEORGIOU: 13:35
But we’ll be sending stuff to InfluxDB or our data to InfluxDB. So essentially, also, Telegraph is meant to be extremely versatile in how you set it up. We have flavors in most Linux binaries, but we also have Windows, Dockers, Helm charts, Mac OS. And so essentially, what we want to do is configure that Tamil and then define each plugin and its parameters. And then after that, you can use a variety of flags to kind of test your telegraph configuration before committing to it. So, the first flag that I want to talk about is debug. It provides all the logs for your Telegraph agent so you can really see what’s going on you know if you’ve successfully loaded all of the plugins that you have intended to load and what’s happening with them. And we have a test. And that’s really great because it allows you to collect data from your input without actually sending it to your output. So, you can really make sure that the data that you’re collecting and processing and aggregating is all being handled correctly before you commit to writing it to your data store. And then we also have [inaudible]. So, this is a really great tool. What it allows you to do is test your output because it only sends one sample instead of maybe the thousands or hundreds of thousands of data points that you’re collecting. So yeah. And then you just [inaudible] —config and the pathway to the actual config file that you want to run. And then you can also deploy it on a variety of different methods, so.
ANAIS DOTIS-GEORGIOU: 15:11
Now, I’m not going to go into detail about using Telegraph as a sidecar in this presentation, but just know that that is a very common thing that people do with Telegraph. So, you can sidecar Telegraph into Kubernetes. And if you follow this QR code or URL, you can go to this simple demo in that repository. So, you can learn more about how to do that and how to sidecar Telegraph into your Kubernetes infrastructure. We also use the Prometheus input plug-in in this demo. So, we monitor our application and also monitor the Kubernetes infrastructure using Prometheus endpoints. And what we do is we scrape all of them into InfluxDB to fully acknowledge that Prometheus is the master of monitoring when it comes to Kubernetes and also to highlight the versatility of Telegraph. So, this is what a Telegraph config looks like, the Tamil configuration. This particular portion right here is the agent configuration or the global configuration portion for Telegraph. So, I want to talk a little bit about some of what we’re looking at here. So, the interval is basically the default data collection interval for all inputs as we’re polling. We also have or any data that is collected through any input plugin is put into a memory buffer. So, if your network goes down for whatever reason, these samples are then stored in a message queue, and then we wait for the network to come back online. And when they are, then they get written out of the queue. And so, you can also kind of configure that buffer by specifying the buffer limit size. But just keep in mind that the bigger that your queue is, the more memory you’ll need. So sometimes people put an astronomical number for the buffer limit, and then they’re surprised when they run into memory issues if the network goes down. So that’s something that you just have to balance.
ANAIS DOTIS-GEORGIOU: 17:10
We also have things like the flesh interval and the jitter interval. Basically, those are used to jitter the collection by a random amount. That’s the collection jitter. And the flush interval is just like when you flush the data out to the output. So yeah. And this is what an input plug-in config would look like. So, we have chosen four plugins to meet our requirements. So, we use SNMP here to monitor our routers and our firewalls, and we can pull these endpoints with Telegraph. Or we can also use it to monitor and wait for traps. So, we’re going to set up a system uptime for this router and give the system name. We could also monitor things like temperature, usage, throughput, etc. But we just wanted to give a really simple example here. We’ll also use OpenTelemetry. We can also scrape Prometheus endpoints if we wish. And we’ll also use CloudWatch, and we will pull events into metrics from services being monitored by CloudWatch here. And so, for our output plugins, we’ll be using the InfluxDB v2 output plugin. And we’ll talk about how to configure that as well. So basically, all you need to do is incorporate the host URL of your InfluxDB instance, provide a token for authentication and organizations, and then also a bucket for basically the database that we’re going to be writing our data into. And yeah. That’s pretty much essentially all the minimum that you’d actually need.
ANAIS DOTIS-GEORGIOU: 19:06
So, we’ve hit our first milestone, and we’ve talked about how we perform data collection with Telegraph to monitor all the parts of our application. So now let’s go on to data storage. So, in order to talk about data storage, we’re really talking about storing data in InfluxDB. So InfluxDB is a time series database and platform. And InfluxDB 3.0 is built on Data Fusion, Parquet, and Apache Arrow. So, Apache Arrow is an in-memory columnar format, a method for defining memory in a columnar format. Parquet is the durable file format, and Data Fusion is a framework for transporting really large data sets, primarily Arrow, over network interface. Oh, no. Sorry. Data Fusion is— I apologize too. I’m a little concussed, [laughter] so I’m just messing that up. But Data Fusion is actually the query execution framework so that we can query our data in either InfluxQL or SQL. Arrow Flight is used to transport Arrow over network interface, which we’re also built on. So, the reason why I wanted to break down these three technologies that InfluxDB 3.0 is built on is because essentially, this columnar format and this rewrite of the storage engine makes it so that InfluxDB can handle really, really high volumes and high cardinality or high dimensionality data. And it allows developers, therefore, to ingest not only metrics, time series metrics, but logs, traces, and events.
ANAIS DOTIS-GEORGIOU: 20:54
So, some other valuable points about InfluxDB 3.0 are that it’s schema on write. This is the same as previous versions as well, which means that you don’t have to go and define a schema, and you’re not locked into that predefined schema beforehand. And you can modify the schema as you go. We also enable developers to write and query millions of rows per second, and that’s because we’re on the bleeding edge of throughput through adopting a columnar store. InfluxDB is also a single data store for all time series data, like I mentioned, metrics, logs, and traces as a result. And also, finally, last but not least, we offer support through Data Fusion in both SQL and InfluxQL. So, you can query in the language that’s most comfortable to you. InfluxQL is just a SQL-like query language that is specific to InfluxDB. And so, this is kind of a bird’s eye view of the InfluxDB 3.0 platform. Essentially, the idea here is that regardless of where you’re trying to collect data from a multitude of different data sources, you can do so with a variety of different data collection methods, Telegraph being one of them, but we also have client libraries. And then you can store that data into InfluxDB, query it with SQL or InfluxQL, and then integrate with a variety of different data analytics and visualization tools as well as business intelligence and analytics platforms to process your data further and do whatever work you need to with it.
ANAIS DOTIS-GEORGIOU: 22:39
I also think it’s important to take a step back and understand the data model of InfluxDB so that we can understand how we would store data into it. So essentially, a bucket is also the same thing as a database in SQL. The difference between a bucket and a database, though, is that a bucket also has a corresponding retention policy or retention period. And this is the amount of time that each data can persist in that bucket. So yeah. In InfluxDB 3.0, a bucket in a database, those are synonymous terms. But all buckets come with this retention policy. Then after that, we have a measurement, and a measurement can be thought of as a table in SQL. So, it’s just a way to group data at a high level. Underneath that, we have tag sets. So, tag sets are a way to group data at a lower level. The values are strings. And so, we think about applying metadata there. Which CPU are we looking at? If we’re monitoring temperature in a certain area, which location is that? Those would all be tags. And then we have our field set. And field sets are our key value pairs to represent our actual numerical data or logs. And then we have a timestamp. So, you can have data with up to a nanosecond precision in InfluxDB. And the unique combination of measurements and tags makes a series, and that contributes to the cardinality. But luckily, in 3.0, you really, really don’t have to worry about runaway cardinality like we did in the past, where you’d have to think about strategically, what do I want to make a tag because I don’t want my tag values to explode like if I were to do, I don’t know, a user ID as a tag and then have an explosion of user IDs. Now, we don’t have to worry about that. So that’s really nice.
ANAIS DOTIS-GEORGIOU: 24:29
And the way that we write data to InfluxDB is through line protocol. So, line protocols just our ingest format, and it takes the following shape. So, we have a measurement, then we have a comma to separate our tag sets, and our tag sets are separated by commas as well. Then we use a space to indicate that we have our field set. And those are separated by commas. And then we have another space to then indicate our timestamp. And so yeah. It’s just that simple. I also wanted to take a moment to talk about some schema best practices. So just when we’re talking about designing for performance, the one thing that we want to do is to avoid wide schemas and avoid sparse schemas. And the way that we can, excuse me, achieve both of those is through making sure that our tables are homogeneous. And what we mean by that is that we want to try and keep our tables consistent and try and prevent a lot of null values. And that just means that we’re going to put like data in one table and not try and put a bunch of unlike data in the same table or measurement. So, another thing is that we want to design for query simplicity. So, when we are, for example, naming tables, we want to keep the table names relatively simple. Avoid special characters or anything that we’d have to escape out of. Because remember, you’re going to have to use that name when you’re querying in SQL. So, you just want to keep that in mind. So, what do I mean by making our data homogeneous? It means just to keep your data consistent between tables. So, if we are monitoring our network, that should all go into one measurement. And if we are monitoring our servers, that should all go into one measurement, application, one measurement, cloud, one measurement. So, for example, what would not be homogeneous would be to measure some financial trading data and put that in the same measurement as you monitoring your temperature of your living room once a day because you’d have data points at the nanosecond precision and then data points at a daily precision. And that means that one column would have a ton of null values. And that’s what you want to avoid.
ANAIS DOTIS-GEORGIOU: 26:50
I also want to highlight a hot new trend for InfluxDB, which is storing OpenTelemetry data. So, we are focused on storing traces, metrics, and logs, which is what we’ll do to solve this problem today. And so, when we store OpenTelemetry data, this is what the schema looks like. And these are screenshots from the Data Explorer in our InfluxDB UI. And you can see that we have different tables to store our metrics, our traces, and our logs or our spans. And so, the schema isn’t something that you will need to worry about. We’ll discuss this a little bit later. But essentially, the ability to do this and to store our logs and our spans without having to be worried about runaway cardinality is all because we are built on a columnar store with Apache Arrow Data Fusion and Parquet as the underlying technology. So now let’s talk a little bit about hybrid InfluxDB solutions. So, we have both cloud and edge-based offerings for InfluxDB. And so there might be some use cases where a user wants to keep their data closer to their source. So, what they might do is downsample and aggregate their data before writing their data to a more globally visible store. So, you can do that with InfluxDB within a hybrid solution. What you can do is you can install InfluxDB open source locally on a server or at the edge. And then you can downsample that data locally and then write it to a more global store or a cloud instance. And so, you can do this using edge data replication. And what that allows you to do is when data is written to a bucket at the edge or at the OSS version, we automatically put that data into a durable queue as well, which then writes that data into a remote instance of InfluxDB.
ANAIS DOTIS-GEORGIOU: 28:49
So now that we have two milestones down, we have one milestone left to go. So, let’s talk about data in action. So, in order to visualize our data and act on our data, we’ll be using Grafana. Grafana and InfluxDB have a really great and longstanding relationship. It’s primarily the main visualization tool that we expect our users to use with InfluxDB. We do have a UI, but we really consider that UI to be reserved only for exploring your data to make sure that everything is arriving as you expect and for some account management. But really, you want to be using a tool like Grafana to actually be acting on your data. And it will be our primary dashboard method for this use case. So, there’s a few different ways that you can interact with Grafana. You can use Grafana Open Source or Grafana Cloud. It’s up to you. And you can use the Flight SQL plugin or you can use the new official InfluxDB V3 plugin as well. But to quickly understand how we might do that, what we can do is, for example, use the Flight SQL plugin as our data source. And to do that, then we just connect to our InfluxDB instance. And we actually contribute this Flight SQL plugin for Grafana. And the cool thing there is that any other data store that is leveraging Flight SQL means that you can also use the same plugin. So, it just kind of highlights our commitment to open source. So, for example, if you are using Dremio, Druids, or any other column or store that takes advantage of Flight SQL, you can use that plug-in. And then from there, you can go ahead and query in SQL and get your time series data from InfluxDB in Grafana.
ANAIS DOTIS-GEORGIOU: 30:48
But there’s also an official InfluxDB v3 data source that you can use and really easily specify the query language that you want to query with and connect that way. So, you can also follow this QR to a blog post that the wonderful Jay, another developer advocate at InfluxData wrote all about this plugin, how to get started, and take advantage of it. And so, once we’ve queried data with SQL into Grafana, we can do things like create a wonderful visualization like this and maybe look at our payload or our CPU or whatever it is that we’re looking— here, we’re looking at our CPU. And this is an example of how tags come into play. So basically, we have our tags is differentiating each one of these series. And yeah. So that’s kind of a translation from line protocol to visualization. I also wanted to take a second to talk about some useful SQL queries. So, the very first is to use Date Bin. So, what Date Bin does is it creates buckets or windows of time that then allow you to apply aggregation over those windows of time. So, here’s an example of creating Date Bin to basically create the or find the average usage user, usage system, and usage idle values from a CPU table with some tags or with some conditions applied. And then we also have selector functions. So, selector functions are designed to work with time series data specifically. They behave similarly to aggregator functions in that they take a collection of data and return a single value. However, selector functions just— they’re a little bit unique in that they actually return a struct that contains a time value in addition to the computing value, a JSON struct. And then you can get the time value or the actual value of that data and return it. So that’s really good for things like gauges. It’s also valuable if you want to put that in a subquery and use the time or the value specifically in some shape or form to do some sort of more complicated query or maybe some simple math.
ANAIS DOTIS-GEORGIOU: 33:12
So, if you look at the bottom— sorry, the right-hand of the screen, you can see that we have a QR code too for a really simple example dashboard. So, this is like your most basic quick start when it comes to using Grafana with InfluxDB that we’ve created. It has a system stats for your system. So, you can use Telegraph to monitor your disk, your memory, write it to InfluxDB and build this dashboard all using SQL. So yeah. Follow that QR code if you want to just look at this simple example and try it out for yourself. Another really valuable reason for using Grafana is that it has very sophisticated alerting capabilities. So, they’re extremely powerful. And for example, if we were monitoring our CPU usage, we could do something like define a threshold and say, if our CPU usage exceeds this threshold for a specific amount of time like, let’s say, two minutes, then trigger an alert. And we can trigger alerts through Prometheus, Slack, back into Telegraph, PagerDuty, etc. Wide variety of endpoints available to us with Grafana. So that’s another reason why it’s so powerful and why we push our users to leverage something like Grafana for your visualization and alerting capabilities. So basically, we could build any of these solutions. We’ve already gone through— once again, let me backtrack. We’ve already done our data collection, talked about data storage and data action. But you could also— if you didn’t want to use Grafana, you could build out any of your solutions with a client library. So, we have client libraries in Python, R, Java, C#, Node, Go, etc. You could also use a data analytics engine like Apache Spark or RapidMiner. And then you’re also not limited to Grafana. So, you could use something like Apache Superset, Tableau, Mage, AI, Quix. These all offer solutions for alerting and visualization as well, as well as some data processing and ETL.
ANAIS DOTIS-GEORGIOU: 35:16
So, you have so many options available to you, and that was one of the goals of InfluxDB 3.0 was to really offer more interoperability. And the way that we achieved that is not only through things like leveraging Arrow Flight, where we can provide things like the Flight SQL plug-in with Grafana. But that also opens up things like Flight JDCB drivers, which is how we connect to things like Tableau and in the future, Power BI. But simply the fact that we are leveraging Arrow and columnar data format means that we can just transport a ton more data really efficiently, even through things like client libraries, which just makes leveraging a variety of Python libraries way, way simpler and way more actually feasible. Also, our Python client libraries supports pandas and polars as well, which just make it a really good tool for any sort of ETL that you want to do with Python. Yeah. So now let’s talk about the actual observability proof of concept or demo. So, I won’t actually demo it myself here. And the reason why is because you can actually— I encourage you to do it yourself through Killercoda. Killercoda is this educational tool that allows you to run demos without having to install anything or pull any repo. And so, I encourage you to do that and also just because we’re kind of running out of time here. But essentially, this OpenTelemetry demo, basically, what it does is it uses the OpenTelemetry Collector. And what we do is we collect data from HotRod, which is our sample application, and write that data directly into InfluxDB, including all of the spans, logs, and metrics. And we also use Jaeger to query that data back out and then use that as our bridge interface with Grafana. And we convert the Jaeger query into SQL, and we ask InfluxDB for all of those results. So again, highly recommend that you check out this QR code and look at this repo and give it all a go for yourself.
ANAIS DOTIS-GEORGIOU: 37:23
It’s also all Dockerized too. So, if you do decide to pull it directly, it makes it really easy. But as an example, if I were to demo this, basically, what I would do is I’d pull up the Hot Rod application, and I would click one of these buttons to generate a trace. And you can generate the traces in real time. And then I’d go to my Grafana dashboard. So, this is us monitoring our traces over the last 90 days. But we could use the dropdown to change that from the last 90 days to maybe the last five minutes instead. And now we can actually go ahead and maybe click on a particular trace, for example, and see the relationships on all the spans within that trace and drill into the trace as well to really kind of perform this root cause analysis and figure out what’s going on with the health of our application. So back to the Whisper GPT solution in this example problem that we started with in the beginning of this presentation. Basically, what we did here is we addressed the challenges of monitoring and scaling an application using a combination of Telegraph, InfluxDB, and Grafana. And so, we used Telegraph as our collection backbone. We deployed it on all of our servers and our cloud infrastructure to collect OpenTelemetry data, Prometheus, and CloudWatch data, as well as raw server-based metrics. And then we use InfluxDB 3.0 and set up various buckets representing each of our data sources from our servers to our cloud, etc. And then we use Grafana as the observability hug— hub, excuse me [laughter], hub.
ANAIS DOTIS-GEORGIOU: 38:59
And we can use both the Flight SQL plugin or the official V3 plugin and the Jaeger data source to query data from InfluxDB3.0, where we have consolidated all of our logs, traces, and events and metrics and actually build a dashboard and create any alerting that we want to as well so that we can actually be proactive and address the health of our application. So, what are the next steps? The next steps are to also take advantage of the Quick Starts repo. So, this repo contains a series of Grafana dashboards and telegraph configurations. I mentioned some of them already throughout this presentation, but there are a bunch more. And if there’s anything else that you would like to see an example of that isn’t included in this repo with all of these quick start guide, I highly encourage you to reach out to me on Slack or in the forums and ask for something that you’d like to see. We’d love to build it for you. That’s kind of our job here. So yeah. Please don’t be shy. And then last but not least, the OpenTelemetry demo. You can try this for yourself. You can pull the actual repo like I mentioned before, or you can take advantage of Killercoda so you don’t actually have to pull that repo yourself. And basically, you just follow the steps on Killercoda and configure it through there and also get a bunch of text explanation of what’s going on with each step of configuration, which just is a really nice way to walk through that. And mad props again to Jay for creating this Killercoda example. And last but not least, I would not be doing my job if I didn’t encourage you to sign up for InfluxData. Give the free cloud tier a try or download the open source version. And last but not least, I want to encourage you to take a look at our InfluxData documentation. Our team is really fantastic, and the documentation is really great. We also have InfluxDB University, where we offer free courses on Telegraph, InfluxDB, so many other things related to InfluxData.
ANAIS DOTIS-GEORGIOU: 41:10
And so, you can also earn free virtual badges on your LinkedIn with that. Please keep in mind, too, that for 3.0, we are currently in the process of developing courses that are 3.0 specific. So, if you’re looking for those, there might be a little bit of lag time, but we are pushing new courses out right now, so there should be more. And last but not least, I encourage you to come talk to us on the community slack, influxcommunity.slack.com or community.influxdata.com, which is our forums, and ask any questions that you have, or if you’d like to see any specific blogs or tutorials or POC or example repos, please ask for them. We’d love to build them for you. And yeah. That’s the presentation for today. So, now I’ll go back to the chat and to the questions and see what all you have to say. So, Marcus asks, “Will we get access to the code and/or docs from here?” So, you will get a copy of these slides. So that’s one way to access the code and docs from there. Yes, we will share the slide deck. And so, we have another question from Alexander that says, “When using Telegraph to collect server metrics, it generates plenty of different parameters.” Oops, sorry. The questions are moving, “It is not easy to consume all these parameters as the list is long and not always intuitively understandable. Do you have any instructions or docs or tools to manage this long list of parameters?” Yeah. There’s plenty of examples for each telegraph plug-in in the Telegraph repos of how you might configure any individual telegraph agent. It also just depends on what you’re interested in doing. There’s also telegraph processor plugins for filtering and decorating any of your metrics as well. So, you can use those. And then we also have several telegraph engineers specifically that hang out in the forums, for example, or Slack. So, you can ask them directly as well for specific questions.
ANAIS DOTIS-GEORGIOU: 43:36
“Is it possible to include configuration from external file like NC Programming Language include directive? Telegraph configuration is a big file and not easy to manage.” So yeah. You can definitely when you are building the config or creating a new config through the command line, you can specify which plugins you want to include as a part of that file only and only limit it to that. You can also reduce the binary size to just include the plugins that you’re interested in using. So, if you look at our documentation, there’s ways to do that both. And if I have time, I’ll find that and share it with you all. “Does Telegraph have option to keep resulting data rate in some limits, like no more than one megabyte per second. So, Telegraph throttles its send data when more than one megabyte is about to being sent.” Sorry. One second. I have to sneeze. Or I thought I did. [laughter] I don’t think that there is. I could be wrong, but I don’t know that there is. So actually, that’s a good question. That’s something that I will look more into. I can’t believe I’ve never— if I’ve been asked that, I can’t remember. But yeah. “Can Influx DB2 OSS replicate an InfluxDB3 server to an InfluxDB 3 server in cloud?” So, the right endpoint for InfluxDB 3.0 is the same as the right endpoint for 2. Let me really quickly double-check on that because I would hate if I don’t actually think you can, but double-check.
[silence]
CAITLIN CROFT: 46:09
I think Anais is looking for something, so that’s why she’s on mute. Sorry, everyone. Give me a second.
[silence]
ANAIS DOTIS-GEORGIOU: 46:31
So, I don’t think you can, but that’s actually a really good question that I should know the answer to. So, thanks for asking that. I’m really torn. Yeah. Let me get back to you on that. Oh, and Oregon InfluxDB 3.0 is just like the highest level of organization for your data. It’s per user or multi-tenant, basically, your account ID. How does Grafana log filtering compare to Kibana? So, I haven’t actually used Kibana that much, so I’m not a great person to ask for that. In InfluxDB 1.0, setting retention of data was rather complex. How’s it done in 3? Basically, when you create a bucket in the UI, for example, you just select a drop-down menu that says what you want to how long you want to create it for. And then with the API or— I can’t remember exactly or the CLI, the command. Let me look. So, you just do— in the CLI, you just do a —retention to set the retention period. So, it would be as simple as influx bucket create name retention 72 hours, for example. Specifically, is it possible to have a single graph in Grafana that goes via time to say from now to three years back going across various data resolutions seamlessly? Yes, I want to say. I mean, I think it depends on how much data you have and— if you had an obscene amount of data, maybe Grafana would have trouble rendering or processing all of that. But yes. Unless I’m not understanding the question correctly.
ANAIS DOTIS-GEORGIOU: 48:57
Okay. And then we have some more questions in the chat. So let me see. So InfluxDB 2.x has some alerting features. 3.x does not. The goal with 3.x was really to make sure that users aren’t limited to using just what we build and instead focus on enabling them to use and leverage what they’re already using and other tools that are purpose-built for alerting. So, I would say use another tool with InfluxDB, leverage a tool that’s specifically for alerting. I have a lot of historical logs I want to ingest in Telegraph. Do I have to ingest or order, or can I do it out of order? I think that really only depends on your retention policy. Obviously, if some of those logs are outside of your retention policy, then it wouldn’t serve you to try and write them. But no, it shouldn’t matter.
[silence]
ANAIS DOTIS-GEORGIOU: 50:23
Okay. I think I answered all of the questions. So, thank you so much, everybody.
CAITLIN CROFT: 50:30
Thank you, everyone, for joining today’s webinar. I know there were tons of questions. Let us know if there’s any other last-minute questions that you want us to answer. Once again, this session is being recorded and will be made available by tomorrow morning. So, the recording as well as the actual slides will be made available. So, you’ll be able to check it out, share it with your friends. And don’t be shy. All of you should have my email address. If you have any further questions that you’ve forgotten ask, feel free to email me. I’m happy to put you in contact with Anais. You can also find her in the forums as well as the Slack Workspace. So don’t be shy to reach out. We’re always happy to get as much help for you guys as possible. Thank you, everyone.
ANAIS DOTIS-GEORGIOU: 51:19
Thank you. Bye.
[/et_pb_toggle]
Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.