InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
Session date: Nov 09, 2021 08:00am (Pacific Time)
Telegraf is an open-source server agent designed to collect metrics from stacks, sensors, and systems - with nearly 300 inputs and outputs. Telegraf Operator makes it easy to use Telegraf for monitoring your Kubernetes workloads. It enables developers to define a common output destination for all metrics, and configure Sidecar monitoring on your application pods using annotations. With the Telegraf sidecar container added, it will collect data and start pushing the metrics to a time series database, like InfluxDB. Discover how to use the Telegraf Operator as a control center for managing individual Telegraf instances which are deployed throughout Kubernetes clusters. Find out how to use the InfluxDB and Telegraf Operator to monitor and get metrics from your Kubernetes workloads.
Join this webinar as InfluxData’s Pat Gaughen and Wojciech Kocjan provide:
- InfluxDB & Telegraf overview
- Telegraf Operator deep-dive
- Live demos of sample deployments!
Watch the Webinar
Watch the webinar “InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Wojciech Kocjan: Software Engineer, InfluxData
- Pat Gaughen: Senior Engineering Manager, InfluxData
Caitlin Croft: 00:00:00.857 Once again, hello, everyone, and welcome to today’s webinar. My name is Caitlin Croft. I’m very excited to have Wojciech and Pat, who are part of the engineering team here at InfluxData. And they are going to be talking about InfluxDB and the Telegraf Operator and how it relates to Kubernetes monitoring. Once again, the session is being recorded. Please feel free to post any questions you may have for the speakers in the chat or in the Q&A feature, which you can find at the bottom of your Zoom screen. Without further ado, I’m going to hand things off to Wojciech and Pat.
Wojciech Kocjan: 00:00:41.363 Hello, everyone. As Caitlin mentioned, today we’re going to be explaining InfluxDB together with Telegraf Operator and how to use them to monitor Kubernetes workloads. We’re going to be showing some examples. A lot of these are based on what’s in Telegraf Operator repositories. Maybe we’ll start by introducing myself. So my name is Wojciech Kocjan. I’m a software engineer at the InfluxData. I am one of the people that contributes to Telegraf Operator from our company, and I’m going to be the one showing Telegraf Operator today.
Pat Gaughen: 00:01:23.070 Cool. So I’ll go ahead and introduce myself. So I’m Pat Gaughen. I’m an engineering manager here at InfluxData. I manage the deployments team. And we are responsible for all the plumbing that is in place, this whole CI/CD pipeline, for our cloud to SaaS offering. So we’ll kind of focus our talk in the space that we know, which is Kubernetes and InfluxData. So first, I just really want to say, so InfluxData is the remote-first company behind InfluxDB. So I think most of you all know probably more than I do about how to use our product. But I’ll give a little bit of an overview. And don’t hesitate to ask questions along the way, and we’ll get to them at the end. So InfluxDB is the platform for building time series applications, and - oh, I wrote all these really good words, and now, I’m having to read them. And really, at the heart of it, it’s an open-source time series data. So it’s purpose-optimized for time series data, whether that is sensors or like - I have one of those like doorbells where you can see the person, so it’s like there’s time data there. So wherever there’s time-based data, InfluxDB is a perfect platform for you to develop applications around that data. So you can start from the UI, or you can skip right to it, and you can use the raw code and the APIs. And we’ve got APIs and client libraries in several of the most popular programming languages.
Pat Gaughen: 00:03:09.940 So Telegraf, if you’re not already familiar with Telegraf, if you have, say, that Ring device over there and you want to get your data somewhere, Telegraf has the input and output plugins to allow you to get your data from your device into a database. Of course, my preference is if you put that into InfluxDB, but we’ve got plugins for other types of things. It’s an open-source agent, and it, I think, has a really healthy open-source community, and it’s maintained by InfluxDB. And they have a lot of different plugins. There’s over 300 different plugins that allow you to basically, like I said, manipulate your data on the way in, get your data in, and help you manipulate your data on the way out. It’s really a really powerful tool.
Pat Gaughen: 00:04:05.342 So today, we’re going to focus on talking about it in the space of Kubernetes. But I know that there were several talks, I think, at InfluxDays North America, where they actually talked about Telegraf. I think there was a beginner session, and I think some other things. So check it out because it’s a really powerful tool. So now, we’re going to tell you a little bit about the Telegraf Operator. And I wrote some notes ahead of time to prepare for this. So the Telegraf Operator packages the operational aspects for deploying a Telegraf agent on Kubernetes. So this is about having a Kubernetes sidecar with a Telegraf Operator there. It’s sidecar container based on annotations, and it provides the Telegraf configuration to scrape the exposed metrics all defined declaratively. It allows you to define common output destinations for all your metrics. So you can send it to InfluxDB, or you can also send it elsewhere. And I’m going to pause there because I want to let Wojciech finish setting the stage for his demo, and I don’t want to take it all. So actually, Wojciech is going to take it from here and do a demo, so. But I’ll let you also finish.
Wojciech Kocjan: 00:05:28.346 Right. So thank you, Pat. So as you’ve mentioned, Telegraf Operator is meant to be running alongside Kubernetes workloads, and this is what we’re going to focus on today. Could you stop sharing so I could start sharing on my end?
Pat Gaughen: 00:05:46.681 Yes. But I have to - Wojciech, first, I have to show this slide. Okay. Good. Done. Now, I can stop sharing.
Wojciech Kocjan: 00:05:53.294 Okay. Yes. So I noticed that there is a question about APIs for InfluxDB, so I’ll just share this real brief and keep it open. So we have documentation about all of the APIs. There’s also clients, and I’ll show it in a bit. So it’s well documented . In the REST API, there is a querying language called Flux, and InfluxQL that could be used to get the data, and writing the data is relatively simple. Going back to Telegraf Operator, what I’m going to do is I have checked out a copy of the source code of Telegraf Operator, and it includes a lot of ways to run it using kind. kind is Kubernetes in Docker, which is a way to run a whole cluster locally just using Docker. And that’s what we use for a lot of testing of Telegraf Operator, because it is a simple way to run things and also be able to do fancy things, like building a custom build of the container image with Telegraf Operator and, say, loading it, which is not something we will be doing today, but it’s also really useful in development. And we’ll just use the exact same setup that we used when we develop it. So what I did in advance because it takes around one, two minutes, I run a kind start - make a kind start command, which basically just creates a kind cluster on my computer, and it deploys a few things. But we’re going to deploy InfluxDB version 2 because that’s what we want to demo.
Wojciech Kocjan: 00:07:34.302 So I just deployed it. This is the open-source version of it. And as soon as it - let me just port forward to it. Okay. So let me see what’s happening in my kind cluster. Okay. So my InfluxDB version 2 is running right now, so I can do this. So right now, I created a fresh Kubernetes cluster in my machine. I deployed InfluxDB version 2 to it, the open source version. And I’m just going to bootstrap our cluster, meaning that this is the same if I would deploy locally, but I just want to have everything in my cluster. So when I first deploy it, I’m going to set up an organization and everything for the InfluxDB itself, because this is where we want to get the metrics in. And also, touching on the question of how to get the data in. We also have a UI that shows a way how to write data from a lot of places. So say if you’re a Golang developer, it’ll give you ready-to-use snippets. Obviously, you would want to replace the token and some other things with [inaudible] this. But this is a really good way to get started with just putting data in Influx. But anyway, right now, what I really want to do is in order to be able to write my organization, I need this token. So right now, we’re just going to grab this, and then we can get back to explaining and configuring Telegraf Operator. So I haven’t deployed the Operator yet because there’s one additional thing I want to do.
Wojciech Kocjan: 00:09:23.619 So Telegraf Operator has a concept of classes, which are classes of applications or classes of metrics that we are gathering. And this basically maps to specific sets of Telegraf configurations. And one of the things that we should be setting in here is, how would my application write the data to wherever I want this? Because Telegraf Operator is meant to be generic, so we should be trying to get out specific outputs, which is part of standard Telegraf configuration. So what I’m going to do right now is I’m going to tell it, “Okay, let’s just also write it to my InfluxDB tool.” And now, what I’m telling it is, in my cluster, there is an InfluxDB 2 service, which is what we were just talking to in the browser, in the InfluxDB 2 namespace, and the port it listens on it is 8086, which is the default port. I’m just going to tell it my organization is demo, as I just entered it. Mark it is demo. And now, I’m just going to copy my token and [inaudible] to my [inaudible] I’m sharing it, because I’ll just keep the Flux [inaudible] on. I will also copy it to [inaudible] using the default one as well. So I’ll just do this. So right now, I’m configuring the classes, meaning that when we want to monitor some workloads, we’ll need to specify what the class of that workflow is, or it will be using the default class if it’s not specified.
Wojciech Kocjan: 00:11:03.182 So I have created up, default, and, I believe, intra. I think we will not be using all of them. But all of them also specify that the data should be going to the new InfluxDB tool that I have just created. This is a standard -
Pat Gaughen: 00:11:17.005 Hey, Wojciech?
Wojciech Kocjan: 00:11:17.820 Yes?
Pat Gaughen: 00:11:18.193 You’re missing your equals sign -
Wojciech Kocjan: 00:11:22.022 Oh. Thank you very much.
Pat Gaughen: 00:11:22.069 -on token.
Wojciech Kocjan: 00:11:23.303 Yes. Okay. Yes. That would be -
Pat Gaughen: 00:11:25.869 And it bothered the -
Wojciech Kocjan: 00:11:26.257 That would be a painful demo.
Pat Gaughen: 00:11:27.252 It didn’t just bother me. It bothered someone else, too, who’s watching it. Thank you.
Wojciech Kocjan: 00:11:33.181 Okay. Thanks so much for noticing it. So right now, I’m going to go back to my terminal, and I’m just going to deploy it. So the example is already committed. And the example shows how to use it with InfluxDB v1, because from development perspective, we keep on using the version 1 for that, which is something we should improve. But it’s just Telegraf Operator has been created when the v1 was the [inaudible]. So right now, I deployed my configuration. I can update it in the future and reload, so I can change it. But right now, we deployed that. So what I’m going to do next is I’m going to deploy Telegraf Operator. And it can be deployed in multiple ways. We have the dev.yml file, which basically is meant to be used for local development. But because I’m doing this in kind, I’m just going to reuse it, because it also has hardcoded certificates, so it’s not really production-ready, but it’s enough to work in kind
. We also have a Helm chart. Okay. So right. Now it’s on GitHub. Create Telegraf Operator. Helm chart. Yes, this is what I was looking for. So, yeah. So we have a Telegraf Operator Helm chart, as well, that’s available if you just install our InfluxData Helm chart’s source. And then you can install it or just use upgrade install, which will either install it or upgrade it depending on whether it’s already installed or not.
Wojciech Kocjan: 00:13:18.570 And this is also a - this is a preferred way to getting production environments running. But because we’re using kind and because all of the examples are based on this, I’m just going to follow this and not do the Helm chart base installation. But right now, I can just go and see what’s running in my cluster. So you can see that Telegraf Operator is running. It’s ready to handle the new deployments coming in and adding the Telegraf sidecars. So now, the way Telegraf Operator works is - and maybe I’ll just open one of the deployments to explain it. This is just a very simple definition of how to run Redis. It is a StatefulSet, but it doesn’t really even include volumes. In real life, this would be a more complex StatefulSet. But this is an example of how to use Telegraf Operator to monitor things. The way Telegraf Operator works is for each port that gets created, it checks the annotations. And if there is a Telegraf Operator annotation in it, it will inject the sidecar. So right now, we can see there is just one container called Redis that’s just using the default Redis image. But we can also see that we have the annotation telling Telegraf Operator that it should be contacting local host and the standard Redis port and using the Redis plugin. This is one of the plugins that Pat mentioned. And maybe I’ll explain just a bit more. So-
Pat Gaughen: 00:14:54.483 Actually, so Wojciech, before you go into that, I was hoping - because I don’t think we actually showed people the repo.
Wojciech Kocjan: 00:15:01.487 Oh, that’s a very good point, too.
Pat Gaughen: 00:15:02.889 I realized -
Wojciech Kocjan: 00:15:03.445 So Telegraf Operator -
Pat Gaughen: 00:15:04.380 This is all code you guys can get to. We kind of got -
Wojciech Kocjan: 00:15:08.164 Yes. Telegraf Operator -
Pat Gaughen: 00:15:08.576 -into the details. Yeah.
Wojciech Kocjan: 00:15:10.771 -is open-source, and it also includes an extensive “Read Me” on how to get started with development, with deploying points to the Helm chart. So if you want to rerun what I’m showing today, I think the easiest way is to clone it. And I’m basically using a lot of the make targets and just applying some of the things that we also mentioned in the documentation, because you can see that we’re just deploying this through GitHub URLs rather than locally. But yes, the repository is on GitHub. It is open-source. You can clone it. You can run the same [crosstalk] today.
Pat Gaughen: 00:15:47.166 Yeah. And you’re working within it right now. I just realized we didn’t -
Wojciech Kocjan: 00:15:51.946 No, no, Pat, thank you so much for this. That is a very good point. Because I am so into the repository, I sometimes forget to explain things that may seem like maybe my day-to-day things, but for other people, they may be new, so it’s good to mention it. So going back to the configurations, I may have skipped explaining some of these things. So the way Telegraf Operator works, it combines the Telegraf configuration that the Telegraf would be reading from multiple sources. One of the sources, the classes that I mentioned, which is just a vanilla Kubernetes secret with the definitions of all the classes - and usually, this would be including outputs or some of the [parts?] or some of the general things that would be applied to all the metrics related to this. It’s [class of?] applications of quantum monitoring. So in this case, we added the output to it, which means that everything with the up class would be writing to our InfluxDB v2. We also make it output the standard output [to?] GitHub. And we have it use global [inaudible] and showing the [inaudible] in the UI. Basically, type is set to up and then host name and node name would be the name of the host and the node that the Telegraf is running on.
Wojciech Kocjan: 00:17:12.662 And now, if we take a look at the Redis deployment, we are writing some other pieces of Telegraf configuration. So one of this is we’re adding input about Redis, which means use the Redis in plugin. And previously, we were using the InfluxDB v2 out plugin. So we’re telling Telegraf, “Talk to Redis on this port. Get some of its standard metrics, and send them out to InfluxDB v2 on this specific URL.” We could also tell it like, “Send this to my cloud [and?] send it to some of the on-prem instance of InfluxDB, or maybe send it to one of the very, very large set of output plugins that we support.” We could be sending it directly to Kafka or some other output plugin that we support. [We’ll?] drag it to a file. But basically, we tell it, “This is the input. These are the outputs that are in the secret in the classes,” and then they get concatenated. So my Redis definition tells, “This is how you should gather metrics for my Redis.” My classes tell it, “This is where you should be writing this,” and it also tells it, “By the way, this is the up class,” meaning that whatever I put in my up class in the classes definition is where the data goes. We can also specify the settings for memory requests and limits for the Telegraf sidecar. This one is invalid and will be ignored. This is more of a development test case. But the SQL limits will be set to the [inaudible] Telegraf sidecar. So anyway, that’s that. And let’s just go ahead and deploy this.
Wojciech Kocjan: 00:18:50.349 So this was examples Redis, I believe. Yeah, examples Redis. Okay. So now, if we go back to watch, we can see that we only specified one container within the port spec. We can see it’s actually running two containers. So if we do this [get?] pod, if we go ahead and describe it - let me just do it this way - we’ll see that there is the Redis container we defined. There’s also the Telegraf container that was injected by the Telegraf Operator. And we can see that the CPU limit is set to 750 minutes, so 0.75 of a single CPU core. We can see it’s mounting at C Telegraf using - [inaudible] we can see below, using a secret that was generated by Telegraf Operator. So basically, when the port was about to create the Telegraf Operator, combined the whole Telegraf configuration, put it in that secret, and started running Telegraf Operator. And it also told Telegraf Operator that it should be monitoring that configuration to allow hot reloading, which I’ll explain in a bit because that is an interesting feature of Telegraf Operator. But anyway, at that point, I believe the ports are already running. So what we could also do is - right now, I’m asking - or actually, let’s use something more visual. We’re going to run a tool called K9s, which is a nice console-based UI for a lot of things Kubernetes-related. And it’s much better than what I was doing before that, so I think that’s going to be more visible.
Wojciech Kocjan: 00:20:45.136 So this is my board with the sidecar included. I can take a look at the logs of this Telegraf sidecar. And I can see that because we told it to log all the metrics to standard out, we can see that we already have the metrics in here, and the metrics are in line protocol, which is what InfluxDB builds on top of. But this is basically just because we told Telegraf to write to the standard’s output, and we didn’t use any other protocol. So it’s just writing it in line protocol. But going back to - so this we can see it running. So now, I can go back to my InfluxDB and I can see that I have a lot of my Redis data. So I could see - I actually don’t even know what to look. But let’s say maxclients
. Now you see, okay, that my maxclients
is configured to 10k. I could probably also see a lot of other matches, but because there’s really nothing happening, I can also just have show all the metrics. So we see some metrics changed over time, but not a lot of them did. So we can see that there are a lot of metrics that are - I can also show the raw data and we can see that there’s a lot of data that we have. And we can see that the Telegraf Operator is reporting this. Okay. Let’s try to do something more practical. I would want to monitor how much memory is being used. So I already have it. And for any other workload that I would be deploying in my cluster, Telegraf Operator will automatically be injecting that. But we can also see that the top “type” is set to “app”. And right now, InfluxDB UI works in this way, that I’m - right now, I can build out a query using just the UI.
Wojciech Kocjan: 00:22:42.011 And I can filter away all of the tags that we are setting, and the type equals up was set when we were creating the Redis deployment. So with this, I would only be - so let’s go back. Let’s remove just a bit. I could say that I will start by just filtering data coming from my applications. And then I can go back and say, “Okay. And now, let’s take a look at all the fields they have, right?” So for example, we have another thing that we could deploy, which is an example of deploying Nginx. In this case, it’s also an interesting example because previously with Redis, we were specifying that role Telegraf configuration. But if the application is already exposing metrics in the Prometheus format, so if you have a - so if your application is already exposed to metrics using the Prometheus Standard, you can say, “Create Prometheus matrix on these ports or on one port.” I could just say, “Just scrape port 8080.” And then this is the path to go to. Scrape it every five seconds. And the protocol is HTTP. And the last annotation we have here is also scraping get the internal Telegraf metrics. So once I deploy that, apply examples. DaemonSet . This will deploy the Nginx [team on?]. You can see it’s being deployed. We can see it slowly running.
Wojciech Kocjan: 00:24:25.873 Now if go to the logs - right, it’s mentioning that it can’t really scrape logs, because the Nginx is not listening on those ports. And also, our Nginx is not running an application that would expose the metrics. But we can see - because we also enabled the internal metrics, we can see some basic metrics that Telegraf is reporting. So right now, if you go back. So we can see the internal data in here. We can see [inaudible]. We can see a lot of other data in here that’s slowly being gathered. And based on that, we should be able to build a lot of dashboards out. So let me just show a quick example of that. This is not exactly what - this is not exactly Telegraf Operator specific, but let’s just show how I could basically just go and say, “Okay, I just want to see if I used memory for Redis,” right? And then I can just save it and I have my dashboard. And that would be an easy way to just move from having my workload in the cluster to basically being able to visualize the complexity, so. And we can see that if I go back to the Telegraf plugin, [inaudible] to the logs of Telegraf, we can see the data keeps on coming in. So one other thing that I wanted to mention or show is just really interesting. As I mentioned, we also support reloading of configuration. So I could just start plotting a new type. Let’s say, “New type equals application.” And for the other one, we could say, “Default.” Your default. Okay.
Wojciech Kocjan: 00:26:26.077 Okay. So the only thing I’m deploying right now is I am changing a secret that Telegraf Operator is using. But if we take a look at the logs- and this should take around one minute for Telegraf Operator to notice this. And we want logs from all the time. In around one minute, because this is how much it takes to reload to the secret mounted inside the container - in around one minute, Telegraf Operator will pick it up and will say, “Okay, I see that the configuration has changed.” So it’s going to reload it, but it’s also going to check what are the Telegraf sidecars that are created - what are the SQLs they’ve created - and go in and update them, as well. And we should start seeing the new data in a few minutes. This is really useful and this is something we use a lot at InfluxData, and we started using that [port read out?], as well, recently. And that was one of the things we really, really wanted, because whenever any configuration changes, we don’t really want to restart the whole workflow to try to manually restart the Telegraf Operators. What we would really want - and we have it right now - is the ability that once we change the settings, Telegraf Operators would be smart enough to detect that and then decide which are the things that really need to be updated. So we can see that it decided that we don’t really need to update the secret for the Nginx. We can see that it decided, “Let’s not update the secret for nginx-deamon-mnhx2 because nothing changed in there because we didn’t change the basic class.
Wojciech Kocjan: 00:28:16.055 But let’s update the secret for Redis, because the class in there was up. So if I go back - and this is a mistake I’ve made. If I go back and also add this class and then tags. Basic. Basic. Up. If I do that, then in around one minute, we should see - oh, K9s. We should see it reload again. And then in a while, we should [inaudible] it with all the logs. We should see another little message saying that’s updated. But right now, I can go back. And if I try to filter on - we called it type,]. Okay. Hold it like this. So maybe the change wasn’t reloaded yet on the Redis level. We should take a look at that in a bit.
Pat Gaughen: 00:29:16.703 But so, Wojciech -
Wojciech Kocjan: 00:29:17.240 [inaudible].
Pat Gaughen: 00:29:17.799 -just to summarize what you’re doing, you’re now kind of - you’ve got the Telegraf Operator in your local kind cluster. And now you’re adding more and more things for it to monitor [crosstalk].
Wojciech Kocjan: 00:29:30.840 Right.
Pat Gaughen: 00:29:31.901 Correct.
Wojciech Kocjan: 00:29:32.531 So I think what I -
Pat Gaughen: 00:29:32.995 Just thought I’d summarize.
Wojciech Kocjan: 00:29:34.406 Right. So maybe that’s a good point. So I’ll try to summarize what’s what I’ve been doing and what’s happening, what I’m trying to show right now. So we had a Kubernetes cluster that did not have any workloads in it, which would often be the case for why maybe we would have some workloads. And then what we did is we deployed Telegraf Operator, which would start injecting the Telegraf sidecars to any new port that were created. So for any new workloads or any workloads that would have the new annotations added, the Telegraf sidecars would be injected to those. And because changing the annotations on the port would mean that the port gets recreated, so whenever we would be adding the annotations, then the new ports would get created and they would start getting the Telegraf sidecars included. One other thing that I tried to show - and maybe I should have done a better job explaining these things - so that would be day one of operations. You would deploy Telegraf Operator. You would add annotations to all your workloads. And you would start seeing the data inside your InfluxDB or any other place where you would be loading the data to. But as you move into day two of operations, sometimes you need to change some of the settings. And this is an important aspect of this. Or sometimes you need to, let’s say, rotate your tokens, which I assume would not be manual. It would be some automated process. But that would be something that should be happening. Say you generate a new token. You have an automated process. Go and update in the classes definitions. And then after a while, let’s say after 24 hours, you would delete your token and expect everybody to be using the new token.
Wojciech Kocjan: 00:31:21.429 If the hot reload would not be in place, this means that all of the workloads would have to be restarted, or at least the Telegraf sidecars would have to be started. With the hot reload functionality in place, Telegraf Operator and then Telegraf sidecar would take care of this automatically. And the data operations are much easier with this [inaudible] functionality being available.
Pat Gaughen: 00:31:44.446 Now, Wojciech, you added the hot reload functionality. Was that like two or three months ago? Or maybe it’s a little bit longer now.
Wojciech Kocjan: 00:31:51.984 It was definitely this year. I don’t really remember when exactly. But that -
Pat Gaughen: 00:31:55.489 It’s all blur.
Wojciech Kocjan: 00:31:57.149 But that was exactly what happened as part of our internal use cases. So this was a pain point for a lot of things we were doing internally, which is that in some cases, we just want to change some settings or we just want to - so, for example, we want to change the frequency at which we get some of the data, because we want to increase or decrease the amount of data we’re storing, or we want to remove some of the data to other places. We may be monitoring some data in our internal systems, but we also want to removing some of the data to production systems, because we want these to be in the same place that our customers use it, so we can also use that port, so. For [crosstalk].
Pat Gaughen: 00:32:43.994 Yeah. It was -
Wojciech Kocjan: 00:32:44.759 [crosstalk].
Pat Gaughen: 00:32:45.141 I -a huge game, because you’d go and you’d - an engineer would change, like you said, the frequency. And then the next question would be like - they’d go and look, and they’re like, “It hasn’t changed. What’s going on?” So having that hot reload, adding that functionality, which was added earlier this year, fantastic. And also I wanted to say, as you mentioned, we’re using this in-house. So yeah, it was definitely kind of a frustration point when people would make a change and then they’d look for the change and it would take a little bit to - basically, it would have to wait, I’m going to say, Wojciech, until it naturally got restarted [laughter], which is kind of a funny use of the word naturally. But let’s just ignore that, but.
Wojciech Kocjan: 00:33:30.373 Right, because whenever the actual application codes changes, then we would still restart it and see the changes. But the thing is, then it could be between a day and a week, depending on how often the code changes. With this, this is a matter of minutes. But like you said, this was a big thing for us, and this is a huge improvement for us. So going back to the dashboard and the data we have in here. So right now, if I reload this - right now, I can see the new type, so the field I added. And I did not go and restart anything. So this is like the thing - this is the thing we talked about. It’s difficult to show it because it takes a few minutes for all the Kubernetes mechanics to kick in and change the underlying secrets, and then this triggering the underlying watch mechanism to notice this. Well, in the Kubernetes reality, waiting a few minutes for this change to get deployed to hundreds of thousands of Telegraf [cycles?], this is very acceptable as opposed to the thing we mentioned, which I think it would be a matter of days or weeks before this data is visible. So right now, I can go in here and see my internal metrics as well. So this is a huge improvement, and this is, I think, a really nice feature of Telegraf Operator. And like I said, we could, for example - oh, one other thing that we wanted to show - because if we were to see the logs of, say, Redis and the Operator here, we would not be able to find the message that the logs [where?] we started because we kept seeing this data [flow?] again.
Wojciech Kocjan: 00:35:21.387 But if I were to, say, remove the [inaudible] file and deploy that, and then wait a few minutes while we perhaps do something else, I’ll also see that now I no longer will see my data being [written?] to standard, which is also a pretty interesting feature. So -
Pat Gaughen: 00:35:44.611 Why would someone use that feature, Wojciech? What would they use that one for?
Wojciech Kocjan: 00:35:48.877 I mean, so I think the standard [inaudible] is more like a debugging tool. So the reason why we include it is we include it when people develop, because then you don’t have to go to -
Pat Gaughen: 00:35:58.684 Okay. So the data -
Wojciech Kocjan: 00:35:58.762 So -
Pat Gaughen: 00:35:59.312 -is still going where it’s - it’s still going -
Wojciech Kocjan: 00:36:02.202 Right, right. Yes. So, I mean, one of the nice things about Telegraf is we can put it in multiple outputs, right? So Telegraf Operator has extensive [document?]. Telegraf itself has pretty good documentation of - I wanted the documentation. Yes. So we have a lot of different types of plugins. And they have a good documentation. So basically for outputs, I could be writing it for a ton of things. We were just using file. Okay. So we were just using the file in InfluxDB. So we were just using this plugin and then we can see it’s [read-me?] file. And we were also using the InfluxDB v2. We were using v1, but that’s kind of less interesting. But basically, we could contribute a lot of things, like I said, custom outputs. We could be filtering things at the output level, as you mentioned. It’s pretty powerful to be able to do that. We could be configuring a lot of things. I think the nice thing of Telegraf itself is that if it can’t write one of the - let’s see. We wanted outputs, not inputs. If for some reason, it can try to configure something that isn’t really working at the time, it’s going to retry it and it’s going to buffer the data. And it’s also going to be smart about how much data it can buffer before it throws away all the data. And all of that is configurable, which is a really nice thing as well, because we were toggling inputs and outputs, and Telegraf would just be automatically disabling the ones it has. But technically, I would be able to disable one of the outputs. And let’s say if it wasn’t able to write to another output, it would be smart enough to realize this is the same output. I’m just going to keep on using the same buffer.
Wojciech Kocjan: 00:37:57.704 And I mean, we just did change the configuration. You can see that right now it just reloaded and it stopped writing outputs. It just stopped writing outputs. But the nice thing is we can do all of these, like I said, in Kubernetes world, where sometimes we don’t want to restart - I don’t know. We have deployments or we have a deployments StableSet, and then other types of workflows. But we have workflows for a single type of microservice would have hundreds of faults. And then restarting all of them just because we want to tweak a single setting isn’t that great, whereas here, we could just apply a small change, reload it, and the whole system will just pick it up. And none of the things we [started?] - like Telegraf itself, the site here, was not even restarted. It was just entirely within Telegraf. So we spent a lot of time inside the company across teams to get all of this working. And I think in general, Telegraf is a really nice tool to monitor Kubernetes, because I’ve shown that we support. It’s really simple to support both just using Prometheus metrics and scraping them, and they end up going into Kubernetes, which is what we use a lot ourselves internally, because a lot of languages just make it natural to expose the metrics in this format. So it’s really neat that we could just specify the port or ports, the path, and Telegraf will be - and Telegraf Operator will just generate [inaudible] off of that.
Wojciech Kocjan: 00:39:26.714 But also, if you know that you’re running something that Telegraf knows how to scrape, then you can just use one of the many, many, many plugins. And you just inject this small snippet and Telegraf Operator will do it together with [inaudible] the output tool. And you can also have some additional settings in the classes. So it is really easy to manage. And from our experience - and we have large clusters; we have dozens and dozens of those clusters we have to manage - it is really useful to be able to do that. One of the community contributed features that I think we’ll be showing in the next release that’s happening really soon and I’m very excited about is ability to also reference other [inaudible] secrets and to be able to reference some of the metadata. So if I would want to get some of the Kubernetes metadata, I can expose it as an environment variable in the annotation. I believe the annotation is something like [end field ref?]. And I can say that my variable name is like [inaudible] space name, and it’ll just be metadata [inaudible]. And like I said, it’s [one of?] -
Pat Gaughen: 00:40:42.915 So what is a new feature that’s coming, Wojciech?
Wojciech Kocjan: 00:40:46.572 Ability to reference various things. So in this case, I’m referencing a Kubernetes field, meaning that this would tell my code what the namespacex space is, or the name of the port is. So this would be like [inaudible] name. I could also expose that. This is useful in some cases, but we really want to tie this back to some of the fields. But I could also get the IP address of the port, which I could then use to filter things. But I could also do something like secret [key?] ref and token, and I could say for my secret, which will be my token secret - dot. This would be the key name. Dot. Let’s say token, right? So with that, I could - for example, I’ll just put the [wrong?] example. [Let’s say?] token equals token. With that, I could have my token managed by a secret. I could reference that, and then Telegraf would - and then Kubernetes would load this as an environment variable for the Telegraf sidecar. And I could use it in the configuration, so I wouldn’t have to inject it in other places. So this is useful if, for example, we’re using other tools to manage the secrets, or the secrets are just managed better application, because then the Telegraf sidecar would get it. This is not how [inaudible] would work because of the way it’s working, because of the Kubernetes internals. Maybe just something we could extend in the future. But this is still a pretty nice piece of functionality, because if for multiple reasons, we have some data in some other secrets and we just want to reference it, it’s much easier than having to hard code it in the annotation.
Pat Gaughen: 00:42:30.117 And you said this is a community contribution that’s kind of in review and will be part of the next release of the Telegraf Operator?
Wojciech Kocjan: 00:42:38.930 Yes, I’m really hoping it will be. And I’m really excited about that because every time we get contribution to Telegraf, Telegraf Operator - and I think that’s a really nice sign that the people are using the tool and that people are willing to spend their time extending it. So we’re trying our best to help whenever anybody contributes in any way, even if someone just opens an issue, like we’ve had people open an issue that they run it, then we forgot to create the namespace. And we were fixing those kinds of things. And that’s also great because this means someone took the time to give it a try. And if something was broken, they also let us know so we could fix it for other people.
Pat Gaughen: 00:43:19.485 That’s really cool. Wojciech, did you have anything more you wanted to share today? Or I think we’re kind of -
Wojciech Kocjan: 00:43:26.452 No. I think that’s -
Pat Gaughen: 00:43:27.340 Yeah. Caitlin, I think we’re finishing up with our part of the show. I mean, the -
Caitlin Croft: 00:43:33.268 Awesome.
Pat Gaughen: 00:43:35.064 -webinar.
Caitlin Croft: 00:43:34.966 You guys aren’t done yet.
Pat Gaughen: 00:43:36.744 We’re not done yet?
Caitlin Croft: 00:43:38.994 Well, thank you for that, Wojciech. I know live demos are always fun. So I know, Wojciech, you already sort of answered this, but how does a newbie get his or her arms around APIs? I know you showed the docs link. Is there anything else that a community member can do to get some help, or?
Wojciech Kocjan: 00:44:03.241 I think going to InfluxDB - so first thing is just getting on board at InfluxDB. I think the easiest option is to go to cloud InfluxData that come and play with the [inaudible] because there’s a free tier that provides most [inaudible]. Okay. [Because my typing?] [inaudible]. So basically -
Pat Gaughen: 00:44:26.909 Now they’re going to see where I got my picture from. Quickly.
Wojciech Kocjan: 00:44:30.340 So basically, just sign up for InfluxDB cloud, which is the easiest way to do this, or just run the open-source version, whatever you want. Like I just run it in my Kubernetes cluster There is a container you could just run. There are binary that we can just grab and run it on - there’s multiple ways to run InfluxDB. And then when you go to the UI, there’s a way to get started with most languages. We also provide ways to get Telegraf configurations. But that is a slightly longer process. But basically, this is a - there are multiple ways to get the data. You can also write. You can also directly use the API, but I think we try to do our best to just get people started with whatever it is that they need to do, right? So we could just say some [inaudible] system data, and that it’s going to basically - okay. Going to basically generate a whole conflict for me. And this is just a Telegraf configuration I can save. I can run Telegraf on my machine and [inaudible] start writing data to InfluxDB.
Pat Gaughen: 00:45:35.369 Well, and I would like to - so let me tell you what I do to go figure out anything on InfluxDB. I go and find blog posts from the fabulous Anna East. So she has one. It’s like TLDR InfluxDB Tech Tips, creating buckets with InfluxDB API. I am completely biased, but I think her blog posts are fantastic for a newbie, and then I think they’re also really good for someone who is not a newbie. So I would go look for some of those InfluxDB tech tips where I think she talks through using the API to do different - how to use the API to do some different things.
Wojciech Kocjan: 00:46:17.837 And I think one other thing we’re mentioning is InfluxCLI. It’s a great tool to do a lot of things, so anything from creating buckets to writing data, reading data. And it’s possible [inaudible]. There’s also a way to export, import both data and object, so things like dashboards. It’s a really powerful tool and it’s also easy to get it set up. So that’s the other way.
Caitlin Croft: 00:46:48.031 I think also you already answered this, but can you share this data locally - or can you share this code for us to test it locally? I’m assuming it’s all in the repo?
Pat Gaughen: 00:46:58.111 Yep, I think that’s where you -
Wojciech Kocjan: 00:46:58.907 Yes.
Pat Gaughen: 00:46:59.198 -view that read-me for that. For that [crosstalk].
Wojciech Kocjan: 00:47:01.847 [crosstalk].
Caitlin Croft: 00:47:02.038 And actually, that question inspired me to say, “Oh yeah, we forgot to point you [inaudible] the repo.” [laughter].
Wojciech Kocjan: 00:47:08.021 And the make file is also a good starting point, because it provides an easy to use make targets, like in kind [start?]. [It?] deploys InfluxDB 1. It deploys a lot of things. Make kind test basically deploys most things. And it’s even deploying Redis and showing you at the end that Redis has the site [inaudible] container. That [inaudible] is going away. I just became manager of PR. I just didn’t have the time to do it today. [inaudible] wait, so we will wait for the operator and not just assume 20 seconds is enough. But there is a lot - there are a lot of make targets that just make it super easy to start.
Caitlin Croft: 00:47:45.399 And you touched on this briefly, but how is InfluxData using the Telegraf Operator internally? It sounds like it sort of was developed from an internal pain point as well.
Wojciech Kocjan: 00:48:00.057 So it was developed, I think, for both internal and external uses. But when we started deploying workloads and we were thinking about being able to handle that sort of Kubernetes clusters and large workloads, we were just discussing how to do this, how to get all the data. And given that we already had Telegraf as a very successful and project with long history, we wanted to use Telegraf. And we were just wondering how to do that, and Telegraf Operator was just a natural way of doing this. So we use it a lot for most of our workloads, meaning that one of the first things we deploy in our cluster is Telegraf Operator, which obviously [I wouldn’t make it?]. But that’s one of the first things we deploy. And then from all the workloads and monitors, we just add the same annotations, like [inaudible]. They may be slightly more complex than the examples were showing. But it’s still annotations of use. And for a lot of the code we write internally, we just expose them as [inaudible] metrics or expose them in other ways. So it all depends on what we’re monitoring. But we’re trying to use the native input plugins [inaudible] Telegraf [inaudible] plugins that [inaudible] - so the things like Redis, we would just be having those plugins, get the data from Redis internally. For things that expose metrics, we get them as Prometheus metrics. It really depends on the use case. But most of the things we deploy just has the annotations and Telegraf gets deployed automatically.
Caitlin Croft: 00:49:37.166 And I’m just kind of curious. What, for both of you - this is a question that I have for both of you. What are you guys working on in the next six months that you’re really excited about that the community will like or get excited about as well?
Wojciech Kocjan: 00:49:55.783 That is a good question. So I know that we - I mean, I think we should also go back to the why we’re using this sidecar containers as opposed to DaemonSets, because we do get this question a lot. And I’m actually surprised this question hasn’t come up. So we deploy Telegraf as a sidecar. And this means that if we have lots of workloads, then there is a lot of Telegraf sidecar containers and there are a lot of processes that could be stopped by a DaemonSet. And so we chose to use the sidecars because we noticed that at Telegraf, their port is more successful at getting the data and being able to buffer if things ever go wrong temporarily. So it’s much more reliable if we monitor a single port. But we’re also trying to figure out ways to do something between running Telegraf sidecar for each port and running it as a DaemonSet file, monitoring all the nodes, sort of all the pod in a specific node. So just to explain briefly. A DaemonSet is something where there is a one pod for each Kubernetes node, so for each dedicated [inaudible] or [inaudible] [machine?], depending on whether Kubernetes is running. And we’re trying to figure out if there is a way to also handle workloads that don’t really get a lot of metrics without injecting the Telegraf as a sidecar to each individual port. And I think that is an exciting challenge, because maybe there could be some compromise, like some thing’s being monitored as DaemonSets and some thing’s being monitored as sidecars. But we don’t really have a good answer to that yet.
Wojciech Kocjan: 00:51:49.873 So we’re trying to tackle this, because that’s one of the things that could be helpful for us internally as well. And I’m sure a lot of people have this issue that a DaemonSets’s [inaudible] ports. It’s too much data to gather. And then a sidecar for every single port is too many resources being used for something that’s really small, microservices that often don’t get involved a lot.
Caitlin Croft: 00:52:18.135 And what about you, Pat? I mean, you’re nodding along, so you clearly agree. But anything else?
Pat Gaughen: 00:52:25.805 In terms of from the perspective of the Telegraf Operator, I think Wojciech covered it. But just generally -
Caitlin Croft: 00:52:34.104 In general, what -
Pat Gaughen: 00:52:34.995 -we’re just going to continue to make InfluxDB, our cloud to SaaS product, screaming fast, and my team is working to continue to make it so that our developers can deliver sweet, sweet software to the users more quickly. So that’s what I’m always excited about.
Caitlin Croft: 00:52:55.020 [laughter] awesome. Well, thank you both. I feel like there’s going to be lots of people checking out this webinar, and they might come bug you in the Community Slack with follow-up questions. So thank
you, everyone, for joining today’s webinar. Once again, it has been recorded and will be made available for replay probably by tomorrow morning. So just go and check out the registration page. The slides as well as the recording will be made available. Thank you, Wojciech and Pat.
Pat Gaughen: 00:53:28.970 Thank you, Caitlin.
Wojciech Kocjan: 00:53:31.125 Thank you Caitlin. Take care, everyone.
[/et_pb_toggle]
Pat Gaughen
Senior Engineering Manager, InfluxData
Pat Gaughen leads the engineering team that deploys our Cloud 2 SAAS offering into the clouds. She's passionate about software at scale, unicorns and working with amazing people from all over the world. Before she joined InfluxData, she worked at Canonical and the IBM Linux Technology Center. She works from her basement office in Portland, Oregon.
Wojciech Kocjan
Software Engineer, InfluxData
Wojciech is a Software Engineer at InfluxData, focusing on automation of InfluxDB Cloud deployments across multiple clouds and regions. He has around 10 years of experience with multiple public clouds. Worked in software and with Open Source for over 10 years as developer, team leader and as an architect. Most of his career is in startups, helping with automation of application packaging and deployment.