Best Practices: How to Analyze IoT Sensor Data with InfluxDB
Session date: Feb 11, 2025 08:00am (Pacific Time)
InfluxDB is the purpose-built time series platform. Its high ingest capability makes it perfect for collecting, storing, and analyzing time-stamped data from sensors—down to the nanosecond.
Join this webinar as Anais Dotis-Georgiou provides a product overview for InfluxDB 3. She will lead a deep dive into some helpful tips and tricks to help you get more out of InfluxDB. Be sure to stick around for a live demo and Q&A.
Join this webinar to learn:
- The basics of time series data and applications
- A platform overview—learn about InfluxDB, data collection, scripting languages, and APIs
- InfluxDB use case examples—start collecting data at the edge and use your preferred IoT protocol (i.e. MQTT)
Watch the Webinar
Watch the webinar “Best Practices: How to Analyze IoT Sensor Data with InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Best Practices: How to Analyze IoT Sensor Data with InfluxDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
ANAIS DOTIS-GEORGIOU: 00:01
Hello, everybody, and welcome. My name is Anais Dotis, and today, we’re going to be talking about IoT best practices. This webinar will start in just a couple of minutes, but I want to give everybody a chance to get settled in and to join, so I will give it a couple minutes just to let everyone kind of join in. And I also want to just go over some brief housekeeping before we do start the webinar. And that’s just mainly around if you have any questions during this webinar, please ask them in the chat or the Q&A, and I’d be happy to answer them there. And if I can’t answer them and I don’t know the answer, I’ll write down your question, and then I will post the question and answer in the community forums. We have Discourse community forums as well as Slack, and I invite you to ask any questions that you have in those areas as well in case you don’t have a question immediately here or you have a question later. And also, I’m based out of Austin, Texas, and I’d love to hear where you all are based out of if you’re curious about that– I mean, I’m curious about that and love to see where all the participants are coming from.
ANAIS DOTIS-GEORGIOU: 01:11
And let’s see. What else? Yeah. A recording of this webinar will be made available to you after the webinar is over. So, you should see that in your email or your inbox shortly. And that’s so cool, we have people from all over the world today. So welcome, everyone. Germany, Madrid, Spain, Finland, Michigan. The audio mic is too low. Okay. Thank you so much for letting me know. I will make sure to project some more too. And if anyone else is having any problems with the audio, please let me know. But is that better for everyone? So cool to see everyone from all over. All right. Well, with that, I will just go ahead and get started. So today, we are talking about IoT best practices. And this is kind of a high-level talk because, really, we’re going to be talking primarily about sort of the basics of time series data and some of the applications. And then we’re going to be going into a platform overview of InfluxDB and Telegraf and the ecosystem compatibility there. And within that, we’ll talk a lot about some IoT best practices that InfluxDB can help you address, including things like optimizing your data ingestion and leveraging line protocol, which is the ingest format for InfluxDB and also the last value and distinct value caches that are part of InfluxDB Core and Enterprise, which are two versions of InfluxDB that are currently in alpha and just were released.
ANAIS DOTIS-GEORGIOU: 02:56
We’ll also talk about efficient data modeling and how you want to structure that line protocol and make differentiations between your tags and your fields within that, or your metadata and the meat of your time series data, to optimize the last value cache and distinct value cache to return queries very effectively and efficiently. We’ll also talk about ways that you can leverage retention policies in the cloud versions and down sampling to help manage your data and automatically expire data. And we’ll also talk about edge computing and considering deploying InfluxDB at the edge for real-time processing and to reduce the latency in the processing of IoT data before sending it to a centralized hub. Then we’ll also talk about query optimization and leveraging both InfluxQL and SQL to write efficient queries. And then we’ll also just talk briefly about security and access control, how you can enable authentication and TLS encryption with InfluxDB, which is another best practice of any IoT use cases.
ANAIS DOTIS-GEORGIOU: 04:09
Then we’ll talk about integration with IoT middleware, and we’ll utilize tools like Telegraf to seamlessly collect data and integrate with all your IoT devices. And then last but not least, we’ll talk about some example projects that you can try yourself. I’m part of the DevRel team. We have a collection of demos and proof of concepts that exist in the Influx Community repo on GitHub. And there are a ton of projects there that leverage a bunch of different IoT protocols as well as projects for all sorts of different scenarios. And so, I highly recommend using that as a resource. It’s a great place to get started with InfluxDB. So, my name is Anais Dotis-Georgiou, and I’m a developer advocate here at InfluxData. And I encourage you to connect with me on LinkedIn if you want to do so. And for those of you who aren’t familiar with what a developer advocate is or what they do, basically, I like to summarize my job as someone who represents the company to the community and the community to the company. And I do so with webinars like this, where I try to help educate the community about what InfluxDB is all about, but then also bringing back feedback from the community to product and answering questions on forums and Slack and then writing blog posts and attending conferences. So that’s pretty much what I do here. And with that, let’s get started into kind of the meat.
ANAIS DOTIS-GEORGIOU: 05:47
So, let’s go into time series basics and understanding what time series data is and the rise of time series data applications and the specifics about this data, how it’s powering these apps, and some of these solutions. So, what is time series data? Well, it’s probably what you imagine it is. It’s just any data that is a sequence of data points that has a timestamp associated with it. And you’re usually looking at the data in time-ordered fashion. And time series data comes from a bunch of different sources, but primarily, two main sources, and that’s the physical world and the virtual world. So, if you’ve ever looked at a temperature graph or a year’s worth of stock prices or a heart rate monitor, then you’ve seen time series data. And all this data represents a recorded value and the timestamp of that value and some additional information that describes the values and ways that people and applications could maybe use that data to organize and understand it. So, for example, you might have a sensor ID or a patient ID or a sample ID as part of the metadata that’s associated with your time series data or– yeah, you name it. And so, there are those two main sources for time series data.
ANAIS DOTIS-GEORGIOU: 07:08
Through the physical world, we mainly get time series data through sensors. And that can include sensors like pressure, temperature, humidity, color, light, flow rate, you name it. And then through the virtual world, we’re seeing time series data through applications and infrastructure. So, we’re looking at logs, metrics, and traces. InfluxDB focuses on metrics and events. And basically, the entire corpus of this physical and virtual instrumentation provides historical records of operations and behavior, both good and bad, but they help you understand the environment that you are monitoring. And they also allow you to do things like predictive maintenance and forecasting and anomaly detection so that you can have more control over your environments. And we’d like to separate time series data into kind of three main categories: metrics and events and traces. And InfluxDB focuses on storing metrics and events. So, metrics are, basically, any time series data that has a regular timestamp associated with it. And events are created by sampling. So, they could be a cardiovascular event, where it’s an irregular type of time series data. And they’re usually created by exception. So those are kind of the two main categories that we like to talk about. So, here’s an example where metric data for time series is collected at a one-second interval. And then an event is more sporadic because maybe you’re pulling the data from a particular source.
ANAIS DOTIS-GEORGIOU: 08:59
We also need to think about timestamp precision when we’re talking about time series data because we’re talking about understanding when an event happens versus when a metric happens. And to do that, you need to know exactly when these timestamps and when this timestamp data is happening. So, if you are, for example, monitoring a rocket launch, then you’re going to need to probably see data coming in and snap data at least every second, probably, a millisecond timeline or maybe faster. And so, there’s kind of a simple rule with determining what timestamp to use. And it’s if the source system uses second, then you keep the seconds. And if the source system uses nanoseconds, then you keep nanoseconds. But you do want to be able to leverage a database that will afford you the timestamp precision that you need for your use case. So InfluxDB does support nanosecond precision. And yeah, you’re going to want to be able to leverage that if you have use cases like that.
ANAIS DOTIS-GEORGIOU: 10:03
You also want to think about data granularity. So, granularity is a concept that’s closely related to timestamp precision and represents the number of time series data points per some measure of time. So, an easy example is that you might have 1,000 data points per second is more granular than 100 data points per second. And so, for certain parts or certain types of time series data analytics, you need to validate the granularity of your time series fully and be able to represent the shape of your data by having the right type of granularity. And then it’s easy to miss, sometimes, the overall trends of your data if you’re collecting data at too low or too high of a data granularity or sometimes you might want to perform down sampling with your data to be able to capture the behavior of your data. So, an example here, you have your data coming in at an actual waveform. And if you sample it at 10 points, then your data granularity is good because you’re able to capture the original waveform. Sampled at 6 points, also decent, but sampled at 2 points, you probably start to lose some important information. Similarly, sometimes, if you want just the overall trend of your data and you sample at too high a granularity, then you might obscure the overall shape and the overall behavior of your system by having too much information. So that’s just something to keep in mind.
ANAIS DOTIS-GEORGIOU: 11:38
And some key drivers for time series applications just include kind of accessing your data, being able to collect your data, being able to analyze your data, and being able to act on your data. You can’t manage what you don’t monitor. And this is especially true for any technology investments or IoT applications. So that’s kind of the main thought there. Actually, I think I’m going to skip that slide for today. But I do want to take some time to talk about a platform overview for InfluxDB to give us a basic understanding of how we can collect data with InfluxDB and how this all comes together. So here is the reference architecture for InfluxDB, which is just kind of a clear way to represent all the components and everything that developers will care about when they’re looking for functionality and understanding the functionality of InfluxDB and this technology. So basically, InfluxDB has all sorts of open-source data load tools. We have agentless pools and scrapers, which can pull data on a regular cadence from a given endpoint.
ANAIS DOTIS-GEORGIOU: 12:55
And then we also have, external to InfluxDB, additional ways which you can collect data for storage and visualization within InfluxDB. And then at the top, we also have Telegraf. And Telegraf is an open-source, low-code, standalone agent for metrics and events. And it features hundreds of different plug-ins that you can configure with a single TOML configuration file and pull data from– and push data from a variety of different sources, process that data, aggregate that data, and then send that data to InfluxDB as well as other data stores because it is database agnostic. But that’s just a lightweight agent that has buffering and caching capabilities that can help you consolidate all your data from MQTT devices or MQTT brokers, for example, and write that to a consolidated hub, which could be InfluxDB.
ANAIS DOTIS-GEORGIOU: 13:56
And then we also have client libraries. And you can use those client libraries to build applications, to extend existing ones, and enable access to InfluxDB and have access to a bunch of different libraries that allow you to do things like analyze your data, visualize your data, and apply machine learning algorithms for forecasting and anomaly detection. And then there’s also a few native tools and applications built using the client libraries that also have specific collection needs. But there’s also integrations with a bunch of BI tools and visualization tools, things like Grafana, Superset, and Tableau, etc., that you can leverage for visualization as well. And so InfluxDB 3.0, let’s talk about it and the open data architecture. So essentially, it’s built on open standards for seamless interoperability. And what that means is that it’s built on things like the various Apache technologies with things like Arrow, Flight, and Apache DataFusion and Apache Arrow and Parquet. And so, Flight is the columnar format, the transport protocol for columnar data. And it’s based on the Arrow format. And DataFusion is the query execution engine that’s written in Rust. And it is what performs all the query optimizations, including pruning and pushdowns. And it allows you to query InfluxDB both in SQL and InfluxQL. And InfluxQL is just like a SQL kind of type of– or InfluxQL is just a SQL kind of type language that’s proprietary to InfluxDB. And then we also have Apache Arrow, and that’s the columnar format and memory format. And then Parquet is the column-oriented file format. So that’s kind of Arrow’s counterpart.
ANAIS DOTIS-GEORGIOU: 15:58
And basically, all these open standards allow you to easily integrate with existing infrastructure and new technologies that also all leverage them as well. So it helps enable seamless data flow that helps you avoid vendor lock-in and also allows us to eventually also have Iceberg integration and therefore, Snowflake integration, which means that when you’re thinking about what technologies do I want to use for my IoT infrastructure and my IoT architecture, having something that allows you to have snapshots into your data and be able to read that data into a data warehouse for larger processing with other data just is the right move. You want to be able to consolidate data across all your different IoT devices, all your different protocols, and mix it with data that’s not necessarily just time series, but with relational data so that you can perform full analytics and not be locked in that way.
ANAIS DOTIS-GEORGIOU: 17:02
And this is the ingest format for InfluxDB. It’s called line protocol. And basically, you need to know about it for efficient data modeling within InfluxDB, especially when it comes to creating last value and distinct value caches for InfluxDB Core and InfluxDB Enterprise. So basically, the way that you write line protocol– and if you’re using Telegraf or using the client libraries, all of them are, at some point, taking your data from whatever source you have and serializing it– or deserializing it and then serializing it into line protocol before writing it to InfluxDB. And the way that it– the way that you write it is basically you have a measurement or table. Those are the same things in InfluxDB. And then you have a tag set, and that’s where you include the metadata about your time series data. So, for example, if I was collecting data, let’s say I’m monitoring the air and I’m getting carbon monoxide data, temperature data, and humidity data, my tag set might include a sensor ID or maybe even a location where that sensor exists. And then my field set would have my carbon monoxide, my humidity, and my temperature values. And then finally, I would have a timestamp that can be up to nanosecond precision that is in Unix. So that’s pretty much what line protocol looks like.
ANAIS DOTIS-GEORGIOU: 18:31
And you just want to be cognizant of what you make tag and field sets. And like I mentioned, tags are like labels, and they’re designed to further specify or disambiguate similar signals or similar time series. And then fields are the primary numerical values that need to be monitored. So, tags are best for metadata and fields are best for samples. And then you can use InfluxQL and SQL to query that data, specifying which columns you want to select for, whether that’s a temperature column or a sensor ID column or in other words, a temperature field or a sensor ID tag. And then similarly, if you want to create and set up a last distinct value cache– so basically, last value caches let you cache the most recent values for specific fields in the table. And distinct value caches let you similarly cache the distinct values for specific fields in a table. And this vastly improves the performance of queries that return the most recent value for a field of a specific time series or the last n values of a field. And this is typical for many monitoring workloads that you need to be able to do this. And you want to be able to do this without having to return all the other points that might exist as a part of a time series or a collection of series. And you want to be able to just collect that last value without having to specify a particular range for your data to query and look back from.
ANAIS DOTIS-GEORGIOU: 20:14
And with the last value cache and distinct value cache, these types of queries usually return in under 10 milliseconds and 30 milliseconds, respectively. And that includes with– or you have millions and millions of points. And so, this is another instance where you want to think about efficient data modeling for your IoT use case and how you want to structure some of those tags. And then we’re setting up the cache, what you want to specify in terms of that hierarchy, what is the first node in the tree for this cache? Because let’s imagine you have maybe two tags, T1 and T2, yeah, what you want to make the priority there versus the second priority for the column hierarchy, the key column hierarchy or the tag column hierarchy. And then you have a value column that gets populated at the base of your cash for distinct value cache or last value cache with a buffer size of a certain amount. And so, then when you’re querying a particular series that has– let’s say in this example, maybe T1 is location and then T2 is the sensor ID. And then we have our temperature value for that particular sensor ID and that temperature sensor ID and that location. And so, then we can easily access the last whatever values for that particular time series.
ANAIS DOTIS-GEORGIOU: 21:44
And then a little bit about deployment and deployment flexibility. So basically, InfluxDB has a lot of different deployment options, and it includes Cloud Serverless, Cloud Dedicated. So Serverless is for small and medium workloads, and Cloud Dedicated is for large enterprise workloads. These are both managed. Then we also have Clustered, and that’s for large enterprise workloads as well, but this is a self-managed environment. And then we have Core and Enterprise, and those were just released, and they are currently in alpha. And Core is the open-source version, and it allows you to query recent data and data of certain kind of Parquet limit. And the idea there is that Core is being used at the edge for IoT, and then you consolidate that within either enterprise, which is also self-managed or any of the cloud-managed options as well.
ANAIS DOTIS-GEORGIOU: 22:50
And why are we looking at InfluxDB in general for time series? Well, one big reason is just because of kind of the massive growth of IoT data. And so, by 2025, so by the end of this year, we’re looking at the data creation to grow to more than 180 zettabytes. And when you’re looking at things like InfluxDB, you’re deciding how I want to store this data and with what tools, one big consideration that you need to have been, can you accommodate the sort of massive growth in this IoT data and can you handle the ingest rates? And I will go ahead and share an InfluxDB benchmark, but basically, the data ingest for InfluxDB is being able to ingest over 330 million rows per hour, so really, high ingest rates or points per second, 4 million points per second, as an example. And being able to handle that type of ingest, if you need it, is a consideration. And what I would consider a best practice is looking for a tool that can handle the ingest that you need.
ANAIS DOTIS-GEORGIOU: 24:26
Another one is considering the Edge and Hub architecture for InfluxDB, specifically around removing data silos and kind of simplifying the access to your data by collecting data at the edge and then performing maybe some of your real-time analytics there and reduce latency in your IoT processing by keeping some of your data at the edge before sending some of it to a centralized hub. And this is especially true with InfluxDB Core and Enterprise because there is a Python data processing engine. Currently, there’s only triggers available on WAL flush or WAL sync. But soon, there will also be a bunch of other triggers available on a schedule via HTTP. And so, it’ll allow you to basically add any sort of processing that you want and allow you to call other endpoints as well as a part of that processing and then write that data to a centralized location and allow you to perform things like edge data replication as well.
ANAIS DOTIS-GEORGIOU: 25:50
So, someone asked right now, Dimitri asked, are there any improvements planned to the replication when edge-to-cloud connection is not 100% reliable? It takes 2.2 times X– it takes hours to recover. So, I do believe that the processing engine will help with that, for sure. And I will share some getting started documentation with that as well. And also, the DevRel team and I are looking to create a collection of Python plug-ins for the processing engine for Core and Enterprise to help get the community started with that because, like I said, it is in alpha. But also, whether or not it takes a long time to recover also depends on how much data you’ve collected at the edge and how long it takes to– how much data you do have and what your buffer looks like and how long it just takes to replicate that back.
ANAIS DOTIS-GEORGIOU: 26:51
And then I also want to talk a little bit about interoperability with InfluxDB 3.0. Parquet offers a lot of interoperability with pretty much all modern ML tools and analytics. And then because DataFusion supports both SQL and a DataFrame API for logical query plans, basically, you can execute that engine against Parquet files. And that gives you interoperability with a lot of other popular languages, including C++ and Python and Java. And then you can easily convert those Parquet files into Pandas DataFrames and vice versa. And then that also gives you interoperability, like I mentioned, with Iceberg that then lends itself to things like Snowflake. So, let’s go specifically into that. So here are a few of the integrations that we offer within the industrial IoT space. So, an overview of the stack, if we just go into it, basically, we’re looking at how data moves from industrial processes and assets from the bottom through middleware solutions into InfluxDB for the data persistence itself and finally, up to applications and analytics tools. And so, the process and asset layer, the bottom layer includes the industrial control systems and SCADA systems, PLCs for robotics and sensors. And these devices usually are what generating the timestamp data that must be collected, stored, and then analyzed. And then we have our middleware layer. And this includes Telegraf, which is the collection agent that is created by InfluxData, and then also Kepware and HiveMQ and NiFi and HighByte. And these handle the data collection, transformation, and routing. And they standardize data from the various sources before sending it to InfluxDB.
ANAIS DOTIS-GEORGIOU: 28:47
And then we have Kepware, which I mentioned already– or I mentioned Telegraf, and then you know Kepware and HiveMQ. So Kepware is the industrial connectivity platform. HiveMQ is the message broker. NiFi is the dataflow management system, etc. And then we have the data persistence layer with InfluxDB, and it forms kind of the core of the stack. And like I mentioned, it can be run at the edge or as a central data center or in the cloud, depending on what you need. And then at the top layer, we have our applications, and that’s tools like Grafana, Tableau, Superset, etc. And that just provides us a simple way to dashboard our data and perform some advanced analytics. And I guess some of the– specifically looking at the middleware and integration there for IoT, why that matters is because it gives you an ability to standardize and have also protocol translation. And being able to pull data from all that middleware into InfluxDB also is what contributes to things like efficient data collection and buffering and having, like I said, that data transformation or protocol translation or oftentimes, also, the ability to enrich it as well and providing all the security and reliability that you need by managing the authentication and encryption. Because of the routing, that also gives you the scalability and flexibility options. And then also there’s client libraries for InfluxDB in Go, Java, C#, JavaScript, and Python for all the v3 versions.
ANAIS DOTIS-GEORGIOU: 30:37
So, a lot of ETL and DAG tools leverage Parquet, and then you can work with those Parquet files and then query them and then do things like use Python and Pandas to convert them and do additional analytics or any sort of data processing with Pandas, for example, which is kind of my favorite way to do data analytics and data transformations. And then here’s another example of interoperability. This is interoperability with Tableau specifically. And Tableau does have some statistical forecasting methods that you can use out of the box that you can just simply apply to your time series data. And here’s an example of we’re looking at carbon monoxide over time, and then we apply a forecast. And it determines which forecast has– which forecasting method produces the best forecast out of the box. So, you don’t even have to do any of that yourself, which is pretty nice.
ANAIS DOTIS-GEORGIOU: 31:39
And I also wanted to talk about some example projects that you can try yourself. So, the first one is called Saving the Holidays. And one second. Let me share a link. And I just realized the last link that I shared to everyone in the chat, I accidentally just shared it to host and panelists and not everyone. So, the first link, I shared this Influx Community org. It’s where we, as community members and as DevRels, share a bunch of different projects as well as all the v3 client libraries. So, you can find so many examples for using InfluxDB with a variety of other tech stacks for a lot of different use cases so that you can have a good idea of how to use InfluxDB and kind of how to get started. But some projects that I wanted to highlight was Saving the Holidays, and you can scan that QR code or just search for it within Influx Community. And basically, this project involves three robot arms that generate random dummy data and send that data to HiveMQ that we’re using as our MQTT broker. And then we use an MQTT client to send that data to InfluxDB.
ANAIS DOTIS-GEORGIOU: 32:58
And from there, all that data is called machine data, and it goes into a machine data table. And we query that data out and apply a Hugging Face algorithm to it to perform some anomaly detection. And this is an auto-encoder that we use here. And we write those ML results to the destination table called ML results within InfluxDB Cloud. And then we visualize the raw data as well as the ML results within Grafana. And if there is a difference between our expected results or our forecast and our machine data and that error is too great, then we know that we have an anomaly from our robots and we can detect those anomalies. And the cool thing about this Saving the Holidays demo is that it is scalable precisely because we use an MQTT broker and because all this workflow here is done within Quix. And Quix all uses Kafka under the hood, and it abstracts away all the pain of using Kafka, but it does make all these data streams extremely scalable as well as using that in combination with HiveMQ. So, in this example demo project, we only are looking at three dummy robots, but we could be looking at hundreds of thousands and with data written at much higher precision, and this would all still work.
ANAIS DOTIS-GEORGIOU: 34:34
We also just have a plain MQTT simulators repo as well. And this is an example of the Telegraf TOML configuration. And you can see how, basically, you can connect to an MQTT broker with the MQTT consumer input plug-in here. And you basically provide the server that you want to connect to. So, in this instance, a Mosquitto broker and any topics that you want to subscribe to as well as connection time-out details and the format of the message that you’re getting from your MQTT broker. For example, if it is JSON, then you might want to use the JSON v2 format. There’s a JSON format and a JSON v2. The JSON v2 is just kind of– it has more options for formatting more nested and complex JSON. And yeah, then you specify that you want to use that JSON v2 format, and you specify what part of that JSON is the measurement name or you provide a measurement name or a table name. And then you specify the path and specify, within the JSON, what are tags and what, maybe, information you want to exclude. And then anything else, it assumes, is a field. And that’s how you use Telegraf, but these MQTT simulators simulate the IoT data. They’re all dockerized and then provide an MQTT broker like Mosquitto and then Telegraf configurations. So, it’s a great place to get started with that.
ANAIS DOTIS-GEORGIOU: 36:12
And then there’s also the CSTR InfluxDB project as well. So basically, the goal of this project was to create a scalable digital twin of a CSTR. And CSTR is just a continuous stirred tank reactor. So, it’s probably the most common type of chemical reactor– or reactor in the chemical petroleum industry and manufacturing industry– and a PID or proportional integral derivative controller that controls the temperature of the CSTR itself. And it leverages Faust, which is a Python library for building stream processing applications. And it uses a Kafka-like model, and it’s all open source. And we leverage Kafka as well for the streaming platform to build this real-time pipeline. And then Telegraf pulls data from Kafka, puts it into InfluxDB, and then you can visualize that data within Grafana. And basically, what it’s doing behind the hood is, basically, there’s a CSTR simulator and it’s sending the concentration of a particular chemical in the reactor and the temperature of the reactor itself to a particular topic. And then it’s also consuming the cooling jacket temperature of it as well. And then we have another topic called the PID control. So, this CSTR topic is for all the data created by the CSTR simulation. And the PID control topic has all the data that the controller needs to operate.
ANAIS DOTIS-GEORGIOU: 37:52
And then, basically, the PID controller, in set point simulation, consumes the concentration of that particular chemical and the temperature of the reactor itself and output. It says, “Okay. Based on what’s going on in the reactor, now I know what temperature of the cooling jacket should be for optimal performance.” And so, basically, behind the scenes, it’s solving a bunch of differential equations to do this type of work. And so, it’s also a good reminder that in the IoT space, you oftentimes don’t need to throw really fancy machine learning algorithms to do meaningful work and to be able to simulate a particular environment and be able to model a complex environment. And also, we use Kafka here just so that you could be able to scale this architecture as well from just one CSTR, just one reactor to multiple reactors and multiple PID controls by adding various partitions. And then last but not least, I wanted to just talk about some IoT use cases and why you would use InfluxDB for industrial IoT. And I think I talked about a lot of these cases already, but primarily, some of them are to get real-time insights on your data, enable things like predictive maintenance that can then improve the profitability of your factory floor, and therefore, optimize OEE and production.
ANAIS DOTIS-GEORGIOU: 39:21
And we have a bunch of customers in the industrial IoT space, including Honeywell, BTC, Bboxx, and Heineken. And when we think about industrial IoT and IoT in general, we kind of like to separate our customers into two main categories. And one of them is industrial and then consumer. And I did want to talk about a couple of companies that I actually like a lot. So Bboxx, they develop and manufacture products to provide affordable clean solar energy to a variety of communities that are off-grid, especially communities in the developing world. And so, the name Bboxx is short for battery box. And they had an interesting problem, which was, how do they become a data-driven company and continuously monitor all these geographically dispersed solar rooftop units, close to 100,000 of them? And they use InfluxDB just because it was easy to spin up and use and work. And the result is that they’re able to provide close to 400,000 people across 35 countries with electricity. Another cool use case that I like a lot is FarmPulse. And I believe that they’re Australian. And they’re an agribusiness that provides solutions for reporting remote sensor data. And so, they basically store their data in InfluxDB OSS. And they use LoRa and satellite connectivity to kind of give farm data.
ANAIS DOTIS-GEORGIOU: 41:17
And then Spiio is another cool one, and they have basically a bunch of green wall installations in all sorts of offices. And they use all sorts of sensors to monitor those green walls so that they can make sure that the health of them is good. And they use InfluxDB as well to query that. We also have use cases in the mining space as well. And Pilbara is a region in Western Australia. And they operate a world-class integrated network of 17 mines. And basically, they have issues with iron ore and monitoring some of those assets and all those mines. And so, they use InfluxDB as well for some of that. We also have beverage manufacturing. So InfluxDB helps monitor some data silos or fix some data silos and uncover various cost issues with beverage manufacturing. Bevi also is used by InfluxDB, talking about a consumer use case where– I don’t know if you’ve ever seen those in an office where you work. But basically, those help you create custom sodas and custom drinks. And that’s another instance where it’s not quite beverage manufacturing. I mean, it’s on a really small consumer scale. But yeah, all sorts of use cases where InfluxDB is used to monitor brewing and beverage manufacturing. We also have hobbyists that use InfluxDB to monitor their brewing at home.
ANAIS DOTIS-GEORGIOU: 43:17
And that being said, there’s another project in the Influx Community repo that leverages Telegraf. And I think it’s called BG Brewing, where Telegraf is used to create mini forecasts on micro batches of data before that even reaches InfluxDB on temperature data for someone’s home-brew process. So, lots of examples there. And you know that any brewing use case is definitely using multiple CSTRs. And then these companies all use InfluxDB as well, whether it’s monitoring battery walls or your home thermostat. And I want to encourage you if you are someone that enjoys learning by taking online courses, then InfluxDB U is probably a really good resource for you. We have a bunch of online courses there, and they’re all free. And you can also find both self-paced, but also live trainings there as well. And some of the courses, if you complete them, they even earn you digital badges. So that’s an option. And I will say that we are currently working on creating content for Core and Enterprise, so that should be out shortly.
ANAIS DOTIS-GEORGIOU: 44:36
And then, yeah, that’s pretty much all. I want to thank you so much for joining. And I also want to check the Q&A to see if I can answer any of your questions there. So, Brenton asks, “Why wouldn’t I just transmit from the edge through a client lib or Telegraf instead of replication?” So, you could. But if something happens and your client library is down or your destination is down and offline for whatever reason, then you still want to be able to store that data and have a buffer of that data so that you can go ahead and write that data when it becomes available online. So that’s one reason, and that’s one thing that storing data at the edge would allow you to do. But again, yeah, it’s up to you. You might not need that. And if that’s the case, then you can use a client library or Telegraf.
ANAIS DOTIS-GEORGIOU: 45:40
And then someone else asks, “Regarding InfluxDB administration and my limited experience, it has not been simple and easy to learn and manage storage buckets, prune, time series, etc. What are plans available to make it more friendly or straightforward to manage it?” So, if you’re coming from InfluxDB v2, I absolutely agree with that. There was a lot of necessary steps that people had to do to even do things like optimize or query performance by using tasks to write some data from a particular bucket to another bucket so that it’s more easily accessible. And that was all pretty challenging to manage, which is a big reason why InfluxDB 3 was a complete rewrite of the storage engine was to address some of those issues specifically around cardinality concerns so that you don’t have any of those with InfluxDB v3. But also having things like distinct value cache and last value cache to make sure that your queries are always efficient. And so, hopefully, that addresses a lot of the concerns that you just mentioned.
ANAIS DOTIS-GEORGIOU: 46:50
I don’t have any projects or references for financial services and high-frequency trading use cases, but that is something that is on our mind as the DevRel team, especially with Core that focuses on a shorter range– of querying at shorter range at any period. You can query that range at any point in history, but at one point, the queries are optimized for a shorter window. And so that’s a perfect use case for that, and so we are looking to do more of that. And yes, a small question. You absolutely will get access to this presentation. A recording of it should be made available and sent to you at the end. Once this is over, you should see it soon. And then someone asks, “That would be perhaps shared in another webinar which focuses on that domain industry?” I’m not sure what that’s relating to. So, if you want to follow up that question, that would be helpful. Yeah, the MING stack is very popular. There’s a bunch of blogs and examples for leveraging the MING stack. Yeah, it’s a go-to.
ANAIS DOTIS-GEORGIOU: 47:59
So, would I recommend other stacks? I think it just depends on your use case. Grafana is probably the most popular tool for visualizing any data from InfluxDB, but it’s entirely up to you. I mean, I can recommend Quix and HiveMQ. Those are also really great options. Does this database require annotated CSV? No, it doesn’t, just line protocol. And then someone else asks, “If I want to upload five gigabytes of data, it just responds to nothing.” I’m not sure what that means, but that might be a question for the community forums or Slack. And I will share the link for the community forums.
ANAIS DOTIS-GEORGIOU: 49:04
So, if you’re dealing with– and someone else asked, “How do you deal with long-term archival of data and still load it for analytical use cases?” So, I think especially with Enterprise, Serverless, Dedicated, etc., that’s well suited for long-term archival of data. And that means that in all the Cloud products, Serverless and Dedicated, that means that you probably don’t want to set a retention policy to make sure that that data doesn’t expire if it is truly very long-term or you could set a really long retention policy, like a year. And then, yeah, I don’t see why you can’t use it with any recent data as well in combination. It just kind of depends on your SQL queries or the Python that you want to write when you’re querying the database with the client library and how you want to leverage that. I don’t know if that answers your question, but it felt a little bit big picture. So, I don’t know if there’s a specific question there.
ANAIS DOTIS-GEORGIOU: 50:09
What is the best approach for integrating machine learning with IoT data? What are the best practices for managing and archiving historical IoT data efficiently? So, I think some of the best practices within InfluxDB would just be, yeah, making sure that you’re setting the retention policy that you need and then also considering down sampling your data or creating materialized views of your data, both which you should also be able to do with a processing engine here shortly. But that would just involve making sure that you’re retaining your historical data at the resolution that is appropriate because, oftentimes, you only want to keep high-precision data for a small amount of time, and then you want to reduce that to a lower-precision aggregate for your historical data. And then what is the best pros for integrating machine learning with IoT data? It really depends on your use case. So, in that Saving the Holiday repo, you can find an example of leveraging– or of training a Hugging Face auto-encoder and then calling that trained model from Hugging Face and including those forecasts and as a part of your Quix pipeline. So that’s one way.
ANAIS DOTIS-GEORGIOU: 51:27
And I also mentioned the brewing example. Let me find that repo. Then here’s an example of doing the micro-batch forecasting before data even hits InfluxDB. And that was appropriate here in this use case because a simple statistical forecasting method like triple exponential smoothing was yielding good enough temperature forecasts. And that method only requires, I think, a minimum of 10 data points to generate a forecast. So, it just depends, I think, on what your use case is and what you’re trying to accomplish with the forecasting or anomaly detection that you’re doing. And then someone else asks– or the same person asks, “How can we ensure data integrity and prevent data loss in case of severe downtime and connectivity issues?” So again, I think that’s a benefit to having InfluxDB at the edge so that you can make sure that you are storing that data at the edge in case there are connectivity issues. And then when services become available and they’re back online, then you have the option to take some of that data and write it to InfluxDB, your centralized hub.
ANAIS DOTIS-GEORGIOU: 52:55
I’m new to Influx. Is it advisable for me to go straight to Influx 3? Yes, unless you have a need– so InfluxDB Core open source, like I said, it allows you - I can’t remember the number - to query something like across 400 Parquet files at a time. And then if you query outside of that many Parquet files at a time, then query performance starts to go down a little bit. Then there is InfluxDB 3 Enterprise, Cloud and Serverless. Both Cloud Serverless and Cloud Dedicated both have a free trial as well. And honestly, I recommend just starting with that because it’s the easiest to set up and because you don’t have to install anything on your machine. Yeah. And it just depends on what your workload looks like, what your requirements are because the free trial does offer you a fair number of resources in cloud. And then if you are needing to store data for years, though, and you want an open-source version, then Core is not going to meet those needs. And then Enterprise, we are going to release a community license for that as well. But again, that’s a middle in between. And I would say, yes, if you have cardinality concerns and if you want to also leverage interoperability with other tools and if you really– yeah. Because with 3, you don’t have any cardinality concerns. The performance is much greater.
ANAIS DOTIS-GEORGIOU: 54:46
And then also, for me, I’m really excited about the data processing Python engine and just the ability to convert and transform your data with Python with InfluxDB is exciting to me. But yeah, if you’re someone– but conversely, if you’re someone that’s looking to just collect 10 series of data in one bucket for some home monitoring solution and you want to collect that data for years and you just want a free OSS version, then 3 is probably– you’d probably be fine with 1. And someone else asked, “Is it better to handle outliers directly in InfluxDB using Flux or InfluxQL, or should data be cleaned externally in Python or a separate database for real-time IoT sensor data? Which approach is more efficient in terms of performance and accuracy?” That depends on what the data cleaning mean, what you’re trying to do, and what extent. Flux is the query language for v2. If you are new to InfluxDB, the one thing I would say is don’t go to v2. v2 is vastly different from v1 and v3, and it uses Flux, which is this very specific query language that requires kind of a high-activation energy to learn. So, I wouldn’t recommend going there for someone who asks if they’re new. So yeah, I would say in terms of being able to handle outliers directly, my preference would be to use Python because I think it’s easier. And there’s just so many more tools and flexibility when it comes to handling an outlier. But if your outlier is a simple threshold, then Flux would work just fine. And so, it just kind of depends on what you’re trying to do.
ANAIS DOTIS-GEORGIOU: 56:44
In terms of which is more efficient with v3, Python should be just as efficient because querying data and the ability to leverage a Python client library in conjunction with it or use the processing engine to identify some outliers as data is coming into the wall, for example, would be much more efficient. So yeah, I would recommend using Python. What serialization specs did you explore beyond line protocol, and was MessagePack one of them? I think that’s a question that predates me during the creation of InfluxDB. I believe line protocol– I mean, line protocol has existed since InfluxDB– since the very first version as the ingest format. And so, I’m not quite sure what that answer is, but I will write it down. I’ll copy it real quick so that I can answer it and create a question on our forum. So, give me one second to just copy that.
ANAIS DOTIS-GEORGIOU: 58:05
The database of version v2 asks for annotated data, like table ID, and some terms which are compulsory. Without it, it doesn’t upload to DB. You’re correct. Annotated CSV is a complete headache. If you are writing CSV data to InfluxDB v2, I much, much more recommend that you don’t try and create your own annotated CSV because it’s such a pain. Instead, use the CLI to write CSV data. And that’s just plain CSV data. So, you don’t have to worry about annotated CSV or use Telegraf to write CSV data and use the file input plug-in. And then it’ll convert data as you need. For new projects, is it better to focus on SQL or InfluxQL? So, SQL has more functionality than InfluxQL. InfluxQL has some great functions for exploring your schema, but I’d also say there’s more information on how to write SQL queries. And a lot of LLMs are very good at answering and writing SQL for you. I will say it is the DataFusion implementation of SQL, which has a couple little syntax caveats, which you can always look at the DataFusion SQL docs or RDoCs for some of those gotchas. So, I would probably point you to SQL just because there’s more functionality and you can get more support, generally speaking.
ANAIS DOTIS-GEORGIOU: 59:33
Yes, the Core model does– so someone else asked, “Does InfluxDB Core 3 open-source version run in a single monolithic process model as the v.1.11 open-source version? Yes, it does. So yeah, if it’s recent IoT data that you’re storing and you want to analyze, then I would recommend v3. And if it’s long-term and you’re wanting to query a year’s worth of data at a time, not in terms of volume, but in terms of just literal duration– because v3 could handle more volume than InfluxDB v1, but Core specifically– and I’ll just share that with you just so you know what I’m talking about specifically. So, this is the blog post that addresses the limitation of Core, of being able to query efficiently against or across a certain number of Parquet files. So, when you’re trying to figure out if you want to use v3 Core or v1 open source, that should help with that. Oh, and it looks like– I’m so sorry I didn’t notice. It looks like we’ve just gone over with time, so I will let you all go. Thank you so much for joining, and I really appreciate all your questions. Yeah. Thank you so much. Bye.
And this webinar is being recorded and will be made available later today. So, thank you so much.
[/et_pb_toggle]

Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.