Power Your Predictive Analytics with InfluxDB
Session date: Jul 25, 2023 08:00am (Pacific Time)
If you’re using InfluxDB to store and manage your time series data, you’re already off to a great start. But why stop there? In our upcoming webinar, we’ll show you how to take your data analysis to the next level by building predictive analytics using a variety of tools and techniques.
We will demonstrate how to use Quix to create custom dashboards and visualizations that allow you to monitor your data in real-time. We’ll also introduce you to Hugging Face, a powerful tool for building models that can predict future trends and identify anomalies. With these tools at your disposal, you’ll be able to extract valuable insights from your data and make more informed decisions about the future. Don’t miss out on this opportunity to improve your data analysis skills and take your business to the next level!
What you will learn:
- Use InfluxDB to store and manage time series data
- Utilize Quix and Hugging Face to build models, visualize trends, and identify anomalies
- Extract valuable insights from your data
- Improve your data analysis skills to make informed decision
Watch the Webinar
Watch the webinar “Power Your Predictive Analytics with InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Power Your Predictive Analytics with InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Director of Marketing, InfluxData
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
Caitlin Croft: 00:00:00.000 Hello everyone, and welcome to today’s webinar. My name is Caitlin, and I’m joined by Anais. I’m really excited to have you guys here for this webinar, which is about predictive analytics using a time series database. Please post any questions in the Q&A which you can find at the bottom of your Zoom screen. This is being recorded and will be made available later today or tomorrow. And without further ado, I’m going to hand things off to Anais.
Anais Dotis-Georgiou: 00:00:29.378 Hello, all. Thank you, Caitlin, for the introduction and for getting everyone here and ready to learn some stuff. My name is Anais, and I just want to apologize in advance if you suddenly feel like I have disappeared, it’s just because I’m battling a cold. And I’ve just put myself on mute, so that’s all that’s happening. But I promise I’ll be back shortly. Yeah. So today we’re going to be talking about predictive analytics using a time series database. Spoiler, that database is going to be InfluxDB, specifically InfluxDB v3, because that’s the latest and greatest thing that we’re working on here at InfluxData. A little bit about me, I’m a developer advocate. If you want to connect with me on LinkedIn, I encourage you to do so. But also, like Caitlin said, please join the Slack and community.influxdata.com and reach out to me there as well. And we can talk about anything related to this webinar or really anything that you’re working on recently. I’d love to hear about the projects that you’re getting into and any problems that you’re facing.
Anais Dotis-Georgiou: 00:01:37.507 For today’s webinar, we’re going to kind of just paint a backdrop by introducing predictive analytics, and we’re going to talk about time series databases and the emergence of the time series database category and how InfluxDB fits into that. Then we’re going to talk about some customer use cases, users that use InfluxDB for predictive analytics or anomaly detection. Excuse me. And then we’re going to talk about the actual meat of this presentation, which is using Quix, Hugging Face, and InfluxDB v3 for forecasting and anomaly detection. All three of them are really cool tools, especially used together. And I was really excited to especially use Quix because it was just so easy to create streams of data that I could easily add all sorts of transformations or forecasting to. And it all uses Kafka behind the scene. And it was just so easy to subscribe to different topics essentially and manage these streams. So, yeah, very fun. And then I will share, last but not least, some resources so that if you wanted to get started with a project like this similar or one similar to it, you can do so.
Anais Dotis-Georgiou: 00:03:07.577 So first of all, what is predictive analytics? Well, it’s really just a branch of advanced analytics that typically uses historical data, time series data, some machine learning techniques or algorithms, some, maybe statistical algorithms to make predictions about future events. And the primary objective is to identify past relationships that can be informed to make patterns about what happened in the past so you can predict what’s going to happen in the future. And it usually involves, first, data collection, then processing your data, and then building a model, training the model, evaluating whether or not the model that you trained is performing correctly. Usually, you’ll do a split train validation, and then you’ll deploy it and make a prediction. And it’s used in a lot of industries and applications, things like marketing, finance, healthcare, manufacturing, retail, supply chain, so much. It helps organizations make these data-driven decisions and anticipate customer behavior and identify potential risks. So, yeah, used all over the place. Used very commonly. Predictive maintenance, too, is another subcategory of predictive analytics, and one that a lot of InfluxDB users run into specifically because we have a lot of IoT customers that are monitoring maybe their manufacturing floor or some sort of manufacturing process. And they want to make sure that any components of that process that need to be replaced will do so in a timely manner to avoid any shutdowns.
Anais Dotis-Georgiou: 00:04:56.656 So now that we kind of have that basic understanding, we can move on to learning about time series databases. So I’m going to start by really just talking about kind of the age of instrumentation. So we’re currently living in it, and we now have the ability to measure the change of, both, the system and our physical world or environment over time. And we are increasingly doing so. In the IoT space or sensors in the physical world, we measure things like pressure, temperature, humidity, concentration, light, flow rate, etc. So all of those sensors are collecting time series data. And then in the DevOps world, where we’re performing DevOps monitoring, we’re also gathering time series data, whether or not those are metrics, like CPU, disk, and memory. When we’re monitoring Docker, Kubernetes, or third-party endpoints, we could also be performing CI/CD monitoring, including things like deployment frequency, change volume, and bug fixes. And basically, all of these spaces, whether it’s IoT or DevOps, is generating a huge amount of time series data. And so time series data, just to reiterate, really its only main attribute is that it has a timestamp. But we like to categorize time series data as two types of data, a metric, and an event. And so if we looked in the healthcare space, for example, regularly gathered heart rate would be considered a metric and something like an AFib or some cardiovascular incident would be considered an event. It also usually comes in huge volumes.
Anais Dotis-Georgiou: 00:06:43.716 Time series databases should be able to — they don’t all InfluxDB does, but they should be able to write data at a nanosecond precision. And we see people writing huge volumes of time series data so that they can really understand their environment better. A great example of this is the particle accelerator in Switzerland, the CERN particle accelerator. Excuse me one second. They use InfluxDB to monitor all of their experimentations. And as you might imagine, if you are studying subatomic particles, that requires you to monitor data at an extremely fine precision and as a result, you get a lot of data. So it also kind of goes to follow that the data is real time and time sensitive. And I mentioned this a little bit already, but essentially, we really do see time series in every application. So kind of in the first category, consumer and industrial IoT time series data exists in things like manufacturing and industrial platforms. It exists in renewable and alternative energy systems where people are monitoring their solar panels or wind farms, and in fleet management and telemetrics. But another source is also software infrastructure. Like we mentioned already, you’re monitoring developer tools and APIs, performing DevOps, monitoring Kubernetes, etc. Then we also see time series in real-time applications. Gaming applications are a huge source of time series data while people are tracking player activity. And we also see time series data in fintech applications and network monitoring. So, just to summarize, kind of the main point here is that time series data really exists kind of everywhere.
Anais Dotis-Georgiou: 00:08:45.859 And as a result, it goes to follow that since all this data was being generated, a new category of database soon emerged to accommodate this new type of data that everyone starts collecting so much. And so I just wanted to kind of go through the history of databases and their categories and the emergence of them. So we’re all familiar with relational databases which are optimized for things like orders and customers and records. But then we saw the emergence of a document store and you’re probably familiar with something like MongoDB, for example, and they’re really high throughput and optimized for documents. And then lastly, we see search engines. And a search engine database is a type of non-relational database and it’s dedicated to the search of data. And search engine databases are optimized for dealing with data that’s usually long, semi-structured or unstructured. So things like logs, geodata, etc. And finally, we see the emergence of the time series database. And a time series database is optimized for data that is timestamped or has time series data. And time series databases are able to handle such high volumes of time series data because — for instance, for InfluxDB, especially v3 certain design assumptions were made and considerations that allow it to accommodate really high use cases, especially for time series data. One of those is that the data representation both on disk and in memory is columnar. And the advantage there is that, imagine for a second that we’re monitoring the temperature of this room. If we were monitoring it every 10 minutes, most likely if you’re in an air-conditioned room or a temperature-controlled room, the temperature is not going to change much. It’s probably going to be 72 degrees or whatever temperature you’ve set your room at for a long time.
Anais Dotis-Georgiou: 00:10:55.820 So that means that when you’re storing that value in a time series database like InfluxDB, specifically InfluxDB v3 that’s columnar based, you’re going to get a lot of the same values in that column. And so if you are storing something as a column as opposed to in a row, that means that you have an advantage for really cheap compression and an ability to store a lot of that data as metadata. So you could store the account of the number of 72 values you have, rather than each 72 value. It also means that you have less work when you want to serialize and deserialize those values if you are transferring the data across network interface because, for similar reasons, you can summarize a lot of that data through cheap compression and metadata. And if you have stuff in a row as opposed to a column, and you want to also include all the metadata as a part of those rows, like, which thermometer the data is coming from, which room you’re monitoring, etc., yeah, then every time you want to transport that data, you have to serialize it and deserialize it. And that’s pretty expensive. So that’s another reason why InfluxDB is so efficient and able to accommodate really high use cases. I should mention, it’s built on top of the Apache ecosystem, and I’ll get into that a little bit. But just to summarize here, I think this is kind of a duplicate side, so I apologize. The point is that the data is timestamped. It’s generated either in regular data, which is metrics, or regular. It’s huge volumes, and it’s real and time sensitive.
Anais Dotis-Georgiou: 00:12:46.455 And InfluxDB is really three things wrapped into one. So we have a powerful API and toolset, especially with InfluxDB v3. Because it’s built on the Apache ecosystem, it offers interoperability with a lot of other tools as well, like Power BI, Grafana, Superset, etc. And also, it allows for the transport of Pandas data frames pretty easily, which means that you can integrate it with a lot of other machine learning libraries and tools in Python. Python is kind of the leading language for time series machine learning libraries. So that really helps when you are working on projects like that. And we do use Pandas for this predictive analytics kind of example as well. One second, excuse me. But then at its core, InfluxDB is also primarily a time series engine. And so it’s where we can store our real-time data and also query it in either InfluxQL or SQL. And then last but not least, we also like to think of InfluxDB as being more than just the software, but also including the massive community and ecosystem that is a part of it. So yeah, we have a huge community that is very helpful, very appreciative of them, as they not only ask great questions and make really cool things with InfluxDB but also help each other succeed. And that’s really invaluable.
Anais Dotis-Georgiou: 00:14:30.844 So here’s our reference architecture just so that we’re familiar with it. So essentially, InfluxDB also includes Telegraph, which I haven’t really mentioned, but Telegraph is a collection agent. It’s also open source, it’s plugin driven, and downloadable as a single binary, as well as InfluxDB open source. And you can use it to write data to InfluxDB 3.0 and write data from a wide variety of different sources. But you can also use client libraries to write data to InfluxDB 3.0. You can also use Arrow Flight directly or any of the Flight SQL clients as well. So the Flight SQL clients kind of wrap a lot of the Arrow Flight functionality, and you can use the Flight SQL clients, but I actually find it easier to use Arrow Flight directly. And then it also gives you more fine-grained control in some ways. But yeah, because InfluxDB 3.0 is built on the Apache ecosystem, you can either use our client libraries, which wrap up using Arrow Flight underneath the hood, or you could use it yourself. And then you can easily pull data out of InfluxDB and do all sorts of data transformation with SQL or InfluxQL within InfluxDB, or pull it out and do any sort of data analytics and visualization that you want using machine learning analytics tools and BI tools, which we’ll be doing today.
Anais Dotis-Georgiou: 00:16:10.585 So I’ve already talked about this a little bit, but I just want to make sure that we understand what InfluxDB v3 is all about. So it’s the new InfluxDB engine. And it’s built on top of Rust Apache Arrow, Arrow Flight, Data Fusion, and Parquet. And I just wanted to introduce these technologies in case we’re not already familiar with them, so that we can understand how they contribute to the performance of InfluxDB and why we made this move to completely, re-gut and rebuild the storage engine. And the first is Rust. So Rust is a programming language, if you’re not already familiar with it, that is very performant and it offers really fine-grained memory management. And it was selected because, in the past, a criticism about InfluxDB was the inability to manage the memory. And so we will be providing versions of InfluxDB with operator memory control. And that’s just a huge long-term request. And Rust really helps contribute to that. It also helps contribute to handling much higher use cases than previous versions — sorry, much higher volumes of data than previous versions of InfluxDB because of this fine-grained memory management. And then Apache Arrow is a framework that’s used for defining in-memory columnar data. So you get all of the advantages of having your time series data represented in a columnar fashion that I already kind of spoke about before. Apache Arrow was also developed in part by Wes McKinney because he saw that he wanted to be able to transport Pandas data frames between various sources and various tools. And Arrow was written in part to solve that challenge. The idea being that if as more tools use Arrow as their standard way for defining in-memory columnar data, then you will be able to transport huge Pandas data frames between them. And so that just opens the door for so much more analytics and machine learning and just general control over your data pipelines.
Anais Dotis-Georgiou: 00:18:21.989 And then it’s also built on Arrow Parquet. Parquet is the column-oriented durable file format. It’s something like 17 times more efficient and compresses better than CSV as an example. And we can’t right now, but eventually, we will be able to pull Parquet files directly from InfluxDB. So if you really do need to do that and then store them in a variety of other data lakes or data warehousing tools, you could do that because a lot of them take Parquet. And then Arrow Flight is a framework that simplifies high-performance transport of these large data sets of Arrow tables, for example, over network interfaces. And last but not least, Data Fusion is a query execution framework that’s also written in Rust that uses Apache Arrow as its in-memory format. And it enables us to query InfluxDB v3 with SQL. But there’s also a Python Data Fusion client API. So eventually, we hope that people will be able to query InfluxDB directly with Python as well, which I’m personally really excited about because imagine how cool it would be if you could just perform all of your data analytics and data transformation within InfluxDB directly with Pandas for example. But that’s just me. Other people would be like, no, I just want SQL. So I get that too.
Anais Dotis-Georgiou: 00:20:05.337 So yeah, as of today, InfluxDB Three now serves as a foundation for all InfluxDB products. And InfluxDB 3.0 is now currently available as InfluxDB Cloud Serverless, which is our fully managed multitenant database, and InfluxDB Cloud Dedicated, which is a single tenant version of InfluxDB. And then coming soon is Clustered, which will be the evolution of InfluxDB Enterprise, and 3.0 Edge, which is a single node instance for local and edge deployments, and that will be available later this year. And so yeah, just to highlight, this is what the new Data Explorer looks like, and you can now query in SQL and also Influx QL. But here’s an example of querying in SQL. And if you’re an old InfluxDB user, one thing you’ll notice is that there’s no longer dashboarding. And that’s because we’re really trying to push people to use dashboarding-specific tools and take advantage of the interoperability that InfluxDB 3 offers, rather than trying to reinvent the wheel and offer dashboarding ourselves.
Anais Dotis-Georgiou: 00:21:10.363 So now I wanted to talk about some use cases that take advantage of InfluxDB for things like predictive analytics or anomaly detection. So first is Bboxx. Bboxx is a really cool company that develops and manufactures products to provide affordable clean solar energy to off-grid communities. And specifically, they focus on communities in the developing world. And they’re able to provide over 350,000 people across 35 different countries with electricity. And they monitor all of their 85,000 solar rooftop units with InfluxDB. And they’re able to provide insights into customer usage patterns and perform anomaly detection with those insights. I don’t actually know how you say this company’s name for sure. I want to say Algist Bruggeman. They produce yeast for large-scale bakeries and home bakers. And they lacked insight into their fermentation process. And so then they implemented a variety of sensors. And originally, that data collection was manual. But eventually, they built a data historian on top of InfluxDB that has helped them to collect data about their yeast production, enabling them to gain more insight into the process and provide predictive maintenance for that yeast collection process.
Anais Dotis-Georgiou: 00:22:40.206 And then BAI Communications is a world leader in shared communications infrastructure. And what they do is — they provide the infrastructure for T-Connect, which is the wireless network used by the Toronto Transit Commission. And it’s used primarily by rail operators and for platform overcrowding and dealing with any safety issues associated with that. So they wanted to design a way to use Wi-Fi data to kind of determine overcrowding and look at safety issues. And they use InfluxDB in conjunction to provide real-time observability into passenger volume for the entire Toronto subway system and sort of predict whether or not there’s going to be any events that would lead to unsafe crowding. And then the last company that I really wanted to touch upon, although there’s plenty, is Bevi. So if you’re not familiar with it already, if you don’t have one in your office — well, you might not even go to an office anymore. They make smart office water coolers that provide really tasty sparkling water that you can flavor yourself. It’s actually pretty cool. And they wanted to be able to connect their smart water coolers to the internet to be able to reinvent their supply chain. And any part of good supply chain is being able to predict demand. So they use InfluxData to adopt a distributed supply chain and thereby help eliminate the use of plastic water bottles so that people can just go into the office and use Bevi instead. So those are some of the use cases that InfluxDB users are using for predictive analytics, mostly predictive maintenance, but also some anomaly detection.
Anais Dotis-Georgiou: 00:24:43.405 And so now, without further ado, let’s actually get into kind of the example today of using Quix, Hugging Face, and InfluxDB v3 for forecasting anomaly detection. So first I want to talk about Quix. What is Quix? Well, in one sentence, Quix is a platform that allows you to deploy streaming pipelines for analytics and machine learning. But to dig a little deeper, Quix also provides an online IDE and an open-source streams processing library called Quix Streams. Quix Streams is just the client library that you use in your Python or C Sharp code. Quix is also built on top of a message broker, specifically Kafka, rather than built on top of a database so that it can handle real-time applications and real-time streaming and also scale easily. And it also treats Python developers as first-class citizens. And you can really easily push databases or Pandas data frames from one process to another. So that makes working with real-time data really easy because, yeah, you’re probably going to be working in Pandas already if you are using Python.
Anais Dotis-Georgiou: 00:26:09.698 And then Hugging Face. Hugging Face is a machine learning platform that enables users to train, build, host, and deploy open-source machine learning models, as well as data sets as well. So if you’re looking to do a machine learning project and you want a data set, that’s another great place to look. It’s well known for its contributions specifically in NLP, but there are tons of contributions in AI as well in general, and machine learning as well. They offer a bunch of what they call Transformers or a Transformer library. And this provides access to a variety of pre-trained NLP models as well as other models. So that’s one of the big advantages, that you can just have access to these trained models and essentially download them from their Transformers library or import them rather, and then just use them directly. So one of the key strengths really is that it kind of democratizes the access to trained models. And this model repository they like to call Model Hub. And you can also train your own model and push it to their Model Hub. So yeah, that’s pretty fun. And that’s especially fun when we use it in conjunction with Quix because you can actually turn a model into kind of an environment variable and then select from a variety of models that you have pushed to Quix from Hugging Face. So you can easily swap out models and try different ones in your stream pipeline that you’ve created to kind of compare and contrast for fast experimentation.
Anais Dotis-Georgiou: 00:28:02.044 So the data set that we used for this example is from this repo. Hold on, I’ll share that in the chat. Oh, someone asked, what is the name of the service before Quix? Hugging Face, the emoji. That’s so funny that you can’t see me on camera right now because I look miserable, and I’m sick, but I just did the little — I just did the little huggy face emoji. I put my hands up, which was funny. But anyways, so I’m sharing the script that generates this dummy data set — hold on, actually, let me see what I sent; hold on, give me one second - as well as another — the first link I shared you is also an example tutorial that shows you how to build a simple task engine using Arrow, Docker, and the Anomaly Detection Toolkit, which is another Python library on the same data set. So that’s just another example. Oh, thank you, Caitlin. She just shared those. And yeah, and then the second link I shared is the actual Python script that generates this dummy data. And this generated dummy data contains data about a machine data with values like temperature, load, and vibration for a variety of Machine IDs. It’s fabricated so that we could actually induce anomalies when needed to test the anomaly detection algorithm. And this is kind of what the data looks like coming out of the InfluxDB Query service. And this is a screenshot from the Quix UI. So you can see this is part of the stream processing. And that’s another thing, too, is that you can easily see logs. And another fun thing about Quix is that they actually store, I believe, their logs and a lot of their performance metrics for your stream processing in InfluxDB as well. So if you look under the hood, that’s what they’re using. So that’s kind of fun. Excuse me. One second.
Anais Dotis-Georgiou: 00:30:25.288 Yeah. So basically our data has a measurement machine data, timestamp, vibration, Machine ID, and also some other fields. But I think we mostly focus on vibration for today. So this is what the full Quix pipeline looks like. As previously mentioned, it enables us to deploy streaming pipelines for analytics and machine learning. And the image here depicts our complete pipeline. One second. So essentially, our workspace pipeline contains the following services. It contains a source pipeline, and that’s called InfluxDB-Query. And this service project is responsible for querying InfluxDB v3 with InfluxQL and converting the output to a Pandas data frame so that the transformation blocks the event detection and forecast transformations, or the transformation projects. These services can actually find anomalies and generate forecasts, respectively for each one. And then last but not least, we have two write services. So these service projects are responsible for writing data back to the InfluxDB cloud. We could just use one and write all of our data back to one InfluxDB instance. Here, we decided to create two of them to write to two separate instances. Why? Just because we could. Just to show we can, so that we can do these writes in parallel, but we could also consolidate and do them together. So it’s kind of up to you.
Anais Dotis-Georgiou: 00:32:05.972 And the other really cool thing about Quix is that when you say, hey, I want to create a new service that’s going to run on a schedule and continue to execute a job, when you do that, if you click this button in the top right, that says Add New because you want to add a new service project, for example, you’ll come here and it’ll say select the sample. And you can select from a variety of pre-built samples to either create a service, a service runs on a user-defined schedule, or a new job which only will run once from selecting from these common services. And they all contain all of the boilerplate required to stream the data from a previous service or job and pass the inputs into the corrected outputs. And additionally, you can easily stream Pandas data frames between the different projects, which removes any data conversion effort or any serialization or deserialization kind of back into Pandas data frames so that you can further work with things. So that makes it super easy. And not only do the code examples contain all the boilerplate, but they literally have comments that are like, put your transformation work here so you don’t even have to think about anything. Yeah, super helpful.
Anais Dotis-Georgiou: 00:33:17.073 So let’s talk about the Source project a little bit. Specifically, the very first one on the left, which is Influx-Query. So this is kind of the only — besides the boilerplate, this is kind of the only important stuff that I contributed to it. The first part is to import the InfluxDB 3 client. The second is to include all of our authentication and credentials for the client so that we can instantiate it. And we include those as environment host variables, which you can also configure through the Quix UI, so that if you easily wanted to switch from one InfluxDB instance source to another one, you could do so without changing any code. So that’s fun. And then the other thing that I added was essentially this function here to actually get data from InfluxDB Cloud 3.0. This again is using the InfluxDB client library. So once we’ve instantiated the client, we can then create a query. And basically, we create an InfluxQL query to show the tag values for Machine ID so that we can iterate through all of the Machine IDs and convert all the data from each Machine ID into a Pandas data frame. So that’s the first InfluxQL query. Then we have this for loop for Machine and Machine after we’ve converted the output of that InfluxQL query to a list. And then we have our second query where we say select vibration, because that’s what we’ll be focusing on today for anomalies and Machine ID from Machine data for the past five minutes. Also, this project, I think this service is running every five minutes, and iterate through all those Machine IDs from that list and then convert each one to a Pandas database. And so, yeah, that’s essentially all of the code that’s related to the client library that’s Influx specific. The rest of the code for that service is really just created from the boilerplate that Quix provided. So that’s why I don’t really go into it.
Anais Dotis-Georgiou: 00:35:21.871 And then let’s talk about the event detection project. So for this kind of tutorial, we use the Keras autoencoders. I’ll share a link. Hold on. Oh, sorry. So if you want to learn more about that — so we use it to create and train anomaly detection model. So for those of you who aren’t familiar, autoencoders are a type of artificial neural network that are used for learning efficient coding of input data. In anomaly detection, the autoencoders, they train on normal data and learn to reproduce it as closely as possible. And when presented with new data, the autoencoders then attempt to reconstruct it using the patterns learned from the normal data. And if the reconstruction error aka the difference between the original input and the autoencoder’s output is significantly high, then the model classifies the new data point as an anomaly because it significantly deviates from the normal data. So you could, in theory, also just use autoencoders for forecasting as well. Although I will mention that it assumes that the data follows pretty regular patterns, so it would likely overfit your data and not generate great forecasts in general. So it really just depends on what type of data you’re using. So you can follow this QR code to see where this autoencoder was actually trained. Let me actually grab you that link real quick, too. Give me one second. [silence]
Anais Dotis-Georgiou: 00:37:25.412 So here’s the training for it. Once we trained that autoencoder on the machine data, then it was also pushed to the Hugging Face Model Hub so that we could pull it down when we use it in the event detection project. And I want to just take a second to thank Jay, Jay Clifford, who’s another developer advocate at Influx. He’s amazing. He did this portion of this demo and did this training himself. So, yeah, thanks to him. And also, it’s kind of worth mentioning too. I don’t recommend this as a best practice, but if you are not a specifically data science engineer and you’re trying to figure out how to use the autoencoders and how to tune the parameters, he actually relied heavily on ChatGPT for it and found that he was yielding pretty decent results for our use case. So if you’re not relying on trying to save lives necessarily, and this is just for a home project, I think that’s a great thing to take advantage of. Yeah. So I just wanted to mention too, the model itself is a variable. I think I said this earlier. So you can easily swap models in Hugging Face. So if you pushed a bunch of trained models to Hugging Face, you include those in your code as an environment variable, and then you can easily swap out which one you’re using. So for this instance, he called this model that he trained Vibration Autoencoder, but he could have trained it on a different data set or used slightly different parameters during tuning parameters. And maybe he had a second model called Vibration Autoencoder Two for example. And so you could easily swap them out as needed. And so this allows you to separate your model tuning and your training workflow from your pipeline deployment, which I find really helpful. So this is kind of what you do. So from Hugging Face, you import your pre-trained Keras models. And this is what I mean by the model from pre-trained Keras is just an environment variable called model that you can swap in and out.
Anais Dotis-Georgiou: 00:39:49.416 And so now let’s talk about the forecast transformation project. So for this, we basically just used Holt-Winters, which is a statistical method for forecasting time series data. It’s really lightweight and very efficient and works pretty well. So that’s why we used it. And essentially, this was kind of the only part of the code that I really added to the boilerplate after I selected event detection or — no, maybe I selected a — starter transformation was the basic sample that I used. So I previewed that code and then it had a selection where it said print data frame. And right here you can transform your data frame. And that’s what I selected. And then I basically added this code. And that was pretty much everything that I had to do. So, yeah, again, really excited about this tool and the integration of both together. And then last but not least, we have the write project where we actually write data, both the anomaly data and the forecast to InfluxDB. And we wrote them to two separate instances just because. But we could have used one write project to write all of the data to InfluxDB. And we used the InfluxDB v3 Python client library. Again, you can write Pandas data frames directly, so you don’t have to transform your data to any other ingest format in order to write that data back to InfluxDB, which again, makes this whole process so much easier. So, again, we instantiate our client with our environment variables, and again, you can configure those environment variables for each project in the pipeline. So that’s another benefit to using these, so that we can easily write. We can use the same code for both of the write projects, but then write to a different one by configuring the environment variables in the Quix UI.
Anais Dotis-Georgiou: 00:41:58.461 And to write a data frame, essentially, all you have to do is reset the indexed time, reset the time column to an index with the, dot, set, underscore, index method. And then you just use the write method as a part of the client to pass the data frame, specify any data frame measurements that you might have and any data frame tag columns that you might have as a part of that data frame. Alternatively, if you don’t set the timestamp as an index, you can use the data, underscore, time, underscore, column and specify directly which column is the timestamp column. And the timestamp column should be in Unix precision, I believe. But I think actually now the client handles more timestamp — yeah, more time objects than just Unix precision or Unix timestamps. So I would double check that. Someone asks, what is the equivalent of OSS in v3? There isn’t. That’s coming. So unfortunately, you have to use Influx Cloud Serverless or Dedicated right now, but that is coming, hopefully, sooner than later. Yes, of course. And what else did I want to mention about this? I don’t think I really wanted to mention anything else about this. If I did, I’ve forgotten, so maybe I’ll come back to it.
Anais Dotis-Georgiou: 00:43:39.311 Without further ado, let’s talk about some resources so that you can get started yourself. So the first, go to this URL or influxdb-engine/beta. I guess this is a little bit old because it’s already released, so I apologize. This is an old slide, but essentially, you can go to influxdata.com and learn more about v3. Please join us on our community Slack and ask any questions you have about anything we learned today, or about InfluxDB v3 in general. If you want to join the InfluxDB underscore IOx channel, that’s where we talk about v3 specifically. So the engineers internally call v3 IOx because IOx is iron oxide, which is also rust, and since everything is written in Rust, hopefully, that explains that. So, yeah, if you want to ask v3 specific questions, go there. Also, I encourage you to go to github.com/InfluxCommunity. That’s where the DevRel team and various community members have contributed a bunch of projects related to using InfluxDB with a bunch of other tools. So if you want to get started using InfluxDB in conjunction with something else, or you just want to see examples of InfluxDB being used for a variety of projects, that’s a great place to head. It’s also where all of the client libraries are maintained. And so if you have any issues with the client libraries or any features that you’d like to see, or you’d want to contribute a feature yourself, please, please do. Oh, look at that, another duplicate slide, just to keep you on your toes. And some more resources. I already talked about the forums as well. Please ask questions there.
Anais Dotis-Georgiou: 00:45:34.536 But also, please keep in mind, there’s only three developer advocates or people on the developer relationship team, and we have forums, Slack and Reddit and Twitter that we all answer questions on. And there’s really only three of us that it’s our main job to be answering questions. There are also engineers in the Slack channel, so if one of us is out, sometimes there will be delays on getting to you on some of those, but we try our best. And I also encourage you to ping other active members in the community channels as well, because they will help. They’re super sweet, but if you ever see a delay, I please ask that you have some patience with us. We’re trying our best to handle all of the questions that we get. And then also, our docs are fantastic. I can’t thank the docs team enough. They are such a small team, but so mighty. And then we also have a bunch of blogs. You can find a blog that basically covers everything that I talked about in this presentation that will be coming out soon. Last but not least, we have influxdata.com/university, where you can get — yes, we do need more developer advocates. I agree. Do we need more developers? If I had to just guess, I’d say probably, but I don’t actually know for what positions engineering is hiring for right now. But that being said, let me pull it up and share it with you since someone asked. Always good to ask. Caitlin, if you wouldn’t mind actually finding that and sharing that with them, that would be great.
Caitlin Croft: 00:47:20.297 Sure. Our careers page?
Anais Dotis-Georgiou: 00:47:22.265 Yeah. Yeah. And then I just wanted to mention about InfluxDB University, you can go and take classes on all things InfluxDB, but it’s mostly focused on V2. And so if you’re using open source, I’d say still go there. But if you’re looking to get classes on InfluxDB v3, I patiently and humbly request that you wait a little bit. I’m currently developing courses for InfluxDB v3 and redesigning InfluxDB University for InfluxDB v3 specifically. So those courses will be coming soon, and I appreciate your patience. Thank you, Caitlin. So thank you. With that being said, I’m going to go look at the questions and see if I can answer any of your questions live right now.
Anais Dotis-Georgiou: 00:48:09.911 So Bruce asks, do you have a way to contextualize data to organize the data collection and association assets? So I would say the best way to organize a data collection is when you’re writing data to InfluxDB to include tags. So in InfluxDB v3 — so let me back up. So when you ingest data to InfluxDB, whether that’s V2 or v3, you are going to use the right API endpoint. And the ingest format is known as line protocol. And in V2, there’s a really big importance whether or not you use tags or fields. And tags or fields are components of line protocol. And this was important because tags were indexed and fields were not, and it would cause problems with high cardinality use cases. And people had to be really strategic about what they wanted to make a tag versus a field. In InfluxDB v3, all tags and fields are just converted into columns in an Arrow table or Parquet table. So functionally, they’re completely the same. That being said, we still recommend that you think of tags as the place that you want to store metadata about your InfluxDB data, your time series data, and a field is where you want to store the meat of your time series data. So where you actually store the value of your time series data, the vibration, the temperature, the pressure, the price, the number of customers, etc. You can have up to 200 columns in a single table or aka measurement. Those are the same things in InfluxDB v3. So I would say the best way to organize your data is by taking advantage of using smart measurement names or what you want to name your table, and also including a lot of tags so that you can contextualize that data.
Anais Dotis-Georgiou: 00:50:01.408 Will InfluxDB v3 be made available for self-hosting anytime soon? Yes, it is, but I don’t have the exact timeline on that quite yet. And then any update on InfluxDB Three open source availability? The best I have for you is sometime later this year. I’m so sorry I don’t have a better answer for that, but there is a free trial for InfluxDB cloud, so if you want to get started with that and just play around with it while you patiently wait. But I’m really excited that you’re asking for it and you’re interested in it, so that’s cool. And then Nicos asks, is there a mechanism to use streams instead of queries for data written into Influx, like subscriptions for V1? We are using Influx quite extensively. They found that constant queries are having a bad impact on resources while streams are much lighter. So that shouldn’t be a problem with v3 precisely because it’s written on Arrow. And in fact, one of the methods that you can use with the client libraries or Arrow Flight is the chunk mode, which actually leaves the GRP stream open. So you are kind of essentially streaming the data with Flight. So, yeah, hopefully, you shouldn’t run into that problem. Let me share with you what I mean.
[silence]
Anais Dotis-Georgiou: 00:51:39.918 So, Nicos, I’m including some of the documentation so that you understand, for example, what is being used underneath the hood, and I’ll share that with everybody as well. Everyone. Sorry, Caitlin, you’re always copying and pasting everything because I haven’t been sharing it with everyone. But for those curious, that’s what’s being used under the hood, so that should help with that. I think those are all our questions. Let me just look in the chat to see if there are any specific questions that I didn’t ask. Thank you, Josiah, for sharing the Flux learning with Python. I appreciate that. For those of you who are V2 users, Flux is a part of V2, but it is not the emphasis of v3. SQL and InfluxQL is. So Flux support is being phased out of v3. And so I don’t recommend using Flux for your Quix pipelines. I recommend using Pandas because there’s just so much more availability, there’s so many more resources, and I’m really thankful that I can just use Pandas. And you just use SQL and Flux to multi-query all of your data and then use Pandas to go crazy and do all sorts of really complicated analysis and transformation work that you simply can’t do with Flux. And I think that answers all the questions that everyone had. So I’m going to stop sharing. Thank you so much everybody.
Caitlin Croft: 00:53:22.598 Awesome. Thank you, Anais. And thank you everyone, for joining today’s webinar. It has been recorded and will be made available later today or tomorrow morning. So really excited to see what you guys build with InfluxDB. And thank you for joining today.
[/et_pb_toggle]
Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.