Basic Two-Step Pipeline to Sync Data From InfluxDB 2.x to 3.x With Quix
Session date: May 07, 2024 08:00am (Pacific Time)
Quix is a complete solution for building, deploying, and monitoring real-time applications and streaming data pipelines using Python abstracted over Kafka with DataFrames. Quix integrates directly with InfluxDB as either a source or sink—serving as an ETL engine for InfluxDB 3.x or 2.x users who want to leverage stream processing for their use cases.
In this webinar, you’ll learn how to leverage a Quix project template to sync data from InfluxDB 2.x to InfluxDB 3.x. We’ll also look at some common use cases for the template from users looking to migrate to InfluxDB v3 and users running v2 at the edge.
Basic Two-Step Pipeline to Sync Data From InfluxDB 2.x to 3.x With Quix
Watch the webinar “Basic Two-Step Pipeline to Sync Data From InfluxDB 2.x to 3.x With Quix” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Basic Two-Step Pipeline to Sync Data From InfluxDB 2.x to 3.x With Quix.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
- Tun Shwe: VP of Data, Quix
Anais Dotis-Georgiou: 02:19
Hello, everybody. Welcome. I’m just going to give a couple minutes to let everyone come and join. So today we’re going to be presenting a webinar on Quix and InfluxDB and migrating data from v2 to v3. A couple of housekeeping rules and just general information. I encourage you to ask any questions that you have in either the chat or the Q&A, and we’ll make sure to answer those questions at the end of the webinar. Also, a recording of this webinar will be available to you after the webinar so you can follow up with anything there that you might need. And also, I want to encourage you to ask any questions that you might have as well on our community Slack or forums. And just as a general reminder, any questions that you might have about this webinar, and if you’re ever feeling shy about having a question, just remember that someone else probably is also having that same question. So just a friendly reminder to speak up because you’re probably helping someone else along the way. What else can I think of? I think that’s generally pretty much all of the information that I want to give you before we start. But also, I would love to know where you’re from. So, if you don’t mind, just kind of as an icebreaker, sharing where you’re from in the chat and also maybe your favorite flavor of ice cream, just gives us an opportunity to see where everyone’s coming from. And it’s always exciting to see participants from all over the world joining. So yeah, would love to hear where you’re from and what your favorite flavor of ice cream is. And I’ll give just a few more seconds to let anybody else join who might be a little bit late. And then we’ll get started here in just a little bit.
ANAIS DOTIS-GEORGIOU : 04:15
Yeah, I’m from Austin, and my favorite flavor of ice cream is mint chocolate chip. Chocolate, classic choice. All right. Well, I think we can go ahead and get started. I’ve given a couple of minutes to let people get in. So welcome, everybody. Today, we’re going to be talking about Quix and InfluxDB and migrating data from v2 to v3 with a Quix template. And today, my co-speaker is Tun, so I’ll let Tun introduce himself.
TUN SHWE: 04:58
Thanks, Anais. So, my name’s Tun Shwe. I’m really glad to be back here hosted by InfluxData for this webinar. So, thank you for joining. So, I work at Quix. So Quix is a developer tools company for stream processing. I’m usually based in and out around England, usually out of London. And I pass through Cornwall a lot at the moment. And my background is in data engineering. So, I’ve been writing a lot of Python, Scala, and Java in my days for tools such as Spark and Kafka, and spent some time in the last few years in analytics engineering land, so working with data warehouses, data modeling using tools like dbt, Snowflake, Redshift, etc. And in the last few years, I’ve focused specifically on data strategies. So, helping companies determine their data strategy and help implement it using streaming first principles and taking that mindset of working with real-time data. So as soon as they generate data, they start processing it, kind of helping companies get into that mindset. My areas of interest are that. It’s basically the mantra less is more, get started sooner because I think the whole landscape around this tooling is changing a lot, especially when it comes to real-time data and AI, the ecosystem. There’s a venture capitalist called Matt Turck who created the MAD landscape. It stands for ML, AI, and data landscape. And it truly is mad. You need a telescope to actually view it all. My teammates just posted the link in there. So, get your telescopes ready. You can see that every other week or so. There’s a brand new tool. So, I love to stay on top of that and figure out how they fit into this whole ecosystem, so. And lastly, when I’m doing that research and figuring things out, I think you should always do that whilst eating an ice cream. And my favorite is pistachio. Or whenever I’m in Italy, it’s pistachio gelato. So yeah, that’s it from me. Over to you, Anais.
ANAIS DOTIS-GEORGIOU : 06:49
Thank you, Tun. Thank you so much. So, I– oops, sorry. So, my name is Anais, and I’m also a developer advocate. I’ve been at Influx for about six years. And my background is really in biotech. I started analyzing data, lab data for diagnostics for various diseases. And from there, I decided I don’t want to spend my time actually in the lab. I’d prefer doing the data analysis portion. So, I have a strong passion for things like Python, Pandas, Polars, math, algorithm selection, statistics. And those are kind of my main areas of interest. And I agree. I think if you’re doing any of that with your favorite flavor of ice cream, it makes it even better. And mine is mint chocolate chip. And I’m also based in Austin. So, for today, we’re going to just give an intro into what InfluxDB and Quix are just to give you that context for what we’re doing here. And then we’ll dive into understanding Quix templates in more detail, not only talk about how we can use the v2 and v3 template, but also some of the other templates that are available to you through Quix to use with InfluxDB. Then we’ll talk about some use cases and kind of also dive into the difference between Edge Data Replication and this v2 to v3 template so that you can better understand when you’d want to use either one. And last but not least, we’ll give you some resources so that you can get additional help for getting started with Quix, getting started with InfluxDB, using these templates, and creating some meaningful streaming data pipelines.
ANAIS DOTIS-GEORGIOU : 08:30
So, with that being said, let’s talk about InfluxDB and Quix a little bit. And in order to understand InfluxDB, it’s important to take a step back and understand what time series data is because InfluxDB is a time series database. So, time series data is any data that has a timestamp associated with it. So probably one of the simplest examples or easiest examples to recall is stock value data. And an interesting thing about time series data is that singular points are usually not of much interest. With stock data, you really want to be looking at the price of something over time. So, you know what a series of time series means and that trend because that helps you evaluate whether or not you want to buy or sell that. So really, this concept also applies for any other time series data. Time series data can be something like IoT data where you’re monitoring your environment over time. And that could be something like pressure, temperature, concentration, light, humidity, flow rate, you name it. Anything that has to do with monitoring your physical environment. And similarly, you’re going to be looking at the trend of that over time so you can make predictions about what’s going to happen to your environment, make sure that it is actually exhibiting the state that you expect it to. And similarly, you might be monitoring your virtual environment. You might be looking at DevOps monitoring and looking at your CI/CD pipeline. You might be looking at the availability of endpoints. You might be monitoring your network or also monitoring Kubernetes or Docker instances. So, in this world as well, whether or not you’re actually monitoring your physical environment or your virtual environment, you’re gathering time series data. It has a timestamp associated with it, and you’re looking at the data over time, and you’re looking at multiple points.
ANAIS DOTIS-GEORGIOU : 10:26
And in this way, time series data is usually characterized as being presented in kind of two types, both metrics and events. So, metrics are defined as any time series data that is coming in at a regular time interval. So, if we think about pulling any sort of device at a regular interval, then that could be a metric. An event is any irregular time series data. So, in the healthcare space, we think of our heart rate as being a regular metric. And whenever we experience a cardiovascular event, hence the name, and irregular things like AFib or a heart attack, then that’s an event. But the cool thing about time series data is that you can easily convert events into metrics. You can take your event data and maybe look in an aggregation of how many events are we getting per day and perform a count. And now we’re converting that event data into a regular metric. And so, it’s a common task for anybody collecting time series data just because you want to be able to have a way to standardize any sort of event data and provide some context to it through that aggregation. Similarly, another thing that you might want to be doing with your time series data is downsampling that data, taking your raw, high-precision time series and getting an aggregate. Maybe not to just convert your event data into a metric, but also so that you can kind of eliminate some of the noise of your high-precision data and just be able to focus on the trends over time. And for both of these really common use cases or common data transformation tasks, you could probably want to use something like Quix to achieve that as well as more sophisticated problems.
ANAIS DOTIS-GEORGIOU : 12:09
But now let’s talk about time series databases and some of the requirements for a time series database. So, the very first one, and this might be kind of obvious, is that you are going to be writing timestamped data, time series data. So, every point in your time series database is going to be associated with a timestamp. The second is that time series databases will be able to accommodate really, really high write throughput. And this is because a lot of time series use cases, especially in things like IoT or FinTech, just produce a lot of time series data. So, if we think of, for example, an industrial vibration sensor, in that instance, on a manufacturing floor, a single industrial vibration sensor will collect around 1,000 kilohertz of data, which is about 10,000 points or 10,000 series of time series data every second. So, if you had multiple vibration sensors, maybe you have other sensors as well that are monitoring other aspects about your factory floor, you can see how you quickly are writing hundreds of thousands or millions of points per second. And so, you need a database that can accommodate that throughput. A third component of time series databases is that they should be able to perform really efficient queries over those time ranges. It’s not a lot of value if you can write that data if you can’t query it back efficiently, right? You also want to be able to perform things like mins, maxs, scan over large portions of data, and actually get your local minimums or global minimums and maximums and be able to perform those fast scans. And last but not least, you want a database that, like every other database, though, is also scalable and performant, right? You want it to be designed to scale horizontally so that you can also accommodate this type of write and query requirements and also have that scalability and reliability.
ANAIS DOTIS-GEORGIOU : 14:11
So, this is what InfluxDB looks like. InfluxDB 3.0, that is. So, we had a total rewrite of our engine in 3.0, and we based it all on the Apache ecosystem. So, we use data fusion as our query execution framework. We use Apache Arrow as our in-memory columnar format. Parquet as our durable file format. That’s also columnar. And essentially, now with InfluxDB3.0, you can query directly in SQL with data fusion as well as InfluxQL. So InfluxQL is a SQL-like query language that existed in v1 of InfluxDB. And so, if you’re migrating from v1 to v3, then all of your queries can remain the same. And a big reason why we made this change to rewriting the storage engine with these upstream projects is to, first, be a part of those upstream projects and contribute back to them and be a part of that community. And as a result, also offer more interoperability with other tools that leverage things like Parquet, with tools that leverage things like Arrow. So being able to get your Parquet files directly from InfluxDB, which is what we hope to– a feature we hope to provide in the future, will open up so much more interoperability with other tools. We’re also looking to leverage Iceberg in that way to be able to extract the exact parquet files that we want. Additionally, being able to store everything in this columnar format is a huge advantage for time series use cases and allow us to reach a lot of the goals that are requirements for time series databases. Being able to reach things like having unlimited cardinality or dimensionality in InfluxDB is largely because we moved to this columnar format.
ANAIS DOTIS-GEORGIOU : 16:05
And that’s just because of the nature of time series itself, where time series is– generally when you’re, let’s say, monitoring the temperature of a room, that value might not change very often. And so, as a result, if you’re storing multiple values that are the same over time, and you are storing it in that columnar format, then that offers opportunities for really cheap compression because you can just summarize a lot of those values with dictionary encoding. And then additionally, it also provides the ability to perform really fast scans over all of your data. When you think about storing data in a relational format or in a row-based format where you have every single field in a separate column and you’re looking at every single row, if I just want to find the max temperature for one single temperature sensor in one column, then I don’t have to go across every row just to find that one max value. I can just focus on that one columnar value. So that’s also how we help make queries a lot more efficient. And that being said, I will now move to [inaudible] drive for a second so that he can share some information about Quix.
TUN SHWE: 17:20
Thanks so much. Yeah. So, I’m going to share my screen real quick and just talk you through Quix. And kind of to get started, I think it always makes sense to explain the differences between batch and stream processing because most of the customers we’ve spoken to at InfluxData who are developing on InfluxDB at least understand the concept of processing using batch. So, when you process data at rest, what that means is you’re usually reading from and you’re querying data from something like a database or a data lake or a lakehouse. So, it’s where the data kind of sits for generally long-term storage and you kind of amass historical data. So, for this use case, I’m going to take a cyclist. So, you see the cyclists up there in the top. And that cyclist has an app with some sensors on it. So, it’s monitoring things like heart rate, temperature, gravitational forces, elevation, velocity, things like that. In retrospect, I should have probably made this about ice cream. So, I think for future reference, I’m going to do one with maybe customers at an ice cream shop. But anyway, we’re stuck with the cyclist. So, the first thing that happens is the cyclist. As they’re moving along and cycling along, they will create a data point and that gets sent over the network and saved in a database. And this happens over time. So as time passes, the cyclist will create more data points, and they will get stored into the database as such. We call this bounded data because this data has a start and an end generally. And so, at some point in the future, a scheduler will kick in and it will load in that data. So here we’ve got four columns. We’re interested in the timestamp, which is t, and we’re also interested in the x, y, and the z axes for the gravitational forces.
TUN SHWE: 19:06
So, what we generally do in practice is you will process data for maybe a calendar day. So, between yesterday and today, you would load that data into this processing system, and you would perform a calculation. So, in this case, we’re just summing up the total of the X, Y, and the Z gravitational forces, and optionally, the deltas, how they change over time. So, the takeaway here is that that computation is done on historical data and it’s stateless because you’re loading all the data that you care about. In this case, let’s say a calendar day into memory and you’re processing it in memory. And the results are not in real time because you’re introducing a delay. You are scheduling this to kick in and load and process that data at some point in the future. Now let’s contrast that with stream processing. So, this is different because you’re processing data in motion. We don’t use a database in this case. We use something that will hold messages. So, in our example here, it’s a broker transport called a Kafka topic. And here, when the cyclist creates that data point, there is no delay. That message is immediately consumed by the processing system as soon as that data point is generated and saved into the topic. And here we’ve selected out the same values, the x, y and the z, and we’ve computed the total. Now what we need to do is, because we’re observing each data point one at a time, if you choose to remember certain aspects of this data, you need to maintain an internal state. So, in this case, we’re just going to store for that cyclist a timestamp and the totals. So, if we move on and the cyclist generates a new data point, that gets immediately processed again. And so, there’s no delay. It happens in real time. So as soon as that data is generated, that data is processed and calculated as such. Okay. I’m just going to let it animate all the way through. There we go. So, it will process the data in flight.
TUN SHWE: 21:06
So, the use case for this, for example, could be if you are trying to keep track of the gravitational forces during a race. So, let’s say you are monitoring all the cyclists in the Tour de France, and you have an ID per cyclist, you would keep that state there and then over time you’d be able to determine for each cyclist what their total forces are or maximum velocity. There are various aggregations you could do. But the take-home here is that that computation is done on each event as it comes in, and it requires state to keep a track of any sort of historical data. So, we consider that stateful data. And most importantly, you get real-time results. And this is really what Quix exists with its mission to make real-time data available to Python developers everywhere. So, the whole landscape for stream processing is built up using Java and tooling that was built in Java. And so Quix’s mission is to change that and to have Python developers be able to work with data in Kafka and work with streaming data. So, the first project that I’m going to talk about here is Quix Streams. It’s the open source Python library that enables you to work with data in Kafka. Now, when I’ve spoken to a lot of InfluxDB customers and developers, they kind of look at Kafka and they say, “Oh, it feels like overkill.” But in this day and age, we’ve got a lot of managed services and the mission of Quix Streams really is to simplify all that complexity when the data and its infrastructure, as complicated as it can be, is abstracted away from you and provided to you through a simple interface, which I’ll show you in a moment. So, we really believe that Kafka is the best solution for scaling out these sort of data pipelines and for scaling out mass-scale data and stream processing. And so therefore, we’re boarding on Kafka, and we want to make that extremely simple and bring that to the masses.
TUN SHWE: 23:01
As I mentioned, it’s open source. We rewrote the library because the library is over 1 year old. We rewrote the library at the end of last year to make it fully Pythonic. So, it’s 100% Python now, which means great things as a developer because you get the best developer experience. So, because you don’t have other languages, like it’s not Python wrapped around Java, for example, so whenever you have errors that happen, and you encounter a lot of errors as you’re building up a system and taking it to production, you get all these errors in Python, so you’re able to debug line by line. So, we thought this was the most important aspect of learning about stream processing and making it available to Python developers. And the principle that we make this all really easy by is by turning messages that come into that message broker and turn that into a tabular representation. So, I think most developers are accustomed with languages like SQL and being able to query rows and tables in a database. So, we wanted to bring that same kind of interface. And we do that because we have released a feature called Streaming DataFrames. Now, a lot of developers, certainly in the Python world, are familiar with data frames if they’ve worked with libraries like Dask and Pandas. So, we were inspired by such approaches. So, we created this concept of a data frame where you would create a data frame. You can see on– what’s it? Like line five where you create a streaming data frame. You essentially subscribe to a topic. And whenever a new record comes in, so like in the previous example with the cyclists, as each new event data point comes in, it invokes these functions. So, we have an apply here where we’re pulling out field A and field B from that JSON that’s coming through. So, we make it really easy to manipulate that message data using a tabular format.
TUN SHWE: 24:53
We also support stateful operations. So again, if you were monitoring the Tour de France, you would need to keep a key for each of the riders and tabulate things like distance and total times, things like that. You’re able to do that by maintaining state. So, we give you access to a high-level state object, which in effect works like a key value like a dictionary and enables you to set values. And this has persisted throughout the topic, and I’ll go into how that happens later. And we also have window operations. So, window operations in the context of time series data are extremely important when you want to perform, like Anais mentioned earlier, a downsampling task. And you can do that using a tumbling window. So, there we’ve defined a tumbling window that is a window that is 10 minutes in size. And we have also introduced a grace period of 10 seconds so that any late arriving events will be incorporated into that window as well. And we apply a reduced step. And for that, we’ve provided a function for an initializer, which kind of gives you the initial data structure and the reducer, which in this case is the downsample function. And what that does is as each new record comes into the window, it will invoke this function on each and every new record that comes in. And in this case, it’s a downsample. And then we emit the results at the end. So, we support several window operations. And so, when you’re working with InfluxDB, being able to downsample becomes a piece of cake. It just becomes a few lines. So, we have gone all in on Kafka because we believe in its powers. So, Quix Streams is a lightweight library, but because it depends on Kafka, you get the guarantees that Kafka brings. So, powerful guarantees such as high availability, resilience, the ability to recover when you encounter errors and things terminate. But also, we do that through this thing called the changelog topic.
TUN SHWE: 26:50
So, once data is written into a topic, it’s durable, meaning that you don’t have any data loss. And as you change the state over time, so that state object, as you make changes to it over time, we also publish it to an internal kind of hidden change lock topic. And that, in the event of a disaster or in the case where you need to recover, it’s able to reread everything out of the internal state as well as to change log topic and resume its state. So, all these consuming applications in your data pipeline, they have failover and they’re able to recover gracefully from it. So, that’s really why we chose Kafka and how we bring the power of Kafka to this lightweight library. And we noticed that a lot of Python developers probably don’t want to concern themselves with infrastructure and putting their applications in their pipeline somewhere. So, we built Quix Cloud. So Quix Cloud enables you to deploy those Quix Streams applications in a data pipeline, and we will take care of the management of Kafka, of Kubernetes, and all the CI/CD, all the image building for Docker, all the deployment, all the registration, all that stuff. We take care of all of that for you with a few clicks, as you’ll see in the demo later. We also have a series of open-source connectors. So, we have sources and destinations. So of course, InfluxDB is both a source and a destination, and we support a lot of others like HiveMQ as well. So, we take care of all of that in our cloud. We have several options. I think we have a deployment kind of parity with InfluxDB because we have a serverless solution, which is multi-tenant. We have a dedicated solution, which is single tenant, meaning no shared infrastructure with other customers. And we have BYOC, which is bring your own cloud or bring your own cluster in our case.
TUN SHWE: 28:40
So, you can spin up a Kubernetes cluster in your own private VPC, and we’re able to install Quix there. So, I think pretty much like for like with InfluxDB’s offering. And this is a screenshot of Quix Cloud. It’s a little bit redundant in this case because I’m just going to switch to the demo now so you can see it in action, but it enables you to build streaming ETL pipelines. And it’s especially good for time series data. So hopefully now you’re now seeing Quix Cloud. So normally when you log in and land in Quix Cloud, you come to a workspace screen, and I’ve clicked through to the pipeline view. So, starting maybe at the top there, we’ve got environments in the top here and you can create new ones. So here, I’ve just got one called production. You tie the different environments to branches in Git. So, during your onboarding, you’ll either use our own Git, but we recommend you bring your own Git repository. So, you will authenticate your Git with Quix Cloud, and that will enable you to use different branches. So, you could have a dev branch, you could have a demo branch, you could have a production branch, and you can create them as separate environments up here. And each environment can have its own broker. So again, you can use Quix’s own managed broker, which is what a lot of customers use. They use Quix’s managed Kafka, or you can bring a few others. So, we have partnerships with Redpanda, Confluent Cloud, Aiven, and Upstash at the moment. So, you can bring your own Kafka if you have that.
TUN SHWE: 30:10
So yeah, let’s talk about this pipeline view. So here, you’ll see various things going on. So, each of these boxes is a Quix Streams application. And Quix Streams works on the premise of everything being dockerized. So, each application has a Docker file, and it’s built and deployed in this fashion. We’ve got a few running here. Oh, let me start this one. So as simple as hitting the start button, it will start the service and deploy it. And you can see as it enters into a starting state, and it enters into a running state really quickly. So, you’ve got all that CI/CD in the background so that you can deploy applications once you’ve defined them. We’ve got the blue outline here for sources. We’ve got this orange outline here for syncs or destinations. And we have all the transformations in this purple outline. You’ll see things go. These arrows in between each of the boxes, each of the containers, those are topics in Kafka. And here you can see how Kafka is really the data backbone inside Quix. You can see when they’re lit up in green, that data’s flowing through them. So, in this case, we have JSON 3D printer data flowing through at about 30 kilobytes per second, and you have that observability. I’ll quickly go down the menu on the left side. So, if you click on deployments, you can see the various deployments that you have in this environment, various metrics around CPU utilization, replicas, and memory. We also have topics there which tell you their retention period, what they’re called, and you get a real-time view of data flowing through. So, you can click through and actually see that data as well. And you’ve got the application. So, when you connect to your Git– when you connect Quix Cloud to your GitHub repo, each project is in a single monorepo, and this shows you the paths to it.
TUN SHWE: 31:58
And I’ll lastly just go into code samples here. So here we’ve got our open source connectors and code samples in here. So, you can obviously search for the connectors of your choice. And here you’ll see we’ve got two sources for Influx, one for the version two source, one for the version three source, which we’ll go into a bit in a moment, as well as the 3.0 sync. And last thing I’ll show you is just how we have laid out all the code in this. So, if I click through on the 3D Printer Down Sampling, you get to this view. So, I’ll go through the tabs here. So here on the left, we’ve got build logs. So, for the CI/CD, you can see all the logging as we build your container and start deploying your application. We’ve got the logs, which is essentially like your logger or your print line statements. They all come out in here. And we’ve got the messages, which are the actual messages that are flowing through this topic. In this case here, it’s the JSON 3D printer data topic. And you can see that flowing through in real time. And you can kind of pause and click to inspect the data as you wish. And the last thing I’ll show you is there is an edit code button.
TUN SHWE: 33:02
So here we’ve got a downsampling application. And if you click on edit code, this should present you with the Cloud IDE based on Visual Studio Code. And here you can see all the code, and you can actually edit in here and make commits as well. Obviously, everything’s in Git, so we have a concept of a history. And from here, you can deploy as well. So, as you’re developing and saving and committing back to Git, you can hit redeploy and you can select specific commit hashes or just always use the latest version. So, you’ve got different configuration properties here. So many things to explore. But yeah, that was everything I wanted to show for now for Quix Cloud. So, I’ll hand it back to you, Anais. Oh, I think you’re muted.
ANAIS DOTIS-GEORGIOU : 33:57
Thank you. Yeah, let me get the slides back up really quickly. Here we go. All right. Yeah. So yeah, Quix is really easy to use. And so I get the pleasure of talking about the Quix templates, which make it even easier to use because you literally just click one button and it populates all of the sources and syncs and components of your pipeline for you and then even prompts you to fill in any environment variables that you might need to actually configure them to work as you expect. So, the one that we’ll be talking about today is the sync from v2 to v3. And this is probably maybe the simplest project template that Quix has. It just uses an InfluxData v2 connector source and an InfluxData v3 sync and allows you to write data from v2 to v3 and also select any tags or fields that you want as a part of that so that you could even narrow down your sync to specific series. And you can get started by following the QR code there, or you can also just visit Quix projects templates and browse through the entire catalog of the templates that they offer and pick this one. So, let’s go into how we actually configure it and some of the requirements. So obviously, you’ll need an InfluxDB v2 instance, hopefully with some real-time data that you’re pushing to it and writing to it. And then also a v3 cloud account. And you can sign up for a free tier trial. And you also are going to want a Quix Cloud account. And then for InfluxDB, you’ll need your bucket that you are gathering your data from, your bucket that you want to write to in InfluxDB v3, your tokens for v2 and v3 as well, and your organizational IDs. And then similarly, any fields and tags that you want to specify that you want to sync from your v2 instance to your v3 instance.
ANAIS DOTIS-GEORGIOU : 35:58
And I’ll actually just show you the same thing that’s going to be in that slide, which is essentially just how to gather those credentials. So, this is my v2 instance. And if you need to create a bucket, you can go to the load data page and navigate to the buckets tab, create a bucket, give it a name and a retention policy, which just specifies how you want to automatically expire your data and after what point you want to automatically expire your data. So that could either be never or after a certain amount of time. And then you’ll also want to create any sort of token. And you can obviously scope those tokens to the particular bucket that you want to be querying your data from. So here I have a bucket named cpu2, just to indicate that it’s from my v2 instance. And I’m writing just some CPU data to it with Telegraf. Telegraf is InfluxData’s collection agent for metrics and events. It’s plug-in-driven, super lightweight agent that you can use. And you can also configure a Telegraf agent within the UI if you want to by searching for a bucket that you want to write to, a plugin that you want to use, in this case, CPU. You can go ahead and continue configuring it, even go as far as save and test it, and it’ll export a token for you and create a command to point to the config that you want. So that’s how we just have this set up here. And that’s how you can also get similarly in InfluxDB v3. Go to the load data page as well and create a token similarly.
ANAIS DOTIS-GEORGIOU : 37:35
So, I’ll go ahead and go past this because it just is a video describing exactly what I just shared. But to get started with this template, what you’re going to do is essentially go to quix.com/templates, and then you can find the sync data from InfluxDB v2 to v3. And then from there, you can simply hit clone this project. And when you do, you’ll be navigated directly to Quix, and it’ll ask you if you want to import your project, and you can. You can import it with a Quix default configuration or an advanced configuration. I just did the default. And then from there, once you’ve done that, it’ll prompt you to add your missing secrets. And this is where you can add your Influx_DB token, and that’s for your v2 instance, and your Influx_DB3, and that’s your token for your v3 instance. From there, once you actually have the pipeline, you can go ahead and edit your deployment and add any variables that you need. This includes things like your org ID, your host, your bucket, and the amount of times or the time period with which you want to be executing the sync. So, the default is five minutes, but you could change that to be less if you wanted. One thing I should mention too, probably the easiest way to find your org ID is from the URL itself. You get /orgs, and then there’s your org ID. And with that being said, after you do that, you can go ahead and just sync all your data.
ANAIS DOTIS-GEORGIOU : 39:19
I did want to mention, though, kind of the difference between this Quix template and edge-to-data replication. So, if you are a v2 user, you might be familiar with Edge Data Replication. What Edge Data Replication is, is it allows you to transfer data from an InfluxDB Edge node to InfluxDB Cloud. And this is true for InfluxDB v2 open source. And so, you could, for example, in InfluxDB, open source v2, take your data, store it in a raw instance of InfluxDB, maybe use a flux task, which is a query processing language for v2, and downsample that data, and then take that downsample data and replicate it to cloud with Edge Data Replication. And the idea here is being allowed to provide you certain benefits like durability because Edge Data Replication does offer buffering of data. And also, you’re keeping this ability to replicate your data at the edge. So having that local access to your real-time data, and then also the global access by sharing or saving all of your edge data in cloud. And so, we have that direct replication where you’re just sending all of your raw data directly to cloud or that pre-processed replication where you’re maybe just performing a simple downsample and then writing that data to the cloud after that. But there are different advantages and disadvantages to using Edge Data Replication over Quix. An advantage to using Edge Data Replication is maybe some latency reduction and bandwidth optimization just from it running at the edge. However, you’re limited in terms of the amount of processing you can do. Flux does offer some data transformation capabilities, but it is much more limited compared to something like the power of Python.
ANAIS DOTIS-GEORGIOU : 41:18
So, with Quix, you could not only have your v2 to v3 sync, but you could add more components into your pipeline to actually include machine learning integration, any type of real-time processing that you want. And then you also get all of the scalability advantages that Kafka has to offer. Another thing about Quix is that you can have sources and syncs for v2 and v3, and Edge Data Replication is only available for v2. So that’s just something to kind of keep in mind there. And I want to take a moment, too, to highlight some of the other templates and projects that are available through Quix. So, one is the AI Customer Support template. And this template basically has two LLMs that are providing conversations for a customer support member and a– an AI support team and a customer. And all of the chat conversations are stored in InfluxDB. So that’s kind of a really cool use case, a hybrid use case, of using LLMs with InfluxData. And then we also have the Quix– real quick, Tun, is there anything that you wanted to share about that template?
TUN SHWE: 42:37 Oh, yeah. Thank you. So yeah, going forward, Quix is really going to invest and focus in building out more of these templates based around Gen AI, based on the demand that we’ve been receiving. So, in this case, we’re using LangChain with Llama 2 Large Language Model. And going forward, we’re going to start incorporating more of these tools in the ecosystem. So again, we’ve been keeping our eye on the MAD landscape and trying to incorporate some of these tools in. So, we’ve got a few already that you can visit on quix.io/templates where you can see we’re using vector databases like Qdrant. We’re about to release some more vector database examples as well. And yeah, we’re going to basically do all the hard work of keeping an eye out on what’s trending at the moment and picking out what could be useful and building that into a nice reference architecture. So yeah, stay posted and you’ll see some more.
ANAIS DOTIS-GEORGIOU : 43:33
And a second template is the predictive maintenance template. So, for this template, what’s happening here is that it’ll generate a fleet of 3D printers and then helps you predict which ones are going to fail. We’ve also done a webinar on this particular template. So, I encourage you to follow that link as well and watch the webinar on this particular Quix template. And all of the raw data from these 3D printers are stored in InfluxDB as well. And then we also have another project, which also has an accompanying webinar that you can watch. And it’s called Quix Saving the Holidays. And here we use HiveMQ as our MQTT broker to collect our generated or simulated generator data. And we store that within InfluxDB and then also connect to Hugging Face to employ some autoencoders to actually perform some anomaly detection for those generators. So that’s another cool example. And this one’s also fun because it’s completely scalable with the use of that Hive MQTT integration as well. So, we’re only simulating generator data for three generators for this example, but it could easily be scaled for thousands of generators.
TUN SHWE: 44:54
Yeah. That use case is extremely popular. We get a lot of people in the community refer to, “I want to deploy a machine learning model for prediction.” And you have the benefit of working in Python. So, a lot of your models and the tooling you use can easily be ported into Quix. It’s basically a library dependency at that point. So yeah, we’ve had a lot of customers and inquiring developers deploy their first projects with us really quickly within a day. So yeah, those are really good use cases. Also, my favorites. So, thanks for pointing those out.
ANAIS DOTIS-GEORGIOU : 45:26
Yeah. Absolutely. And then similarly, I wanted to share a use case of a customer using Edge Data Replication just because they would be also a great example or a great use case for using something like this template. They came to Influx before we had a partnership with Quix before these templates were available, but I could easily see ju:niz using that. So ju:niz collects thousands of data points about battery health and climate and temperature. And their tech stack includes things like Telegraf, Modbus, MQTT, Grafana, Docker, and AWS and InfluxDB. And they use InfluxDB cloud dedicated specifically to collect sensor data from their batteries. And they use Edge Data Replication to aid in those collection efforts. They also are a big Python company as well. So, I could see them easily wanting to leverage something like Quix instead of Edge Data Replication. But they are now currently using edge-to-cloud replication to downsample at the edge before sending their data to InfluxDB Cloud. And so now we have time for an actual demo. So here I have my– oh, no, I have to log back in. Give me one second. I’m going to stop sharing just to log back into everything.
TUN SHWE: 46:54
So, we’ve got a couple of questions in. So, we’ll answer those at the end, I think. I think there’s one question for you and one question for me. Actually, whilst you’re logging in, maybe I’ll answer the first one. So, we’ve got a question that’s come in that says, “How does the Quix Streams Python library compare with the Kafka Streaming API or Java API in terms of performance?” I hate saying it depends, but it really does in this case. So, I’ll speak generally. So, with Java, Python is generally about 20 times slower. So, with some of our customers, we found the performance difference being really, in terms of latency, the difference between 10 milliseconds and 200 milliseconds. Now, that’s not really discernible to the human eye. So, in that case, it’s negligible, but obviously for larger workloads and payloads that flow through the streaming system like Kafka, it adds on time. So, the answer is, it depends. You can get around it with some clever design decisions. So, Kafka is the concept of partitions and consumer groups and replicas. We would use those all in combination to help speed things up. And it should be said that we have highly demanding customers who work in highly critical, low-latency environments. So, the founders of Quix, their background is from F1, sort of Formula 1 racing. So, they worked at McLaren. So, they’ve employed a lot of their best practices into designing this system. And we also have Formula 1 race teams as our customers. And so far, we’ve been able to cleverly architect their solutions, and they haven’t had any problems winning races. So, I think that’s probably the best way to put it. But thank you for your question.
ANAIS DOTIS-GEORGIOU : 48:34
Yeah. So, I’m logged in now, and this is the v2 instance. This is my v3 instance here. So, like I mentioned before, I’m just collecting CPU data with the Telegraf. And yeah, once you actually click on clone this project, it’ll take you directly to Quix, where you have the opportunity to add any environment variables that you need, but we can actually look at the individual components of our pipeline. For example, our data source. And this is where, too, you could edit any variables that you might have. And we can see that we’re, in fact, collecting data here successfully, and we can look at all of our logs and our messages. We can also look at the source code directly to understand the Python that’s actually running this. And for this, we’re using the InfluxDB v3 client library to essentially get our data, use a Flux query to actually query our data from our v2 instance because Flux is the query language for v2, and then essentially convert that data or serialize that data and encode it as bytes so that we can then use the other portion of our pipeline to actually sync that data to InfluxDB v3. And similarly, in this one as well, we’re using the InfluxDB v3 client, whereas before, we’re using the v2 client. And we can actually go ahead and send data directly to InfluxDB v3. And we are using this points method here to construct points and write them with the right method as a part of our Influx DB v3 client and specifying the right precision as well.
ANAIS DOTIS-GEORGIOU : 50:31
And similarly, within that pipeline, we can also look at any logs. This is also where I would go ahead and edit any of the environment variables that I need to actually write data. And then we can verify within our InfluxDB v2 instance that we are successfully writing data to v2 with Telegraf from the past five minutes. And then we can also navigate to our v3 instance and query our data as well from, let’s say, the last hour and see that we are successfully writing data. And I also made sure to specify that I only wanted to sync just one field usage system instead of all of my fields so we can see that that variable as well is working as we expect. We’re selecting for the specific tags and fields that we want to write to our InfluxDB v3 instance. So yeah, this is probably the simplest template that you can get started with Quix and InfluxDB. But that being said, it’s a great way to just make sure that your instances are communicating to each other before adding any other components that you might want to add, especially if you’re doing any sophisticated data processing or transformations. And before we leave, I want to take a moment to give and share some resources with you. So, the very first one is the InfluxDB v3 Python client. I recommend using it for any sort of data science that you might want to do with InfluxDB v3 or just for querying in general. It’s super easy to use. It has Polars and Pandas integrations as well. So, you can query Pandas and Polars directly back from InfluxDB v3 and write them directly to it as well. We also have great documentation on the client library. And if you want more details about all of the options that are available to you there, you can use both of these as reference. I also want to–
TUN SHWE: 52:32
Yeah. Anais. Oh, sorry to interject. I was just going to say, as you demonstrated as well, the open-source connectors in Quix Cloud, we’re using the InfluxDB Python library, and we’ve not had any problems whatsoever. So, I think whoever’s maintaining that is doing a great job. We need to send them ice cream, I think.
ANAIS DOTIS-GEORGIOU : 52:51
I agree. Absolutely. And then also, I want to invite you to join our community Slack and ask any questions that you might have about this webinar, about time series, about Quix. Happy to help get you started and encourage all questions that you have. So please join us there. Also, our forums at community.influxdata.com. So here are also some resources for getting started with Quix.
TUN SHWE: 53:20 Yeah.
I think if you have any questions or you’re trying to implement this template to do any sort of syncing from v2 to v3, or even if you’re learning about real-time data and streaming data, working with tools like Kafka and you’re a Python developer, please visit us in our Slack community. So, there’s a link there at the bottom. When you join, you’ll see we have various questions from people of different abilities. We welcome them all. So please join us there and show some support and follow our growing little humble open-source streaming library. So, head on over to Quix Streams on GitHub. And that little star is a reminder for you to star us on there to follow us and support us.
ANAIS DOTIS-GEORGIOU : 53:58
Yeah. And I’ll say the community in Quix has been so supportive. Every question that I’ve had, super responsive. And that always feels really good when you can join a community and know that there’s actually active people there available to help you. And yeah, the entire team has been really helpful, so thank you.
TUN SHWE: 54:15
Yeah. We span a few time zones. That’s the secret, actually, so it feels like. So yeah, it’s quite intentional. We really care about the developer experience and helping any developer on their education journey. So, thank you for saying that.
ANAIS DOTIS-GEORGIOU : 54:31
Yeah. Absolutely. And yeah. Similarly, I’ve shared some of these resources already. Get started with InfluxDB at cloud2.influxdata.com. That’s where you can sign up for a free account. Our documentation is great. You can also use InfluxDB University to take classes about all things InfluxDB and earn some digital badges. They’re all free. So that’s another great resource. And then I already mentioned the community forums in Slack, and I hope to see you there. And I hope you enjoyed this webinar. And now I just want to go back to the questions to make sure that we answered all of them. So, we have another question where someone is asking, “Is there something similar to Edge Data Replication on v3 to sync data from edge to cloud?” So, if you are writing data to an OSS v2 instance, you can still use Edge Data Replication to write data from a v2 instance to InfluxDB v3 cloud because our write endpoints are the same. But there is not a v3 open source option available yet, that’s coming later this summer, and there aren’t plans to update Edge Data Replication right now for v3 exclusively. So, I would encourage you to just use something like Quix instead if you are looking to write data from a v3 instance of OSS to a v3 cloud instance. And then we have another question where it says, “Any tips on moving big Influx v2 project to v3? Updating all the Flux queries to SQL queries doesn’t seem ideal.” So, I would say, again, this is another opportunity to use something potentially like Quix. I guess if you’re just trying to update all your Flux queries, yeah.
ANAIS DOTIS-GEORGIOU : 56:24
Unfortunately, I wouldn’t even recommend necessarily changing them to SQL, honestly, because you can’t really substitute some of the data transformation work that you’re probably doing with Flux in SQL. Flux had more capabilities than SQL did, although there was a good barrier or hindrance to adoption or learning. You kind of had to do some upfront legwork. That being said, SQL is quite limited compared to something like Python. And there’s a high probability that if you’re doing any sort of transformation or analytics work with Influx, that you’d want to substitute that with something like Python. And I agree. It is a lot of work. It is frustrating. It is a huge pain point for our v2 users. And so that’s when I would encourage you to come to the community, give me your Flux scripts, let me translate them for you to something like Python, and then you can leverage a tool like Quix to actually be replacing your task engine in v2 with it. So that’s what I would recommend doing.
TUN SHWE: 57:27
Yeah, please go to the community. I think that’s always the right answer there. So, for Quix Cloud, understandably, we have an overlap with certain customers with InfluxData as well. So, we’ve helped a couple of the customers with some of their quite complex queries migrating that over to just using Python logic. So again, yeah. So, join us in Slack, basically, is the short answer. I think we have to take it as a case-by-case basis. There could potentially be some improvements we can make in the InfluxDB client library as well, which can suck in all the data in a very unstructured way and be able to deal with that intelligently. So yeah, we’re looking always for new use cases so that we can really figure this out and create something that works for everyone. So yeah, that’s a great tip.
ANAIS DOTIS-GEORGIOU : 58:15
Yeah. Thank you. So, I think that’s all I have. I don’t know if there’s any other final comments that you’d like to make.
TUN SHWE: 58:23
No, that’s it. I think those two action points, join us on Slack, star us on GitHub. That’s the take-home.
ANAIS DOTIS-GEORGIOU : 58:30
All right. Sounds good. Well, thank you so much, everyone, for joining us. Again, this recording will be available and sent out to you probably later today. And with that, I look forward to maybe seeing you in future webinars. Thank you so much, everyone.
TUN SHWE: 58:44
Thanks, everyone.
ANAIS DOTIS-GEORGIOU : 58:45
Bye.
[/et_pb_toggle]
Tun Shwe
VP of Data, Quix
Tun Shwe is the VP of Data at Quix, where he leads data strategy and developer relations. He is focused on helping companies imagine and execute their strategic data vision with stream processing at the forefront. He was previously a Head of Data and Data Engineer at high growth startups and has spent his career leading T-shaped teams in developing analytics platforms and data-intensive AI applications.
In his spare time, Tun goes surfing, plays guitar and tends to his analogue cameras.
Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.