Introducing InfluxDB’s New Time Series Database Engine
Session date: Feb 07, 2023 08:00am (Pacific Time)
InfluxData is excited to announce the general availability of InfluxDB Cloud’s new database engine! It is a cloud-native, real-time, columnar database optimized for time series data. InfluxDB’s rebuilt core was coded in Rust and sits on top of Apache Arrow and DataFusion. InfluxData’s team picked Apache Parquet as the persistent format. In this webinar, Paul Dix and Balaji Palani will demonstrate key product features including the removal of cardinality limits!
They will dive into:
- The next phase of the InfluxDB platform
- How using Apache Arrow’s ecosystem has improved InfluxDB’s performance and scalability
- Key features of InfluxDB Cloud’s new core — including SQL native support
Watch the Webinar
Watch the webinar “Introducing InfluxDB’s New Time Series Database Engine” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Introducing InfluxDB’s New Time Series Database Engine”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Paul Dix: Founder and Chief Technology Officer, InfluxData
- Balaji Palani: VP, Product Marketing, InfluxData
- Caitlin Croft: Sr. Manager, Customer and Community Marketing, InfluxData
Caitlin Croft: 00:00:01.499 Welcome to today’s webinar. My name is Caitlin, and I’m really excited to have Paul Dix and Balaji here to talk about InfluxDB Cloud powered by IOx. This webinar is being recorded and will be made available by tomorrow morning. If you have any questions for the two of them, don’t be shy. Please post them in the Q&A, which you can find at the bottom of your Zoom screen. And we will answer all questions at the end. So without further ado, I’m going to hand things off to Paul and Balaji.
Paul Dix: 00:00:36.130 All right. Thanks, Caitlin. Thanks for coming to the webinar, everybody. So today, I’m going to give a brief overview of InfluxDB IOx. Obviously, last week, we launched a new version of InfluxDB Cloud, which is powered by IOx. So if you’re signing up for our cloud product any time after last Tuesday and you’re signing up in the AWS cloud provider, you get two options, either in Virginia or in Frankfurt. And both of those regions are generally available IOx regions. So we made ourselves these launch t-shirts for this thing. And I got mine yesterday and I was so excited. I was like, “Oh, I’m going to wear it here for the webinar.” But of course, my thing doesn’t capture it. My camera doesn’t capture it. So I figured I’d show everybody that, at the very least. [laughter] So yes, we have special DeLorean-like launch T-shirts with the event. All right, so let’s get into it. What is InfluxDB IOx? So first and foremost, it’s what I call a cloud columnar database that is optimized for time series, like workloads and queries. So what I mean when I say cloud columnar, columnar database is a thing. It means something. They’ve been around for a few decades now. When I say cloud columnar, it means it’s designed specifically for cloud environments, where you have object storage and you have a scalable compute layer separate from that. One of the important properties of IOx, which is important to InfluxDB as a whole, is as a system, it has schema on write, right?
Paul Dix: 00:02:21.937 So you don’t need to define ahead of time what tables and stuff you’re creating. You basically just throw data at it, and it creates the schema as you go along, which makes it kind of really easy to get up and running and start using things. As I just mentioned, cloud columnar database, so we use object storage for persistence. We persist all the files in object store. And in the interim, while we’re ingesting it in real time, there’s a buffer layer essentially that makes it so that we can query the data in real time before it lands in object storage. But the goal is object store is the source of truth for the data that exists in the system long-term. IOx is a SQL-native database. So it’s built using a SQL engine. Under the hood, there’s a SQL query, a SQL parser, planner, execution engine, and we use that extensively. The other thing is InfluxDB IOx will support InfluxQL, our query language, natively. So this is coming soon. Because of the fact that InfluxQL is very similar to SQL, we were able to essentially create an InfluxQL parser that will parse a query but actually create a SQL query plan that can be executed by the engine.
Paul Dix: 00:03:40.420 So that will be launching soon, within the next two to three months within the IOx service. So why did we build IOx? We built it to enable a bunch of different things that we’ve seen our users and customers have, the needs that they have over the last 10 years. So the first and foremost is unlimited cardinality, right? Within InfluxDB, obviously, we have this measurement tags and fields. And people put values and tags, which are basically the dimensions on which they want to slice and dice their time series data. And the problem with that in previous versions of Influx is that, as the number of unique values that you wrote in for tags increased, the system performance slowed down. It became more expensive to ingest the data. Certain kinds of queries became much, much slower, queries where you wanted to run a computation across, say, a million individual unique time series. So IOx lifts this limitation because of its underlying design and architecture and makes it so that you can have completely unlimited and unbounded cardinality. You can write a unique value every single time you write a row in for a tag.
Paul Dix: 00:04:58.442 This separating from storage, from compute, is a big thing for us because it enables a bunch of different downstream features that we have planned in the coming years. It also means that we are able to separate out the different kinds of workloads, right? We can separate the ingest workload from the query workload. And we can actually create separate tiers of query processing workloads, right? So you can have one set of query processors for real-time queries, one set of query processors for larger-scale analytical queries or historical queries. And having that compute separated from the storage layer means that we can scale the compute layer up and down dynamically without having to manually move a bunch of data around, reshuffle, rebalance a cluster, or any of those kinds of things. We basically just skip over all of that. And then one of the other big pieces is bulk data import and export, right? So we get the feature request for bulk import for a few different use cases. One is basically just historical backfill of all of your data. But another one we hear frequently is people have systems at the edge or systems where they’re collecting high-precision data, and they don’t need it in the centralized store for real-time query capability. But they want it for historical analysis.
Paul Dix: 00:06:28.499 So basically, what they want to do is they want to have their system at the edge that executes stuff in real time and then, on some schedule, once an hour or once a day, upload the data in bulk, where it’s a much more compressed format. It’s cheaper to ingest. And then, on the export side, we see that for two big use cases. One is basically just creating backup systems and reading the data in other kinds of third-party systems. But the other big one is in data science and machine learning use cases, right? If you want to train a machine-learning model, you have to access the data in bulk and do a bunch of analysis on it. And the thing is, for a database that’s optimized to return an individual time series for some range of time, getting the data in bulk, traditionally, has been quite expensive to do, especially with previous versions of Influx. So with IOx, what we’ll enable is the ability to basically just get large files from object storage directly to do that kind of processing.
Paul Dix: 00:07:34.301 So when we think of time series data, I mean, ultimately, we think of time series as a way to analyze data. But there are different kinds of time series data. So traditionally, what most people think about is metrics, right? These are values at fixed intervals of time where you’re summarizing something, right? They’re collected regularly. The other type of time series data is just individual events that occur, right, trades in a market, request to an API, a machine turning on or off. These are basically irregular in nature. They can occur at any time. Sometimes, they’re very, very high frequency. Sometimes, they’re spread out and very low frequency. The interesting thing is that you can actually create metrics on the fly from raw underlying event data, right? If you have individual requests to an API, you can say, “Give me the 90th percentile response time in five-minute intervals for the last four hours.” And you can compute that on the raw event stream. And what you’ve just created essentially is a metric that you can look at to determine the health of your API server. And then traces is another use case that we hear about again and again in terms of people wanting to have a backend data store where they can keep all this data in one place and do this kind of real-time query for automation but also historical analysis.
Paul Dix: 00:08:57.593 And our goal with IOx is that it can be the one place for all these different kinds of data, metrics, events, traces, also even log data. All of this data we view is essentially data that you want to do time series analysis on. And IOx should be an ideal place to ingest that data in real time and build automation systems on top of it and be able to do historical analysis on it. So I wanted to talk about some of the technologies that we used to build IOx because it is very, very different than previous versions of InfluxDB. Previous versions of InfluxDB were written in Go. We were very vocal about that early on. But IOx is actually written in the programming language, Rust. I’ve talked about my excitement for this language a few times over the last four years and written about it. Ultimately, I really think that Rust is essentially the future of systems software. It gives you fine-grained control over memory, and it gives you the safety of a higher-level language. And it has a great model for concurrent applications, which is obviously like most server-side software. The guarantees that the compiler gives you makes it so you can eliminate data erases, right? There are whole classes of bugs that Rust basically just eliminates that’s super advantageous for creating a system in it.
Paul Dix: 00:10:30.611 It’s also nice because it’s embeddable into other systems and other languages, right? You can now also compile it down into Wasm, which is great. These are things we want to take advantage of over time, which is this ability to pull different parts of the database, different parts of the database code, into different third-party applications and languages and systems. So ultimately, I think there’s a ton of stuff to love about Rust language, but this talk isn’t that. This talk is about the pieces of IOx. So the next thing I wanted to talk a little bit about was Apache Arrow. So arrow was started in 2016 by Wes McKinney. And it originally started as an in-memory columnar data specification to help data scientists exchange data using fast, zero-serialization, zero-copy data interchange between different platforms and languages. It then later expanded into persistence by bringing in Apache Parquet as the persistence format. And it also expanded into essentially the RPC layer, originally with Apache Arrow Flight and now, most recently, with Apache Arrow Flight SQL, which is essentially a new standard for database systems or SQL database systems to do fast data transfer between clients and the servers.
Paul Dix: 00:11:58.481 So we are very serious about supporting Flight SQL. IOx supports Flight SQL today in our cloud environment. And there are a few interesting integrations that I’ll talk about later that that’s given us. So despite the fact that Arrow is relatively young, it’s getting significant adoption in the data science world. And it’s getting more and more adoption in what I call the data warehousing landscape or big-data landscape as a bunch of these different systems will accept Parquet data that will allow you to export data in the Parquet format. And many of them are starting to adopt Flight SQL as a way to communicate with these databases and send millions of records back to the client very, very quickly. So our design within IOx is we basically have this idea of hot versus cold data, right? Most of our time series use cases, people are querying data of what I call the leading edge, right, this data that is recently ingested. And they’re doing things like they’re doing real-time, dashboarding or they’re building automation, where the queries are all against the last five minutes of data or the last hour or the last few days, right?
Paul Dix: 00:13:13.389 And the goal for IOx as a system is to enable those kinds of queries and make them fast but also still have the historical data available for query on the fly. So hot is essentially what we keep in memory. This is the buffer data that we’re ingesting, but it’s also what’s cached for the query workloads as different queries are coming in. And then cold data is essentially the data in object store. And the idea there is you can have multiple petabytes of data total that you’re keeping. But realistically, many times, there’s only hundreds of gigabytes that’s the data that you’re actually actively querying. The thing that makes this all complicated is that that 100 gigabytes, which 100 gigabytes you care about, is constantly changing as you’re ingesting data. So IOx as a system is designed to manage that flow of data from in memory and in the cold object storage and make it fast and efficient.
Paul Dix: 00:14:18.595 All right. So we’re going to get a little bit deeper into some of the underlying pieces of IOx. So first, just as a little bit of review, we created something called line protocol for InfluxDB, which is basically the data format for sending data in real time, right? Now, it looks like this. Basically, you have a measurement then tag, key, value, pairs separated by commas, and then a space, and then field, key, value, pairs separated by commas, and then a nanosecond timestamp. Now, tag values are always strings. And field values can be either an N64, a float64, a UN64, a Boolean, or a string, right? So unlike some other specialized time series databases that only support floats, InfluxDB supports a variety of different data types, which means you can store event data or even log data in it, right? So as an example below, we have a measurement called CPU load. We have a tag set, so we have a host. We have a region. And we have a set of fields, and we have a timestamp. So the way data is organized within IOx is very, very different than how it’s organized in InfluxDB, right?
Paul Dix: 00:15:41.547 The underlying storage engine in InfluxDB 1 and InfluxDB 2 had a specific structure that kind of created this metadata index, this inverted index for looking up tag values and stuff like that, and then the actual raw underlying time series data store. Now, with IOx, the organization looks more like a relational database, right? So you have a table. You have columns in the table. And then you have the records themselves. So within IOx at the very top level, you have a database or a bucket. Beneath that, you have the different measurements. So each measurement is a table. And then beneath that, we partition the data. Now, by default, we partition the data into a day. So basically, all the data for a given day in one measurement is going to fall into one partition. And then beneath the partition itself, you have a series of chunks, which are basically just Parquet files, right? And those Parquet files are organized into object store, right? So what that means is, when a query comes in, it looks at the query. It says, “Okay, this query is for the memtable. It’s hitting 2022, 12:15. And it’s hitting this time range.” And we know from the metadata that we keep around which Parquet files have data for each individual time range. And then we execute the query on those Parquet files that basically match.
Paul Dix: 00:17:07.844 Now, in the future, we will have user-configurable options to further refine how data is partitioned. So instead of maybe partitioning by day, you could say, “You know what? I want to partition by week, and I want to partition by region because I know region is something I’m always going to have in this table.” And the advantage there of changing how you partition the data is, when you come to execute query, it can use the partitioning rules to kind of narrow the scope of which Parquet files have to be run against to answer the query. So basically, it means you’ll be able to have larger and larger individual tables and still have fast query performance if you’re able to align the kinds of queries you’re doing with the partitioning scheme that you have. So as I mentioned before, the way to think about the schema in the InfluxDB combined with IOx is a measurement is a table. Tags are basically just columns. Those are part of what I call the primary key for that table. The fields are also columns. And then there is a column called time, which is also part of the primary key.
Paul Dix: 00:18:22.036 So one thing that means is that you can’t have tags and fields with the same name. They have to be unique. And the other thing, which is not totally obvious, which is basically, when you’re designing your schema, you only want to make something a tag that’s actually part of what I call the primary key. If the thing you’re thinking about putting in there is basically just a descriptor or something that you may query on later but it’s not part of the primary key, it’s best to have that actually be a field, right? So in many cases, if you’re collecting data for individual hosts, like CPU data for a host, right, so the host would be a tag. And the CPU ID would be a tag. But everything else would be a field, right, what region it’s in, what service it applies to, all this other stuff. All that would be fields rather than tags, which again, is a little bit different from how you think about schema design in InfluxDB 1 or 2.
Paul Dix: 00:19:30.592 So the bit I wanted to close with was just a highlight of some of the integrations that we have with IOx, right? Because of the new SQL capabilities and because of Flight SQL, it’s enabled a bunch of interesting integrations. So the first is we have a Flight SQL Python library. And what that means is, in a couple of lines of code, you can import the Python library and actually execute queries against the IOx and get a ton of data back very, very quickly, and then easily convert it into a Pandas DataFrame or something like that. It’s very, very fast. Flight SQL is available in other languages, and the Arrow project has Flight SQL client libraries to varying degrees of maturity. Flight SQL itself is just over a year old in terms of when they announced it. So it’s been moving a lot over the last 14, 15 months, I think, since they first announced it. But there are libraries for Java. There’s a C++ library that’s first class. And then most of the higher-level languages wrap that, right, Python, Ruby, Node.js. They all wrap that.
Paul Dix: 00:20:42.702 We built a Flight SQL plugin for Grafana. And specifically, for this plugin, we built it as a Flight SQL plugin, not as an InfluxDB plugin. Our goal here is to push Flight SQL as the standard that a larger ecosystem of database vendors and third-party tool developers can adopt. So we submitted that. I’m told that it might be going into the Grafana 9.4 release, which is great. So hat tip to Grafana for picking that up really quickly because we didn’t submit it that long ago. We also built a plugin for Apache Superset. So Superset might be new to some of you, but it’s basically like a business intelligence tool. It’s like a front-end dashboarding tool. It was originally built at Airbnb, and then they open-sourced it and made it part of the Apache foundation. And it’s really a quite — it has a lot of momentum behind it, so it’s quite mature for how long it’s been around. So we built a Flight SQL plugin for that. So again, you can do things like build dashboards. You can build the reports and other kinds of things that you’d find in a traditional BI tool.
Paul Dix: 00:22:00.637 We’re also working on support for JDBC. So JDBC is the database connection standard. And what that will enable, once we have a JDBC driver, is a bunch of third-party tools to use that. So Tableau, for instance, would be able to connect, and ODBC also coming soon. So at that point, you’d get Power BI. So basically, all these different third-party tools for doing visualization or analysis, those are the kinds of things we want to enable with Flight SQL and with IOx generally. So that is all I have for the presentation. I will, at this point, hand it over to Balaji.
Balaji Palani: 00:22:44.356 Thank you, Paul. All right. Good morning. Good afternoon, everyone. What I’m going to do is show you a quick demonstration of our cloud product that we launched last week. Caitlin, are you able to see my other screen, the bigger screen which has — okay, good. Thank you.
Caitlin Croft: 00:23:04.445 Yes, yep, seeing the get started page.
Balaji Palani: 00:23:08.583 All right. Awesome. Thank you. So if you want to get a cloud account or a cloud tool, go to influxdata.com, sign up. If you go directly, you will see AWS and the two regions, Virginia and Frankfurt. You can select that. I already created it. I have some data pumping in, so I wanted to quickly show what I’ve got here. So this is the home screen. It’s powered by IOx. You can see that. I would also point to the docs. So if you click on that link, it takes you to the documentation page, which we have a separate page for docs. It talks about writing data. You can use Telegraf, import a CSV, if you want to do migration from the front end. And we’re still working on the migration tools. But if you want to get it from the front end, you can do that, and then querying the data with SQL. And I’ll talk about Grafana and Superset in a second. So if I come back to this screen, if I go into — I have some buckets already created. I have some there data pumping in, but I just want to quickly go through.
Balaji Palani: 00:24:11.755 So this is a data explorer, a new data explorer that you will see. There are some changes in my thing, but you should see basically on the right. This is the SQL panel where you can type in your SQL. You can also browse your schema. So for example, you see here buckets, so I can see all my buckets. If I select something, the measurements, all the measurements underneath that should load up. So you can select specifically, for example, spam. It should give you all the fields and tags and so on. The other thing which you can also do — I’m going to use MySQL’s cheat sheets. I’m not typing in the whole thing. So if you do a show table, this should also give you kind of the quick schema information. So show tables will show you all of that measurements in that particular bucket. By the way, you should be selecting a bucket here. If you change the bucket, for example, to system usage, you should see a different result. So all of these measurements under this bucket should be under show tables.
Balaji Palani: 00:25:16.405 And if I were to show another one, show columns, for example, I’m going to take the HTTP request. So if you do that, it should give you all of the columns. So like what Paul mentioned earlier, you can see all of the columns, dictionary type. And then there’s also timestamp, which is the time. Time is a column. If you’re familiar with the older system, like 2, endoscope time would be a column. But in this case, you could just say time, and that should be a timestamp column. And then you can see all of these fields, as N64, UTF8, and so on. So let’s do some quick queries here. So I’m going to say, “Okay, how many total records are there in this particular table?” So I’m just going to try and do that. So this would select kind of store from HTTP request. There are about 3.5 million rows of records. Basically, what I’m showing here is a collection of — we collect all of these API requests that come in for only monitoring [inaudible] of that, put it into my account. And it’s showing you for the past 30 days.
Balaji Palani: 00:26:29.743 So I can do some additional things here. Again, this is completely SQL, right? So you can also see how fast it gets me that data, especially the aggregation queries. So this is basically — I want it to show me all of the count by endpoint. If I’m going to say, “Hey, show me everything, the count of that by particular endpoint,” endpoint is a field. So if I were to select this, you can see endpoint is a tag that you see here. So I’m showing all of the count by endpoint that are between for the past 30 days and sorting them, right? So I’m sorting them, reverse sorting them, by descending orders, as you can see. A lot of these requests are coming in for api/v2/write. That totally makes sense, and then task query and then api/v2/query and so on. Second, I just want to show you quickly how easy is it. And you can see the time here in milliseconds. I can do a little bit of complex queries. For example, “Hey, show me all of the statuses, which are 5XX or 500, 501s. I want to see all of the count by endpoint.” I can do that too.
Balaji Palani: 00:27:49.722 Again, it’s by descending order, so most of the 500s or v2/query or v2/write. Again, you’re talking about three point something million requests, and it’s a sliver. You’ll get some Python [hooks?]. So if I keep going, let’s say I want to measure the amount and size and capacity of my cluster. There is an environment tag here. So I’m going to just have some fun with that and show you all of the — so the max request bytes, I’m calculating how much payload is in every API request [divided by?] a thousandth thousand. So this is in gigabytes and by environment in the past 30 days for write. So this will give you, “Hey, what’s the maximum has somebody tried to write in?” And this is for every hour or. So within the hour, there was someone who the maximum was 490 gigabytes and 444 in the USC2. So there are some additional things that you could do here.
Balaji Palani: 00:28:54.896 We also support percentiles. So this is another sort of query that will give you sort of — and keep in mind, these are things that happen really quickly, again, to 45 milliseconds. You don’t see any of them going over a second at all, right, so really fast. And that was the whole design part of it. We chose the hot side of things, which comes really fast for memory. And this is a percentile. I want to show 90th percentile of request bytes and for write by environment, just to show, “Hey, if I just go by the 90th percentile, then what would it be?” So it shows me that USC2 and [central?] are a little bit above 200, 283. But the rest are below a gigabyte. If you want to do in megabytes, you can do that too. So this would be in megabytes, and you can see all the megabytes. So anyway, so I just want to quickly show that, hey, your data explorer supports SQL. You can run some really, really fast SQL servers here.
Balaji Palani: 00:30:00.863 But next, I want to go to Superset. I mean, Grafana, I think many people use Grafana. So we do have a Flight SQL plugin. Again, one of the reasons we chose Flight SQL is it really optimizes the responses to [inaudible]. Apache Arrow, Parquet, all of them deal with columnar data. And Flight SQL really optimizes and gives you — instead of trying to make them into a row and sending it across, it just sends over the entire columns. It’s very, very fast in terms of receiving the data and so on. So Grafana is there, but I want us to actually go into Superset. So we do have instructions for Superset. For example, I used the docker container for Superset to host on my own local machine. And it has pretty good instructions. You can set it up. So this is my Superset, for example. One of the things that Superset requires is database connections. So I’ve set up the IOx database connection here. So if you go into it, you would see that we use the data fusion Flight SQL. And this is kind of the endpoint, 443, and then the bucket name and then the token.
Balaji Palani: 00:31:15.835 These are the things that you need to put it in. And once you do that, your database connection should be set up. So you would have to set up a database connection per bucket. So every bucket you create a token. By the way, if you’re not familiar with tokens, you can go through API tokens. And you can either generate an all-access API token, meaning you gave it everything, or you can use custom API permissions, which is what I use. So you can select the bucket system usage and provide it read and write privileges, and it should give you a token that you can copy to Superset. So the other thing I wanted to show is Superset also has a SQL lab, super useful. So you can select the database, like I did, schema. Select the IOx schema. And then it should also show you the table schema, but I’d just quickly run a SQL query. You can use this to write your SQL queries. You do have to set up your datasets. So datasets are the way Superset works — I’m sure — is you create a database connection, then the dataset. Then you create the charts, which you can use it in your dashboard.
Balaji Palani: 00:32:31.561 So I just created an API request by endpoint, which is — it’s very drag and drop, right, so very super useful here. So I drag and drop the dimensions endpoint metrics I said, “Hey, I need a count of request bytes.” And then the aggregation IN (‘count’) will show you everything that’s — I mean, I have aggregation IN (‘count’) and [inaudible], so I combine the two. And that will give me the — it’s also querying last week. That’ll give me all the endpoints and show me what are the number of API requests that’s available there. So I’m actually using this in one of my dashboards. So let me just quickly go there to show an example of what you can build. So this is what I built using my data. Again, this is a sliver of the data that we have, all of the API requests. So you can write number of orgs querying data, number of unique organizations in cloud, number of organizations querying data. Again, this is in the past week, right, so. And then there was about 11.4 terabytes of data written. Again, this is [total bytes ingested in terabytes. And this is number of queries run. And you could just look at different things, like api/V2/write. This is [inaudible] byte, so how many terabytes have been written for that?
Balaji Palani: 00:33:54.774 So again, a quick way to show your data, and this is the chart that we created that I showed you earlier. So some quick ways to create your dashboard, and you can also hit refresh. Set up your auto-refresh interval. So there are some useful things that Superset provides [inaudible], and I’m super pleased with that. One final demo before I move on, Paul talked about traces. Again, this is super early. But if you are interested in traces, if you’re interested in the value prop that InfluxDB provides in terms of, hey, storing metrics, events, and traces in a single data store, if that appeals to you, check this out under GitHub InfluxDB observability. I’m going to put this in the chat for you guys. This is a repo maintained by one of our engineering managers, super passionate about tracing. In this, what I have for you is using the HotRod, which is an OpenTelemetry use case or sample, which Jacob has built, rebuilt using InfluxDB. So you can provide an InfluxDB a bucket, and you can also provide — there’s a button called “archive a trace”. So if you want to store those archived traces in a different bucket, you can provide that, provide the token and your region-specific URL. And same thing here, once you build it, you should see once you launch the [inaudible] example.
Balaji Palani: 00:35:39.014 These are all things that would fire off a series, a bunch of requests, and that actually is captured in a trace. For example, I just clicked on all of these buttons. And if you look at past five minutes, these are all the traces. So this is actually using InfluxDB as a backend. But what is interesting again, these are things, if you want, you can use a plugin. But again, Jacob built a Jaeger query plugin which actually knows how to query from the InfluxDB backend. There’s an “otel2influx” plugin that you can use for your OpenTelemetry collector. And also, this is for your Telegraf collector, if you are going to use that. But the thing about this, how useful it is to store it in IOx, for example, is not this. But I’m more excited about what I have here because, for example, by choosing OpenTelemetry and looking at the spans, I can now do some analytical queries on all of the tracing that I have stored in my back end.
Balaji Palani: 00:36:57.250 So if you’re building some additional things that, hey, your OpenTelemetry is not — I mean, the [inaudible], looking at the traces individually if you want to combine them together. For example, this query shows me all of the services where that is contained within the spans in the past, whatever trace, and you can choose seven days, for example, or five minutes, and it will still work. So you can do that, and that will show you all of these services. You can also do, “Hey, show me the average duration that is spent in all of the services and the methods that are run.” You can also do that, right? So for example, these are the worst creations — so I know that [inaudible] HTTP get dispatched in the front end service takes about 737 or so. And again, this gives you more flexibility, more access to the data, and you can do more with the tracing. Example, you can actually combine metrics and traces together to create your own dashboards, which again, I used Superset to create my own thing. So total number of traces, again, this is going past seven days. And then you can see work services by duration and so on. It just opens up so many possibilities that you can use with the tracing and metrics and events data on Influx. So that’s my quick demo. That’s the end of my presentation.
Caitlin Croft: 00:38:40.082 Awesome. Thank you both. That was fantastic. So I know there’s a ton of questions that came into the chat, and I know a lot of them have already been answered through the Zoom chat. But we’re just going to rattle through them just so that people watching the recording can also have this information because I think there are some really good questions that have already been asked. So the first one is, what is the benefit of using InfluxQL over SQL? Will there be things that are possible in one versus the other, or would there be performance differences between the two?
Paul Dix: 00:39:17.416 Yeah. So I can speak to that. So whatever is available in SQL will be a superset of what’s available in InfluxQL, right, because the InfluxQL just uses the same underlying engine. So basically, what you’re talking about is two different query front ends on top of a planner and executor. So the difference really is just aesthetic in terms of how you like to query your data. We’ve heard from a bunch of people over the years that, for some basic time series queries, they find InfluxQL just easier to work with and easier to use. And the performance that you can expect from the two is basically identical, right, because the InfluxQL query gets parsed, and an actual SQL execution plan is what gets created. So ultimately, a plan gets created, and that’s what happens. The real benefit or the primary benefit for all of our existing users is we will have — essentially, IOx will support InfluxQL natively but then separately will have a translation layer that will expose the InfluxDB version 1 query API layer, right? So you’ll be able to submit a query to it as though it were in InfluxDB version 1 with InfluxQL. And it will execute that and return your results in the same format that InfluxQL 1 did, which means, if you’re using third-party tools like Grafana or whatever, you’ll just be able to interact with it as though it’s an InfluxDB V1 database without having to rewrite your dashboards and all that other stuff.
Paul Dix: 00:40:59.137 Now, I will say that 100% bug for bug compatible version of InfluxQL is probably out of reach. But our goal there is to make it as close to the original InfluxQL as possible while also taking advantage of the performance benefits that we can get out of IOx, right? There are a large number of queries that will just be significantly faster, like orders of magnitude faster on IOx, than they will be on the traditional InfluxQL engine. And obviously, requests for better performance have been ongoing and outstanding for the entirety of the project. So that’s always good.
Caitlin Croft: 00:41:44.556 Awesome. What if you want organizations by tags instead?
Paul Dix: 00:41:51.747 What if you want organization by — I’m not quite sure what that means. I see the question, but I’m not quite sure what it’s asking, to be honest.
Caitlin Croft: 00:42:01.920 Okay. Sean, if you’re still on, we can always unmute you if you want to ask your question. But we’ll move on to the next one for the time being. What’s the order of tags in the primary key, example, sorted by name, order in which the tags are added, etc.?
Paul Dix: 00:42:26.668 Yeah. So for each individual table or measurement, the order is set the first time data is persisted from that measurement. And it’s set based on cardinality, lowest to highest, because that gives us the best compression in Parquet. But obviously, the schemas within InfluxDB can evolve, so you can add new tags later on. And we basically tack those on to the end, right? So we don’t change the tag ordering over time, even if the cardinality changes. It’s basically set on the first persist operation, and then we just go with that from the future. Now, reordering that data would be quite an expensive operation. But my guess is that, over time, will give people both the ability to set the ordering that they want ahead of time. Basically, instead of doing schema on write, specify. This is the scheme I’m going to create, and I actually want this ordering for the primary key. So that is something that we planned at some point in the future roadmap. And then later, reorganizing the data, secondary indexing, all of those kinds of features, we essentially want to get to. And the idea is that separation of compute from storage is really what’s going to enable us to be able to deliver those features, which are basically very expensive to do. They’re expensive in terms of having a database perform those tasks. But because we’ve separated compute from storage, ideally, you’ll be able to spin up new compute infrastructure to run that task in the background separate from your production infrastructure that is actually servicing queries and all this other stuff, so.
Caitlin Croft: 00:44:12.858 Does the order change if the cardinality changes?
Paul Dix: 00:44:16.279 Yeah. That was part of the —
Caitlin Croft: 00:44:17.549 Oh, is that?
Paul Dix: 00:44:18.353 Yeah, yeah.
Caitlin Croft: 00:44:18.734 Okay, okay. Cool. I think you kind of covered this already a little bit. But is Flux supported in InfluxDB Cloud powered by IOx?
Paul Dix: 00:44:29.477 So Flux is enabled in the API in InfluxDB Cloud, but it is not in the user interface. So again, for people who want to use Flux, they can use it in the API. But in the UI, what we’re pushing right now is SQL. And then later, when the native InfluxQL support is added, we will also likely expose that in the UI. So the reason for this is that the way Flux works with IOx is Flux is — it’s not just a query language. It’s also an entire scripting language, and the entire thing is written in Go. So we don’t have the bandwidth to be able to translate that over into a native Rust implementation. So what we have is IOx has a lower-level storage API that it exposes and that the Flux processor uses. Unfortunately, what that means is a lot of the query optimizations and stuff like that that happen within IOx are not made available to Flux. So Flux is essentially acting as kind of like a scripting client that’s pulling back a bunch of data and doing some things. So essentially, for the best performance, we’re pushing people to use either SQL or, when it comes out, InfluxQL.
Caitlin Croft: 00:45:46.286 Would it be possible in the future to write InfluxDB tasks in SQL?
Paul Dix: 00:45:53.455 That’s part of the plan, yeah, is to make it so that people can write tasks in SQL and also tasks in — the two programming languages we’re looking at in terms of being able to script stuff or either Python or JavaScript. But that’s kind of TBD, so. But yes, being able to write tasks in SQL and InfluxQL is part of the future roadmap.
Caitlin Croft: 00:46:22.840 Is the nullable field feature a welcome novelty in InfluxDB IOx?
Paul Dix: 00:46:29.948 So nulls in IOx are basically the same as they were in InfluxDB, which is we don’t actually store nulls, right? We store data on columnar fashion. And then, essentially, if a value for a specific field at a given time is null, then there’s a null bit specified. But we don’t actually literally store nulls like you do in a traditional row-oriented database. So you can have null fields, but you could have null fields in InfluxDB 1 as well. It’s not a value that you can set explicitly. Yeah, I’m not quite sure what they’re trying to get at with the question, but yeah.
Caitlin Croft: 00:47:21.818 Really nice UI and query experience and data explorer. Is any of it using Ibis?
Paul Dix: 00:47:30.419 I don’t think so. I’m not familiar with that project. So I would say no, but.
Caitlin Croft: 00:47:38.958 Let’s see. A bunch of people have been asking. Are there any plans to release InfluxDB IOx as part of InfluxDB opensource during 2023?
Paul Dix: 00:47:49.702 Yeah. So we’re still trying to figure that out in terms of the timing. Our focus right now is obviously our recent cloud release. We have cloud-dedicated clusters that we’ll be releasing in late April. So basically, you’ll be able to say, “I want an IOx cluster of a certain size with a certain number of resources,” and it’ll spin it up on the fly for you. And it’s basically yours and yours alone, so you can do things like private networking and all that kind of stuff. And then, we have basically our enterprise product based on top of IOx. So essentially, a new major version of our enterprise product will be releasing in early August. So for our open-source efforts, right now, we’re not producing builds and documentation because we’re really just focused on, obviously, these commercial offerings. We are obviously contributing a lot of opensource in DataFusion and Arrow. And a lot of the IOx effort is actually happening in the open. We’re tentatively targeting doing something in open source in August, along with that enterprise release. But we don’t yet know what is going to be in that and what the target is going to be. Essentially, the way we’re thinking about it is that the target audience for open-source use is going to be kind of different than the target audience for a cloud product, for example, right, because essentially, we want to have different projects having different purposes in terms of where they fit in the data stack.
Caitlin Croft: 00:49:22.570 Paul, I like the hot versus cold tiers you are moving to. We do something similar with Kudu and Hive. Wondering if you are looking at Iceberg as part of your future roadmap.
Paul Dix: 00:49:33.899 Yeah. So we looked at Iceberg earlier on when we were first — when we were saying, “Okay, we have to create essentially what’s called a database catalog. And we have to do this on top of object storage.” For those that don’t know, Apache Iceberg is basically a standard for writing files in a table into object storage and basically keeping track of the files that are part of that table. At that time, Iceberg wasn’t far enough along for us to really commit to doing it because we would’ve had to do almost all of the work ourselves in terms of creating an iceberg implementation with Rust and all this other stuff. And the other thing, the other problem we had at that time, which could still be a problem, is for our use cases, we don’t have just a couple of tables that are super high throughput that have all of the data, right? Most of our customers have hundreds or potentially even thousands of individual tables that they’re tracking data on. And we didn’t have a strict requirement to have the catalog be only based in object storage. We actually separate the catalog into a relational database that keeps that transactional database. So having some sort of Iceberg integration is definitely something I’d like to have because it’s great to support those kinds of other standards. But we don’t have anything specifically planned at this point.
Caitlin Croft: 00:50:58.832 Stay tuned. You never know. [laughter]
Paul Dix: 00:51:00.300 Yeah. If enough people ask for something, then it becomes a priority.
Caitlin Croft: 00:51:07.420 What happens if the query does not have the time predicated? Does it scan the entire table?
Paul Dix: 00:51:14.665 Yes. Yes, it does. Yeah. I mean, again, there’s kind of no way around that. We organized the data by time. And if you’re not putting time in your query, then it’s going to do that. I will say, at some point, getting to secondary indexes is something we’d like to do, which would allow you, if you have some sort of query that you’re going to do or you know it’s not going to be organized by time, you’d potentially be able to support that. But that’s definitely not something we’re getting to this year. If we’re lucky, we’ll get to it next year, so yeah.
Caitlin Croft: 00:51:53.858 Is there any change to Rust bucket creation, data injection, etc.? And then the follow-up is, what standard engine for new buckets if we have old buckets?
Paul Dix: 00:52:08.196 So for the API for creating buckets and stuff like that, that’s still the same. It’s all there. But if you have old buckets, if you signed up as a customer before the release of IOx, you are still on TSM, so the old storage system. So if you create buckets, those buckets are also going to be TSM buckets. Now, within your account, you are able to create a new organization, which can be an IOx-based organization. And that’s limited, again, to AWS Virginia or Frankfurt. We will be doing — probably starting in late May, migrating all of our existing cloud customers from TSM over to IOx. And the target is, by the end of the year, everybody will be operating on IOx-backed buckets. And when you create a bucket, it will just be an IOx bucket, and you kind of don’t have to worry about it. So because we have a very complex, large running production system, we have to do this in phases over time. So we’re in this weird in-the-middle phase where some customers are using IOx actively in production. Some customers are on TSM, and then there’s going to be a mix over time. But hopefully, that will all be all resolved over to IOx by the end of the year.
Caitlin Croft: 00:53:32.587 Let’s see. Is Chronograf going to be expanded? Or maybe the new UI for InfluxDB is on the way, especially when it comes to managing tasks and running on-prem installations.
Paul Dix: 00:53:46.732 So we don’t have plans to expand Chronograf. The UI within our cloud product is — for the IOx release, we actually removed a bunch of stuff from the UI to focus only on the basic query experience. Where we see the UI going over time is essentially basic data exploration and also administrative tasks for setting up new buckets, creating permissions. When we enable SQL-enabled tasks and stuff like that, there will be a way to manage that stuff. But essentially, the dev flow will likely be in your IDE, which will have hooks into the API. Or if you want some more complex dashboarding or visualization, what we’re really trying to do is focus on first-class third-party support, so Grafana, Superset — I think there’s one called Metabase or something like that — Power BI, Tableau. Our goal there is we want those third-party tools that people are already familiar with and they like to use to work really well with our cloud product, with our enterprise product, and with our open source, so.
Caitlin Croft: 00:54:56.777 So this next question, I realize you’ve already answered the first part. But I think the second part is an interesting one. Will we be able to choose which engine per bucket? And then the follow-up is, when do you expect to provide a migration tool or mechanism? I think you covered most of that, but.
Paul Dix: 00:55:13.928 Yeah. Basically, right now, if you already have — if you sign up into cloud right now on AWS, you get IOx buckets, and that is it. You do not get to choose to have a TSM bucket instead. For the people who are in the middle, they can create a new org and get IOx buckets. But again, by the end of the year, there won’t be any TSM buckets in our cloud environment, at least the cloud 2 environment. And if you create a bucket, it will be an IOx-backed one.
Caitlin Croft: 00:55:45.067 Does InfluxDB IOx support data updates?
Paul Dix: 00:55:50.749 So it does, but the behavior is undefined. I’d say it’s undefined behavior. So we don’t guarantee ordering of the data coming in. So if you have a value for field A and one writer sends it in and you update the value for field A with a different value, there’s no guarantee on which order those get applied. Now, what you will likely see from a practical standpoint in the production system is, if you write one value for field A and then a few minutes later you write a different value for field A, you’d probably see the last value that you wrote. But that is not guaranteed. Now, it is different if you’re saying I write a value for field A and then later, I write a value for field B. Well, there’s no conflict to resolve, right? So it doesn’t matter what ordering those are applied in. But essentially, in order to create a system that was horizontally scalable on the ingest layer and had all these different properties, we relaxed the constraints in terms of updating.
Paul Dix: 00:56:54.603 For most of our use cases, updating doesn’t make sense. If you’re updating values, it’s likely that what you need to do is kind of change your schema design, right? So a perfect example, I think — I’ve seen a lot of people ask about this, which is like I have a prediction model. And I’m writing values in for that prediction model. And when I change the model, I want to update those values. Well, actually you don’t, right? So what you should have is the version identifier of the model as a tag. And when you update the model, you bump that version identifier, and then you write all the data in. And you have the exact same timestamps. And then, at that point, what you have is you have both sets of time series from all of your different models over time, which is what ideally you would be doing, right? And again, the goal with the way the system is organized is the data is highly compressed. It’s kept on cheap object storage, backed by a spinning disk. So keeping more data around is something that you can actually do.
Caitlin Croft: 00:57:58.478 All right. Let’s see. Considering the added compression of Parquet, how would that impact data rates and latency going from InfluxDB Cloud to another cloud, for example, AWS?
Paul Dix: 00:58:12.321 So when you query data, it’s not actually returned as Parquet data. It’s returned as Arrow buffers, which is not a — it’s not a compressed format, so it’s actually — it can be pretty big compared to the equivalent Parquet representation. We don’t have this in the API yet. But we will have an API to actually get the raw underlying Parquet files. And again, the data is partitioned by table and then by day. So you could theoretically get all the data for a given day. And it would be quite cheap, cheap in terms of there’s no query cost to it, and also in terms of data transfer because it’s more compressed. That API, I would expect to land in the not too distant future, right, within the next anywhere from two to four months, so. But yeah, the transfer from cloud provider to cloud provider is still going to be problematic at scale, so.
Caitlin Croft: 00:59:13.789 If InfluxDB IOx will be available in InfluxDB, enterprise current version of Chronograf might not support the new query language. I’m not quite sure what the question is.
Paul Dix: 00:59:26.939 Well, the current version of Chronograf works with the InfluxDB V1 API and InfluxQL. The enterprise release that we have scheduled for later this year based on IOx has SQL, but it also has InfluxQL natively exposed through a Flight SQL interface where you just specify the query type is InfluxQL. But it will also have, like I said, that wrapper layer that provides V1 API compatibility. So for anybody who’s coming from our enterprise V1 product to the IOx-based product, they’ll be able to just point Chronograf at it, and it should just work, yeah.
Caitlin Croft: 01:00:12.466 Awesome. So we are over, but we have a few more questions. Paul, do you have a little bit more time?
Paul Dix: 01:00:17.249 Sure.
Caitlin Croft: 01:00:17.714 Okay. Cool. How large is a partition, and how large is a Parquet file allowed to be? How does the immutability of Parquet affect you if you are, say, dealing with streaming data?
Paul Dix: 01:00:35.277 Yeah. So we try to keep the individual Parquet files no larger than about 100 megabytes. We may change that over time. There’s no limit on the size of an individual partition. But if you have, say, a single table that is very, very high throughput, what I expect is that we will have to add that functionality to essentially partition the data within that table by more than just time in order to make that efficient, particularly when you’re trying to query it out. So on the back end, what happens is data gets ingested. Smaller Parquet files get created in object storage, and then we have a compaction process that picks up those files, combines them together into larger files, runs deduplication and sorting on them so that you get bigger files in object storage. And then the querier has to do less work at query time. So that’s the compactor. It’s basically its job to be able to handle the fact that you’re streaming in data all the time, but you want it in these bigger Parquet files. It’s kind of a mismatch there.
Caitlin Croft: 01:01:46.676 Will Kapacitor real-time streaming of data work with InfluxDB IOx, or will it end with InfluxDB version one?
Paul Dix: 01:01:54.938 So if you’re asking about the subscribe API, that’s InfluxDB V1. That won’t be coming over in IOx. And actually, for all of our users that are using Kapacitor at scale, we actually tell them not to use the subscribe functionality because it puts a bunch of extra load on the database, which you don’t want. So for the people who are using Kapacitor at scale, what they’re doing is they’re writing data. They’re double-writing data, right? They write it into Kapacitor because you can write it as though it’s an InfluxDB V1 server. And then you also write it into InfluxDB. And then Kapacitor itself does its computations and whatever, and then it writes it into the database server. You’ll be able to use Kapacitor with IOx in that same way, but that subscribe API functionality is not going to be there. We will potentially have — I want to have some sort of PubSub feature within IOx, but it’s not a priority for this year, so yeah.
Caitlin Croft: 01:02:53.988 What if all the fields are null? Is the data point stored or not?
Paul Dix: 01:03:00.286 No. Yeah. No. I mean, right now, because IOx only supports the InfluxDB — well, it supports V2 write API right now. V1’s basically the exact same thing but with a different thing in the path. There’s no way to write data in that’s valid without it having a field in those APIs, so IOx rejects it. There’s no reason IOx itself can’t store that data. There’s no underlying requirement that there has to be a field. But again, with the schema design with IOx, there’s no reason to make all those things tags anyway. So it’s kind of like, yeah,
Caitlin Croft: 01:03:47.724 Cool. Thank you, everyone, for all of your fantastic questions. I’m glad to see you guys aren’t shy about asking Paul and Balaji your questions, so really appreciate it. If we didn’t get to your questions, I apologize. Everyone should have my email address. Feel free to reach out if you have any further follow-up questions. I’m more than happy to pull in Balaji and Paul to answer your questions. Once again, I am going to plug our community Slack. You can find it by going to influxdata.com/slack. I know there’s an IOx channel in there, and I know that Paul is always in there answering people’s questions. So I’m sure he’d be happy to answer even more of your questions. It was fantastic. I think we might need to turn this into a blog because there were so many valid questions that I think a lot of people are going to have. So thank you, everyone, for joining today’s webinar. Once again, it has been recorded and will be made available for replay by tomorrow morning, as well as the slides. So thank you, Paul, for your presentation and answering all those questions. Thank you, Balaji, for the demo, really appreciate it. And I hope to see all of you on a future webinar.
Paul Dix: 01:05:07.538 All right. Thanks so much.
Balaji Palani: 01:05:07.538 Thanks, everyone.
Caitlin Croft: 01:05:08.335 Thank you. Bye.
[/et_pb_toggle]
Paul Dix
Founder and Chief Technology Officer, InfluxData
Paul is the creator of InfluxDB. He has helped build software for startups, large companies, and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010, Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison-Wesley. In 2009, he started the NYC Machine Learning Meetup, which now has over 13,000 members. Paul holds a degree in computer science from Columbia University.
Balaji Palani
VP, Product Marketing, InfluxData
Balaji Palani is InfluxData’s Vice President of Product Marketing, overseeing the company’s product messaging and technical marketing. Balaji has an extensive background in monitoring, observability and developer technologies and bringing them to market. He previously served as InfluxData’s Senior Director of Product Management, and before that, held Product Management and Engineering positions at BMC, HP, and Mercury.