Understanding InfluxDB’s New Storage Engine
Session date: Jan 10, 2023 08:00am (Pacific Time)
Learn more about InfluxDB’s new storage engine! The team developed a cloud-native, real-time, columnar database optimized for time series data. We built it all in Rust and it sits on top of Apache Arrow and DataFusion. We chose Apache Parquet as the persistent format, which is an open source columnar data file format. This new storage engine provides InfluxDB Cloud users with new functionality, including the removal of cardinality limits, so developers can bring in massive amounts of time series data at scale.
In this webinar, Anais Dotis-Georgiou will dive into:
- Requirements for rebuilding InfluxDB’s core
- Key product features and timeline
- How Apache Arrow’s ecosystem is used to meet those requirements
Stick around for a demo and live Q&A!
Watch the Webinar
Watch the webinar “Understanding InfluxDB’s New Storage Engine” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Understanding InfluxDB’s New Storage Engine”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Sr. Manager, Customer and Community Marketing, InfluxData
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
Caitlin Croft: 00:00:00.000 Hello everyone, and welcome to today’s webinar. My name is Caitlin Croft. And I’m joined by Anais, who’s one of our amazing developer advocates. And she’s going to be talking about the new InfluxDB storage engine. Once again, this session is being recorded and will be made available for replay by tomorrow morning. Please post any questions you may have in the Q&A. And I also just want to remind everyone to please be courteous to all attendees and presenters. We want to make sure this is a fun and safe, happy place for all of us. And without further ado, I’m going to hand things off to Anais.
Anais Dotis-Georgiou: 00:00:43.958 Thank you, Caitlin. And welcome everybody. It’s really cool to see where everyone was from and where they’re based. And today we’re going to be talking about understanding the new InfluxDB storage engine. So thank you, Caitlin, for the introduction already. I’m a developer advocate at InfluxData. And for those of you who don’t know what developer advocacy is, my role is to help represent the company to the community, and the community to the company. So I try and create blogs and tutorials and webinars to help educate you all and also answer questions in the Slack and forums to help you overcome any hurdles that you run into, but also bring product feedback back here so that we can make sure that we are in alignment with your needs. And I encourage you to connect with me on LinkedIn, if you so want. And yeah. At the end of this presentation, I’m also going to take five minutes and ask you all to fill out a survey. And so if you do have the time and capacity for that, I know I’d really appreciate your feedback. So thanks in advance.
Anais Dotis-Georgiou: 00:01:55.488 Okay. So the agenda for today. We’re going to start off by talking about what is the new InfluxDB engine. Specifically, what requirements does the new InfluxDB engine meet, and what goals do the engineering team have when they were creating it so that we can better understand why they created it? And then we’re going to understand, through that process, the entire — well, not the entire, but a large part of the Apache ecosystem. Then we’ll talk about offerings and release timelines, so you know when it’s available to you, and when you can start taking advantage of all these new features. Then we’ll talk about new InfluxDB cloud features that already exist today. We’ll talk about the SQL support that will be available as a part of the new InfluxDB engine coming soon. Then we’ll talk about interoperability plans and other exciting news in that area. Then I’ll ask that you take a survey, if you feel inclined. It’s anonymous. So no worries there. And last but not least, we’ll finish up with an introduction to some resources that you can take advantage of to learn more about what I talk about in today’s talk and, in general, where you can go and get help and ask any other questions that you might have.
Anais Dotis-Georgiou: 00:03:10.105 So what is the new InfluxDB storage engine? Well, it’s the new storage engine that will, first, power InfluxDB Cloud. And the new InfluxDB storage engine is built with Rust, Apache Arrow, and Arrow Flight — I won’t talk too much about Arrow Flight — DataFusion, and Parquet. And we need to understand how these technologies helped build the new storage engine, and we need to understand what they are in order to understand the benefits that the engine will provide us. So Rust is a programming language. And it’s very performant. And it offers fine-grained memory management. Just as a clarification point, I’m going to talk really generally about what these tools are now and go into deep details on them in a second. Arrow is a framework for defining in-memory columnar data. Parquet is column-oriented durable format. Arrow Flight is a general-purpose, client-server framework that simplifies high-performance transport of large datasets over network interfaces. And DataFusion is a query execution framework that’s also written in Rust and uses Apache Arrow as its in-memory format.
Anais Dotis-Georgiou: 00:04:29.484 So the little icons that I have next to the bullets are there to help you associate what each tool or technology piece is responsible for contributing to InfluxDB’s storage engine. So Rust, we want to think about memory management. Apache Arrow, we want to think about a columnar — a memory column or data format. Parquet, we want to think about the durable file format. Arrow Flight is how we transport data. And DataFusion is the query execution framework. So release details. When will this new storage engine be available? It’ll be available January 31st on InfluxDB Cloud on AWS in those two regions, US East and EU Central. So what were the requirements for this new storage engine, and why did the developers decide that they need to build it? All these requirements that I’m going to list — they all come from Paul Dix, who is the CTO and founder of InfluxData, from his blog post entitled “Announcing InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow”. Internally, we call the new storage engine IOx, just because IOx is the chemical symbol for Rust. So if you see any blogs with that, don’t be confused. It’s just the new storage engine. It’s just what we call it internally. But all of these requirements come from that post.
Anais Dotis-Georgiou: 00:05:56.030 And so my goal today is to help you understand how each technology piece helps meet that requirement. So let’s go over all the requirements. The first requirement is that there are no limits on cardinality. Okay. Let’s take a step back. We can’t actually say that we expect to have unlimited cardinality because, if you’re actually an engineer, you’re not going to say you can approach something unlimitedly, but near-unlimited cardinality. Essentially, unlimited cardinality. And we want you to be able to write any type of data. So if you want to have logs and traces, InfluxDB should be able to handle that. We also don’t want you to have to worry about what a tag or a field is anymore, which is such a relief. And so all of the technology is Rust, Arrow, DataFusion, and Parquet. They will all contribute to this requirement and feature. Then we also want to serve best-in-class performance on analytics and queries in addition to our already well-served metrics queries. Again, all the pieces will contribute to that. We also want separate compute from storage and tiered data storage. And the database should use a cheaper object storage as its long-term durable store. DataFusion and Parquet are what are responsible for meeting this requirement or this feature.
Anais Dotis-Georgiou: 00:07:10.819 We also want to provide operator-controlled memory usage. The operator should be able to find how much memory is being used for buffering, caching, and query processing. If you’ve ever been in the forums or on Slack, you will undoubtedly run into community members asking questions about InfluxDB consuming memory. And how can I limit it? And how can I have more control over it? So this is a long-time ask. And it’s really exciting to see this being addressed and this feature being provided. Another one is book data export and import. This is not only helpful for migrations, but this is also helpful for continuing to view InfluxDB as your time series lake and being able to pull data out of it, and query and analyze it, and work on it, and transform it in the environment, and with the tools of your choice. Because we found that while Flux offers a lot of query analytic capabilities and data analytic capabilities, the [inaudible] of the matter is that a lot of data scientists just want to use the tools that they’re familiar with. So the best thing that we can do is to meet them in the middle by facilitating bulk data export and import.
Anais Dotis-Georgiou: 00:08:22.905 And that also ties into the next requirement, which is broader ecosystem compatibility. Where possible, we should aim to embrace emerging standards in the data analytics ecosystem and allow there to be ecosystem compatibility so that people can take advantage of all the open-source tools that are available. And last, but not least, we want to be able to run both at the edge and in the data center and have it be federated by design. So now that we understand all the requirements at a high level, let’s go in and talk about how each technology helps us achieve those goals. And just so that we’re clear, when I reference the numbering in the rest of this presentation, I’m referencing these requirements. I’m not skipping numbers. But for Rust, I’ll talk about how it meets requirement one to six and seven, for example. Okay. So, yeah. Let’s talk about Rust and InfluxDB. So the first requirement was no limits on cardinality. Write any kind of event data that you want, and you don’t have to worry about what a tag or field is. So how does Rust contribute to this goal?
Anais Dotis-Georgiou: 00:09:31.295 Well, first, let’s take a step back and talk a little bit more about Rust and why it was chosen. And so Rust was chosen because of its exceptional performance and reliability. For those of you who aren’t familiar with it, it’s syntactically similar to C++ and has similar performance as it also compiles to native code. But unlike C++, it has a much better memory safety. Memory safety is protection against any bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And Rust achieves this memory safety due to its innovative type system. It also is a system-level language which does not allow any dangling pointers or buffer overloads by default. And a dangling pointer is a pointer that points to invalid memory, and they’re one of the main classes of errors that lead to exploitable security vulnerabilities in languages like C++. So Rust helps meet this requirement of no limits on cardinality because the new engine is built on the Rust implementation of Apache Arrow. More on this in the next section. And additionally, the approach to handling unlimited cardinality requires non-trivial CPU during query processing, and so therefore, you’re very reliant on squeezing the most performance that you can. And that is something that Rust is very well-suited for.
Anais Dotis-Georgiou: 00:11:03.692 So the second requirement, that we provide best-in-class performance on analytics and in addition to our already well-served metrics queries. So how does Rust help meet this requirement? Well, it helps meet it because the benefits of memory optimization that Rust has to offer directly impacts the storage performance and the query performance of the new engine. Rust is also the foundation on which the implementation of Arrow, Parquet, and DataFusion rests. So each technology that the new storage engine incorporates collectively reaches this goal. And it’s additionally worth mentioning — you can’t have an excellent analytics query without a query execution framework, which is DataFusion. And you can’t take advantage of excellent analytic queries without excellent durable storage, Parquet, and in-memory storage and data exchange, which is Arrow and Flight. So they’re all kind of related and the foundation is Rust. The fourth requirement is that you should have operator control over memory usage. This one, at this point, hopefully, is starting to feel a little intuitive. We have some idea that Rust is used for memory control. I just want to quote Paul Dix from the post I mentioned above. So Rust gives us more fine-grained control over runtime behavior and memory management. As an added bonus, it makes concurrent programming easier and eliminates data races. Its packaging system, Crates.io, is fantastic and includes everything you need out of the box with the addition of async/await to fix race conditions.
Anais Dotis-Georgiou: 00:12:37.386 So for example, the buffering crate guarantees protection against buffer overloads, overflows, and bounds checking. There’s a cache crate, which ensures thread-safe async caching structures, which, again, help prevent races. And just the fine-grained memory control that you would expect with C or C++, Rust also offers. But it has also improved upon many of the shortcomings, including security and multithreading. So if Rust gives us this memory control, we can then give that to you. Okay. And requirement six, broader ecosystem compatibility. Honestly, this is one of the requirements that I’m the most excited about with IOx — or sorry, with the new storage engine. And so how does Rust help? Well, essentially, because it’s the foundation of Arrow, Parquet, and DataFusion, it will contribute to the broader ecosystem compatibility. So this interoperability story will really become more clear when we talk about DataFusion and Parquet. And then requirement seven, it should be able to run at the edge and at the data center and federated by design. So optimizing your memory usage with Rust means that InfluxDB in the cloud will also contain these memory optimizations, and they also will at the edge and the data center.
Anais Dotis-Georgiou: 00:14:09.459 Okay. Now, let’s talk about Arrow and InfluxDB. And let’s kind of also paint a picture for why Arrow came to be and what it is. So over the last few decades, a lot of companies are leveraging really, really large datasets to perform increasingly complex analytics. And advancements in query performance, analytics, and data storage are largely the result of greater access to memory. Demand, manufacturing processes, improvements, technological advancements have all contributed to lower and cheaper memory. Cheaper memory. And lower memory costs have spurred the creation of technologies that support in-memory query processing or OLAP tasks. And they also support data warehousing systems. And so for those of you who aren’t familiar with OLAP or online analytical processing, that’s just an acronym that describes any software that performs multidimensional analysis on large volumes of data. And that has all been coming around in the last decade or so largely due to how cheap memory has become. So that sets the stage for the creation of Apache Arrow. What is Apache Arrow? Well, it’s a framework for defining in-memory columnar data. And it aims to be the language-agnostic standard for columnar memory representation to facilitate interoperability.
Anais Dotis-Georgiou: 00:15:40.532 So there’s a bunch of open-source leaders that came together to create Apache Arrow, including leaders from Impala, Spark, and Calcite. And among the co-creators also includes Wes McKinney, who, hopefully, we know as the creator of pandas. And the reason why he was interested in this project in specific is because he wanted to make pandas interoperable with other data processing systems. And this is one of the problems that Apache Arrow solves. So why is Apache Arrow becoming so popular? Well, it achieves widespread adoption because it provides efficient columnar memory exchange. And it provides zero copy reads. In other words, the CPU does not have to copy data from one memory area to a second memory area, which in turn reduces the requirements for CPU cycles. And finally, because Arrow is a columnar-based format, processing and manipulation data is a lot faster. And we’ll talk about the advantages of columnar data in a second. It also enables things like single instruction, multiple data, vectorized processing, and vectorized querying. It’s also used for a lot of different projects, including Spark, Parquet, InfluxDB, pandas. So Arrow uses Parquet to vectorize reads, for example.
Anais Dotis-Georgiou: 00:17:18.249 And then I’ll talk about how we’re going to use — how we use Arrow for the InfluxDB new storage engine here in a second. And pandas uses Arrow to offer read and write support for Parquet as well. So requirement number one, no limits on cardinality. We should be familiar a little bit with this requirement by now. So how does Arrow contribute to this requirement? Well, it overcomes memory challenges associated with large cardinality use cases by providing efficient columnar data exchange. And so we’ll understand that even more when we talk about the benefits of columnar data storage. So let’s have a little sidebar. And let’s imagine that we are writing the following lines of line protocol to the new storage engine. So we have one measurement. We have two tags, tag one and tag two, with three different tag values. Tag value one. Tag value two. Tag value three. And we have one field and — actually, we have two fields, sorry. Field one and field two, where field one is integer and field two is a Boolean. And we have our timestamp. So if we were to write this data, the new storage engine will actually return the following table where tag sets and timestamps identify new rows on write when you query in either SQL or Flux with the iox.from() function.
Anais Dotis-Georgiou: 00:18:59.201 And so this is what you’ll actually return. Which if you are using 2.x at all or InfluxDB Cloud now and you query with Flux, quite different because we have an underscore field and an underscore value column. This is now kind of like as if the data were pivoted. And more on that in a second. However, underneath the hood, the data will be stored in a columnar format like this. So let’s take a second to look at this again. So if we look at field one, we can see that we have — our values are one, two, three, four, and one. And for field two, it’s null, null, null, null, null, true, null. So this is how they’re actually stored in a columnar format. In other words, they’re stored like this formatted block. And one thing I want you to notice is that, for each row, neighboring values are the same data type. And they’re oftentimes similar in value, especially in time series data. And that, in a second. And this provides a perfect opportunity for cheap compression, which also enables high cardinality use cases. This enables faster scan writes by using the SIMD instructions found in all modern CPUs.
Anais Dotis-Georgiou: 00:20:14.798 And depending on how the data is stored, you may also only have to look at the first column of data to find the max value of a particular field, for example. So if you were trying to find that max field, you’d maybe only have to look at the first row. But let’s contrast this to a row-oriented storage where, instead, you’d have to look at every field, every tag sent, every timestamp in order to compare all the max field values for one field. In other words, you’d have to read the first row, parse the value into columns, and then include the field values in your result, and repeat for each row just to get that max value. Additionally, columnar storage makes even more sense for time series data just because of the nature of time series data itself. It is not uncommon when you are collecting time series data — let’s take the temperature of a room— that that value doesn’t change. And there is a bunch of metadata associated with Parquet and a bunch of encoding that can essentially tell you how many values of the same value you have. So you don’t have — again, it just makes the scans that much cheaper, especially for time series data where you frequently have repeated values over time.
Anais Dotis-Georgiou: 00:21:28.963 Up, up, up. So sort of to conclude this sidebar, Arrow meets the first requirement of providing unlimited cardinality because Arrow overcomes memory challenges associated with large cardinality use cases by providing efficient columnar data exchange and compression. Okay. So requirement two, how does Arrow help provide best-in-class performance on analytics queries in addition to our already well-served metrics queries? Well, you can’t have highly performant analytics without efficient in-memory columnar storage that Arrow will provide. Requirement number three, there should be separate compute from storage and tiered data storage. And the database should use cheaper object storage as its long-term durable storage. So Arrow provides the in-memory columnar storage, and Parquet will provide the column-oriented data file format on disk. Parquet and Arrow are both column-oriented formats, and they have fantastic interoperability with read and write APIs between the two of them. And this interoperability enables both the separation. And Parquet will act as the cheaper object storage for the database’s long-term durable store. And we’ll talk more about this in the Parquet section.
Anais Dotis-Georgiou: 00:22:56.563 Requirement four, we should have operator control over memory usage. The operator should be able to define how much memory is being used, and Arrow contributes to this because, again, we’re using the Rust implementation of Arrow. And this is how Arrow is able to additionally gain finding control over memory usage. And we want to have broader ecosystem compatibility. And using Arrow as the underlying format allows for fast and easy data exchange with large and growing Arrow-based ecosystem tools. So there are libraries written in C, C++, Java, JavaScript, Python, and Ruby. And there are more. I think there’s at least 12 in total. And the broad ecosystem compatibility of Apache Arrow is a part of the reason why technologies like Google’s BigLake, Apache Spark, Snowflake, Amazon Athena, etc., have chosen to integrate it with their stack so that they can achieve this efficient data exchange and achieve interoperability with other tools. Okay. Now, let’s talk about DataFusion and InfluxDB.
Anais Dotis-Georgiou: 00:24:14.657 So DataFusion, just to recap, is an extensible query execution framework. It, again, is written in Flux — written in Rust, sorry. And it uses Arrow as its in-memory format. So how does it help us achieve the requirement of no limits on cardinality? Well, without a query engine that can handle high-cardinality data, you can’t really even take advantage of your high-cardinality data. So while DataFusion doesn’t directly help us write high-cardinality data, it does help us query, process, and take advantage of that data, and transform it. Requirement two, how does it serve the best performance on analytics queries? Well, DataFusion is used to execute logical query plans and optimize query optimizations — that’s kind of redundant, but anyways — and serve an execution engine that is capable of parallelization using threads. So essentially, many different projects and products make use of DataFusion and its broad SQL support and sophisticated career optimizations to make these queries. And what else do I want to talk about with this? DataFusion is also built on top of Rust. And it also uses Arrows as its in-memory columnar format. And Parquet is its durable file format. So it gets all the performance benefits additionally from those tools as well. So everything’s related.
Anais Dotis-Georgiou: 00:25:51.688 And then how does it help achieve separate compute from storage into your data? Well, DataFusion enables fast query against data stored on the cheaper object store, and it’s separate from compute. So it has the native ability to read Parquet files from object store without downloading them locally, and DataFusion can selectively read parts of the Parquet files that are needed. Parquet is chock-full of metadata. So it can take advantage of that. And this selectivity is accomplished through projection pushdown and object storage range scan. Additionally, DataFusion utilizes other advanced techniques that further reduce the bytes required, such as predicate pruning, page indexing pushdown, row filtering, late materialization, and a bunch of capabilities that, in general, contribute to fast query against data stored on a cheaper object store and separate compute. So how does DataFusion help reach the goal of having a broader ecosystem compatibility? So DataFusion supports both a Postgres-compatible SQL and a data frame API. And this means that the engine will support a large community of users from broader ecosystems that use SQL and, eventually, Pandas DataFrames.
Anais Dotis-Georgiou: 00:27:20.551 All right. And now we are ready to talk about Parquet and InfluxDB. So what is Parquet? Parquet are files that are compressed columnar — that are in a compressed columnar-data format. And just to highlight kind of how successful it’s been and how valuable it is, I want to quote some numbers from Databricks that kind of describe some of the results when they converted one terabyte CSV file to Parquet so that we have some context for understanding how much more efficient it is. So the file size was reduced from — was reduced by 87%. Query runtime was reduced from 236 seconds to 6.7 seconds. So it’s 34 times faster. The amount of data scanned for a query, for them, dropped from 1.15 terabytes to 2.51 gigabytes, which was a 99% reduction. And cost was reduced to 99.7. So how? Right? How is Parquet able to perform so much better than CSV or other file formats, for example? Well, there’s some key concepts and features behind Parquet that make it so efficient. The first is run length and dictionary encoding. So rather than storing the same value on disk many times, which effectively wastes space, Parquet is able to simplify the list of how many times that value appears within a column, which I touched upon a little bit when I was talking about how this feature is especially useful for time series data because we typically get a lot of the same values. And this results in a massive space savings on datasets where these repeated values are frequent.
Anais Dotis-Georgiou: 00:29:18.482 There’s also record shredding and assembly. So Parquet is able to map nested data structures to a column-based layout. And it’s chock-full of rich metadata. So it keeps track of a large amount of metadata, which enables above-ground strategies. But developers don’t even have to really be aware of that. And I’ll talk a little bit more about that in a second. But let’s talk about how Parquet meets all the requirements of the new storage engine. The first is that it will offer best-in-class performance and analytics. So since Arrow and Parquet are both columnar-data formats, they have increased performance for all of the reasons that I’ve already mentioned and all the associated benefits with that. And Parquet achieves fantastic compression. And it also supports interoperability with several machine-learning tools and analytics tools. Requirement two, so how does Parquet help the requirement of separating compute from storage? Well, Parquet files take up very little disk space and are very fast to scan. They take advantage of advancements in RAM and bottlenecks that result from disk I/O, and they have pushed the need for the separation of these tools — which have pushed the need for the separation of these tools. And so, in return, this is how we have separate compute from storage because we are using both Arrow and Parquet and DataFusion.
Anais Dotis-Georgiou: 00:31:07.947 All right. Requirement five, bulk data and export and import. This is really where Parquet shines. So Parquet files are specifically used for OLAP tasks and OLAP use cases because they do enable this bulk data export and import. And they work with so many other tools that can read and write Parquet files — Spark, pandas, and all the other tools associated with that — a lot. So let’s talk about that broader ecosystem compatibility a little bit more. So Parquet has interoperability with almost all modern machine-learning and analytics tools. Like we said, DataFusion supports both the SQL and a data frame API for logical query plans. And additionally, the execution engine will execute against these Parquet files. Eventually, you will be able to query these for these Parquet files and retrieve them directly from InfluxDB Cloud as well. But first, we’ll just utilize the SQL support for queries. And interoperability is supported by Parquet because many of the most popular languages such as C++, Python, and Java have first-class support in the Arrow project for reading and writing Parquet files.
Anais Dotis-Georgiou: 00:32:38.223 And I’m personally really excited about this and the potential Python support because that means that you’ll be able to query these files directly, and then convert them into Pandas DataFrames, work on them directly, and have interoperability with tools like Tableau, Power BI, Athena, Snowflake, Databricks, Spark, and so many more. All right. And so how do we meet the requirement five of running at the edge and in the data center but federated by design? Well, because Parquet files are so efficient, they will facilitate and increase the capacity for data storage at the edge or in the data center. So now I wanted to take some time to talk about some of the new InfluxDB Cloud features that already exist and prepare you for being able to query in SQL. So if you go to InfluxDB Cloud right now, you have the option to try the new script editor, which I really like because it’s like the Data Explorer where you have the Query Builder and the Data Explorer, and you know you can toggle between each two if you’re familiar with that UI already. But this is all of that in one single page. And you get to see — you get up to both build your queries and edit them in the same view.
Anais Dotis-Georgiou: 00:34:01.632 So essentially, there’s a left-hand panel here where you can select where you want to get your data from, which is similar to the Query Builder in the old Data Explorer. And another feature — oh, so nice — you can now create new scripts. You’ll be able to select your language, whether it’s Flux or SQL, and save your scripts, which is so useful. I’m so happy that that — that we have that. It’s a long-time requested feature. And you can toggle that in and out. And so if you use the left-hand panel to select your bucket measurement tags, as you would in the Query Builder, then you can select the Flux sync toggle to be on or off. And if you have it on, it’ll automatically populate that Flux script in this script editor box, and then you can view your results directly below. Additionally, you can view the table view and graph options right below the run button. And I love that because I spend the majority of my time just in those two visualizations. I don’t really mess around with the other ones when I’m querying and exploring my data, and so I love this simplicity. And I love that I get my data back as a table, usually, first so that I can understand what the shape of my data looks like and how my query is affecting that output. So highly recommend that you give it a try. Let me know your thoughts. I love it. I’m thrilled by it. Thank you to the UI team.
Anais Dotis-Georgiou: 00:35:35.218 So, yeah. I guess I kind of talked to this already. But we can actually now maybe see what I was talking about because I’ve highlighted it here. So the top-left pink box demonstrates or shows where you can actually go and create new scripts, open new scripts, and save new scripts. There’s that Flux sync toggle I was talking about. So in this example, I have browsed through my schema. That’s another awesome part about this, is that this way that this is formatted, where you have your bucket and then your measurement and your tags and fields like this, really helps give you a mental model of what your schema looks like, which is so important when you are trying to query your data, unwrangle your data, and understand the shape of it. That’s the very first thing that you have to do if you’re going to perform any analytics on it. But, anyway. So this filtered-down approach, to me, helps give me a more intuitive understanding of that shape. And you can also search for your tags as you would previously as well. And then you can select them. And if you have the Flux sync on, for example, it’ll auto-populate. But you can turn that off and use the panel on the right side to query for any Flux functions, for example, and inject them in there.
Anais Dotis-Georgiou: 00:36:52.022 And then you can select your time, like you normally would, but also select your time zone through that, which I like. And then you can hit run and view that data as a graph. You can change the visualization type if you want. But by default, it will be a graph. And you can switch over to the table view as needed as well. Okay. So when January 31st comes, if you are a Flux user and you want to continue using Flux, you should be aware of the iox.from() function and how it differs from the from() function. So the iox.from() function, you will import with the experimental/IOx package. Remember, IOx is just what we call, internally, the new storage engine. That’s all it is. And so you will query that, and you will specify the bucket that you want to query from, as well as the measurement that you want to query from. And that is the equivalent to selecting from bucket air with a separate filter, filter for measurement air sensors. And then you would select your range and any additional filters for field and sensors that you might have. So essentially, iox for .from() function in line two is the same thing as writing the commented outlines three and line five.
Anais Dotis-Georgiou: 00:38:13.506 But what’s so exciting about it, for me, is that when you use the from() function, we usually expect our table output to look like this, where we have tables that have these group keys, and we have the fields in one column, and the value in another column. And this has been such a stump for the community and for anyone learning Flux. It’s not intuitive. Everyone expects that their field be the column name and that their value be presented under it and that they all exist in one table. Well, we heard you loud and clear. And that’s what you get with iox.from(). So when you query with iox.from(), you can see now that if we are querying for two fields, concentration and humidity — or is that? I think that’s concentration. We can see that now we have both of our fields represented in one table. And this is just generally what people want. This is what feels more intuitive. And it’s as though we’ve applied a schema field as columns function in Flux or, in other words, we pivoted our data. And the other thing you’ll notice is that the underscore start and stop times are not included. I think, in general, too, developers found that that was just confusing for community. People rarely use that underline start and stop. And that’s more just important to using Flux under the hood but detracts from the experience for the users.
Anais Dotis-Georgiou: 00:39:46.104 The only thing that you really care about is the time, your fields, and your measurement, and such. Oh, temperature being another field there. So in the new query data explorer or the new script editor, in the future, in January 31st, you will also be able to query with SQL. So here’s an example of a basic SQL query that we are using to get back some of our data. Specifically, we’re using the SQL query that says select sensor ID, temperature, time from air sensors where time is between these two timestamps. And we’re selecting for a particular sensor ID. And we can return our data that way. Yeah. So what is going to be supported as far as SQLs goes by January 31st? Well, we’ll support the following statements: select, from, where, group by, order by, join, both left and enter, with clauses, having, union, limit. And over? I feel like maybe that’s wrong. I forget. I don’t think I’ve used an overstatement in a while, so excuse me if that’s a typo. And then subqueries, we’ll have exist, not exist, in, and not in. For functions, we’ll have aggregated functions, count, average, mean, sum, min, max. Yay! Because one of the things that so many people have always wanted to be able to do is calculate the max of your data across all time, and now you’ll be able to do that. This is such a long-requested feature. So it’s really exciting to see that being delivered.
Anais Dotis-Georgiou: 00:41:36.193 There are also time series functions, time, a bucket gap fill, date bin, which is kind of like aggregated window, and now. And then there’s some other additional functions like explain and explain analyze. And so interoperability plans. So there will be various Flight SQL plugins that will be contributed to upstream projects to enable interoperability. So you should expect to see interoperability for Apache Superset, Tableau, and Power BI, and Grafana. I think the order of attack is actually Superset, Grafana, Tableau, and Power BI. But eventually, you should be able to use InfluxDB as a backend for those tools and take advantage of them. Which, again, is another long-requested feature, and I don’t know the exact timeline for this. But maybe an engineer — or Chris, if you want to maybe ask if that’s something that we can provide because that would be pretty cool because I know a lot of people are really excited about having that sort of support. And now, without further ado, I am going to drop a survey into the chat, and request that maybe we can take a little breather if you don’t want to take the survey. I would love for you to fill this out so we can get some more information on how you use both Flux — if you’re excited about SQL support, eventually, too, the new storage engine will also support InfluxQL — and how you feel about these advancements so far.
Anais Dotis-Georgiou: 00:43:27.397 So let me stop sharing my screen. And let me add that into the chat. And if you don’t want to take it, no problem. Just know it is anonymous. And if you don’t want to take it, well, then we can enjoy a little bit — a little breather. So you can get some coffee or something.
Caitlin Croft: 00:43:47.068 I just shared it in the chat, Anais.
Anais Dotis-Georgiou: 00:43:48.908 Oh, thank you. Okay, perfect. So, yeah. I guess I’ll give you all a few minutes, maybe five minutes. Thank you so much, too. [silence]
Caitlin Croft: 00:48:25.969 — we get started again, Anais?
Anais Dotis-Georgiou: 00:48:28.177 Sure thing. Thank you. Okay. So I was just answering some questions that you have in the chat. So I don’t know the answer to all of them. But maybe someone, maybe Chris, maybe you do. So, “Is C# supported in the platform? Just wondering.” The client library support should still work, although, I think you’ll only be able to query with Flux for right now. “Are these new features available in InfluxDB open-source community version, or is there any plan to add them?” Yes, I just don’t know the timeline for that. “Are there any benchmarks available?” Not yet, but they’re coming soon, I’m sure. And when they do, we will share them. “Is there a timeline for deprecation and replacement on that?” I don’t know exactly what the timeline for deprecation looks like but, eventually, the new storage engine will replace the old one. “Do you have a date for the release of this new engine to the OSS version?” I do not. I don’t have that yet, I’m sorry. Okay. Someone else asked, “I think I missed this, so pardon me if you mentioned it. But on 1/31st, if we are using InfluxDB OSS 2.6, what do we need to do? Will the new engine become active with 2.7 as well as the new Data Explorer?” The new Data Explorer, when is that becoming — is that available in OSS?
Anais Dotis-Georgiou: 00:50:02.560 I actually don’t even remember right now in 2.6. I don’t think it is available. I would be surprised if it isn’t available in the next version, but I honestly don’t know. You don’t need to do anything if you’re using OSS 2.6. You are good. If you want to take advantage and try out querying with SQL, then I would say sign up for a free cloud account and give it a shot there. No, you can’t try InfluxDB, the new storage engine, with Docker, right now. “How easy, difficult is it to migrate data? Or is it automatic?” Our wonderful documentation lead, Scott, is currently writing that immigration guide. I haven’t tried it, so I don’t know. But I think some of the huge efforts of the new storage engine is to make migration easier. So, yeah. I don’t have a good answer for that either, I’m sorry. “How much faster is the new storage engine than the current release?” Again, I haven’t seen any benchmarks, so I don’t know. Honestly, y’all, I am learning this almost at the same pace as you all are. I just started learning about the new storage engine a few weeks ago. So this is all just so new and we’re rolling it out in these — in the two regions just to see how well it performs and learn more about how customers are feeling about it and how community is feeling about it. But I wouldn’t say it’s a shocking, sudden change. You can still continue using Influx as you have been.
Anais Dotis-Georgiou: 00:51:57.136 Okay. “Will EDR work with open source 2.x in the new storage engine?” I think so, but I’m not sure. I actually don’t know. But please fill out that survey because that’s one of the questions I ask. And your feedback will really help the team in that way. Okay. Up, up, up. Thank you again, Caitlin, for sharing that. “Do insertion methods change?” InfluxDB will still accept line protocol, so you shouldn’t worry. Yeah. Still able to use Telegraf. Honestly, if you have to summarize all of this, it’s just basically, under the hood, InfluxDB is changing to enable SQL support, unlimited cardinality, and all those requirements. And hopefully, you don’t have to worry about anything else is the goal. [silence]
Anais Dotis-Georgiou: 00:53:14.785 Oh, there’s actually more to this presentation. Let’s see. We can keep going. Okay. So if you want to see how the new engine works, you can sign up for the cloud beta program and stay up to date on the newest features. So scan that QR code or go to influxdata.com/influxDB-engine-beta. Again, please join us on the InfluxDB community Slack workspace to participate in conversation specifically about the new storage engine and ask questions about it. You can join InfluxDB_IOx. Also, please go to the forums as well, community.influxdata.com and Reddit. We have developer advocates on both of those as well. And again, if you want to get started, look at influxdata.com in general. That’s where you can find all these resources, product downloads, etc. And then Influx community is the developer-advocate-maintained repo where all of us at — all of the developer advocates at InfluxData have a bunch of different repos with examples of how to use InfluxDB for different use cases and in different contexts. So that’s a really valuable tool for those of you who aren’t familiar with it. If there’s something you’re trying to do with Influx, there might be an example of how to do that within that organization.
Anais Dotis-Georgiou: 00:54:52.225 Sorry, duplicate slide. So here’s some related blogs. And I will actually post those in the chat as well. Caitlin, do you know if a link with the recording will be emailed to them?
Caitlin Croft: 00:55:07.889 We will, yep.
Anais Dotis-Georgiou: 00:55:10.299 Cool. So you can also get the blogs that way. And then additional resources, again, like I mentioned, the forums, Slack. I forgot to put Reddit in there, but there’s also a subreddit for InfluxDB, that Influx community org, the Time to Awesome book, the docs, blogs, and InfluxDB University. InfluxDB University, you can get free training on all things Influx and earn LinkedIn badges. So you can have bragging rights and impress employers. So I think we’ve gone over the questions. And let me stop sharing again so that I can also copy some of those blog links for you guys. [silence]
Caitlin Croft: 00:56:09.879 So a few people — while Anais doing that, I know a few people are asking for the recording. So by tomorrow morning, you can basically just go to the link that you used to register for the event and find the recording and the slides. But I will make sure we also send out an email with all of the helpful links that Anais has mentioned.
Anais Dotis-Georgiou: 00:56:36.747 Yeah. Actually, we’ll just do that because I don’t have time to copy all the links separately because they’re all — yeah, they’re all linked to the titles. Yeah.
Caitlin Croft: 00:56:48.802 Awesome. Well, thank you, Anais. Thank you everyone for joining. Anais, I know you answered a bunch of these questions, are there any that need — that we haven’t addressed yet? I just want to make sure we’ve gotten them all.
Anais Dotis-Georgiou: 00:57:02.238 Okay. “Is there a timeline for bringing the new engine to other AWS regions?” Yes. I don’t know what it is, sorry. All this is so new that this is largely what we’re trying to figure out. But please join the Slack channel. And I have all these questions listed so, hopefully, I can also make a post in Slack and in that channel, as well as maybe the forums, and help bring those. So if you stay tuned there, hopefully, you can find some answers to that. “A backup made with 2.6 can be restored in 2.7.” I’m assuming so. Yeah. I don’t think our plan is to leave you hanging, so. [silence]
Anais Dotis-Georgiou: 00:58:08.887 “An option to dump in CSV in the following format supported by InfluxQL would be very much appreciated.” Noted. I don’t think that you’ll be able to necessarily immediately be able to dump CSV indirectly, but there’s so many tools to convert that CSV to Parquet and you’ll immediately be able to bulk import that. So I don’t think that type of work would be very challenging because that’s like one line of Python to do that type of conversion. But maybe, I don’t know. So much is up in the air, that’s why your feedback is really important. All right. Thank you.
Caitlin Croft: 00:59:00.628 [crosstalk]. Thank you, everyone. Thank you for joining today’s webinar. There will be lots of links sent out to you, probably tomorrow, so be sure to check out your email for that. And once again, everyone should have my email address. So if there’s another burning question, feel free to email me. I’m happy to loop in Anais and the engineering team, who I know are working around the clock for this release. And we have another webinar coming up in just under a month with Paul Dix and Balaji, who is our product marketing person. So we’re going to go into even more about the new storage engine. So be sure to check out our events page. And thank you, everyone, once again, for joining today’s webinar.
Anais Dotis-Georgiou: 00:59:47.620 Thank you. Bye.
Caitlin Croft: 00:59:49.040 Bye.
[/et_pb_toggle]
Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.