Best Practices for Leveraging the Apache Arrow Ecosystem
Session date: Aug 15, 2023 08:00am (Pacific Time)
Apache Arrow is an open source project intended to provide a standardized columnar memory format for flat and hierarchical data. It enables more efficient analytics workloads for modern CPU and GPU hardware, which makes working with large data sets easier and cheaper.
InfluxData and Dremio are both members of the Apache Software Foundation (ASF). Dremio is a data lakehouse management service known for its scalability and capacity for direct querying across diverse data sources. InfluxDB is the purpose-built time series database, and InfluxDB 3.0 has a new columnar storage engine and uses the Arrow format for representing data and moving data to and from Parquet. Discover how InfluxDB and Dremio have advanced their solutions by relying on the Apache Arrow framework.
Join this live panel as Alex Merced and Anais Dotis-Georgiou dive into:
- Advantages to utilizing the Apache Arrow ecosystem
- Tips and tricks for implementing the columnar data structure
- How developers can best utilize the ASF to innovate and contribute to new industry standards
Watch the Webinar
Watch the webinar “Best Practices for Leveraging the Apache Arrow Ecosystem” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Best Practices for Leveraging the Apache Arrow Ecosystem”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Director of Marketing, InfluxData
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
- Alex Merced: Developer Advocate, Dremio
Caitlin Croft: 00:00:02.002 Welcome to today’s webinar. I’m very excited to have Anais and Alex here to talk about how InfluxDB and Dremio leverage the Apache Arrow ecosystem. Please post any questions you may have in the Q&A, which you can find at the bottom of your Zoom screen. And this session is being recorded and will be made available tomorrow and the slides will be made available by tomorrow morning as well. And without further ado, I’m going to hand things off to Anais and Alex.
Anais Dotis-Georgiou: 00:00:33.033 Thank you so much, Caitlin. And welcome everybody. It’s nice to see you all here. I know that you are here. And yeah, let’s just begin. So we’re going to be talking about how InfluxDB and Dremio leverage the Apache ecosystem. And I’ll just take a moment to introduce myself. I’m a developer advocate at InfluxData. And my job is to sort of represent the community to the company and vice versa. And a lot of what I do is creating various example repos and tutorials and demos on how to use InfluxDB with a variety of other tools. And so I encourage you to connect with me on LinkedIn if you want. And I’ll also give Alex a quick moment to introduce himself as well.
Alex Merced: 00:01:18.018 Hey, everybody. My name is Alex Merced. I’m a developer advocate here at a Dremio. I’m mostly spending my time talking about open source technologies like Apache Iceberg and Apache Parquet and Apache Arrow, the very topics we’ll be talking about today. Also one of the co-authors of the upcoming O’Reilly book, Apache Iceberg: The Definitive Guide. I’ll mention that again a little bit later on. But basically, I’m excited to be here today and thank you InfluxDB for having me as part of this to talk about really cool Apache technologies. And back to you.
Anais Dotis-Georgiou: 00:01:47.047 Thank you, Alex. Yeah, so let’s get started. Let’s talk about InfluxDB and Apache and learn about why InfluxData decided to rewrite our storage engine using the Apache ecosystem. But first, I just wanted to give a little brief overview for those of you who aren’t familiar with what InfluxDB is and InfluxData. So InfluxData is the creator of InfluxDB. We’ve had three versions; 1, 2, and 3, and 3 was a complete rewrite of the storage engine. And InfluxDB is a time series database and platform. And you can use InfluxDB to collect data from a variety of different sources and then you can transform that data with SQL or InfluxQL, and then also use and integrate with other tools. So to take advantage of various machine learning or business analytics or business intelligence tools to visualize your data, perform additional analytics and such. Another tool in our stack is called Telegraf, and that is a collection agent. It’s also open source. And it’s used for writing metrics and events. It’s plugin-driven and there are over 200 input plugins alone, and it’s also database agnostic. So if you just have the task of needing to get a lot of data from a particular source and you want to do that with buffering and caching capabilities on a lightweight agent, I highly recommend looking into Telegraf.
Anais Dotis-Georgiou: 00:03:15.015 There’s also a variety of plugins for Telegraf that make it extensible in any language, and those are called the exec plugins. And there’s a collection of external contributed plugins using that exec plugin architecture, so you can take a look at all of those plugins that people have contributed to as well in the language of your choice. And so if you’re also looking for a really easy contribution project, I recommend looking into that. But InfluxDB, we stripped away some of the features of previous versions and completely rewrote the storage engine, like I said, on top of the Apache ecosystem. And there were a lot of different motivators for doing this, but one of them was also so that there was a broader interoperability with a lot of other tools. And I’ll talk more about that in detail, but that really has spurred more possibilities for things like data visualization, analytics, machine learning, and even the ability to get data from a lot of different sources, and then send data to a lot of different sources. So InfluxDB’s new storage engine is built on a lot of Apache products, but also Rust. And we chose Rust because it’s a language that offers really fine-grained memory management and one of the previous concerns or requests from previous versions of InfluxDB was that we wanted people to be able to have operator control over memory specifically.
Anais Dotis-Georgiou: 00:04:47.047 And so we basically take a lot of those fine-grained memory management capabilities that Rust offers and extend those to the user. It’s also built on Apache Arrow. So Apache Arrow is a framework for defining in-memory columnar data. And Apache Parquet is a column-oriented, durable file format. And then Arrow Flight is a client-server framework that simplifies the transport of these really large datasets over network interfaces. And DataFusion is a career execution framework that’s also written in Rust. A lot of these are actually written in Rust. It uses Apache Arrow as its in-memory format. So that was kind of another reason why we chose Rust as well was just because it integrates so seamlessly with a lot of the Apache products or technologies. And over the last few decades, basically, a lot of businesses are having to perform increasingly complex analytics, and this has required them to leverage really, really big datasets. And there’s been a variety of advancements in things like query performance, analytics, and data storage, and that’s largely a result of greater access to memory. And greater access to memory has a lot to do with improvements in manufacturing processes and technological advances as well, but basically, lower memory costs have spurred the creation of technologies that support a lot of in-memory query processing or OLAP processes and data warehousing systems.
Anais Dotis-Georgiou: 00:06:23.023 And so Arrow is kind of coming out of that context. And Apache Arrow is that framework that’s used to define in-memory columnar data. And the goal is that every processing engine can use it and it aims to be basically the language-agnostic standard for columnar memory representation. And if it becomes that, and as it is becoming that, it’ll help facilitate a lot more interoperability. And for example, one of the co-creators of Arrow was Wes McKinney, who is also the creator of pandas. And he specifically wanted to make pandas more interoperable with other data processing systems. And this is another problem. This is the problem that Apache Arrow solves. And so InfluxDB wanted to use Arrow to take advantages of all the performance benefits that a columnar memory data format provides and as well leverage that interoperability with other tools. And Apache Arrow has achieved really widespread adoption because it provides efficient columnar memory exchange and also provides zero-copy reads. So that’s kind of the two main benefits or things that kind of set it apart. And it’s also used in a variety of other projects aside from InfluxDB and Dremio. It’s also used in Apache Spark and pandas as well.
Anais Dotis-Georgiou: 00:07:56.056 And Parquet also, which we’ll talk about in a little bit, uses Arrow for vectorized reads as well. So yeah, you can see it being used a lot of other places. But I wanted to touch specifically on why we chose to use Arrow as our in-memory columnar format for InfluxDB. And the first reason is that essentially it overcomes a lot of the memory challenges and helps also give fine memory control. A second reason is that it provides really efficient data exchange for data analytics and give you all those performance benefits as well. And then likewise, because it’s an in-memory columnar data storage, it works really well with Parquet as well, which is what we use for the column-oriented data file format that’s on disk. And last but not least, it provides really broad ecosystem compatibility. We now have the ability to — we’ve created various JDCB drivers and you can now visualize your data with things like Apache Spark, Grafana, which we had before, but now we can use the Arrow Flight plugin to directly pull a lot of these really large datasets into Grafana as well. Things like Tableau, Power BI is on the way.
Anais Dotis-Georgiou: 00:09:20.020 So one of the things that we realized from earlier versions of InfluxDB was that there’s this need to expand our interoperability with other tools that already exist to support the type of data analytics and machine learning that people are interested in doing. And then Arrow has a lot of libraries written in a lot of different languages; C, C++, Java, JavaScript, Python, Ruby. There’s 12 in totals. So that also contributes to broader ecosystem compatibility. And it’s the reason why Arrow’s also part of other technologies, including like BigLake, Snowflake, Athena. Those are some that I forgot to mention before. So as a sidebar, I also wanted to talk about the advantages of columnar data in general because it’s relevant not only to Arrow, but also to Parquet. And it’s especially true that there are additional advantages for time series data, specifically. And that’s because a lot of times when we are monitoring something with and collecting time series data, the value of our time series data doesn’t really change necessarily every minute or every second that we are gathering that data. Oftentimes, the value remains consistent, especially if we think about monitoring the physical world where we’re looking at things like temperature or pressure or maybe humidity.
Anais Dotis-Georgiou: 00:10:42.042 And so the advantage of that is that imagine that we’re writing the following line protocol to InfluxDB. Line protocols in just format for InfluxDB. You don’t really have to focus on this, just imagine that we were writing some measurements and some tags and some fields, some temperature values, and some various timestamps. Well, InfluxDB v3 will return the following table, where essentially you have a field column for your fields, and then tag columns. And anywhere where you didn’t have a specific field associated with that timestamp, you’ll receive a null value. And so the first thing that we’ll notice is that there’s going to be a lot of repeated values, especially if field one was temperature values. It might be like 72 across or down the entire column. And so when we represent things in a columnar format, we’re actually going to group all the columns together. And so if there’s also the same values in those columns, then we have the opportunity to have really cheap compression and also use a lot of metadata to represent those columns in a more efficient way. We also, if we contrast this to storing anything in a row format, where we’d have to scan across every field, every tag, every timestamp, that’s really an efficient way for viewing our data and analyzing things like what’s the greatest field value. Ideally, we only have to look at that one field value column in order to identify what the max value is. We don’t have to scan our entire table.
Anais Dotis-Georgiou: 00:12:23.023 So that’s basically some of the advantages of columnar data storage, in general, and then also for InfluxDB v3 specifically where we have time series data. So essentially under the hood, like I said, those columns are all stored together. So all the field values are together, all the null values for that other field are together, all the tag values are together for that one tag, etc. So essentially, it enables faster scan rates as well. And you can also — also, it allows you for cheaper compression as well. So now let’s talk a little bit about DataFusion and InfluxDB. So DataFusion is the query execution framework for InfluxDB, and it’s used to execute logical query plans, optimize query optimum — sorry, optimize queries and serve as an execution agent that is capable of parallelization. So it also supports not only SQL, but also a data frame API. So eventually, the hope is that InfluxDB will allow you to query with pandas natively, which I’m really excited for because I absolutely love pandas. And it also enables really fast query against data store and cheaper object store. DataFusion has a native ability to read Parquet files from an object store without downloading them locally. And it can also selectively read the parts of Parquet files that are needed. And this selectivity is accomplished through various pushdown and object storage ranch canes.
Anais Dotis-Georgiou: 00:14:21.021 Yeah. And so another reason why we used it was to basically give InfluxDB users that ability to query in SQL, recognizing that developers that are interested in storing their time series data somewhere aren’t really interested in necessarily having to learn a new query language. They want to learn what they’re familiar with and use the tools that they’re already using. So that was a big, big impetus to moving towards that. And then Parquet. So like I mentioned, Parquet is the compressed columnar data format that’s on disk. And there are a variety of advantages to using Parquet. The first is that it’s wildly more efficient than something like CSV, for example. If we compare CSV to Parquet just to give some context, excuse me, the file size is reduced almost by 130 gigabytes, which is an 87% reduction. And query time on Parquet files is about 34 times faster. And the amount of data scanned for a query drops from about 99%. And so as a result, cost drops about 99% as well. So what is the secret sauce kind of that makes Parquet so much more performant than CSV? Well, the first is that there’s run length and dictionary encoding. So rather than store the same value on disk many times, effectively wasting space by storing the same value over and over again, what Parquet does is to simply list how many times that value appears within a column.
Anais Dotis-Georgiou: 00:16:05.005 This is, again, especially useful for things like time series data where that column value might be repeated multiple times. And this saves massive space on datasets where there is that repeated value. There’s also record shredding and assembly. So basically, it allows Parquet to map nest to data structures to a column-based layout. And then there’s rich metadata. Under the hood, Parquet keeps track of large amounts of metadata, and that makes all of the above or aforementioned strategies possible. So why did we use Parquet? Well, we use Parquet to reap the benefits of everything that I just mentioned, but also because we wanted to provide more interoperability with other machine learning and analytics tools. A lot of them allow you to write or use Parquet files directly, and so we wanted to be a part of that. Another reason is that it takes up a little disk space and a fast scan, and we wanted to also give those advantages to our users. And enables bulk data export and import, which right now you can’t currently export Parquet files, but that is the dream, to be able to really easily export Parquet files directly from InfluxDB so that you can dump them wherever else that you need to put them. So now I’ll hand over the presentation to Alex, and he can talk about Dremio and Apache.
Alex Merced: 00:17:50.050 Hey, everybody. This is Alex Merced. Again, developer advocate at Dremio. And Dremio was built top to bottom using Apache technologies because Dremio really believes in sort of being sort of a very open platform that allows you to kind of plug and play the tools that you like. So first off, let’s just kind of talk about what is Dremio so you can see how Apache Arrow gets integrated into Dremio. So on the next slide, we have sort of a map of what Dremio is. So if we go to the next slide. Cool. So basically what Dremio is, it’s a data lakehouse platform. Okay? So in the world of databases, data warehouse, data lakes, a data lakehouse basically is, is you’re using your data lake as your storage repository for storing structured and unstructured data. But the idea is that you’re able to use that data lake more like a data warehouse, thus the name data lakehouse. And what really enables that is essentially a table format like Apache Iceberg, which I’ll talk about a little bit later that allows data lake tools and query engines like Dremio to sort of do a more robust set of operations on that data. Not just read only, but be able to do updates, inserts, basically treat your data lake more like a full-on database. So what Dremio does is basically make having a data lake or data lakehouse easier and faster. So basically, if you want to speed up the performance of queries in your data lake, Dremio has an answer. If you’re trying to organize and better make your data more self-service and easier to use, Dremio has an answer.
Alex Merced: 00:19:22.022 So basically, what would happen is you would connect your data sources, databases, data lakes, data warehouses to the platform. You have a semantic layer in which you can organize, document, and govern your data, a query engine with which you can query that data, and then an access layer that makes it easy for anyone to access that data either through a REST API through Arrow Flight, as mentioned earlier, that makes for very quick data transfer, especially for Arrow-based data. A REST API and other layers into whatever analytics use cases you have, whether they be BI dashboards, analytic notebooks, and analytics applications. So that’s what Dremio does. If anyone’s ever wanted to give it a test drive, there’s a QR code right there. But now let’s explore how Apache projects make Dremio possible. So on the next slide, we’re going to talk about a project called Apache Calcite. So that’s on the next slide. Okay. And what Apache Calcite is — it’s an open source framework particularly for building database and data tools. Okay? Basically, what it provides is the ability to parse SQL. So a lot of data tools, one of the first thing you want to do is you want to be able to take SQL queries and execute them. But like any kind of language, you have to be able to parse and create a lexer for that language. So Calcite kind of fills that role. It parses the SQL query.
Alex Merced: 00:20:40.040 But also built into Calcite are things like a query optimizer. So that way if someone puts in an SQL query, it can identify inefficient patterns and sort of swap out the query pattern for one that’s going to be executed more efficiently to — or be able to evaluate multiple SQL patterns to, again, get the best most efficient query possible. You have an adaptive framework, so that way you can build sort of custom SQL language on top of it with all sorts of plugins, but it also supports standard SQL. So essentially, this sort of enables the SQL-first interface that Dremio has. So that way Dremio can parse SQL, again, add custom functions to SQL, and so forth and so forth. So that’s what Apache Calcite’s role is in the big Dremio picture. Okay? On top of that, a technology that we’ve mentioned earlier, Apache Arrow on the next slide, basically is what provides a really fast way to process data and memory. So we go to the next slide. Cool. Again, as mentioned earlier, Apache Arrow is a columnar in-memory format. And Dremio use this in a lot of different ways. It uses it for how it processes data, allowing Dremio to process data really fast. On top of that, as mentioned earlier, interop with other tools. So other tools working with Arrow data can easily send that data over to Dremio, Dremio can send that data out in Arrow format, which makes that Arrow-Flight endpoint extra fast so you’re processing the data fast in Arrow, and able to send that data really fast via the Arrow-Flight endpoint.
Alex Merced: 00:22:11.011 But also, it’s also used as part of our caching layer. Okay? So basically there’s two layers to how Dremio sort of caches data. There’s other layers, but basically you have something called the columnar cloud cache that’s a Dremio technology that’s sort of caching things when you’re working with object storage. Because again, if you’re working with objects or trying to go back and access those files repeatedly, can result in sort of more access cost. So what Dremio will do is we’ll try to cache data, metadata, actual data from those calls for queries that are repeated very often. And a lot of data will then be cached in memory of your Dremio nodes in solid state memory using sort of Arrow Flight buffers. So that way on very frequent queries, you’re not necessarily having to make all those round trips to your favorite object store and not having to worry about incurring more costs associated with it. But Arrow gives that sort of really nice, crunchy format that’s really quick to get from memory, really easy to process in memory, and really lightweight to cache. Another part of the Arrow project that Dremio’s contributed to the Arrow project is a project called Gandiva. And what this project does is that it actually takes SQL or query processing logic, and what it does is pre-compiles it into binary. So that way you’re able to do those operations faster.
Alex Merced: 00:23:35.035 So Dremio is able to provide even more performance because a lot of the operations that tend to repeat a lot will get pre-compiled in the binary so that way when you run that query, that operation happens multiple times, it’ll be running sort of the binary version of that instead of running the Java bytecode for that particular operation. Okay? And that also allows you to kind of pre-create also special functions, SQL functions that specifically work with Arrow buffers. And then again, you can use Calcite to add the SQL language to trigger those operations in Gandiva. Okay? So Apache Arrow is very much a big part of the story of how Dremio processes data. And because of that, it does open up the doors for easy interoperability with other Arrow-based tools. Okay? Now, for one of my favorite parts, on the next slide, is going to be Apache Iceberg. So again, co-author of the upcoming Apache Iceberg: The Definitive Guide. So if you want to get an early copy, check out dremio.com. There will be a link where you can get a free early copy of Apache Iceberg: The Definitive Guide from O’Reilly. But essentially what Iceberg is, it’s a metadata layer on top of your Parquet files, or C files, or Avro files. But the idea is you have data in a data lake. Let’s say one dataset’s a thousand Parquet files. The query engine wants to know: do I need to scan all the files in Parquet files? If I can avoid it, I’d rather only scan the ones that I need.
Alex Merced: 00:25:01.001 Now, traditionally, with something like a Hive, what you’d have to do is you at best have the files split up in different directories for partitioning, but then the engine would still have to scan all the files in that directory for that particular partition. And those file listing operations would really kind of be slow. Okay? So what Iceberg does, it says, “You know what? Let’s bypass all this having to do file listings and iterate through files and all those opening and closing file operations. Let’s just create a metadata structure that sits on top of your files that list what files are part of the table, has metadata about those files.” So that way a query engine can be like, “Okay, well, this is the query. These chunks of files have data that’s relative to my query,” when it does the partition pruning. So it’ll basically say, “Okay.” Let’s say it’s partitioned by state. “So here’s the information for Oregon, Ohio, and Florida, but my query says, hey, I only want people — I’m only searching for anyone who’s of a height above 6 foot.” Okay? Well, in those groups of files, there’s a metadata about each individual file. Let’s say, okay, well, the min and max of the height column is X. So now we can go through the metadata and say, “Hey, I don’t need to scan this file. I don’t need to scan this file. I don’t need to scan this file.” So basically, once it’s done planning the query, the number of files you have to scan is maybe only a handful depending on how your data’s set up in those Parquet files, allowing for a much more efficient query plan.
Alex Merced: 00:26:37.037 And this not only allows for those more efficient query plans, but it’s going to enable time travel, partition evolution, schema evolution. All these things that normally would only be possible in a database or a data warehouse, you can now do on the data you have in your data lake. And where Dremio uses this is in a couple of different ways. One, you can operate directly with Apache Iceberg datasets with Dremio in many different ways. But even before that, Dremio used Iceberg for a feature called data reflections. So data reflections, think of them as materialized views on steroids. Okay? So basically, normally with materialized views, you would have to have a separate namespace. You’re creating all the storage for each individual view you want to materialize. There’s all these imperfections. But with data reflections, what Dremio does is you have a view that you would like to materialize, it creates an Apache Iceberg representation. So you get all the benefits of Apache Iceberg and planning any further queries in that dataset, but two, you don’t have to use a different namespace. So anyone who comes and queries that same table, they don’t need to be aware that a materialization exists. They’re just going to see that performance benefit right away. And that’s all powered by Apache Iceberg, which then allows you to partition and create different sorting rules on those same data files to get — to be used to have metadata that allows for performance on different types of query patterns as you create these data reflections. Which is one of my favorite features.
Alex Merced: 00:28:07.007 Now let’s talk about Apache Parquet. Because Apache Parquet - on the next slide - is basically, again, a very performant binary columnar analytics file format that has all those compression benefits, those interoperability benefits that Anais talked about earlier. Now, Dremio, generally when you write data in Dremio, you’re going to write to Parquet. But Dremio can read Parquet very efficiently, especially off object storage. Okay? Because it also has that columnar cloud cache, which is really good at taking those bits of Parquet and saving those as little chunks in Arrow format in memory on those Dremio nodes to, again, save those round trips to S3 to lower your access costs, improve your performance, which will lower your compute costs, and all those kind of things. But with Parquet, as well, is that when you create those data reflections that I mentioned, basically you find that Iceberg metadata with Apache Parquet data files. So where this becomes really useful is that — let’s say you are trying to load a very large CSV file that you have on your data lake storage and you want to accelerate it, you can just turn on data reflections on that CSV data file and you don’t need to be aware that Dremio’s going to create that Parquet representation with that Iceberg metadata. Okay?
Alex Merced: 00:29:25.025 And now basically, when you go query the CSV file, it’s just naturally going to feel faster because you’re taking advantage of these Apache Iceberg technologies without necessarily having to know sort of all the nuts and bolts. That’s one of the big things about the Dremio platform; it allows you to get a benefit from all these technologies, especially in the data lake where in the data lake, you oftentimes would have to kind of really know all these things and know to convert between X and Y, unlike a database or a data warehouse where all the stuff is very integrated and gives you that sort of integrated feeling like a database or a data warehouse, but on your data lake. And that’s sort of where Dremio excels, and that’s sort of not just fast, but making the fast part of a data lake easy. So yeah, that’s basically how Dremio uses these different Apache technologies. And with that, I’m going to pass it back to Anais.
Anais Dotis-Georgiou: 00:30:12.012 Thank you so much. That was super interesting. I didn’t know a lot about Iceberg as well, so that’s really cool to hear about.
Alex Merced: 00:30:22.022 Thank you.
Anais Dotis-Georgiou: 00:30:23.023 I’m sorry for being trigger-happy a little bit on the slides there.
Alex Merced: 00:30:26.026 Oh, yeah.
Anais Dotis-Georgiou: 00:30:28.028 But before we leave too, I wanted to take a moment to talk about some resources for getting started with InfluxDB. So we have our forums. Please, as Caitlin mentioned, please join us there. Ask any questions that you have about InfluxDB or Dremio. Then we have our Slack space as well, so that’s another great resource. We also have Influx community on GitHub, which is an org where the developer advocates in myself, as well as other community members, create projects around using InfluxDB with other tools. So if you have any questions about how to use InfluxDB with some other tools or want examples of ideas of things that you can do with InfluxDB, go there. Then we also have a book on InfluxDB, but we actually need to do some work rewriting it for v3. So if you’re interested in 3.0, then I would actually recommend skipping that resource. But our docs are fantastic, so highly recommend our docs as well as our blogs to learn about all things InfluxDB. And then last but not least, InfluxDB University, which Caitlin already talked about, you can go there and get courses on things related to InfluxDB. And we’re currently also in the process of adding more courses for InfluxDB v3. So it’s a little bit sparse on that content right now, but we’re just trying to play catch up. Caitlin, I know you wanted to talk about this slide, so I’ll give you a minute.
Caitlin Croft: 00:31:54.054 Awesome. Thank you. So I just wanted to provide everyone with a few additional resources. If you’re interested in learning more about InfluxDB and maybe you keep hearing us talk about InfluxDB 3.0 and you’re curious to learn more about it, be sure to check out this webinar. You can check out the recording. Learn about how you can save 98.6% on storage costs with InfluxDB. And if we’ve completely convinced you that you need to try InfluxDB, which obviously I hope we have, we would love to talk to you about running a proof of concept. So really diving into what your data needs are and showing what InfluxDB can do for you. So just want to make sure that everyone has all the resources that they need to get up to speed with InfluxDB 3.0. Looks like Hammad, and I apologize if I’m mispronouncing your name, you’ve raised your hand. So I just wanted to let you have the chance to talk. So Hammad, I just allowed you to unmute yourself. So if you want to talk directly with us, you can unmute yourself.
Hammad: 00:33:10.010 Hi. Hi, thank you very much. By the way, sorry about that. It wasn’t my intention to raise my hand, sorry about that.
Caitlin Croft: 00:33:20.020 Oh, okay. No worries. Sorry to put you on the spot. Do you have —
Hammad: 00:33:24.024 No, no, no, it’s fine. It’s fine. It’s just you know sometimes when you’re handing a phone, you tend to press a few things, but thank you.
Caitlin Croft: 00:33:31.031 No worries.
Hammad: 00:33:32.032 I definitely understood you and the presentations are quite interesting. I think comparison to maybe MongoDB or others basically.
Caitlin Croft: 00:33:42.042 Yeah. Anais can talk to a little bit more about the differences between MongoDB versus us. I know we always like to say that we’re purpose-built for time series data. There’s a lot of different tools that you can throw timestamp data at. But because we were built from the ground up for it, we’re able to handle the really high ingestion really well. A lot of times when people start off with time series data, they don’t know how much data they need to collect, so they might start collecting it at every millisecond or nanosecond. And when you’re dealing with that much data, you need to make sure the tool can handle that really high ingestion. And of course, a lot of people downsample it later down the road once they know what they need.
Hammad: 00:34:28.028 Okay. You can tell me more. Of course, of course, of course. Okay. Thanks for that. Thank you very much for that.
Caitlin Croft: 00:34:33.033 Yeah, of course. I’m just going to put you on mute for now. Okay. All right. So, Anais, I was just curious. You’ve been using InfluxDB for a long time. What excited you the most when you found out that we were rewriting it using Apache Arrow? What did you think were going to be the biggest benefits?
Anais Dotis-Georgiou: 00:35:01.001 I mean, for me, Arrow is part of the equation, but I think DataFusion. I think the big things that users have always asked for since I’ve joined is more control over memory. So being able to have operator control with certain versions of InfluxDB that are coming out is really exciting. Or certain offerings. And then for me, personally, I love SQL, but I also don’t love SQL. So I understand it’s just easy to use. And there’s just so much information on it. You can just use ChatGPT or any similar tool to query, to write your SQL native English and then you get your SQL query back returned to you. So it’s just super easy to use. But I’m really excited for the future of v3, especially around being able to pull Parquet files directly and pop them into a variety of other tools. That just makes working with so many tools so much simpler. And then also eventually being able to support pandas natively with InfluxDB would make me really happy because I really love pandas. And so I think I’m really excited more about the future that it’s headed in, but I also wanted to give Siji an opportunity to ask his question, which he put in the chat. So he says - and this is for you, Alex - does Dremio convert the underlying data store or data warehouse storage data format to the Arrow data format?
Alex Merced: 00:36:39.039 Yeah. So basically what happens is that whenever you’re querying data, so what it’s going to do is going to basically first hit the data source. So basically, if it’s a database, it will do a pushdown query to the database first to get that dataset, which will then be converted into Arrow format for processing. Now, in order to avoid pushdowns, that’s another benefit of using data reflections in Dremio, because then what happens is that let’s say I have a dataset in MySQL. If I turn on data reflections on it, it’ll create that Iceberg plus Parquet representation on my data lake behind the scenes that then can accelerate the whole process of reading it and then loading it into Arrow in memory. But also, if you’re querying that same dataset multiple times, some of that data will get cached in Arrow format in the nodes memory in order to speed up future queries on that same dataset. So on the first query, it’s going to read the data source in full. And then in the future, it’ll use data reflections and the C3 cache to help accelerate those queries or any queries going forward.
Anais Dotis-Georgiou: 00:37:44.044 Thank you, Alex.
Alex Merced: 00:37:45.045 No problem.
Caitlin Croft: 00:37:46.046 Awesome. Thank you. We’ll stay on here for just a couple minutes. If anyone has any more questions for Anais and Alex, feel free to post them in the chat or the Q&A. Alex, I’m just curious. I asked Anais this question as well. What excites you about — being in technology for so long, what excites you about the fact that Dremio’s built with Apache Arrow versus something else? Why do you think that is really cool?
Alex Merced: 00:38:18.018 I’m definitely a big fan of interop. I like the idea of being defined different ways to connect different pieces for new use cases. So the more that we use sort of these open formats, the better that story gets. And that’s why I am very excited to see more and more sort of data lakehouse pieces, sort of very standardizing on Apache Iceberg as sort of that data lakehouse format. So now not only are you seeing data lake tools like Dremio connecting to Apache Iceberg, but now you’re starting to see a lot of data warehouses and databases being able to use Iceberg as sort of an external table format for certain data use cases or just being a good export target when you’re exporting to a data lake, which has been pretty exciting. Because again, especially for really large datasets, it really kind of can make a difference on how a query can get planned and make it where you’re not having to create as many — you’re not having to move the data around as much, and then you can just focus on using what is the best compute tool for the particular use case you want. So I do like the open ease of interop that this environment is creating, and just the whole Apache ecosystem in general, how it does that.
Caitlin Croft: 00:39:26.026 Fantastic. Okay. Cool. Thank you. Really appreciate that. It’s always exciting to hear you guys talk about why the underlying pinning technology with Apache is so fantastic. It looks like someone’s asking if there’s going to be additional reading materials. I believe there will be some content sent out to you guys tomorrow, including links to the recording and the slides. So be sure to check out your email for that. And all of you should have my email address. So if you are an hour from now and you think of another question that you wish you asked Anais and Alex about Apache, feel free to email me. I’m happy to put you in contact with them, and all that sort of good stuff. So really appreciate everyone joining today’s webinar. Anais, Alex, is there anything else you would like to add? Any last thoughts, tips, tricks?
Alex Merced: 00:40:24.024 Just recommend everyone to give me a follow on Twitter. That’s @AMdatalakehouse. And also, I also do a podcast periodically called Data Nation, where I talk about all these data topics and just love talking tech. So if you want to hear my random thoughts as time goes on, those are a couple of spots to check it out.
Caitlin Croft: 00:40:44.044 All right.
Anais Dotis-Georgiou: 00:40:45.045 Yeah, I just want to emphasize, please come and find us on the forums in Slack if you want to talk about anything. We also have office hours every Wednesday at 11:00 CST. So if you have any questions, all the developer advocates will be available on Slack. We’ll post a video as a prompt for things to learn about InfluxDB so you can get exposed to new things that you might not know about. But then we’ll be there for the whole hour, available to jump on a Zoom call with you or answer any questions that you might have. So yeah, come say hi.
Caitlin Croft: 00:41:23.023 They’re a really friendly bunch. I will definitely say that about our dev rows. Someone asked, when will the recording be made available. It will be available tonight or tomorrow morning. So it’ll be pretty quick. And everyone who’s registered for this webinar will get a link to the recording tomorrow, so don’t worry about it. It’s actually the same exact link that you registered for it. So if you check back where you registered, it should be available by tomorrow morning. All right. Well, thank you, everyone, for joining today’s session. Really appreciate it. And thank you so much to Anais and Alex for providing your insights and expertise on your products.
Alex Merced: 00:42:01.001 Thank you.
Anais Dotis-Georgiou: 00:42:03.003 Thank you, Caitlin. Thanks, Alex.
Caitlin Croft: 00:42:05.005 Bye.
Alex Merced: 00:42:06.006 Bye.
[/et_pb_toggle]
Alex Merced
Developer Advocate, Dremio
Alex Merced is a Developer Advocate for Dremio, a developer, and a seasoned instructor with a rich professional background. He has worked with companies like GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly. Alex is a co-author of the O'Reilly Book "Apache Iceberg: The Definitive Guide." With a deep understanding of the subject matter, Alex has shared his insights as a speaker at events including Data Day Texas, OSA Con, P99Conf and Data Council.
Driven by a profound passion for technology, Alex has been instrumental in disseminating his knowledge through various platforms. His tech content can be found in blogs, videos, and his podcasts, Datanation and Web Dev 101. Moreover, Alex Merced has made contributions to the JavaScript and Python communities by developing a range of libraries. Notable examples include SencilloDB, CoquitoJS, and dremio-simple-query, among others.
Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.