ETL Made Easy: Best Practices for Using InfluxDB and Mage.ai
Session date: Nov 14, 2023 03:00pm (Pacific Time)
InfluxDB is the purpose-built time series database, and the new InfluxDB 3.0 offers major performance gains due in large part to its columnar database design, built on the Apache ecosystem, including Apache Arrow and Parquet. Mage is an open-source, hybrid framework for transforming and integrating data. Positioned as the modern replacement for Airflow, Mage was built from the ground-up to bring engineering best practices to data pipelines. It can run all your Python, SQL, and R data transformations — whether they’re in Polars, Pandas, or DuckDB. Developers use Mage to create real-time and batched pipelines to transform data using Python and SQL. Join this webinar to learn how to use Mage to create materialized views of time series data in InfluxDB Cloud.
Join the live discussion and Q&A as Matt Palmer (Mage) and Anais Dotis-Georgiou (InfluxDB) dive into:
- Overview of InfluxDB and Mage and why both solutions leverage the Apache Arrow ecosystem
- ETL best practices – Learn from SME’s how to simplify workflows and analyze time-stamped data
- Anomaly detection demo – See InfluxDB and Mage in action together
Watch the Webinar
Watch the webinar “ETL Made Easy: Best Practices for Using InfluxDB and Mage.ai” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “ETL Made Easy: Best Practices for Using InfluxDB and Mage.ai”. This is provided for those who prefer to read rather than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Anais Dotis-Georgiou: Developer Advocate, InfluxData
- Matt Palmer: Developer Relations, Mage.ai
Caitlin Croft: 00:19
All right. I think we’ll get started here. Once again, hello everyone and welcome to today’s webinar. My name is Caitlin Croft and I’m joined today by Anais from InfluxData, as well as Matt Palmer from Mage. Please post any questions you may have for them in the Q&A. This session is being recorded and the recording and the slides will be made available by tomorrow morning. And without further ado, I’m going to hand things off to Matt and Anais.
Anais Dotis-Georgiou: 00:33
Thank you, Caitlin. So, welcome everybody. My name is Anais and I’m a developer advocate at Influx and I encourage you to come and connect with me on LinkedIn if you want to. Feel free to ask any questions that you have about this webinar or developer advocacy or Influx. Love to connect with you there. And for those of you who don’t know what developer advocacy is, basically it’s a way that I can help represent the community or the company and vice versa by creating blogs, tutorials, webinars like this, demo projects, and also answering community questions and bringing product feedback back to products. So, yeah, that’s pretty much what I do and I’ll let Matt introduce himself as well.
Matt Palmer: 01:15
Awesome. Thanks, Anais. Sorry, do you have anything to add?
Anais Dotis-Georgiou: 01:18
No, no, no.
Matt Palmer: 01:19
Okay, cool. I’m Matt. I work in developer relations at Mage. So, it’s an excellent explanation of developer advocacy. I do something very similar. If you’d like to connect with me, I also have a newsletter you can follow along with. Yeah, super excited to talk about Mage, talk about InfluxDB today. And a little bit about me. I have a background in data engineering. I’ve worked in product analytics mostly on the data side. And that’s kind of led me to work with Mage, which is a data transformation orchestration tool, which we’ll talk more about today. So, excited to jump in. Please feel free to leave questions in the chat. I think we’ll have time for questions at the end as well, but would love to hear them. So, I can jump into the agenda a little bit. I’ll be talking about Mage to kick it off. Then we’re going to pass it over to Anais and talk about what InfluxDB is. And then we’re going to give you a demo. So, we’re going to talk about using Mage and InfluxDB in an anomaly detection example, and I think we’ll have some time for questions at the end there.
Matt Palmer: 02:18
But to jump right in, what is Mage? So, this is a webinar hosted by Influx. You might not be familiar with Mage, but it’s an open-source tool for transforming and integrating data. And we’ll talk a little bit more about what that means, about exactly how Mage works. But just to tide you over, on the right there is sort of a snapshot of our graphical editor for composing data pipelines. And we’re going to give a demo, we’re going to walk through exactly how that works and exactly what it looks like to build awesome data pipelines, magical data pipelines even using Mage. So, Mage is really built around the idea of projects. And so within each project, that’s more kind of an environment, right? You have pipelines. And so if you’re familiar with data engineering workflows, a pipeline or often referred to as a DAG is sort of a data engineering workflow that transforms, maybe extracts, transforms, and loads data into a data source. So, it performs some sort of job that transforms data. It’s used to process data. And in Mage those pipelines are built with blocks. And so blocks are reusable components. If you’re familiar with Airflow, you might know what a task is in a DAG, but they’re basically atomic pieces of code that perform some operation.
Matt Palmer: 03:35
And so in Mage those blocks can load data, they can transform data or export data, and there’s a bunch of other functionality. So, at the service level, this seems pretty basic, I think, but it can be really powerful, especially since Mage supports blocks that have a number of different functions. So, we have sensor blocks that can detect if some event occurs and perform an action based on that event. We have conditional blocks that allow you to create branching logic or dynamic conditions. And dynamic blocks that can fan out logics or logic to execute tasks in parallel. Similarly, there’s also webhooks and other things that can help you build the best pipelines possible. And so when you stack this with some other functionality that Mage has, like data integration, if you’re familiar with a Meltano or Fivetran, it’s a similar concept where we’re able to synchronize data sources in the tool, unified pipelines, that is, passing information between pipelines and other collaborative features like multi-user environments and templating.
Matt Palmer: 04:36
It really makes for a powerful data engineering experience. Yeah, mind blown emoji. I think one of the things we pride ourselves most on is that developer experience and the interactions that you’ll have with the tool. So, this is an example of adding a block to transform data from S3. But if I take a step back a bit and give you some context on why this is important, I think one thing that we’ve noticed is that a lot of the—so, MDS here stands for modern data stack. There are a lot of shortcomings in the data developer experience among the modern data stack that lead to non-optimal developer experiences. So, if we have any data engineers here or data practitioners that think about their day-to-day and the tools they’re using in their job, maybe this will ring true for you, right? What does your flow state look like? Are you able to develop sort of seamlessly and switch between the different services that you use? What do your feedback loops look like? Is it easy to test and iterate on the tools that you’re developing in? And is there a lot of cognitive load associated with your job?
Matt Palmer: 05:50
How much do you need to know to get your job done? Is it really hard to think through the problems that you’re solving every day? And I have this screenshot here from LinkedIn, this guy’s saying, “Hey, the easiest way to stop an Airflow instance is to just reboot your computer. It’s too difficult to develop.” And so that’s what Mage is here for. We’re here to fix that data developer experience and make building data pipelines fun. And so that kind of ties into the two biggest features, right, that I think are apparent out of the box. Now, there are a lot—there’s a lot of functionality that becomes more apparent once you start using the tool, but I think the two things that stand out the most are the hybrid environment. So, we have a GUI which allows you to develop interactively, and we’ll show you that during the demo today, or don’t, right? Because part of the concept of Mage is that every element that you can create in the user interface, you can also edit as a plain file in VS Code or in your favorite code editor. And we have our blocks which are, as I mentioned, our testable, reusable pieces of code.
Matt Palmer: 06:52
So, that leads to an improved developer experience where you can code and test data pipelines in parallel, reduce your dependencies, switch tools less, and be more efficient. So, really, the concepts that we’re talking about here are engineering practices, best practices that are built into the tool. Those are inline testing and debugging in a notebook-style format that’s familiar to most. Fully featured observability meaning that you can do all your transformations, all of your orchestrations in one place. You can pull in your dbt models, you can build streaming pipelines, you can run your data integration pipelines, which traditionally takes four tools to do all of those things, right? And those dry principles, as I’ve mentioned blocks a few times, that allow us to move away from patterns in tools like Airflow that might result in what some refer to as spaghetti DAGs or code with duplicate functions and weird imports that can get confusing very quickly. And more importantly, difficult to collaborate and build with your team, or requiring a ton of time to ramp up and contribute code.
Matt Palmer: 07:58
And lastly, that might lead to something that I’m coining “data engineering as a service,” which is basically being able to build patterns that others can go implement. So, in a system that’s easy to build and easy to develop and uses these reusable pieces of code, you can imagine creating blocks or different components that transform data and then having your teammates, maybe analysts, maybe analytics engineers go and implement those pieces of code. And I’ll talk a little bit more about what that looks like in our demo today, but I think for now that’s probably a good place to leave it. And so that’s a very high-level overview of Mage. Hopefully, that makes sense if you’re familiar with data engineering. I think it’ll make a lot more sense once we get into our demo.
Matt Palmer: 08:43
And if you’d like to check out more about Mage, you can scan the QR code on the left to star us on GitHub. We’d really appreciate that. And if you’d like to read up on Mage, you can check out the QR code on the right to read our docs. Between those two things, there’s a ton of resources out there. And as I mentioned, we’re open source. So, all of the code is on GitHub. All of our documentation is on GitHub. If you want to dig into how to deploy Mage, we have Helm charts, Docker composed templates, any sort of template you can imagine to get—Terraform templates. Any sort of template you can imagine to get started. So, I highly recommend you check that out. We’ll distribute this deck and you can scan these QR codes later if you didn’t have a chance to now. But I think I’m going to kick it over and we’re going to talk a little bit about what InfluxDB is now.
Anais Dotis-Georgiou: 09:29
Thanks, Matt. Yeah. So, I’m going to introduce InfluxDB. And I just want to say, too, I really enjoyed using Mage. I thought it was super easy to get started and to do meaningful work with it. And I love that it’s open source as well. Highly recommend it. But InfluxDB. So, InfluxDB is a time series database and platform. And time series data is any data that has a timestamp associated with it. So, we can think of things like temperature data, pressure data, concentration, basically any sensor data or any IoT data. But we also have time series data that comes from agri-tech, biotech, fintech. Basically, time series data is kind of everywhere when you really start to think about it. Next slide. But this is our InfluxData reference architecture. So, InfluxData is the company that creates InfluxDB, and essentially what InfluxDB allows you to do is to collect data from a variety of different sources where all that data is timestamped. So, that includes metrics also for DevOps monitoring, for example, system monitoring and sensor data. And you can also grab event data as well.
Anais Dotis-Georgiou: 10:40
So, in general, time series data is thought of as both being metrics and events, where metrics are time series data that come in a regular interval, things like maybe your heart rate, and an event would be a cardiovascular event like AFib or something. And so you can send all that data into InfluxDB through a variety of different methods. We have client libraries for quite a few languages. For InfluxDB 3.0, we have C#, Python, Java, C++, JavaScript, and I think I’m missing maybe two, but we have a bunch of client libraries. We also have Telegraf. And Telegraf is our collection agent for metrics and events. It’s plugin driven. And both InfluxDB and Telegraf have open-source offerings that are downloadable as a single binary. But with InfluxDB, our open-source versions are coming out shortly, where right now we have an InfluxDB Cloud 3.0 free tier that you can use. And all the improvements that were made to the 3.0 engine make that free tier very appealing because you can collect really insane amounts of data.
Anais Dotis-Georgiou: 11:48
But Telegraf is also a very popular tool. Today for our demo, for example, we’ll be using the Mosquito plugin to collect data and also then the InfluxDB output plugin to write data to InfluxDB. And then we use the Python client library to query that data and load it into Mage and then actually perform our anomaly detection. And we’ll also be generating machine data in this demo, dummy machine data for three different machines and a variety of different sensor data from that. And then within InfluxDB itself, you can do a whole bunch of things not only collect your time series data, but also perform some analytics and queries on it. We support both SQL queries and InfluxQL queries. SQL you’re probably already familiar with. InfluxQL is a SQL-like query language that we have developed that has some specific functions that make working with time series data easy.
Anais Dotis-Georgiou: 12:47
And then a big emphasis of InfluxDB 3.0 when we rewrote the storage engine was to really prioritize interoperability with other tools. So, you can use things like Grafana, use the Grafana plugin to visualize your data there. Apache Superset, Tableau, we’re working on a Power BI integration as well. And additionally, eventually we also hope that you can pull Parquet files directly from InfluxDB as well. And that would make putting those Parquet files into a variety of ETL tools very easy. So, next slide, please. So, InfluxDB’s new storage engine. I just want to take a minute to sing its praises. We completely rewrote it. We wrote it in Rust and used the Apache ecosystem. So, the reason why we used Rust is because Rust offers really fine-grain memory management, and we wanted to be able to provide that fine-grain memory management to users. So, we will be offering a cluster version of InfluxDB, where you can completely manage and have great operator control over InfluxDB.
Anais Dotis-Georgiou: 13:56
And then it’s also built on Apache Arrow, Apache Parquet, Arrow Flight and DataFusion. So, Apache Arrow for those of you who don’t know what that is, is a framework for defining in-memory columnar data. And Parquet is the columnar-oriented durable file format. And Flight is used to transport these really dense and large data sets over network interface. And DataFusion is the query framework, and it uses also Apache Arrow as its in-memory columnar format. So, basically the kind of summary here is that when we move to this columnar format, we’re able to gain a lot of advantages in terms of compression and efficiency of the data sets because we’re able to basically summarize repeated numbers or repeated values and create indexes for those and compress them really efficiently so that we can write really, really large amounts of data to InfluxDB.
Anais Dotis-Georgiou: 14:59
And specifically, for those of you who might be curious, I really recommend looking at this benchmark, especially if you are a 2.x user. Hold on, let me just copy it. There we go. Oh, that was a strange URL. But any event, essentially one of the biggest ingests that I enjoy looking at or considering is that you can now write around 4.3 million values per second to InfluxDB 3.0. So, pretty exciting. And then also the query and write performance has been improved greatly as well. And then next slide, please. Great. So, this is just an example of our data explorer in our UI. So, here you can choose to query both the SQL or InfluxQL and you would basically just select the table or the measurement that you want to query your data from, and then you can use the script editor and the query builder to generate your SQL or InfluxQL queries and hit run.
Anais Dotis-Georgiou: 16:13
But the data explorer is really only meant to explore your data to kind of confirm that it’s there and get a general idea of what it looks like. If you actually want to perform any sort of better data analysis or create dashboards, again, I’d go to another tool that’s specifically built for that. Next slide, please. So, now we’re ready to talk about our demo. I’ll let Matt introduce it. But essentially, we’ll be creating anomalies and sending alerts to Slack with Mage on our machine data. And then again, we’ll send a URL with where you can find this demo and try it out for yourself. But in general, that demo lives in the Influx community org on GitHub, and there are demos for using InfluxDB with a variety of other tools. So, a whole bunch of IoT examples, Raspberry Pi examples, Mosquito examples, MQTT examples, other anomaly detection and forecasting examples. So, yeah, so highly recommend giving that resource a look at if you are looking at doing anything with time series data in InfluxDB.
Matt Palmer: 17:18
Awesome, thanks. Yeah, and I totally agree. Just to reiterate, we’ll have the link for the demo here, but InfluxDB is an awesome tool. Even though there were some stipulations around the visualization functionality, in my experience working with it, it was really great. It was really handy for being able to see live data as it’s coming in, and it was really fun to do this project with InfluxDB. So, the demo that we’re going to talk about today is built on top of InfluxDB, and then we’re going to use Mage to perform some operations on top of that, pull data in, do some light transformations, and then run an analysis. But essentially the demo simulates some machine data, load, vibration, power, and temperature. And yeah, sure, we can get you the demo link if I don’t go—
Anais Dotis-Georgiou: 18:07
I’ll send it.
Matt Palmer: 18:07
—the wrong direction. Okay, sweet. Thanks, I appreciate it. So, basically the demo generates this machine data. It’s all containerized, so it’s super cool. And then it allows you to click a button and generate some anomaly data. And so we’re pulling this data in, and then we’re going to use Mage to build a pipeline that, number one, loads the data. Two, uses a Python library called River to import an online machine learning model that’s going to detect those anomalies. And then three, send us an alert in Slack if it detects the anomaly. So, this is a pipeline that you could then theoretically schedule to run every day and check for anomalies in a IoT data source perhaps for some of your machines. And so this is really interesting. I think it’s a pretty on point use case. This could be in production somewhere, and we’ll talk about how it works. So, without further ado, I’ll jump into Mage. And I think we’re going to start simple. So, I’ll walk through a very simple example to show you what the tool is and how it works, and then we can get into our anomaly detection scenario. So, this is Mage. Hopefully, everyone can see this. I’ll make it a little bit bigger for us here. This is the pipeline overview page, so it’s a list of our pipelines. So, the Influx-Mage demo is what we’re primarily going to talk about today. But I think the example pipeline is illustrative of a very simple demo and kind of really relays the core idea of what Mage is. So, perhaps familiar to many we have our Jupyter-style notebook here, right? So, there are three—sorry?
Caitlin Croft: 19:43
Matt, we can see it, but can you zoom in a little bit just so the text is a little bit bigger?
Matt Palmer: 19:48
Absolutely. Thank you. How’s that?
Caitlin Croft: 19:51
Perfect. Yeah, that’s easier to read.
Matt Palmer: 19:53
Perfect, perfect. So, again, we have sort of a Jupyter-style notebook appearance here, and each cell represents a transformation. So, in this pipeline, we’re loading data from an API, and if I run this cell here, they’ll just pull in a data frame. And simultaneously we’re testing the output. So, I think that’s—one big benefit of Mage is that when you perform a transformation, you can also perform tests directly on the output, and they’re all returned as data frames. So, in some other tools might not be dealing with data frames, things might be a little bit more messy. So, that’s a data loader. But what do you do after you load data? Typically, you transform it. So, a transformer here performs some light transformations, fills in some missing values, and then actually performs another test on the output. So, if we run this cell again, we get a data frame with that transformed data. And then once we’re done transforming data, we probably want to write it somewhere. So, if I jump down to the last cell and I execute this with all upstream blocks, we’ll perform those transformations and then run the tests and write the data to our destination, which in this case is just a file.
Matt Palmer: 21:08
Now, the cool thing about Mage is that you can also obviously pull data in from any source you can imagine, right? We have a ton of different templates, whether that’s S3, BigQuery, Redshift, your data warehouse, your data lake, your data lake house, you can write any transformation you’d like whether that be in SQL or Python. We support both SQL and Python blocks. And then you can export also to a suite of transformation, or destinations rather. But we also support streaming pipelines. We support data integration, so it’s not just limited to this micro-batch ETL architecture. Those are important to call out. You can read much more about that on our docs page or in our GitHub repo or play around yourself. But the point is to just illustrate that you can also do things like run dbt models. This is not just limited to what I’m showing you here today, even though this is a simple example. But for the demo itself, we’re going to open the pipeline up and we’ll talk about exactly what it does. So, you can disregard this block here.
Matt Palmer: 22:10
So, we have our actual extraction from Influx, and then a block that just loads the data locally to make sure that this demo works, because I’ve given live demos that didn’t work, and that’s not a fun experience. But in this first cell, this is the InfluxData loader. We’re just importing the Influx client. And the great thing about writing Python and Mage is that you can literally write anything that you want to. So, we just install our InfluxDB client or Python client for version three, given all of the improvements, and then are able to just pull in environment variables since Mage has pretty great environment variable support with our host token and database name. And then we can just write a query in SQL against that database. Close the connection, convert the data frame or convert the response to pandas, make some minor adjustments to the time and we have our data frame here. So, you can see we’re getting in those machine values. So, this is simulating IoT data. We’re getting in the load, power, machine ID provider, temperature, time, and some other ancillary data.
Matt Palmer: 23:22
So, to get started with this data, we’re first performing a few cleaning tasks. That’s data engineering, right? You get data and you need to clean it, you need to do all the hard stuff. But we’re going to generate a unique ID and that’s just done through joining, slugifying rather the machine name, a few other details, converting the date time to microseconds, renaming the power column, very basic stuff. And yeah, just prepping the data basically for the fun stuff, which is detecting anomalies. So, as I mentioned, this demo uses the River library, and River is just an online machine learning service. It actually lets you run anomaly detection on streaming data. We’re simulating that streaming data in our model. So, you can also run it on static data sets, but I think this is a really promising tool, a really interesting tool if you need real time streaming data. So, I highly recommend you check out this library as well.
Matt Palmer: 4:21
But it’s pretty straightforward to use. So, we’re constructing a data frame, and in that data frame, we’re going to record our anomalies. We’re going to record each data point and then score it using this model. So, one nice thing about River is that it has a really great Pythonic syntax for building what it describes as a pipeline. And we’re using this half space tree model here to detect anomalies. And so you can find details on this in the repository that is linked there and read more about exactly what we’re doing here. But the important thing to note is that we’re basically just normalizing the power component. So, for this model, we’re just looking at power. We receive multiple measurements, but we’re just going to focus on power here. And for each power measurement, we’re going to train the model and then score the result. And so it’s relatively arbitrary, but this is one of the things about models, right? You just got to play around with the values. We’re saying that a score greater than 0.8 is going to be an anomaly, and then we’re writing that to our data frame.
Matt Palmer: 25:26
So, the result is that we get these values for power and a date timestamp with a score. And then if those values are an anomaly, we’ll score it as such and return the data frame. So, with that, we can basically have everything we need to know if we have anomalies, basically. And so if we run our check anomalies block, what this is going to do is just read that data frame, say, “Hey, are there any values where the score is greater than 0.8?” Because that would be an anomaly. And then we can visualize data using Mage’s—Mage actually has an inbuilt charting feature, but it also supports Matplotlib, so. In this case, Matplotlib was the easier way to visualize this data. But basically, we’re reading in our sensor data, plotting it, and then our machine learning model is detecting like, “Hey, our score is greater than our threshold of 0.8, so they’re an anomaly.” And in this chart, the blue line is the—rather, the blue line is the actual value, and then we normalize the value between a range of zero and one just for River. That’s a requisite to the model.
Matt Palmer: 26:36
So, end to end, right? This is kind of showing us, okay, we can extract data from a data source, transform it using Mage, and then if we want to install a machine learning library, install some other anomaly detection service and run that on our transformed data and actually visualize that in the pipeline so you could get some verbose logging and see exactly what happened. But then the final piece, right? If there’s anomalous data, somebody probably should know about it, right? Who are we going to tell that this power value is completely going nuts on all of our machines it looks like? So, Mage also has alerting functionality. And that’s configured in the triggers page so that there’s some settings that you can go to configure this, but we’ve done that. You can find out details on how to do that more in the GitHub link to the demo. But in my little playground Slack channel here, you can see that we got a failure today on the pipeline run.
Matt Palmer: 27:41
So, end to end, hey, we’re pulling data in from these sensors. We’re running it through a machine learning library. We’re running this detection service and we’re seeing this anomaly. Hey, we’re going to send a Slack message to this channel to let everybody know that we probably should take a look at this, probably should understand what’s going on here. And that’s when people could go to their Mage instance, check the results, check the logs, and then start digging into what’s going on with this anomaly. So, end to end, that’s an anomaly detection service in Mage using Influx. Mage is primarily an extract ETL data transformation, data orchestration tool, but anomaly detection is one of those things that fits pretty nicely into that pattern for our tool. So, I would love to see you guys check this out. I’d love to hear if you try this demo, how it goes. So, yeah, I think that’s all I had for the demo. What else do we have here? Oh, just another slide just showing the graph on InfluxDB for our machine data, super clean. I really like their charting functionality as well.
Anais Dotis-Georgiou: 28:52
So, yeah, if you want to get started specifically with InfluxDB, I encourage you to scan that QR code and you can go ahead and just also just visit influxdata.com, get started there, sign up for a free trial and get started using InfluxDB and run this demo yourself. It’s fun because you can also go on to, I think it’s 50/50 where we’re generating the data, and actually click on the machines to make the anomalies and then see those Slack alerts being sent to you. And also encourage you to join the Influx Slack as well. We have a channel devoted for notification testing for InfluxDB, and you can use that Slack webhook URL if you want, as well as ask any questions about InfluxDB or Mage or using the two together. Happy to answer any questions there. And yeah, next slide. Let’s see what else is going on.
Matt Palmer: 29:46
I have some links of my own in here, so I wasn’t going to slide through without mentioning this. Yeah, no, we’d love for you to star us on GitHub. That’s the first link there. We also have some pretty great documentation. I say that because I wrote most of it. Feel free to disagree, but if you do disagree, you should disagree in our Slack and tell me just what I can fix on the documentation or create a pull request because it’s also open source. So, definitely check out our Slack. I think we just crossed the 3,700-member threshold today. So, lots of folks in there talking about Mage. You can talk to me anytime you want. I’ll be on there. Or say hi on LinkedIn. And then there’s a bunch of stuff. If you want to follow me or check out my blog or my newsletter or whatever, feel free to do that too. But definitely check out Mage. Lots of exciting stuff going on there. What else we got? Oh, more about Influx.
Anais Dotis-Georgiou: 30:38
Yeah. So, feel free to join the Influx community Slack. You can join the—actually, I think it’s now InfluxDB 3.0 channel/V3 if you want to learn about that. Originally, the storage engine was called IOX because IOX is iron oxide which is rust essentially. So, since it’s built in Rust, that’s why we were calling it that. Announced just V3. Also, we have forums as well, community.influxdata.com. I’m mostly answering questions on there. And then I have other DevRels spend the majority of the time in Slack, but we’re both in both places. Yeah. So, feel free to reach out in either places and ask questions. I’d love to hear about what you’re planning to do with Influx or what you enjoyed about this webinar, what you’d want to learn more about. Yeah, I always get inspiration from the community, so please always share. Yeah, next slide.
Matt Palmer: 31:32
I like the IOX thing too. That’s pretty cool.
Anais Dotis-Georgiou: 31:35
It’s cute, right? I like it. Yeah. And then here are the QR links to those resources as well. I think I shared them in the chat as well, but whatever your preference is.
Matt Palmer: 31:47
More resources.
Anais Dotis-Georgiou: 31:48
More resources. We also have docs. That’s right. Our docs are fantastic as well. I can’t say enough good things about our docs team. And then blogs. I linked to two blogs about Mage and InfluxDB specifically, but we have tutorials if that’s your preferred way of learning for how to do things with Influx for a variety of different topics. And then we also have InfluxDB University. There are courses there for learning about how to use Telegraf and V2. We’re in the process of updating it for V3, so I wouldn’t go there if you’re specifically looking for InfluxDB V3 content quite yet. But if you are interested in learning how to use Telegraf, for example, that is a good resource. Yeah, I think that’s everything. Oh wait, no more.
Caitlin Croft: 32:32
I can talk about this one. We’re just wanting to make sure everyone has so many links they can find so much information. So, I just wanted to provide a few more links because there hasn’t been enough. So, just check out another webinar. If maybe you’re brand new to InfluxDB and you’re trying to learn more, here’s another webinar that might be of interest to you. And then just more additional resources about the team here at InfluxData has been really busy this last year getting InfluxDB 3.0 out the door. And as a result, you can save up to 96% on data storage costs. So, just kind of learn more about updates to the product. And lastly, maybe if you’re just like, “Oh my God, I need InfluxDB right now. My team needs it.” Would love to talk to you about running a proof of concept.
Matt Palmer: 33:28
Awesome.
Caitlin Croft: 33:30
Awesome. Thank you both Matt, Anais. That was fantastic. There’s a bunch of questions here, so we’ll jump right into them. So, how does Mage AI compare to, and I apologize, I’ve actually never heard of this, so I’m not sure how you pronounce it, but it’s N-8-N. And does Mage have ready-made integrations or connections like n8n?
Matt Palmer: 33:52
Yeah, sweet. Actually, this is interesting. I think I’m familiar with n8n. So, my understanding is that’s more like workflow automation, but we can do it. This is my favorite part about live demos, “Hey, look, I wasn’t completely wrong about something.” So, yeah, this is more like, “Hey, I want to automate this workflow and it’s going to do things for me.” So, this is a good example of maybe for things in Google Sheets, I’m going to split them into batches or run a different workflow. Mage is much more catered towards data engineering, so here’s our great homepage. So, it’s data plumbing without some other things. So, if you’re familiar with data analytics, data engineering, it’s all about, “Hey, I have data that lives somewhere. I need to pull it in, perform a transformation, and then write it to a different output.” So, in a sense, I would look into if you’re thinking about what space does this live in, other tools like Airflow are kind of our competitors. It’s a tool we’re trying to build on that experience and create something better, create something a little bit more magical. So, we do have a ton of connections, to your point, but the connections are all like big data sources. So, Amazon S3, Google Cloud, right? All these sources where you can fetch data or write data to ClickHouse, BigQuery, Redshift, DuckDB, things like that, if that makes sense.
Caitlin Croft: 35:18
All right, so the next question. In Airflow when I need to install a new Python library with pip, it requires to install in all worker nodes in separate computers. Does Mage support install new library in Web UI terminal and sync to all?
Matt Palmer: 35:34
This is awesome. Again, super technical on the questions. Yes, I feel like we could market this better. So, yeah, in certain executors with Airflow things can be pretty confusing. With the way that Mage is built, it’s built through Docker Compose, it’s built through Kubernetes. There is a terminal in Mage. So, installing through the terminal I believe will install across all of the nodes. You might have to deploy this in your method because I’m not entirely sure how you’re deploying Airflow. I guess it kind of depends on that configuration, but I’m fairly sure that you can install new packages here and that’ll carry through, or I know for certain installing them through Docker or Helm or Terraform will carry them through, so.
Caitlin Croft: 36:26
How can you manage hundreds or thousands of pipelines and how can you scale horizontally easily?
Matt Palmer: 6:34
Yeah. Awesome technical questions. Yeah. So, we have a ton of deployment options. I think the easiest thing would be for me to actually not navigate to our docs page and talk through them. So, you can deploy Mage on any number of providers, whether that’s AWS, Azure, DigitalOcean, Google Cloud Platform, and using Terraform or Helm. And so the piece there, right, like Helm, that’s like Terraform for Kubernetes. So, you can deploy Mage with Kubernetes. And if you’re familiar with the concept of Kubernetes, which is like a bunch of Docker containers, that’s sort of horizontal scaling where, “Oh, hey, maybe all of my pipelines execute in a different container.” And so then in that sense, that’s horizontal scaling, where you can deploy Mage in Kubernetes. Yeah, there’s a lot more written in our docs about how to accomplish that, so I’d recommend checking that out. But we do have some configurations for horizontally scaling workflows.
Caitlin Croft: 37:33
Perfect. All right, let’s see. There’s a few easier questions, less technical, but all the while interesting. Matt, what is exciting to you on the current Mage roadmap? What are you guys working on over there?
Matt Palmer: 37:49
Oh man, there’s so much exciting. Tommy is just destroying some—he’s just cranking out features left and right. Yes. First, I appreciate easy questions. These are great. I can relax now. I’m not going to be asked about deployment consulting. Great.
Caitlin Croft: 38:05
Not yet. Who knows?
Matt Palmer: 38:07
Not yet. So, this is not exactly answering your question, but very recently Tommy updated this tree, which now features some very cool left or right click options and it’s super easy to interact with. That’s a new piece of functionality that I really like. Second, he’s also working on some really cool Spark functionality. So, if you’re using specifically Amazon EKS and running Spark workflows, there’s some really exciting functionality that I think is going to improve how you do things using Mage. So, I’d say that’s probably the thing I’m most excited about.
Caitlin Croft: 8:46
Awesome. Does Mage host any events or meetups?
Matt Palmer: 38:50
Yeah, great question. We do. The last one we hosted was the Magic Meetup in San Francisco. So, it was awesome turnout. If you’re familiar with the data ecosystem, data influencers, we had Zach Wilson there. We had Xinran from Data Engineering Things. We had a bunch of other really great personalities. It was a lot of fun, super big turnout. I don’t know if—it seems like Tommy asked this, so. I also host meetups in the Bay Area for data practitioners. Go to Luma, look up Bay Data Club or follow me and you’ll see those as well. There’s lots going on within the Mage ecosystem, within the Mage community. Sure, I’ll drop a link for you, Tommy. I have it off the top of my head. Yeah. So, you can copy and paste that and that should take you to some other events that we host, but Mage is working on more as well.
Caitlin Croft: 39:46
Perfect. So, this is kind of a question for the two of you. So, Matt, what are some tips and tricks with Mage that you wish your community knew? And then Anais, what are some tips and tricks that you wish the InfluxDB community knew? Is there anything that you guys get asked constantly or you’re like, “Oh, I wish I’d known this trick beforehand”? Anything like that?
Matt Palmer: 40:15
Sure, I can go first. I’d love to dig into this because this is something I’ve been working on. So, if you go to our documentation, I’ll need to take a look at that. But we do have a tips and tricks section, and I’ve been posting these on our YouTube as well. That is just the GitHub page.ai. So, there are a ton of tips and tricks on our YouTube, and you can access that through this GitHub page. There will be a demo link which will take you to our YouTube. Oh, or show you an ad. This is the danger with live demos as well. But we have a bunch of tips and tricks here as well that I’ll show you how to use functionality. So, I would say some of the more nuanced things are creating replica blocks or using split pane view in our editor, which you can do very simply. Well, I haven’t enabled it in this demo instance, but I show you in the tips and tricks video or using conditional blocks. But they’re all on our YouTube, they’re all on our documentation, so I would recommend checking that out.
Caitlin Croft: 41:23
Awesome. Anais, I mean, you’ve been using InfluxDB for a long time. Especially with everything going on this last year with InfluxDB 3.0 coming out, what are some tips and tricks that you wish you had known, or that you wish the community knew?
Anais Dotis-Georgiou: 41:39
I actually love this question because previously with Influx, we had so much going on. In earlier versions, we had a proprietary query language. There was a lot of tips and tricks that were required for managing your data lifecycle and analyzing data with that query language. And actually, now I don’t have to give very many tips or tricks because you can have unlimited cardinality with InfluxDB V3, so you don’t have to worry about tips or tricks to manage it. And then additionally, as far as visualizing data and creating good dashboards, you can really just use the tools that you’re probably already using because there’s greater interoperability with V3. So, whether that is Power BI or well, that integration is coming, or Tableau or Superset or Grafana, you can just continue doing what you’ve already been doing. And then I think the last bit would also just be using the client libraries is my favorite way to analyze data with InfluxDB or data that’s in InfluxDB. I love pandas and I love working with Parquet files. And so V3 supports both of those things. And so that’s really where I would direct people.
Anais Dotis-Georgiou: 42:55
I think anyone coming from V2 maybe misses the task engine that we had, but hopefully now you can see that you can use other open source alternatives like Mage to replace the task engine. And then this way you don’t have to learn Flux. You’re not limited to that language. You can use Python, and there’s so much more examples and functionalities for transforming and analyzing your data with that. So, I guess the only real tip and trick I would have is to just check out the Influx community org. Because if you are considering using a certain technology stack, there’s a high probability that there’s an example demo there that will contain all the information for doing what you want to do with Influx. And then I guess the last tip is if you’re new to Telegraf, entirely new to Telegraf, the easiest way to get started is to actually set up a Telegraf configuration within the InfluxDB UI. And so I recommend going there. And then I also just wanted to give a plug for all of the exec plugins. So, if for whatever reason there isn’t a Telegraf plugin that has the functionality that you need, although there’s over 200 input plugins alone, but if there isn’t one, you can use the exec plugins to extend the functionality of Telegraf into any language of your choice.
Anais Dotis-Georgiou: 44:12
And so you can do any sort of processing data loading or writing the data with any of the plugins that exec input processor output plugins that exist. And then if you do do that, you can also contribute that script that you used as part of the external plugin. So, I’ll just share a link for that as well. But yeah, Telegraf is a fantastic tool. It gives you all the functionality of buffering and caching. So, if your destination is offline or unavailable for a while, you won’t lose your data. And it’s a very lightweight Mage agent. There’s ways to install it so that you’re only actually downloading the part of the binary that corresponds to the plugins that you actually want to use, so you can reduce the size of it as well. Yeah, so that’s a great tool. And I just say I’m just happy that I have less tips and tricks that I have to give people, it’s simplified.
Caitlin Croft: 45:20
Awesome. And if anyone’s interested, actually this Thursday, in two days, Anais will be presenting our data collection basics, which will go over Telegraf and the other ways that you can pull data into InfluxDB. So, if you’re brand new or need a refresher or just need some help, it’s another fantastic resource and it’s completely free. So, go hang out there with Anais and Jessica and you’ll learn even more about the amazingness of Telegraf. Well, I think we covered everyone’s questions. We’ll just stay on here for another minute or so, see if anyone has any last-minute questions. Thank you, Matt and Anais. I think this was a fantastic webinar. Great job putting the content together. I always love seeing how InfluxDB works well with different products, and I get to learn about other cool new products. So, got to learn more about Mage, which was awesome.
Matt Palmer: 46:21
Awesome. Yeah. Thanks Anais and thanks Caitlin. This was really fun to collaborate and definitely love InfluxDB as a product as well.
Caitlin Croft: 46:30
Yeah. And if anyone on the webinar, this is just a shameless plug, but if anyone on the webinar is doing cool stuff with InfluxDB and you’d love to share it with the wider community, please feel free to reach out to me by email. I’m always keen to meet more community members and learn how you guys are using it and obviously try to convince you to share your knowledge on a webinar or in a blog or something like that. And if any of you have any follow up questions for Matt and Anais that we didn’t cover today or that you think of right after we finish this webinar, feel free to email me and I’m happy to put you in contact with the two of them. Cool. Well, thank you everyone for joining today’s webinar. I hope you enjoyed it. And the session once again has been recorded and will be made available for replay as well as the slides by tomorrow morning. Thank you everyone.
Anais Dotis-Georgiou: 47:26
Thank you. [/et_pb_toggle]
Matt Palmer
Developer Relations, Mage.ai
Matt leads DevRel and is a general superstar at Mage. Before joining Mage, he worked across the data stack as a product analyst, analytics engineer, and data engineer at companies like Storyblocks and AllTrails. In his free time, he loves writing about data on his blog, hosting meetups, staying fit, climbing, and hiking. Matt recently relocated to San Jose, CA, but has lived across the country, from Asheville, NC to Salt Lake City, UT.
Anais Dotis-Georgiou
Developer Advocate, InfluxData
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.