Apache DataFusion Meetup: Chicago December 2024 Recap
By
Andrew Lamb /
Developer
Jan 06, 2025
Navigate to:
This past week, I attended and spoke at the Apache DataFusion Meetup in Chicago, Illinois. Inspired by Sami Tandogdu’s (Synnada) great recap of the DataFusion Belgrade meetup, I figured I would try it myself.
First of all, huge thanks to 1871, Pydantic, and (of course) InfluxData for sponsoring the event; to Adrian who did much of the work organizing; and to Xiangpeng and Adrian for some of these pictures. Around 25 DataFusion enthusiasts attended, learned from talks hosted by project contributors, and discussed ideas for the future. The meetup felt somewhat unique as almost all attendees were using DataFusion in their products or projects. This led to some great discussions and a visceral feeling that the adoption of DataFusion is increasing. Below is a summary of the four featured talks:
“Building a Real-Time Data Lake with DataFusion”
Adrian Garcia Badaracco - Founding Engineer, Pydantic First up was Adrian, a founding engineer at Pydantic. His team is building the database for pydantic LogFire, an observability platform. Adrian gave an overview of how Pydantic uses DataFusion to build a near real-time data lake for observability data and some details of their indexing and metadata store. VIDEO / SLIDES
”Practical Data Science in Robotics Using DataFusion”
Tim Saucer - Director of Simulation & Infrastructure, May Mobility Next up was Tim Saucer, a contributor and committer on DataFusion, who focused on the Python bindings. Tim spoke about data science in robotics and how DataFusion can be used to address some of the challenges particular to that field. VIDEO / SLIDES
“Practical Disaggregated Cache for DataFusion”
Xiangpeng Hao (@XiangpengHao) - PhD Student, UW Madison The next speaker was Xiangpeng Hao, a fourth-year PhD student at the University of Wisconsin-Madison, studying and building database and storage systems. He spoke about his work building SplitSQL, a disaggregated cache for modern data analytics also built on DataFusion. He was a former intern at InfluxData and, in that role, contributed heavily to the StringView integration in Apache DataFusion and Parquet Metadata. VIDEO / SLIDES
“Building InfluxDB 3.0 with the FDAP Stack”
Andrew Lamb (@alamb) - Staff Engineer, DataFusion, PMC chair, InfluxData Finally, it was my turn to speak about the rationale for why and how we built InfluxDB 3.0 using the FDAP stack, with a focus on the DataFusion aspects. Sorry for the somewhat goofy picture and the fact I forgot to turn on the microphone for the recording. VIDEO (no sound 🤦 ) / SLIDES
In addition to the speakers, it was great to meet Alex Wilcoxson, Michael Maletich, and others from Relativity Software, who are building a document discovery platform using DataFusion and Michael Ward of DataFusion-Python fame. Also present were Camuel Gilyadov and Sergei Turukin from Embucket, who are working on a new DataFusion-powered project and Devan Benz, a fellow Influxer working on database internals. After lunch, we had some informal conversations about topics such as the future of the project, building secondary indexes, performance, and the DataFusion-Python roadmap.
While running around meeting other users is somewhat exhausting, I think it is important during this stage of the project’s growth. As its adoption takes off, building a community that can sustain the project over the long term is more important than ever, and I am very excited, as always, to be a part of that.