How to Choose the Right Database for Your Workload in 2024
Session date: Jan 09, 2024 08:00am (Pacific Time)
Learn how to make the right choice for your workload with this walkthrough of a set of distinct database types (graph, in-memory, search, columnar, document, relational, key-value, vector, and time series databases). In this webinar, we will review the strengths and qualities of each database type from their particular use-case perspectives.
Join this webinar as Charles Mahler dives into:
- An overview of the current database industry and trends
- Database performance analysis
- Tips and tricks for picking the correct database for specific use cases
Learn how to pick the best database for your workload! This one-hour webinar will include a live Q&A.
Watch the Webinar
Watch the webinar “How to Choose the Right Database for Your Workload in 2024” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How to Choose the Right Database for Your Workload in 2024.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Caitlin Croft: Director of Marketing, InfluxData
- Charles Mahler: Technical Marketing Writer, InfluxData
CAITLIN CROFT: 00:00
Hello, everyone, and welcome to today’s webinar. My name is Caitlin. I’m joined by Charles, and he will be discussing how to choose the right database for your application. This session is being recorded and will be made available by tomorrow morning. Please post any questions you may have in the Q&A at the bottom of your Zoom screen. And without further ado, I’m going to hand things off to Charles.
CHARLES MAHLER: 00:24
All right. Thanks. So, let’s just jump into it. We’ll look at our agenda. So basically, kind of a broad webinar in the sense that we’re going to go over kind of some trends for the year ahead that you can kind of track and they might be useful to know about, kind of the current snapshot of what’s going on in the database ecosystem. Then we’ll look at—before diving into these various types of databases and what you can use them for, we’ll look at, kind of high-level, what essentially is the difference between these different types of databases and why do they perform differently for different types of data, different types of use cases? And then at the end, like Caitlin said, we’ll have QA, and that can be about InfluxDB-related stuff or also just kind of anything that pops up during the webinar. If you have questions about it, I’ll do my best. I’m not an expert in every single type of database, but obviously, I’ve learned about some of them, so I can probably help in whatever way, or at least point you in the right direction.
CHARLES MAHLER: 01:27
So, I’ll spread out these kinds of in-depth trends throughout the presentation just to kind of break it up because I’ve done this webinar in the past, and the first time I did it, it was just a string of like nine different databases in a row, and I think it felt kind of monotonous. But this is just the overview. So, some big things to look out for is kind of these multimodal databases where there’s kind of—before you had NoSQL, SQL, and they’re kind of differentiated. And now, basically, different databases are trying to move into other—trying to gain more market share by adding features. And there’s kind of some overlap now that you see in a lot of different products and a lot of different tools. Kind of on the AI front, semantic layer’s been around for a while as a concept. But with some of the large language models and the way they can dynamically create SQL and query databases, there’s been a lot of people talking about that, how these semantic layers can help make those more useful. There’s going to be a big focus on interoperability, integration between different systems, basically common standards in terms of file formats and communication protocols. And then we see some stuff where it’s AI features directly kind of helping make databases easier to use and also more efficient, more optimized. So those are kind of the main trends I’ve seen people discussed. And obviously, just through my own research, talking to people on our team, talking to basically database specialists, and then just seeing what venture capitalist and business investor types are saying, I kind of basically just compiled all this stuff.
CHARLES MAHLER: 03:16
So, things to keep in mind, I think it’s always important, don’t fall for the hype. It can be tempting as an engineer to chase the shiny thing. But at the end of the day, you should probably make—as simple as possible. The least complexity for your situation is almost always the best option. There’s no real magic. It’s computer science. There are always trade-offs. No matter how much some of these tools try to make it sound like they’re magical, they’re either kind of abstracting away complexity, which could be a problem down the road, or there’s some sort of situation where there’s a performance trade-off. So, as I go through these categories, there is overlap. I kind of talked about that earlier. So just because I say something’s a column database, it doesn’t mean you can’t use it for other use cases. You can definitely use a more general-purpose database for other things. So, a lot of this, just for the sake that we have under an hour of time, I have to make some generalizations and kind of simplify things, so keep that in mind.
CHARLES MAHLER: 04:22
So, these are the types of databases that we’re going to go over. I’ll go over the pros and cons, kind of general overview, and then specific use cases for them. I originally did this in 2022, and this is just kind of a—why you should listen to me on this is that you’ll see in the next slide, this was a snapshot of the trending databases from DB engines. And it basically tracks different things, like references across the web, a bunch of different things that they track, then essentially create a trend. So, as you saw in the earlier slide, vector database, that was something that I mentioned in 2022. And you can see like this, at that time, it wasn’t even like a category on DB Engines. They didn’t really track it. It wasn’t significant enough. And as you’ll see, like vector databases now, it’s like the AI stuff, pretty much in the last year became huge. So, this is kind of show that you might learn something useful today that could put you a little bit ahead of the curve and also just pat myself on the back. I love it, to be honest.
CHARLES MAHLER: 05:29
So, this is the current snapshot. So, you can see that pinkish line, you can see basically came out of nowhere. They added the category, and these vector databases are the fastest-growing category and that’s almost entirely driven by some of these AI advancements. They’re very useful for those, and I’ll cover specific use cases later. You can see time series databases like InfluxDB, they were number one last year. They’re right now around two or three, kind of depends on the timeframe you use. So, time series is also—that’s our thing. And lots of people, lots of developers, lots of companies are seeing the value of that. So, this is the snapshot, basically the market share of each type of database. So, you can see despite, kind of in the past, the hype around NoSQL and the relational was completely dead. You can see that still by instances and references, relational is still a pretty big slice of the market. You can see time series grew year over year. I don’t have the image, but again, that’s kind of gaining over time. More and more companies are using it, document databases, basically, there’s a variety of NoSQL that are also kind of—have a decent market share.
CHARLES MAHLER: 06:47
So, into kind of the performance stuff, you could do a PhD on this, so I can’t cover every aspect, but I will try to give you kind of a high-level view of why these things, even why different models of database exist. So fundamentally, it really comes down to—the simplest way to think about it is read versus write performance, and everything that comes after that is a trade-off between this. So, if you think about in terms of writing, the fastest way you could write to store some data is just appending to a file continuously. And that would be about as fast as you could possibly write data. The issue would be when you went to query that data, you have no index. It would be a linear scan of all your data every time. So that is the trade-off in basically query performance versus when you write and ingest that data and everything else kind of comes downhill from that thing. And some of these models and databases then, you can make assumptions for certain workloads for time to use data as an example. Being able to kind of query that data very soon after ingest is important so you can drop, but you’re not going to need certain characteristics just as part of that use case, you don’t need a lot of different things that a relational database has. So, you can trade off in certain areas for performance in what matters to you. That’s kind of the whole idea behind some of these more specialized databases is if you know upfront what you need and what you don’t need, you can kind of sacrifice some things. And it really, overall, doesn’t hurt you because you didn’t need it in the first place.
CHARLES MAHLER: 08:30
So, these are kind of the main design considerations that come from that read versus write performance. So, without going into too much craziness because this is—like I said, you could do a PhD. Databases are, in a lot of ways, the final boss of programming. There’s so much stuff that when you do those leak code interviews that you don’t think is useful if you’re doing web development, but almost all that’s relevant when you’re doing database stuff. So, the first thing is on disk format. The main thing is do you store your data as rows or in a column format, and then how do you map those? If you’re using a graph database, do you map those data points to maintain that graph structure? And there’s other stuff like do you keep the data sorted on disk? Do you keep it sorted by the time it was ingested? That sort of thing. Next up is indexing for basically your indexing data structure. So, there’s your on-disk kind of representation, and there’s in-memory indexes that are stacked on top of that effectively. The big thing for disk is obviously it’s cheaper. So, for relational databases, they use a B-tree because it maps very well. So even if whatever data you have isn’t in memory, they can just go grab that off disk and it’s more efficient than some other types of indexes. So, B-trees tend to have better read performance, and that’s why they work well for relational databases. LSM trees are another alternative, and they have better compression, and they also have very good write performance. So, a lot of NoSQL databases use these LS entries. Then you have secondary indexes, and that’s like you think about if you’re using MySQL or whatever database, you can index certain columns for certain types of queries to make them more performant.
CHARLES MAHLER: 10:29
You have disk-based versus memory-based kind of designs. So, if you’ve used Memcached or Redis, those are in-memory databases, and RAM is much faster than doing anything on disk. And it also gives you the bonus of you can create kind of useful or unique data structures that only are really viable if you know ahead of time everything’s going to be in memory. So how you kind of allocate that also comes down to cost. If you have a huge data set, it’s going to be very expensive to store it all in memory. So, the idea is, think about the size of your data. How valuable do you need fast queries? How frequently are you querying it? If it takes, do you need it to be subsecond? Do you need it to be okay if it’s a long-running job that it’s okay if I get it by tomorrow like it’s some analytics kind of report. Those are the type of things you need to take into consideration. You have compression, how the trade-off there is. If you compress your data a lot, you can save on storage space. And it also reduces I/O bandwidth when you’re moving between disk and memory and stuff like that. The other issue though is that for certain encoding and decoding your data, that could be a kind of performance loss there if you compress a bunch with an inefficient algorithm. A more recent kind of strategy is hot versus cold storage. So, this is something that InfluxDB 3.0 does, and a lot of other databases are starting to do it as well, which is like they can do—basically swap between memory, like an SSD, and cheaper object storage for data that’s not queried to sprinkle it in. And if it does, it gets pulled back into these different tiers of storage depending on your query pattern.
CHARLES MAHLER: 12:24
So that allows you to, in theory, like the ideal world, you can get the same performance with a—but you can also store much more data and not have to worry about going broke due to storage costs. And finally, you have durability and recovery stuff. So, if your data is very valuable, you obviously need to make sure you don’t lose it. In other use cases, if it’s no big deal if you lose a few seconds of data or a certain amount of data, you can make tradeoffs for how much you reproduce your data, how many replicas you have of your data, backup stuff, and the management between multiple instances. So how to choose the right database really comes down to knowing your workload, knowing what matters for your end user, whether that’s an actual consumer user or if it’s an internal, your business analytics team, knowing what their expectation is and kind of then making tradeoffs, making decisions based upon that. So, the big thing to think about as far as data access patterns is going to be, is it a OLAP workload where it’s analytics type stuff? That would lean towards a more column-oriented data format because you’re doing queries across maybe one or two columns, but you want to take the average or something. The other alternative is a OLTP workload, which would be a what would be seen as a standard relational database workload. That’s a row-oriented type of situation.
CHARLES MAHLER: 14:10
You also have to think about, is it a read heavy versus write heavy workload? If it’s read heavy, you can probably just get by with a more simple setup. You just cache a lot of the responses, and it’s a little easier to manage. If you get into a write heavy workload, you then have to think about how you’re going to scale up supporting all those writes. So that gets a little more complicated. You then have to think about kind of isolation requirements. If you’re like in finance, that’s kind of the go to use case for a relational database where you want to make sure that midway through an account transaction, you don’t accidentally double charge somebody or they withdraw money, but it doesn’t withdraw the money from their account. So, they just get free money. The crypto ecosystem has had a few issues with stuff like that in the past. That’s a famous one, I think, that was one of those exchanges that failed years ago was related to something like that, where they used a NoSQL database that didn’t have strong transaction support. Somebody found the bug, and they’re able to essentially just withdraw money for free.
CHARLES MAHLER: 15:20
So, the other thing to think about, there is actually—when you hear the term ACID transactions, there’s actually different kind of levels to that. It’s kind of a rabbit hole where not all transactions are created equal. And so, some people will—some kind of database providers will hype up that they support ACID transactions, but it might be not what you expect. We have scalability requirements. So, this comes down to not over engineering. Everybody wants to think that, “Hey, I’m going to plan ahead for massive amounts of traffic, massive amounts of users.” But if you can get by—if it’s just an internal app, you know it’s going to have like 20 users. Just go with the simplest solution. Don’t go crazy with it. Business stage. So, this comes down to some of the pros of a NoSQL, like NoSchema structure, which is that you can basically kind of iterate a little bit faster because you don’t have to pre-plan your table, what your columns are, that sort of thing. You can move a little bit faster. So, what stage your business is in is something to account for.
CHARLES MAHLER: 16:29
And then finally, I think probably the most important thing is take into consideration what your team, what’s your in-house knowledge? What are they familiar with? And is it worth potentially having to train everybody to use a new tool, use a new system, or can you just roll with what you’ve got or just use something they’re already familiar with and comfortable with? And this is kind of an overview of the characteristics at a high level of relational versus NoSQL. So, we kind of talked about earlier the B tree versus LSM tree. The relational has obviously got a table-based structure, the different types of NoSQL databases. There are tons of different types, which we’ll go into. Got the schema, the consistency trade-offs. And then finally, a relational database, like the old-school ones, at least, were primarily scaled up by just getting a bigger instance, bigger machine, and maybe have some replicas for failover. But NoSQL databases, most of them are designed from the ground up around the concept of being able to scale horizontally.
CHARLES MAHLER: 17:42
So, our first trend to kind of break things up is talking about these multi-model databases. So, the way I kind of see things moving is that if you’ve looked at kind of business cycles, there’s a history. You can look back and see a lot of industries start off vertically integrated. Then they go towards a horizontal specialization stage, and they move back sometimes towards a vertical integration. So, if you think about—a class example would be like Ford, where they start off doing everything themselves. And then over time, they kind of split off. They started using different providers to build their vehicles. Instead of like before, they had their own rubber farms to make their tires. They did everything in-house. So, there’s benefits to that. But there’s also the idea of having a core competency and being specialized at a single thing. So as far as databases go, I’m seeing a lot of tools they’re adding. You can see it with Postgres. They added a JSON. Kind essentially, they saw people like NoSQL kind of document style. So, they added that column type to support JSON data. And you’ve also seen with NoSQL databases where they added transaction support and stuff like that where basically there’s these different—to start off specialized are kind of merging in the middle where they’re copying or taking the best features from each other.
CHARLES MAHLER: 19:17
And some examples of that so you can see MySQL, for example, has a columnar storage engine extension so that they can support more analytics stuff, which in the past a relational database would have struggled with. And there’s kind of combo things like relational graph databases and relational pretty much every combination you can see. If you look at Postgres, there’s all sorts of extensions for specialized kind of data types to support that. Then you got a big thing, which I’ll touch on a little bit later, is like this kind of—it’s a major buzzword that I have to laugh as, like the data lakehouse. So, this is like the modern like you start off with database. Then for analytics, they created data warehouses. And then because of some of the weaknesses of that, they moved to data lakes. But then that also had issues. So now it’s like the data lakehouse is this concept of trying to get the best of all of that, where you get the ability to work with structured and semi-structured data kind of all in the same tool. So those are getting a lot of attention. There are quite a few places adopting that kind of architecture as the underlying kind of storage layer for all their data as a company. And finally, specifically calling it as the vector embedding support that kind of touches on the earlier stuff where there’s specialized vector databases, and because of all that hype, a ton of different places, a ton of different databases have added support for that. So that’s a trend to look out for is essentially these companies are trying to—they have their current user base, and they obviously want to expand, get more revenue, get more users by adding support for different use cases.
CHARLES MAHLER: 21:00
So relational databases, jump into this is where we’re going to start looking at each of these specific things and what they’re good for. So relational basically stored as rows on disk, stored in table format. You’ve got SQL to query your data. And the prime examples of that would be Postgres and MySQL. The pros of this are essentially you’re going to get solid performance across a broad range of applications. You have a really strong ecosystem, like literally decades. Its battle tested. You have that data consistency, transaction support, so you can rely on that. Some of the cons are—there’s tools now to make this a little bit easier, but traditionally, it’s been scaling it horizontally, and because high write volume workloads also tend to be a little bit of a struggle. And you have that defined schema, which in some cases can be valuable, but in others, doing table migration stuff can be kind of a pain. So classic use cases are pretty much any general web purpose kind of web application, stuff like that. You’ll get solid performance, and especially anything where you want a financial thing where you can’t afford to have corrupted data, that’s where it’s really going to be the biggest value.
CHARLES MAHLER: 22:25
So key value databases. This is kind of the simplest form of a NoSQL database. It’s essentially just a hash table that you can use that you would see in any programming language, basically. And it’s just a key which is created by mapping using a function, and then it maps essentially any type, like a key value database in its essentially earliest form. There was no structured data. It was a single key. It points to a blob, and that’s kind of just what it was. And it was made popular by Amazon. They wrote a paper about one that they created internally to help scale up their ecommerce. And that was kind of helped along with a few other papers kind of streamlined that big companies were using these in production, and it kind of got the ball moving in that sense because they realized for their use case, they could afford to not—they didn’t need transactions. They didn’t need all these other things provided. So, they stripped down and created this tool that gave them the performance they needed. So, the pros and cons are really good read and write performance. The schema is flexible in the sense that it’s basically just a key, and you can map it to anything you want. It’s very scalable. It’s easy to scale horizontally. Downside is most of these designs have weak consistencies, so you can get—essentially when you write a new data point, there’s a potential chance if you query at the same time, you could get the old value rather than the new one. So that’s a potential risk. And essentially, the pure form, a key value-based, the querying, there’s really no standard querying capability in the sense that you can only grab keys. You can’t filter efficiently. You can’t do a lot of things you would expect.
CHARLES MAHLER: 24:22
So, some standard use cases. The most famous from the Amazon paper as they used it for their shopping cart. And they’re having issues basically scaling on Black Friday and Christmas and peak periods of time. And essentially, the takeaway was they realized they were using so many hacks to scale a relational database that they’d already lost most of the stuff you would expect anyway. So, it’s like they might as well just build a database from scratch that they know upfront they’re losing the data consistency guarantees, and then they can just operate knowing what they’re actually working with. In addition to that, e-commerce, you have real-time features and personalized recommendations. Pretty much anything where you want low-latency, high scalability would be a good fit.
CHARLES MAHLER: 25:11
So, onto document databases, which can be seen as kind of like the next step of a key value. It’s basically just an extension from that concept that adds support for metadata and some semi-structured data, which allows better querying. Generally, you can treat it kind of like JSON values where you can have nested data, different stuff like that. And it’s kind of a different mindset from a relational database because instead of having a bunch of different tables that you join together when you need different values, you’re kind of steered towards putting all the data you need in a single document, and then you potentially nest that and treat it that way. So, the examples, MongoDB is probably the most popular example. And then you also have a lot of different cloud solutions and a few open-source ones as well. So, the pros and cons, you get the flexible schema, pretty good performance, data locality, which is like essentially if you have all your data in one document and you know you’re going to need it for all your queries, there’s some performance benefits to having that just on disk and other ways, not having to do joins, stuff like that. You can map, again, kind of related to schema. You can map your data storage to kind of fit your application. It can, when done right, kind of result in a more developer-friendly experience. And again, scale is a big reason people go with these historically. And then you have consistency as a potential issue. And there can be situations where the relationships between documents—if you do have that, you might have been better off going with a relational database.
CHARLES MAHLER: 27:01
So, document databases, in a lot of ways, they are similar to relational databases and that they support a pretty broad range of applications. And the big thing is anything where you would benefit from that flexible schema for your data model, that particularly would be a good use case. All right. So, our next kind of trend for 2024 is these semantic layers. So, without going into too much detail, it’s basically—as a concept, it’s just a way to kind of abstract away the data sources and the data model and make it more user-friendly representation for end users, whether that be somebody using it through a BI tool or just some engineer who wants to access all this different data within your organization, just kind of an abstraction layer. The reason these are becoming more important, or a lot of people are hyping it is because with the rise in AI and these large language models like GPT, people have already done where it’s like you can generate SQL on the fly. So, you do natural language query, like how many users in the last month have signed up, something like that. The language model creates a SQL query, and then you get your data back, and it can format it in different ways. So, the big reason the semantic layer is valuable here is it can give that language model additional context on what your schema is, what fields are available, basically just all sorts of additional information because it’s not fine-tuned or trained on your data set in most cases. So, it doesn’t know your organization’s data. So, if you give it the semantic layer, that can free up and make it more accurate and makes it easier to work with.
CHARLES MAHLER: 28:57
And I think, essentially, the ideal end goal would be—we’re still a ways off—but it’s like if you could have a non-technical subject matter expert, they’d essentially be able to kind of talk to your business like they can write queries. Your backend essentially has access to all of your organization’s data, and they could kind of chain together queries and get the data they want, get the insights they want. All right. Next database up is time series databases. So, overview, it’s obviously designed for time series data. The big thing is it needs to be able to handle queries at really both ends of the spectrum, which is what makes it such a challenge. You kind of want a query for a specific, like let’s say you have a device or a specific server, “Okay, I want all metrics for this specific server by this ID.” But you could also want a query where it’s like, “I want all the metrics for my thousands of different servers.” So essentially, you have narrow and wide types of queries, which makes it a challenge to basically provide good performance for both of those is tough. And that’s why these specialized databases have been created. Big thing, write throughput is probably the biggest. In a lot of cases, potentially millions of data points per minute, sometimes more. And you not only have to be able to support that write throughput, but people want to be able to query that within, in some cases, like seconds to sub-minute at the very least. Because if you’re tracking your servers or you’re maybe monitoring a fuel pipeline, your data is useless if you find out five minutes after the fact that, “Oh, this pipeline burst or my servers have been down for 5 or 10 minutes, and I didn’t realize it because my database couldn’t even query it in time.”
CHARLES MAHLER: 30:49
So that’s essentially, from a performance perspective, the big things. And then also, from a developer productivity perspective, they support a lot of stuff, so you don’t really have to reinvent the wheel when it comes to features that pretty much all time series use cases are going to need. They try to provide that out of the box. So, the big ones are InfluxDB. And then there’s a few other open source like Timescale, Quest, and a few others. So, the pros and cons, as I said earlier, you’ve got very fast data ingest and query performance. You’ve got the developer productivity stuff where you don’t have to write retention policies. You don’t have to write custom code to delete data after. If you don’t want it after a month, you don’t have to do that. It’s built in. And that’s support for stuff like downsampling, aggregations. And in a lot of cases, they’ll have query languages that make life easier, like built-in kind of common time series functions so that you don’t have to write some monster SQL query. Some of the downsides, which again, if you kind of know your use case upfront, you won’t have issues with this because you’ll work around that accordingly. But they really aren’t designed to update data, specific data points, and they also aren’t good at specific data point deletions. Normally, you’d want to delete a chunk of data after a week, just automatically delete it. It’s very good at that. But for specific data points, it’s not great. And for updates, it really shouldn’t matter again because you don’t really want to rewrite history. That’s kind of the concept behind that is if you’re tracking something through time, there’s really no need to update it. So that’s kind of why that trade-off was made in performances. Ideally, you’ll never need to update it. So, this really isn’t even a real-world con or real-world downside.
CHARLES MAHLER: 32:46
So common use cases that we see are monitoring IoT applications. It’s monitoring pretty much anything, but specifically it could be observability, application performance monitoring. We also see a lot of financial data. It’s a common use case. And then you have stuff like analytics and event tracking are common stuff that we see. And for InfluxDB specifically, one of the big reasons people go with us is we support nanosecond precision. So, we have, for example, a space startup that they need to be able to shut things down—if something goes wrong, they want to be able to shut things down or act quickly. So, for them, even millisecond precision isn’t enough. They need very low-level granularity on their data. So that’s something to look out for. Again, knowing kind of your data access patterns, knowing the requirements for your end users, that’s something to take into consideration. What granularity do you need for your data?
CHARLES MAHLER: 33:54
So, columnar databases. So, these focus on analytics. And as you can tell by the name, they store their data. Rather than a row, they store it in columns. And this gives the benefits of much better compression because for each column, you can use the best possible compression algorithm. And it also allows for improved kind of query speed and query processing, stuff like that. Some examples are Clickhouse and Vertica. You have AWS Redshift. In a way, you could kind of consider InfluxDB 3.0 in this category because it does under the hood use a column-oriented architecture. And it’s obviously the fine-tuned for time series, but it does provide some of the benefits you’d see here as well. So main thing is that it’s more efficient for analytics. You’re not pulling in redundant data. And if you think about if you try to do an analytics query for a row-based database, you have to go over essentially every row and then throw out the data you don’t need just to do your calculation. With a columnar database, let’s say I need the average of sales price or something. I want the number of unique customers in the last week, and you just access that one column.
CHARLES MAHLER: 35:12
Another angle of that is that they can do—under the hood, they do a lot of really cool performance tricks where they’ll actually store redundant or duplicated columns that are sorted in different orders. So, if you want—like I want the most recent whatever, they can be sorted in different orders. They can pre-calculate different statistics about the column. They just do all sorts of different kind of optimizations that you don’t see that help improve performance. Again, I mentioned earlier the data compression. Because they’re the same column, you get basically the best possible, most efficient. And so that saves storage costs, essentially. You get vector processing. So that’s the SIMD. It allows parallel, basically, better usage of your CPU for—you can use multiple cores, process in parallel, stuff like that. The cons are that writing data is generally less efficient just by the way the data model works. Usually, most of them recommend batch uploads, and the frequency can kind of depend on that depending on which tool you’re using. And if you’re accessing multiple column values at the same time, you kind of lose some of the performance gains that you would expect just because of the way the data is laid out.
CHARLES MAHLER: 36:42
So, this is an example of just a benchmark shown in Clickhouse, where you can see how much faster it is for these analytics queries for some more general, relational, and then MongoDB, a document. You can see it’s 200 times faster than Postgres and up to 700 times faster than Mongo. So, it’s not like a 10% improvement for a lot of these queries using a column database. You’re getting orders of magnitude better performance. So, the use cases, basically, analytics are common. You also have observability and data warehousing as well. And this is just another real-world case study of Uber, how they migrated and basically got—they switched from a different database to a column database, and they got 3 times savings on their compression, 10 times faster queries. And because it’s so much more efficient for the same hardware, they’re able to cut in half and get better performance. And a lot of this we’re seeing, again, with InfluxDB, since we’ve used this for InfluxDB 3.0, there’s a lot of similar performance gains. So, you essentially get more for less, which is always nice.
CHARLES MAHLER: 38:04
So, graph databases. These are kind of interesting because from earlier, it was a while back now, but if you look close, these only have 5% market share from that chart. So, it can’t really be said to be widely used by general population of developers or businesses. But 75% of Fortune 500 companies are using graph databases. So essentially, that kind of shows the value. And I think over time, you’re going to see more and more adoption. It kind of takes time for some of these newer techs to kind of reach the smaller businesses. And that comes down to simplifying things and people just realizing value over time. So, the way it works is that it stores data to essentially maintain those relationships between connected data points if that’s what you want to use it for. The classic example would be a social network where you want to track how different people within your data set are connected. So, they have typically specialized query languages that make it easier to write these types of queries, like standard SQL and the examples. The most popular one I’d say is probably Neo4j, and there’s a few other open-source ones as well.
CHARLES MAHLER: 39:21
So, the big thing is they provide good performance on these types of queries, which a normal database might not. Developer productivity is big where they provide this kind of—SQL wasn’t designed for doing graph traversals out of the box. So, they provide these languages that kind of make it feel more natural to write these queries. Most of them support flexible schema, and they make it easy to essentially add a new data point and connect it to existing data points. The downside is pretty obvious. If you don’t work with heavily connected data, it wouldn’t make a lot of sense to use these. You can see the top is what a query would look like in SQL. And then the bottom is this new—they’re actually working on a standardized type of graph query language, so every database doesn’t have their own kind of thing. So, they’re working on standardizing that. You can see it’s like nine lines versus three lines, so it’s a little simpler. So, the use cases, you have stuff like fraud detection, finding the connections between different behaviors, different accounts, and detecting that fraud before it happens or as it happens. You have social networks, as I said, and they’re also heavily used for stuff like recommendation features. They can tie together this person bought this. They bought the same thing, and so you might like this product as well.
CHARLES MAHLER: 40:52
All right. So, another trend is basically increasing interoperability between different systems, different databases, different tools. So, a big one is Apache Arrow. So InfluxDB’s invested heavily in this and built on this quite a bit. The term some of our engineers use is the FDAP stack, and what that stands for—we have a blog post on it. You can go check that out. But it’s essentially like Flight SQL, which is a communication protocol. Then you have Data Fusion, which is part of Apache Arrow, which is a way to—it’s like a standardized query engine that you can plug into your system. And you also have Arrow, which is the protocol itself. And then finally, you have Parquet, which is the storage format. So, the big kind of reason this is happening is that it’s a win-win for both developers and the companies themselves. So, you can make it so that they can use these open-source tools to build their product, and they don’t have to reinvent the wheel. And it also makes acquiring new customers easier because you can say, “Hey, we use this protocol. You can just test this out real quick. You don’t have to rewrite your code. It’s pretty simple.” The win for end users is obviously that you can reuse your knowledge of these tools if you want to try something new, and it reduces kind of the vendor lock-in risk because you can always kind of move your data out to a new solution.
CHARLES MAHLER: 42:24
The other thing which ties into this, and probably can be seen as the center of what will interopted with or integrated with are these data lakes, data lakehouses, where they use either Parquet or other formats like Iceberg or this Delta Lake - basically, just different ways to, essentially, provide metadata for your underlying storage - and then these higher-level tools will plug into that and kind of be able to access that data. So, Snowflakes and Databricks are both either currently or investing in being Iceberg-compatible so that you have this basically common kind of—what would you call it? It’s essentially a format for these different data tools to all integrate with.
CHARLES MAHLER: 43:19
Let’s see. And then finally, you have query engines, which are basically just ways to, again, access these—it’s kind of a—what do you call it, a kind of agnostic way to access the same data. You’re not locked into a specific way to do it. You can use this query engine that plugs into the database or the data lake, and you’re not locked in completely. So, in-memory databases, the big thing is that they essentially store all data directly in RAM. So, they’re very fast, and they can use unique optimized data structures without trade-offs because they don’t have to worry about potentially having to make a call to disk that would be inefficient if all the data was in memory ahead of time. Everything’s in memory, so everything’s going to be fast. And basically, they can plan accordingly based on that. So, you have data structures that can be optimized for CPU. And if you’re writing over disk, you basically are thinking how the data is actually encoded. It’s also it will be optimized because you don’t really worry about writing back to disk.
CHARLES MAHLER: 44:40
So, pros and cons. You’ve got high performance, going to be a low latency because there’s never a call to disk for the most part. And then you have a ton of different useful data types you can store your data. Downside would be RAM is very expensive, obviously, compared to alternatives. So, you’re going to have to limit the size of your dataset or pay a lot of money. Yeah, if there is, obviously, horizontal scaling, you only get so much RAM on one machine. So that’s something you need to take into consideration. And pretty much all cases, you’re probably going to need some type of secondary database. So, your architecture is going to be a little bit more complicated. The most common use case is going to be some form of caching. So that could be caching sessions or a webpage, all sorts of different stuff. Instead of going to your database, you just serve it directly. Your in-memory database sits in front of your backend, and it can just answer those. You have real-time applications, and also, they can act as like a pub/sub broker and that type of architecture.
CHARLES MAHLER: 45:48
So, you have search databases. These are used for primarily just searching, storing, querying text. They can work with structured or unstructured data. In a lot of cases, you can see them as basically a specialized type of document database that just has more indexing on top of it that’s optimized for tech specifically. Pros and cons are developer productivity. You have a lot of built-in algorithms for different types of text search. And the query language is also kind of optimized to make writing those queries easy. Good horizontal scaling, good performance. Because of that heavy indexing, though, the downside is they tend to struggle with really high write throughput. So, you either have to sacrifice a little bit of your indexing and query performance, or you have to look into how you’re going to replicate so you can scale out your database. Common use cases are going to be stuff like log analysis, anything that’s full-text search, autocomplete, and then analytics type stuff as well.
CHARLES MAHLER: 46:56
Finally, we have vector databases. So, these are pretty much the fastest-growing kind of segment, still relatively low rate of adoption just because they’re fairly new. But what they’re designed for is storing and searching vector embeddings of unstructured data. So that is basically you have a model that can take text, video, image, whatever, and it converts that into this array of vector embeddings. So open-source example would be Milvus. That’s like the most popular one. We also have some closed-source ones like Pinecone. The big thing is they’ve been used for actually quite a while by bigger companies internally, but they’re finally kind of reaching adoption in the outside world. And now with the LLM craze, people are seeing a lot of really good use cases for them. And within the last year, because of that, a ton of, in addition to these specialized ones, going on the theme of databases moving into the territory of others like Postgres, MongoDB, [inaudible]. They’ve all added support for vector embeddings, but obviously, they aren’t fully optimized just because they weren’t designed from the ground up for that.
CHARLES MAHLER: 48:17
So, the biggest thing is, obviously, you want efficient vector search. So, if you use a relational database, for example, the complexity of that search, it’s like it’d be—the Big O notation is essentially the number of vectors times the number of dimensions, times how many matches, right, the closest number of matches you want. So that is very inefficient using a relational database. For example, OpenAI, their API embedding model has 1,536 dimensions. So, if you have millions of vectors that each have 1,500 dimensions, that’s going to be a slow query if you’re using not a specialized tool. So, they provide scalability. Most of them are horizontally scalable. And the big thing also is the hybrid storage. So, they don’t have to store everything in RAM. They can put those vectors onto disk and essentially do hybrid queries. And they also provide, a lot of them, metadata, hybrid query support, so you can query based on a similarity search between vectors, but you can also add in a filter for only do a vector similarity search for this user ID to find similar images. The con really is it’s very specialized. So, if you’re not working at a huge scale, you probably could get away with using some of those more general-purpose ones. So, the overhead might not be worth it unless your app is fully based on working with vector embeddings.
CHARLES MAHLER: 49:50
So, these are some kind of common use cases. So, you got duplicate removal. That’s like you could detect how similar an image upload is and be like, “Okay, this is identical to this.” If it’s spam, somebody keeps uploading the same image, you get rid of that. You can do anomaly detection, a lot of ranking and recommendation use cases, semantic search, which in the case, if you ever used Google and you put in your keyword and it doesn’t match at all, this could be a downside of using this type of tool is that it just doesn’t include your keyword because it finds something that’s similar enough, but in certain searches, if you want that keyword, they can kind of ignore it. But the big one recently is this retrieval augmented generation that’s called RAG, and that’s with GPT or any of these other open-source LLM models. You can use this vector search. You create your text. If you want to examine a PDF, for example, you can take that PDF, create chunks of text, turn it into vectors, and then later you can have your LLM. You can put in a query, convert the query into a word vector, and then find similar passages within PDF or whatever, and then pull that in for context, and then the LLM can work with that relevant data. So that’s how, if you have private data, internal company data, obviously, OpenAI or whatever open-source model hasn’t been trained on it, it isn’t aware it exists, and this retrieval augmented generation is how you would give that AI model access to this custom data so that it has some awareness of these new set of facts or this new information. So that’s basically the main reason that these databases have really grown in popularity in the last year.
CHARLES MAHLER: 51:42
I think our final database is kind of this concept of NewSQL. That’s the terminology for them. And they are really just an attempt to try to get the best of both worlds in that you can do relational and analytics type stuff with a single database. So, most of them are based off of—Google released a paper called—well, their database, it’s called Spanner. And they kind of put this research paper out about how they built it, some of the stuff about it. And then other people kind of implemented their own open-source solutions based off that. So, the idea was basically kind of going against the grain of NoSQL. There’s essentially a quote by the writer of the paper talking about, you’re better off not having your engineers try to work around the limitations of not having transactions and commits. So, you’re better off figuring out the performance issues rather than trying to push that down onto your application developers. So that’s the concept behind it. So big things are SQL support, horizontally scalable, and they’re effectively cloud-native and that they’re designed from day one to operate within a cloud environment. Downsides are complexity. They try to hide it. They try to abstract it away, but it is there, kind of under the hood. So, there’s always a potential issue that you’re working with a tool you don’t really understand because there’s so much going on. So that can be a downside. Because of the horizontal nature, there’s potential latency issues. If you make a query and one node doesn’t have all the data you need, it then has to fetch from another node to get it. And that basically adds latency to your application. There are some limitations with the SQL and what it can do just because of the way it’s architected. And they’re still, as far as databases go, really new compared to something that’s been around for decades.
CHARLES MAHLER: 53:42
So, the concept is they call it HTAP. So, it’s hybrid transactional analytical processing. So, the idea is that you can use it for what you’d use a regular relational database for, or you can also do kind of analytics workloads, as well. And this is our final trend to look ahead for is some of these directly using machine learning to kind of optimize databases. So, one research paper came out of Google a few years back, and it’s kind of starting to being used in production is that they’re able to actually create a machine learning model that would index data on disk rather than using a standard B tree. They created this model that could kind of fine-tune the index for the workload. And there’s also tools like Google Cloud now. They have a tool that will—basically it’ll read the queries, see the performance of them. And if it seems like a frequent query, it can recommend, “Hey, create an index, this type of index on this field and there’s performance this much.” And there’s a few startups, as well, that offer similar tools that basically look at what your database is doing, and they can make recommendations and automatically fine-tune your indexes and some of the other stuff around your database to give you better performance. So, I think you’re going to see more and more of that, basically, just streamlining how these databases work as these models get more reliable and effectively trustworthy.
CHARLES MAHLER: 55:15
So that’s it on my end. Hopefully, this was useful, and you learned a lot about databases and maybe found some interesting stuff that you can check out in the year ahead, if nothing else out of curiosity. And these are some resources like Caitlin had talked about earlier. For InfluxDB specifically, we have our community forums. We have Slack. If you have questions about using the product or just want to talk to other developers about databases, you can go there. We have good stuff on our blog about both our product and just general kind of useful tutorials. And for in-depth courses on InfluxDB, we have our university program, so you can check those out too. I think we have time. I think I see a few questions. If anybody else has questions, you can put those into the QA section. If like a minute or two, I’ll try to get to them.
CHARLES MAHLER: 56:12
All right. Let’s see. So, the first question, “How does discovery work in a semantic-model system?” I guess it needs a lot of upfront configurations. Yeah, I’ve seen there’s quite a few tools that can be used for that. I don’t know that maybe it’s just good marketing, but I know there’s cube.dev, or something. But I think they’re one tool that I know that’s like—they have a lot of stuff about semantic models, semantic layers. I think they’re open source, so you can go check that out. Your best solution is probably just doing a quick Google search and looking for the tools available. Let’s see. Oh, and as far as discovery, yeah, I think that’s probably the biggest thing essentially that, if you have a lot of data silos, you have to figure out a way to make all those different databases, all those things, basically discoverable. And that is, I don’t think there’s any one-size-fits-all solution there. I think every organization probably has problems like that with data silos. Other question, “What type of database is KSQL Kusto Azure Data Explorer from Microsoft?” Yeah, I think they kind of do try to push towards time series, but I don’t know. It’s kind of like I said, there’s overlap. They try to do a lot of different things. So, there’s always some overlap like with InfluxDB 3.0. Even that new storage backend, there’s been people playing around with using it for more general-purpose kind of analytics. Stuff you’d traditionally see from a data warehouse, they’re using it for that as well. So, a lot of these different tools you can use for getting good performance even beyond the specific, “It’s best at this,” but it can also do a few other things.
CHARLES MAHLER: 58:08
We got a question from an anonymous attendee, “Could Elasticsearch be a good choice for time series?” I’ve seen people try to do it, and I think we actually do have benchmarks. It’s obviously serviceable, but from what we’ve seen around benchmarks, it tends to, like I mentioned, that right throughput issue, that tends to be the choking point, I think, is that you can get good performance, but you’re going to have to fine-tune it quite a bit, probably mess around with the indexing structure, that sort of thing. But from what I recall from our own benchmarks is that query time is decent, but the big weakness was once you get over a certain threshold for write throughput, it kind of struggles. And let’s see. We got a question, “What would you consider SurrealDB in this landscape?” I’ve recognized that name. If I recall, I think it’s a graph kind of hybrid. I’ll look it up real quick. I think I recognize it. I think, yeah, it’s multimodal. I think they hyped up the graph stuff. I don’t know what they’re built on. I know they’re open source. I’d assume they’re maybe built on Postgres, and they kind of extend that. But I’d have to look deeper into it. I’m far from an expert, but. Let’s see. Looks like we have one more question, “What do you consider the nearest competitors for InfluxDB 3?” Caitlin, do you have an opinion? I can say a few, but.
CAITLIN CROFT: 59:42
I mean, the one that I want to answer with is our open source of InfluxDB. We often talk about that as being one of our competitors. With InfluxDB 3.0, we kind of became more of a calmer database. So, in addition to time series, we’re kind of up against some columnar databases as well. Charles, anything else you want to add there?
CHARLES MAHLER: 01:00:12
Yeah, I’d say that’s pretty much the go-to answer. Yeah. One of our biggest things that comes up in sales is, yeah, it’s like open source is working great for us. So why would we go to paid? So, I mean, it’s a good problem to have, I guess, having happy open source users.
CAITLIN CROFT: 01:00:29
While we’re kind of on that topic, Charles, what would you say from your experience in use of databases, what excites you the most about InfluxDB 3.0?
CHARLES MAHLER: 01:00:43
I’d say probably, again, the flexibility. Some of it’s not exactly production-ready yet, but I alluded to it with some of that stuff about data lakes and integrating with that. I think what we have kind of in the works with how Parquet and some other formats that InfluxDB will eventually support, it’ll be really cool how you can kind of integrate and pick and choose how you use it and test it out with different stuff. And in general, just like the broader ecosystem of how we’ll integrate with visualization tools, all sorts of different kind of how you access the data. That thing I think is probably the biggest long-term potential for it. And again, the expanded use case. I never know how much we can talk about. But there’s stuff that’ll be—like I said, eventually, you’ll be able to fine-tune it so that you can like I said, it’ll be good for obviously, it’s always going to be good for time series, but you’ll be able to kind of customize how you use it so they can do support other workloads as well. We got a question. Yeah, the certifications.
CAITLIN CROFT: 01:01:47
Yeah. Do you want me to answer that?
CHARLES MAHLER: 01:01:49
Yeah, you got it.
CAITLIN CROFT: 01:01:50
Yeah. So, we don’t currently offer any official certifications. I know that’s sort of in the roadmap eventually with InfluxDB University. There are certifications of completion of courses, I believe, that are provided through Credley. So, if you do attend any of the trainings, you can get one of those, and they are free. Let’s see. Craig, you’ve raised your hand. I’m happy to allow you to unmute yourself. So, feel free to unmute yourself if you feel comfortable speaking up.
CHARLES MAHLER: 01:02:32
I suppose it could have been an accidental hand raise.
CAITLIN CROFT: 01:02:34
I think it might have been an accidental hand raise. All right. Thank you, everyone, for joining today’s webinar. Lots of great questions. All of you should have my email address. So, if you have any further questions, please feel free to reach out to me. I’m happy to put you in contact with Charles. And I’m totally going to put Charles on the spot. I know he’s in the Community Slack workspace, so feel free to ping him there. Someone’s asking how to get the recording. Just go back tomorrow morning and check out where you registered for the webinar, and it will have converted over to the recording. Once again, thank you, everyone, for joining today’s webinar. Thank you, Charles. You did a fantastic job. And I hope to see you guys on a future webinar or training. Thank you.
[/et_pb_toggle]
Charles Mahler
Technical Marketing Writer, InfluxData
Charles Mahler is a Technical Marketing Writer at InfluxData where he creates content to help educate users on the InfluxData and time series data ecosystem. Charles' background includes working in digital marketing and full-stack software development.