Best Practices for Collecting and Querying Data from Multiple Sources
By
Zoe Steinkamp /
Developer
Aug 14, 2023
Navigate to:
This article was originally published in The New Stack and is reposted here with permission.
In today’s data-driven world, the ability to collect and query data from multiple sources has become a very important consideration. With the rise of IoT, cloud computing and distributed systems, organizations face the challenge of handling diverse data streams effectively. It’s common to have multiple databases/data storage options for that data. For many large companies, the days of storing everything in the singular database are in the past.
It is crucial to implement best practices for efficient data collection and querying to maximize your datastores’ potential. This includes optimizing data ingestion pipelines, designing appropriate schema structures and utilizing advanced querying techniques. On top of this, you need data stores that are flexible when querying data back out and are compatible with other data stores.
By adhering to these best practices, organizations can unlock the true value of their data and gain actionable insights to drive business growth and innovation. This is where InfluxDB, a powerful time series database, comes into play. InfluxDB provides a robust solution for managing and analyzing time-stamped data, allowing organizations to make informed decisions based on real-time insights.
Understanding different data sources
When it comes to data collection, it is crucial to explore different data sources and understand their unique characteristics. This involves identifying the types of data available, their formats and the potential challenges associated with each source. After identifying the data sources, selecting the appropriate data ingestion methods becomes essential. This involves leveraging APIs, utilizing Telegraf plugins or implementing batch writes, depending on the specific requirements and constraints of the data sources.
It’s very important to keep in mind data space and speed. For example, we find with IoT data these are top concerns. Ensuring data integrity and consistency throughout the collection process is of utmost importance. So, too, is having backup plans for data loss, stream corruption and storage at the edge. This involves implementing robust mechanisms to handle errors, to handle duplicate or missing data and to validate the accuracy of the collected data. Additionally, implementing proper data tagging and organization strategies results in efficient data management and retrieval. By tagging data with relevant metadata and organizing it in a structured manner, it becomes easier to search, filter and analyze effectively.
It’s helpful to note here that most data storage solutions come with their own recommendations for how to begin collecting data into the system. For InfluxDB, we always suggest Telegraf, our open source data ingestion agent. Or for language specific needs we suggest our client libraries written in Go, Java, Python, C# and JavaScript. The important takeaway here is to go with recommended and well-documented tools. While it might be tempting to use a tool you are already familiar with, if it’s not recommended you might be missing out on those mechanisms for handling problems.
Effective data modeling
Effective data modeling is a crucial aspect of building robust and scalable data systems. It involves understanding the structure and relationships of data entities and designing schemas that facilitate efficient data storage, retrieval and analysis. A well-designed data model provides clarity, consistency and integrity to the data, ensuring its accuracy and reliability. The most important piece when dealing with multiple data sources is determining your “connector”, or your data piece that connects your data together.
For example, let’s look at a generator that has two separate datasets: one in a SQL database storing the units stats and one in the InfluxDB database that has real-time data about the battery capacity. You might need to identify a faulty generator and its owner based on these two data sets. It might seem like common sense that you would have some kind of shared ID between these two data sets. But when you are first modeling your data, the concern is less about being able to combine data sets and more about the main data use case and removing unnecessary data. Also the other question is: how unique is your connector and how easy will it be to store? For this example, the real-time battery storage might not have easy access to a serial number. That might need to be a hardcoded value added to all data collected from the generator.
Furthermore, as data evolves over time and variations occur, it becomes essential to employ strategies to handle these changes effectively. This may involve techniques such as versioning, migration scripts or implementing dynamic schema designs to accommodate new data attributes or modify existing ones.
For example, if our generator adds new data sets, it’s important that we add our original connector to that new data. But what if you are dealing with an existing data set? Then it gets trickier. You might have to go back and retroactively implement your connector. In this example, maybe the app where people register their generator and view their battery information, you require them to manually enter their serial number. This allows you to tag them as the owner, and you can run analysis on their device from a distance to determine if it’s within normal range.
Obviously this is a very simple example, but many companies and industries use this concept. The idea of data living in a vacuum is starting to disappear as many stakeholders expect to access multiple data sources and have an easy way to combine the data sets. So let’s start to dive into how to combine data sets once you have them. Let’s continue from our previous example with InfluxDB and a SQL database, a common use case for combining data.
When it comes to querying your data, and especially when it comes to combining data sets, there are a couple of recommended tools to accomplish this task. First is SQL, which is broadly used to query many data sources, including InfluxDB. And when it comes to data manipulation and analysis, a second tool, Pandas, is useful for flexible and efficient data processing. Pandas is a python library that is agnostic to the data it accepts, as long as it’s within a pandas data frame. Many data sources document how to convert their data streams into a pandas dataframe because it is such a popular tool.
The following code is an example of a SQL query in InfluxDB, which returns the average battery level over the past week for this specific device (via serial number):
This query would happen on the app side. When a user logs in and registers their generator’s serial number, that enables you to store the data with a serial number tag to use for filtering. For the readability of this query, it’s easier to imagine all generator data goes into one large database. In reality, it’s more possible that each serial number would be a unique data storage, especially if you wanted to offer customers the chance to “Store your data longer for a fee”, which is a common offer for some businesses and use cases, like residential solar panels.
Now, this is just one query, but an app developer would likely write several such queries to cover averages for the day and week, and to account for battery usage, battery levels and most recent values, etc. Ultimately, they hope to end up with between 10 and 20 values that they can show to the end user. You can find a list of all these functions for InfluxDB here.
Once they have these values they can combine all those data points with their SQL database that houses customer data, things like name, address, etc. They can use the InfluxDB Python client library to combine their two datasets in Pandas.
This is an example of what that join would look like in the end. When it comes to joining, Pandas has a few options. In this example, I’m using an inner join because I don’t want to lose any of the data from my two data sets. You would probably need to rename some columns, but overall this query results in a combined data frame that you can then convert as needed for use.
You can imagine how data scientists might use these tools to run anomaly detection on the datasets to identify faulty devices and alert customers to the degradation and needed repairs. If there is a charge for storing data, users can also combine this data with a financial data set to confirm which customers pay for extended storage time and possibly receive extra information. Even in this simple example, there are many stakeholders, and at scale the number of people who need to access and use multiple data sets only expands.
Key takeaways
With so much data in the world, the notion of storing everything in a single database or data store may seem tempting. (To be clear, you may want to store all of the same type of data in a single database, e.g., time series data.) While it can be a viable solution at a small scale, the reality is that both small- and large-scale companies can benefit from the cost savings, efficiency improvements and enhanced user experiences that arise from utilizing multiple data sources. As the industries evolve, engineers must adapt and become proficient in working with multiple data stores, and the ability to seamlessly collect and query data from diverse sources becomes increasingly important. Embracing this approach enables organizations to leverage the full potential of their data and empowers engineers to navigate the ever-expanding landscape of data management with ease.