Getting Started with R and InfluxDB

Navigate to:

This article was written by Gourav Singh Bais and was originally published in The New Stack.

As a data professional, you may come across some datasets with few independent variables (input variables). One variable would be time, and the other can be any sort of time-dependent column, such as the number of bookings in a hotel or the number of passengers on a flight.

This type of data is referred to as “time series data,” which has some type of trend and captures a point in time. There are various ways of storing this type of data, such as relational databases or files, like CSV or Excel. However, these options are not designed to efficiently store the time series data. Enter time series databases, which are specifically designed to efficiently and quickly store time series data.

There are various use cases where time series databases (TSDB) perform significantly better than other storage mechanisms. Consider a few:

  1. Storing IoT data: Time series databases can easily store IoT data continuously and at regular intervals, in which seasonal patterns, average consumption and inefficiencies can all be identified via time series analysis, which provides time-stamped data points.
  2. Monitoring applications and infrastructure: Companies can store data about their application and infrastructure usage, and they can later use it for tasks such as anomaly detection or prediction of infrastructure needs. Some web and mobile apps keep track of their own events in a TSDB, such as clicking a button, playing a video or sharing some content. They can chart a user's journeys, highlight difficulties or performance bottlenecks, and streamline more complex operations using these events.
  3. Retail stores sales forecasting: Retail stores collect data to predict sales, which helps them manage their supply chains.
  4. Real-time analytics: Time series databases can be used to store the data used for real-time analytics, such as self-driving car data. Data generated for self-driving cars is so huge and time-dependent that it is impossible to store it in a relational database. Time series databases provide faster writing and querying mechanisms that help self-driving cars perform operations in real time.

Furthermore, there are several advantages to using a time series database over other storage mechanisms for that data type. Here are a few reasons:

  1. Scalability: A time series database specializes in a higher number of writes with eventual consistency, even over distributed storage, which means less anxiety for those who care about the data.
  2. Usability: Storing data in a TSDB is not enough. One has to be able to quickly access it to make data-driven decisions. Here, data can be aggregated over time to make transactions faster and more efficient.
  3. Increased productivity: Time series databases are easily accessible in the form of simple APIs and can be accessed using different sets of programming languages.

One widely used time series database is InfluxDB. The company InfluxData created InfluxDB, an open source time-series database. It’s written in Go for storing and retrieving time series data for any use case, including operations monitoring, application metrics, Internet of Things (IoT) sensor data and real-time analytics.

To learn more about the benefits of InfluxDB, you can refer to the InfluxData website.

In this article, you will learn what is needed to get started in InfluxDB with R language, starting from installing, setting up, querying, writing and finally, building a simple time series application using R.

InfluxDB client library

Clients interacting with InfluxDB using any programming language must be able to connect to the database so that different database operations can be carried out. The influxdb-client-r library can be used to connect to InfluxDB using R. It’s a package that supports operations, like writing data, reading data and getting the database status. This client library works with InfluxDB version 2.0.

Let’s start with setting up InfluxDB using version 2.0. InfluxDB is available on different platforms, like Windows, Linux and macOS. Examples that you will see in this article are tested against macOS Big Sur, although installing it on any platform is simple.

Alternatively, you can use InfluxDB Cloud to quickly get a free instance of InfluxDB running in minutes without having to install anything locally on your machine.

InfluxDB can be installed on macOS using Homebrew:

$ brew update
$ brew install influxdb influxdb-cli

Alternatively, InfluxDB can be manually downloaded here.

Once InfluxDB is installed, you can start it by using this code:

$ influxd

The first time you start InfluxDB, it will ask you to set up the account, which can be carried out using the UI localhost:8086 or command line interface (CLI). For a UI setup, you will have to open the localhost URL and provide the information required for the initial setup. If you’re using CLI, you’ll need to do it with the InfluxDB client, which can be started in the terminal using the following code:

$ influx setup

For the initial setup, note the following details:

Username: You can choose any username for the initial user.

Password: You need to create and confirm a password for database access.

Organization name: You need to choose the initial organization name.

Bucket name: An initial bucket name is required, and you can create as many buckets as you want to work with.

Retention period: The time period your bucket will store the data before deleting it. You can choose never or leave it empty for an infinite retention period.

To install InfluxDB on other platforms, refer to the following link.

Once you have installed InfluxDB and completed the setup, you can log in to localhost:8086. You should see a screen like this:

InfluxDB Home

InfluxDB Home

You can take a look through the various modules included in the dashboard, though this article will primarily focus on those through which you can connect to the InfluxDB client. Start with the data module:

InfluxDB Load Data

InfluxDB Load Data

Here, you can observe different sections, like SourcesBucketTelegrafScrapers, and Tokens. To interact with InfluxDB using R, you’ll need to check the Buckets and Tokens sections. To connect with the database, you’ll need to have a private token (key) generated that is only accessible to you, allowing you to connect to different buckets.

To generate this token, navigate to the Tokens tab. On the right side, you will see a Generate Tokens button. This button has two different sections:

– Read/Write Token: This token provides read and write access to different buckets, which can be limited to the scope (to specific buckets) or provided to all the buckets available. With this token, you can only read and write the data in an organization.

Generate Read/Write Tokens InfluxDB

Generate Read/Write Tokens InfluxDB

– All-Access Token: This token provides full access to actions, like reading, writing, updating or deleting each bucket. This would be the recommended token through which you can connect to any bucket available without any explicit configuration and can perform all the needed actions, like read, write, update and delete.

All-Access Token

All-Access Token

For the purposes of this article, you’ll want to generate an All-Access Token. Once the token is generated, you can access it anytime by simply logging into the localhost console.

Now that you have InfluxDB all set up, you can download R and RStudio for writing and testing the code. Installing R is pretty simple. You can download the package here, then open and install it. After the R installation, you can download RStudio, which will be the IDE that you use to write the R code. You can download RStudio here.

At this stage, you have almost all the tools and technologies needed to connect to InfluxDB. As the last step, you need to install the InfluxDB client library for R, which can be downloaded using the following line of code:

install.packages("influxdbclient")

If you install it on RStudio, other dependencies will be downloaded along with the base library. However, if dependencies are not automatically downloaded, you can separately download them using the following line of code:

install.packages(c("httr", "bit64", "nanotime", "plyr"))

Making a connection

The next step will be to import the InfluxDB client library in R and create an instance of InfluxDBClient that can be used to interact with the database and perform all sets of operations. Parameters required to make a database connection include the following:

  • Token: This refers to the access token that you generated using the console. You can log into the InfluxDB dashboard and copy the token from there.
  • Bucket: This requires the name of the bucket on which you will be working. You can choose the initial bucket or create a new one using the dashboard.
  • Organization: This is the organization you have named during the initial setup of the InfluxDB.

Since this connection will be made locally, the connection script should look like this:

## import the client library
library(influxdbclient)
# parameters needed to make connection to Database
token = "Paste Your Token Here"
org = "my-org"
bucket = "RInfluxClient"
## make connection to the influxDB bucket
client <- InfluxDBClient$new(url = "http://localhost:8086",
                             token = token,
                             org = org)

If you are using a cloud account make sure the URL parameter matches the region your cloud account is located in, rather than using localhost. You can find the URL endpoints in the docs.

Inserting data

Now that you have established a connection to InfluxDB, it’s time to use the data to perform different database operations. To understand these operations, let’s take a look at some sample data of worldwide COVID-19 cases from January 2020 to April 2020:

Sample data

Sample data

This sample data contains the following fields:

  • Date: The date on which the observations are taken.
  • Cases: The number of COVID cases active on that date.
  • Region: An identifier of the place where cases are reported.
  • Country: The place where the observations are taken.

To read the data frame in R, you will need to write the following line of code:

data<-read.csv("/Users/Uname/Downloads/covid_data.csv")

Let’s start by first inserting this data into InfluxDB. To do so, use the write() method, which accepts parameters like this:

client$write(data, bucket, precision, measurementCol,
tagCols, fieldCols, timeCol, object)

Note: The above method is simply a function definition, not part of the code.

This method takes the following parameters:

  • data: The dataframe or a list of points to store in a database.
  • bucket: The bucket name where you will be storing the data.
  • precision: The precision of the time stamp.
  • measurementCol: The measurement in InfluxDB is similar to the table name in a relational database.
  • tagCols: This column represents the metadata related to the data.
  • fieldCols: The column names that you want to store in the database.
  • timeCol: The time-stamp column in your data.
  • object: The object to debug the write operation.

To store the COVID-19 data in InfluxDB using the write() method, you will need to make sure that your time-stamp column (Date) is in POSIXct format.

## convert date column to POSIXct
data[['Date']] <- as.POSIXct(strptime(data[['Date']], format='%Y-%m-%d'))
## write data in influxDB
response <- client$write(data, bucket = "bucket", precision = "us",
                          measurementCol = "Cases",
                          tagCols = c("Region", "Country"),
                          fieldCols = c("Cases"),
                          timeCol = "Date")

The response from the write() function can be NULL, True, or an error. To debug the write() function and check how the data is being written in the database, you can assign an object: lp.

Querying data

Now that you have your time-stamped data stored in the database, let’s try reading the data. For querying the data using the R client, the read() function is used, which expects a Flux query. For querying, you can make use of the same client that you created for writing the data or you can create a new InfluxDB client and do the same.

result <- client$query('from(bucket: "RInfluxClient") |> range(start: -2y) |> drop(columns: ["_start", "_stop"])')

Let’s break down the above query. Starting with the keyword “from,” you’ll need to first specify the bucket name, followed by the range of time from which you want to select the data, and finally, a set of conditions. In the above query, the condition specifies not to include the start and stop columns from the database.

The result contains a list of data frames for each entry made in the database for the specified period. To check an instance of it, you can use the following code:

result[[1]][c("time", "_value")]

Now that you have queried the data, let’s make use of this data for forecasting purposes. Here, you will be training a time-series model on the data retrieved and will try to predict the next five days’ cases. Let’s create a dataframe from the results that you have after querying:

## create an empty dataframe to store all the results
df1 = data.frame()
## iterate over each entry and append it in created dataframe
for (r in result){
    sub_df = r[c("time", "_value")]
    print(sub_df)
    df1 = rbind(df1, sub_df)
}

Once the dataframe is created, there are some changes that will be required to apply the time-series model on it. Typically, this stage is data preprocessing.

## arrange dataframe on ascending order based on time
df1 = df1[order(df1$time),]
## change the date to YYYY-MM-DD format
df1[['time']] <- as.POSIXct(strptime(df1[['time']], format='%Y-%m-%d'))
## rename column _value to Cases
colnames(df1)[2] <- "Cases"
## convert double values to integer
df1$Cases <- as.integer(df1$Cases)

After preprocessing, now it’s time to create a time-series representation of our data. This would be done using the following code:

## import library =
library(forecast)
## create time series representation of data
mts <- ts(df1[c("_value")], start = decimal_date(ymd(df1[1, "time"])),
                                     frequency = 365.25 / 7)
## plotting the input data
plot(mts, xlab ="Weekly Data",
      ylab ="Total Positive Cases",
      main ="COVID-19 Pandemic",
      col.main ="darkgreen")

Input data

Input data

Finally, let’s fit the data into the forecasting model and make the predictions for the next five days:

## fit model on the data
fit <- auto.arima(mts)
## make predictions for next 5 days
forecast(fit, 5)
## plot predictions
plot(forecast(fit, 5), xlab ="Weekly Data",
      ylab ="Total Positive Cases",
      main ="COVID-19 Pandemic", col.main ="darkgreen")
# saving the file
dev.off()

Model prediction

Model prediction

That is how data can be accessed and used for time series forecasting, which is just one practical use case for the time-stamped data. The whole implementation can be found here.

For more information and best practices for optimizing the performance of InfluxDB, refer to the docs.

Conclusion

After reading this article, you now know how to set up InfluxDB in your system, as well as how to create a client and to write and read data for your time series use case using R language. One major advantage of InfluxDB is that it comes with support for almost all major programming languages.

There are several options for storing time series data, but time series databases, like InfluxDB, can do so more quickly and on a higher scale. Several use cases, such as IoT applications, automated cars or real-time application analysis, need data insertion from as little as tens of thousands to as many as hundreds of thousands of entries at a time. Time series databases perform this task at a very high speed and in real time, allowing them to be easily adapted by any developer working on a real-time time series application. Be sure to consider deploying InfluxDB to use these great features in your own applications.

About the author:

Gourav-Singh-Bais

Gourav Singh Bais is an applied machine learning engineer at ValueMomentum Inc. He is skilled in developing machine learning/deep learning pipelines, retraining systems and transforming data science prototypes to production-grade solutions. He has been working in the same field for the last three years and has served a lot of clients including Fortune 500 companies, which provided him the exposure to build experience and skills that can contribute to the machine learning community.