Enabling Edge-Cloud Duality of Time Series Data
Session date: Jun 28, 2022 08:00am (Pacific Time)
In this session, learn about the new feature in InfluxDB: Edge Data Replication! Discover how to automatically replicate data from an InfluxDB instance to InfluxDB Cloud. This will provide developers with insights into all assets at the edge - including sensors, servers, networks, and apps. InfluxDB is the centralized hub for collecting, storing, and analyzing time-stamped data collected from the edge, cloud and on-premises. InfluxDB will automatically copy the data from the source and send it to InfluxDB for all engineers, data scientists and business analysts to utilize.
Sam Dillard will discuss the growing needs and challenges of edge computing. Applications have become more distributed and data volumes keep increasing. Sam will discuss InfluxDB’s new edge data replication feature that leverages existing capabilities of the time series platform in order to enable edge-cloud data pipelines that fit any business needs and constraints. This feature automatically streams data on-write from an edge dataset to a cloud one of the user’s choosing. Adding to this automatic replication of writes is a durability designed to withstand network outages. This feature lays the groundwork for a much larger story about how the edge and cloud can work together to produce global time series data architectures! Sam will cover:
- Methodology for improving IIoT monitoring at the edge with a time series platform with nanosecond precision
- The importance of centralized visibility into all assets to meet business requirements
- How to use InfluxDB and Flux to reduce latency and cloud operational costs
Watch the Webinar
Watch the webinar “Enabling Edge-Cloud Duality of Time Series Data” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Enabling Edge-Cloud Duality of Time Series Data”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Sr. Manager, Customer and Community Marketing, InfluxData
- Sam Dillard: Senior Product Manager, Edge, InfluxData
Caitlin Croft 00:00:00.777 Hello everyone, and welcome to today’s webinar. My name is Caitlin Croft and I’m joined today by Sam Dillard, who will be talking about edge cloud replication. So if you didn’t see the news recently, we recently launched a new feature in the InfluxDB platform and Sam’s going to be diving into it. This session is being recorded and will be made available later today, as well as the slides. Please post any questions you may have for Sam in the Q&A. We will answer them at the end. And just want to remind everyone to please be friendly to all speakers and attendees. We want to make sure that this is a safe, fun, happy place for our community. Without further ado, I’m going to hand things off to Sam.
Sam Dillard 00:00:49.025 Thank you, Caitlin. Hi, everyone. Nice to have you. So I am Sam Dillard. I’m a product manager within InfluxData. As you can see by the slide here, I work with basically everything having to do with the edge. But if you’re not familiar with the edge, consider that to be the on-prem stuff. So I work with everything kind of outside of our cloud product. I collaborate with them, but the stuff I own is outside of that. I’m realizing now that the title of this slide actually is different than the title of the webinar. Don’t worry, the topic has not changed. They’re both relevant. So I’m going to talk about this replication feature that Caitlin alluded to, but only briefly really. I’m actually going to be talking about the context, the market context and why we built it and sort of talk about the value of it in the context of InfluxDB, time series data pipelines and this thing called the edge cloud duality, a duality that we believe can’t really exist without this replication feature or something like it. So it’s more of a background on kind of the why, rather than a deep dive on the what. I figured that would be more interesting for you, too. So it’s not much of a pitch really here. So what we’ll talk about specifically is I’m going to talk about the edge, the cloud and then the edge cloud duality and kind of what all that means and the reasons people want to do that, but then also why people kind of aren’t doing it. What are the current problems with this edge cloud thing that prevent people from actually building these pipelines or building these topologies to improve on their business intelligence? Then I’ll go over the replication feature itself briefly so that you know what it is and how it’s helping, and then use cases, conceptual use cases, that I’ll kind of do a little bit of visualization for.
Sam Dillard 00:02:48.399 So moving right in. What and why edge? Edge is a nebulous term, I’m assuming based on selection bias, that probably most people here are familiar with the term, but let’s level set on it. The edge, rather than defining it, I’ll talk about its properties, sort of build to the definition with its properties. The edge is where data sources are, especially in the world of industrial IoT or in data center monitoring. The data sources are at the edge. Therefore, data is born there and is in its raw form. At some point, it’s in its beginning stages and it hasn’t been computed at all. Nothing has happened to the data. It’s completely raw, which is something, a property that we can design for if we want to put a data layer, a data storage layer at the edge. We can design for that by having fast ingestion, highly precise ingestion and storage, meaning like having precision down to the nanosecond, and then having low latency query so that the data as it’s raw and being ingested quickly, it can also be computed quickly and acted upon quickly or immediately, in some cases. Data is also scoped to the network of the edge itself. There may be many edge presences, if that’s a word, but within that one edge presence, all the data is scoped to that one network. So we can design for that again, and that’s ingestion speed and low latency query once again. A negative of the edge is that it’s constrained resources. If you’re familiar with working with the edge, you’re probably familiar with not being able to update your hardware there or add new hardware and scale. And it’s probably fairly old, right? Most people are dealing with potentially 20-year-old hardware sometimes, and it’s about time to upgrade. But what this means is you need to have something - your data layer needs to be resource efficient.
Sam Dillard 00:04:46.503 And ideally, it would have some kind of built-in data management of its size, of the data size. So if you’re familiar with InfluxDB, you probably know what I’m alluding to, but I will talk about that stuff. This next part is fairly obvious. Why cloud? We’re not going to go into depth, but of course, the pitch for cloud for the last 15 years in this world is that it’s generally centrally located, it’s scalable and it’s managed for you. So it lets you worry about the applications you’re building rather than the infrastructure that supports them. So why both? We understand maybe why edge. It can be fast, low latency, it can be potentially cheap because of the old hardware, but it can be private because it’s in a private network. And the cloud has its advantages, but there is a reason there’s a value in having a duality of the edge and the cloud so that your business can see both at the same time effectively. So the reason why you do both is you can combine the speed, precision and privacy of the edge with basically the flexible horsepower of the cloud. When you have critical and immediate workflows, or workloads, if you can and want to put them at the edge, you should. If you have private data, it shouldn’t leave the edge if it’s not necessary to leave the edge. So you should have a data storage layer there for that reason. Any compute intensive workloads, if you’re training machine learning models or doing large OLAP-style analytical queries against your time series, you’re going to want to have scale, you’re going to want to have a ton of horsepower to back that up. And potentially, you might need more quickly. So you have to be able to scale it up quickly and you might want to do that in cloud.
Sam Dillard 00:06:29.123 Also global context, the fact that cloud is central means that you can have your edge presences, but you can combine them all into one place and do analysis across them all if you want to understand how all your factories or all your wind farms or all your satellites or even all your data centers behave with certain changes to whatever parameters. So there’s a lot of value in the edge cloud duality, but there’s a reason why it’s kind of rare. Today might be the first time you’ve heard of the term edge cloud duality. I didn’t make it up, but it isn’t commonly used. And I think that’s mostly because it’s fairly rare as something that you design for. And the reasons are there’s huge challenges today. They may be fairly obvious, but these are actually showstoppers for people. So we talked about the properties of the edge. The data is raw, precise, high fidelity and potentially high velocity. That data, when transferring over the wire to the cloud may be too large to do that in its raw form. When storing that data in the cloud, it may not make sense. The queries you’re doing in cloud may not require that the data be so precise and so granular. And so why store something that’s going to cost you a ton of money when you don’t have to? And then the queries themselves against that data, if the data is too precise, if you’re querying long time ranges of data, huge data sets, the query may be infeasible to run or just incredibly expensive. So for those reasons, the precision of the edge makes transferring that data to the cloud sometimes impossible, but often just too expensive to do it.
Sam Dillard 00:08:29.358 So the second thing is the connection between the edge and the cloud. In a lot of industrial cases, mining sites, wind farms, oil rigs, satellites, planes, cruise ships, whatever, there’s going to be some kind of intermittent connectivity issues, expected ones. It’s not a huge problem, but it does make transferring data from one place to the other in an automated programmatic way very hard. So you have to have skilled developers actually maintain that. Otherwise, you’re doing manual ad hoc exports and imports, which is just not fun for anyone. And it’s enough to say, “I don’t want to do an edge cloud duality.” So those are the challenges. If you’re familiar with InfluxDB today, you do know that it offers benefit either in the cloud or at the edge. InfluxDB allows for high velocity ingest, real-time queryability, so data is immediately available for query so that there’s that low latency thing again. It offers you the Flux engine, which is an analytics engine that is both scriptable in an automated way and also functional so you can create your own functions. So it’s almost touring complete in that way. So you can do kind of whatever you want with your data. There’s your data management. It’s a small single binary that includes the storage, visualization, and any utilities you need. But it also uses few resources for how much data you’re actually working with. All that exists in the cloud as well. But of course, with the cloud, you get the benefit of being central, scalable, and managed. The issue here is not that Influx couldn’t handle the edge or couldn’t handle cloud data. It’s built for both. The issue is that historically, we hadn’t really addressed this edge cloud duality problem yet of getting the data from one to the other. You kind of had to choose or do some extra work to make those both work together. With the introduction of the replication feature, that’s not true anymore.
Sam Dillard 00:10:29.498 So this replication feature is in the OSS, open source version of InfluxDB. And what it does is it allows you, as a business, to reliably stream your data from the edge to the cloud in real time. And it’s backed by a durable on-disk queue that will buffer data in cases of intermittent connectivity issues so that when that connection is restored, the data in that buffer will get flushed to the cloud so that you have this eventual consistency between the data you want replicated or mirrored to the cloud and in the cloud. So this allows you to selectively choose what data gets safely transferred or mirrored to the cloud, which is an incredibly important way to - or it’s an incredibly important property to solve for that huge second problem. It’s Flux that actually solves for that first problem, the ability to intelligently shrink your data, essentially, to make it viable over the wire. So what we’ve got here, basically, is that at the data layer, whether you’re using InfluxDB on its face just for operational stuff, like with the Grafana dashboard or something like that, you can do it in the cloud, you can do it at the edge, and now you can do it in both places safely and less expensive. But of course, Influx is also a platform. It has a fully featured API that you can build applications on top of. So if you have an application, a time series application that is, that you want to deploy for your customers at the edge and the cloud, InfluxDB will offer you the backbone to do that safely and cheaply and in a way that makes sense to your customers. And it’s designed to be many to one. So this replication, you can replicate from many presence of edge presences, again, if that’s a word, to a single presence in the cloud.
Sam Dillard 00:12:28.852 And effectively, you’re going to probably do something like this. You’re going to be doing temporal aggregation or dimensionality reduction. What I’m trying to visualize here is that using Flux at the edge, you’ve got your local kind of raw data storage here, and then you have your programmatically downsampling that data to another bucket that’s configured to replicate. So what you have is this one bucket selectively mirrored to the cloud. It’s less data and hopefully, because you’re using Flux, it’s fairly faithful to the original data set. And that’s what we’re going to talk about in the last kind of third of this. Other options, not just temporarily aggregation, but you can actually pivot data and change the shape of data. You can do feature enrichment. Flux is almost touring complete. So you can do - if you have entirely different needs for the data in the cloud, you can make that happen at the edge and send only that data in the shape and form that you wanted in the cloud and send only that to the cloud. That’s all you need. The edge can handle that for you. But for this, we’re going to talk about downsampling because, in my opinion, and from what I’ve heard from the field and what I’ve heard from all the research out there, downsampling, it’s the data size that matters the most. So downsampling is going to be the emphasized bit here. There’s are other things you can do, but downsampling is the most important thing. Downsampling is driven by functions that either aggregate data multiple points into one or few, or selecting data from a bucket, selecting one or a few points from a bucket. And there are many, many ways to do that, both built-in and custom in Flux. So first I just want to go through basic aggregations. Your mean, min, max, last, derivative - sorry, not derivative, moving average, that kind of thing, right?
Sam Dillard 00:14:24.941 You are reducing your data and trying your best to maintain the shape. And for certain data sets, these kind of aggregations make a ton of sense. They’re also very cheap to run. Derivative, and the column name here, I do not mean rates, I mean derivative as in that you’re deriving new data from the original data. So your count function is actually a derivative function in this way. The count returns data that isn’t faithful to the original data set. It’s changing the meaning a little bit. It’s giving you a different perspective on your data, and it’s changing it to numerical values. Maybe you have text values there, but you’re counting the number of occurrences of it. So it is changing the data, but it’s giving you a different type of information, and we’re going to actually talk about that in a second. The derivative function isn’t rate. Holt-Winters is a predicting function. It’s a forecasting function. Triple exponential moving average. You have your momentum oscillators. If you’re in finance, you can take your original data set and then turn it into an entirely different data set that basically says should you buy or should you sell. Basically, that kind of thing. On the right side is the custom functions. You can create your own functions in Flux. We’re going to talk about count over threshold here, which is just one that we use internally. It’s one I recommend highly. “Candles”, if you’re back to the finance people, Candlesticks is a derivative data set from an original price value, basically. It derives five or six values from one. So that’s definitely not downsampling, that’s actually expanding your data set, but it is changing it completely. And swingingDoorTrending - we’re going to talk about at the end, too. But these are algorithms that you can use to remain faithful to the original shape of the data and also reduce it significantly. So we’ll start by jumping into the mean.
Sam Dillard 00:16:23.400 So this is your most common aggregation function by far. You do an aggregate window. If you’re familiar with Flux, you throw a mean in there and you get less data back on purpose, which is actually fine for any data set that is normally distributed. The mean will in fact provide you a fairly faithful interpretation of what happened in the original data set, and it will reduce the data a lot, potentially. But if your data is not normally distributed, you might actually lose information. And what we want to do is we want to make sure that when we’re transferring data from the edge to the cloud, that the data in the cloud is still useful to us. It’s not just smaller. And then all of a sudden, the data scientists in the cloud and the analysts are like, “Okay, great, you shrank the data so we could get the data, but I don’t have any use for this data because it isn’t the data. It’s just not accurate.” Well, in this case right here, this data may be normally distributed, I don’t know, but it might not be. And so if you apply a mean here, you do lose some information here. It’s a cheap function to run. And if you’re not losing information, if you decide that this graph is showing you the trend you want to see, absolutely do it. You should do it because it’s a cheap function. But you will notice that there are peaks and troughs that we’re losing here. So just be aware of that. Make sure that your data set is conducive to running a basic aggregation function like mean, if you’re going to do that. Another option is that Count Over Threshold. So this is a totally different way of thinking. The Count Over Threshold actually just evicts all of the original data. It’s totally unfaithful to the original data set. None of the actual data points or shape of the data really stick around. But it’s possible that your data set at the edge, you only care, or from the cloud’s perspective, you only care about the high values.
Sam Dillard 00:18:23.826 You don’t even care that low values happened. So what you could do, for instance, is count the number of high values in windows, and I’ll show you this in the next slide. This can be really useful because if you care about high values, but you maybe don’t care about intermittent high values, then counting them per window is going to help you a lot. So in this example here, we’re running this count over threshold and in our first window, we find one. In our second window, we find zero. In our second, we find two. The graph is not super faithful to that, but you get the point. In our fourth window we find zero, and then we find six. So we generate this new data set and we send that to the cloud. And we have this new graph that looks - I guess it’s kind of similar in shape to the original, but it’s not really the point. The point is that we might want to alert on threshold breaks, but we don’t want to alert on each threshold break, because that’s just noisy. If we go back, we break the threshold once, but then we come back down. We break the threshold again, we come back down. It’s not a big deal to us, right? But when we break the threshold a bunch, there might be something interesting happening there. So we want to notify ourselves that this happened six times in a single window. So the six times broke this threshold we gave ourselves, user defined. The threshold was five. We broke that five. So we’re doing this kind of meta query against our derived data set. So this can be a very useful function. You can also do the mean along with this. You can derive two, three or more data sets from your original data set. And then lastly, is your Swinging Door Trending.
Sam Dillard 00:20:04.886 This is just an example, but this is a very common industrial IoT algorithm that is very intelligently downsampling data significantly, but also not only retaining the shape very faithfully of the original data set, but it’s also selecting the original data points themselves. So with the mean, for instance, you are actually, in a sense, making up a new data point, a point that probably actually doesn’t actually exist in the original data set. And its timestamp isn’t going to be the same as the original timestamp either. So if you care about all those things, you care about the shape of the data as well as the interesting moments and exactly when they happened, maybe down to the nanosecond precision, the Swinging Door Trending will maintain all of that for you while reducing the data sometimes 80, 90, 95% percent, or more. So you can send that data to the cloud, and then you have basically, for all intents and purposes, the same data set for all of your analysts in the cloud. But it’s much cheaper to store, it’s faster to query, and it makes it possible to even store in the cloud because you reduced it enough to send it over the wire. And that is it. That’s all I had for you. Thank you so much for your time. We’ll do Q&A now. Hopefully, there’s some questions, because I feel like there’s some thought-provoking things in there. But if you want to hit me up, you can do so on our community slack, too. But you can do it on Twitter @sdillard12. And then I have the OG Sam Dillard tag here for LinkedIn. I’ve been a LinkedIn guy for a long time. So I am at Sam Dillard in LinkedIn, so.
Caitlin Croft 00:21:50.164 Awesome job. Thank you, Sam. So there are a bunch of questions here. So the first one, someone’s asking about the recording. Yes, I believe there will be a link that is sent out about it, but everyone who’s registered for this will have my email. So feel free to email me if you can’t find it, and I will send it to you. The good news is it is just the same link that you registered for the webinar, that’s where the recording will be made available. All right, so the first question, and I apologize, Sam, this might have been a good one to ask during it, based on when you were asking it, but for what you were talking about, it says here, “Does that mean I can, for example, replace an IoT hub with edge data replication?”
Sam Dillard 00:22:37.300 Yeah. Okay. I think I know the slide, probably, that is relevant to this. We can even go back here. We’ll be dynamic. Look at this. Probably something like this or like this would trigger that question. Yeah, I think the answer depends. It depends on what you’re actually doing with that hub. Any data layer part can be replaced with InfluxDB. InfluxDB is not going to hook into your devices themselves and do remediation, but it is going to be the data layer that enables that kind of work. So if let’s say you’re an industrial IoT vendor, where you have an application that Hershey’s uses or Siemens might use. You’re going to need a data layer underneath your application, and you don’t want to worry about building the database part, that’s kind of the hard part. So you invest in InfluxDB to be the backbone of that application, and you use the custom and fast computation of InfluxDB to actually do that stuff at the edge or in the cloud. So I would say that Influx is more of a complement to your kind of gateway or your hub because it really is just the data layer. We would love to build up the stack and do automated remediation features and things like that, but we haven’t gotten there yet. So I guess the answer is yes and no. It just depends on what that hub is doing. If your hub is a data layer, it’s just a database, then absolutely, yeah.
Caitlin Croft 00:24:22.265 Perfect. Do you have any king of filters to apply to raw data, like medium filter to remove noise?
Sam Dillard 00:24:32.217 Built-in, that one doesn’t sound familiar, but yeah, absolutely. There’s definitely variations of that that kind of effectively do what you’re asking. If it doesn’t exist built in, that’s where one, you can make a feature request and we may build it either in Go or in Flux. But Flux is functional. So if you understand the algorithm and can implement it, you can implement it in Flux and make it available for you and all of your other peers.
Caitlin Croft 00:25:07.144 There’s kind of an add on, or digital signal processing filter, like finite impulse response or infinite impulse response filters.
Sam Dillard 00:25:19.002 I’m not familiar with those. But if those data sets are time series in nature, like you’re doing a computation over time, it doesn’t have to be that, but your InfluxDB is designed for that. So you’ll have a good time if you’re doing time series computations and potentially a bad time if you’re not doing time series computations. If those are time series computations, then yes, you can do it. Since they’re not familiar to me, they’re probably not implemented in Flux today, but you can implement them yourself in Flux, if you want to.
Caitlin Croft 00:25:55.302 Cool. Does replication cover security or do I need to set up some sort of tunneling to get data from the edge to the cloud?
Sam Dillard 00:26:05.780 It’ll be encrypted with TLS. So hit me up if you mean something more than that and we can talk about it, but at that basic level, yes, it’s secured.
Caitlin Croft 00:26:20.540 Can cloud updates be batched in a time window? So store and forward, but forward only once an hour or between 1:00 and 2:00 AM?
Sam Dillard 00:26:30.888 Yeah, Flux supports all of that.
Caitlin Croft 00:26:35.908 So what kind of Flux functions are there to help with that? Do you know what they’re called?
Sam Dillard 00:26:41.949 Well, so the first one, I think, was just basically windowing, it sounded like. Maybe I misinterpreted. The second one is, I don’t remember the exact function names, but there’s a time and then there’s date packages where you can set - and offsets, where you can set any work you want to do on your task, basically, to only happen at certain times or on data within certain windows in a repeatable way. I don’t think Flux understands 2:00 AM, but it understands offsets. So you can sort of derive your own concept of 2:00 AM, essentially, from Flux. But it’s an interesting question. So again, I would want more time to think on a better answer. So if you want it again, I’m available over Slack. So hit me there.
Caitlin Croft 00:27:39.867 Yeah, and Noelle just added, “No, the aggregation windowing might be continuously applied, but I want only for the data on a schedule.”
Sam Dillard 00:27:52.794 Oh, okay. Only for the data on a schedule so the - well, no, because the - well, in the replication feature, the idea is to make it automatic and as real time as possible. So it’s just going to be flushing data as fast as it can. The interesting thing about that question is, I think I would assume that the reason for the question is that there are points in time where it doesn’t make sense to flush because maybe the internet, the connection might be down. In that case, it would just continue to try until it could. So you might just accidentally get that result. So it might be the best of both worlds, but there may be a use case out there that I’m not thinking about, which, again, I would expect hopefully somebody would ping me about it and we can talk about it.
Caitlin Croft 00:28:42.002 Yeah. And they’ve expanded kind of the use case of like a satellite, where it might only be available at certain times. So it’s more about doing it when there’s a connection versus when the data is there. It sounds like if it’s continuously collecting data, but if there’s not a good connection all the time.
Sam Dillard 00:29:02.601 Yeah. So that’s actually kind of the core reason that the feature was built. So what it’s going to do under the hood, is it’s going to continue to try to write. I think there might be an exponential back off of that, but if it fails, it’s going to store that data in the queue and just save it there for an amount of time that basically is configurable, but it’s going to store it there until it can’t, but it’s going to store it there until that connection is restored. So sort of in a way, I think it’s going to do exactly what you want it to do. It’s going to not write when the connection is lost. It’s not going to drop the data either. And when the connection is restored, it’s going to write it out as fast as it can.
Caitlin Croft 00:29:49.295 Cool. Awesome, Sam, thank you so much. We’ll stay on the line here just for another minute in case you guys have any more questions. Session has been recorded. So you’ll be able to check it out. We did have another webinar last week that talked about edge data replication. So if you missed that, you can find it on our website as well. And Sam is very active in the community, so if you want to go bug him, I definitely encourage you to. He’s known internally for asking lots of questions. So I’m sure he loves it when the community asks him a lot of questions as well. But Sam, given the amount - you’ve talked to a lot of our customers over the years and you understand pretty well what all of our different community members are doing. What excites you the most about this new feature?
Sam Dillard 00:30:45.416 I think it’s actually said by this slide, coincidentally. I love working for a company that is the backbone for other companies. So the reason why I like developing on a platform is that people can build a business on top of the software. So the thing that excites me the most, and this might sound cliche, but it truly is, because it does benefit us too, that’s why it excites me, but when our customers can make more money, that’s what excites me because it means we’re passing through a ton of value. And so I think if this feature lands with people, and there’s people building IoT, or not even just IoT, but time series applications that have a distributed nature to them, we want InfluxDB to make that super easy. And when those customers build stuff like that, their customers will value their product even more. So that’s the cool stuff.
Caitlin Croft 00:31:48.688 Awesome. Yeah, I’m excited to see what people are going to build with it.
Sam Dillard 00:31:55.568 Me too.
Caitlin Croft 00:31:57.933 Awesome. Well, thank you everyone for joining today’s session. It has been recorded and will be made available later today. Thank you everyone.
Sam Dillard 00:32:07.031 Thanks everyone.
[/et_pb_toggle]
Sam Dillard
Senior Product Manager, Edge at InfluxData
Sam Dillard is a Senior Product Manager, Edge at InfluxData. He is passionate about making customers successful with their solutions as well as continuously updating his technical skills. Sam has a BS in Economics from Santa Clara University.