How an Analytics Platform Detects Reliability Threats and Eliminates Obstacles Impeding Results Using InfluxDB
Session date: Jan 21, 2019 08:00am (Pacific Time)
Tignis built a physics-driven analytics platform that aids in improvements to the reliability and efficiency of connected mechanical systems. Tignis’ solution analyzes large quantities of time series data from IoT sensors to help identify issues affecting system performance in real-time as well as provide accurate data for predictive maintenance. They chose InfluxDB for its high ingest and storage of time series data as well as its ability to easily send this data into their systems for predictive analytics.
Hear from Jon Herlocker, CEO at Tignis, to learn how using a purpose-built time series database helps to continuously optimize reliability of their customers’ connected mechanical systems.
Watch the Webinar
Watch the webinar “How an Analytics Platform Detects Reliability Threats and Eliminates Obstacles Impeding Results Using InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How an Analytics Platform Detects Reliability Threats and Eliminates Obstacles Impeding Results Using InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Jon Herlocker: President and CEO, Tignis
- Alan J. Castonguay: Staff Engineer, Tignis
Caitlin Croft: 00:00:00.000 Hello, everyone. Once again, my name is Caitlin Croft from here at InfluxData. We’re excited to have Jon Herlocker and Alan Castonguay from Tignis. So without further ado, I will hand it off to Jon.
Jon Herlocker: 00:00:17.112 Great. Thank you, Caitlin. Appreciate the chance to talk a bit about Tignis and what we do and how we’ve been using InfluxDB. Just to kind of set the stage in terms of expectations of what you’re going to hear today, I figured most of you want to hear about kind of what we did, why we chose InfluxDB, and what we do with InfluxDB. And so I promise you that will be the majority of the talk. However, I will take about 15 minutes to talk about what is Tignis and what we do. And so let’s hop right into this. So I’ll start by just kind of introducing ourselves. So I’m Jon Herlocker, cofounder and CEO of Tignis. Tignis is a startup company - and I’ll tell you a little more about what we do here in a second - been around a little over two years. Prior to this, I was the chief technology officer at VMware’s Cloud Management BU. And a little further in the past, I started out as a professor of computer science. Joining me is one of our awesome engineers, Alan J. Castonguay. Alan is here because he actually is the expert in what we did. He’s responsible for operating our infrastructure and DevOps, and he is responsible for a lot of the early implementation that we did with InfluxDB. And so you can bring your most technical questions today, and Alan is going to be able to answer anything that I can’t touch. And so thanks, Alan, for joining me.
Jon Herlocker: 00:01:51.060 So let me start by setting the context and tell you a bit about what Tignis is and what we do and then the problem that we solve. And so the fundamental problem that Tignis has been created to solve was helping sort of industrial manufacturing equipment to implement sort of what we call condition monitoring and optimization. And so the problem is, is that when you’re looking at large industrial systems - you’re looking at sort of manufacturing plants - downtime and inefficiencies can be incredibly expensive. In the pharmaceutical world, 1% of yield difference can be $40,000 a batch. Right? And if you have downtime in a power plant, it can be millions of dollars in minutes sometimes. So equipment is expensive. It’s expensive to replace. It’s expensive to operate. If it operates inefficiently, it can be very expensive. And if it’s down, it can be catastrophic [inaudible].
Jon Herlocker: 00:02:50.340 At the same point in time, we have this whole revolution around the industrial Internet of Things, where you can put sensors on things. You can collect data. You can then plug that into software. You can use that software to make data-driven sort of decisions about how to optimize your process. And you can also use tech when things are not right or things are headed in the wrong direction. It’s called condition monitoring. You’re monitoring for conditions but also optimization. The challenge is, is that for these large industrial operators, actually getting to the point of implementing that has all of these obstacles. And these obstacles can include install [all the?] necessary sensors, [working out?] all the IT in terms of getting the data off of those sensors and through the sort of secure networks into the place centrally where it could be handled. There are lots of challenges around data quality. And even if you get all the data in one place, now, how do you make sense of it? Right? How do create some sort of an alert that says, “Now [something to do?].” And even if you put all that work up there, it’s actually something that has to be continually maintained. And so there’s just all these obstacles to applying these advanced technologies that enable you to achieve these reduced downtimes and increased efficiencies.
Jon Herlocker: 00:04:08.962 And so Tignis brings a very unique solution to this space, which is we call it physics-driven analytics. And I guess there’s several parts to our solution, and I’ll talk more about them kind of in future slides. But part of this is that we provide Software as a Service, that it’s somewhat all-inclusive. And so a lot of the obstacles to implementing these sort of data-driven condition monitoring optimizations for industrial systems is the time and expertise it takes to get to the point where you can implement this. Right? As a Software as a Service, we provide the expertise, we provide all the IT support, etc. They can get up and running in a very short amount of time. And the other kind of aspects you’re going to hear more from me about is that we’re combining both an engine that understands the physics of these systems with sort of the most advanced machine learning, as well as sort of top-notch kind of data science and subject domain expertise. So we bring all that together in one place in a unique way that nobody else does, leading to these value propositions that you see on this slide.
Jon Herlocker: 00:05:22.251 A little bit more specifically, how are we able to actually help sort of a large industrial plant just sort of bypass all these classic obstacles and get to sort of data-driven condition monitoring and optimization and delivery [use?], therefore reduced downtime and increased efficiency? We start with the two things that the plant already has. Right? And one is something you see on the left, which is a schematic or a piping and instrumentation diagram. It kind of shows that any plant - any industrial plant is going to have something like this that shows kind of how are all the pieces connected, where are the sensors located, where are the control points located, where are the valves, all that kind of stuff. They will have something like this that they use to understand and diagnose their platform. The other thing that they will probably have is that any sort of relatively complicated industrial system has to have a control system. And that control system sort of manipulates kind of flows and valves and temperatures based on sort of looking at the sensor readings. And so there are always some amount of sensors in these industrial plants. And so those sensors will be collected at the control platform. So they will have some amount of data that they’ll be able to export. So we just sort of say, “Well, you’ve got these two things,” which most people do. Then we can get started.
Jon Herlocker: 00:06:42.992 And what we do is we build what we call a digital replica of your schematic, of your piping and instrumentation diagram. You kind of see a picture of it here. We’ve got a product that was kind of - one of our customers called it Visio for schematics or whatever. But you can sort of drag and drop to create a digital replica of your physical schematic. Right? And then our system from there basically not only gives you a rich way to understand and explore what’s going on in your system, but also creates the data format on which we can run our physics-based analysis and our machine learning in a way that automatically can handle changing conditions. So if you retrofit your plant or you change a little bit, you tweak the digital model, and suddenly all the analytics will automatically update for you to support the new structure. And having all of this, then, we can continuously monitor the systems in the context of those models.
Jon Herlocker: 00:07:45.984 And so the problems here that we’re dealing with is that these complicated mechanical systems have lots of data problems. And I’m not going to read over that list, but there are certain problems that are so obvious everybody knows about them. Your plant, it shuts down completely over that. But then there’s tons of smaller problems that are affecting the efficiency of your system or are going to lead to a major failure in the future that are very much harder to detect. And we really focus our condition monitoring on detecting those kinds of things. And that’s how we significantly decrease the downtime and increase the efficiency. And as a result, we can provide value like predictive maintenance, detect these indicators of future failures. We detect these hidden failures that you can’t see. And then our sort of rich user experience allows people - together with our analytics, it kind of helps identify the region, where in the plant the problem is. We accelerate this root cause. And a couple of sort of key terms there - mean time between failures and mean time to repair - those are a few sort of key metrics that our customers are always looking to optimize. And if you can significantly increase your mean time between failure and lower your mean time to repair and lower your operations costs, you’re talking about potentially millions of dollars a year per plant. And customers love us, I guess. For example, we currently monitor hundreds of the world’s largest chiller plants. One of our partners in that area is Optimum Energy, and as you can see, they like us. So we’re kind of a couple years in, but we have a pretty good footprint already.
Jon Herlocker: 00:09:27.725 All right, so that’s us. I figured let’s get into the design and architecture and how it relates to InfluxDB, which I think most of you folks, [if you’re a client?], wish to hear about. Let’s start by - this is more of a high-level functional view of what does our product actually do. So we’re a Software as a Service. We currently run inside the Azure cloud. And inside of Azure cloud, we have a certain set of services, a process of services that have to run. So sensor data comes from outside into the Azure cloud. We have to clean that data. We have to normalize that data. We have to generate our machine learning models. We have to then detect the anomalies and trends. We have to generate evidence for cause. This is a sort of an idea of the process that happens. But all of this is powered - and, I guess, all of this is powered by analysis across those metrics. And in that last box there, which is sort of interactive analysis of root cause, is once we’ve identified a set of conditions that we think a customer might be concerned about, there’s sort of a final step, which is we call what we do augmented intelligence. We’re not like a black box and just like, “You need to do this, and we’re not going to explain it,” which a lot of AI systems are. Instead, what we like to do is say, “Hey, we detected a condition that you should be concerned about. And here’s as much information as possible to help convince you that you should be concerned about it, but also to help you kind of make the final assessment about what’s the root cause of this incident or what should be done to mitigate the problems being introduced by this incident.”
Jon Herlocker: 00:11:04.619 And toward that end, we have a very rich user experience, which maybe I’ll show a super quick demo of, that involve interactive analytics across the metrics that we’re collecting, which can be quite large. And so at the core here, we need to make a time series database. Right? We needed to store all of these metrics from all of these sensors from all of these customers and their sites in a way that would support the use cases. And I’ll go into some more detail as the time goes on. But our design considerations when we started were basically the following. We’re a startup company. Right? When we started looking at InfluxDB, we did not have a product. Right? We had to build a product. And because we’re a startup company, it’s not like the product is well defined. Right? When you initially start, you kind of have to sort of get something working fast and iterate and test it with your customers. And then some things work, and some things don’t. You have to make changes and adjust, etc. And because you’re a startup, also you have limited cash to pay for all this stuff. And so your biggest fear in early stage startup is you’re running out of money. Right? And so you want to really be efficient with that.
Jon Herlocker: 00:12:23.150 So that led to kind of bullet number one, which is we wanted to get up and going fast. Right? So we needed to get up and going so we could start testing with early customers. And every minute, every day, or every month that goes by that we don’t have it up and running for sort of testing with customers, it’s just a chance that we’re going to run out of money before we figure out sort of what the magic product was. So this was us a few years ago. The second bullet for us was we didn’t want it to distract from our core business. So our core business, we decided, was not storing time series data. This is something that other people have mastered and that can be acquired through vendors like InfluxData. And as a result, as a startup company that’s trying to disrupt the space, we wanted to focus on what was our core business, which we felt was solving these industrial problems of AI and machine learning. And so we wanted to focus more on that side and not have to spend a lot of brain cycles on time series stuff. And so that’s why we were looking to outside vendors. As I described, there’s sort of two parts of what we do. There’s sort of doing data analytics to identify potential problems, but there was a separate sort of augmenting the human experience. [It was?] very critical to us to provide an awesome interactive user experience for our customers who are exploring the data, and I’ll talk some more about that. And I think the last one, which was particularly important to us, was we wanted to not make a design decision that would constrain our growth, that would constrain where we could deploy, that would constrain us scale-wise, etc. And so this was kind of like we had lots of sort of smaller things we were looking for, but they all kind of reduced down to sort of one of these four.
Jon Herlocker: 00:14:28.477 So as we looked around, there were several points in favor of InfluxDB. And the number one bullet was, when we were looking around, it was considered very mature technology. And we knew multiple people that we trusted who had done a lot of work in this space who had had successful experiences with InfluxDB. And that was, I would say, probably the strongest reason for us. But we wanted to verify. We wanted to go and make sure that this was going to be a safe choice if we chose to go down this path. And we found there was a very strong community. Right? Lots of people using it, lots of people talking about it, lots of forums where people were answering questions about it. And to us, this was particularly important because you could hire people who knew how to use it. Right? You could find consultants or contractors who could help you with our problems. Obviously, InfluxData themselves were very helpful. So a very strong community of support around that.
Jon Herlocker: 00:15:35.645 Third, rich and easy-to-use query API. It seemed simple, but we needed it to do the interactive analytics. We needed have a good API to do this. Strong, good documentation. Trying to get up to speed, there’s always subtleties about API calls and responses and all that kind of stuff. We found the InfluxDB stuff to be well documented. And I guess the last thing was sort of strong supporting ecosystem. Right? So the whole TICK stack, particularly Telegraf [in general], all the [inaudible] they had was very appealing to us. And so, I guess, these are a bunch of points that we said, “Hmm, it’s making a strategic decision. It’s something we might be living with for a long time. And for a startup company, these are all sort of points in favor with InfluxDB.” And so it was the first thing - based on all this, we said, “Well, let’s try it. Let’s give InfluxDB a try, and let’s make sure that - a classic startup, let’s just not spend too much time in design paralysis. Lots of points in favor. Let’s test it and see if it breaks or if there’s some use case we didn’t think about, etc.”
Jon Herlocker: 00:16:49.757 And so, I guess, Alan and I sat down to think about all of our trials in getting InfluxDB out. But then we realized there weren’t that many trials. We were like, “Well, what really did happen?” We kind of just installed it. And we had to figure out how to deploy it and all that kind of stuff and how do we sort of manage the configuration and deployment of it. I’ll talk some more about that in a later slide, but I really don’t have a lot to say. Largely, it just kind of worked. Right? And this was sort of what we were hoping would be the case. Right? And so back to that original comment of we wanted something that would get up and going fast, that wouldn’t distract us from our core business, it seemed to nail kind of all that. And so Alan and I were sitting down and talking about like, “Okay. Once we kind of figured out how to configure it and the schema” - and I’ll come back to those later in another slide - “how much maintenance does it really require?” And I think what we came up with is occasionally we go and tweak how much RAM is associated with it. Right? And that’s pretty much it. Right? Yeah. We’ll tidy things here and there, but for the most part, [it’s working?]. And I have a note here that InfluxDB version 1.6 - so I know version 2 is out, and there’s lot of new things in 2 with Flux and all that kind of stuff. But at the time - that was very early - 1.6 was kind of what we’re based on. So we did not use Flux, etc. I’ll talk more about [those slides?]. But our experiences are largely based on Influx 1.6.
Jon Herlocker: 00:18:27.092 So let me talk a bit about our sort of product architecture, technical architecture, how InfluxDB fits into that, how we manage deployment of InfluxDB, and all that kind of stuff. So we are a Kubernetes shop. I mentioned that we’re on Azure. We run on top of the Azure Kubernetes Service. However, our software, [as you’ll learn later?], is architected to run other places as well too. But for now, we’re running on Azure as our primary source. And this picture kind of shows a high-level view of our architecture. And I think the first design decision that we made was to, as much as possible, isolate customers from each other. So in our space, we’re dealing with semiconductor manufacturers, pharma manufacturers, chiller plants, that kind of stuff. And just in this classic, we call it, OT space, there’s a lot more sensitivity to data. Data can reveal sort of things about your manufacturing yield, about quality problems you might be having. There’s also security concerns, etc. And so as someone who spent a lot of time selling to the IT organizations, where there was definitely security concerns, there’s even more sort of security and privacy concerns in all this OT space, industrial space. And so toward that end, we chose an architecture that isolated tenants from each other as much as possible.
Jon Herlocker: 00:20:06.047 And so in this picture here, you’ll see these kind of white boxes that says Customer1. - my mouse here. You see Customer1.Tignis.IO, tenant name, space, customer. So each of these kind of white boxes that I’m pointing to is a collection of containers and persistent disks that represent one of our tenants, one of our customers. And so every customer has their own sort of InfluxDB container and has their own InfluxDB persistent disk. And so I think that’s the other kind of - so we have kind of a target Kubernetes cluster that can be shared or dedicated. We have a set of containers running these tenant services, and we have a set of persistent volumes. There are some shared services like the load balancer, the Azure Load Balancer, and [all zero?]. But for the most part, talking about InfluxDB, each customer has their own.
Jon Herlocker: 00:21:00.203 And some interesting - I guess this architecture has worked out fantastically well. There’s some sort of key things about it. One is that InfluxDB runs inside of a container that is stateless. Right? The beauty of that is if we need to restart it, we can just kill a container and start a different one. Right? And all of the configuration for that InfluxDB container comes from Kubernetes Secrets and Kubernetes ConfigMaps. Right? So we kind of separate out the config from the application from the data. And then state is provided by a blocked persistent disk volume. You can kind of see that over here, managed disk as provided by Azure. And the beauty of that is that then we can use all the classic Azure services on top of that volume. You can [stack on top of that?]. You can copy it. You can do all that kind of great stuff. And so this architecture has worked out super, super well for us. There are other containers, the MongoDB container for metadata. There’s the containers running RAI algorithms. There’s containers running the UI. There can be multiple sort of containers for customer for a service. All those things are true.
Jon Herlocker: 00:22:14.260 You will notice that there is a metric API sort of cluster of containers which sits in front of the Influx cluster of containers. And so the metric API is sort of the application-level API that we expose to our user experience and also to our customers. And so for that, it was important to do sort of very application-specific authentication. In some cases, we needed to query our application metadata interface to look - [were people coming?] in the metric? We may need to query our [graph?] API that supports our sort of digital twin. What are the entities? How are they connected to each other? What are the properties of those things? And so we felt putting kind of a service in front of InfluxDB was important for a variety of reasons. I guess we managed all of this using Helm. And if you have any questions kind of about the details of Helm and Kubernetes and all that stuff, we’ve got Alan on phone. So please throw them in there, and Alan can answer those questions.
Jon Herlocker: 00:23:28.557 Part of the beauty of our configuration that we’ve designed is we can deploy a new tenant in like five minutes. Right? And so even though they’re an isolated, separate piece of infrastructure, everything is so automated that basically, if we need to spin up a new customer, we create a new Helm release from a template. Right? So fundamentally, you copy this file that you see kind of in front of you. You make a new copy of it. You do a little search and replace to change the tenant name. And then you basically commit that new file to the git [repro?], or at least create a pull request for that git file. When that pull request is approved, it runs all this automated verification - actually, all the automated verification runs on it first. If it all passes, then the git pull request can be approved. And then FluxCD, Flux Continuous Deployment, will automatically create all the right containers, [pre-purchase?] some disks. Everything will get mounted, booted and all that kind of stuff, and so almost no human intervention besides sort of creating that initial specification of what the tenant looks like and then which kind of services it [does?].
Jon Herlocker: 00:24:46.995 There’s other nice pieces of this too, where once we can actually - when the engineering team makes a new change and a pull request is approved, that change, by default, is automatically rolled out to all of our tenants. Right? So anyway, this is way we deploy our [current?] application, but it includes InfluxDB. Right? So if we want to update InfluxDB, we just change [all these?] files, and then we edit InfluxDB. It replaces the app container from the old version with the app container to the new version. And for the most part, things just keep working. So we’re super proud, I guess, of this infrastructure configuration, and it works quite well. Alan designed it all, I should say, credit to the man. So that hopefully gives you an idea of how sort of InfluxDB and how our design of architecture helped us meet some of these goals of just getting up to speed really fast and not distracting us, I guess, enabling agility going forward.
Jon Herlocker: 00:25:51.983 So the other thing I want to talk about is - we talked about one of the key things was to have an awesome kind of interactive user experience. And this is fundamentally important because if there’s an incident in one of these plants, the more interactive, the more responsive the diagnosis user interface, the less downtime people are going to have. They’re going to find the problems faster. They’re going to get the insights faster. They’re going to solve the problem faster. They’re going to have less downtime. They’re going to have more efficient - save money. And so, literally, faster UI queries lead to sort of money savings in the end. And so we need to [inaudible] interactive analytics over this historical query. Now, the interesting challenge here is that the queries are ad hoc. A user could request any time range. They could request any building, any sort of sensors. We don’t necessarily know until the query happens which - so it’s sort of very ad hoc. And so that was kind of one challenge here.
Jon Herlocker: 00:26:57.003 Another challenge is that we have massive amounts of data, but you kind of need to downsample that data for transmission displayed to the user experience. So even if I’ve got a million data points, if you’re asking for, “Show me the energy utilization over a one-year period,” or let’s say a three-year period, that could be 100,000 points. It doesn’t make sense to download 100,000 points to the browser and it display 100,000 points. First of all, it takes a long time to download data. Second, the resolution at some point gets lost. And so regardless of what time range you ask for, we only really need 1,000 data points to summarize that period of time. And so there’s this need to kind of downsample or filter across the points. So if you ask for three years, you got to figure out, “Okay. What 1,000 values do we need to return that represent best that period of time that you asked for?”
Jon Herlocker: 00:27:50.110 And the beauty is that InfluxDB with InfluxQL handled all this, [and the direct?] design really that handled this part. You really can just sort of say, “Here’s the time range. Here’s the interval size,” and then you can configure some of the properties about how the Influx does the computation to come up with those 1,000 points. But it basically just does it for you. Right? And it does it in a very [inaudible] interactive way that allows us to render sort of these queries in subsecond time. And yeah, we’re going to try a super-fast download here. So we’ll give this a try [inaudible] [I’m going to end this show?] and just to kind of show you how this looks like - [inaudible]. So this is basically our user experience. Over here, you can sort of render charts and data about the system. And over here, you can sort of navigate the sort of digital twin of the system. And so you can kind of see sort of all the colorful charts that get rendered, etc. And then I won’t spend a lot of time doing it, but here I can render all sorts of additional data. You can sort of see this is literally rendering as I scroll. Right? As I scroll down, it will pull a whole [mass of?] different sensors and thousands of data points. And so this is not even preloaded. Literally, it’s rendering as I kind of roll it. Right? And so you can browse it very efficiently, etc.
Jon Herlocker: 00:29:25.736 Just kind of another fun little demo, this sort of shows the digital twin. And you can kind of see the - you can see the sensor values overlaid on the different parts of this schematic. And in fact, we have this cool feature that allows you to actually basically travel in time. Right? So as I slide this back and forth, you can sort of see all the times update across all the different sensors across [all of it?]. So actually, behind the scenes, we’re making a whole mass of InfluxDB queries to pull up the data to fill in the things for that particular point in time. Right? And you can kind of see valves opening and closing. So I guess this was trivial, honestly, for us, given - so the performance characteristics that we were experiencing from InfluxDB. So let’s go back. That was a very, very simple and quick demo of sort of showing that. Back to here. Cool.
Jon Herlocker: 00:30:32.822 So let me talk about - some of the things I talked about was about our requirements that we wanted. We said, “Well, we wanted to create an interactive experience.” I kind of demonstrated that to you. The other one was we wanted to not be able to constrain our growth. Right? We wanted a platform that didn’t fall over when it reached a certain size, in particular. It’s one of the [ways we think about constraint?]. So the architecture we chose and sort of the characteristics of our customers really led to InfluxDB being a good solution. And so, for example, our scale mostly comes in the number of sort of customers that we have. Right? You have to think about plants. And because our architecture was every customer gets their own sort of containers and storage, it’s an easy scale-up story. Right? It’s not like you’re running out of contention. If you sign a new customer, you deploy a new container or you deploy a new persistent disk so that basically you’re introducing more IOPS and disk space storage, and you’re using more compute time. And there’s almost no contention between sort of customers as we scale. And so that was a great scaling story. And so we were able to do that because we had this amazingly great sort of configuration, deployment, [and automated?] [inaudible] super trivial to deploy a whole new set of containers in support of a customer. So our architecture supported this scaling step.
Jon Herlocker: 00:32:00.379 Scaling within customers was actually sort of less important to us because for the most part, at least at the cloud scale, the amount of data coming from a single plant is not sort of the prime sort of contender. Those tend to be bottlenecked on the network between the end plant and the cloud. And so the amount of data that they send to us is not like a million points a second. It’s sometimes one a minute or one every five minutes or [inaudible] is made to [inaudible] something more frequent that. So the point is that for the most part, it wasn’t really a challenge [to scale?] the customer. But if needed to do it, it’s largely just a matter of increasing the amount of RAM. And because we’re in the cloud, we can kind of dynamically change RAM to be whatever we need it to be. And today’s modern cloud environments, since you can build the machine with [mongo?] amounts of RAM - [we’re?] not even vaguely close to the limits here. So scaling within a customer is straightforward. Scaling across [inaudible]. And so it hasn’t really - so there’s no constraints to growth from the scale perspective, and scale largely hasn’t been an issue.
Jon Herlocker: 00:33:14.408 One thing, one sort of exception, which I’m going to talk about at the end of this call - in terms of not sort of preventing sort of our growth, one of the things that we were worried about was sort of cloud platform lock-in. And so when we started, we had to make a decision. Were we going to build this on Azure? Were we going to build this on AWS? Were we going to build this on Google Cloud? Build this on our own cloud? And it’s kind of a hard decision. Right? Because maybe you end up where you have to go to where the customers are, and maybe it turns out your customers are more on one platform than the other. In our case, we chose Microsoft because within the industrial space, Microsoft is, at the end of the day, the more visible partner, I guess. And so we initially went with Microsoft, but we wanted to be able to not be locked into them. So if we found that at some point their platform wasn’t sufficiently stable for us, their pricing was sufficiently friendly for us, or we had a customer that said, “We’re going to pay you obscene amounts of money but only if you move it to a different cloud,” we wanted to have that flexibility.
Jon Herlocker: 00:34:27.976 And so a lot of sort of startup companies will basically make bets on these sort of native services of AWS or Azure - right? - their sort of proprietary databases. And we didn’t want to do that. And so by making a decision - just as an example, both AWS and Azure have their own proprietary time series databases, and we explicitly chose not to use those. But by choosing InfluxDB, InfluxDB is working on Azure. It works on AWS. It works on Google. We can run it on-premise if we need to. And so, for us, this felt like it gave us a feeling of control. And it’s currently still primarily Azure, but we have this ability to move that if we need to - right? - if any of these situations become an issue. And like I said, there’s recipes for deploying this on all these other places, whether it’s Google, Amazon, or a customer who wants to run it in their own Kubernetes cluster [inaudible]. Right?
Jon Herlocker: 00:35:36.689 One of the things that we didn’t quite think about how valuable this was until we actually did it was the second bullet - right? - which is if you initially - [inaudible] we get to fiddle a little bit with sort of proprietary time series databases on Azure, for example. And one of the things we realized is that you can’t spin up your own version of it, like in a CI/CD cluster or on your laptop. And so you literally cannot do an end-to-end test on a laptop or a Jenkins machine or something when you’re using these cloud-only services. And so because we used InfluxDB on top of Kubernetes, now you can literally just launch a Kubernetes cluster on your laptop, and you can run our entire stack, including our time series data, and then you can run particularly tests end to end. Right? And this was critical in terms of agility. We feel it’s very important in terms of agility and ability to move quickly, [more?] willing to do more extensive testing, I guess. And also I think kind of it increases the speed at which you can resolve issues when you run it, particularly sort of system-wide issues. So it’s kind of a [side effect?] that we actually [didn’t?] think about until it came, and then we’re like, “Wow, that was helpful.”
Jon Herlocker: 00:36:58.899 I do want to talk a bit about sort of our experiences with the rest of the TICK stack because, as I mentioned, one of the reasons we chose - one of the reasons that was sort of, I would say, we chose Influx in the first place, it had a very strong sort of supporting software, and in particular, Telegraf. Right? I had used Telegraf previously. I was amazed at sort of the breadth of connections it had and all that kind of stuff. So the interesting thing is it turns out we basically haven’t used Telegraf, and it’s for the following reasons. All of the plugins are for sort of IT things. Right? And the thing that we’re monitoring are things that are sort of on the industrial side of things. And it’s almost like there’s this entire parallel world of technology, and it is sort of the industrial side of things. They have their own computers, their own operating systems, their own network protocol, their own databases. And there’s some amount of convergence, [and we call?] the convergence of IT and OT is starting to happen. But a lot of the sort of preexisting installed plans have - just stuff is different. Right? And so, for the most part, we haven’t found those sort of off-the-shelf things to be available for the thing that we’re talking to. So Telegraf hasn’t been as useful as we would have thought.
Jon Herlocker: 00:38:13.766 Chronograf? We used Chronograf a lot in the early days, just trying to make sure our system was working, to do some of the early analysis, etc. But as you saw from our product, our product is, in essence, a superset of what Chronograf does now. Right? We [don’t usually?] allow customers to put in arbitrary InfluxQL queries. But they explore the data, they can summarize the data, they can do analytics on the data, etc. So at some point, our product got good enough that we stopped using Chronograf, but in the early days, it was super useful. And likewise, Kapacitor, we didn’t use because it was largely overlapping with the capabilities of our core product, which were applying analytics to streaming data and then taking action based on that kind of data. And so we had our own sort of core code to do that stuff. There wasn’t really a use for it. But anyway, so that’s kind of our experience with the TICK stack. I mean, it wouldn’t surprise me if we do start to use Telegraf for certain use cases moving forward, but it was kind of like its acceleration capabilities ended up not being so relevant to us. However, we are still super happy with the InfluxDB, so.
Jon Herlocker: 00:39:24.483 So let me talk a bit about what some of the challenges we ran into. I mentioned that getting InfluxDB started was kind of like it just happened, didn’t worry about it. So that was actually getting it up and running and configured and accepting data and querying data. That [substance?] was the easy part. The hardest part of getting started was figuring how to design the schema. So if you’re an InfluxDB person, you know that measurements and fields and tags and that kind of stuff - and the problem that we ran into was that in our early customers, we didn’t really know what data they were sending us. Right? And this is actually a fundamental problem of a lot of industrial systems, is that you plug into a system that collects data, like an industrial system that collects data, and the time series that are there have very obscure names, that you need to find the guy who created it and ask him exactly which sensor this is measuring on which asset. So there isn’t good metadata necessarily about sensors and where they’re attached and all that kind of stuff.
Jon Herlocker: 00:40:36.296 And so when we looked at designing schemas, there’s kind of the recommended method for how to design a schema and how to choose what a measurement is and how to choose what a field is. And we found it almost impossible to apply because we didn’t really know what this data was necessarily. Right? We knew it was a time series and it was from that building, and that was about it. Right? And so we chose kind of, I’d say, a structure that was optimized just getting the data in so we could query it and build a product. And then later, we’d figure out exactly what all those time series were, as opposed to optimized to kind of do high-performance queries or aggregations or that kind of stuff. And so our approach was - and I guess, I don’t think we’re sad that we chose this approach although we’re now kind of saying, “Boy, it would be nice if - now that we understand the data a bit more, that we could do [the migrations across it?].” So our approach was to basically map every building to a measurement because that was the one thing we knew, like all the data is coming from the same building or the same plant. That’s for sure. Right?
Jon Herlocker: 00:41:43.920 And then for the field, well, we knew that this was a - we knew that these were different sensors. And so we could say, “Well, that’s a sensor that had this weird name. And that’s a different [inaudible]. I don’t know what they are yet, but I know they’re different.” Right? And so we could create a field for - we literally created a separate field for each sensor initially. And to date, this approach has held up even if sometimes we’re a little embarrassed by it. It did get us through those early days. And basically, it removed the friction in getting data in [so I could?] fully understand the data before we get it into the system. The downside is that we can’t use InfluxDB to do these aggregations across sensors. And we do need to have, now, a separate database, this MongoDB [sort of?] structure, to actually track the sensor metadata, like to track that this sensor is measuring this thing, and it’s attached to that device, and that device is part of that bigger device, and so on. But I think that it’s unclear - we have so much richness in our metadata about sensors and devices that we would have had to do that anyway. Right? And so that’s part of the reason why we’re not too sad. The main downside, again, is it can’t use InfluxDB to do these aggregations across sensors easily. Right? We need to sometimes do that outside of InfluxDB.
Jon Herlocker: 00:43:05.549 Other kind of main sort of challenge I will mention is the one performance challenge we sort of ran into, which is that we basically have two usage patterns for a time series data. One is the interactive analytics I showed you, which is basically small amounts of usually recent data, where we [inaudible] sort of very predictable response time. It’s fast. Then we have the model training. And the model training is like, “Give me all data on this asset since the beginning of time, every single point, regardless of the resolution.” And what we found is that the learning models would take in, and they would query InfluxDB and basically sort of, in essence, flush it out the cache, the in-memory caches, which would then make the interactive analytics not be so predictably fast. And furthermore, because they were basically just saying, “Give me all data,” we weren’t really benefiting from a lot of the value that InfluxDB provided in this case. Right?
Jon Herlocker: 00:44:03.348 And so looking forward, I would say kind of the two major - InfluxDB has been great for us. The two things we’re kind of focusing on right now, one is sort of providing a separate data access path to the batch analytics that’s really optimized for big continuous queries. And this may be as simple as just reading files off of a disk. You don’t necessarily need anything like InfluxDB. And I think the other thing that we would like to do is sort of automating the migration of old data to cold storage, i.e., sort of lower cost storage. And so we’re a startup, so we haven’t had much time to accumulate that much data. But as we start accumulating more data, people tend to not query the old data that much. It makes sense perhaps to kind of filter it out. So that’s kind of, I think, our time for today. Hopefully, we wanted to give you a little bit about Tignis - is a bit of what we were looking for, why InfluxDB was selected, some of the benefits and wins about InfluxDB, a couple of the challenges we ran into, some of the architecture and the cool things that we did to automate the InfluxDB, and some of what we’re looking forward to. But I’m happy to take some questions now.
Caitlin Croft: 00:45:19.004 Thank you, Jon. That was great. I’ve heard you kind of give this before, and I always learn more every time I hear these presentations. So we’ve gotten a few questions already. And I think you’ve covered some of them, but I’ll just read through them as it is just in case if you want to add any more insight. Do you use Kapacitor or any other tool to perform analysis on the stream of data points, or is it mostly batch analysis?
Jon Herlocker: 00:45:48.699 Yeah. So one of the core [ideas?] of Tignis [early days?] was a stream processing system. Right? And so we never had any reason to use Kapacitor for that kind of stuff at any point in time. So no, we just never even went there because we already had a solution, I guess, that did that. It’s part of our [IT?], part of our [process?].
Caitlin Croft: 00:46:14.291 Perfect. Do you have multiple processing pipelines, reading from one measurement and writing back to another measurement?
Jon Herlocker: 00:46:24.004 I believe the answer is yes. Yes, actually no - definitely the answer is yes. So for example, one of the things we do is we build these models to predict the values of certain sensors. Right? So if we look at a chiller, let’s say - a chiller has a power sensor - one of the models that we’ll actually build is a sort of combination of physics and machine learning to predict what we think the power utilization should be, and we [emit?] that. And so we have a sort of processing pipeline that emits those points. So here’s what we think it should be, and we actually store that back in Influx. And then customers can actually then visualize here’s what the actual sort of power utilization of the chiller was, and here’s what we predict that it should be. Right? And in fact, a lot of our kind of analytics triggers, when the variation between what we think it should be and what it actually is kind of exceeds certain thresholds, then [we’ll?] trigger alerts. But we stuff all that stuff back into InfluxDB so that we can both do later-stage analytics. Right? You can imagine that - so it allows us to modularize the compute. Right? So we can separate out the module that computes the predicted values from the module that decides whether or not the deviation is big enough to care. Right? And so you can kind of think of it like InfluxDB is like the blackboard. Everyone’s kind of writing sort of the sheer knowledge back to the blackboard, as they used to say, or to the InfluxDB. And then all these different modules can sort of use what sort of previous steps have computed.
Caitlin Croft: 00:48:16.445 Perfect. And what kind of metadata are you using for the data points?
Jon Herlocker: 00:48:24.035 Boy, that’s a lot. So I guess, a lot of the - I guess, let me answer that in two ways. So there’s a lot of metadata associated with the sensor. Right? So a sensor generates a stream of points. Right? And there’s a ton of metadata about that sensor. And that metadata is things like which asset it’s attached to, what is it measuring, what units does it have? In many cases, you might have multiple temperature sensors on a machine, measuring the same thing actually, and which one of those is the sort of primary trusted one versus not. So there’s a whole bunch of metadata that we have kind of on sort of the sensors themselves. On the individual points, I would say not so much. The points are largely kind of just measurements of that kind of - yeah, just sort of a measurement. I don’t know. Alan, you can correct me if I missed something there. But I don’t think we have any sort of metadata on the individual points themselves.
Alan J Castonguay: 00:49:40.709 So the metadata inside of InfluxDB is fairly light. We have the measurement per building, like you identified earlier. And then we’re using point per field for whatever the original sensor name was. But we didn’t make good use of tags to provide any additional metadata. What we found was the original sensor names that we were coming across had opaque names. And so we just pulled them in as is, without trying to infer structure so that we could graph them and then figure it out. So we could hand the data in InfluxDB, like a decade worth of it, over to data scientists to go and dig into. But then restructuring it later by adding tags proved to be a bit prohibitive, so we did not. We put all the metadata into Mongo.
Caitlin Croft: 00:50:30.192 Perfect. Do you use Kafka in addition to InfluxDB or any other critical technologies?
Alan J Castonguay: 00:50:38.396 I’ll take that one. So not yet. We are looking at using Kafka, actually, for taking in data from the customer sites. What we found for the first couple of customers is that we had to do a lot of bespoke development for integration. People had some old historian systems. People had data in odd formats. There was some amount of sanitization that had to happen of that data or relabel it on the fly. Little pieces of metadata here and there was just wrong. Let’s say the time zones on the date stamps are off. The time zones are correct, but all the dates are slewed by 30 minutes. Some sensors have flipped names. Things like this made it rather annoying. And so we had built some custom integrations to go and sanitize that stuff, restructure it, deal with formatting, parsing the original data and putting it into something that we wanted to go into the inserts into InfluxDB and MongoDB. And as a result, when
we started, we weren’t in a position to have a standard API for getting that data in. What we’re looking at doing is taking those integrations and changing the write path on them to write to Kafka instead of writing directly into InfluxDB, both so that we have some separation of concern so that we could swap out the back-end database with the [write to?] different back-end databases, like two instances of InfluxDB. We could buffer some data along the way. But that transition has not occurred yet. It’s still in flight.
Caitlin Croft: 00:52:15.628 Perfect. [Espen?], let me know if you have any more questions. And if you do after the webinar, please feel free to email me, and I’m happy to connect you with Jon and Alan. A couple more questions here. How easy is it to migrate from a data historian like OSI PI?
Jon Herlocker: 00:52:37.207 I guess the answer is it really depends. I think that we are - to Alan’s point, a lot of - well, let’s see here. There’s so many dimensions to that answer, I don’t even know where to start. So the answer is we can - the biggest problem actually in migrating the data is actually the human part of it – right? - which is in many cases, an OSIsoft historian doesn’t necessarily have all the metadata that we need in order for us to do what we need to do. And so, in particular, we need to understand kind of what are these sensors, what are they measuring, and then kind of which assets are they associated with? Sometimes, some of that can be extracted, some or all of that can be extracted from OSIsoft. But what is almost always missing is its kind of connectivity between assets. Right? And so what we need to know is that this chiller’s chill water return pipe is connected to that pump over there, for example. And it’s connected to that port on that pump. And in fact, these sensors on the chiller are measuring the flow coming in and through that port on that pipe. Right? And so that level of detail is just not there for the most part.
Jon Herlocker: 00:54:14.172 And so our approach sort of involves human intervention kind of at the onboarding time. So inevitably, what we end up doing is exporting some sample of data from OSIsoft, something like a CSV file. And then we have kind of both a combination of human and automated tools that will sort of manipulate that CSV to get the additional information, that is, that we were not able to export from the OSIsoft. And also, people’s schemas in OSIsoft often differ. And so there’s kind of a - it’s actually one of the biggest challenges of getting us up and running is adding that semantic information to the sensor data as it comes out. And I actually mentioned that we designed the schema early on, saying, “Well, I don’t know anything about this data. I’ll just get it in.” Right? And we’ve slowly shifted to the point of, “No, now we really kind of need to get a deeper understanding of this data before we put it in the system in order for us to do all the other things that we need to do.” But as I said, right now, we haven’t found a better solution other than kind of taking samples of data, exporting it to Excel, having sort of transformations to get multiple Excel files, one that sort of specifies the sensors, one that specifies the assets. And then in those kind of spreadsheets, there’s often sort of a human step where they come in and say, “This is what this asset is, and this is how it’s connected to here,” kind of stuff. At that point, we can import it in and do that. So let me just stop there. There’s a lot of things one could say about what we call point mapping, but it’s a lot of work and challenging.
Caitlin Croft: 00:55:57.052 Perfect. What kind of retention policy do you provide the customer, and is it customizable?
Jon Herlocker: 00:56:07.391 So highly customizable at this stage. Right? So I don’t know. Alan, do you want to maybe talk a bit about just briefly [crosstalk]?
Alan J Castonguay: 00:56:17.324 So our default assumption is the retention policy for the customer’s data will be forever. So we’re not going to, let’s say, keep the data for a year and then throw it away. They’re paying us to keep the data and make it available for query. We’re going to keep doing so. Data at rest is cheap. Disk is cheap. So there’s no problem with doing that. The data retention is done at the building level. So if we had a customer that said, “Hey, this building is now a secret and you need to get rid of the data for it,” then we can just drop the measurement and get rid of that data off the disk. If the whole customer was going away or they decided that they themselves were a secret and we needed to forget they existed, then we could drop a managed disk in its entirety. But in terms of automatically pruning old data, we don’t do any of that at this point. The data is retained indefinitely. We may at some point need to change that, but I don’t think it’s going to be within the next decade because disks are not very expensive.
Caitlin Croft: 00:57:28.043 Great. Okay. So we’ll take one more question. Do you ever use a caching layer in front of InfluxDB?
Alan J Castonguay: 00:57:37.553 No. That’s the simplest way of saying that. So the front-end APIs for dealing with UI queries just query right through. They don’t hit Memcached or Redis or anything like that along the way. We are exploring doing that for some pieces of the data, like data that arrived within the last hour or something, in particular, for dashboard use cases that continually try to refresh very recent data. But at present, we found a lot of the search patterns are really searching for things like in the last day and not doing it super frequent. If at some point we get higher resolution data is the more common case, like less than one minute between points and a whole lot of them, and we really want to stress dashboarding use cases where someone keeps the UI open continually, and we have lots of customers doing that, then that’s going become a lot more important. But at present, we didn’t bother with overoptimizing for that path, as most of our query load didn’t match that pattern.
Caitlin Croft: 00:58:46.629 Perfect. Well, thank you very much. If anyone has any more questions, please feel free to email me, and I will forward them on to Jon and Alan. Thank you so much for joining today’s webinar. Just another friendly reminder, we have InfluxDays London coming up in June. There’s a promotional code down at the bottom. We will have Influx training - we will have, sorry, a Flux training at InfluxDays, so it will be a really awesome event. The slides will not be made available. However, the webinar has been recorded. So once I clean this up a little bit, it’ll be available on our website. So if you just go to the registration page tomorrow, you’ll be able to find the recording. Thank you very much, everyone, for joining today. I hope you have a good rest of your day.
Jon Herlocker: 00:59:44.503 Thank you very much. Thanks, Caitlin.
Alan J Castonguay: 00:59:46.096 Thanks for having us, Caitlin.
[/et_pb_toggle]
Jon Herlocker
President and CEO, Tignis
Jon is a deep technologist and experienced executive in both on-premises enterprise software and consumer SaaS businesses. In his prior leadership roles, he was Vice President and CTO of VMware's Cloud Management Business Unit, which generated $1.2B/year for VMware. Other positions include CTO of Mozy, and CTO of EMC's Cloud Services division. As a co-founder of Tignis, Jon is an experienced entrepreneur, having founded two other startup companies. He sold his last startup, Smart Desktop, to Pi Corporation in 2006. Jon is a former tenured professor of Computer Science at Oregon State University, and his highly-cited academic research work was awarded the prestigious 2010 ACM Software System Award for contributions to the field of recommendation systems. Jon holds a Ph.D. in Computer Science from the University of Minnesota, and a B.S. in Mathematics and Computer Science from Lewis and Clark College.