How to Manage TensorFlow with InfluxDB
In this webinar, Chris Goller & Michael DeSa will show you how to manage your data in TensorFlow with InfluxDB.
Watch the Webinar
Watch the webinar “How to manage TensorFlow with InfluxDB” by clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]
Here is an unedited transcript of the webinar “How to manage TensorFlow with InfluxDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: Chris Churilo: Director Product Marketing, InfluxData Michael Desa: Software Engineer, InfluxData Chris Goller: Architect, InfluxData
Chris Churilo 00:03.336 Once again, Good Morning, everybody, and thank you for joining us for our webinar today on how to manage TensorFlow. We’ve got two great speakers joining us today. And they will be going over how to manage TensorFlow with the use of InfluxData. We have our architect, Chris Goller, who’s joining us. As one of our key engineers, Michael Desa. Some of you might have met him in one of our training programs. Both are very, very, well-versed in InfluxData and TensorFlow, so it should be a really great session. I am recording this session, so you could take a look at this webinar once again, later on. And if you do have any questions at any time during this webinar, please feel free to put your questions into the Q&A panel. And with that, I’m going to hand the ball over to Michael and we will get started.
Michael Desa 00:57.625 Let me unmute myself. So today’s sort of lecture is going to be sort of putting, working with TensorFlow, InfluxDB data and TensorFlow. So there’s a sort of GitHub repo with the slides and sort of all the code examples that we’ll have today. And just to sort of start off, I’d like to have sort of Chris Goller introduce himself. Chris?
Chris Goller 01:24.444 Yeah. Sure. Thanks, Michael. Yeah. I’ve worked with machine-learning and modeling for sensor data for about 10 years in robotics. And now I’m working with InfluxData around, you know, some of the data engineering principles that want to make some of these models far better, far easier to deploy. My background is actually in chemistry and my research was originally in modeling, distributive modeling of catalytic systems. Michael?
Michael Desa 01:56.304 Awesome. Thank you. So my name’s Michael Desa. Many of you may have met me before. I do a lot of our training here. In the past, I’ve sort of hate running and I signed up for a marathon. I ended up not completing that marathon. I got about halfway through my training and then kind of epically injured myself. In college, I studied math. Other words, I hate running. I very much enjoy biking and as sort of Chris mentioned, I do a sort of a number of various roles here at InfluxData. So I’ve worked on the platform team, now working on the application team, done training and support, so I’ve kind of been all over the place with InfluxData. So what is the point of this presentation? And just sort of give you a little bit of how we view the world and how, sort of, we view the world changing-is we’re increasingly able to monitor sort of everything in our lives, this sort of fitness level, the shower head, containers, electricity consumption, sort of all of these things. And nearly all of this data ends up being time-series. And sort of current state of the world, you can really monitor maybe hundreds of millions of individual series, but sort of scaling beyond that is very difficult. And we see a world where there will be sort of billions of things that we will be monitoring. And after you kind of can monitor things, you receive a sort of a next step as sort of prediction of where these sort of values will be in the future, and we’d like to sort of bridge that monitoring prediction gap. And we think that building some close integrations with InfluxData tools and things like TensorFlow will sort of get us started in that work.
Michael Desa 03:35.417 So the topic for today is going to be just specifically pulling data out of InfluxDB and using that data in TensorFlow, but we also have some visions for some sort of future topics. Those future topics would be the monitoring of TensorFlow itself with InfluxDB. TensorFlow integrations with our tool Kapacitor, which is sort of a processing and alerting tool, built-in sort of canned dash TensorFlow dashboards with Chronograf, and then sort of specific TensorFlow InfluxDB use case examples. So just to give you an idea of kind of where we’re at, we’re very much in the sort of infancy of exploring the relationship between InfluxData and TensorFlow, so if you have any ideas, or if there’s anything you come across that you think might be interesting, we’d love to hear it from you, so please do reach out. So by the end of this section, or this sort of presentation, hopefully, I’m kind of imagining that there’s certain people with various backgrounds, some people maybe are familiar with TensorFlow, but not familiar with InfluxDB, or InfluxData. And then sort of vice-versa, where you’re familiar with TensorFlow, but not with InfluxData. So hopefully, by the end of this you’ll be able to describe what each of these things are at a very high level, and sort of how one would use them. Then we’re going to make a sort of distinction of what is data engineering, or what we consider to be data engineering and how it’s distinct from data science. So we’ll get into that as well. Hopefully, you’ll be able to explain that. And now we are going to go into how you can use InfluxDB with TensorFlow, to solve this data engineering problem. And then, specifically, we’ll get into a little bit about how InfluxDB models time series, and how you can query data from InfluxDB. The main reason for that is sort of just a background for the example we’ll have at the end of this presentation here. And then finally, we’ll go over sort of the scope of what we built so far, so we’ve got a library that we would build, and then we’ll have a presented example where we can sort of show you it in practice. Those are the things that we’re trying to achieve with this presentation here.
Michael Desa 05:45.141 So to sort of start this all off, just to give you an idea of what InfluxDB is, we’re going to start with what time series data is. Hopefully, most of you know, but time series is a sequence of data points specifically consisting of successive measurements made from the same source, over a time interval.
And what this means is if you were to take some data and plot it on a graph somewhere, one of your axes is always time. So just to give you this little bit of visual representation of what something like that looks like, we can see now monitoring things like the temperature and the various dew points, barometer values for something over time. X-axis, we have time. Y-axis, we have some value, it’s time series data. So InfluxDB is a Time Series Database. All of the data that goes into InfluxDB is sort of time series. We provide a SQL-like query language. It has nodes, external dependencies, it’s horizontally scalable. So if you need to scale up your production, you can do so. It’s maintained by InfluxData. Who is InfluxData? We’re the company behind Telegraf, InfluxDB, Chronograf, and Kapacitor. So sort of all these things we call the TICK Stack. We’re on the mission to be the platform for time series data. We really want to own the space, so to speak. And we’re hiring if you’re interested. So what is TensorFlow? TensorFlow is a numerical library for doing computation on data fluid graphs. It gives the user a platform to describe very high-level modeling without having to worry about the underlying implementation. So this can be the underlying implementation of a model or the hardware that it’s built on. The product itself is built and maintained by Google and it’s quickly becoming the standard for machine learning and data science. And so that’s kind of why we sort of picked it over some of the other options out there like Theano or Torch.
Michael Desa 07:46.203 So there’s one thing where we see TensorFlow hitting a barrier or sort of an edge that exists, a gap that exists in TensorFlow. And TensorFlow provides a standard data format but often the data that you want to use isn’t in that format. And you end up spending a lot of engineering hours just translating format X into format Y, and we see this as a real problem. And getting real world data into that time format is again, consuming. It also has primitives for how to queue up data in TensorFlow but integrating that with a real-world data pipeline is not always the most straightforward thing to do. And these problems sort of come into a class of thing that we call data engineering. Data engineering predominantly concerns itself with managing models at scale, integrating models with existing data sources or data source. It abstracts certain things appropriately so that you can reuse one model for another use case. And it sort of handles the deployment of model from testing and training into acceptance and production systems.
Michael Desa 08:53.803 So we think that InfluxDB can help sort of bridge this gap of the data engineering problem. And our hypothesis is if we can lower the barrier to entry for getting data out of InfluxDB into TensorFlow, we can really leverage all of the tools that TensorFlow allows us to have. So we want to give people the ability to build things and apply models to different data as easily as possible. So we think that InfluxDB can help manage model data at scale, because it’s a database, it’s designed to manage data at scale. And it can act as a buffer for ingestion of real time data. So you don’t have to worry about your data pipeline getting sort of clogged or whatever or having sort of any buffering issues that may come along with that. And specifically, we think it’ll be helpful for doing things with time series because InfluxDB is optimized for time series. So your on-disk representation will be substantially smaller than a traditional sort of data store, whether that be a SQL database or something else and most definitely it will be better than text files. As for the reusability of the system, we think that InfluxDB has a simple SQL-like query language which gives us sort of a low cognitive overhead for putting new time series data into the database itself and then data streams are just Influx QL queries so you can very easily reuse the same model by just simply adjusting a query and sort of work from there. So applying a model to new data streams is as simple as sort of modifying a query. And then the transition from models into production. InfluxDB allows for both historical and real-time data. So this transition from training to testing to production will all have the exact same query semantics, and that’s kind of an important point that we want to drive home here-which is the process you go through for training, testing, and production will all be sort of self-contained, and there’s very little cognitive overhead for achieving all of that. So handling both historical and real-time data sets will work seamlessly. And just as sort of a general fact, pulling data from the same store always lowers general friction when moving into a production environment.
Michael Desa 11:25.415 Before we get into sort of what we built, I want to give a little bit of background for those of you who aren’t familiar with the InfluxDB Data Model, just so when we go through our example, the words we use are a little more stabilized or we have a common vocabulary. So to do that I’m going to start with sort of a typical time series graph and sort of go over the various components that we have. So up at the top, we have the label, so in InfluxDB we call this label the measurement. It’s kind of a high-level grouping for all the data beneath it. Off to the side, we have Legend data. This is sort of metadata about things that are on the graph. These are called tags. Tags are indexed values in InfluxDB. The collection of all the metadata for a single Legend item, so the blue circle with the A, we call the tag set, so ticker equals A, market equals NASDAQ would be that blue circle with the A there. The Y-axis values, this is the data that we would be operating on, we call these fields. And fields can store ints, strings, floats, or bools. The collection of all of the fields, just as the fieldset, the collection of all the tags is called the tagset. The collection of all the fields is called the field set. Note that in this case, there’s only one, but it’s possible that we could have as many as we would like. And then, finally, we have the thing that makes this sort of time series, is that timestamp down there at the bottom. So we represent points in InfluxDB textually via what is called the line protocol, so it is as follows. It goes measurement, comma tagset, space fieldset, space timestamp. So you can think of sort of that front part there as the identifying part, the indexed information, whereas the field set is the actual sort of data itself. A series in InfluxDB is all the points that fall on that blue line, or that green line, or that yellow line. We would say that all of those points belong to the same series, and to sort of codify that in something a little bit more explicit, a series is a measurement plus a tag set that gives you an individual series, all points in that series. A measurement plus a tag set plus a timestamp gives you a single point, and that way you could think of a timestamp as an ID for a point in a series. So as I mentioned, we have sort of a SQL-like query language, so a basic sort of select statement, looks like this. It’s just select some field from some measurement, and you can put it on some various other sort of conditions, so I could say select star from CPU, or select star from CPU where busy is greater than 50, or select free mem where host is equal to server1, sort of various things like that.
Michael Desa 14:25.333 We also allow you to have select statements with relative time, so a good example here would be select star from CPU where time is greater than now minus one hour, or minus ten seconds, or four days. So you can use-have queries that span the current time span, the current timestamp at the moment. The reason that I’m talking about this is it comes up a little bit in the example that we’re going to be showing, and I just want everybody to have a sort of common baseline for understanding what it is that we are doing. And then, finally, you can have a select statement with sort of absolute time where you specify these sort of time ranges in various date formats, so you can use RFC3339, Epic, so seconds, or anything that kind of looks like RFC3339. So the example down here, which is, “Select star from nums where time is greater than the 19th and less than the 30th of September.” And then finally, we allow people to do what is called a group by time, where I can apply some function to a certain bucket of time. So the example that we have here, which is, “Select max busy from CPU where time is greater than now per hour, group by time 10 minutes.” What this is saying is, go back in time an hour, break things into 10-minute buckets, and for each of those 10-minute buckets, return whatever the max busying value is. So just to give you a sort of idea of how the query language works.
Michael Desa 16:03.614 So now we’re sort of back to what it is that we’ve actually built-and so just an overview of what it is-we’ve built a Python Library for working with InfluxDB data in TensorFlow. So it provides a way for you to query InfluxDB to produce TensorFlow sequence examples, and implements the chunking of these records so they can be stringed into TensorFlow queues. And sort of the main point that we-or the main thing that we think is useful or interesting about this, and why we thought it was a good start, is it allows us to sort of-a baseline really, to start working with InfluxDB data in TensorFlow. And in particular, it allows us to have this nice sort of historical and real-time data all in this same sort of database. And it’s on the same platform, which we really think is sort of the leveraging point, or the important thing about this integration. And so on that note, I’m going to pass things off to Chris, to sort of go through an example that we have, where we’ve kind of worked through some things with that library. So I’m going to hand that ball off to Chris, now.
Chris Goller 17:14.217 Great. Yeah, take a minute.
Michael Desa 17:14.364 All right, there we go.
Chris Goller 17:16.648 Let’s see, sharing my screen. All right. Are you able to see the screen okay?
Chris Churilo 17:24.785 Yes.
Chris Goller 17:25.572 Yes. Awesome. Yes. So I wanted to describe this demonstration first of all, and how I want to use this demonstration to kind of point out some of the data engineering features around how to use InfluxDB with TensorFlow. This is just a real straightforward weather prediction using an LSTM pulled right off the Internet. What we’ve done is we’re using this library we’ve built to query Influx data, use TensorFlow’s kind of data formats, and then use it to make a prediction. So what we’ve done is we took weather data called wet_bulb data from something called the quality controlled local climatology data. And basically what we’re going to do is we’re going to take 48 hours of the previous weather, 48 hours of weather, and predict the next hour’s weather. And what I want to emphasize is how we can think about training data from using a database, and then using the exact same model, bring it into production. And how we can ease that transition that is always so painful when just using straight CSV files and so on. Okay. So first thing to talk about is we’re just setting up this very simple model here. Just this basic LSTM. It’s got a couple of them there. And at his point this model is fairly self-contained. It’s essentially just a progression, not particularly interesting but fairly straightforward. But most importantly what I want to point out here is we’re using an Influx query and I can do a deeper dive into how this query batching works, to chunk up datasets across a specific query. And as a result, we’re able to train the data over kind of a large set of time series.
Chris Goller 19:28.544 This is the Influx query that we are using. Now, this wet_bulb_temp [temperature] is in Fahrenheit and it’s again, coming from this particular measurement. This is quality controlled local climatology data-wban equals a particular unit. Now that unit is, turns out, 14920. That happens to be the La Crosse airport right near me. I live in a small town just overlooking the Mississippi River in Minnesota. And this happens to be the airport that’s nearest. It’s about 45 minutes away. And so we’re going to use that data in the last year. So now minus 365. We’re going to grab that as our training set, training validation testing set. This particular query will yield results into our shape. It actually generates what are called TensorFlow sequence examples. And these sequence examples are in a standardized format. Now that actually turns out to be incredibly important when you’re dealing with bringing a model from just what you were doing working on your laptop into acceptance and into production. Specifically, if you work your model around that TensorFlow data format that allows you to add in or use other data sets using the exact same format. So today perhaps I’m modeling wet_bulb_ temp [temperatures] to try to predict weather. But tomorrow perhaps I think that this model with these, overly fit model, with LSTM cells is very, very good, so I want to reuse this, but in a completely different context. That context may be something along the lines of, perhaps, disk usage or any other kinds of sensor data that you wish to use. So the important concept here is, from a data engineering perspective, being able to fix the kind of data format, such that the model itself can be reused. And that’s what we provide in our library which we can do a deep dive into in a moment.
Chris Goller 21:36.873 Okay, so this particular model, we break it up into training, testing, and validation sets. And after we’ve been able to do an upward prediction, we have this, what I’m showing here, is that there are 214 samples, 214-hour samples. And at this point, we’re just plotting what it predicts. So it takes the previous 48 hours and then predicts the 49th hour of temperature, and it does a pretty good job for a very simple model that all I’m doing is running it on my local laptop CPU. It’s not even GPU accelerated. Okay, so the thing that I want to mention is that if I wished to, it’d be very simple for me to change where I want to grab data. In a query language, simple as InfluxQL, I can just simply, for example, change my weather station. Maybe I don’t want to use the La Crosse weather station, perhaps I want to use San Francisco, or Denver, or Austin, or something like this. And it would be very trivial for me just to change this one parameter. I don’t have to understand a lot of the underlying CSV data form-or the weather data format, because it’s been normalized into this tensor flow sequence example. And I can retrain the model, or I can take a trained model and then bring additional data through it in various ways. And this actually ends up mattering when you want to bring a model to production. When you productize any kind of machine learning model it typically, when the data format is so intertwined with the modeling behavior, it’s very difficult to elevate that model or bring up that model into more of a production environment because these files exist, and so on, and so forth. So there’s many things that a data engineer has to build in order to use this in production. However, if you were to use Influx data for a time series modeling, simply to take real-time data, rather than using something along the lines of 365 days, we could do more real-time queries. Produce the real-time data, perhaps in the last day, and send that into the model and continue to produce results without needing to change the model underlying file format and how we can build the model. So it gives a nice progression from training the model, testing the model, all the way into production without changing much more than the query range from now perhaps until maybe that some sort of real-time data.
Chris Goller 24:21.757 Okay, that’s what we have for our demonstration for a deeper dive into the actual batching and chunking of data, just a quick point out some of these highlights here and then of course it’s in the GitHub repository. What we’re doing is we’re querying the Influx data, we’re limiting and creating chunks of data so that we don’t overwhelm the in-memory usage of the time sequence, and we’re producing time sequence, excuse me, sequence example, TensorFlow sequence examples with context around how big the sequence is, wherewith the sequence comes from, and the features themselves which can be either integers, floats or even byte streams. These chunks then are streamed into potentially a TensorFlow queue and can be used either in a distributed TensorFlow training or even locally, local to one’s laptop. Okay, that’s all I have for the demonstration, back to you, Chris.
Chris Churilo 25:28.456 So Chris and Michael, if you don’t mind reading the questions out loud and reading and then answering them. So our first question is from David Lin, “How is this different from Lambda architecture where Spark is used?”
Chris Goller 25:45.175 Yeah, sure I can handle this question. I actually would say that a lot of these kinds of similar-have similar concepts to the Lambda architecture basically around how do we want to separate the notion of how data comes in. The difference here is that we are focusing on time-series data and using sequence examples themselves rather than sort of the more of the Lambda architecture which has kind of these other kinds of data formats. Yeah, I can go ahead and answer the next question from Revington Campbell. “So can we write a series of data in protobuf format directly into Influx database?” Yes, we could do that. I’ve run experiments around doing so. As long as you use what are called tags so that you can query them back, you will be able to pull back this kind of nice compress string format and present it to TensorFlow. What I’ve done is a little bit different. I’m actually taking the data as it’s formatted natively for this weather data and producing the protobuf sequence example format, but you could in theory.
Chris Goller 27:13.223 And from Jay Dar, “I attended a bit late, I didn’t quite catch. Is Influx used to manage TensorFlow or in this example TensorFlow is used for predictive analytics on streaming the data?” It is used for predictive analytics and streaming data in Influx. However, we have goals around basically hooking into TensorFlow’s logging and events formats so that we can actually capture the kind of the time series data that TensorFlow itself produces. That which can be instrumented, and so over time we can generate even more kinds of distributed event recording. My large goal in the long run is to allow TensorBoard to read the events coming out of Influx itself such that we can now have a distributed TensorBoard where there could be many experiments running at the same time. Yeah, there’s another good one from Sebastian Borza. So your example is a singular query from a database for wet_bulb info. How would this be admitted for a streaming query, so example refresh? Right, so the idea is when you would take this into production, one thing that you can do is you could simply have a while loop where you can do predictions every so often. For example, perhaps in this data set, it only refreshes, I believe every few hours, if I remember correctly. You would at that point just run the query for the last chunk of time, perhaps in the last hour, and then you would send that data into the model and make predictions. Other more high-frequency things you could add in different time segments so that you could bring in the appropriate data live. Yeah, that’s another good question, Jay Dar. The question is if the case is latter, could you use TensorFlow on pattern recognition for real-time streaming data? That is something that I think would be a really cool demonstration, honestly, where we could do some sort of pattern recognition or anomaly detection. That’s some of my background as anomaly detection in time series data, and I believe that that’s something that we should probably create some sort of demonstration on, but essentially that’s what you can do. You can use it for real-time streaming data based upon training on historical datasets and that would-I think that’s problem specific-it’s potentially difficult to do but the framework, the data engineering, the how you do it could be-we could leverage the software where we’ve created.
Chris Goller 30:00.524 Okay, yeah maybe this one is a good one for Michael from Damian Stanton. “What kinds of bottlenecks have you seen occur in high throughput IoT space in processing time series camera and sensor data for autonomous cars?”
Michael Desa 30:20.468 Specifically, if you don’t have kind of a batching layer, so just direct right access to the databases and problematics, so having something like Kafka in front that you’re feeding your data on into that can be pulled from various other places, has been sort of what I’ve seen. So the bottleneck that I’ve seen is typically-it’s very easy to overload one of these systems, or any of these systems with right traffic. And sort of managing that if you have 100,000 individual clients all connecting to your database, it can be problematic, so that’s the biggest sort of throughput issue I’ve seen. And then there’s sort of the query space issues as well but those are a lot harder to diagnose and sort of, you frequently have to consider a lot more variables than what we could talk about here.
Chris Goller 31:21.664 Thanks. All right, Revington Campbell, “Can we get TensorFlow summary statements into InfluxDB directly instead of a logfile?” That is my goal. I’ve been playing around with this kind of concept, I feel like that it’s pull requests I want to eventually put into TensorFlow, so that we can actually off-load some of these. Instead of a logfile, put it into the database. And then you could imagine many distributed training that would-all the summary statements would go into Influx, and then TensorBoard itself could do queries to actually render that, or whatever else we wanted to render from there. So using Influx what it is good at, which is this kind of time series data, which is what summary statements are, to actually monitor TensorFlow itself. But to be clear, that’s a follow-on work that we want to accomplish. And [crosstalk]-
Michael Desa 32:25.299 So to answer the-[crosstalk]. Yeah, so is InfluxDB integrated with Kafka? InfluxDB itself is not, but another one of our tools, which is Telegraf, can read off of Kafka cues. So, yes it is, but through our other tool, Telegraf. So you sort of set up a Telegraf that will just pull from a particular channel.
Chris Churilo 32:57.481 All right, so we’ll let you guys put in a few more questions. I have a question for you, Chris and Michael. So in doing this work, where there any surprises, good or bad?
Chris Goller 33:14.930 I would say one of the things that is a little bit surprising is that there really isn’t a great amount of interface yet in TensorFlow around shipping off these summary files-that was kind of asked about by Revington Campbell, into a database. Currently, it’s just basically logfiles and so on and so forth. You can tell the mentality of the people that were writing it in that they are doing a lot of work local to their laptops and so on and so forth, but as people have seen over time, what you need to do is elevate a lot of these modeling tasks, kind of using good data engineering principles to use things like database and so on and so forth to allow you to make distributed training easier, to make reproducible training easier, make productizing a little bit easier. Although I do imagine that there’s going to be quite a few people trying to address this space, but it doesn’t look it’s been produced quite a bit yet.
Chris Churilo 34:27.899 Thank you. Looks like we have another question from Kapil Sharma.
Chris Goller 34:35.004 Yeah. How about running a TensorFlow from Keras or any other-yeah, any issues using Influx? Yeah, I mean, that was my first introduction, honestly, to TensorFlow was through Keras. So I have a good feeling especially now that it’s been integrated as-will be an official API. I have not done any integration with that, but presumably, that would be something that should be straightforward to do, especially now that we’ve created a system where you could query Influx and produce sequence examples. But I’ll write that down to try to produce an example about how to use Keras in Influx directly, which is a really-from an API perspective, really easy to use.
Chris Goller 35:23.477 Okay. David Lin. David Lin said, “Can I use Influx to store TensorFlow output?” A lot of the-yes, I see what you’re saying. Yes, the summaries that can be written-I believe what you’re asking is the summaries that can be written as the models being trained. You can correct me if I’m wrong. But I believe that could be done. Yes. Okay. Great. Because these are timestamps and they’re kind of time-oriented series, yes, those things could be stored. We would need some way-I haven’t done this work-but some way of translating either the Protobuf into a Protobuf that can be serialized to string and sent to Influx. Or if you wanted to have much more specific graphs, which I think we should do, we would sort of lay down the fields from those summaries into fields and tags that Influx itself would understand and all the tooling that Influx, for example, Kapacitor and Chronograf could use. My outside goal would be something along the lines of perhaps simple metrics around using Kapacitor to tell how the model is running, if the model is running without crashing, and those sorts of things. When will that be available? Yeah, we’re kind of working on this on and off. It depends on how much interest there is in the community, but apparently, there is.
Michael Desa 36:58.224 Yeah. I think one thing that surprised me a little bit is there was quite a bit of interest that we got. And this sort of largely started out sort of a small thing that Chris and I had, and started toying with. And it appears that there’s a lot more interest, a lot of the people who are interested in things like this. So I think that’s only struck me sort of off to the side, which is very interesting. So as I should mention, for the most part, Chris and I are kind of gauging interest and we’re trying to figure out some other projects that we can start doing presentations on. So if anybody out there has anything that they’d like to see or anything that they think would be cool as a topic, please do reach out to us and we can discuss in more detail and hopefully get another presentation out to all of you.
Chris Goller 37:59.594 Sebastian Borza. Sebastian Borza asks, “You mentioned something about integrating with Kapacitor, do you have any examples of that?” Not yet. I’m imagining an example where we would make a Kapacitor UDF-UDF stands for user-defined function-such that it would consume data from, for example, some training summary and [laughter]-TensorFlux, right. And yeah, so we would have a user-defined function Kapacitor that could monitor some of these summary functions, especially in a distributed manner, distributed training. Perhaps, we could have early stopping. That might be interesting. But maybe we should think about more of a functional UDF where we’re watching to make sure that training is still occurring. Perhaps in a distributed environment, or a Kubernetes environment where you’re training TensorFlow, you want to make sure that your containers are not crashing and so on and so forth. So those are the kinds of things that we could capture that I think would be valuable, from a data engineering perspective, for the training.
Chris Churilo 39:04.726 And if anybody on the call wants to get in touch with you guys to chat a little further, what do you two recommend as the best way of getting in touch with you? Should they just ask their questions-?
Michael Desa 39:19.504 Email for me.
Chris Churilo 39:20.922 In the GitHub repository, or-?
Michael Desa 39:21.737 Yeah, actually. That would actually be really good.
Chris Goller 39:25.784 That’s smart, yeah-
Michael Desa 39:26.388 A GitHub repository can serve as a starting point. Yeah, that sounds perfect.
Chris Goller 39:31.882 I think that there’s a lot of things that we can do and it will be really wonderful to, kind of, hash it out and discuss it in the GitHub repository perhaps, as an issue? And then, we as community come together and figure out how we want to, kind of, handle some of this stuff.
Chris Churilo 39:54.905 Okay, let’s see, I think we’re probably coming to the end of the questions. What I will do is I’ll send everybody an email with a link to this video. And also a link to the GitHub repository just as a reminder to start the conversation there, with both Chris and Michael. And let’s see their great work turn into something really awesome for everybody. Looks like we have one more question from, Nick?
Chris Goller 40:32.154 Boy, I don’t see it yet. Did I miss it somewhere?
Chris Churilo 40:35.777 I think it’s more of a comment? Nick says that, “It would be great to see an example of the streaming anomaly detection.”
Chris Goller 40:42.508 Yeah, I think that would be really cool. So that is near and dear to my heart. That’s something that I worked on for years and years in robotics, and so that sort of example will be really good. Perhaps, I can find some really nice data set that we can use an example. If people in the community know some really cool time sequence or time series datasets that are around anomaly detection, please let us know and we can try to play around with some of those.
Chris Churilo 41:15.743 Fantastic. Like I said, I will have the recording up later on for everyone to view. But I also recommend that you pass this webinar and this GitHub repository to any of your friends that are also, kind of, working in this space. I think the more-the larger collection of people that we can get on this problem set, the more exciting the result can be. So let’s try to build some momentum behind this. I do want to thank our speakers today. It was a really fantastic talk and we will definitely follow up with some more exciting topics similar to this in our webinar series. So thank you so much, guys. And I hope everyone has a wonderful day.
Chris Goller 41:59.379 Yeah, thanks for the opportunity.
Michael Desa 42:00.249 Thank you.
[/et_pb_toggle]