How a Time Series Database Contributes to a Decentralized Cloud Object Storage Platform
Session date: Feb 25, 2020 08:00am (Pacific Time)
Storj Labs provides affordable object storage that is private by design, secure by default, and lower cost compared to traditional cloud providers. Everything stored is encrypted, split into pieces, and then distributed across Storj’s network of nodes. Storj’s Tardigrade cloud storage service is easy-to-use with fast, predictable performance, so users no longer need to manage their own infrastructure.
Discover how InfluxDB is a component to Storj’s Tardigrade service and workflows.
In this webinar, John Gleeson and Ben Sirb will dive into:
- Storj's redefinition of a cloud object storage network
- How InfluxData fits into Storj's Open Source Partner Program
- Collecting and managing high-volume, real-time telemetry data from a distributed network
Watch the Webinar
Watch the webinar “How a Time Series Database Contributes to a Decentralized Cloud Object Storage Platform” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How a Time Series Database Contributes to a Decentralized Cloud Object Storage Platform”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Customer Marketing Manager, InfluxData
- John Gleeson: Vice President of Operations, Storj
- Dr. Ben Sirb PhD: Senior Data Scientist, Storj
Caitlin Croft: 00:00:00.455 Welcome to today’s webinar. I’m really excited to have Storj Labs presenting on how the InfluxDB fits into their cloud object storage platform. So without further ado, I will pass it off to John and Ben.
John Gleeson: 00:00:17.657 Hey. Thank you, Caitlin. We’re really excited to present to your audience here. Just by way of introduction, my name’s John Gleason. I’m VP of Operations at Storj Labs.
Ben Sirb: 00:00:29.924 And I’m Ben Sirb, Senior Data Scientist at Storj.
John Gleeson: 00:00:34.659 So just to give you a sense of what we’re going to talk about today, we’re going to provide a brief primer on Storj and our business and how sort of the network functions. We’ll talk a little bit about how that network function drives a business need for time series data. We’re going to talk a little bit about just how important that data is in terms of our success or failure. We’ll go over what our analytics approach is generally and then also what success looks for us, not only right now but also into the future. And then we’ll open it up for Q&A. So first, let me give you a little bit of an overview on what Storj Labs does. So our goal is to create a large, secure, private, and resilient cloud storage service comparable to many of the other cloud storage providers out there today from Amazon to Google to Wasabi or Backblaze and to do that without owning a data center. And the way that we do that is we have three pieces of software that we operate. All of our software is open-source, of course. But the first piece is the storage node. And the storage node software allows people who have excess hard drive capacity and excess bandwidth to share that with the network. The next piece is something called the satellite and that’s a piece of software that we run, several of them under a Tardigrade brand, but anyone in our community can run a satellite. And what the satellite does is it helps mediate between the people who have storage and bandwidth to share and the developers who want to store data on that excess storage capacity. And so the satellite handles developer accounts and also is responsible for data repair. It’s responsible for billing and payments and a variety of other tasks. And finally, there’s the Uplink, and the Uplink is a set of developer tools. There’s ang S3 compatible gateway, a command-line interface, and also a Go library with a lot of language bindings for popular development languages that allows developers to embed our software in their product and then store data on the platform.
John Gleeson: 00:02:39.251 So let me give you sort of a little practical example of how that actually functions and also what some of the drivers are in terms of our needs for time series data. So when an Uplink client wants to upload a file to the network, there are a number of different things that happen. First, that object is encrypted client-side because so many of the storage nodes are run by people in many different countries and they’re effectively untrusted and we don’t know who they are. Everything has to be client-side encrypted for security and privacy purposes. The next thing that happens is that the files are erasure coded. And so it’s broken up into 80 tiny, little pieces and those 80 pieces are stored then on 80 statistically uncorrelated storage nodes. And so the benefit here is that if any one or two or three storage nodes become available, the Uplink only needs 29 of those pieces to actually recover their file and then decrypt it. The satellite provides a list of storage nodes to the Uplink as a selection for storing the nodes. And then the Uplink uploads directly to those storage nodes. Now in the background while that file is being stored, the satellite is keeping track of all of those different storage nodes and it’s monitoring them to make sure that they are available, they have high uptime, and also that they haven’t failed or left the network by auditing the data stored on them. And so over time, it is possible that storage nodes leave the network or the hard drives do fail. And so the satellite’s responsible for counting those storage nodes as they leave the network. And if it reaches a certain threshold where the availability of the file could be put at risk, the satellite is able to download 29 pieces to itself, recreate the missing pieces, and upload them to healthy nodes, making sure that the file remains available at all times.
John Gleeson: 00:04:32.712 Now, that’s an expensive process for us because it uses bandwidth and, of course, in a decentralized system like ours, it’s a highly bandwidth constrained environment and bandwidth is really a premium. Now when the storage node goes to recover that file, again, it downloads 29 pieces and reassembles those pieces locally and decrypts the file. And so there’s a lot of complexity here. Right now there are over 50 million files being stored on our network. Each of those files is broken into a minimum of 80 pieces. So you can imagine that there’s quite a bit of complexity in terms of the real time, keeping track of those pieces, the storage nodes that they’re on, the files that they belong to, to track and ensure that the system performs well, when a Uplink wants those files, can get them immediately because it has to be comparable to any other cloud provider out there. In terms of the number of storage nodes that come and go from the network, that satellite has to be aware and make sure that the churn of storage nodes does not impact the durability and availability of files. And then on top of all of this, we have a system of incentives that go along with that usage and tracking where we actually compensate all of those storage nodes for storing and making their bandwidth available. So throughout all of this, there are a set of configuration settings that we have that we create this balance in terms of the trade-offs between the usability and the performance on the client-side and the cost and complexity on the backend of the system side. One of the things I did want to touch on here is that we are an open-source project and one key aspect of what we do is that we do want to find a way to create synergies with other open-source projects since open-source drives so much of cloud workloads today. We offer an alternative to large cloud infrastructure providers and those providers tend to use open-source as a loss leader for infrastructure sales. And so what we’ve produced is an open-source partner program. And I’ll talk a little bit more about this at the end of the webinar today, but the idea is for us to create a sustainable path to revenue for open-source projects. But for now, I’m going to pass the mic over to Ben and he’s going to talk a little bit about the business need for our time series data. Ben?
Ben Sirb: 00:06:54.232 Thanks, John. So we’re operating on a two-sided market. So we’re interfacing with both supply side and demand side. On the supply side, we are interested in bringing on reliable storage node operators because they’re the ones bringing disk capacity and bandwidth capacity to the network. And on the other hand, we are analyzing the rate at which clients are adding data to the network. And we want to make sure that the capacity that they need is there and available. Now as a function of files being stored on the network, we’re considering several key factors and John alluded to a few of them already. And we have under consideration durability, retrievability, and repair. Now, durability is what we are using to describe the likelihood or probability that a client that add data to the network can still access their data in the future. And we wanted to make sure that we would launch a product that can guarantee the same levels of durability as anyone can expect using a lead cloud storage provider. Retrievability is a function of storage node uptime. So if your data is stored on a storage node that is not currently online and you are not able to access your data, we call that an unreputable file. Now, the data may still be there and if the node comes back online, you might still be able to access it but we also wanted to make sure that data that was stored on the network is retrievable, again, according to expected levels guaranteed by other lead cloud storage provider. And finally, we have the idea of repair. And repair is the process that the satellite undertakes to make sure that segments and files that are losing pieces due to churn and other factors are not actually lost. So repair is what we use to make sure that the file remains durable and available for customers.
Ben Sirb: 00:09:04.259 Now, there’s a lot that goes into this calculation. And when we first set out to launch our product, we didn’t have any data to base our conclusions off of so we had to start with some a priori assumptions. And we did a little bit of research to see what expressions we could use, what prior work had been done in this area, and we arrived at some calculations that gave us confidence in the parameters that we were using to encode files on the network. Now, you can look at the details on our whitepaper which can easily be downloaded from our website but we’ve highlighted a snapshot here just for a brief description. Now on the table to the left, you’ll notice some of the parameters going into the expression on the right. And the expression on the right is what we have used to initially estimate the amount of repair that we can expect to occur on the network. Now the B is the amount of data stored on the network. The alpha is the expected level of churn. And then M, N, and K are the Reed-Solomon encoding parameters with N being the number of pieces that are created, K being the number of pieces that one needs to rebuild their file, and M being the repair threshold. We used some calculations and simulations to generate a list of values that could give us the guaranteed durability that we want to promise as well as help produce the expected levels of repair. As I mentioned, you can view the details in the whitepaper, but our goal at the onset was to balance the trade-offs between usability, durability, and performance along with the obvious economic constraints that we have to consider. Now once we had the theory, we wanted to make sure it measures up according to actual real world usage.
Ben Sirb: 00:11:09.027 And if you can’t measure it, you can’t improve it. So we have a list of key metrics that directly impacted our roadmap progression. Whether it was [slagging?] from alpha to beta or beta to production, we wanted to make sure that the actual performance of the network, not based on theory but based on data, was living up to the expectations that we set for ourselves. And furthermore, if we measure this data and exposed the network behavior internally, it helps teams across our organization move more quickly. The more insight they have into the network, the faster they can iterate and the faster they can iterate, the more we can improve and the better we can improve. Now we are going to succeed or fail based on the data that we are collecting. And in order to ingest this data, you can imagine all the moving pieces John described generates a massive amount of data. We had to boil it down to a simple ETO. And where it all starts is with a package called monkit. And monkit is a package that we’ve sprinkled throughout our codebase in various aspects of the network. You’ll find it on Uplinks, storage nodes, and satellites. And this does basic monitoring and metric supporting as well as offering the capacity for custom measurement and I can talk a little bit more about that in a moment. But what monkit does, it outputs two what we’re calling internally our stat receiver binary. And stat receiver is our messaging service which is going to take the monkit data and then distribute it to various destinations internally within the organization. We have debugging endpoints as well as log outputs as well as various databases and one of these databases is the Influx database.
Ben Sirb: 00:13:12.647 And why we chose Influx was that the majority of our key metrics are very sensitive to change over time and we needed to track these changes. And it offers the scalability and features we needed, giving us the greatest operational experience and confidence in the software with the least amount of work. And finally, we connected our Influx to our Redash BI investigative tool. So Redash is our endpoint for the various databases that teams across our organization use to analyze network performance or answer business-related questions. So this seems really simple on paper and it seems like we have all the data we need in bulk. Bulk data done, right? Well, it’s not quite so simple because when you’re generating massive amounts of data, you really need to focus on optimizing the design. So the faster we start generating key data, the faster we improve. However, we don’t have a lot of resources for a lot of maintenance overhead. We couldn’t bring on people dedicated to managing the database and we wanted to make sure we had flexibility as the network grew and the data needs changed. Furthermore, as we do work with open-source partners, we want to make sure that it’s possible for non-storage folks to use and run at some point. And finally, it needs to scale with the network. So how do we do it? Well, as I’ve mentioned, we started with monkit. And monkit is a Golang package that captures event data right in the code. We’ve included a screenshot here as an example and you can see that with a few lines, you can add extra monitoring and telemetry data right within the function call.
Ben Sirb: 00:15:11.152 So in addition to some basic default monitoring, which you might notice towards the top of this code is a mon.task, this reports percentiles, averages, success times, counts by default. And then below at the bottom, you can see that monkit also allows you the capacity to provide custom measurements. So if a particular function in the code is reporting values that are particularly interesting to the operation of the network, you can actually call monkit and have it report through our data pipeline, resulting in a measurement that shows up in our time series database. Now as I mentioned, this is collected directly on the network and various pieces of software and hardware are sending out this data. And in order to ingest this data, we use this binary called stat receiver. Now, this is a custom binary that we use internally to ingest this data and using a simple configuration file, can apply some basic templating and filtering to qualify it for use in our various data links. Now once we pass it through stat receiver, one of our endpoints is InfluxDB. And something that we found incredibly helpful, especially early on, was the fact that InfluxDB supports the Graphite protocol. Now the way monkit works by default is it outputs completely unstructured data. And the structure that stat receiver assigns is limited because that receiver is sending the measurements out to various destinations. So in order to assign more structure to the data, we used the Graphite filtering and templating schema. So what we were able to do is take certain measurements that we were interested in, apply a specific filter, and give it a specific measurement in the Influx database.
Ben Sirb: 00:17:23.915 Now, for those of you familiar with SQL, you’ll know that it’s easy to create against and the result was that various teams internally were able to investigate the performance of the network. And using this monkit metrics monitoring output, they could actually investigate the performance of various function calls as well as look at key metrics that business decisions need to be based on. Here, we have an example that we can also touch upon a little later, but this is a very important measurement that has to do directly with the amount of repair we can expect. This ends up plotting the different percentiles of pieces remaining for the number of segments on the network. And this enabled a forecast that I can talk about in a moment. Now, the challenge to all of this is managing series cardinality. So we started with the goal of maximizing visibility into the network and this meant saving all the data. But, of course, you can imagine this leads to massive database bloat and resulted in a series explosion. So one of the lessons that we learned early on is - well, we learned it the hard way, is that we have to make sure to not encode highly variable information as a tag. We were initially fitting out to make it more efficient, to filter function calls by various actual pieces of hardware. And so we encoded InstantID as a tag because we wanted to separate out by satellites, by Uplinks, by storage nodes, but we quickly saw that this led to a massive amount of series.
Ben Sirb: 00:19:20.669 And the number of series were growing at an unmanageably fast rate. And in order to prevent the number of series to continue growing, we actually had to start filtering out certain measurements before they reached our Influx database. Now, the good news is we were still able to use other internal endpoints to debug and monitor performance but we were no longer able to analyze the data coming in from storage nodes through the Influx database. And this was somewhat of a hard lesson and we definitely kept this in mind as we’ve made improvements to our ETL process. Now we succeed based on the data and what Influx lets us do is enable a very powerful segment decay forecast which is essential to knowing how much repair we can expect. Sometime in the middle to end of last year, we took the segment health data that we highlighted the query for a few slides ago and we did a predictive model taking the historical data and bootstrapping a linear regression against the data to forecast when we can expect the first large chunk of repair to start. And this forecast was very impactful in the organization, as it led to us modifying our repair threshold as well as giving us a good idea of what kind of costs to expect, as well as giving us the capability of testing our repair mechanism by tweaking our repair threshold to start repair at an expected time. Now, the time series [quality?] also enables us to monitor growth rate.
Ben Sirb: 00:21:15.756 As I mentioned towards the start of the call, if there’s not enough capacity for clients to upload data to the network, then it’s hard to bring on new clients. But on the other hand, if there’s too much capacity and storage node operators aren’t hosting data and they’re not being paid for data they’re not storing, that’s no good either. So in order to balance the two, we have to keep a very close eye on growth rate. Now, it also allows us to track historical bandwidth usage for the various bandwidth operations, gets put it repairs, and repair is especially critical to bandwidth usage and cost. And finally, it allows us to track vetting success rates over time. Now, vetting is what we’ve implemented to help reduce the rate that storage nodes leave the network. So before we actually trust a storage node operator with data, we require the storage node to undergo a certain number of data audits to verify the integrity of data stored on their disk. And once they’ve successfully passed the number of data audits, then we consider those nodes vetted and then we more fully trust them with more data on the network. And we’ve implemented this as a way of reducing churn. Now, this allows us to make data-driven decisions and we can track historical rates of churn and segment decay. Now, we’ve created a number of internal dashboards for the benefit of the entire organization and this allows individuals and teams to dive in and do root cause analysis of any observed anomalies. You’ll notice as an example, here we notice an uptick in the percent of nodes flying and we can take this information and compare it with the amount of repair that we expected or the amount of pieces remaining for eroded segments.
Ben Sirb: 00:23:23.511 All of this is made possible by the time series nature of this data. Now on the bottom left, you’ll notice a plot of the query that we highlighted earlier in the talk. And what this shows you is the percentiles of the number of pieces remaining for the segments on the network. And this is something that we rely on very heavily to decide whether files are in danger of being irrecoverable. And as we’re using a nice expansion factor and allowing for rebuild to occur with only 29 pieces, you can see in the [inaudible] in this plot that none of the segments on the network ever get below 50. And we have a very generalized repair threshold to make sure that segments are never in danger of becoming irrecoverable. Now, we can also measure model parameters in real time. And we can compare the key aspects of the network against one another. As I’ve mentioned, and as you can imagine, storage node churn directly correlates to the amount of repair one can expect on the network because the more that storage node can leave the network, the faster segments lose pieces and this is turn results in a higher rate of repair flow for the segment. And so, of course, we’ve created a dashboard that highlights the relationship between churn and repair. We can see how many bytes are on the network and how much repairs is used, the amount of data used for repair, as well as comparing directly the rate of repair as well as comparing it to the churn. Now, as I’ve mentioned, the initial implementation of our ETL resulted in measurements that we had to filter using Graphite.
Ben Sirb: 00:25:28.645 And that consideration led us to push internally for a change to the way monkit works. And we’ve recently released, internally, monkit v3 which eliminates the need for the Graphite layer. This has been a push for a couple of months internally to make monkit speak Influx more natively. Now, the way monkit works is it outputs the default metrics that are now tagged more appropriately and has better default naming for Influx measurement, output, and browsing. And it also allows for very convenient measurement naming and adding for custom measurements. Now, one thing we actually had to implement fairly early on was what we called locking measurement. With our initial setup, when a custom measurement was added to the code, we had to lock it to make sure that a code refactor did not impact the filter and schema we were using due to the Graphite stopgap. Now because of the way monkit works nicely with Influx, we no longer have to lock measurement and you can imagine that this helps eliminate a large expected amount of technical that we would have if engineering teams had to keep track with which measurements needed to be locked before and after a code refactor. Now, we also have our eyes on the future and we have implemented an additional portion of the TICK Stack, that portion being Chronograf. So Chronograf plays very nicely with Influx, of course, and it makes it very easy to natively browse the measurement that we are outputting.
Ben Sirb: 00:27:21.921 And furthermore allows for various teams to investigate the data, even if they don’t know SQL or you can’t sit down and spend the time to write an in-depth query. Well, the Chronograf connection now allows them to go in, browse the database, and even copy and paste with very little modification into the Redash query box and start generating dashboards for internal dissemination. We’ve also recently rolled out this statistical modeling directly into our architecture. So we’ve implemented Apache Airflow, which is an automated batch processing paradigm as well as a custom version of that which we’re calling Data Flow. And Data Flow is a Golang version that allows for batch processing in near real time. So it’s on a shorter period and you can do anything on an automated basis that you can do batch offline such as download your data, run some kind of optimization routine on it, upload to a database, or output some nice graphics for wider use. And the reason we did this is because we wanted to help empower our teams internally to make better use of the data. We wanted to make sure that we’re not siloing the knowledge and the power of our data in any one area because we believe that a decentralized network eventually needs decentralized analytics. And especially since we want to make sure that we work very closely with our open-source partners, we want to make sure that our tools are readily available for their use as well.
John Gleeson: 00:29:14.099 So with this presentation, we focused a lot on some of the metrics that we have for the key business drivers around repair thresholds and churn rates and things like that because those are an aspect to the network that are really unique to a decentralized storage platform like Storj. And we have a really complex set of dashboards and a large number of aspects that we monitor continuously in terms of making sure that we’re not only offering a commercially successful and economically viable product, but also we’re offering a great end user experience for both sides of our network. And one of the things that I did want to sort of touch on briefly at the end here was our open-source partner program. This is something I mentioned earlier and basically, what it breaks down to is the network is made possible by two things, our storage node operators who contribute storage and bandwidth to the network and then there are developers who are building applications that store data on the network. Our primary incentive structure is to reward storage node operators by fairly and transparently compensating them for the resources they provide so that they remain on the network for long-term. And we have a lot of metrics that we just were talking about in terms of churn and vetting and those sorts of things make that possible. But we also have a program for partners who drive demand for storage to the network and open-source projects that generate a lot of data like database backups, for example, through Influx for the time series data are great candidates. And so the way the program works is for open-source projects that generate data, their end users get great, secure, reliable cloud backup storage for half the price of the big storage providers out there. But at the same time, when end users of an open-source project choose to store data on the storage platform, for example, database backups, the project gets a meaningful share of the subscription revenue when their end users store that data on our platform.
John Gleeson: 00:31:27.638 And so the way we look at it is it kind of generates this virtuous cycle. The open-source project creates an innovative product. Their end users store data on a decentralized infrastructure provider like Storj. We then in turn share a portion of our revenue back with the open-source project directly related to the use of their software. And then that revenue allows them to continue the cycle of innovation, which then creates more users and drives more usage of our platform. So we have a lot of amazing projects out there who have already signed up for our program. So if you are part of an open-source project that generates demand for data or you know somebody who is, sign up on our partner program on storj.io/partners and we’ll tell you all about how you can build a connector that gives you, our users, an option to store data on the network that ultimately results in sustainable revenue for the project. So with that, Ben and I would like to open the floor to questions about how we’re using time series data in conjunction with our product.
Caitlin Croft: 00:32:39.885 Thank you, John and Ben. So there’s a couple of questions that have already come in. And, obviously, if you guys have more questions, feel free to post them in the Q&A box. So you mentioned early on that the data is split into 80 pieces. Is there a reason why it’s 80?
Ben Sirb: 00:33:00.336 Yeah.
John Gleeson: 00:33:01.388 Sure. You want to take this one, Ben?
Ben Sirb: 00:33:03.887 Sure. Yeah. I can go ahead. So if you remember the slide showing the little box with all the parameters and the mathematical formula there, that’s just a small glimpse into the calculation that went into our assumption. And, again, the details are in the whitepaper, but generally speaking, you took the expected amount of churn - and I believe that we were very generous in our estimate of churn. And, actually, according to measured rates, we’re one or two orders of magnitude lower than the assumption that we used. Thanks, John. And so alpha here, that term - or that factor is our expected level of churn according to the model. And the N, M, and K parameters are the encoding parameters as mentioned. In this case, we’ve landed on a value of N equals 80. Now, what we aren’t showing here is the simulation model that we scripted on the back end. So we took our assumption and using our economic constraints, we created a script and ran the model to output a variety of acceptable values. The constraints were that we want to make sure that durability is where it needs to be, as well as we want to make sure that repair isn’t going to actually cost us more than we’re taking in economically. So using some constraints, we ran the model and generated a list of acceptable criteria. And among that acceptable criteria, guaranteeing the durability that we are wanting to promise to compete with leading cloud storage providers, we landed on a value that was going to help minimize the cost of expected repair. And N equals 80 was one of the choices for that. There’s a few other acceptable parameters. Actually, there’s quite a few but this was one of the more desirable ones given the balance between expansion factor and expected repair.
Caitlin Croft: 00:35:09.891 Thank you. I hope that -
John Gleeson: 00:35:10.620 Yeah. To just -
Caitlin Croft: 00:35:12.478 Go ahead.
John Gleeson: 00:35:13.647 Yeah. I was just going to add a little bit more color on that. So in context of how that actually works on the network, our target is to sort of meet the industry standard of 11 9s of durability. And in order to do that, we could have 1,000 pieces of which you only need 29 to get your file back or 50 or whatever. And so we evaluated a large number of different combinations. And based on the math that we did and the churn that we expected, it was the 29 of 80, which creates an expansion factor of about 2.75. So for every terabyte we store, we actually are storing 2.7 terabytes on the back end. So there’s a balance there that sort of provides the durability we want, the performance and availability for end users that they expect, and the cost structure that we have in place. So there’s a lot of math to it, but ultimately, it results in balancing economics with the developer experience.
Caitlin Croft: 00:36:16.209 That’s really interesting and very helpful. Now, you’ve mentioned a queue - sorry. You mentioned a few key metrics that you used during your development process. There’s so many different metrics that people can use when developing a new tool. How did you guys sort of finalize on those primary metrics that you were going to consider?
John Gleeson: 00:36:39.857 Well, the first thing we sort of mapped it up against was the overall sort of economic model for the platform, right? And so there are a few things that are known constraints. And the ones we focused on are probably by now, obviously, the most important ones, repair being a potential huge cost driver and also durability and availability and performance on the client side. And so for us, it was really looking at our business model. And then sort of as we evolved and as we went through the early alphas and the betas and approaching production, seeing how well the actual performance of the network tracked to the assumptions we made upfront was really what drove a lot of our sort of decisions in terms of what metrics to select, right? So we would look at something, go, “I wonder how we would measure this?” And then we would look at what data we had and then build dashboards and build monkit integrations to generate the metrics and then run them through Graphite and then ultimately get the data visualized. And so we could see what direction is a lot of this stuff trending? Because among the other interesting things, things change a lot over time. So it’s extremely unusual on our network for a piece to be - or a file to be put in danger of unavailability in a short period of time. If you have 80 different storage nodes storing a segment of a file, large files are broken up into multiple segments. So they’re actually more than 80 pieces comprising a file. A gigabyte file would be on 1,280 different storage nodes, for example. But the idea there is that if 51 out of 80 storage nodes went offline all at the same time, the statistical probability of that is extremely low, right? Because they’re all on different power supplies, all on different networks, all in different geographic locations, all operated by different people in different places.
John Gleeson: 00:38:46.505 And so the way it ends up happening is things slowly decay over time. And the complexity in tracking that slow decay, while it produces really high availability, it’s one of those things that we couldn’t let it sneak up on us because you only get one chance as a data storage provider to make your durability almost perfect. Ben, do you want to add anything to sort of how we chose some of the metrics?
Ben Sirb: 00:39:13.683 No. I mean, that’s mainly it. Durability is matching what the lead cloud storage providers are offering and we had to ramp-up to this in terms of building our roadmap to production. And so yeah, I mean, I think, John, you gave a great overview of that. Caitlin Croft: 00:39:34.205 Great. And kind of adding onto that, when you guys were considering your roadmap, what was the catalyst for creating your own ETL tool versus using one that already exists?
Ben Sirb: 00:39:48.615 So we were running lean and mean and wanted to work with something that could be as easy and quick to configure as possible. Furthermore, we wanted to work with something that could easily take the monkit package that was already something many people on our team were familiar with and convert it into something that was usable for our analytics. Also, since we didn’t have a lot of resources for managing the way these metrics were coming in, we really had to kind of hack together the measurements that we needed and apply measurements almost on the fly, apply a filtering and templating on the fly, to give us the insight that we needed into our network. And actually, I think we did a pretty dang good job in hindsight given where our repair levels are and our durability. Yeah. I think we’re doing pretty good with where we started. And now that we’re kind of maturing data-wise, I think we are starting to consider other options and more end-to-end type solutions.
John Gleeson: 00:41:02.185 Yeah. I think a big factor here is that because we don’t own or control a significant portion of our infrastructure, it has led us to do a lot more build in terms of that sort of buy, build partner analysis. And so it’s not just the metrics piece of this, but also the messaging components. And so we’re originally using gRPC for messaging and we ended up writing our own messaging lightweight protocol, DRPC, because the distributed decentralized system requirements are in some ways radically different given the untrusted nature of some of the components of our infrastructure. And so a lot of that is also what drove some of these decisions in terms of building components that are really lightweight and really matched to distributed systems.
Caitlin Croft: 00:41:57.717 Great. And towards the end of your presentation, you mentioned there’s other areas of the TICK Stack that you’re looking into. Have you looked at InfluxDB Cloud 2.0 or Flux or any of the new functionality?
Ben Sirb: 00:42:13.157 So yeah. I’m glad you mentioned that, actually, because Flux is one of the things that we were kind of hoping to implement actually because with our initial filtering and templating scheme, a lot of the measurements that we were interested in comparing wound up - or a lot of the metrics we were interested in comparing wound up as separate measurements in our Influx database. And the Influx version we were using hadn’t yet incorporated or allowed for the use of Flux for cross measurement analysis. And so this was kind of earlier on and we had then started the discussions about whether we should modify monkit and realized we had other priorities at the time. And so as a stopgap, we did make use of the Graphite filtering and that’s what allowed us to combine metrics that we wanted to compare into a single measurement. But yeah, we have since then kind of improved in a variety of areas. So we’ve now gone to a version of Influx that does support Flux and we are happy to have that capability. I think given how widely these measurements are going to be used internally, I think it’s going to be more for investigative purposes. So it’s going to be more focused on standard SQL-like syntax, but we are excited about the capability of using Flux to compare cross measurement. And also given the shift from monkit v2 to v3, that makes the cross measurement analysis not as pressing. But we have thought about implementing a Kapacitor side of things for the alerting and the analysis, and we are going to continue investigating that and see whether it’s going to suit our purposes.
Caitlin Croft: 00:44:09.595 Great. That’s really interesting. I always find these webinars fascinating to see what people are doing with their technologies and what they’re hoping to accomplish as they move on in their development process, and also as we continue to develop our products as well. Well, just thank you everyone for joining today’s webinar. I hope this was very helpful. If you guys have any more questions, we’ll keep the line open for another minute or so. And just another reminder, we do have InfluxDays coming up in London and we’d love to see you there. There is a promotional code. So if you’re interested in going, be sure to check out our website and use this promotional code. John, Ben, do you guys have anything else that you would like to add?
John Gleeson: 00:45:03.346 Sure. Actually, one thing that I would like to add here is we do have our discussion forum at forum.storj.io. And so if you want to learn more about what we’re doing or ask questions, you can also interact with our community of engineers through our community forum.
[/et_pb_toggle]
John Gleeson
Vice President of Operations | Storj Labs International, SEZC
John Gleeson is Vice President of Operations at Storj Labs International, SEZC. John is responsible for direct sales, partnerships, customer success, community, product management, and data science as well as growing the international presence for the Storj decentralized and distributed cloud storage network. He is responsible for operationalizing core processes to ensure the business scales as the company and network grow. He established and operates from the company's international Cayman Islands office. John has a BA from the University of Michigan and a JD from Wayne State University.
Dr. Ben Sirb PhD
Senior Data Scientist | Storj Labs
Ben joined Storj in April of 2018, bringing 6 years of experience in academia, research, and consulting. Drawing from his knowledge of probability theory, Ben has been instrumental in researching the theoretical foundations and practical applications of decentralized networks at Storj. He holds a PhD in Mathematics from Georgia State University, where his research focused on decentralized algorithms and optimization. Ben, his wife Julia, and baby girl Olivia call Atlanta home. Outside of work, Ben enjoys playing chess poorly, exploring new food options in the city, and discovering new downtempo and electronic pop on Spotify.