Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Session date: Nov 15, 2022 08:00am (Pacific Time)
RudderStack — the creators of the leading open source Customer Data Platform (CDP) — needed a scalable way to collect and store metrics related to customer events and processing times (down to the nanosecond). They provide their clients with data pipelines that simplify data collection from applications, websites, and SaaS platforms. RudderStack’s solution enables clients to stream customer data in real time — they quickly deploy flexible data pipelines that send the data to the customer’s entire stack without engineering headaches. Customers are able to stream data from any tool using their 16+ SDK’s, and they are able to transform the data in-transit using JavaScript or Python. How does RudderStack use a time series platform to provide their customers with real-time analytics?
Join this webinar as Ryan McCrary dives into:
- RudderStack’s approach to streamlining data pipelines with their 180+ out-of-the-box integrations
- Their data architecture including Kapacitor for alerting and Grafana for customized dashboards
- Why using InfluxDB was crucial for them for fast data collection and providing single-sources of truths for their customers
Watch the Webinar
Watch the webinar “Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Sr. Manager, Customer and Community Marketing, InfluxData
- Ryan McCrary: Senior Sales Engineer, RudderStack
Caitlin Croft: 00:00:00.119 Hello everyone and welcome to today’s webinar. My name is Caitlin Croft, and I’m joined today by Ryan from RudderStack, who will be talking about how to streamline and scale out your data pipelines with Kubernetes, Telegraf, and InfluxDB. So what’s really cool about RudderStack is they use InfluxDB and we use them. So it’s kind of a fun relationship. And without further ado, I’m going to hand things off to Ryan.
Ryan McCrary: 00:00:32.289 Awesome. Thanks, Caitlin. So, yeah, as Caitlin mentioned, I’m Ryan. I’m a solutions engineer here at RudderStack. I’ve been at RudderStack for about two and a half years, which in the RudderStack world is a long time, but probably kind of focusing these days on helping some of our strategic customers build out their data pipelines, building custom data solutions, and helping them solve problems largely around maintaining traditional data pipelines and replacing those with RudderStack. So I guess to jump in, we’ll start off with what RudderStack is for those of you that aren’t familiar. We’ve got a high-level, more marketing kind of architecture diagram here. But to slice it up, RudderStack initially started as this top blue arrow here that we’ll see which is event streaming. And so think about this as a similar tool to, if you’re familiar with a Segment or an mParticle, this is going to be basically collecting behavioral user data, whether it’s via API or any of our 15 plus SDKs, whether that’s client side, server-side or mobile, and collecting that first-party data and basically providing it to all of your various integrations. That could be marketing integrations, that could be product analytics, that could be any kind of CRM, or really anything. And so you’ll see there we have about 180 different destinations that we support, as well as this kind of bottom smaller area here where this event stream is also being sent to data warehouses and data lakes.
Ryan McCrary: 00:01:59.634 And so the idea here is at a high-level thinking about just easing the burden on engineering to get the data that various internal stakeholders need into their cloud tools. And so through a single instrumentation, a single SDK, RudderStack essentially will load the native SDKs where needed. So let’s say in this situation, if we’re looking at this example here, we may be using Braze for in-app notifications or push notifications. RudderStack would actually be instrumented as the only SDK, and then would load and wrap the Braze SDK for that kind of rich user experience, but then also would be sending to Mixpanel FullStory or anything like that, server-side as well. And then also loading that same data that’s being sent to those real-time destinations, batch loading that into a data warehouse or data lake. As part of that, RudderStack is going to handle all of the schema management. So we’re going to build tables for each event. We’re going to add columns for each new property that’s sent. And so that, regardless of what’s happening on the instrumentation side, that data is flowing uninterrupted to our data warehouse, which we can then use to drive some of our other internal data apps. So that’s kind of the initial pipeline. That’s what RudderStack initially was built for. Was really just a way to kind of simplify that event collection and send to our cloud tools as well as to our data warehouses and date lakes.
Ryan McCrary: 00:03:17.630 As kind of a next phase, we moved into these other lines that we’ll see over here. Essentially our idea - and we’ll touch on this more in a second - is that your data warehouse should be the source of truth. And so one of the big differentiators that we’ll discuss is that RudderStack’s not keeping any of this data. We’re sending it to the warehouse and the cloud tools, and then we’re dropping that data completely. We have no access to it in perpetuity. And so, it only stands to reason that that data in the warehouse is not going to be complete. Just clickstream data doesn’t always paint the full user picture that we need. And so we - oh - provide a number of ETL connectors to get this cloud data that we have in cloud tools back into the data warehouse. This could be, again, any of the cohorts from Mixpanel or Amplitude. It could be payment data from Stripe or NetSuite. It could be customer success data from something a Zendesk or Intercom. And then we’re going to combine that with the clickstream data and any other disparate datasets that we have in the warehouse and use our third and final pipeline, reverse ETL, to basically pull those derived creates, or user scores, or any kind of aggregated traits that we’re building in the warehouse back into those cloud tools to activate the data. So really kind of a one-stop shop. All three kind of traditional pipelines, in a sense, in a single place, really built around that data warehouse or data lake. So those are kind of our three pipelines.
Ryan McCrary: 00:04:39.569 In addition to that, you’ll see down here we do have some other features. We have transformation. This isn’t transformation that you would typically think of as aggregate post-load transformations in the warehouse. This is going to be in flight. We allow the ability to inject a JavaScript or Python function in real-time to operate on your data on a per-event basis. So we can modify that data in flight. We can do anything we would do in any of those native languages. Some things to think about would be allow listing or deny listing certain events, modifying or correcting schema in flight. We can mask hash or detect PII in that and modify that before it’s part of the downstream tools. And then we also have outside network access. So we can do event enrichment, whether from an internal dataset or from one of these external tools, add those data points in flight, and send that enriched data into our downstream tools. We also do a bit of identity resolution. Again, that happens at the warehouse level, but we provide some mappings around how to stitch different user Identifiers or devices together in the warehouse to then unify those again into the downstream tools. So we have kind of that golden user record in the warehouse, and that can prevent the drift across different users in the various downstream tools that we’re using.
Ryan McCrary: 00:05:53.762 And then lastly, we touch on data governance. And so we do provide a couple of kind of proactive and reactive data governance tools where we can audit the data that’s being sent through RudderStack. We can also define via tracking plans what we anticipate sending through RudderStack. We can do some enforcement of that. We can treat violated events in different manners. Whether we block them, whether we add errors, whether we [inaudible] or cue them so they can be replayed or corrected down the road. And then we have some plug-ins as well where that can be basically added to kind of your CI workflow and limited from that tracking plane API by the developers so that we can kind of eliminate some of those mistakes before they actually happen. So again, this is kind of a high-level kind of marketing view of RudderStack. We’ll jump in a second to the actual architecture of how RudderStack is built, what happens under the hood, and then how we use Influx.
Caitlin Croft: 00:06:45.249 So moving along. Again, this is a marketing slide that we typically show. These are some of our key differences in the CDP space. We often get lumped into that kind of CDP terminology. We go back and forth as to whether we consider ourselves a CDP. At the core, we aren’t a CDP. We’re really more pipelines maintaining the data in and out of the warehouse and cloud tools. But because of the traditional CDP terminology, we kind of get lumped into that a little bit, but we don’t persist any of that data. So we kind of, for lack of a better term, we’re building that CDP on your warehouse. Kind of like a headless CDP, so to speak. And so the main differentiators that we’ll kind of key into what we use InfluxDB specifically for are the fact that we’re warehouse first and that we’re not persisting any of that data. So as I mentioned before, we’re just processing the data downstream. Our belief is again that the data warehouse and data lake should be that source of truth, should be that CDP, and then RudderStack is just facilitating moving data in and out of the data warehouse.
Ryan McCrary: 00:07:41.812 So how do we use InfluxDB? So here’s a little bit more deeper dive into the RudderStack architecture. In a second, I have another slide that will show actually the different Kubernetes pods that are deployed as part of RudderStack. But before I jump into this, I will mention we are a Kubernetes deployed app, and we have a couple of different deployment models. So we actually started as an open source, and so we’re still open core. So the Rudder server, when we talk about sending data into the Rudder server and then distributing it to downstream tools, that server is still open source. And so we have a number of customers that will deploy this on their own. The server itself is a Go binary. They’ll deploy this via Kubernetes and deploy and manage it on their own infrastructure however they want and manage that fully on their own. So that’s kind of one flavor of deployment we have, is open source, kind of self-managed. We’ll get into that in a little bit of kind of how that can bring up some challenges when we use tools like InfluxDB when we bring in something like that. And then we have a couple of commercial offerings. So on our commercial offering, we have a large free tier. So you can go to rudderstack.com and sign up and just begin using it as part of a free tier.
Ryan McCrary: 00:08:49.372 As folks move in to different commercial offerings, we also have a professional tier which would be on a more dedicated multi-tenant environment. So we would have their own basically separate environment from the free tier, but it would have some shared resources. The data’s never shared across different instances, but there are some different pods that we’ll see that can be shared across that. Our enterprise offering is going to be a single-tenant deployment. So each customers going to have their own tenant. There is no shared resources across those. And then lastly, we have what we call a VPC. So kind of like an on-prem deployment where the data processing layer actually happens within the customer’s VPC itself. The control plane where you put in your settings is still hosted, but all the data that’s processed from the front end or the server-side or mobile apps is processed through our customer’s VPC and then on to their cloud tools. And so where that becomes really valuable is primarily in kind of if you think about the fintech space and then the healthcare space where we have customers who maybe their data warehouse is not exposed to the internet, and so the only way to get that data in is to use RudderStack as kind of a VPC deployed model, where our team actually deploys and manages that in their VPC so that all the data’s flowing through and then can access a redshift or something that’s in their own infrastructure that’s not going to be exposed to the internet for any various infosec kind of requirements. So a number of different ways to deploy RudderStack.
Ryan McCrary: 00:10:09.906 And so we’ll jump in here and look at the architecture. These are a couple of different diagrams that we use. So looking on the left, this is going to show kind of how data is going to be adjusted into RudderStack. We’re not really going to show how it’s sent out of RudderStack since we’re kind of discussing more of the InfluxDB side of things today, but essentially via a load balancer, this would be used to manage multiple nodes so we can basically scale RudderStack pretty infinitely horizontally behind a load balancer. The only requirement of the load balancer is going to be to basically preserve the event ordering per user. So let’s say I come on the website and do something. Caitlin goes on the website and does something. We want to make sure that the events for Ryan as a user are processed in order, and so they need to be processed to the same nodes. That’s all the load balancer really does. And so on the right, we’ll see this is kind of the RudderStack processing side of things. This is the sources processing. So this is going to be the reverse ETL and ETL. And those are going to be processed in the same way that the events coming through the clickstream are going to be processed through the server here. So this is kind of where all of the RudderStack lift is happening. And then all of the stats are sent across to kind of a modified TICK Stack. So we’re not using Chronograf. We’re using Grafana as a visualization instead. But basically, the reason this becomes important is because we aren’t persisting any of that data. We still need to have some type of access to the metrics in the data around what we’re processing, how we’re processing it, the latencies downstream tools, all of that, which we’ll get into a second, and so that’s why this becomes important. And so we’ll talk in a second a little bit more about these specific pods. But if we look over here, we’ll kind of see the separation between the monitoring stack and the processing stack for RudderStack.
Ryan McCrary: 00:11:47.166 So the way that we actually queue events surprisingly to a lot of people is through a Postgres instance. And so the way that RudderStack is deployed, we deploy a native Postgres container. Postgres is used to queue all the incoming events. And then as they’re processed into the downstream tools, they’re processed into another Postgres table for processed events. And then, once processed, the ingested events are dropped. And then once the processed events are sent to the downstream tools, the processed tables are dropped as well. So we have, again, like I said, zero access to those events once they’ve been processed. Ideally, they’re existing in the data warehouse. But then to have some notion of what was processed and how and the results of that, that’s what we’re going to use InfluxDB for. So essentially the Rudder server is going to basically kind of aggregate these metrics and then periodically sync them via Telegraf into storage and Influx. Traditionally, as a lot of folks do, we’re going to use Kapacitor for alerting around that, and then we’re going to use Grafana as our visualization layer. So this is kind of a high-level look at how we manage the deployment in RudderStack and then kind of where InfluxDB and the modified TICK Stack fits into that deployment.
Ryan McCrary: 00:12:59.638 There’s a lot of other pods in here. We’ll kind of jump over to this. And this is just a kubectl, basically, listing of the pods in a typical RudderStack deployment. This is actually a little bit of inception here. A little bit meta. We actually use RudderStack internally at RudderStack. And so there’s a bit of redundance around RudderStack, RudderStack. Typically, this would be the namespace of the customer. This is RudderStack internally, so just forgive me for that. But you can see we have a number of different pods deployed here. This is going to be an enterprise single-tenant deployment that I mentioned before. So if you look down here, we’ll see this is the RudderStack stack server itself, this RudderStack zero. The zero’s going to be basically what we would iterate on multiple pods or multiple nodes. So let’s say we were scaling up for a high-volume event. You might see RudderStack zero and the RudderStack one, RudderStack two, three, four, whatever, so on and so forth. So anything outside of this monitoring namespace is going to be incremented based on those pods or based on those nodes. Excuse me.
Ryan McCrary: 00:13:59.766 So this is going to be the processing layer. This is going to be the gateway pod. So as you can kind of imagine, if we’re processing customer data, we always air on the side of data integrity over latency, right? So when we’re processing through the server, we want to process everything in real time. But when there’s an issue with the server itself, we want to air on the side of data integrity. So the gateway, the ingestion, is actually deployed as a separate service so that if any of these other pods or even all of them become unhealthy or crash, we’re still ingesting events, queuing them in Postgres, and then once everything comes back to health, we can process everything that was cute and even spin up more nodes to process that as well. This is a Postgres container that, as I mentioned, is what we use to queue the events. And then this is the Telegraf pod that’s being used to send that data over to Influx. The warehouse services are managed a little bit separately. Again, since we’re not storing that data, we actually batch-load it into the warehouse. So we have some warehouse workers that will stage that data into object storage. That could be S3, GCS, any service where we would cue that data. And then these workers are going to do that and then the warehouse service is going to come in on a predefined schedule and point the warehouse to where those exist and they’ll be uploaded and build the tables into the warehouse itself.
Ryan McCrary: 00:15:19.496 Then we’ll see there’s a couple other pods here. Their sources, which is going to be how we handle the ETL. That’s done as a little different service. Some different proxies around how we handle the event ingestion. The warehouse Postgres is handled a little bit differently because we’re batch-loading those files being staged before they’re sent to the warehouse. And then Blendo integrations. This is part of our services. Blendo’s a company that we acquired for some of ETL jobs. And then up here in this kind of sub-namespace of monitoring, we’ll see where InfluxDB fits in. And so you’ll see it in this particular one, and I’ll hit on this later, we’re always evaluating the best, most efficient technologies for our customers. And so Prometheus is probably the biggest thing that we compare InfluxDB to. And so, in this one, we’re running Prometheus in parallel just as an evaluation purpose to see what works best for our customers. And again, this is a bit of an experimental one because RudderStack’s a one instance. And then we’ll see in here kind of what you would expect. Kapacitor for alerting, InfluxDB for storage, and then Grafana for that visualization. And then we have the alert manager is going to be part of the Prometheus package. So that’s a typical RudderStack deployment. Again, this is going to be single tenant. In the multi-tenant or free tier, some of these resources might be shared across multiple customers. Whereas with this, we’re going to have all of those dedicated specifically to the single customer.
Ryan McCrary: 00:16:34.081 So what do we need something like Influx? So again, our core business, our core kind of value prop, being based around really from a compliance perspective. So RudderStack was actually founded because our CEO was at 8x8, a large telecom company, and needed something like RudderStack that was fully compliant from an infosec point of view. And so there was nothing on the market available. This is the reason he built it. And so because we don’t persist the data, that provides an inherent problem of how do we know what happened? How do we know what we process? How do we know how long it took to process? How do we know where the data was sent? How do we know what kind of errors happened in the downstream tools? And so we need something that’s fast and reliable and kind of gets out of the way. So what we’re doing with the metrics isn’t our actual business, right? Our business is actually processing our customer’s data in a timely manner to exactly where they tell us it needs to go and making sure that we have full integrity across of it, across all of those tools. So our actual business is not in these metrics, but it’s really important to what we do because our customers do need to validate that that’s what’s happening. We need to be alerted when it’s not happening correctly, and it just gives us a visualization as to what’s actually happened.
Ryan McCrary: 00:17:46.398 So we evaluated a number of tools. Why we chose InfluxDB? Again, we’re constantly evaluating. As you saw in that one instance, we’re actually running Prometheus in parallel just to continue to see what makes the most sense for our customers. And we know we needed a fast, efficient time series database. And basically, we’ve been using InfluxDB since we started storing metrics. So as soon as we built RudderStack, one of the first things we realized as we’re not storing those events, we do need to have some metrics around those. And so it only kind of stood to reason that we needed a time series database. And so InfluxDB is kind of what we’ve used since day one. So let’s take a look at how we actually use it. I’m not going to risk pulling up anything as part of this webinar because that means it’s not going to work. So I’ve got some screenshots. But as I mentioned, modified TICK Stack. So we’re using Grafana for our visualization. So I didn’t point out, but there is a Grafana pod in each of the — in each of the instances. And so we’re actually using the open source version of Telegraf. So Telegraf is deployed as a service embedded within each one of those instances. Same as Postgres and then same as Grafana. So Grafana’s deployed inside of that and can be used to visualize those metrics. And so we provide out of the box some just kind of bare metrics around what you would anticipate seeing. So this is kind of the big picture. This is going to be over any frequency that you would select, whether it’s the last 5 minutes, 10 minutes, 5 days, 10 days, etc. We’re going to show just kind of at a high-level received requests, received events. So there’s going to be, obviously, some batching here, and then delivered to downstream destinations. Gateway, again being the ingestion, you’ll see that the events ingested over time. And so a number of different things. That’s kind of high level. If we think about more lower level, these are going to be the processor stage times. So this is going to tell us down to how fast or how slow we’re reading the events from the database to process them into the downstream tool. So as high level as just how many events are we processing to how long is it spending being read from the database or in pre-transformation time, so on and so forth.
Ryan McCrary: 00:19:47.393 And then we also use it from a monitoring perspective. Thinking back to how we use Kapacitor, this is kind of an internal metric, but our customers often become familiar with it pretty quickly, where the jobs DB pending event count is what we’re going to use to monitor the health of an instance. So one of the reasons we use Postgres is because it’s reliable and it’s durable. And we’re really, for the most part, underutilizing it for what it’s capable of. So every time we’re queuing events, we’re building a maximum of 100,000 rows in a table, which for Postgres is very, very small. When we hit 100,000 tables, everything’s append-only. When we hit 100,000 tables or 100,000 records in a table, we build a new table and start queuing into that. And then in parallel, table one is being processed while table two is being written to. And ideally, before table two can be fully written, table one is processed and dropped. And so this job — oh, sorry. This job DB metric that we see down here, you’ll see it hovering right around two and then one for these other — or this one job DB or batch rather was a two for this as well. But the idea here being that two is healthy, right? So two means we have a full table and we’re writing to another table while we’re processing that other table. The subsequent table. And so if it ever climbs above two, let’s say three, that means we’ve written two full tables that haven’t be able to process before we bring in a third table. And so from an alerting perspective, three tables doesn’t automatically trigger an alert. But over some time, if that stays at three or if it climbs above three, that would become concerning because there would be some indicator that, if that’s on the ingestion side, that would lead us to know, “Hey, the data’s not being processed as quickly as it’s being ingested. We need to add additional nodes for that.” Or if it’s on the processing downstream side, so on the router side here, if that’s climbing, that means there may be is an issue with the health of the downstream tool and we need to spin up additional processing resources to make sure that those jobs can be handled and sent to the downstream tools. So again, just a number of different metrics. And this is all stored in Influx. This is how we’re able to monitor everything that we do within RudderStack.
Ryan McCrary: 00:21:49.846 So metrics, metrics, metrics everywhere. This is kind of some of our — this is just a small snapshot of some of our internal metrics. Our internal documentation around metrics, rather. So as you can imagine, as new teams are building new features with RudderStack, we’re all constantly adding new metrics. And so one of the reasons that we that we like InfluxDB and that it’s very useable for us is that it’s very easy to add and store a new metric. And so the team can come in here and add it, they can document it, and then determine what type of metric it is. And so this gives us just a central place for our team to understand what those metrics are as we’re building those queries for customers in InfluxDB or in Grafana. So I kind of showed some basic charts of what we use in Grafana, but really Grafana’s pretty extensible in that it’s just a query language on top of Influx. And so this is important to our customers to know or for us to know what’s important for our customers, for each one of them individually, so that we can help build custom queries based on the metrics that apply most. Are pertinent to them.
Ryan McCrary: 00:22:47.024 So we have hundreds of metrics that we store. Like we showed before, some high level, as high as just how many events are we collecting, to low level. Processing loop times, disc read times, things like that. And in the three categories that we use are counters, timing, and gauges. And so timing being anything that would be multiple fields. So a 50 percentile, 90 percentile of a response time or execution or loop. Counters would be as simple as that. Just a single value of a count. That could be requests. It could be cycles. And then gauges is just going to be the current state of something. And so again, it’s very easy and quick for our team to add new metrics, to track new metrics. And then this becomes the primary monitoring and alerting. So because none of that data’s in RudderStack, this is the way that we do all of our alerting is through InfluxDB and Telegraf instance. These are all stored within the instance, but we also have a lot of customers that maybe use a different system, and we’re able to take those metrics from InfluxDB directly into their other system as well. So maybe they’re using New Relic or Datadog or something like that. We can relay those from the internal store, which we will still use for our alerting, but they can also have that in their native screen where they’re viewing those otherwise.
Ryan McCrary: 00:23:59.119 So kind of wrapping up, thinking about what’s next, of how we are using InfluxDB and how we’re storing our data. We’re having to think through now the long-term storage of our metrics. So when we think about RudderStack deployment, as I mentioned, everything’s queued in Postgres in those 100,000-row tables. And so, as you can imagine, our RudderStack deployment doesn’t grow in size very much over time, right? So hundreds and millions and billions of events can be processed through RudderStack and the deployment largely stays about the same size. And so what does grow is the metric storage. And so as InfluxDB is becoming the tool that we use for that, as a time series database, it scales really well, but it does become, eventually, a bottleneck around the storage and just the accessing of those various metrics that we have. And so we’re having to think with long-term storage, do we want to consider sampling? Older metrics, do we want to compress those? How do we handle that? And so we’re doing some different things across different customers of what’s important to those customers, and then how we can kind of compress and modify that older data versus the data that they’re viewing on a daily basis around those metrics.
Ryan McCrary: 00:25:07.105 We also have to deal with migration of metrics as the deployment model changes. So kind of as I mentioned before, we have different deployment models. Whether it’s open source, fully managed by the customer, or it’s something where they’re moving from a free tier to a multi-tenant to an enterprise, maybe to a VPC eventually. And so we do have to consider the portability of those metrics where they may live in a specific instance. Because when we think about changing the RudderStack deployment, the main concern for our customer where and how the data is being processed, right? So where that data plane lives. Whether it’s on a multi-tenant, single tenant, or in their VPC. That’s kind of the primary concern. But again, as we upgrade those, it’s a very simple upgrade process for customers to start to send metrics or send data to a different processing location is very seamless. It’s a very easy upgrade process. We’ve architected the application as such. But the problem then is we have to figure out how to migrate these over as well. So we have to have some considerations around duping some of the data that would be in InfluxDB on a local deployment, moving that to an enterprise, or moving that from enterprise to a VPC as well.
Ryan McCrary: 00:26:13.939 And then, as I kind of alluded to earlier, any change that we make in this overall architecture kind of has to be supported forever, right? So in our commercial offerings, we can always change what’s happening under the hood. We can migrate those metrics if we wanted to say, move to Prometheus or something. But we can never stop supporting it because we have countless customers, some of who we know, some of who we don’t know who are using RudderStack in a fully managed kind of open source deployment where they’re doing all the management of the instance. They’re using one of the containers that they’ve downloaded of us. And for the most part, RudderStack is kind of a set-it-and-forget-it tool. Honestly, the more you’re thinking about RudderStack probably the worst, right? We don’t want customers having to think about us a lot. We really want to fade into the background and really be an efficient processing layer. And so we have customers that have been using RudderStack in a self-hosted model that haven’t touched it in months or maybe even years. And so we have to make sure that we can still support what they’re using. If they’re not going to upgrade to the new pods, we need to make sure that we’re supporting that over time for them.
Ryan McCrary: 00:27:12.085 We are running up against some limitations around InfluxDB currently. That the version that we’re using supporting only 200 tags. So that’s something that we’re having to think about is kind of how we manage tags versus individual metrics. And then, as I said, kind of finally, always thinking about what’s most efficient, what’s most effective for our customers. So we’re using Influx, we’re happy customers up Influx, but we’re also just always making sure that we’re not just doing what we’re used to and what we know. So that’s kind of it from my side. I don’t know, Caitlin, if there’s anything you wanted to add or if we want to open up for questions.
Caitlin Croft: 00:27:46.419 That was great, Ryan. So if anyone has any questions, please feel free to post them in the Q&A or chat. I actually have some questions, Ryan. One is a comment. I think it’s really great when you talked about on this page actually talking about compression. What to do with the older data. Because that’s pretty common with time series projects where, when you start off, you need a ton of data because you don’t know where the interesting points are. And then as you go, you’re like, “I don’t actually need a second granularity on this data that’s six months old, but I still want to keep it for trend analysis. What do I do with it?” So I think it’s a challenge that probably a lot of customers are trying to figure out. So I’m sure you guys aren’t alone there. Have you guys looked at Flux for any of this?
Ryan McCrary: 00:28:42.827 We have. We’ve used Flux before. Again, it’s not uncommon for us to run different tooling in parallel. I think we used Flux much more heavily earlier on. But yeah, you’re absolutely right. I mean, this is the consideration for time series, right? It’s really good at what it’s good at. It’s really good at the second or microsecond granularity. Yeah. So for us, we have to figure out over time - and it varies per customer, right? - of second or microsecond matters today. It probably doesn’t matter what happened last year, but we need to know at what level do we want to compress that to get that trend analysis and to have some level of validity around that. So that’s where the sliding scale is for us is it varies per customer along their needs and the different performance that they’re looking for and what metrics they’re tracking and what alerting they’re doing around that.
Caitlin Croft: 00:29:37.868 Yeah. That’s an interesting point because I was just going to kind of ask you, given that you guys are already looking to downsample your data, what was sort of that sweet spot for the granularity that you actually needed? But it sounds like it differs based on the customer needs.
Ryan McCrary: 00:29:55.039 Yeah, I think out of the box, and don’t quote me on this, there’s some default timeline where we start to compress it to. I think it’s either 10-second- or 1-minute intervals by default. For most customers, that really suffices across a lot of their metrics. But one of the benefits of our deployment model is that, in the open source, we don’t necessarily get all of this, and in the free tier as well. But when we think about an enterprise deployment, it’s all operated from a helm chart with a lot of configurability. And so since we’re considered single-tenant deployments, we’re able to really control how each customer’s instance handles a lot of different things, right? So in relation to Influx, we can handle that compression or what metrics are or aren’t compressed or even what metrics are or aren’t tracked. As you can imagine — I’m kind of digressing here, but as you can imagine, a lot of customers use us for the compliance side of things, right? That we don’t persist that data. That we don’t have any access to it. Or an API that’s passed through, we’re not generating another silo. That does create problems though, right? It really makes visibility hard. And so there’s always this kind of tug, this give and take, of they want us to not be storing the data, but they want access to more and more things in the metrics layer. And so we really have to fight to make sure that there’s no PI, there’s no event level data being sent across that. But when you think about something like errors, right, we track errors as part of our metrics as well when there’s an error in downstream tools or if there’s an ingestion error. It becomes really tricky of what can we store. And it might not be in — it might not be something in Influx, but in a larger sense of the instance, they want to see access to those error codes. As part of that message, it may have some PI or even the payload that’s causing the message. And so for some customers, they’re actually okay with that, and so we can do some additional storage around those error payloads. Where some customers with fully compliant needs, they can’t do that. So we can adjust to level of granularity. And then that even’s beginning to happen on the metric side. So even if customers want to say — there’s a lot of asks around showing what users or what user IDs are having issues and storing those as metrics. And so there is a pull towards that. But that becomes an issue where, for some customers, that’s just an absolute no-no. And so we have to really think through all of that. But kind of going back to where I started, the configuration logs get very granular as to what data points we are putting into metrics, what metrics we’re storing all together, and then even as low level as how we’re compressing that. And so that’s kind of something that we thought through early on is how we deploy it. And that really gives us some of that granularity and really extensibility on how we’re going to compress or how we’re going to sample it for different customers and their different needs.
Caitlin Croft: 00:32:44.540 So did anyone on your team have experience with time series tools before you guys had implemented InfluxDB?
Ryan McCrary: 00:32:52.083 I’m sure they did. I’m not fully on the engineering side, but our engineering team is breadth and depth of experience across the board. And so, yeah, I mean, again, I think the two kind of competing technologies from our sense, and competing in a friendly way, would be Prometheus and Influx. And so I think a lot of it probably boiled down the familiarity with InfluxDB and with kind of that modified TICK Stack. And then because one of the points that is a differentiator for our stack that I kind of skimmed over is that we are very much developer focused as a tool. That’s kind of one of our differentiators. And so InfluxDB is very much more — InfluxDB and Grafana are very much more geared towards developers. Folks that want to do some tinkering, right? They want to get in and build different queries and they want to build different things and then alerting around those specific things. And so I think just the overall extensibility of kind of that whole TICK environment, which I keep saying, but we’re not using because of the Grafana piece, but I think that in general is what drove a lot of that early on just towards tools that developers are already using.
Caitlin Croft: 00:33:57.092 Yeah. And I would say — not to toot our own horn, but we definitely are able to scale out a little bit more the other tool that you mentioned, so. Another question I had, when you guys were early on implementing InfluxDB and still kind of figuring things out, what were some of the aha moments? Was there something where you started pulling in the data into InfluxDB, and you’re like, “Oh my gosh. That’s when that happened”? Like, “We knew something was awry, but we didn’t know when it happened or weren’t able to correlate it.” Was there any moments like that?
Ryan McCrary: 00:34:36.062 Yeah. I think the biggest moment — and our engineering team would probably have different answers. But for me, it was kind of what I mentioned on here. The ease of adding a new metric, right? Excuse me. When I initially started, our Grafana dashboards looked very different. When I first started, we had very few commercial customers. It was very much more geared towards the open source community. And our kind of out-of-the-box Grafana was just enough, right? It was just enough almost for our engineering team to prove validity themselves. To make sure, “Okay. Events in is the same as events out. Router processing is this.” Whatever. And so, kind of to some of those more in-depth charts that I showed, our team realized, “Okay. Once we’re instrumenting something like Influx, InfluxDB doesn’t care what we’re sending to it.” And so to add another metric is really not only valuable to the customer to visualize what they’re seeing, but also for our team for debugging. So we find ourselves where we traditionally would have gone directly to kubectl and to pods, which now that we’re having to go through compliance, we’re not actually exacting into pods anymore, but where that would typically have been where an engineer developer would go to say, “Hey, let’s go look at the Kubernetes pods itself,” we found ourselves going much more to looking at Grafana because we were sending more and more metrics and were able to say, “Hey. When we’re in Grafana, we’re looking for the health of this. We can’t see X. Well, just go add a metric for it and start tracking that right now. And so that would allow us to start to really build out how we could monitor the health of customer pods around Grafana and Influx, versus having to go all the way to the metal. And so that was kind of an aha moment of, “Hey, this isn’t just valuable for customers seeing data in and data out, but it’s also valuable for us as engineers to see,” like I said, “the” — we can determine if there’s an issue with the database read time of processing events from Grafana because it’s in InfluxDB now. And because, again, we’re really focused towards those developers and towards kind of the data engineering persona and so those are folks that think the same way, right? So when they’re thinking about the health of a pipeline, they’re not always just thinking about the marketing persona, where it’s like, “Okay. How many events are we processing and then what’s in the downstream tools?” They’re really thinking, “Okay. What are we doing that’s inefficient and how can we scale the deployment more efficiently?” So I would say that was it for me is when we realized this isn’t just a tool for the customers to validate the data in and out, but really getting much closer to the health of the overall deployment itself.
Caitlin Croft: 00:37:02.718 Yeah. Absolutely. And I think it’s great that you guys are open source as well. It’s fun working with other open source organizations. I know that you’re currently using InfluxDB open source, but have you looked at any of our commercial offerings —
Ryan McCrary: 00:37:22.034 We have.
Caitlin Croft: 00:37:22.116 —[crosstalk] —?
Ryan McCrary: 00:37:23.572 Yeah. We have. I mean, they’re fairly new. I mean, new in so much as how we’re building everything out. But in general, I know that there is kind of that bid towards open source because a lot of times we’re deploying these things as deployed services. So I think for us, it makes the most sense, for now, to continue to use the kind of embedded. But again, like I said, we’re always evaluating what makes the most sense for our customers. And as we grow and as we work with different types of customers, we are seeing folks that are forwarding their metrics into other services. And so there’s also the possibility for that where if they’re hosting a cloud-hosted InfluxDB somewhere, we could begin to forward those as well. So that then they have their own kind of granular access to maybe what they’re processing through RudderStack and Influx, but also maybe other services that they have an InfluxDB so they can do that all in a single place, over having to go to bigger [inaudible] for RudderStack and then whatever else they’re actually monitoring their other business systems.
Caitlin Croft: 00:38:21.554 That’s awesome. Thank you, Ryan. This was amazing. We’ll just keep the lines open here for another minute or so if anyone else has any questions for Ryan. Just want to remind everyone this session has been recorded and so it’ll be made available later today or tomorrow. So I’m sure there’ll be some people who want to check it out. All of you have my email address. So if you have any questions for Ryan that you forgot to ask or whatever, please feel free to email me. Don’t be shy. I’m more than happy to connect you with Ryan. And once again, I want to remind everyone check out InfluxDB University. There are so many amazing courses. And the team is continuously adding to the course catalog and you can get certifications for courses completed. So I don’t know, it’s pretty cool going on to Twitter and LinkedIn and all that and seeing people complete those courses and get those badges. So be sure to check it out. All right. Well, thank you everyone for joining today’s webinar. I hope you enjoyed it. Ryan, any last comments or thoughts you’d like to share with the group?
Ryan McCrary: 00:39:35.422 I think that’s it. I appreciate the questions at the end. If anything comes up, I’m just [email protected]. So any questions, happy to clarify, or any questions about how we use InfluxDB or RudderStack in general happy to help if we can.
Caitlin Croft: 00:39:48.701 Awesome. Well, thank you everyone, and thank you, Ryan.
Ryan McCrary: 00:39:52.514 Thanks, Caitlin. Thanks, guys.
[/et_pb_toggle]
Ryan McCrary
Senior Sales Engineer, RudderStack
Ryan is a software engineer with a knack towards data. Currently, he is helping customers solve the hairiest of customer data problems, using the data warehouse as the source of truth and helping replace brittle legacy data pipelines with easy to manage infrastructure so customers can focus on machine learning models and real-time personalization. He is an engineer but spends a lot of his time working directly with customers face-to-face (really zoom-to-zoom).