Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB
Session date: Jun 07, 2022 08:00am (Pacific Time)
Pinnacle 21 is a leader in clinical trial data software and services. By streamlining the drug approval process, they aim to bring life-saving medicines and treatments to patients faster. Their platform helps biopharmaceutical organizations collect and prepare all clinical trial data for approval and be ready for regulatory review. Their goal is to create clean data pipelines for their clients that result in successful regulatory submissions. Organizations like the FDA and Japan’s PMDA, as well as 22 of the Top 25 pharma companies globally, use the solution to validate clinical trial data. To ensure they are providing the best product to their clients, Pinnacle 21 realized they needed observability into their apps, servers and application availability over HTTP. Discover how Pinnacle 21 reduced their monthly infrastructure monitoring spend by using Telegraf and InfluxDB.
Join this webinar as Josh Gitlin dives into:
- Pinnacle 21's approach to improve clinical data pipelines
- Their automated DevOps monitoring methodology including Chef
- How a time series platform provided them with better analysis - customized based on data source
Watch the Webinar
Watch the webinar “Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Sr. Manager, Customer and Community Marketing, InfluxData
- Josh Gitlin: Director of DevOps, Certara
Caitlin Croft 00:00:03.950 Hello, everyone, and welcome to today’s Webinar. My name is Caitlin Croft. I’m really excited to be joined by Josh, who is here to talk about how Pinnacle 21 uses InfluxDB. Once again, this session is recorded and will be made available. Please put any questions you may have for Josh in the Q&A. And I just want to remind everyone that we just want to make sure that all these webinars are a safe, fun place for all attendees and speakers. So please be respectful of that. And without further ado, I’m going to hand things off to Josh.
Josh Gitlin 00:00:42.193 Thank you, Caitlin. Welcome, everybody. So my name is Josh Gitlin. I am the director of DevOps at Certara, and I’m going to be discussing how we’re using InfluxDB to monitor our AWS infrastructure and optimize the product that we provide to our customers. So first, a little brief overview of Pinnacle 21 and who we are, what we do. Pinnacle 21 is a software company that specializes in life sciences solutions. So what does that mean for anyone not familiar with that? The primary product Pinnacle 21 makes is called Pinnacle 21 Enterprise. This is a piece of software as a service, so it’s a web application, and it’s used by major life sciences companies or pharmaceutical companies to validate their clinical trial data. So Pinnacle 21’s customers are developing treatments, medicines, devices, and they go through clinical trials where they give people the treatment or either a placebo, and then they validate this data through Pinnacle 21 Enterprise to make sure that it meets the data standards established by CDISC. CDISC is the Clinical Data Interchange Standards Consortium, and they have a variety of technical standards for how clinical trial data should look. This includes things like SDTM, ADaM, etc., and these define relationships between different data tables and how the data should look.
Josh Gitlin 00:02:22.874 The FDA and the PMDA and other regulatory agencies are going to want to make sure that when a clinical trial for a new medicine is submitted to them, that the data matches the standards. And Pinnacle 21 Enterprise is the software that these regulatory agencies use to validate. So our customers use the software to make sure that their data is going to look correct when it goes for approval for a new drug. The best layman’s explanation I’ve heard of what it is - it’s like spell-check for your clinical trial data. Pinnacle 21 Enterprise was incorporated originally as a privately-owned company, and last year was acquired by Certara, who is a manufacturer of a number of different pieces of software for the life sciences and pharmaceutical industry. So brief background about myself. I joined Pinnacle 21 in the beginning of 2020, right before the pandemic, as their principal DevOps engineer. Previously, I had no background in the life sciences space. Almost exclusively, I was in e-commerce and retail. I’d been with Amazon.com as a senior systems engineer, and I developed a software as a service e-commerce and content management system called Site Palette at my company Digital Fruition, which is still around but smaller than it was many years ago. Joined Pinnacle 21 with the goal of improving automation, improving monitoring, taking my knowledge of operating at scale to optimize our infrastructure and improve the delivery of our software to our customers.
Josh Gitlin 00:04:05.832 So the need for a solution. What did I find? When I first joined Pinnacle 21, we were currently using a product called Datadog, which some of you may be familiar with. Datadog was selected by the CTO as the easiest, lowest-resistance path for monitoring all of our servers. It was a single install command. He could blast it out to all of the servers. It automatically collected the data we needed, got logs and monitoring dashboards in the application. I was a little frustrated with the tool when I first joined. I had come from using Grafana with an InfluxDB back end at previous roles, loved Grafana. And I had some friction with Datadog. I found some critical functions were just missing. I couldn’t label a Y-axis. Some metrics didn’t have units. I couldn’t compare two metrics the way I could in Grafana. Just I wasn’t quite thrilled with the tool. And I wasn’t alone. The existing monitoring solution that had been implemented was not well adopted by the team at Pinnacle 21. Very few of the engineering people were using it, customer success used it but found it cumbersome, and it was expensive. It was costing the company over $65,000 a year. It was priced per server, so there was a fixed price in our contract. And Pinnacle 21’s infrastructure is that we scale out horizontally so each customer gets an instance, which led this to be a fairly costly solution for us.
Josh Gitlin 00:05:58.173 So I proposed to the CTO to replace it and look for some alternatives. Things we needed to consider in replacing Datadog, had to have something that was easy to implement. That was primarily the concern of Datadog is it was very simple to put on all the servers. One of the things that I had been doing since I joined Pinnacle 21 was writing automation software in Chef. And because we were using Chef - specifically, we’re using CINC, which is the open-source version of Chef, C-I-N-C, CINC is not Chef. It was going to be fairly easy to replace Datadog with any other monitoring solution of our choice because we could just write cookbooks and recipes and deploy that to all of our servers. A replacement solution needed to capture both metrics and logs. And in the InfluxDB Slack, I did ask - some people talked about you could use it for logs. I wasn’t super happy with Grafana’s log panel. I felt like it didn’t have feature parity with Datadog. And so for the log portion, we ended up looking at an external vendor we were hosting it on the ELK stack. But metrics, Telegraf was absolutely more than capable.
Josh Gitlin 00:07:28.652 We definitely needed an externally-hosted solution. This is both for the heavy audit and compliance requirements that we have. We need to make sure that our metrics are hosted on a location that is free from tampering, that we can say, look, this data is hosted by somebody else, we can’t get in there and adjust figures or anything, nobody has access to the actual servers. But it was also very important for us that we didn’t have to manage the infrastructure of another tier one monitoring solution that was offloaded to InfluxDB or whoever we chose. Datadog had 10-second granularity, which was actually excellent for what they offered us. Telegraf has the ability to do that as well, and I’ll get into granularity later on. Definitely needed to have APM metrics. So this is application performance monitoring, seeing what’s happening in the application, what users are doing, what portions of the application are taking the most CPU or memory time, and that was going to be vital for us to improve the software. And the final consideration, Datadog was doing either active or synthetic HTTP monitoring, meaning it was connecting to each of our customers’ instances, making sure that they were responsive, using reports of uptime for SLAs, etc. And that was something else we would have to replicate, as well, in any replacement solution.
Josh Gitlin 00:09:07.914 So evaluating InfluxDB and Grafana as a replacement. InfluxDB Cloud is an awesome pay-as-you-go plan. And so in my initial calculations, I figured it could be significantly cheaper even when paired with another vendor to do the log, the ELK stack hosting. It worked very well for us because we pay exactly for what we use. So we had control over what our monthly bill would be, and we could evaluate each metric we choose to send from Telegraf and decide if it was worth paying for. Telegraf, if anyone has not used it, is a fantastic data collection. It even can send to Datadog if desired. It has plugins, a huge library of plugins. Out of the box, it was able to collect more than the agent that we previously had, and then you can extend it easily with exec plugins or file or tail plugins to gather any metric you can possibly imagine from your machines. So easy solution there. Wrote Chef recipes to deploy it everywhere and manage it. So that is actually more powerful than we had before with the single command to just install the agent. And as I mentioned, we went with a hosted ELK stack for the logs. The active HTTP monitoring part is more interesting. We needed to build something custom in-house, but we did. It’s based on InfluxDB as the back end and uses Grafana alarms to replace the alerts we are getting for any instances that are down.
Josh Gitlin 00:10:54.678 So technically, how do we get into all of this? What was the actual design? We first show up a high-level architecture diagram and then I’ll get into the nitty-gritty. So each of our - oops. If I cannot scroll with my mouse, that would help. Okay. Thank you. Okay. So each of our customers is located in AWS. We have each customer within their own EC2 security group. Each customer has two instances, a web server and an application server. We have Telegraf on both of these instances publishing to both an internal InfluxDB instance as well as to a GCP-hosted InfluxDB Cloud instance. We located the HTTP monitoring inside our Ops VPC. It checks all of our instances, publishes again the metrics to InfluxDB Cloud as well as our internal InfluxDB instance. We got Grafana Cloud again in GCP, thinking if there’s an issue with AWS, we have redundancy by a cloud provider. And then the users, the engineers at Pinnacle 21 access Grafana Cloud through their browser. Grafana is pulling data from either InfluxDB Cloud or our internal InfluxDB and it works really well and very happy with it.
Josh Gitlin 00:12:24.057 So how do we approach this task? To start with, it was very easy to just get an InfluxDB Cloud account and start prototyping. So signed up, use the free trial, hopped on one of our development servers and just installed Telegraf and turned on all the plugins because why not, right? Why not? It’s because it’s very expensive if you do that. So went through, showed that we could collect everything we need and more, and then take a look at the data usage dashboard. For anyone who’s not familiar with this, you absolutely should be. It is a critical tool in your pay-as-you-go InfluxDB Cloud plan. It shows you the queries that you have coming in, your data storage, and the - I’m sorry, the data you have coming in, the queries you’re executing, and your data storage. So this is very useful as you’re scaling things out. And I looked at this a lot. With the initial instance, I would turn metrics on, let them run for an hour or a day, review what they were doing to the actual cost, and then use that to figure out, okay, if we multiply this by all of the servers in our fleet, what is our actual cost going to be, and use that to fine-tune which intervals we wanted for each metric. So this is what the InfluxDB data usage dashboard looks like. There’s a link on there, but you can find it right on Influx’s website or you can google for it.
Josh Gitlin 00:14:04.131 It’s similar to what you see if you look at the billing tab of InfluxDB Cloud, but it is a lot more granular. You can select a particular time range, you can zoom in, and all of these metrics are available for you to play with Influx if you want to dive deeper and find out where your usage is coming from. So [inaudible] must if you’re an InfluxDB Cloud pay-as-you-go user. So installed on a single server, found out which metrics we cared about, created some proof of concept graphs in Grafana, CTO said, “Great, looks good, let’s scale this thing out.” So this is where I get into the Chef portion of it. And we had already a cookbook that was a base cookbook included in all of the cookbooks we have. So for anyone who is not familiar with Chef, you write recipes in Chef, you put them in cookbooks, and then in our particular case we have a policy file based workflow, which means for each type of server - we have the web servers and the app servers. You have a policy. That policy defines which recipes, which pieces of automation you want to run. And so this policy for both the web and the app servers would include the monitoring cookbook, which comes with a base set of things we want to monitor through Telegraf. Then we can extend that and monitor on the web server, the actual web application, and the NGINX reverse proxy on the application that does the bulk of the heavy lifting.
Josh Gitlin 00:15:51.026 So installing the package, we use JFrog Artifactory. There’s a variety of ways you could do this instead. You could just download it through cURL using a remote file resource in Chef. In our case, it was as simple as a package resource because we just download the telegraf package, load into Artifactory, and it’s available as an RPM artifact. Telegraf, out of the box, has a telegraf.d directory in the Etsy Telegraf directory. This is great because you can write individual plugins configuration in there, and Telegraf will pick them up. So it’s really useful when you’re automating a Telegraf installation via Chef because if you have different policies or different roles for your servers and you want to monitor different things on them, it becomes really difficult to manage that in a single long telegraf.conf file. So we chose to do - have each plugin write a configuration in that telegraf.d directory, and then what we do is a node attribute listing which plugins, which things we want Telegraf to monitory. We looked over that. We reiterate over that data structure in the Chef recipe. And for each thing in the node attribute, we write out the appropriate configuration for the monitoring of that component. That’s great because it also allows us to customize the configuration for each Telegraf input. So if we need higher granularity for CPU on the app server versus the web server or something like that, we have the ability to do that per server.
Josh Gitlin 00:17:39.955 So what does this actually look like in code? We’d see that the main Telegraf recipe here will actually iterate over each of the plugins in the node attribute, and for each one, you see it’s writing a file inside the telegraf.d directory with the name of that input. And so this list Telegraf plugins by default, comes with a certain set of plugins that we selected as the right ones, but then those can be added to or subtracted from on a per policy file basis. For anyone who is into Chef and isn’t familiar with this, it is much better to structure your node attributes as hashes and not as arrays. So you’ll notice that what we’re doing here is we actually have the key of this hash as the name of the plugin, and the values in the hash are the configuration of those plugins. The reason for that is that when you’re stacking node attributes, arrays will be deep merged and you probably don’t want that. It’s not easy to turn off. For example, if we wanted to turn off the systemd input, you can’t do that as easily with an array if it’s a hash or what Chef calls a mash. You could just set that key to nil. And so what we’re doing is we’re filtering out anything that’s nil or false so that we have control of what we’re monitoring at a very granular level.
Josh Gitlin 00:19:13.114 Then you can see this is an example of the interrupt Telegraf plugin file. And so this one is looking at particular configuration options from that hash. So we’re passing them in as the variables line here, and you can customize the interval if you wanted to for this particular host, and then it’s adding any particular tag overrides that we might want for this plugin. And that’s something I’ll get into later because this allowed us to customize with high granularity where our metrics are going to, and enabled us to prototype new metrics by sending them to our internal InfluxDB server instead of the cloud server. So what is that base monitoring set that we selected in Chef? Monitoring disk IO, the ethernet interfaces, interrupts, network traffic, Telegraf internals, systemd unit, netstat, and an InfluxDB listener. This is on top of a base set that can’t be customized per server. That includes things like CPU, memory, some kernel metrics, things like that. We’ll never want to turn those off, so those are always on. So this is the set that I reviewed and found that some of these were really great data, but not worth the price. And some of these were things that we probably wanted for certain servers but maybe not for others, so we might want to turn them on or off, like interrupts is something we might not care about.
Josh Gitlin 00:20:52.970 I’ll pick one of these victims, not at random, systemd units, excellent input from Telegraf. It gathers all the data about every systemd servers that’s running. It generates a lot of data. Very expensive if you’re sending this to InfluxDB Cloud. This is one we’re really not going to need very often, but it’s one that we might want in specific circumstances. So how did we deal with that? There’s information online. I couldn’t find the actual link when I was preparing these slides, but if people are interested, I can find that. It was somewhere on the InfluxDB forums, I believe. But basically, this is the process. Telegraf has a section in each output plugin called the tag pass. And you can configure particular output plugins - in this particular case, this is our InfluxDB Cloud output - and we tell it only pass metrics that have a tag, an InfluxDB destinations tag value matching this glob string. So this is the InfluxDB Cloud value I invented, and we exclude that tag from being actually published to the InfluxDB server. So this is a tag we don’t care about when we’re looking at metrics in Grafana. We’re tagging metrics, as they’re collected by an input plugin, to decide which instance they get published to. You’ll see this particular case - this is an exec plugin. This is a critical metric for us. This is the activity of Pinnacle 21 Enterprise. We need this one on the cloud because we actually care about audits and showing that we have this information.
Josh Gitlin 00:22:45.951 So we’re passing this to both our internal server as well as our InfluxDB Cloud server. So by putting tags on each Telegraf input, you can select which of your InfluxDB servers you want to send those metrics to, which is very helpful. This also enables us to collect higher granularity metrics for a particular input. We could, for example, say we want 10-second memory metrics sent to the internal server, but we only care about 30-second metrics or one-minute metrics for memory when we’re sending them to the cloud, where we pay for storage and data input. So much more powerful than we had with Datadog. We can really finely-tune what we’re collecting, how often we’re collecting it, and where we’re sending it. So roll that out to everywhere, getting great system metrics, now needed to look at some of the actual key performance indicators. How are the servers doing their job? So recommendation there, parse the NGINX logs. In our case, because of Datadog, we were already writing NGINX logs in JSON format. So we’re using the Telegraf tail plugin with the JSON formatter to just pick those up, select individual fields we wanted, and send those to InfluxDB Cloud. You can write a grok format or if you have Apache or NGINX logs in common log format or extended log format, one of those. You can grok out the individual fields from that log, send them to InfluxDB. In our particular case, we mostly care about response time, HTTP status code.
Josh Gitlin 00:24:42.583 We’re tagging part of the path so that we can break down graphs by path name, etc. Monitoring the actual application KPIs, we use a data transfer system called IBM Aspera for customers to upload these very large pseudo-anonymized data sets for validation. Aspera, we configured using Chef to write its logs to a particular path where Telegraf can again pick these up using a tail plugin. That enabled us to generate graphs showing average data set size, how big the data our customers are uploading, the speed at which this data is uploading, how many uploads are in progress, etc. Really, really valuable information for the business because the data sets can take a long time to upload and validate. And this really help the business side understand what are the impacts of various data set size, how are our customers utilizing the application, etc. And then that exec plugin I showed earlier actually calls the application APIs to gather information about what tasks are currently running in the application and then puts those on a dashboard. Very, very helpful for customer success. So if somebody sends in a ticket and says, hey, things are slow or this isn’t working, they can pull up a dashboard and see all the tasks that have been running. Is every task failing or some of them failing? It gives them information that can be correlated with system metrics better than having one tab open for the application, look at the log in a different tab for graphs and trying to correlate back and forth.
Josh Gitlin 00:26:23.743 In this case, we use the Ruby InfluxDB gem, fit perfectly because Chef is a Ruby product, so we could just use the embedded Ruby in there, install the Ruby gem within our Chef Ruby gems library, and then this plugin picks it up from there and publishes the metrics right to Telegraf using InfluxDB line protocol. And that’s the Telegraf InfluxDB listener that I mentioned earlier that we configure on every host. So what does that actually look like? Actually, I have that in the next slide. We also gathered application performance monitoring metrics. We’re a Java application, Pinnacle 21 Enterprise. So our engineering team located a piece of software called inspectIT Ocelot. This is a Java agent, so it’s a jar file that could be loaded in with our package. You load it in with a Java dash CLI switch. Out of the box, it collected JVM metrics, memory usage metrics, thread counts, things like that. It can publish it natively to InfluxDB. So again, we’re publishing that right to the Telegraf listener on the machines. And then it enabled our engineering team to write code for capturing individual events like function runtimes or calls of particular pieces of the application. Really helpful for the engineering team to be able to go in and optimize the software, figure out which function calls are taking a long time, or how to improve the actual performance of the application.
Josh Gitlin 00:28:00.409 Some of these trials can take hours and hours to validate large data sets. Like the COVID-19 data sets could potentially take days because they’re just so massive. And so saving any bit of time on these validations can really make a big business impact. It can really help turn these validations around faster, and ultimately, that helps get the data to be more accurate and gets the information to customers - or the medicines to the customers faster. So here’s what the actual dashboards that we set up with those look like. So we have application usage metrics. These are pulling some of those. This is the tasks that’s pulling from the KPI, the exec plugin. These are the tasks that are being pulled from Inspective Epsilon from the actual Java code itself. And then you can see JVM metrics there with heap size and thread counts and things like that. So have system monitoring, have our servers pretty well covered, next thing to do was to replace our active HTTP monitoring system. So this was a service that Datadog was providing, where they were connecting to each of our servers every few minutes, making sure that they were responding with a 200 OK, looking for particular content on the page. You can do this with Telegraf, and that was my first approach. I spun up an EC2 instance outside of our actual customer instances, because obviously, you can’t monitor the instance from within itself. You want something that’s outside so if there’s a network connectivity issue or the VPC is down or somebody pushes a bad firewall rule or something like that, you’re catching that, also, you want to test end to end.
Josh Gitlin 00:29:57.134 So it’s possible for Telegraf to do it. I wasn’t personally thrilled with the solution. I didn’t get quite all the data I was looking for. It wasn’t as configurable as I wanted. CTO said he would actually like to see if we could do this in AWS Lambda, and that was the approach that we took. So created a small Node.js application. Node is the perfect solution for this because of its multi-threaded or single-threaded but event-driven infrastructure. Each HTTP request was handled by a separate part of the node process, and the JavaScript code simply spun up all these requests as promises. And as the results came in, we check the timing information, check the status code, looked through the content, making sure that we see the actual text we’re expecting. We used the InfluxDB client from InfluxData, the Node.js client to publish these metrics, in this case, directly to InfluxDB Cloud because we don’t have a Telegraf instance. This is Node.js code running serverless. And then we run this using AWS CloudWatch. We trigger it to run every minute. It’s able to connect to our sync server and pull a list of all of the URLs that need to be checked. And then it’ll iterate over those, make HTTP requests to each URL, look at the result, publish that to InfluxDB, and then we use Grafana alarms on that, checking to make sure that the response time is acceptable, that the response was a 200 and not a 503 or anything like that, and then we get this nifty little dashboard.
Josh Gitlin 00:31:48.278 So we’re able to have a map, which region the AWS Lambda was running in. We run it in multiple regions so that we can see how is response time from Europe to sites in the US or vice versa. It helps detect internet issues if there’s outages from one region of the world, if there’s a list of all of the checks that failed recently with the reason why they failed, the actual result code, etc. And gives us heat maps and graphs of actual response time, showing us how long the DNS took to be established, how long the initial HTTP connection was, how long we were - waiting time to first byte, etc. Really great data, much more granular data than we were getting for custom-built in-house, and so we have the ability to modify it as needed. So some tips and tricks from my experience building an entire monitoring system in InfluxDB for our infrastructure. I’d say start by evaluating your needs, figure out what you’re actually looking to do. Because Telegraf is so powerful, you can easily get lost in what you’re looking to do. So come up with [inaudible]. If you’re using the InfluxDB Cloud account, absolutely use the usage dashboard. If you’re using InfluxDB Cloud or even if you’re just self-hosting it, make sure you use the right Telegraf plugin for the job.
Josh Gitlin 00:33:34.000 At one point, I accidentally selected the file input instead of the tail input to Telegraf. The difference there being that file reads from the beginning of the file, every interval. So I was sending the entire file contents each time, and I could not figure out why our InfluxDB Cloud usage had skyrocketed while we were using as much as we normally use in a month in about a day. So make sure you’re using the right plugin for the job. Again, that usage dashboard is very useful. You could see the usage going straight up, so I knew something was wrong right away. Add your metrics slowly, select one or two, add them, wait an hour or two, see what it does to your usage, and really evaluate. Especially if you have a large fleet of servers, it helps you evaluate the cost of each metric, and you can fine-tune them one at a time. This is especially useful when you’re using something like Chef to configure all of your servers, send these metrics out to just configuration out everywhere. You want to make sure that you’re doing it right before you deploy something to 200, 300, 500, 1,000 servers and suddenly realize that that usage is not scalable. Sending data to multiple InfluxDB instances is a really powerful option. You can do that if you just want redundancy. If there’s any issues with the InfluxDB Cloud instance in GCP, we have a backup with our own internal instance. We could also run a cloud account with InfluxDB in AWS as well.
Josh Gitlin 00:35:20.264 So good for redundancy, good for management of which metrics you really care about SLAs on and which ones you don’t. So thinking about, do I really need this metric? If this goes away, is it going to impact us or not? If you’re under any audit requirements, things like that. Customizing your Telegraf interval can really help because certain metrics you just don’t care about, 10-second granularity on. Other ones like CPU, you probably want to know. And the only other thing I found that was really helpful, integrating the status page for InfluxDB Cloud into our Slack. This was really useful because we saw some cases where HTTP monitoring alarms would fire, but the issue wasn’t actually the sites being down. The issue was the InfluxDB Cloud was having ingestion problems or Grafana was having issues querying it. So InfluxDB has the status page that shows what statuses are up and what statuses are having issues. You can subscribe to this through RSS, but you can also integrate this right into Slack or Teams. So we’ve done that with - actually, we’ve done this with all the cloud providers or software as a service we use. So we will get a message in a Teams chat anytime there’s an InfluxDB outage. And this enables us to know that we should go silence our HTTP alarms or that if we see an alarm, it’s potentially a false positive, etc. So where have we gotten to - what have we achieved with our migration here? In the end, our solution is about $40,000 per year cheaper, and we have much better control over our spend.
Josh Gitlin 00:37:13.410 We’re collecting the metrics that we [inaudible]. We’ve seen much better adoption of the tool. The developers are really, really loving Grafana. The ELK stack is more powerful for searching for logs. We’ve linked the two together so you can jump right from a time period in the ELK stack to the equivalent dashboards in Grafana and vice versa. This has really been well adopted by our customer success team. So they are popping right onto the dashboards. They are looking at metrics. They are looking at the logs and finding issues when customers report things. And that means that for my team, when we are getting asked to support a particular customer, we’re getting much better data from the customer support team. They’re pointing us right at the issue. They’re giving us graphs and metrics, showing us what the problem is, or we’re getting better alarms so that it’s ultimately faster resolution of any issues because we have better visibility into our data. And on the engineering side, the APM metrics are really helping the ability for developers to look at how the application is performing in development before they roll something out to production, or seeing metrics of how a particular release is performing helps optimize the software.
Josh Gitlin 00:38:45.468 And when we’re talking at software that runs for these periods of times or crunches this volume of data, that can really make a difference. Having great visibility into how the application is performing can really help the engineering team find areas for optimization, improve the application. And ultimately, the goal is it makes the software more efficient. And Pinnacle 21’s real customers are patients in need of medicine. So the faster and the more accurate that these clinical trials get validated, the faster the medicines get approved, and that the people who need the treatments get them. So optimizing the data pipeline by ensuring that the infrastructure is operating optimally and that the software is tuned as best as it can be. Where would I like to go from here? The HTTP monitoring service, I think, is something that could be valuable to the community. I would love to release that as an open-source project and allow other people to deploy it as the lambda, monitor their infrastructure as well. It’s created in such a way there is an individual node package just for the monitoring that has no dependence on Pinnacle 21 infrastructure, and then we load that package in with the component that pulls our URLs from the Chef server. So designed that way from the beginning.
Josh Gitlin 00:40:20.839 As I mentioned, Pinnacle 21 was acquired last year. There are now a whole bunch of new products or products that are new to me that I am responsible for. So I am seeing lots of opportunity across the company to expand our usage of InfluxDB and Telegraf. We have a number of Windows servers now that my team is responsible for. So would love to put Telegraf on those because I think it would be great to build out some Grafana dashboards showing how all those instances are operating. All these other products, I would love to get some application performance monitoring from them. A goal I would love to see, but I don’t know if it’s going to happen or not, would be actually integrating InfluxDB into some of these products. So Pinnacle 21 Enterprise, the actual application does not utilize InfluxDB. All of the metrics and data and graphs that it shows our customers come from the actual Java code. Largely, this is because the application was developed for many, many years before I joined and I didn’t have any input there. But it’s also due to security and data compliance requirements that each customer’s data must be segregated. So that’s not really a good solution for Pinnacle 21 Enterprise within the application, but I think there’s a possibility to integrate this into some of the other products that we have. I would love to actually use the time series database that InfluxDB provides in an application in addition to just monitoring of our infrastructure.
Josh Gitlin 00:42:02.314 And finally, Flux. For anyone who has not played with Flux, it is incredibly powerful. And we are barely scratching the surface of what we can do with Flux at Pinnacle 21 right now. I would love to actually build out some of these dashboards a little better, have increased visibility over trends over time, year-over-year graphs, month-over-month graphs, things like that. You could build an entire monitoring system in Flux if you wanted to. I don’t know why you would, but it is a fully functional language. It is much, much more powerful than the InfluxQL that I am used to from previous years. So I will be checking out the InfluxDB University courses on Flux and improving my Flux skill, so. That’s about it for what I’ve done so far. Questions?
Caitlin Croft 00:43:06.552 That was awesome, Josh. That was great. Yeah. Definitely check out InfluxDB U for Flux training. What’s really cool is the trainers who developed the Flux training for the in-person Flux training we do at InfluxDays help develop the Flux training that’s in InfluxDB U. And also, I believe you can do Flux queries directly in Grafana now, so -
Josh Gitlin 00:43:33.548 Yes, we are doing that. I actually have to ask the Slack for some help with them because some of them are giving me errors. I think they work fine in InfluxDB Cloud, but within Grafana, I’m getting some weirdness. So it’s on my list.
Caitlin Croft 00:43:52.733 Well, that’s the best thing about our community is people are always in there helping each other out. It’s pretty great.
Josh Gitlin 00:43:59.214 Yeah. The Slack community is excellent. Caitlin mentioned at the top, I highly recommend joining. If you’re interested in InfluxDB, definitely join the Slack community. Really good community of people in there.
Caitlin Croft 00:44:11.413 And there’s so many different channels. Sometimes someone will tag me in a channel that I didn’t even know about. There’s so many different - which is great. I think it’s great that the community has gone out and created their own channels for their certain use cases. Obviously, there’s a lot on IoT. There’s some for different people, different regions in the world. So it’s really fun. It’s fun. And especially during virtual InfluxDays, it’s really fun getting to see everyone in there, getting to know each other. And it’s pretty cool, especially when traveling wasn’t happening and you could see where everyone was joining from. All right. So let’s see. There’s a bunch of questions here. How does this system work when interfacing with a control system like DeltaV?
Josh Gitlin 00:45:01.244 I’m not sure exactly what that question is asking. We’re not using DeltaV. And honestly, it’s not a technology I’m familiar with. I’m not sure.
Caitlin Croft 00:45:16.442 Totally fine. Yeah. The person who asked that, if you want to expand upon that, please feel free to put it in the chat. A couple of people have asked if the recording is going to be made available. Yes. The recording of this webinar as well as the slides will be made available later today or tomorrow morning. And someone also asked if there’s going to be a certificate of attendance. We don’t provide that. If you need something, just email me. We can try to figure something out. But currently, there’s nothing like that being provided. I’ve heard your story now a couple of times, and I just think it’s so cool. The way that you had experience with it and you had experience with InfluxDB and your manager finally gave you the sign-off, like, okay, let’s see what it can do, let’s see how it compares to another tool, which I think is always kind of interesting.
Josh Gitlin 00:46:16.571 Thank you. I have had a lot of fun because I think InfluxData makes great products. And originally came to it from Grafana, was the tool that I was familiar with. And when I was looking for a time series database to back Grafana with, InfluxDB was the one that I selected. So I’ve been very happy with that choice. And InfluxDB Cloud has just made this so much better, so much easier to get up and running now than it was five, six years ago when you had to install InfluxDB server and configure it in the open source solution, so.
Caitlin Croft 00:46:54.500 Yeah. We’ve done a lot of development work on InfluxDB Cloud in the last couple of years, so it’s exciting to see people understanding the value of it, understanding the value of having everything running in the cloud.
Josh Gitlin 00:47:07.239 Yeah. That was a primary consideration in our implementation was so that we didn’t add more operational burden of supporting an InfluxDB open-source server. The one we have, if it goes down, it goes down. It doesn’t impact business operations. And that’s the benefit we were able to see by sending the metrics to both the internal one and the cloud one.
Caitlin Croft 00:47:31.186 Is there sort of an initiative at your organization to be cloud-first? I know a lot of companies - it’s interesting when I talk to developers, there are certain developers who love writing everything in the cloud, but then at the corporate level, they prefer to have things on-prem. Just kind of curious how that plays out for you guys.
Josh Gitlin 00:47:50.237 Within the Pinnacle 21 department, it was very much cloud-first, and that was largely the startup mentality, I think, where the founders needed to host this software and the last thing that the CTO needed to do was manage more infrastructure just to keep the business running. I do really like cloud-first, and so I’m sticking with that approach for a significant number of the things like InfluxDB that we rolled out, like our log monitoring solution. But I’m also not afraid of running things internally. Sometimes I think there’s benefits one way or the other. So it depends on team size and resources. We are hiring at the moment, so as we get more DevOps, engineers, we’ll have the greater ability to support internally hosted things as well.
Caitlin Croft 00:48:44.396 Yeah. If anyone’s interested in learning more about open jobs with Josh - everyone should have my email. So if you want to email me, I can definitely connect you with Josh. I know the job market right now is kind of nuts, so happy to put you in contact with Josh. You’ve mentioned a bunch of different things that you’re hoping to do next. What is at the top of your list? If you had a few hours, let’s say this week even, what’s at the top of that punch list regarding metrics that you’re collecting in using InfluxDB?
Josh Gitlin 00:49:21.014 I would love to start replacing some of the InfluxQL queries we have with Flux queries and really improving some of the dashboards that we have. I think we could benefit from some comparative information where we look at comparing one customer week-over-week or performance week-over-week. I think there’s definitely some power to be had there. I really like to improve some of the things in the HTTP monitoring side. It’s good right now, but I think it could be better, and less on the InfluxDB side but more on the Grafana side. I’ve been playing with Grafana OnCall, really happy with that. We’re not using it in production yet, but we are using it in development, and I would like to switch everything over to that. It allows you to acknowledge alerts right from within Slack. It allows you to configure downtime better. That’s one of the challenges we’ve had with the existing solution is if we know we’re doing maintenance on a particular instance, we get a whole bunch of alerts for that instance because it’s difficult to silence those alerts or schedule downtime. So Grafana OnCall makes that easier and still works with InfluxQL or Flux.
Caitlin Croft 00:50:38.866 Yep. No, I know you talk a lot about using Grafana, which is totally fine. We know our community loves Grafana. Have you looked at the visualization tooling in InfluxDB?
Josh Gitlin 00:50:52.188 Yes. When I’m building new queries, I will use the InfluxDB Cloud UI and I will write my queries in there, visualize them, make sure that I have them the way I want them. I use the InfluxDB Cloud UI when I’m looking at the usage dashboard. I think some of the visualizations in there are not quite as powerful yet just because Grafana has such a head start on InfluxDB Cloud. But it’s a great system. It’s already, I think, more powerful than some of the things I’ve seen with Datadog. For example, you have different kinds of visualizations and just line charts. So yes, we do have some dashboards within the InfluxDB Cloud UI itself.
Caitlin Croft 00:51:40.603 Yeah. Totally fine. The functionality in Grafana is amazing. I was just kind of curious if you were using the other visualization.
Josh Gitlin 00:51:50.084 Yes, I am. I also am using it at home, so.
Caitlin Croft 00:51:53.397 Hey, that’s the fun thing about InfluxDB being open source. I think pretty much everyone I talk to about how they’re using InfluxDB at work, they’re like, “Oh yeah, we’re also using it at home.”
Josh Gitlin 00:52:03.794 Yes, it’s great that it’s open source. It makes it really easy to use the new InfluxDB version with the UI. It makes it much easier to set up. Originally, I think it was intimidating for some people because you would just install this headless server and then sort of now what? How do I see my metrics? And you needed Grafana. And now you don’t. I mean, when I first started this project, I signed up for InfluxDB Cloud first, visualize things in there, and then added Grafana on top of it. So you can do that. It makes it a lower barrier to entry.
Caitlin Croft 00:52:38.743 Yeah. Absolutely. Well, thank you, Josh. This was amazing. Thank you, everyone, for attending today’s webinar. If you have any questions for Josh that you think of after the fact, don’t hesitate to reach out. Holla. I’m happy to put you in contact with Josh. He’s in our Slack. I’ve definitely bugged him over the years, so I’m sure he won’t mind if you guys reach out to him if you are a -
Josh Gitlin 00:53:01.993 Please do.
Caitlin Croft 00:53:05.284 Awesome. Well, people are already saying thank you in the chat. So thank you, Josh, so much, and I hope everyone has a good day.
Josh Gitlin 00:53:13.285 Thanks, everybody.
Caitlin Croft 00:53:14.747 Bye.
[/et_pb_toggle]
Josh Gitlin
Director of DevOps, Certara
Josh Gitlin is currently Director of DevOps at Certara, where he and his team provides the business with Infrastructure-as-Code, Compliance-as-Code, automation solutions, and systems monitoring and data visualization. Josh's experience ranges from CTO of Digital Fruition where he developed a Software-as-a-Service CMS and eCommerce platform to being on one of the operations teams for amazon.com. Josh has a passion for monitoring and visualizing the performance characteristics of distributed software applications in order to improve performance, reliability, and reduce mean time to resolution of outages.