Open Source APM for an Event Management Web Application
Session date: Oct 08, 2019 08:00am (Pacific Time)
Supporting a hassle-free registration and web browsing experience for the largest gathering of CIOs and IT leadership teams from across the world is no easy feat. The challenge becomes even more intimidating when the user base consists of some of the most advanced IT professionals in their field! The objective of the integration is to provide analytics and visualizations of the ingested AppInternals data:
- Objects under monitoring and their topology - such as an instance running on (which servers), and server tags are used to group the servers.
- Metrics and delays for objects under monitoring; for example counts and aggregations of typical, slow, very slow transactions.
- Alerts - generated by AppInternals and sent via SNMP
Watch the Webinar
Watch the webinar “Open Source APM for an Event Management Web Application” by filling out the form and clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Open Source APM for an Event Management Web Application”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: Chris Churilo: Director Product Marketing, InfluxData Richard Juknavorian: Managing Partner, IT Squared Alex Kozlov: Technical Evangelist, Riverbed
Chris Churilo: 00:00:00.495 Good morning, everybody, and thank you for joining us today at our webinar. We’re just getting started here, and we are super-excited to have two of our friends from IT Squared and Riverbed. They’ll be presenting how to use open-source projects like InfluxDB and Grafana as a part of your application performance-monitoring efforts. And just want to remind everybody that this session is being recorded, and like always, please get familiar with your Zoom application. At the bottom of the application, you will see a couple of icons for Q&A and chat, and those are the places that you will go to, to type in your questions. Oftentimes, questions are not asked during the presentation. And if you feel that you do have an important question that you want to pass on to our presenters today - no worries -just send me an email. You guys all have my email from this webinar registration, and I’m happy to connect you to our speakers today. So we can do that quite easily. And as I mentioned, it’s being recorded, so you will get an automated email first thing in the morning, so you can take another listen to this wonderful webinar today. And if you feel so inclined to listen to it today, just give me a few hours to just do a quick edit, and the URL that you used to register will also be the same URL that you’d go to, to listen to this on-demand recording. So we’ll have that posted in the afternoon. All right. With that, I think we’re going to just go ahead and get started, and we’ll ask Richard and Alex to take it away.
Richard Juknavorian: 00:01:36.680 Great. Thanks so much, Chris. Good morning, everyone, or good afternoon. I think everyone is still in the AM time zone. I know here in Boston, it’s 11:00 AM. So thank you for this opportunity to present. My name is Richard Juknavorian, and I’m the managing partner for IT Squared, and with me is one of my close colleagues and partners in the cloud-native development that we are doing, Alex Kozlov from Riverbed. Alex is a technical evangelist for Riverbed and a leading cloud-native architect. So what I’m going to present today is a little bit of a conversation around how we came to work with a large industry event organizer specifically in the IT space, and the needs that they had around making sure that their event management was very performance, had a high degree of stability, and ultimately led to a high amount of customer satisfaction. This was the registration component of the large event that needed to be seamless as well as, sort of, the agenda management and the overall experience during the event. We needed to make sure that the application was up and performing through the entire course of that event. We utilized a bunch of different open-source tools in order to not just be able to speak the language that SREs and other APM specialists need to be able to speak and understand, but also something that drove value up the chain to more senior leadership and executive leadership who wanted to have sort of snapshot views into how the applications were performing throughout the duration of the event, but didn’t have necessarily the technical skills or the know-how around more commercial or industry-specific application performance management or application performance monitoring tools and applications.
Richard Juknavorian: 00:03:46.712 They needed something that was a little bit more at a higher level but still gave them the kind of information that they needed to see. So essentially we had a use case here that was pretty straightforward. I kind of talked a little bit about this in the upfront preamble, but essentially when you’re a major IT event company, and you produce several major IT events with thousands and thousands of registrants and attendees, you need to have a very specific and hassle-free registration process and web browsing experience for your attendees. And in this particular instance, this event is the largest gathering of CIOs and IT leadership teams from across the world. So this is really not an easy feat, and the challenge becomes even more intimidating when the user base is some of the most IT professionals in their fields - the most advanced IT professionals in their field. But we were able to kind of focus specifically on our use case of a hassle-free registration and web-browsing experience and in the end were able to put together monitoring for some key performance business metrics that we will look at aggregated across multiple IT applications. In this instance today, we’ll look at two specific IT applications but bringing those performance metrics and aggregating them in near real time which is what makes some of the time-series analysis things that we’re going to show so important and so pertinent. And then, of course, as I spoke about needing to be able to bring this to a higher level, being able to allow for executive leadership and senior leadership to look at this, we also included what we think are some fairly powerful visualizations with drill downs into focused dashboards that were predefined and preset. And those really were there to provide recog analysis of the different anomalies that were detected throughout the process and throughout the conference itself.
Richard Juknavorian: 00:05:56.376 So again, just to sort of restate what the objectives were for our use case and for our project. It was to provide an intuitive and easily consumable visibility into application performance, implement advanced time-series analysis via InfluxDB technology. And that was really with the SRE team as the primary persona and the primary user that we were trying to derive value for and enable those application SRE team members for success, give them the tools and the data that they need to have a successful experience with their APM data but then as well provide sort of near real-time, high-level views for executive leadership as well. So really sort of a bifurcation there of how do we take the advanced time-series analysis that InfluxDB enables and use that in one particular persona use case around the SRE team but then also have something that was accessible for the executive-leadership persona as well.
Richard Juknavorian: 00:07:04.687 So with that, I’d love to stop showing slides and actually give a little bit of a dive into the application that we built. So I’m going to start here at Grafana, at a monitoring dashboard that is monitoring two different applications. And we did several for the purposes of this major IT event, but I’m just going to focus on two. One was the content delivery app, so this was after you registered and after you kind of went to the agenda page, and you picked your sessions, and you determined which of the pieces of content that you wanted to download. So whether it be slide decks, or whether it be Abstracts, or what have you, so there was the application that allowed you to do your content delivery to yourself and then, of course, the event-registration app. So lots and lots of people were signing up for the event. You needed to have a registration experience that was performant and was easy for the registrants to get done. So you can see that we’ve got the two apps here, content delivery app and the different summation of the monitoring going on there as well as the event-registration app. So different resolutions are available. So here I’m looking at a five-minute resolution. Different time shifts are available, one, six, twelve, in terms of my time shift. And then, of course, I can also pick custom durations of overall time that I’m looking at. So in this instance, I’m looking back over the last six hours. So I can kind of start here by looking at a little bit of an exception trend on my event-registration app with a small spike here, so if I want to kind of zero in on that, which I will do, it kind of takes me into where the exceptions were happening both for the content-delivery app as well as the event-registration app. And I’ve got some predefined thresholds that I’ve set here for when my metrics are moving from sort of the green stage to the orange stage or from the orange stage to the red stage. Right?
Richard Juknavorian: 00:09:15.945 So when I focus in on that spike that I saw on the exception trend or the event-registration app and kind of zoom in at this moment in time, I can see that really what I’m looking at here is - I’ve got an EUE time that is over the threshold that I had set for it to move from my green status to my orange status, in terms of performance. So I’ve got a performance of about two-and-a-half seconds which is higher than I want to see for this particular transaction. So from here, I can do a bunch of different drill downs, right from Grafana. I can drill into some application details, transaction information. I can drill directly down into SteelCentralAppInternals which is Riverbed’s app internal product for APM, and kind of start the process. And this is really the work that IT Squared developed. This is where we kind of resonate as a systems integrator in creating sort of that one-click resolution or one-click deep dive between where you are in Grafana and where you need to be within SCAI or within any other commercial APM tool. But before we do that - before we go all the way down into the APM tool, let’s kind of spin through the information that’s still available here at the Grafana level, at the dashboard level, that’s been powered by the monitoring of data that we’re doing within Influx. So let’s click on application details, and let’s kind of just move down a little bit here into this particular app. Here I’ve got a summary panel where I can look at things like my current load server time, again my EUE time, my app decks, and my exceptions trend here. Also, I have an opportunity to do some time-series analysis as well some adaptive baselining and some time shift analytics.
Richard Juknavorian: 00:11:22.725 So let’s kind of look at these, kind of one at a time for a second. So the time-series analytics is obviously very powerful and important from Influx’s perspective that the time series and being able to see things across a particular time series. So here again, I can kind of look at exceptions that are happening. You’ll notice that as I move across each of these simultaneously, the same sort of float over is happening within the panel below. Right? So as I roll through here, and I can kind of see that really what’s causing some of my delays here, what my culprit is in this instance right away just by kind of rolling over these delays by categories and delays by packaging. Here can kind of see that some of the more egregious things that I’m dealing with are at the top - the good things. The more egregious things at the bottom across different apps and different analytics that I can do there as well as take a look at the baselining and the time-shift analytics. There’s some opportunities here to view directly into these applications at a higher level, so if I’m particularly interested in server time or particularly interested in exceptions, I can drill down, not drill down, but I can just zero in on that one particular view here. So you can see I can kind of move down into this view. I can see what’s happening now as opposed to a shift of one hour and see sort of the gaps between the time shift and what we were looking at in terms of now. So that’s just one opportunity to kind of go to go a little larger.
Richard Juknavorian: 00:13:19.278 From here, I think it’s pretty interesting now as I continue through my sort of root-cause analysis. Kind of come back up here. So we’ve gone down into the application detail. We’re specifically looking at further information around the app, but I can also now kind of go a step further into the transactions themselves. And then here from the transactions themselves, there are several transactions that are associated here, booking, create an account, event information. I can take a look at all of my transactions, or I can take a look at selected transactions. So maybe I want to look at booking, create an account, and session sign-up. When I do that filtering, I can see that, “Oh. Actually, it’s my session sign-up that is causing quite a bit of delay.” Right? I’m in the red here. I’m at 8.66 seconds on my session sign-up. I can see the exception trend. And now this is something that again I haven’t even moved into the application performance monitoring tool yet, but I’m starting to see some pretty specific issues that are going on without having to engage yet with an APM tool. I can see that I’ve got a high EUE time of 8.66 seconds on my session sign-up, and from here, I can move even further into the details that are associated with that transaction. And again here, the same presentation that we were showing before around there’s a summary of information here. There’s our time-series analytics where I can look across delays by category, delays by packages as well as, again, my adaptive baselining, and my time-shift analytics. Being able to look specifically into my views here and seeing the deltas between now and in this instance I’m looking again. I’m looking still at a time shift of one hour. So this is we think pretty intuitive stuff that allows you to do a lot of analysis and do a lot of troubleshooting before you’ve even actually moved into SteelCentral or any other APM tool. Kind of keeping it here, the dashboard level, and allowing you to do some significant work.
Richard Juknavorian: 00:15:45.603 One of the kind of next questions that maybe comes up here as we’re starting to think again about, “Okay. I’ve moved into application details. I’ve moved into transactions and transaction details, but what about server monitoring? What about an opportunity to look into servers or to look specifically into transaction searches?” Actually, yeah, before I move into servers, I’m going to look - let’s do some transaction searching. So let’s take our first foray down into SteelCentral at this point in time. And again, we’ll focus on this EUE max time of 12.42 seconds. And let’s go ahead and look at the transaction search. When I do this now for the first time, I’m coming down into Riverbed, and I’m able to kind of see what are the categories of delays that are happening. And maybe at this point in time what I’ll do is - I’m going to ask Alex Kozlov from Riverbed to maybe talk a little bit about what we’re seeing here on the transaction search page for SteelCentral.
Alex Kozlov: 00:16:55.218 Surely, Richard, gladly. Let me start by saying that this is more of a specialist tool. This is a tool where we - once we know that probably the recalled issue having to do with code execution or the application server that we have on the screen, that a specialist will go in and start troubleshooting the issue in more details. This is a big data-based system that supports traditional architectures, monolith applications, but is also equally flexible and powerful for cloud-native apps. In last few years, we’re making sure that we support the latest cloud-native technologies Kubernetes versus various service mesh technologies, etc. to be able to give our users visibility across the entire technology staff. One of the key things is that we are capturing every request, every transaction heading to our applications, with the full level of details. So here we definitely - let’s take a look at, for instance, transaction checkout that was - or a search. Search in our event registration application, and we automatically sort from our slowest transaction to the fastest. We see that everything’s okay as far as search transactions are concerned. We don’t have any outliers slower than .3 seconds. Each dot represents an individual transaction, but once we have - let me remove this filter. Once we have a slow transaction, for example, you should be able to see - based on what we saw in Grafana, I think I should be able to find slow transactions in the checkout process. If I have a tour in the - or a search. And as a matter of fact, we have a still pretty good performance, applications clearly performing, but let me give a quick illustration of what level of details.
Alex Kozlov: 00:19:12.399 A developer wants to know that the issue is identified to code, what level of issue the developer has. So we see that this is a multi-technology application. We have no GS front end. We have Java on the back end. You see this small designating Java. And most important thing that is being brought out is analytics of what the slowest methods are automatically. So let me look at the top calls. I see some having to do with our network communications with my specific logic in the application. I can drill from here into specific code execution on a specific tier or JDM or a microservice, whichever I might have as part of my application architecture. So this level of detail. Now if we pivot back to the overall view that AppInternals provides for the app, we have a lot of aggregated statistics, color-coded, the same thing that we’re bringing out in the Grafana dashboards as well. We have some interesting analytics based on code that give us relationship between most important transactions and most important pieces of code impacting the application the most. For instance, here we know that we have a slow database transaction being executed. And we can drill down into that, but the key to understand is that to effectively use this tool you need to have familiarity with code. You need to have intimate familiarity with technologies that applications build upon, and that’s why we call it the more specialist tool. And that’s why the upper layer for both SME team and application owners, in the form of InfluxDB time series and advanced time-series analysis that provided by InfluxDB technology and visualizations implemented in Grafana gives additional level of abstraction so that people can do isolation, oblique problems, and decide whether they need to dive into this tool and pass along troubleshooting to the appropriate person who do have the knowledge about the application architecture to use this tool. Back to you, Richard.
Richard Juknavorian: 00:21:38.613 Great. Thanks, Alex. And thank you for making that point specifically about the notion of why the time series was so important here. Right? Because again, and we talked a little about this in the preamble, but certainly we definitely had two very specific personas here. The specialist as well as the generalist and having the ability to use InfluxDB, the ability to bring that time-series information in and made it so much more easier for us to develop applications at this level that are much more consumable by a nonspecialist. Right? And being able to come in - and actually, you can kind of see the same information when you look here and you can see exceptions happening here. We can see delays across what are causing our major delays here. We see the exact same information in Riverbed’s, but it’s a more easily consumable spot here for the user. So from here, again, seeing that transaction search, I’ll just drill down one more time into this transaction search. The summary of the delays here, 94% of it being application code as the major source of the delays. When I come over here and look at where are my delays happening, I can see the correlation between both, as well, so.
Richard Juknavorian: 00:23:11.990 Okay. So last thing I’d like to potentially show here is - we didn’t really get into the server monitoring. So let me just kind of drill down into the server monitoring really quickly. This is all Azure cloud being developed, so please forgive the server names here, but this is a cloud application that is running. But here I can take a look at different things that are going on from a server perspective. Here we’ve got some things that are in the red around memory usage. Actually, in all cases with my servers, the area that seems to be causing some concern for me is anything that’s associated with memory usages. And again, from here I have an opportunity to kind of look directly down into the transactions that are causing that instance data or server details associated with that. So from here, let me just kind of take a quick look at server details, and from here I can take a look again. I have the same presentation. Right? We kind of continued this paradigm through the whole experience of a summary at the top, time-series analytics that you can drill into and look specifically at potentially where you’ve got issues associated across time, as well as the adaptive baselining, and the time shift analytics as well. These panels exist as well. So I can see spikes and drop-offs in memory usage, network usage, so on, and so forth. From here, again, I can look at the transactions that are associated with this spike in memory, or the instants associated. So continue to just drill down, look at my root cause analysis, understand where my challenges are.
Richard Juknavorian: 00:25:05.369 In this instance again - I think Alex pointed this out in when he was in SteelCentralAppInternals. We know that we had some issues with sign-up, and we know we certainly had some issues with checkout. So again, I can look at the transaction details associated with checkout and kind of come back to where this all started for me and kind of come full circle on my monitoring here of the app, the application details, my instants, as well as server data and transaction searching both here within Grafana as well as in SteelCentral’s AppInternals. So that is kind of a full gamut of what we were hoping to demonstrate for you regarding the monitoring tools that we built in Grafana and of course ingesting all of the AppInternals’ data into Influx.
Richard Juknavorian: 00:26:02.921 So with that, I’m going to just have a quick recap on the values that we feel we’ve been able to offer you through this. And kind of this - we already kind of talked about already but just to kind of restate it here at the conclusion. Really this was how do we enable our SRE teams to switch from monitoring to root-cause analysis and troubleshooting with one click. And you can certainly see that in the Grafana dashboards that were presented, having the opportunity to kind of just click in the upper left-hand corner of the visualization and drilling directly down into AppInternals really tried to make complex APM information available to everyone including nontechnical resources. And that was really the goal of the way that Grafana was laid out, and how the panels were built with the sort of similar paradigm of a summary’s always at the top, having the time-shift information, having the different baselines and whatnot as well. And then really more than anything help to kind of prove viability of open-source tools. Coming into this particular opportunity that we engage with for this vendor in the event-management space, there was a little bit of hesitancy to kind of think about, “Well, should we be using open-source tools here? Or should we be relying more on commercial visualization tools?” Commercial visualization tools are great but with commercial visualization tools come things like licensing considerations and cost considerations. But with open source, some of that gives you a lot more versatility. So really with what we were able to produce with Grafana and Influx - really kind of showed there’s a lot of viability in open-source tools, and they are just as beneficial to the enterprise as in some of the commercial tools. So with that, Chris, I think that’s what we were hoping to demonstrate today. And we’d love to take some questions.
Chris Churilo: 00:28:16.354 Awesome. All right. There’s lots of questions here, so I think the first question is, “Can you tell us what Riverbed was built for, and what its strengths are, so we can better understand what the need was to augment this Riverbed solution with Grafana InfluxDB?”
Richard Juknavorian: 00:28:35.354 Sure. Alex, do you want to start on that one, and I’ll chime in?
Alex Kozlov : 00:28:37.779 Yeah. I would love to take that. And I would also maybe like to illustrate my answer with AppInternals.
Richard Juknavorian: 00:28:46.502 Do you want me to stop sharing or just give you control? Or what would you like?
Alex Kozlov: 00:28:49.397 Yep. Just give me control. That’s fine. If you can give it to the browser to AppInternals dashboard and - yeah. Great. Thank you very much.
Richard Juknavorian: 00:28:59.165 You’re welcome.
Alex Kozlov: 00:29:00.638 So the AppInternals use the Enterprise-grade application performance management tool. Right? So there is a curated UI. There is a code-level analytics that gives a high-level view, still specific to application technology staff. So it has a concept of being able to deliver to specific applications with lots of analytics being collected about code that’s executed. You see some of the business transactions displayed in red, some of them in green. This is the result of thresholding and [B signs?] that are being calculated and defined in AppInternals. And based on that, classifying a transaction as normal, slow, very slow. Kind of giving that level of severity of deviation from a service-level objective that we have for a particular performance. Now the key strength of AppInternals is ability to process and store huge amount of data about the application execution. Let me show you the application map. Hold on. Let me pick a specific application. For instance, this is retail app running in the managed Kubernetes in Azure. So I’m pivoting to transaction tags. These are the transactions that are defined. We see what level of violation. 90% of transactions are violating the threshold that we have. Here are more dynamic numbers. So it ended up, the application’s better performing as we can see.
Alex Kozlov: 00:30:50.064 Now we can see a lot of detailed analytics about JVMs. This is a Java-based application. So if I select an API microservice, I see a lot of statistics related to performance of the JVM. It’s a view-consumption memory of various statistics about garbage collection, state of memory, time spent in GC. All this is helpful to understand application hulls. Now, what I previously have shown already, the power of the transaction storage. We can go back where we’re able to bring out 100% of our requests heeding our application. Right? Now the key component, a bridging gap, was to make this information in the more simplified fashion delivered to people who are not specialists in application technologies, essentially democratize the data. So we chose, together with Richard and his team - we chose InfluxDB as the most robust technology for time-series database with a very rich analytical engine and a number of operators that allowed us to do things like visualize dashboards and look at the performance of the visualized thresholds in our dashboards and the level of performance as such. I’ll give you one small example that we can do a time-series baseline analysis based on number of sigmas that we’ll want to see. So let me go to - I probably want to go to last hour first though.
Richard Juknavorian: 00:32:53.880 Yeah. I would go way out. Last hour, sigma five. Yep.
Alex Kozlov: 00:32:59.073 Yep. So as we can see - so I want to see the final view which means that I’m going to have - I want to have sigmas kind of give me lower bound and the upper bound of my baseline. So look how quickly I’m able to go to analyze performance on my transaction based on what I consider a normal range. For instance, two sigmas is much more narrow bend. And I see that from two sigma standpoint, some of my metrics such as server time fall out of the normal range. Right? If I want to assume a wider band, it’s one [inaudible]. So things like that are provided by InfluxDB technology. And another thing that’s very useful from troubleshooting standpoint is ability to do time shift as Richard showed. So for instance, I have an issue in the last hour, but what I want to do is - I want to compare my performance. Oh. And by the way, I want to go with a smaller resolution, one datapoint a minute. I can go even lower, but this is a high-level dashboard, so we’re limited to one - we limited ourselves to one minute. So I want to compare - so my current performance for the last hour is in blue. I want to compare performance of my application with what was happening one day ago, for instance. Okay? I see that deviation exists, but it’s not that bad. So really as far as server time is concerned, my application is not doing anything terribly different than it was doing the same time a day ago. So these types of analytics are enabled and written essentially in real time, [can?] be calculated on the fly by rich analytical language built into InfluxDB.
Chris Churilo: 00:34:55.809 So maybe you can talk a little bit about the data architecture for all these metrics that are being collected. So is everything being collected into Riverbed, and then you’re just moving the data from Riverbed into InfluxDB, or are you also adding other types of metric and event data into InfluxDB that can augment Riverbed?
Alex Kozlov: 00:35:19.818 So in this particular case, all this data about servers, about JVMs, about performance of our network requests, etc. are retrieved from our AppInternals solution. So we didn’t put out an architecture diagram because it’s actually very classic case of how integrations are implemented. So AppInternals has an open-res DPI that allows to get these metrics with various resolutions in near real time, so we’re just forwarding them to InfluxDB enriched. So InfluxDB is essentially about metric data enriched with meta-data call tags. Right? So all the information about transaction, naming servers, JVMs, container names, what have you, we’re encoding as tags, and store them along with metrics. So that’s the data model. Now we leverage capabilities such as continuous query, etc. to calculate our baseline dynamically and the threshold. Okay? In our case, we assume that the baseline is seasonal which is when we calculate baseline, one hour at a time, understanding that our application performance is going to vary by time of day because the rate registrants are primarily business users. So continuous query - right? - tagging and rich analytical operators in InfluxDB query language, our key tool that we’re using, and it’s just standard data source and Grafana dashboards are drawing use data from InfluxDB as a data source.
Chris Churilo: 00:37:14.207 Cool. So how long did it take you to do this with InfluxDB and Grafana?
Alex Kozlov : 00:37:22.301 So Richard’s team put together the first, so obviously there was a learning curve on the API. Right?
Richard Juknavorian: 00:37:28.265 Right.
Alex Kozlov : 00:37:29.176 Then we needed to adapt to a secure deployment scenario. Right?
Richard Juknavorian: 00:37:33.641 [inaudible].
Alex Kozlov : 00:37:34.330 As you can see our URLs [inaudible] is here, so.
Richard Juknavorian: 00:37:38.886 [inaudible].
Alex Kozlov : 00:37:39.877 But the first prototype was showed to development teams and SREs in under two weeks. Then we ran through a couple of iterations of user design with our input and input from the customer to understand what the flow - what would be the most efficient flow. And we think the key thing - the key value add here is ability to do analytics without diving into specialist tool and [crosstalk] analytics.
Richard Juknavorian: 00:38:06.183 Keeping it at this level.
Alex Kozlov : 00:38:07.783 Yep. So even though essentially we’re representing data here that’s for the purposes of this webinar, we’re focused actually on AppInternals data only, but we’re also able to correlate it with and visualize it with events coming from logs that are stored in account and enriched with similar tagging. It actually creates powerful visualizations in additional dimension of troubleshooting in isolation of various issues.
Chris Churilo: 00:38:43.225 That’s excellent. So I’m not trying to make AppInternals look bad, but how long would it have taken to be able to do the same kind of functionality in that application without the use of open-source tools?
Alex Kozlov : 00:39:00.406 So two things. We need to understand that AppInternals is focused on the “analytics.” Right?
Richard Juknavorian: 00:39:09.788 Right. Not on time series.
Alex Kozlov : 00:39:11.022 So what we need to say, essentially the approach here, I think the best approach is the best of breed. Right? So if you are feeling comfortable with open-source technologies, know their capabilities and their power and limitation, essentially you don’t even ask a question why - so it’s just more natural to work with, in my mind - I never had a question in my mind that for advanced time-series analysis, we need to go with a specialized tool which is InfluxDB time-series database. So probably we would have been able to get 80 or 90 percent there with AppInternals, but why? The long-term benefit, even though there is an upfront effort in adopting a platform such as InfluxDB and Grafana - but the long-term benefit because you’re dealing with the right tool for the right problem is going to pay for itself in a very short time and continue paying for itself the more we use it. That was the [provision?].
Chris Churilo: 00:40:08.936 Yeah. That’s great. I mean, in fact, I would kind of try to counter the comment I heard earlier that, “Hey. Open source is free. You don’t have to pay for it, no license.” But I think what actually brought to this situation more than anything was you got this incredible extensibility that can be completely tailored to the needs of your customer. And you said in just a matter of two weeks. Right?
Richard Juknavorian: 00:40:32.496 A couple of weeks. Absolutely.
Chris Churilo: 00:40:34.223 So it’s all of the sudden you just made the existing switch in your head just that much more powerful with a set of open-source tools, too. And of course, we are open-source project maintainers, so we’re very passionate about open source. But just for our audience, I just want to remind everybody, yes. Free is nice, but really it’s because we have such a huge community supporting these projects, extensibility just becomes natural. Right? You have a lot of different people that are adding to this. So being able to just grab up data from this REST API and being able to throw it into the framework that these guys have developed is proof that you can do these things really quickly. I would love to get your guys’s feedback on what the difference between code analytics is versus time series, so our audience can understand the nuance between those terms.
Alex Kozlov : 00:41:26.229 Okay. Sure. So code analytics has more to do with the data collection. Let me pivot to AppInternals, our tool of choice, for a minute here. And I want to take a look in all the transaction types. I want to take a look. Actually, let me go to retail example that I started with. Bear with me one second. So first the key here is this. Our team, AppInternals team, has put a lot of effort to being able to with minimal impact on a performance of the application - to be able to collect unique beginning statistics about how each transaction is being executed. Let me go into search for the transaction checkout. So this level of visibility is a very hard to achieve efficiently at scale in real time. To be able to capture all the distributed trace with all the data, all the tiers, every tier was every JVM with every microservice is actually not the most complex problem. It’s sold in industry comparable to even open-source solutions. Now, the challenge is to look behind every segment essentially of our transaction and accurately calculate contribution of each method of latency of [Oracle?] transaction with very high resolution. So here I have a database query which excluding network communications in the top calls, AppInternals looked at every transaction happening in the last hour and precalculated all kinds of analytics to arrive at the conclusion that it’s executed query of the JDBC tier calling the database. A [petty?] code update inventory is the culprit in my performance issues. Number one.
Alex Kozlov : 00:43:52.062 Number two. It shows me exactly where in my code this database statement is being code. So this is actionable information that is sufficient to give to developer to troubleshoot and understand, either database developer or all day skill procedure or an application developer of Java in this case. To be able to understand the root cause of it and patch it up to prevent it in the future. So the key here is a ton of data that’s not magic beans. Right? These are what’s called traces, very deep recursive stack of execution which at this point, database call occurs at the level nine of the database query. So the magic, the secret sauce here, is to be able to collect. Let me show you how much data we’re talking about. I go to my system starters, and it’s going to tell me the number of transactions that I’ve accumulated in this instance. So in 123 days that I have this instance, I have a total of content 237 million end-to-end transactions consisting of almost 300 million individual calls on individual microservices. All accurately stitched together. This is a mind-boggling amount of data clearly. And the way to leverage it, number one, give me the full level of details to this stack trace level to our developers. And secondly, giving analytics - so let’s look at the checkout again in aggregated view. So across last hour from 10:44 to 11:44 AppInternals churns these millions of transactions in their call traces into very specific information what parts of my code are taking what amount of time in different slices, by category, by package, by method, or by a scale.
Alex Kozlov : 00:46:01.462 Okay, confirming. So in Grafana what we see is - we see more aggregated analytics. Let me go to - so this is a content-delivery app, checkout transaction, [crosstalk] with time series, and I want to see the specific metric which category of code is contributing to my performance most. So here I see that this is summary in GDBC. You see the column matches. Right? So that’s it. Now AppInternals gives me additional level. Number one, stack trace data, and number two, analytics aggregated over time. That’s it. I think I answered your question.
Chris Churilo: 00:46:39.778 Yeah. So I think just to reiterate, and it’s something that we talk a lot about at InfluxData - it’s really - I think you would need to collect all those details, but in order to find that specific trigger that’s going to help you to understand that there is a problem for you to then troubleshoot, it’s like looking for that needle in the haystack. And so I think another great reason why you want to use these open-source tools with AppInternals is that it’s going to help you become more efficient in your use of AppInternals. Right? Taking a metrics-first approach allows you to - if you go back to that Grafana dashboard real quickly - will allow you to just see how things are going. Everything looks okay. You can change. You could do the quick analysis, change the sigma, make it really narrow, and then the moment that you see things are kind of going out of bounds, then the teams that are responsible can get notified. Then they can start to really dig in deep and actually given quite a bit of information at this time series level. Right? Knowing that - oh. Yeah. It is in that JDBC server, or it is in that piece of JavaScript, or wherever these errors are kind of trying to hide allows them to be able to uncover them. And then once you uncover - the other thing that I want to point out is the other thing that’s really nice is we’re also given a very specific time period when you’re looking at time-series data. So your SRE doesn’t have to be digging around, digging around, digging around in AppInternals. Now he has a very specified time, a very specified set of things that are executing correctly or not, then you can really start to go in and troubleshoot and be able to uncover those things and fix those things, I think much more efficiently than you might be doing today.
Richard Juknavorian: 00:48:23.482 That was certainly the goal. Yes, Chris. Thank you.
Chris Churilo: 00:48:25.461 That’s awesome. And so I think now I understand when you were talking about complex APM info and - yeah. Yes indeed. It is actually very complex, and it’s even complex for those poor SREs because trying to find where those issues would be time-consuming. But definitely, the solution that you guys pulled together really helps them to become a lot more efficient. And it helps - I think you guys called it kind of your business users or your nontechnical users can easily see with this Grafana dashboard that, “Hey. Things are executing okay,” or, “Oops. There’s a little bit of red on these dashboards. Maybe we can ask the guys to dig into it a little bit more.”
Richard Juknavorian: 00:49:05.596 Take a look. Yep.
Chris Churilo: 00:49:07.824 So this is really great. What are some plans for maybe the next version or some other things that now - that typically - right - customers see something. They get used to it. They love it. They inevitably ask for more. What are some things on the horizon for you guys?
Richard Juknavorian: 00:49:22.845 Well -
Alex Kozlov : 00:49:23.318 Maybe I can jump in.
Richard Juknavorian: 00:49:26.951 Sure. Just remember the perspective, Alex. Go ahead.
Alex Kozlov : 00:49:29.329 Yes. So as we’re exposing through API more and more data about cloud-native environment, specifics of components of Kubernetes cluster or managed services in public clouds, we would like to have this data available and visualized in a similarly intuitive way in the open-source-faced dashboards. So that’s number one. Number two, we would like to see cross-correlation packaged up and opened up for any type of application. The cross-correlation of this locked data and information from technologies such as service mesh, etc. that are call-native technologies - So as you mentioned, Chris, actually combining on a single dashboard metrics and metric-type data of different nature and even representing some metric way the events that are coming such as rate, classified by type, whether these are warning error events or coming from particular [applications?]. And last, but not the least, is to be able to employ dashboards like these on a large scale. Organizations that have thousands of microservices, hundreds of applications - to be able to consolidate data, metric-backed information, in a single high performing tool such as InfluxDB. And what I understand for that, we need [inaudible] capabilities. We’ll need a more enterprise-type features which are required to step up in the large-scale environments to your enterprise version. That would be my take. Richard, what’s yours?
Richard Juknavorian: 00:51:16.008 Yeah. I think for us, Chris, it’s that notion of APM as a platform. Right? And having the opportunity from a cloud-native visibility standpoint - how do you bring everything - all of that management together into one place. Right? I think as Alex and I are working together, and the things that we hear out in the field, it’s that notion of, “I really need to have APM as a platform.” So we’re continuing to kind of understand. Okay. What are the other either event-based analytics or service management or other monitoring or cloud monitoring things that we need to bring together in order to have that true sort of platform experience for all of your APM tools and all of your APM needs. And then also being able to go high and low. Right? To be able to have a version that works for SREs and application developers as easily as it works for someone who is at the leadership level or at the management level.
Chris Churilo: 00:52:15.703 Excellent. Yeah. I think that’s really great. And it really is about being able to see across the entire “stack.” Right? Whether it’s a physical stack or a virtual stack or whatever. Different kinds of cloud-native things that you want to look at. You can’t look at all these things in isolation because they work together to present your users with a registration page or with an agenda page. And so why collect and look at that data in silos, when we already know they work together?
Richard Juknavorian: 00:52:45.554 Keep [crosstalk].
Chris Churilo: 00:52:46.235 I’ve taken up most of the time on questions. I just want to see if the audience has any questions. We just have a couple of minutes. So make sure if you do, just post it in the Q&A and the chat panel. We’ll keep the lines open just for a couple more minutes and as we start to wrap this up. I’m sorry, Alex. Were you about to say something?
Richard Juknavorian: 00:53:08.886 I think so.
Alex Kozlov : 00:53:09.317 No problem.
Chris Churilo: 00:53:09.411 Okay. Okay. Good. All right. So I think with that, we’ll just start to wrap it up. I just want to remind everybody. Just want to give great thanks to both Richard and Alex. This was a great view at looking at some really complicated data that’s essential for your organizations if you’re trying to make sure that you can make things as important as a registration component of an event manager’s solution to work for these really busy, important CIOs. But I think it also is important to stress to everybody on our call that, don’t assume open source is junk and cheap. These are really great projects. At InfluxData we actually have four different open-source projects. We also have a collector agent called Telegraf that you can also utilize to even bring in more of these Kubernetes or cloud-native types of metrics and events into your solution to make it so that you can look at things very quickly, so you can really get to the heart of the various issues. Richard and Alex, any last thoughts before we close it off today?
Richard Juknavorian: 00:54:16.766 Just want to say thank you to InfluxData and your graciousness to host this conversation for us. We’re really excited. We’re, as you know, seeing a lot of value from the time-series analytics and the integration of Influx and Riverbed and tying it all together with Grafana. And we’re super-excited. So thank you so much for giving us this platform for today.
Chris Churilo: 00:54:40.430 Absolutely. Absolutely. And as mentioned, this session is recorded, so we will be posting this. And the, if you have any questions, please feel free to forward them to me, and I’ll pass it on to the guys that we’re presenting today. All right. Thank you so much, everyone, and have a pleasant day. Bye. Bye.
Richard Juknavorian: 00:54:56.763 Thank you. Take care now.
Alex Kozlov : 00:54:58.409 Thank you. Bye.
[/et_pb_toggle]
Richard Juknavorian
Managing Partner at IT Squared
Richard Juknavorian is a Managing Partner at IT Squared, a software and professional services firm focused on post-implementation consulting and value realization for leading APM software and solutions. Prior to that, he was the SVP for Performance Management at PointRight, the leading provider of cloud-based analytics for long-term and post-acute care (LTPAC). Richard is an active mentor and strategic advisor to several early-stage companies and start-up accelerator programs.
Alex Kozlov
Technical Evangelist at Riverbed
Alex Kozlov is a hands-on Technical Evangelist at Riverbed with 20 + years of experience as an enterprise architect focusing on IT Analytics, APM and Dynamic Software Architectures. Prior to Riverbed, Alex worked as a Senior IT Architect at Gartner where he was responsible for IT Analytics and APM for the portfolio of internal applications. Alex's recent experience includes discovery, visibility and RCA for cloud- and container-based applications using an APM, NPM and IT Analytics toolset.