LLM Monitoring and Observability

Navigate to:

The demand for LLM is rapidly increasing—it’s estimated that there will be 750 million apps using LLMs by 2025. As a result, the need for LLM observability and monitoring tools is also rising. In this blog, we’ll dive into what LLM monitoring and observability are, why they’re both crucial and how we can track various metrics to ensure our model isn’t just working but thriving.

LLM monitoring vs observability: key metrics

LLM monitoring involves tracking the performance of LLM applications using various evaluation metrics and methods. In contrast, LLM observability enables monitoring by offering full visibility and tracing across the entire LLM application system.

Some important LLM observability and monitoring metrics include:

  1. Resource utilization metrics, including CPU/GPU utilization, memory usage, Disk I/O, etc.
  2. Performance metrics, including latency, throughput, etc.
  3. LLM Evaluation metrics, including prompts, responses, model accuracy and degradation, token usage, response completeness, relevance, hallucinations, fairness, perplexity, semantic similarity, model degradation, etc.

The process of evaluating and monitoring an LLM is complex from a practical and ethical perspective. To better understand some of the challenges specific to LLM monitoring, let’s look at some common LLM metrics.

Prompt and response, vector databases, and embedding visualizations

The first is monitoring prompts and responses. How organizations leverage prompts and responses varies largely by organizations and their privacy agreementswith their users. However, it is common for LLM applications to store chat histories for only a short period. Chats are typically stored in document stores. Monitoring prompt-response pairs helps identify recurring user concerns or common inquiries, providing insights into user needs and potentially uncovering areas where the LLM’s responses could be further optimized for accuracy or ethical compliance. For example, human reviewers with diverse backgrounds may assess the model’s responses. They provide feedback to refine the model’s behavior, especially in identifying and steering away from harmful stereotypes or biased statements. This feedback can be incorporated in training data for reinforcement learning LLMs. These LLMs are then used to train models to avoid these pitfalls.

However, the chat history text is also frequently encoded into embeddings and stored in a vector database. By encoding chat histories as embeddings, applications can retrieve contextually similar responses or suggestions, improving the relevance of outputs in future interactions. This approach allows organizations to provide users with more personalized and context-aware responses without storing raw text data for extended periods, thereby enhancing privacy. In this way, leveraging vector databases alongside an LLM offers “long-term memory” for LLMs.

Additionally, vector databases can be leveraged alongside LLMs to evaluate and measure response accuracy. For example, a vector database can be populated with human-evaluated or expert-evaluated prompts and responses. The distance from an LLM-generated response to an approved response from a vector database can be measured as a means of measuring the error. If the distance exceeds a certain threshold, then an LLM could be experiencing model degradation or drift, and retraining might be necessary.

One common way that LLM developers uncover these insights is through embedding visualizations. Remember, vector databases store high dimensional embeddings in the context of monitoring LLMs. Usually, data analysts use graphs to visualize their data, understand trends, and gain insights about the environment they are monitoring. But how do you visualize the hundreds or thousands of dimensions in an embedding space? The first step to creating an embedding visualization is to perform a dimensionality reduction. This involves taking high-dimensional embeddings and reducing them to two or three dimensions to visualize them. This blog post describes common methodologies for dimensionality reduction. Figure 1: An example of a 3D embedding visualization from Music Galaxy that helps visualize the similarity of music artists, a project mentioned in What I’ve Learned Building Interactive Embedding Visualizations.

Token usage

Tokens are essentially the units of text—words or pieces of words—that models process. Each token processed by an LLM incurs a cost, especially in commercial applications. Monitoring token usage helps budget and control expenses by identifying heavy usage and potential optimization areas, such as by reducing response lengths or limiting unnecessary requests. Additionally, token usage directly influences response time and latency. By tracking tokens over time, you can detect if requests or responses are unnecessarily lengthy, impacting the user experience. Monitoring can also reveal inefficient prompts that result in excessive or redundant output, allowing adjustments for learner interactions. Token usage also reflects how users engage with the model. Tracking the tokens used per session or user can highlight patterns in behavior, such as users who consistently ask complex questions or require in-depth responses. Unusual spikes or dips in token usage can signal issues, such as unintended prompt behaviors, errors in usage patterns, or even security concerns (e.g., bot abuse).

Evaluation metrics: relevance, perplexity, hallucinations, response completeness, fairness, etc.

Evaluating LLMs is a complex, multifaceted process that covers both the application’s performance and the model’s effectiveness. Ensuring an LLM produces accurate, relevant, and ethically sound responses is crucial, as these factors determine the model’s success and real-world impact.

So, how do LLM engineers quantify a model’s performance? By applying statistical and neural-based methods, including metrics like relevance, fairness, hallucinations, and perplexity, engineers assess various aspects of reliability and quality in LLM outputs. Here’s a brief look at what each of these metrics entails:

  • Relevance: Relevance measures how well the model’s response aligns with the input prompt and context. For example, if a user asks for the weather in a specific city, a relevant response would directly address the weather there, not unrelated information. High relevance indicates the model understands and appropriately responds to user requests.
  • Fairness: Fairness in LLMs ensures that the model’s responses are free from biases and represent diverse perspectives without marginalizing or favoring specific groups. Fairness includes examining and adjusting the training data, model architecture, and response generation to mitigate unintended bias.
  • Hallucinations: Hallucinations occur when an LLM generates information that sounds plausible but is factually incorrect or fictional. This is a common challenge in LLMs because they can sometimes “fill in” information based on patterns in the data they trained on, even if that information doesn’t correspond to reality.
  • Perplexity: Perplexity is a metric used to measure how well an LLM predicts a sequence of words or tokens. Lower perplexity generally indicates that the model is better at predicting the next word in a sequence, translating to more fluent and contextually accurate language generation. However, it’s worth noting that perplexity alone doesn’t capture accuracy or truthfulness.

Historically, statistical metrics such as BLEU and ROUGE were widely used to evaluate some of these aspects. Still, they often showed a low correlation with human judgments, especially for complex or open-ended tasks. In response, modern LLM-based evaluators, such as LLM-evals and embedding models, offer a reference-free approach, scoring text based on generation probability rather than comparison to predefined references. However, the effectiveness and reliability of these LLM-based methods are still being evaluated. LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide explores statistical and machine learning approaches for assessing these metrics in detail, as well as how G-Eval—a popular open-source LLM evaluation framework—functions. That post also dives into how G-Eval, an open source LLM evaluation framework, works. Figure 2: A framework diagram for G-Eval from G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment.

G-Eval is a framework that uses LLMs enhanced with chain-of-thought (CoT) prompting to evaluate model-generated responses. It has three main components:

  • Prompt for Evaluation: This natural language prompt defines the evaluation task and criteria, such as coherence or grammar, customized for the specific task. For example, text summarization might instruct the evaluator to rate coherence on a 1-5 scale, aiming for well-structured summaries.
  • Chain-of-Thought (CoT): G-EVAL uses a series of intermediate steps, automatically generated by the LLM, to guide the evaluation. This CoT provides a structured process, such as reading the original article, comparing it to the summary, and scoring based on specific criteria, improving the LLM’s accuracy and explainability in evaluation.
  • Scoring Function: The scoring function uses the LLM to produce a score based on probabilities assigned to each rating level. To improve granularity and avoid low variance or overly simple scoring, G-EVAL calculates a weighted average of the probability of each score level, creating a continuous, more nuanced score that better captures subtle differences in text quality.

The complexity of assessing LLM quality and reliability highlights the need for frameworks like G-Eval, but challenges remain. The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches provides an insightful analysis of the intricacies of LLM evaluation, detailing issues that arise with automated and LLM-based approaches and underscoring the significant role that evaluation practices will play in shaping future AI applications. The papers compare various eval methods and discuss the advantages and disadvantages of each:

  • Automated evaluations are fast and repeatable but may lack accuracy.
  • Human evaluations are thorough but costly and suffer from variability among evaluators. For example, a prompt that shares a high similarity score with that from an expert might not be the best response for a novice in a field looking for helpful responses.
  • While promising for their efficiency, LLM-based evaluations lack extensive research and are often overly confident, leading to potential biases.

Building an LLM observability solution

There are several popular LLM observability tools available, but what goes into building one yourself? First, you’ll want a document store to log prompts and responses, then a vector database to manage embeddings, provide context to responses, and assess response accuracy. Additionally, a time series database is essential for tracking performance, resource usage, and token metrics over time. Ideally, you’ll use databases that support seamless integration into a data warehouse, enabling efficient root-cause analysis. For instance, you could use InfluxDB v3 to track token usage, resource utilization, and latency metrics. Anomaly monitoring can be set up alongside Apache Iceberg to access parquet files and combine this data with other metrics, helping you detect model drift, bot activity, bias regressions, harmful outputs, and more.