Multi-Class Classification Evaluation

Why a Single Accuracy Score Is Not Sufficient

For a hierarchical system (category → subcategory), a simple, "flat" accuracy score can be misleading. For instance, if the correct classification is Category: Invoices / Subcategory: Overdue, a prediction of Category: Invoices / Subcategory: Unpaid is significantly better than a prediction of Category: Purchase Orders / Subcategory: New. A flat accuracy metric would treat both errors as equally wrong.

Strict Accuracy: This is the most straightforward metric, representing the proportion of correctly classified items across all levels. It's calculated as (Correct Predictions) / (Total Predictions).
Confusion Matrix: This is an essential tool for visualizing performance. It shows not just how many predictions were wrong, but also which categories are being confused with others
Precision, Recall, and F1-Score: These metrics provide deeper insights, especially for test datasets with significant category imbalance (where some categories are much more common than others in your evaluation data).

These scores are calculated for each category, and then you can compute an average (micro, macro, or weighted) to get a single score for the category level.

Another useful metric is Conditional Accuracy: This measures the accuracy of the subcategory prediction only for those items where the parent category was correctly identified. This isolates the performance of the subcategory classification logic itself.

Overall Accuracy Score (Strict Accuracy)

Strict Accuracy (Exact Match Ratio)

This is the simplest and most stringent method. A prediction is only counted as 100% correct if all three levels are correct.

Overall Strict Accuracy = (Number of items where Category, Subcategory, AND Flag are all correct) / (Total Number of Items)

This metric is easy to understand and reflects the agent's ability to perform the entire task perfectly. It is often used as a primary key performance indicator (KPI).

Advantages of Using Strict Accuracy

Ultimate Measure of Quality: It represents the "gold standard" for your agent's performance. It answers the simple, critical question: "What percentage of the time does the agent get everything 100% correct?"
Easy to Understand and Communicate: This metric is incredibly intuitive. You can easily explain to stakeholders, managers, or clients that this score represents the proportion of tasks the agent handled flawlessly from start to finish. There is no ambiguity.
Excellent Primary KPI: Because of its clarity and high standard, it serves as a powerful Key Performance Indicator (KPI) for the overall project. It's the single number that best reflects the system's end-to-end reliability.

💡 The industry standard approach is to use Strict Accuracy as your main, headline metric

Practical Examples

For Reporting to Leadership/Clients:

"Our agent's overall performance is at 85% Strict Accuracy, meaning it handles 
85 out of every 100 items perfectly from start to finish."

For Your Development Team's Analysis:

"Our Strict Accuracy is at 85%. Let's look at the diagnostic dashboard to see 
where the remaining 15% of errors are coming from:

- Level 1 (Category): 98% Accuracy (Very strong)
- Level 2 (Subcategory): 92% Conditional Accuracy (Good, but some errors here)
- Level 3 (Flag): 94% Accuracy (Also strong)

This tells us that most of our errors are happening at the subcategory level. 
Let's focus our efforts on improving that specific part of the model."

Test Data Requirements

Using unique data items is critical for accurate evaluation

For standard accuracy testing, you should use unique data items for each run rather than running the same data multiple times. The goal is to measure how well the agent generalizes to new, unseen information, which reflects its performance in a real-world scenario.

Using 100 different items tests the agent's capability across 100 different challenges. Re-running the same item only tests the agent's consistency on a single task, not its overall accuracy. This is the industry-standard approach to ensure the final score is a meaningful measure of the agent's effectiveness.

Sample Size Guidelines

The minimum amount of unique data needed for testing depends on your system's complexity and the confidence you require in the results:

Early-stage development: 100-500 items (useful but not statistically robust)
Business applications: 1,000-5,000 unique items (reliable standard)
Rule of thumb: At least 15% of your total dataset

A larger test set reduces the margin of error and ensures that even your rarest and most critical subcategories are represented multiple times, providing a true and trustworthy measure of performance across all scenarios.

For example, if your database contains 9,000 classifications from the past several years, a minimum test set of about 1,350 items (15%) would provide statistically significant results while remaining practical to implement.

Testing Stochastic Models: The Two-Pronged Approach

Modern AI agents built on state-of-the-art models are typically stochastic by nature, meaning they may produce different outputs for the same input due to non-zero temperature settings. This requires a comprehensive evaluation approach that tests both accuracy and consistency.

1. Primary Test: Generalization Accuracy

This test applies the Strict Accuracy metric (defined earlier) to measure how well the agent performs across diverse problems.

Method: Use a large, diverse test set of unique items (e.g., 1,000+). Run each unique item through the agent only once.
What It Measures: The agent's ability to generalize and handle the breadth of real-world scenarios.
Primary Metric: Strict Accuracy as defined above (exact match ratio where all three levels must be correct). This tells you, "On its first try with new data, how often does our agent get everything 100% correct across all classification levels?"

2. Supplemental Test: Consistency & Reliability

This test ensures the agent is stable and trustworthy, not giving different answers to the same question.

Method:
- Select a small but representative sample of 10-20 items from your test set
- Include different categories and edge cases
- Run each item through the agent multiple times (e.g., 10-20 times)
What It Measures: The agent's reliability when given the same input repeatedly
Primary Metric: Consistency Score = (Number of Correct Classifications) / (Total Runs)

Reporting Combined Results

With this two-part evaluation, you can report a more complete picture of your agent's performance:

"Our agent has a Strict Accuracy of 85% on new, unseen data, demonstrating strong generalization. Furthermore, it has a Consistency Score of 95%, indicating that it is highly reliable and produces the correct output consistently when presented with the same input."

This framework testing for both broad accuracy and specific reliability is the robust, industry-standard approach for validating AI systems built on modern stochastic models.

Tangible Example

Classification Structure

{
  "Customer Complaint": {
    "subcategories": {
      "Product Defect": {
        "required_data": ["customer_id", "product_sku", "description_of_defect", "photo_url"]
      },
      "Late Delivery": {
        "required_data": ["customer_id", "order_id", "expected_delivery_date"]
      },
      "Poor Service": {
        "required_data": ["customer_id", "agent_name", "date_of_interaction", "description_of_incident"]
      }
    }
  },
  "Customer Praise": {
    "subcategories": {
      "Excellent Service": {
        "required_data": ["customer_id", "agent_name"]
      },
      "Great Product": {
        "required_data": ["customer_id", "product_sku"]
      }
    }
  },
  "Customer Refund Request": {
    "subcategories": {
      "Duplicate Charge": {
        "required_data": ["customer_id", "order_id", "transaction_amount", "transaction_date"]
      },
      "Product Not As Described": {
        "required_data": ["customer_id", "order_id", "reason_for_refund"]
      },
      "Item Not Received": {
        "required_data": ["customer_id", "order_id", "tracking_number"]
      }
    }
  },
  "Customer Legal Request": {
    "subcategories": {
      "Data Privacy Inquiry": {
        "required_data": ["customer_id", "requestor_name", "details_of_request"]
      },
      "Subpoena": {
        "required_data": ["case_number", "issuing_court", "documents_requested"]
      }
    }
  }
}

Strict Accuracy Measurement Table & Diagnostic Dashboard

Performance Evaluation Sample (20 of 100 Test Items)

Item ID	Ground Truth (Category)	Ground Truth (Subcategory)	Ground Truth (Flag)	Agent Prediction (Category)	Agent Prediction (Subcategory)	Agent Prediction (Flag)	Strict Match?
...	...	...	...	...	...	...	...
82	Customer Complaint	Late Delivery	OK	Customer Complaint	Late Delivery	OK	✅ True
83	Customer Refund Request	Duplicate Charge	OK	Customer Refund Request	Duplicate Charge	OK	✅ True
84	Customer Praise	Excellent Service	Info Missing	Customer Praise	Excellent Service	Info Missing	✅ True
85	Customer Complaint	Product Defect	OK	Customer Complaint	Product Defect	OK	✅ True
86	Customer Legal Request	Subpoena	OK	Customer Legal Request	Subpoena	OK	✅ True
87	Customer Complaint	Late Delivery	OK	Customer Complaint	Late Delivery	OK	✅ True
88	Customer Refund Request	Item Not Received	OK	Customer Refund Request	Item Not Received	OK	✅ True
89	Customer Praise	Great Product	OK	Customer Praise	Great Product	OK	✅ True
90	Customer Complaint	Poor Service	OK	Customer Complaint	Poor Service	OK	✅ True
91	Customer Complaint	Product Defect	OK	Customer Refund Request	Product Not As Described	OK	❌ False
92	Customer Refund Request	Duplicate Charge	OK	Customer Refund Request	Duplicate Charge	OK	✅ True
93	Customer Complaint	Late Delivery	OK	Customer Complaint	Product Defect	OK	❌ False
94	Customer Legal Request	Data Privacy Inquiry	OK	Customer Legal Request	Data Privacy Inquiry	OK	✅ True
95	Customer Complaint	Product Defect	OK	Customer Complaint	Product Defect	Info Missing	❌ False
96	Customer Refund Request	Item Not Received	OK	Customer Refund Request	Item Not Received	OK	✅ True
97	Customer Praise	Excellent Service	OK	Customer Praise	Excellent Service	OK	✅ True
98	Customer Complaint	Late Delivery	OK	Customer Complaint	Late Delivery	OK	✅ True
99	Customer Refund Request	Product Not As Described	OK	Customer Refund Request	Product Not As Described	OK	✅ True
100	Customer Legal Request	Subpoena	OK	Customer Legal Request	Subpoena	OK	✅ True
...	...	...	...	...	...	...	...

Performance Summary (Calculated over 100 Test Items)

Total Items Tested: 100
Total Correct Strict Matches: 85
Total Items with Errors: 15

Strict Accuracy

Strict Accuracy (Exact Match Ratio): 85%
(Total Correct Strict Matches) / (Total Items) = 85 / 100 = 85%

Diagnostic Dashboard

Level 1 (Category): 98% Accuracy : (100 - 2) / 100 = 98%
Level 2 (Subcategory): 92% Conditional Accuracy : (98 - 8) / 98 = 90 / 98 ≈ 91.8%
Level 3 (Flag): 94% Accuracy : (100 - 6) / 100 = 94%

Consistency Test Results

Sample Size: 15 representative items
Runs Per Item: 10
Total Runs: 150
Correct Classifications: 143
Consistency Score: 143 / 150 = 95.3%

Complete Performance Summary: "Our agent achieves 85% Strict Accuracy on new data with 95.3% Consistency when processing the same inputs multiple times."

Understanding and Controlling Stochastic Behavior

The Role of Temperature

The stochastic (random) behavior in AI models comes from how they choose the next word from a list of probabilities. This is controlled by a setting called temperature.

Stochastic Behavior (temperature > 0)

This is the default setting for most LLMs (like GPT-4, Claude, Llama)
The model samples from the probability distribution, not always picking the word with the highest probability
Higher temperature increases randomness, making the model more "creative"
This is why asking the same question multiple times produces different phrasings, reasoning, and sometimes different final answers
This behavior necessitates consistency testing

Deterministic Behavior (temperature = 0)

You can force an LLM to be deterministic by setting its temperature to 0
At temperature = 0, the model always picks the word with the absolute highest probability
There is no sampling or randomness involved
The same input will produce the exact same output every time
The model becomes predictable and consistent

How Temperature Affects Model Behavior

Important: Setting temperature = 0 will not affect the model's underlying capabilities to interpret and classify text. In fact, for a classification task, it will make your agent more reliable.

To understand why, consider the two-stage process of how LLMs work:

Interpretation & Reasoning (The "Thinking"):
- The model reads your input ("open text")
- It uses its neural network to understand context, grammar, entities, and intent
- It reasons about which classification is most appropriate
- It calculates probabilities for all possible outcomes
- This is where the model's "capability" resides
Word Selection (The "Speaking"):
- The model chooses output words based on the probabilities it calculated
- This is the only stage that temperature affects

Best Practices for Classification Systems

Why temperature = 0 is BETTER for Classification

For analytical tasks like classification, extraction, or summarization, you want the model's most confident, highest-probability answer. You are not asking for creativity; you are asking for correctness and consistency. By setting temperature = 0, you are essentially telling the model:

"Do all your complex reasoning and interpretation, and then give me the single best answer you came up with. Don't get creative, don't sample from less likely options. Just give me your most confident conclusion."

This ensures that the same input will always result in the same classification, which is critical for building a trustworthy and auditable agentic system.

When to use temperature > 0

You should only use a higher temperature for creative and generative tasks where you want variety and novelty, such as:

Brainstorming ideas
Writing poetry or marketing copy
Creating a chatbot with a more "human-like," less repetitive personality

Implementation Guidelines

If you set temperature = 0:

Your agent becomes deterministic
You no longer need the separate "Consistency Test" because the agent will give the same answer for the same input every time
You can focus solely on the primary "Generalization Accuracy" test with unique items

If you cannot set temperature = 0 (or choose not to):

Your agent remains stochastic
The two-pronged testing approach (Accuracy + Consistency) is absolutely essential to ensure your agent is both smart and reliable

Final Recommendation: For your classification agent, setting temperature = 0 is the best practice. It leverages the model's full interpretive power while ensuring the output is predictable, reliable, and directly reflects its most confident analysis.

Multi-Class Classification Evaluation for AI Agents