Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output

Authors: Sicong Huang, Jincheng He, Shiyuan Huang, Karthik Raja Anandan, Arkajyoti Chakraborty, Ian Lane

[NEWS] Our paper won the best system paper award at SemEval-2025!

Overview

Hallucinations remain a critical challenge for large language models (LLMs), especially when answering knowledge-intensive queries. While many systems can tell you if an output is flawed, they can’t tell you where. Our work, developed for the SemEval-2025 Task 3 (Mu-SHROOM), targets this exact problem by pinpointing hallucinated spans at a granular level across 14 different languages.

Our approach was highly successful, achieving the #1 average rank across all languages and placing in the top two for 11 of those languages.

The Challenge: Mu-SHROOM 🍄

The SemEval-2025 Task 3 (Mu-SHROOM) required systems to perform span-level hallucination detection. Given a question and an LLM-generated answer, the goal was to identify the exact character spans containing false or unverifiable information. This moves beyond simple binary classification to a much more useful, fine-grained analysis.

Example QA with Highlighted Hallucinated Spans — An example question with its LLM-generated answer that contains hallucinated content. Highlighted spans are hallucinated.

Our Three-Stage Approach 🛠️

Detection Pipeline — Our three-stage hallucination detection pipeline that includes context retrieval, hallucination detection, and span mapping.

We developed a three-stage pipeline to tackle this task:

Context Retrieval: First retrieve relevant documents using either the question or question + claims from the answer as queries. This external context is crucial for grounding the verification process.
Hallucination Detection: With the context in hand, identify the incorrect content in the answer. We experimented with three methods:
- Text Extraction: Prompt an LLM to directly extract any text from the answer that contradicts the retrieved context.
- Knowledge-graph Verification: Parse the context into a structured knowledge graph and break the answer down into atomic facts. An LLM then verifies each fact against the graph to identify inaccuracies.
- Minimal Cost Revision: Use a reasoning model to correct the original answer while making the fewest possible changes. The differences between the original and corrected versions are then considered hallucinations.
Span Mapping: Finally, map the identified incorrect content back to precise character spans in the original answer with three methods that follow the three methods in the detection stage:
- Substring Matching: For Text Extraction, find the exact location of the extracted incorrect string in the original answer
- Fact-to-span Mapping: Following Knowledge-Graph Verification, prompt an LLM to identify which specific text spans in the answer correspond to each false fact.
- Mapping via Edit Distance: After Minimal Cost Revision, calculate the “edit distance” between the original and corrected answers. Words that were substituted or deleted in the revision process are marked as hallucinated spans.

To boost performance, we also used Automatic Prompt Optimization (MiPROv2) to systematically find the best prompts for the detection stage.

Key Results 🏆

Our system delivered excellent performance with several key findings:

Dominated the Competition: We achieved the #1 average rank and placed in the top 2 for 11 of the 14 languages on the Intersection-over-Union (IoU) evaluation metric.
Context is King: Simply adding retrieved context boosted our IoU score by 27%, jumping from 0.44 to 0.56 on the English test set.
Simplicity Wins: Our simplest method—prompting an LLM to extract incorrect text based on context—consistently outperformed more complex approaches like knowledge graph verification.
Prompt Optimization Pays Off: Automatically refining our prompts with MiPROv2 provided a clear performance boost, increasing our best system’s IoU on the English test set from 0.55 to 0.61.
Outperforming Humans: Our best system (IoU of 0.57) was significantly more accurate than our own human annotators, whose best performance was an IoU of 0.48, highlighting the difficulty and subjectivity of the task.

Impact and Takeaways 🔬

This work offers several key insights for the field:

Retrieval is crucial for building reliable fact-checking systems.
Simple, well-prompted systems can be surprisingly powerful. With good context, a straightforward text extraction approach can outperform more brittle, complex pipelines.
The task is inherently ambiguous. The low agreement among human annotators suggests that defining a single “correct” hallucinated span is challenging, which has implications for how we build and evaluate these systems.
Span-level detection is a critical step toward creating tools that can provide real-time, trustworthy feedback to LLM users.

Resources

Benchmark & Data: Mu-SHROOM Shared Task Site

BibTeX

@inproceedings{huang-etal-2025-ucsc,
    title = "{UCSC} at {S}em{E}val-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in {LLM} Output",
    author = "Huang, Sicong  and
      He, Jincheng  and
      Huang, Shiyuan  and
      Anandan, Karthik Raja  and
      Chakraborty, Arkajyoti  and
      Lane, Ian",
    editor = "Rosenthal, Sara  and
      Ros{\'a}, Aiala  and
      Ghosh, Debanjan  and
      Zampieri, Marcos",
    booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.semeval-1.257/",
    pages = "1981--1992",
    ISBN = "979-8-89176-273-2",
    abstract = "Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint where they arise. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes our solution to the shared task. We propose a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking {\#}1 in average position across all languages."
}

Overview#

The Challenge: Mu-SHROOM 🍄#

Our Three-Stage Approach 🛠️#

Key Results 🏆#

Impact and Takeaways 🔬#

Resources#

BibTeX#