Authors: Sicong Huang, Jincheng He, Shiyuan Huang, Karthik Raja Anandan, Arkajyoti Chakraborty, Ian Lane
[NEWS] Our paper won the best system paper award at SemEval-2025!
Overview
Hallucinations remain a critical challenge for large language models (LLMs), especially when answering knowledge-intensive queries. While many systems can tell you if an output is flawed, they can’t tell you where. Our work, developed for the SemEval-2025 Task 3 (Mu-SHROOM), targets this exact problem by pinpointing hallucinated spans at a granular level across 14 different languages.
Our approach was highly successful, achieving the #1 average rank across all languages and placing in the top two for 11 of those languages.
The Challenge: Mu-SHROOM 🍄
The SemEval-2025 Task 3 (Mu-SHROOM) required systems to perform span-level hallucination detection. Given a question and an LLM-generated answer, the goal was to identify the exact character spans containing false or unverifiable information. This moves beyond simple binary classification to a much more useful, fine-grained analysis.
An example question with its LLM-generated answer that contains hallucinated content. Highlighted spans are hallucinated.
Our Three-Stage Approach 🛠️
Our three-stage hallucination detection pipeline that includes context retrieval, hallucination detection, and span mapping.
- Context Retrieval: First retrieve relevant documents using either the question or question + claims from the answer as queries. This external context is crucial for grounding the verification process.
- Hallucination Detection: With the context in hand, identify the incorrect content in the answer. We experimented with three methods:
- Text Extraction: Prompt an LLM to directly extract any text from the answer that contradicts the retrieved context.
- Knowledge-graph Verification: Parse the context into a structured knowledge graph and break the answer down into atomic facts. An LLM then verifies each fact against the graph to identify inaccuracies.
- Minimal Cost Revision: Use a reasoning model to correct the original answer while making the fewest possible changes. The differences between the original and corrected versions are then considered hallucinations.
- Span Mapping: Finally, map the identified incorrect content back to precise character spans in the original answer with three methods that follow the three methods in the detection stage:
- Substring Matching: For Text Extraction, find the exact location of the extracted incorrect string in the original answer
- Fact-to-span Mapping: Following Knowledge-Graph Verification, prompt an LLM to identify which specific text spans in the answer correspond to each false fact.
- Mapping via Edit Distance: After Minimal Cost Revision, calculate the “edit distance” between the original and corrected answers. Words that were substituted or deleted in the revision process are marked as hallucinated spans.
To boost performance, we also used Automatic Prompt Optimization (MiPROv2) to systematically find the best prompts for the detection stage.
Key Results 🏆
Our system delivered excellent performance with several key findings:
- Dominated the Competition: We achieved the #1 average rank and placed in the top 2 for 11 of the 14 languages on the Intersection-over-Union (IoU) evaluation metric.
- Context is King: Simply adding retrieved context boosted our IoU score by 27%, jumping from 0.44 to 0.56 on the English test set.
- Simplicity Wins: Our simplest method—prompting an LLM to extract incorrect text based on context—consistently outperformed more complex approaches like knowledge graph verification.
- Prompt Optimization Pays Off: Automatically refining our prompts with MiPROv2 provided a clear performance boost, increasing our best system’s IoU on the English test set from 0.55 to 0.61.
- Outperforming Humans: Our best system (IoU of 0.57) was significantly more accurate than our own human annotators, whose best performance was an IoU of 0.48, highlighting the difficulty and subjectivity of the task.
Impact and Takeaways 🔬
This work offers several key insights for the field:
- Retrieval is crucial for building reliable fact-checking systems.
- Simple, well-prompted systems can be surprisingly powerful. With good context, a straightforward text extraction approach can outperform more brittle, complex pipelines.
- The task is inherently ambiguous. The low agreement among human annotators suggests that defining a single “correct” hallucinated span is challenging, which has implications for how we build and evaluate these systems.
- Span-level detection is a critical step toward creating tools that can provide real-time, trustworthy feedback to LLM users.
Resources
- Benchmark & Data: Mu-SHROOM Shared Task Site
BibTeX
@inproceedings{huang-etal-2025-ucsc,
title = "{UCSC} at {S}em{E}val-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in {LLM} Output",
author = "Huang, Sicong and
He, Jincheng and
Huang, Shiyuan and
Anandan, Karthik Raja and
Chakraborty, Arkajyoti and
Lane, Ian",
editor = "Rosenthal, Sara and
Ros{\'a}, Aiala and
Ghosh, Debanjan and
Zampieri, Marcos",
booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.semeval-1.257/",
pages = "1981--1992",
ISBN = "979-8-89176-273-2",
abstract = "Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint where they arise. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes our solution to the shared task. We propose a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking {\#}1 in average position across all languages."
}