Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation

Peixin Chen1,2,*, Guoxi Zhang2,*, Jianwei Ma1,3,†, Qing Li2,†
1Harbin Institute of Technology    2Beijing Institute for General Artificial Intelligence (BIGAI)    3Peking University
*Equal Contribution    Corresponding Author

HGR explores 3D environments by generating semantic hypotheses at frontiers, verifying predictions upon visitation, and cascading corrections through the dependency graph when errors are detected.

72.4%
Success Rate
GOAT-Bench
56.2%
SPL
GOAT-Bench
4.5×
Revisit Reduction
vs. 3D-Mem
~20%
Nodes Pruned
Cascade Correction

Abstract

Embodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5×, with specular and transparent surfaces accounting for 67% of corrected prediction errors.

Overview

Overview of Hypothesis Graph Refinement (HGR)
Overview of Hypothesis Graph Refinement (HGR). The hypothesis graph separates observed nodes (purple, verified regions) from hypothesis nodes (green, probabilistic frontier predictions), enabling a hypothesis-verification-correction cycle. (Left) Given the query "What is inside the basket?", observed nodes provide confirmed scene context, while hypothesis nodes project semantic distributions onto unexplored frontiers, guiding the agent toward the most likely location of the target object. (Right) Upon arrival, the agent verifies each hypothesis against actual observations. If confirmed, the hypothesis node transitions to an observed node; if refuted, cascade correction retracts the erroneous node and all its downstream dependents.

Method

HGR addresses two intertwined challenges: frontiers lack semantic cues for efficient exploration, yet VLM-based predictions risk embedding errors that propagate over long horizons.

Architecture of HGR
Architecture of HGR. (Left) Frontiers detected from the occupancy map are fed to a VLM reasoner for semantic hypothesis generation, producing hypothesis nodes linked to observed nodes via spatial and dependency edges. Upon visitation, cascade correction compares predicted and actual semantics; if the residual exceeds the threshold, the refuted node and all its dependents are removed. (Right) Running example on a floor plan showing graph refinement (confirmation) and graph shrinking (cascade correction).
1

Hypothesis Graph Representation

The graph G = (V, E, D) separates observed nodes (verified regions) from hypothesis nodes (frontier predictions). A dependency DAG D records derivation relationships, enabling systematic retraction when nodes are invalidated.

2

Semantic Hypothesis Module

Projects probabilistic semantic distributions onto frontiers using VLM world knowledge. Exploration scoring balances goal alignment, travel cost, and uncertainty bonus (Shannon entropy) for directed frontier selection.

3

Verification-Driven Cascade Correction

Upon visiting a hypothesis node, computes a prediction residual combining category mismatch, CLIP feature divergence, and object Jaccard dissimilarity. Refuted nodes trigger BFS traversal of the dependency DAG, removing the entire erroneous subgraph.

Semantic Hypothesis Module
Semantic Hypothesis Module. Left: Traditional frontier representation treats unexplored regions as undifferentiated boundaries. Right: HGR projects probabilistic semantic distributions onto frontiers as hypothesis nodes, enabling goal-directed exploration.
Cascade Correction Example
Cascade Correction Example. A VLM misidentifies a mirror reflection as a bedroom entrance, generating hypothesis nodes for inferred furniture. Upon reaching the mirror and detecting a prediction violation, the system traces the dependency DAG and removes the entire erroneous subgraph, including all descendant hypothesis nodes.

Experimental Results

GOAT-Bench: Multimodal Lifelong Navigation

HGR achieves the highest success rate and path efficiency on both the full validation set (2,780 subtasks) and the evaluation subset (278 subtasks).

Method Full Validation Set Subset
SR ↑ SPL ↑ SR ↑ SPL ↑
ConceptGraph 61.244.367.848.1
3D-Mem w/o memory 58.638.566.244.1
3D-Mem 62.944.769.148.9
HGR (Ours) 64.1450.172.4156.22
Cumulative Success Rate vs Episode Steps
Cumulative Success Rate vs. Episode Steps. HGR reaches navigation targets earlier than baselines due to hypothesis-driven frontier selection, while 3D-Mem and ConceptGraph require more steps for exhaustive geometric search. The gap widens in later steps as cascade correction prevents error accumulation.

Performance by Target Modality

HGR shows the largest gains on language-specified targets (+7.9% SR over 3D-Mem), as hypothesis nodes encode relational structure for disambiguating spatial references.

Method Category Language Image
SRSPLSRSPLSRSPL
ConceptGraph 65.344.755.038.964.052.8
3D-Mem 79.255.861.946.065.244.2
HGR (Ours) 80.960.369.854.568.261.7

Ablation Study

Systematic isolation of each component across all three benchmarks.

Configuration GOAT (Subset) A-EQA EM-EQA
SR ↑SPL ↑LLM ↑LLM ↑
HGR (full system) 72.4156.2255.958.3
  w/o Semantic hypothesis 67.7150.0248.450.5
  w/o Cascade correction 68.6151.1249.652.9
  Local delete only 70.8553.6752.355.9
  w/o both (geometry only) 63.4245.3343.345.2

Embodied Question Answering

A-EQA (Active Embodied QA)

MethodLLM-Match ↑LLM-Match SPL ↑
Blind LLMs (no visual input)
GPT-4o35.9
Question Agnostic Exploration
LLaVA-1.5 Frame Captions38.17.0
Multi-Frame41.87.5
VLM-Guided Exploration
Explore-EQA46.923.4
ConceptGraph w/ Frontier47.233.3
HGR (Ours)55.945.0

EM-EQA (Episodic Memory QA)

MethodAvg. FramesLLM-Match ↑
Blind LLM (GPT-4)035.5
ConceptGraph Captions034.4
Frame Captions038.1
Multi-Frame3.048.1
HGR (Ours)3.158.3
Human (Full trajectory)Full86.8

Diagnostic Analysis

Statistics on hypothesis lifecycle across the GOAT-Bench full validation set (2,780 subtasks).

4,712

Hypothesis Nodes Created

73.5% confirmed upon visitation, 26.5% refuted and removed.

342

Cascade Corrections Triggered

Average 2.8 dependent nodes removed per trigger, max cascade depth of 4 hops.

4.5×

Revisit Reduction

Revisit rate: 4.2% for HGR vs. 18.7% for 3D-Mem, directly contributing to SPL improvement.

67%

Mirror & Glass Errors

Specular (38%) and transparent (29%) surfaces are the dominant error source for VLM predictions.

Qualitative Results

Qualitative Comparison
Qualitative Comparison: Mirror-Induced Prediction Error. Left: HGR detects the prediction violation and removes the erroneous subgraph via cascade correction, redirecting exploration to the actual living room. Right: 3D-Mem retains erroneous nodes with lowered confidence, leading to repeated misnavigation and step-limit failure.

BibTeX

@misc{chen2026hypothesisgraphrefinementhypothesisdriven,
      title={Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation},
      author={Peixin Chen and Guoxi Zhang and Jianwei Ma and Qing Li},
      year={2026},
      eprint={2604.04108},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.04108},
}