Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
Heng Li was responsible for agent development and experimental evaluations, drafted the initial agent and experiment sections, and revised final revisions based on review feedback. Minghan Li was responsible for the simulator, prepared the initial draft of the simulator section, conducted real-world and partial evaluations, and created the project website. Zhi-Qi Cheng supervised the design and development of both the agent and simulator, managed project execution, designed the evaluation plan, drafted the initial manuscript, and revised the final version. Yifei Dong designed the initial simulation prototyping, drafted the related work section, and provided revision suggestions. Yuxuan Zhou offered collaborative feedback and contributed revision suggestions. Jun-Yan He participated in project discussions. Qi Dai provided invaluable strategic guidance and contributed to manuscript revision. Teruko Mitamura offered constructive feedback, and Alexander G. Hauptmann provided critical insights and contributed to manuscript refinement. We also thank the anonymous reviewers for their valuable suggestions.
@article{li2024human,
title={Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions},
author={Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G Hauptmann},
journal={arXiv preprint arXiv:2406.19236},
year={2024}
}