Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Minghan Li^1,*, Heng Li^1,*, Zhi-Qi Cheng^1,*,†, Yifei Dong², Yuxuan Zhou³, Jun-Yan He⁴, Qi Dai⁵, Teruko Mitamura¹, Alexander G Hauptmann¹

¹Carnegie Mellon University ²Columbia University ³University of Mannheim ⁴Alibaba Group ⁵Microsoft Research
^*Indicates Equal Contribution, authors listed in random order.
^†Indicates Corresponding author. See Author Contributions section for detailed roles

Paper Code arXiv

Accepted to NeurIPS 2024 Track Datasets and Benchmarks Spotlight!

Panoramas Observation Demos

"hallway:Someone talking on the phone while pacing."

"rec/game:Someone setting up a table game."

Abstract

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.

HA-VLN scenario: The agent navigates while interacting with dynamic human activities, optimizing its route and maintaining safe distances to address the Sim2Real gap and improve real-world applicability.

HA3D Simulator annotation process, illustrating the integration of the HAPS dataset, human activity annotation, realistic rendering, and agent-environment interaction. The simulator generates dynamic environments by combining human activities with photorealistic 3D scenes, enabling the HA-VLN task.

Single-frame in the HA3D simulator showcase viewpoints with human presence in each scene(120-degree FOV), demonstrating the diversity of human activities and environments. Common indoor regions such as bedrooms, hallways, kitchens, balconies, and bathrooms are displayed. Multiple humans can appear in the same region, as seen in the third row, sixth column, and the fifth row, fifth and sixth columns.

Model architectures of VLN-CM (left) and VLN-DT (right) agents. Both utilize a cross-modality fusion module to integrate visual and linguistic information for predicting navigation actions.

Overview of the VLN framework assumptions in the HA3D simulator. The simulator introduces an Ergonomic Action Space, Dynamic Environments, and a Sub-Optimal Expert to bridge the gap between simulated and real-world navigation scenarios. The Ergonomic Action Space limits the agent's field of view to 60 degrees, requiring a more realistic navigation strategy compared to the panoramic view used in traditional VLN tasks. Dynamic Environments incorporate time-varying elements, such as human activities, challenging the agent to adapt its navigation strategy to handle video streams that include people. The Sub-Optimal Expert provides navigation guidance that accounts for human factors and dynamic elements, resulting in a more realistic and human-like navigation strategy compared to the optimal expert model that always finds the shortest path without considering these factors.

The 145 human activity descriptions in the HAPS Dataset, categorized by their respective indoor regions (highlighted in bold red font). Each region includes 5 carefully selected human activity descriptions that best represent the diversity and relevance of activities within that space.

Real-world robot used in our experiments. The robot is Unitree GO1-EDU, a quadruped robot equipped with an NVIDIA Jetson TX2 high-performance computing module for handling computational tasks. The robot features an Inertial Measurement Unit (IMU) for measuring acceleration and rotational speed, a Stereo Fisheye Camera for wide-angle perception of its surroundings, and an Ultrasonic Distance Sensor for measuring the distance between the robot and obstacles.

Dataset

Download Link

Human Motion Skeletons in HAPS Dataset

Human Activity Annotation GUI

Explore the Simulator

Video of Real-World Robots

Author Contributions

Heng Li was responsible for agent development and experimental evaluations, drafted the initial agent and experiment sections, and revised final revisions based on review feedback. Minghan Li was responsible for the simulator, prepared the initial draft of the simulator section, conducted real-world and partial evaluations, and created the project website. Zhi-Qi Cheng supervised the design and development of both the agent and simulator, managed project execution, designed the evaluation plan, drafted the initial manuscript, and revised the final version. Yifei Dong designed the initial simulation prototyping, drafted the related work section, and provided revision suggestions. Yuxuan Zhou offered collaborative feedback and contributed revision suggestions. Jun-Yan He participated in project discussions. Qi Dai provided invaluable strategic guidance and contributed to manuscript revision. Teruko Mitamura offered constructive feedback, and Alexander G. Hauptmann provided critical insights and contributed to manuscript refinement. We also thank the anonymous reviewers for their valuable suggestions.

Paper Preview

BibTeX


      @article{li2024human,
        title={Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions},
        author={Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G Hauptmann},
        journal={arXiv preprint arXiv:2406.19236},
        year={2024}
      }