VLM-Nav: Mapless UAV-Navigation Using Monocular Vision Driven by Vision-Language Model

Department of Robotics and Mechatronics Engineering,
University of Dhaka

Abstract

In this study, we propose a novel vision-based UAV navigation method that integrates depth map estimation with a vision-language model (VLM) for efficient obstacle avoidance and path planning. The system processes RGB images captured by the UAV, transforming them into depth maps using powerful zero-shot depth estimator algorithm. These depth maps are then analyzed by the VLM, which detects nearby obstacles and plans avoidance maneuvers. A fully connected network (FCN) integrates the VLM output with the UAV's relative heading angle to predict the optimal course of action, enabling the UAV to dynamically navigate complex environments toward its target. The system's effectiveness is validated through simulations in AirSim using Blocks and Downtown West environment. The UAV consistently reaches its destination while avoiding obstacles and achieving near perfect task completion rate of 0.98.

Proposed Method

The proposed system operates in three stages: (a) First, RGB scene images are captured and converted into a depth map. (b) This depth map is then analyzed by the VLM, which provides the corresponding action response. (c) Lastly, the VLM's feedback, along with the relative heading angle, left and right distance sensor measurements, and proximal obstacle detection result, are sent to the navigation model which outputs five discrete action.

Environment

This work uses three environments in AirSim Simulator. (a) A single obstacle environment with two walls to the side. (b) Blocks environment provided with AirSim. The navigator model is evaluated in this map. (c) Downtown west environment created using downtown west pack from UE marketplace.

Depth Estimation

The proposed system utilizes powerful zero shot depth estimation algorithm to generate depthmap from monocular images. Some examples are shown here.

VLM Output

Proposed Method Flight Path

The results of five flights using the proposed system are displayed here for the three environments, with each episode represented by a different color. As demonstrated, the method effectively avoids obstacles and successfully reaches its destination in each case. Zoomed in version of some spots are also to illustrate the decision making of our system.