In this study, we propose a novel vision-based UAV navigation method that integrates depth map estimation with a vision-language model (VLM) for efficient obstacle avoidance and path planning. The system processes RGB images captured by the UAV, transforming them into depth maps using powerful zero-shot depth estimator algorithm. These depth maps are then analyzed by the VLM, which detects nearby obstacles and plans avoidance maneuvers. A fully connected network (FCN) integrates the VLM output with the UAV's relative heading angle to predict the optimal course of action, enabling the UAV to dynamically navigate complex environments toward its target. The system's effectiveness is validated through simulations in AirSim using Blocks and Downtown West environment. The UAV consistently reaches its destination while avoiding obstacles and achieving near perfect task completion rate of 0.98.
The proposed system utilizes powerful zero shot depth estimation algorithm to generate depthmap from monocular images. Some examples are shown here.