Obstacle Detection

Vision Pipeline

Overview

Lidar is a powerful tool, and very simple to implement using the ROS2 and Neato platform. However, ultimately the data it provides is rather sparse, and many objects it simply cannot see. Our solution to this problem was to include a camera in the object detection pipeline. Instead of using a traditional stereo camera, our group decided to explore the possibility of using monocular vision for obstacle detection.

For monocular vision, we used Depth Anything (v2), a monocular depth estimation deep learning model that was released in 2024. This allows us to generate a relative depth map using a photo taken from any camera. From here, we process the image to find obstacles using the following steps:

Subtract the floor gradient from the depth map
Create a mask from the remaining image. These are obstacles
Find the contours of the mask to identify separate objects
Use the contour as a mask over the depth map. Take the mean
Use the mean to calculate distance.
Calculate ray from the location of the contours (x,y,w,h)
Project the ray and distances onto the xz-plane in Cartesian points
Publish the Cartesian points to a PointCloud, which is a list of geometry_msgs.msg.Point32 objects.

The pipeline is summarized in the following flow chart.

alt_text

Original Photo

To illustrate the pipeline, we will use this image which was taken from a Raspberry Pi Camera Module 3 Wide mounted on a Neato.

alt_text

Floor Gradient

alt_text

When Depth Anything takes a photo from the Neato, this is the result. We noticed that the floor can be represented as a gradient. We believed that after removing the floor gradient, everything else can be considered an obstacle. The maximum value of the gradient is the mean of the lowest row of pixels of the depth floor, and the gradient extends to the “horizon line”, which was at y=150 on the Neato’s camera.

alt_text

When we subtract these images, we are able to obtain a depth map without any gradient.

alt_text

Binary Mask

alt_text

Using boolean indexing, we can simplify the remaining objects to a mask.

Separate Objects

alt_text

OpenCV’s contour detection can be used on the binary map to create reliable masks, as well as mark each object as a separate object. This allows us to perform the following tasks on each separate contour.

Calculate Distance

Since Depth Anything and other monocular depth models generally produce relative depth maps, it is often difficult to identify the scale at which a camera is working at. Below, we show two images that will look identical through a monocular depth model. However, one is an aerial view of the Chambord Castle, while the other is a 1/30 scale model replica in the French Miniature.

alt_text

There are three common approaches to tackle this issue:

Get ground truth using lidar, time-of-flight or stereo camera
Assume depth consistency across all images
Use movement sensors to give context.

Sensor fusion requires precise calibration and embedded integration. Since the camera is mounted in the front, and the lidar is mounted in the back, we were not sure it was possible to do the calibration necessary. The Neato also lacks any sort of real movement sensor. It has an odometry function, but that is prone to error.

However, since the Neato is locked in the y-axis, and the camera is mounted to a static orientation, we can assume that the depth is consistent across all images. This is further enforced by the ground gradient. Since the camera will always see the floor, it will always pick up objects as relative distances to the floor.

Thus, we can find the predicted distance by finding the mean value of the depth map at each contour. We calibrated the pixel strength to actual distance and was able to get distance measurement with a precision of 10 cm.

alt_text

Calculate Ray and 3D projection

For ray calculation, we selected the row of the bottom most points in the contour (since we assume that these objects are on the floor) and projected them to 3D space.

Ray projection requires the following parameters:

Pixel dimensions of the camera
Pixel size
Focal length

This allows us to construct the camera Matrix K which projects a point from 3D space onto an xy picture plane.

	{	f	0	c_x	}
k =		0	f	c_y
		0	0	1

where f = focal length / pixel size

And c_x , c_y are the center pixel of the image.

Taking the inverse of k gives us a matrix that projects a point from the xy picture plane onto 3D space. We can take any arbitrary point (x,y) from the image and multiply it by k to obtain a ray. We normalize this ray and scale it by the distance to obtain a cartesian point with respect to the Neato’s base link.

alt_text

Ground truth: the water bottle is 50 cm away and has a 9 cm diameter.

Bottom left: raw image

Top left: depth map

Top right: obstacle mask overlaid over depth map

Bottom right: cartesian points of obstacles

Communicating the Points

The Occupancy field expects data of the following format

Meters, not centimeters
Odom frame, not base_link
Receives messages of type PointCloud, which is a list of Point32 message objects.

After converting to centimeters, we apply a frame transformation to the Odom frame. This is done by first rotating the points to align with the direction of the Neato’s Odom orientation. After translating the points, we can start looking at sending a PointCloud. Using list comprehension, we are able to create a list of Point32 objects, which we send to a topic.

Lidar

Our camera estimation is able to, independent of movement or additional sensors, identify obstacles and locate them on points with a high degree of accuracy. However, it is ultimately limited by the 102° horizontal field of vision. We also found that the functional range of the depth map using the floor gradient was 1.5 m. This meant that if there are no obstacles within 1.5 m and in the 102° field of view of the Neato, the vision pipeline would return a blank map. We decided that including Lidar would increase the map’s robustness by detecting objects outside of this range. Lidar is able to identify “long term” obstacles such as walls, while the camera is able to find “imminent” obstacles, such as chairs, table legs, or ground clutter that the Lidar is more prone to missing.

Lidar scan is given as polar coordinates in the robot’s base_link frame. We convert the points to Cartesian form and transform them to the Odom frame.