Projects > Master’s thesis

For my Master’s thesis I extended a 3d object detection and localization framework to leverage positive context clues by integrating scene appearance information. My experimental results achieved state-of-the-art performance with a significant improvement over prior results (Song et al, Ren et al).

Average precision scores for five object categories. The last two rows are from my experimental results and show significant improvement from prior work

Prior to writing my thesis I had been doing research with Dr. Zhile Ren and Prof. Erik Sudderth at Brown University on the aforementioned 3d object detection and localization framework. Given input RGB-D images of indoor scenes we were able to achieve state-of-the-art performance, as shown in the third to last row (Geom+COG+Context-10+Layout) in the figure above.

3d object bounding box results from the SUN RGB-D dataset. Ground truth and our results appear on the left and right respectively with the inner two columns representing results of prior work

Besides a novel image feature called “Clouds of Oriented Gradients”, much of the performance improvement stemmed from spatial context features. These features modeled the spatial relationship between object detections themselves and the estimated room layout, greatly reducing the number of false positives (compare the last column of the figure above with the inner two columns).

Knowing the importance of contextual clues and given my experience with image understanding, I decided to try integrating scene appearance information into the 3d object detection and localization framework. My hypothesis was that incorporating contextual clues about scene appearance information (i.e does the room look like a bedroom or a bathroom) would improve performance for object categories that mostly appear in certain types of rooms (i.e beds in bedrooms, toilets in bathrooms).

Confusion matrix for scene categorization. I used deep features pre-computed using MIT’s PLACES-CNN with a linear SVM to achieve state-of-the-art scene categorization performance of 45.72% accuracy versus 39.00% of prior work (Song et al)

To leverage positive context clues and integrate scene appearance information into the 3d object detection and localization framework I first trained a linear SVM to classify scene category using 2d image features pre-computed using MIT’s Places-CNN. This scene category classifier classified scenes into 20 categories (bathroom, bedroom, etc) and achieved a state-of-the-art scene categorization accuracy of 45.72% versus 39.00% of prior work (Song et al).

I then used this scene category classifier to create a new 2d scene appearance feature over the 20 scene categories that I fed into the 3d object detection and localization framework. In addition to this new softmax scene category feature I also created a one-hot scene category feature using the ground truth to test the potential improvement possible.

Precision-recall curves for 3D cuboid detection of the 5 object categories considered by Song et al. Per the legend, baseline performance is indicated by the yellow line, the performance with the softmax scene appearance feature by the red line, and the one-hot ground truth scene category feature by the blue line

As seen in the figure above and the results table at the top, both the softmax scene category feature and ground truth one-hot scene category feature outperformed prior work by a significant margin. Object categories that have strong contextual relationships with scene category (beds, toilets, and bathtubs) showed the most improvement while all other categories showed some improvement.

You can view all the results and the whole paper here!

Carl Olsson

Engineer, musician, and lifelong learner. Loves building things