Building A Mask R-CNN To Detect and Segment Breast Tumors

Chloe Wang
8 min readDec 21, 2021

I’d recommend reading my article on neural networks to understand what convolutional neural networks are before reading this one.

Breast cancer is one of the most fatal and common cancers in women, killing about 1 in every 39 women per year. However, early breast cancer detection can drastically increase the patient’s chance of survival. Detecting breast cancer at an early stage can bring the survival rate up to 93% within the first 5 years.

Thousands of people work to analyze breast sonograms to try to detect breast cancer early — but what if we could automate it? To demonstrate this concept, I built a mask R-CNN to detect and segment breast tumors from ultrasound images.

Object Detection and Image Segmentation

Before going over R-CNNs, it’s a good idea to run down two different computer vision tasks — object detection and image segmentation.

Object detection combines image classification (labeling images) and object localization (drawing a bounding box around objects in an image).

However, object detection doesn’t tell us anything about the shape of an object. Often times, just knowing the class and bounding box coordinates is too vague. Thus, image segmentation is the process of creating a pixel-wise mask for each object in the image.


What Are R-CNNs?

R-CNNs, or region-based convolutional neural networks, are one of the first successful applications of CNNs in object localization, detection, and segmentation. Standard R-CNNs have three parts:

From Rich feature hierarchies for accurate object detection and semantic segmentation
  1. Region proposals (possible bounding boxes for objects) are generated from the input using computer vision techniques
  2. Features are extracted from each candidate region using a CNN
  3. Features are classified based on known classes

However, this original architecture is slow and expensive. A model may operate on about 2,000 proposed regions at a time, and since the CNN-based feature extraction is required to pass on every candidate region generated from the region proposal algorithm, it can become very slow.

Consequently, other forms of R-CNNs were developed to speed up the process.

Fast R-CNN

A fast R-CNN improves on the original architecture by acting as a single model instead of multiple. In a fast R-CNN:

From Rich feature hierarchies for accurate object detection and semantic segmentation
  1. A set of region proposals are passed through a CNN, which is pre-trained for feature extraction
  2. At the end of the CNN is a custom layer called the region-of-interest pooling layer (ROI pooling) that extracts features specific to a given input candidate region
  3. Output of CNN is interpreted by a fully-connected layer
  4. This model then divides into two outputs, one for class prediction through a Softmax layer and the other with a linear output for a bounding box

This model is much faster than the standard R-CNN architecture because it is only one model instead of three, but it still requires many candidate regions to be proposed with each input image.

Faster R-CNN

The faster R-CNN was designed to speed up both training and detection from previous R-CNN architectures. It proposes and refines region proposals as a part of the training process, which are then used in par with the fast R-CNN model in a single model design. Compared to previous architectures, it reduces the number of proposed regions and accelerates the operation time to near real-time. It works by:

From Faster R-CNN: Towards Real-Time Object Detection With Region Proposal Networks
  1. Using a region proposal CNN
  2. Using a fast R-CNN to extract features from the proposed regions and create bounding boxes and class labels
From You Only Look Once: Unified, Real-Time Object Detection

Both of these components use the same output of a CNN and are trained at the same time. The region proposal CNN essentially acts as an attention mechanism for the fast R-CNN, which tells the second network where to pay attention.

Mask R-CNN

Mask R-CNNs builds on top of the faster R-CNN object detection architecture by not only predicting the class and bounding box coordinates of an object, but the object mask as well. The object mask can be imagined as a silhouette of an image; they’re a much finer spatial layout of an object. Mask R-CNNs work by:

From Mask R-CNN
  1. An image is passed through a CNN that returns a feature map for the image
  2. A region proposal network is applied on the feature maps, returning object proposals with their confidence score
  3. A ROI pooling layer is applied on the proposals to make all the proposals the same size. Mask R-CNN ROIs work by computing the intersection between the predicted boxes and ground truth boxes. If the computation is greater than or equal to 0.5, that area is considered a region of interest.
  4. The proposals are passed to a fully-connected layer to classify and output bounding boxes for objects, as well as returning a mask for each proposal

Mask R-CNNs are so far one of the most successful R-CNN architectures because they’re simpler to train, out-perform all previous architectures, are very efficient, and are easy to generalize to other tasks.

The key element of a mask R-CNNs efficiency is the pixel-to-pixel alignment, which isn’t found in fast or faster R-CNNs. The mask R-CNN has the same two stage procedure as the faster R-CNN. However, in the second stage, the mask R-CNN outputs a binary mask for each ROI in addition to the class and box offset. This contrasts other systems that where classification is usually dependent on masking systems.

Building A Mask R-CNN

At a high level

  • I created a mask R-CNN, which combines computer vision and deep learning, to detect and segment breast cancer tumors from ultrasound images.
  • I used a model built on FPN and ResNet 101 from matterport. I also started with pre-trained weights from MS COCO, though they were not my final weights.
  • I set the confidence level so that the model skipped region proposals with < 90% confidence. Then I trained the model for 30 epochs, with 51 steps per epoch. Training took about 1.5 days in total.
  • I sampled my data from here, but I didn’t use all of the images (I used about 200 benign and malignant tumor images). I annotated my images and turned them into json files using VGG Image Annotator.
  • You can access the code in this repository.
  • This project was based off a paper from 2019 that performed a similar task, using a mask R-CNN on sonograms and classifying benign, malignant, and normal tissues.

Project Walkthrough

Since my project is built on a model from matterport, most of my code is actually overwriting some of what they defined in their source code so that I could implement a custom dataset. For context, I used Jupyter Notebook with tensorflow 1.5.0 and keras 2.8.0 in an Anaconda environment.

These are the modules you’ll need to import. Here I also assigned the root directory of my project and the weights path.

Then I defined a class that basically allowed me to override some of the values defined in the source code. Notice that I have 3 classes, which includes the benign tumors, malignant tumors, and the background. Also notice that I basically have the model skip region proposals that have a confidence of < 90%.

Here I defined another class to let me implement my custom dataset. In the load_custom function I defined my classes and the annotations for the ground truth. I also added in images and defined variables based on the annotations I made.

To train my model, I loaded the custom training and validation datasets. I set the number of epochs to 10, but I repeated this process 3 times so that the model ultimately trained for 30 epochs (~ 33 hours). I called pre-trained weights from MS COCO to start the training process, although they weren’t my final weights. In the end, the model performed decently, but for optimal performance I would recommend training for at least 50 epochs.

After training, I tested the model. I redefined some values like the root directory and the custom dataset directory. I also loaded in the configuration, model, and newly trained weights. Here, I select a random image from my validation dataset and run the model on it. Then I displayed the results of the mask R-CNN as it ran through the different layers of the CNN. The final result is a bounding box and mask for the tumor in the image.

Prediction of a benign tumor
Ground truth mask

There are a few drawbacks to this project, including:

  • A fairly small sample size (only about 100 images for the training and validation sets, or 200 images in total)
  • Only classifying benign and malignant tumors and not normal tissue

But overall, it’s a solid demonstration of concept and would have beneficial applications if further developed. You can access the GitHub repository of this project here.

Have feedback or questions? Send me an email at and I’ll be happy to respond!

You can check out more of what I’m up to in my quarterly newsletters. Sign up here.