The presence of lymph node metastases is one of the most important factors in breast cancer prognosis. The most common strategy to assess the regional lymph node status is the sentinel lymph node procedure. The sentinel lymph node is the most likely lymph node to contain metastasized cancer cells and is excised, histopathologically processed and examined by the pathologist. This tedious examination process is time-consuming and can lead to small metastases being missed. However, recent advances in whole-slide imaging and machine learning have opened an avenue for analysis of digitized lymph node sections with computer algorithms. For example, convolutional neural networks, a type of machine learning algorithm, are able to automatically detect cancer metastases in lymph nodes with high accuracy.
To train machine learning models, large, well-curated datasets are needed. We released a dataset of 1399 annotated whole-slide images of lymph nodes, both with and without metastases, in total three terabytes of data in the context of the CAMELYON16 and CAMELYON17 Grand Challenges. Slides were collected from five different medical centers to cover a broad range of image appearance and staining variations. Each whole-slide image has a slide-level label indicating whether it contains no metastases, macro-metastases, micro-metastases or isolated tumor cells. Furthermore, for 209 whole-slide images, detailed hand-drawn contours for all metastases are provided. Last, open-source software tools to visualize and interact with the data have been made available. A unique dataset of annotated, whole-slide digital histopathology images has been provided with high potential for re-use.
Accessing the Data
CAMELYON16 and CAMELYON17 data sets are open access and shared publicly on GigaScience, Google Drive and on Baidu Pan. 1
GigaScience Database:
Google Drive:
Baidu Pan:
Meta files: These files are available in the shared folders. They are shared here too for convenience.
- CAMELYON16: checksums.md5, README.md
- CAMELYON17: checksums.md5, README.md
Data Description
CAMELYON162
Dataset Structure
The data in this challenge contains a total of 399 whole-slide images (WSIs) of sentinel lymph node from two independent data sets collected in Radboud University Medical Center (Nijmegen, The Netherlands), and the University Medical Center Utrecht (Utrecht, The Netherlands). 270 WSIs (159 normal + 111 tumor) were divided into training set. 129 WSIs were divided into testing set.
CAMELYON16
├── testing
│ ├── evaluation
│ ├── images
│ │ └── 129 files
│ ├── lesion_annotations.zip
│ └── reference.csv
└── training
├── lesion_annotations.zip
├── normal
│ └── 159 files
└── tumor
└── 111 files
---------------------------------
Totally 701G
---------------------------------
Annotations
The shared XML files are compatible with the ASAP software. You may download this software and visualize the annotations overlaid on the whole slide image.
The provided XML files may have three groups of annotations (“_0”, “_1”, or “_2”) which can be accessed from the “PartOfGroup” attribute of the Annotation node in the XML file. Annotations belonging to group “_0” and “_1” represent tumor areas and annotations within group “_2” are non-tumor areas which have been cut-out from the original annotations in the first two group.
All the images except for the ones mentioned below are fully annotated (all tumor areas have been exhaustively annotated). The annotations for the images listed below are not exhaustive. In other words, there might be tumor areas in these slides which have not been annotated. Most of these slides contain two consecutive sections of the same tissue. In those cases one section is typically exhaustively annotated.:
- tumor_010
- tumor_015
- tumor_018
- tumor_020
- tumor_025
- tumor_029
- tumor_033
- tumor_034
- tumor_044
- tumor_046
- tumor_051
- tumor_054
- tumor_055
- tumor_056
- tumor_067
- tumor_079: Blurred tumor region is not annotated.
- tumor_085
- tumor_092: Blurred tumor region on the adjacent tissue is not annotated.
- tumor_095
- tumor_110
The following files have been intentionally removed from the original data set:
- normal_86: Originally misclassified, renamed to tumor_111.
- test_049: Duplicate slide.
Test set notes:
- test_114: Does not have exhaustive annotations.
CAMELYON173
Dataset Structure
The data in this challenge contains a total of 1000 whole-slide images (WSIs) of sentinel lymph node from 5 different medical centers from The Netherlands. The data set is divided into training and testing sets with 20 patients from each center in both sets. For each patient the shared 5 whole-slide images are zipped together into a single ZIP file. The patient pN-stages and the slide-level labels in the training set are shared in the stage_labels.csv file.
CAMELYON17
├── testing
│ ├── evaluation
│ │ ├── evaluate.py
│ │ └── submission_example.csv
│ └── patients
│ └── 100 patients
└── training
├── center_0
│ └── 20 patients
├── center_1
│ └── 20 patients
├── center_2
│ └── 20 patients
├── center_3
│ └── 20 patients
├── center_4
│ └── 20 patients
├── lesion_annotations.zip
└── stage_labels.csv
------------------------------------------
Totally 2.4T(1.2T training + 1.2T testing)
------------------------------------------
Annotations
From each center 10 slides are exhaustively annotated and the annotations are shared in XML format. The XML files are compatible with the ASAP software. You may download this software and visualize the annotations overlaid on the whole slide image.
The provided XML files may have two groups of annotations (“metastases”, or “normal”) which can be accessed from the “PartOfGroup” attribute of the Annotation node in the XML file. Annotations belonging to group “metastases” represent tumor areas and annotations within group “normal” are non-tumor areas which have been cut-out from the original annotations in the “metastases” group.
Visualizing
Citation
CAMELYON16
Ehteshami Bejnordi B, Veta M, Johannes van Diest P, van Ginneken B, Karssemeijer N, Litjens G, van der Laak JAWM, and the CAMELYON16 Consortium. *Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer*. JAMA. 2017;318(22):2199–2210. doi:10.1001/jama.2017.14585
CAMELYON17
Geert Litjens, Peter Bandi, Babak Ehteshami Bejnordi, Oscar Geessink, Maschenka Balkenhol, Peter Bult, Altuna Halilovic, Meyke Hermsen, Rob van de Loo, Rob Vogels, Quirine F Manson, Nikolas Stathonikos, Alexi Baidoshvili, Paul van Diest, Carla Wauters, Marcory van Dijk, Jeroen van der Laak. 1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset. GigaScience, giy065, DOI: 10.1093/gigascience/giy065
Babak Ehteshami Bejnordi; Mitko Veta; Paul Johannes van Diest; Bram van Ginneken; Nico Karssemeijer; Geert Litjens; Jeroen A. W. M. van der Laak; and the CAMELYON16 Consortium. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA. 2017;318(22):2199–2210. DOI: 10.1001/jama.2017.14585
Peter Bandi, Oscar Geessink, Quirine Manson, Marcory van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, Quanzheng Li, Farhad Ghazvinian Zanjani, Svitlana Zinger, Keisuke Fukuta, Daisuke Komura, Vlado Ovtcharov, Shenghua Cheng, Shaoqun Zeng, Jeppe Thagaard, Anders B. Dahl, Huangjing Lin, Hao Chen, Ludwig Jacobsson, Martin Hedlund, Melih Cetin, Eren Halici, Hunter Jackson, Richard Chen, Fabian Both, Jorg Franke, Heidi Kusters-Vandevelde, Willem Vreuls, Peter Bult, Bram van Ginneken, Jeroen van der Laak, and Geert Litjens. From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE-TMI (Early Access) DOI: 10.1109/TMI.2018.2867350