Important note to the reader
This blog can be seen as both a description of a particular R-package, oro.dicom, which is widely used in medical research in which imaging and patient information is used and as an introductory post to a future blog on pathology detection using convolutional neural networks.
As many who read my readers know I have a tendency to write lengthy posts and to avoid writing an even lengthier one than usual I have decided to break it up into two distinct parts. The reason for this is that I need to introduce two subejcts prior to discuss the use of convolutional networks in pathology detection, namely the DICOM format and how this file format is to be handled in R.
I also want to bring to the attention of the reader that I am not a health-care professional and that any model that I share is not to be taken as a final and fool-proof model for detecting any pathology. My aim is solely to present a technique that could be used if developed together with qualified professionals (e.g. researchers ans specialized physicians). It is however so that convolutional neural networks have shown strength in detecting pathologies without medical learning and that a development of these techniques deserve to be taken into consideration.
Purpose
The end goal of this two-part article is to actually develop a particular Convolutional Network model for a rapid screening of patients affected by breast cancer.
Breast cancer is one of the most common types of cancer together with lugn, bowel and prostate cancer. It has been estimated that 23,6 million new breast cancer cases each year by 2030. Breast cancer occurs when abnormal cells in the breast multiply uncontrollably and eventually form a tumour. A unique survival rates for breast cancer cannot be given as it varies depending on a wide range of factors. Studies on the importance of unfavorable distribution of stage at diagnosis due to low screening rates, limited access to care and treatment, tumor type, comorbidities, socioeconomic status, obesity, and physical activity are beign done. However, it is alreday clear that the strongest factor to survival rate is early detection. It is therefore of paramount importance to speed up the process of early discovery of cancer cells for large groups of women. The aim is thus use mammogram imaging to classify mammogram images into normal mammograms as well as benign and malignant cancer types.
As mentioned at the top of this blog, I wish (for once!) to avoid a lenghty article and therefore concentrate this first blog on the introduction of a particular R-package and the format used in medical imaging. If you are already familiar with both you will maybe not learn anything new from what follows.
The DICOM-format
There are mainly two formats for handling imaging and information in medicine, the DICOM and NIFTI formats. We will only consider the DICOM-format in this blog as the are ways to convert one to the other. DICOM, or Digital Imaging and Communications in Medicine, is a widely used standard for handling, storing, printing, and transmitting information in medical imaging. It hence secures that both images and information related to it (such as patient information, notes, which equipment has been used and how the images were taken) are transmitted attached and that no patient mismatch may occur. DICOM has its origins in two assiciations’, the ACR (American College of Radiology) and NEMA (National Electrical Manufacturers Association), joint efforts to standardize the desifering of the digital images produced by computed tomography. The result was a standard ACR/NEMA 300 which in its third version became DICOM. Several industries have derived this format, among them the airport security industry to analyse security check images which has developed the DICOS (Digital Imaging and Communication in Security).
DICOM uses three different data element encoding schemes. With explicit Value Representation (VR) data elements. The VR are grouped by their contents: AE (Application Entity), AS (age string), AT (Attribut tag), CS (Controlled concept), DA (Date), DT (Decimal string, fixed or floating point), FL (Single precision binary floating point), FD (Floating point double), IS (Integer string), LO (Long string), LT (Long text), OB (Other byte), OD (Other double), OF (Other floating), OL (Other Long), OW (Other word), PN (Person Name), SH (Short string), SL (Signed Long), SQ (Sequence of items), SS (Signed short), ST (Short text), TM (Time), UC (Unlimited Characters), UI (Unique identifier UID), UL (Unsigned long), UN (Unknown), UR (Universal Resource Identifier or Universal Resource Locator (URI/URL)), US (Unsigned short) and UT (Unlimited text). It may seem a little bit unclear what all these VRs may or may not include, but we we go through an example later in this article. Some of these value represenations are of course self-explanatory, such as UI, PN and LT, other require knowledge in what is needed to acquire sufficient information about a study. By study we mean the result of an examination or tests performed on the individual patient. A single patient can be subjected to many studies and several devices can be used throughout this process. Different positions or angles may also be needed in each individual study and it is therefore important to keep track of the chronological order in which these studies come. The consequences of not paying attention to time could of course be disastrous.
The content of a DICOM file
What better way is there to describe how a tool works than by actually using it? The oro.dicom package in R enables one to extract all the content of dicom-formated information package, including all images attached to it. This is exactly what we aim to do for the rest of this article.
Resource
As convolutional neural networks need images to be trained, the first step is to find enough data to obtain a high enough accuracy. This particularly true regarding cancer or other pathologies since misclassification could lead to lethal consequences. Fortunately, data on breat cancer exists and is made public by the Cancer Imaging Archive. It provides a subset of the DDSM (Digital Database for Screening Mammography), acollection of mammograms from the following sources: Massachusetts General Hospital, Wake Forest University School of Medicine, Sacred Heart Hospital, and Washington University of St Louis School of Medicine. The subset, CBIS-DDSM, the Curated Breast Imaging Subset of DDSM, CBIS-DDSM, contains selected cases curated by a trained mammographer. The images have been decompressed and converted to DICOM format.
Accessing and understanding CBIS-DDSM data
Downloading the entire CBIS-DDSM database takes a while. It contains 10 237 images for 6 671 patients (also 6 671 Studies). The total database occupies 163 GB of images. To be able to work with the data one needs to have a good understanding of how it is structured. It is divided into two groups MASS and CALC, groups that are themselves sub-categorized into two groups, TRAINING and TEST, with 1, 2 or 3 subfolders depending on the amount of information to be stored.
The MASS training and test sets contain mammograms either displaying malignant or benign cancer masses or abnormalites that are not deemed to be cancer. The CALC training and test datasets contains mammograms that displaying identified calcifications or micro-calcifications. Microcalcifications can be the early and only presenting sign of breast cancer and mammography is used worldwide to detect microcalcifications to diagnose cancer in a nonpalpable stage and also detect the extent of the disease. 30-50% of mammographic cancers appear as pleomorphic microcalcifications (Pleomorphic means having many different shapes) with or without a mass or lump. As the early detection of cancer is the strongest predictor of patient survival it is essential to perform a proper evaluation of various calcifications to decide whether they are benign or malignant. Furthermore, biopsy can be avoided if the calcifications appear benign on mammography. An annual screening mammography is then the only requirement to follow the patient.
Information on whether the image is of mass or calcification is indicated in the file name together with information on whether it is the right or left breast. Furthermore, the orientation of the picture is given in the map name. There are two main directions to perform a mammography, the Cranial-caudal (CC) which is a view from above or the mediolateral-oblique (MLO) which is, as the name indicates an oblique or angled view. This information is also retrievable from the metadata in the dicom file. The picture below shows examples of the data in MASS and CALC metadata.
Now that we have a full description of the database, let’s turn to how to handle its content. Only files with the extension *.dcm are present, so life become a little more simple. The first step is of course in install the relevant packages in R that enables one read and extract information in DICOM-file. There are two packages of interest here, the oro.dicom package, which is the only actually needed package to work with DICOM-files, and the oro.nifti package which enbles one the produce 3D-images from series of frames. Once this is done, it is wise to set a working directory and create a list of all the files with the extension *.dcm.
library(oro.dicom) library(oro.nifti) setdir = setwd("D:/......") All_files = list.files(pattern = "*.dcm$", recursive = TRUE)
Let’s take a look at how this list is structured
As you can see, it gives the path to where the DICOM files are situated, if it is a calcification image in the test or training set, the patient’s unique id, when the picture was added from the DDSM DB to the CBIS-DDSM DB and some method relevant data. Note also that the same patient can appear multiple times. This is because of the number of studies made on the same subject and different orientations of the equipment used to produce the mammogram. Just for the sake of prescision, the ROI label in the path name just indicates that the image is restricted to the Region Of Interest as opposed to the “full mammogram images”.
For each one of these items we have one or several images, dependending on the number of frames for different orientations. To obtain this information, let’s choose one of these files. The DICOM files are divided, as mentioned above, into the image and the metadata associated to it. Let’s choose one of these, namely file 4900, which is a calcification detection in the right breast by a mediolateral-oblique examination for which the region of interest has been isolated.
TestFile = All_files[[4900]] Testfile_info = readDICOMFile(TestFile)
As you can see, the information is divided into the image and the metadata, which itself is divided into types. One particular part is the hdr$name, which gives most of the information needed to understand the image.
> Testfile_info$hdr$name [1] "GroupLength" "FileMetaInformationVersion" [3] "MediaStorageSOPClassUID" "MediaStorageSOPInstanceUID" [5] "TransferSyntaxUID" "ImplementationClassUID" [7] "ImplementationVersionName" "SpecificCharacterSet" [9] "SOPClassUID" "SOPInstanceUID" [11] "StudyDate" "ContentDate" [13] "StudyTime" "ContentTime" [15] "AccessionNumber" "Modality" [17] "ConversionType" "ReferringPhysiciansName" [19] "SeriesDescription" "PatientsName" [21] "PatientID" "PatientsBirthDate" [23] "PatientsSex" "ModificationDate" [25] "Unknown" "Unknown" [27] "BodyPartExamined" "SecondaryCaptureDeviceManufacturer" [29] "SecondaryCaptureDeviceManufacturersModelName" "StudyInstanceUID" [31] "SeriesInstanceUID" "StudyID" [33] "SeriesNumber" "InstanceNumber" [35] "PatientOrientation" "Laterality" [37] "SamplesperPixel" "PhotometricInterpretation" [39] "Rows" "Columns" [41] "BitsAllocated" "BitsStored" [43] "HighBit" "PixelRepresentation" [45] "SmallestImagePixelValue" "LargestImagePixelValue" [47] "PixelData"
It gives information about the patient (although most of it is masked), the study, the practitioner that performed the study, the modality (i.e. what king of study. In our case a mammography).
Even more informative if are the values of the metadata (which I have have truncated because the information is of no interest to us right now):
> Testfile_info$hdr$value ---------- [7] "dcm4che-1.4.35" [8] "ISO_IR 100" [9] "1.2.840.10008.5.1.4.1.1.7" [10] "1.3.6.1.4.1.9590.100.1.2.400308168411170217716835373273028057543" [11] "20170830" (Study date) [12] "20160503" -------- [15] "" [16] "MG" (Modality) [17] "WSD" (Referring Physicians Name) ----- [19] "cropped images" (Series description) [20] "Calc-Training_P_01840_RIGHT_MLO_2" [21] "Calc-Training_P_01840_RIGHT_MLO_2" [22] "" (Patient Name) [23] "" (Patient Sex) [24] "CTP" [25] "CBIS-DDSM" [26] "43352602" [27] "BREAST" (Body part examined) ------- [30] "1.3.6.1.4.1.9590.100.1.2.384564524811109665500955449833183199023" [31] "1.3.6.1.4.1.9590.100.1.2.94656632512895465727964522580817303119" [32] "DDSM" [33] "1" [34] "1" [35] "MLO" [36] "R" [37] "1" [38] "MONOCHROME2" ------ [45] "42972" [46] "65535" [47] "PixelData"
A table with all the information is easily constructed by using the following code:
dcm.Images = readDIC.OM(TestFile, verbose = TRUE, recursive = FALSE, exclude = "sql") dcm.info = t(dicomTable(dcm.Images$hdr))
Since my aim with this blog is to introduce the required parts of the oro.dicom package in order to facilitate a future blog on non-medical pathology detection I will not dwell on the details and refinements of the package, It will be sufficient for me to be able to extract the necessary information from the database, classify the images given in the CBIS-DDSM database (already curated) and attach the metadata to them. Hence, we simply end this article with instructions for extracting the images from the DICOM file.
image(t(dcmImages$img[[1]]), col = grey(0:64/64), axes = FALSE, xlab = "", ylab = "")
End notes
This article is to be seen purely as an introduction of essential tools and concepts needed to go forward with non-medical learning pathology detection using deep convolution neural network. It seems technical and not very fashioanble, but many times one needs to go through the phase of learning from different areas in order to reach a goal. My goal is to put my knowledge in analytical methods and artificial intelligence to use for society and especially in health-care. As I am not a trained health-care professional, I wish to put my paticular knowledge at the service of this profession and contribute to advances that in the end will maybe reduce unecessary human suffering and deaths. To serve, through very cheap methods, to eradicate disparitites in disease outcomes throughout the globe is a way to contribute to this goal.
Deep neural networks can be used in a variety of ways and, as I already described in a previous post (A gentle introduction to Image Recognition by Convolutional Neural Network) they are after all not THAT complicated to understand once you’re taken through the ideas and the math. Understanding this opens the doors to many applications with different purposes, from text recognition to what this article and the upcoming one will show.
Citations
CBIS-DDSM Citation
Rebecca Sawyer Lee, Francisco Gimenez, Assaf Hoogi , Daniel Rubin (2016). Curated Breast Imaging Subset of DDSM. The Cancer Imaging Archive. http://dx.doi.org/10.7937/K9/TCIA.2016.7O02S9CY
Publication Citation
Coming Soon
TCIA Citation
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. (paper)