Deep Learning in Digital Pathology
Research in the course of a PhD thesis, by Adrien Foucart.
Research in the course of a PhD thesis, by Adrien Foucart.
Last year, I did a video tutorial for the INFO-H-501 "Pattern Recognition and Image Analysis" course at the Université Libre de Bruxelles. I apparently forgot to also put it here, so here we are...
The video contains a bit of theory on how to approach a segmentation problem with "Deep Learning", and a guide to a working implementation of the pipeline using Tensorflow v2.
We (...) have an unusual situation that we wish to search for objects present in low concentration, the quantity of which is irrelevant, and the final classification of which is not agreed by different experts. At this point the engineer may well begin to feel despair. (A.I. Spriggs, 1969 )
This quote comes from a 1969 publication entitled "Automatic scanning for cervical smears". This paper is interesting for several reasons:
Specifically, Spriggs worries that the "Carcinoma in situ" which they are trying to detect "is not a clear-cut entity", with "a whole spectrum of changes" where "nobody knows where to draw the line and when to call the lesion definitely precancerous". To sum up: "We therefore do not really know which cases we wish to find". Moreover, "the opinions of different observers also vary".
On that latter point, Spriggs also notes that the classifications or grading used by pathologists are often more "degrees of confidence felt by the observer" than measurable properties of the cells. "It is therefore nonsensical to specify for a machine that it should identify these classes." The only reason to do it is that we have to measure the performance of the machine against the opinion of the expert. This is slightly less of a problem today, as the grading systems have evolved and become more focused on quantifiable measurements, but they still often allow for large margins of interpretation. As we match our systems against these grading, we constrain them to "mimic" the reasoning of the pathologists instead of focusing on the underlying problem of finding relationships between the images and the evolution of the disease itself.
"Digital pathology" really started to appear in the scientific literature around the year 2000. At that time, the focus was on telepathology or virtual microscopy: allowing pathologists to move away from the microscope, to more easily share images for second opinions, and to better integrate the image information with the rest of the patient's record [2, 3, 4].
Digital pathology also relates to image analysis. The 2014 Springer book "Digital Pathology" , for instance, includes in their definition not only the acquisition of the specimens "in digital form", but also "their real-time evaluation", "data mining", and the "development of artificial intelligence tools".
The terminology may be recent, but the core idea behind it (linking an acquisition device to a computer to automate the analysis of a pathology sample) is about as old as computers. One of the earliest documented attempts may be the Cytoanalyzer... in the 1950s.
Two paragraphs in a 1955 issue of LIFE magazine, sandwiched between ads for mattresses and cars, presents an "electronic gadget" which "will spot abnormal cells (...) in about a minute, saving time-consuming specialized analysis for every case". A more comprehensive description of the prototype can be found by Walter E. Tolles in the Transactions of the New York Academy of Science .
The Cytoanalyzer (see Fig. 2) had three units: the power supply and computer (left), the scanner (middle) and the oscilloscopes for monitoring and presentation (right). The scanner converts the "density field" of the slide into an electric current, which is used to analyse the properties of the cells and discriminate between normal and abnormal ones.
Two clinical trials of the Cytoanalyer were conducted, in 1958-59 and in 1959-60. An analysis of the results by Spencer & Bostrom in 1962  considered its results to be "inadequate for practical application of the instrument".
Decades passed. Hardware and software vastly improved, yet even as new methods in image analysis and artificial intelligence got better at solving tasks related to pathology, they still fell short of the strict requirements of an automated system for diagnosis. In 2014, Hamilton et al. were writing that "even the most state of the art AI systems failed to significantly change practice in pathology" . The problem is that medical diagnosis is generally made by integrating information from multiple sources: images from different modalities, expression of the symptoms by the patient, records of their medical history... AI systems can be very successful at relatively simpler sub-tasks (finding nuclei, delineating glands, grading morphological patterns in a region...), but they are just unable, at this point, to get the "big picture". Not to mention, of course, all the thorny obstacles to widespread adoption: trust in the system, regulatory issues, insurance and liability issues, etc., etc.
More than 65 years after the Cytoanalyzer, routine use of AI in clinical practice for pathology appears to be very close... but we're still not there yet. The performance of deep learning algorithms, combined with the widespread use of whole-slide scanners producing high-resolution digital slides, make the field of computer-assisted histopathology a very active and optimistic one at the moment. Still, even with the excellent results of Google Health's breast cancer screening system in clinical studies , it's not clear that automated systems are ready for real practice.
The difficulties of our algorithms are in large part the same that were identified by Spriggs in 1969: it is difficult or impossible to get an "objective" assessment of pathological slides and, even with modern grading systems, inter-expert disagreement is high. This makes training and evaluating algorithms more difficult, and when dealing with a subject as sensitive as healthcare, any result short of near perfection will have a hard time getting adopted by the medical community... and by the patients.
AI system is better than human doctors at predicting breast cancer (J. Hamzelou, NewScientist)
AI Now Diagnoses Disease Better Than Your Doctor, Study Finds (D. Leibowitz, Towards Data Science)
AI Can Outperform Doctors. So Why Don’t Patients Trust It? (C. Longoni and C. K. Morewedge, Harvard Business Review)
Machines have beaten us at chess and Go, they can drive our cars... and now they are better doctors? That last article from Harvard Business Review asks an interesting question. If machines can perform "with expert-level accuracy", why don't we trust them? In their study, they find that resistance to medical AI is driven by "a concern that AI providers are less able than human providers to account for consumers’ unique characteristics" . The problem, then, would be mostly one of perception. AI can be better than doctors in general, but we fear that they might not be better for us personally.
It's really hard to have a purely "rational" (whatever that means) discussion about the merits of AI versus human doctors, and part of the problem is in the terminology: "Artificial Intelligence". AI is a relatively vague categorization of a certain domain of computer science, but it is perhaps more importantly a term heavily associated with its historical use in science-fiction and scientific speculation. In fiction, AI is often used as a device to explore what it means to be human. It almost always comes along with an "artificial conscience". An AI is a person - or at least it tries to be. It's a notion that we find both in stories where AI is the "good guy" who just wants to be accepted into society, and in dystopian stories where an AI becomes "self-aware" and starts a war on humans.
So when we say "AI is better than human at...", we think about those AI, the self-aware machines. But modern AI, and in particular the kind of algorithms which are behind the "better than humans" headlines, has absolutely nothing to do with any of that. Deep Learning algorithms are tools which are entirely in the hands of the engineers and doctors that use them. They have no more "personality", "wants", or "conscience" than any other tool.
AI is "better than humans" at predicting breast cancer in the same sense as a bread slicer machine is "better than humans" at slicing bread. In both cases, humans are in control of what goes in, what comes out, and how the machine is set up.
It doesn't really make sense to talk about "better than human" performances for machine learning systems, because everything that work or doesn't work in such algorithms can be traced to humans. The engineers, computer scientists or mathematicians create a mathematical model which fully determines the range of data that the system can manage, and the kind of output that it can produce. Doctors and medical experts provide the data and annotations that will determine the parameters of the model, which only learns how to best reproduce what it's been shown.
If there is a bias in the results of an algorithm, it's not because "the AI" is biased, it's because the persons who designed the dataset and the learning process were. The term "AI" was reportedly coined in 1956 by John McCarthy, at the foundational "Dartmouth workshop". At the time, computer science was in its infancy, and the dream was that every aspect of human intelligence could eventually be replicated by a computer. That idea is now mostly reserved for the field of "Artificial General Intelligence" (AGI), and it's not certain that such a thing is even possible. AI, as it's mostly used today, is not "intelligence". It's a toolbox, a set of techniques that we can use to perform various tasks, but it doesn't have an "identity". It starts and stops at the press of a button, and does nothing more and nothing less than what it's programmed to do.
AI will never be "better than doctors" because it's a meaningless proposition, but that doesn't mean AI has no place in medicine. AI techniques provide very useful tools which can vastly improve patient care.
The best way (in my opinion) to describe existing AI is as a very well indexed library of knowledge. The very complex models that compose modern Deep Learning methods can "store" in their parameters huge amount of medical knowledge, in the forms of links between observations and desired output. Trained algorithms can afterwards process huge amounts of data in short amounts of time.
In clinical practice, this may be very useful to "flag" potentially difficult cases (for instance, if a doctor's diagnosis is different from what the algorithm says, it may be useful to review the results), or to provide a quick first diagnosis in cases where there are no doctors available. It can also serve as a form of quality control. Perhaps more importantly, it can help a lot with ongoing research. Taking large amounts of patient information along with the evolution of their diseases in retrospective studies, machine learning algorithms can detect patterns associated with clinical outcomes. Sometimes, those patterns aren't associated with things that we already know about the disease, so this may give new avenues to explore. Avenues which may lead to dead ends, or to important discoveries. To find out which, human experts will remain well into the loop, at least for the foreseeable future.
A digital pathology task that illustrates quite well the "imperfect nature of digital pathology annotations" that I mentioned in the previous post is artefact detection. In the first post of this blog, I presented the main steps of Digital Pathology, from the extraction of the tissue sample to the digitalization of the slide. Throughout this process, the tissue is manipulated, cut, moved, stained, cleaned, sometimes frozen, and more generally exposed to the elements. This can cause serious damage, as illustrated in the figure below. Sometimes, problems can also occur during the acquisition process, leading to blurriness or contrast issues.
Depending on the severity of those issues, we may have to discard parts of the image from further analysis, redo the acquisition process or, in extreme cases, require a new slide to be produced. This can have a big negative impact on the pathology workflow, and it is also a potential source of uncertainty on the results. Typically, this "quality control" is done manually. This can take a lot of time, and it is a very subjective process which can lead to many mistakes . Some algorithms had been proposed in the past to detect some specific types of artefact, such as blurry regions  or tissue-folds , but our goal was to develop a general-purpose method to catch most artefacts in one pass.
Whose slide images are very large. Artefacts can be of varying sizes and shapes, making them extremely tedious to annotate. This is a situation that will always lead to imperfect annotations. We had one slide annotated in details by an histology technologist, and even in that slide many of the annotations were missing, or imprecise. To be able to get enough data to train a neural network, I quickly annotated the rest of the dataset based on the technologist's example, and I'm not an expert at all. As a result, we had a dataset with annotations that were rough, didn't really match the contours of the artefacts, and where a lot of artefacts were not marked at all. We had 26 slides in total. Slides are very large images, so while twenty-six seems like a very low number for training a neural network, we can actually extract a very large number of smaller patches from those slides to satisfy our needs.
An interesting thing about the data is that it has different types of staining, as shown in the figure above. Some of the slides have been stained with a standard "Haematoxilyn & Eosin" treatment, which is typical to show the general morphology of the tissue, while others have been stained with immunohistochemistry biomarkers, highlighting tumoral tissue in brown.
One of the first question we always have to answer once we have a dataset is: what do we use to train our models, and what do we use to evaluate them?
In this case, as we had one slide with "better" annotations than the rest, we decided to keep it for testing. There were two good reasons for that: first, this would make the evaluation more likely to be meaningful; second, that slide came from another source and another staining protocol as the others, which would help us test the "generalization" capability of our methods.
The main thing we wanted to see was: what parts of the machine learning pipeline were really important, and had an impact on the results? What we initially did was to take a relatively small, straightforward neural network architecture, and to test many different small changes in its depth ("how many layers") and width ("how many neurons per layer"). We also looked at whether we should use different levels of magnification on the image, or if we should try to "balance" the dataset by sampling more artefacts, as they were more examples of normal tissue than of artefact tissue in the data. Finally, we tried to compare using a "segmentation" output (predicting for each pixel if there is an artefact or not) or a "detection" output (predicting for the whole image patch if there is an artefact or not).
We published our first results at the "CloudTech 2018" conference . What we found most interesting was that, quite clearly, the details of the network architecture didn't really matter. The balance of the dataset and how we decided to produce our "per pixel" output (directly from the network, or by combining the detection results from overlapping patches) were the main factors that influenced the final results.
Let's have a look at this result, for instance:
Here, we have the same network, which uses patch-based detection, and has been trained using different data sampling. If we randomly sample the patches from the original dataset, we don't have enough examples of artefacts, and the network misses most of them. If we increase the percentage of artefact examples in the training batches, the network becomes a lot more sensitive. Obviously, this comes at the risk of falsely detecting normal tissue.
Evaluation metrics have to be related to the usage that we want to make of the results. In this case, the goals can be to assess the "quality" of the slide (or: to know how much of the slide is corrupted), and to remove damaged areas from further processing. For the latter task, we can reformulate it as: finding the "good" areas that can be used later.
There is no single metric that can really give us all that information. We computed the accuracy (which proportion of the pixels were correctly classified), the "True Positive Rate" (which proportion of the annotated artefacts did we correctly identify), the "True Negative Rate" (which proportion of the normal tissue did we correctly identify), and the "Negative Predictive Value" (which proportion of the tissue we identified as normal is actually normal). A low TPR would mean that we missed many artefacts; a low TNR that we removed a large portion of normal tissue; a low NPV that our "normal tissue" prediction is unreliable. To take the subjective nature of the evaluation into account, we also added a simple qualitative assessment, which we formulated as a binary choice: could this result be reasonably used in a digital pathology pipeline?
None of those metrics were really satisfactory. The trend that "what goes around the network" had a lot more impact than minute changes within, however, was clear enough that we could draw useful conclusions from those first experiments.
These first experiments with artefact detection showed that the way we prepared our dataset, and the way we defined our problem, had more impact on our results than the network itself. This is particularly visible when working on a dataset with very imperfect annotations, where just fitting the raw data into a deep learning model will lead to extremely poor results.
A lot of published research in computer vision for digital pathology is done on datasets published in challenges. These datasets tend to be very "clean", with precise annotations and pre-selected regions of interest. This makes it easier to train machine learning models, and it also makes it a lot easier to test and compare algorithms with one another. But the question we wanted to explore now was: as real-world datasets from research or clinical practice are not generally as clean as challenge datasets, shouldn't we be cautious about the conclusions we draw from these challenge results?