To thoroughly assess individual studies, information was extracted from each study. Specifically, data concerning the cohorts used (including staining, number of patients, cancer stage distribution, target distribution), the training setup (dataset splits, cross validation and involvement of pathologists), the DL pipeline (input size, pre-processing, model, post-processing, training task and target) and reported results (AUC, F1, ... if available). Additionally, we assessed for each study whether there were slide-level or patient-level results, if the published model is clinically approved, whether data, code, or model weights were available and whether there is any information on inference time. Data were extracted by a single reviewer and verified by a second reviewer. Many studies report multiple results across several cohorts and in multiple configurations. To properly represent those studies in our analysis, the highest reported performance metrics were extracted from each paper unless specified otherwise in the specific chapter. For multimodal models which use data that would normally not be used in a pathology report, we report the results where only histopathology data is used. Papers which cover more than one ICCR element will be included multiple times in the analysis but evaluated specifically for the respective topic.
The search across databases yielded 4863 papers in total, of which 1636 were identified as duplicates. Additionally, 20 papers from citation search or additional manual search were included, yielding 3247 studies to be screened. In accordance with the exclusion criteria, 3066 studies were subsequently excluded, leaving 181 studies for full text screening. After full text screening, 66 studies were included in the final review (Fig. 2).
Although half of the ICCR CRC report elements are covered by the 66 selected studies, most publications concern only three topics. Specifically, 39/66 (59%) of studies showed prediction results on MMR status, 16/66 (24%) on BRAF V600E mutation, and 9/66 on pN (14%). Coexistent pathologies (5/66), Grade (4/66), and TNM Stage (3/66) are covered by five or less publications. For tumor budding and perineural invasion two studies were included each, whereas one study each for pT and response to neoadjuvant therapy was extracted. Publications can cover multiple topics at once, for example, 14 publications cover both MMR status and BRAF V600E mutation. For the remaining elements of the ICCR CRC guidelines the employed search and exclusion criteria did not yield any publications (Fig. 3A).
Following the general trend of increased attention and impactful contributions in CPath, we also observe an overall increase in publications over time, even when applying strict exclusion criteria (Fig. 3B). MSI and BRAF mutation prediction studies remain the most common over the last 4 years. Lymph node metastasis prediction studies were increasingly published starting in the second half of 2022. Overall, the topic variety among published studies also increased in the past 5 years, but without a substantial growth in publication volume on frequently studied ICCR elements since 2022.
Considering data, code and model weight availability (Fig. 3C), more than half of the published papers (37/66, 56%) use some publicly available dataset or make their data available, and 31/66 (47%) use The Cancer Genome Atlas (TCGA) in some form. Code was available for less publications (25/66, 38%) and publications that made model weights available were scarce (8/66, 12%). It should be noted that resources which are available "upon reasonable request" were considered unavailable, although we did not verify whether access could have been granted in these cases.
This initial analysis has shown substantial variation in coverage across all reporting elements. In the next step, we will analyze published work related to the core elements of the ICCR CRC guideline.
Elements classified as "Core" within the ICCR CRC guidelines are considered essential for clinical management and staging of CRC patients. Excluding clinical and macroscopically retrieved information, the microscopic core elements comprise histological tumor type, tumor grade, extent of invasion (and pT), lymphatic, venous and perineural invasion, lymph node status (and pN), tumor deposits, response to neoadjuvant therapy, margin status, margin status, histologically confirmed distant metastasis (and pM), the aggregation into pTNM, and confirmatory evaluation for neuroendocrine neoplasms. Of these twelve core elements, six have been addressed in 20 published articles, with pN (9/20) and tumor grade (4/20) receiving the most attention. 5/20 publications have used publicly available data or released their datasets, 6/20 made their code available, but none published model weights (Fig. 4). Next, we present the state of research of each individual core element in order of the number of included publications.
Regional lymph node (LN) status is crucial for determining adjuvant chemotherapy and cancer staging. The TNM classification of regional LNs (pN) includes: NX (nodes cannot be assessed), N0 (no metastasis), N1a (1 positive node), N1b (2-3 positive nodes), N1c (tumor deposits in soft tissue without nodal metastasis), N2a (4-6 positive nodes), and N2b (≥7 positive nodes). A LN is considered positive if it contains at least one metastatic lesion larger than 2 mm. Pathologists aim to evaluate at least twelve nodes per case, though this number can be influenced by factors such as specimen length, patient age, or neoadjuvant therapy. Smaller metastatic foci are classified as isolated tumor cells, and while their presence should be reported according to ICCR guidelines, they do not affect the lymph node status.
Several approaches have been used to automatically screen LN slides for metastases, listed in Table 1. Most methods are focused on metastasis detection/segmentation or based on metastasis detection further predicting the slide-level labels, whether positive or negative. Model evaluation was performed on both external (30% of the studies) and internal cohorts (70% of the studies). Clinical validation or pathologist involvement was reported in ~30% of studies.
Huang et al. used a human-in-the-loop AI system (nuclei.io) to personalize ML models for colorectal LN metastasis detection. The approach improved accuracy, F1 score, sensitivity, and especially detection of isolated tumor cells, while significantly reducing evaluation time for fellows and negative cases. Similarly, Kindler et al. trained a Deep Neural Network tool achieving high pixel-level accuracy. Clinical testing showed high sensitivity (0.990), strong interobserver agreement (κ = 0.94), and significantly reduced review times for pathologists. Furthermore, Khan et al. demonstrated 100% agreement between their ensemble model and expert pathologists in a study of 217 cases.
The reviewed studies show high sensitivity and specificity across the board for metastasis detection. However, the automation of pN staging requires the additional step of counting the number of positive lymph nodes in a slide. This can be difficult due to tissue preparation steps where some large lymph nodes may be cut in half and may be present twice in the same slide, or on multiple different slides. Correctly re-identifying the same lymph node and ensuring positive lymph nodes not being counted twice therefore remains an unsolved problem. One approach to addressing this issue could be a better link between macroscopic tissue preparation and AI-based image analysis. For example, lymph node placement information could be additionally stored, but consistent tissue inking may also be an avenue to solve the problem automatically.
Tumor grade describes the level of differentiation of tumor cells where a higher grade indicates a loss of glandular organization. In accordance with the WHO classification, the ICCR CRC guidelines classify the grade by the least differentiated component of the neoplastic lesion which then results in either low (previously low to moderately differentiated) or high (previously poorly-differentiated) tumor grade.
The retrieved studies use both private and public datasets, but with the exception of one study, none validate their results externally (Table 2). Given the subjectivity and high interobserver variability of tumor grade assessment, a consensus label based on multiple pathologists might be more reliable as ground truth. Some studies used two raters, with a third pathologist to resolve any conflicts. Yet, none of the retrieved studies reported agreement between pathologists and deep learning models. Most studies rely on tile classification to directly predict tumor grade. Rathore et al. use morphological and texture-based image features aggregated in a support vector machine classifier and majority voting. They achieve comparable performance to tile classifiers on their dataset. For tile classification methods, the overall strategies include multiple instance learning for end-to-end training, and supervised tile-based training. No studies aggregate at the patient level, although clinical practice typically considers the highest grade across multiple slides. This is particularly relevant for TCGA-based studies, as WSIs were not selected for grading and may not be representative of the overall tumor grade.
Data accessibility varies significantly across studies. While Rathore et al. use the publicly available GLaS dataset and Chen et al. TCGA-COAD and TCGA-READ, and Soldatov et al. only make their dataset available upon request. Beyond TCGA and challenge datasets, independent validation is limited due to restricted access to additional clinical data.
Reproducibility is another concern, as most studies do not share code or model weights. Only Schrammen et al. have a publicly available repository. Despite these limitations, the availability of public datasets (e.g., GLaS) enables further research and tool development. It should also be noted that the publications included in this review follow the three-tiered grading system (including well-, moderately-, and poorly-differentiated categories). This system can be directly mapped to the two-tier system, since according to the ICCR guidelines, low-grade corresponds to well- and moderately-differentiated tumors, while high-grade includes poorly differentiated cases. One study is the exception, where no reclassification is possible.
Pathological TNM staging is the combined assessment of pT, pN, and pM. The TNM stage directly informs treatment decisions including whether a resection is necessary, or whether additional adjuvant therapy needs to be performed. It remains the most important prognostic factor in CRC diagnosis, with Stage I patients' 5-year survival being over 90% for colon and rectum, but Stage IV 11-15% respectively.
While TNM Staging could be algorithmically constructed out of evaluated T, N, and M, all three retrieved approaches directly predict TNM stage from the primary, with only one publication also considering lymph node slides (Table 3). Two publications rely on TCGA COAD (and READ) for the evaluation, and Kumar et al. additionally include an internal cohort for training and validation. Two methods used graph convolution to spatially integrate tissue tile embeddings into slide level predictions, whereas the last followed the more traditional method of tumor area detection, and subsequent tile classification and averaging based on morphological features of nuclei.
Only Kumar et al. generate clinically actionable results, but they use their model for the specific task to differentiate between metastatic colon cancer and local colon cancer. While Pei et al. also show slide-level results, they never evaluate them or report a tile-aggregation strategy. Levy et al. predict low (I, II) vs high (III, IV) stage, but their approach does evaluate the slide for invasion depth so further development could lead to a comprehensive TNM analysis.
Taken together, none of the retrieved papers demonstrate an end-to-end pipeline that can reliably assess the TNM stage (I-IV) from histological slides. While there are cases where directly predicting the risk of metastasis from the primary might be justified (e.g., what is the risk of a missed metastasis), simpler approaches following the UICC TNM guidelines by aggregating information about local, regional and distant spread may be more clinically relevant for routine reporting.
Perineural Invasion (PNI) is defined as tumor growth along a nerve, where the tumor surrounds at least one-third of the nerve's perimeter and invades any of its layers. PNI is reported as either present or absent, and its presence has been linked to poor prognosis, particularly in Stage II.
Both included studies employ a similar approach for detecting PNI (Table 4). First, nerve and tumor regions are separately segmented, and subsequently a boundary between the two regions is identified. While Jung et al. use a private dataset for training, Han et al. trained on the PAIP2021 dataset comprising of 240 WSIs from multiple cancer types. The dataset was used for the PAIP2021 challenge to detect PNI.
Neither of the included studies provide the code or the weights of the DL algorithms that produced the reported results. Moreover, the validation cohorts, compared to studies on other topics, e.g., the previously discussed lymph node metastasis detection, are comparatively small. Beyond this reproducibility gap, the methods do report a binary output for the perineural invasion in accordance with the ICCR CRC guidelines.
The extent of invasion of the primary tumor is the first part of the TNM staging and reflected in pT. It is assessed by considering the different components of the colon wall and the presence of tumor within them. pT ranges from pT0 (no tumor) to pT4 (tumor invades other organs or perforation into the visceral peritoneum) and has a direct impact on the treatment of a patient.
Only one study that automates the prediction of pT in CRC passed the exclusion criteria (Table 5). In this study, the authors use two validation cohorts, one internal with 38 patients, one external with 42 patients. For classification into pT1-4, they first segment the tumor area, then use a patch-based classifier to predict pT for each tile to finally choose the highest pT value as the final prediction. The model reaches AUCs up to 0.93 on the internal test set and 0.90 on the external validation cohort.
Yet the validation sets only include few patients, while data, code, and weights are not publicly available. They do show results with explicit cutoffs, but only in 1-vs-all settings without multiclass evaluation. Therefore, more work is needed to develop robust approaches for automatic pT assessment with larger validation cohorts and investigating challenges in real world datasets.
Tumor regression gradings (TRGs) are used to assess the response to neoadjuvant chemoradiotherapy (nCRT) by measuring the extent of tissue changes caused by the treatment. This is a common assessment step in locally advanced rectal cancer treated with neoadjuvant therapies. Multiple TRGs exist but all rely on the assessment of tumor to fibrosis ratio within the tumor bed.
We retrieved only one study that predicted regression after nCRT in a fully automated way (Table 6). The authors used an MIL approach with gated attention weight normalization with a final bilinear attention for multi-scale feature fusion to classify responders from non-responders according to slide-level labels. Yet, they validated the model on a breast cancer metastasis dataset rather than rectal cancer which reduces the value of the validation.
There exist many different systems for TRG, and in the included study, the AJCC system was used. However, the four AJCC TRG categories were merged into a binary classification task which does not correspond to the ICCR CRC guidelines. The reviewed study trained for slide-level predictions but does not discuss aggregation strategies for patient-level predictions. However, this post-processing aggregation step is needed for successful clinical deployment, as pathologists report a patient-level TRG.
Due to the challenges caused by the TRG definitions, we did not find any study that presents a clinically ready algorithm to assess tumor regression in rectal cancer, yet a tissue type segmentation model fine-tuned to predict remaining tumor and fibrosis could replicate every TRG scheme.
The non-core elements of the ICCR CRC dataset assess additional features that provide valuable prognostic insights and are clinically relevant but are not frequently used in patient management or not yet validated. They include measurement beyond the muscularis propria, MLH1 promoter methylation, tumor budding, coexistent pathology, MMR status, and BRAF V600E mutation. Of these six elements, only four have been addressed in the published studies included in this review. MMR status and BRAF mutation, in particular, have received significant attention, with most MMR-related (28/39) and BRAF-related (11/16) studies using TCGA as their primary data source. For these elements, code to reproduce results is often available (21/48, 44%), and some studies provide access to trained model weights (8/48, 17%). Interestingly, since similar methods are applied for MMR- and BRAF-status prediction, 14 studies investigate both biomarkers simultaneously. In contrast, tumor budding and coexistent pathology remain comparatively unexplored and overall lack publicly available data, code, or access to weights (Fig. 5). In the following section, we examine the current research landscape for each of the non-core elements.
The DNA mismatch repair (MMR) system is responsible to correct DNA replication mismatches. The mutation of its major genes and the resulting instability of microsatellite repeat sequences is referred to Microsatellite Instability (MSI). MSI is a biomarker in CRC, guiding clinical decisions for intermediate risk Stage II and Stage IV CRCs, and represents an indicator for lynch syndrome screening -- a hereditary condition characterized by mismatch repair gene mutations. Routine diagnostic testing is typically performed using Immunohistochemistry (IHC) and both ASCO and ESMO guidelines mandate dMMR testing for all new CRC diagnoses. Recent DL models have demonstrated that they can predict MSI status directly from H&E slides without the need for IHC.
Common datasets for evaluation are the publicly available TCGA COAD/READ cohorts (80.4%), the PAIP2020 challenge dataset (26.3%), and DACHS (21.05%) (Table 7). Of note here is that different publications use different subsets of TCGA COAD/READ so while results have been generated on some of the same slides metrics cannot be properly compared. At the same time, many papers only used one (33.3%) or two (35.3%) cohorts in their analysis and the remainder used more (29.4%).
There are mixed approaches to ground truth, as for MSI/dMMR testing, several different methods are clinically appropriate. 33.3\% relied on both IHC and PCR for the same or different datasets, 49.0% only rely on PCR, 11.8% only on IHC, 3.9% on other methods. None of the papers using IHC as gold standard are reporting whether these were single observer results, even though there is some disagreement in manual IHC dMMR protein analysis.
Most MSI prediction models rely on tile level embeddings that are aggregated to a slide-level score. Before that step, many approaches pre-detect the tumor area (43.6%), and then subsequently usually add an average (41.1%), majority (23.5%) or top-k (17.6%) aggregation on top. Models that did not pre-segment the tumor tissue (56.4%) commonly relied on an attention mechanism (50%) to weight tile results for the final slide-level MSI score.
Foundation models have included MSI prediction in CRC on datasets like TCGA-CRC-DX as a benchmark task, which will lead to more results in this topic, though the small scope of the evaluation may not bring additional insights on clinical translatability.
In MSI/dMMR prediction, the DL model either needs to predict MSI as well as a diagnostic test (e.g., at least as well as IHC) or needs to be used as a screening tool. In both cases, a cutoff needs to be defined. In this review, 47.1% of papers defined some cutoff at least on a subset of the validation data, but only 13.7% demonstrated some applicable cutoff viable for screening (e.g., at least 90% sensitivity) in a clinical setting. One method is clinically approved as a pre-screening tool with CE-IVD certification, yet so far, no tool can reach the same specificity as IHC or PCR testing while maintaining 95% sensitivity (Saillard et al. 46% Spec. at 98% Sens., Wagner et al. 61% Spec. at 95% Sens.).
The current state of papers therefore indicates that the MSI category is advanced, yet tools can only be employed as screening tools with gold-standard testing necessarily following the screening as the specificity is comparatively low. Therefore, either DL models can be further improved to have similar sensitivity and specificity as gold-standard testing, or this step necessarily needs to be deferred to a non-H&E evaluation.
BRAF mutation testing plays an important role in Lynch syndrome identification after MSI testing. On the other hand, microsatellite stable BRAF mutant colorectal cancers are particularly aggressive, yet the mutation also increases responsiveness to EGFR inhibitor therapy. The non-core element in the CRC ICCR guidelines specifically concerns the oncogenic V600E mutation, which represents ~90% of BRAF mutant cases in a large trial. Notably, BRAF and MSI cases share similarities in both morphological features and epidemiological patterns.
As there is considerable overlap between the setup, task, and patients with MSI and BRAF mutation, publications frequently predict both (Table 8). Therefore, different subsets of the TCGA cohort are the also most prevalent datasets (75%), followed by the DACHS cohort (31.3%).
Most approaches apply stain normalization (68.8%) and tile extraction (100%) as preprocessing steps. Many workflows (31.3%) subsequently identify tumor tissue using a tissue classification algorithm. The majority of publications implement MIL-based approaches (81.3%), differing in their tile aggregation approaches (averaging: 23.1%, top k tiles: 30.7%, other: 30.9%). These MIL-based approaches are commonly built on either CNN-based (31.3%) or attention-based (31.3%) architectures. Recent publications leverage foundation models (18.8%) with either averaging-based or attention-based aggregation strategies.
Only 43.8% of publications conducted an external validation of their algorithm, while 56.3% utilized more than one cohort. Openly available datasets were included in 68.8% of all studies. The code is available in 62% of all publications, and model weights are only available in 31.25% of studies. Furthermore, most studies did not differentiate between BRAF mutations and all other BRAF mutations (87.5%). In contrast, Fuji et al. utilized a BRAF-specific cohort, whereas Tsai et al. reported both BRAF-specific and unspecific results.
Only four studies reported actionable results, meaning that a label (BRAF/BRAF) rather than a mutation probability was assigned to each WSI. Only two studies report prediction sensitivities higher than 90%. Of those, only one study reached an accuracy higher than 0.5.
Of all 16 included publications, only Guo et al. and Schrammen et al. report fully integrated pipelines with results based on a clinically appropriate cutoff (sensitivity >90%). However, Guo et al. neither shared code nor weights, and Schrammen et al. did not use any openly available datasets on which results could be reproduced.
The current state of BRAF prediction is unsolved and mostly serves as a validation task for multi-purpose models. Almost all (88%) studies reporting BRAF -status report MSI-status as well. As those features are highly correlated, it is expected that models will perform similarly for both tasks. However, there exist cases that are non-correlating, which are especially interesting to pathologists. None of the studies investigated performance in such cases, although previous work has demonstrated a confounding effect between the biomarkers with impact on predictive performance. Furthermore, most studies do not utilize a clinically appropriate cutoff, making it difficult to evaluatetheir performance in a clinical setting.
Under coexistent pathology, all other identified lesions should be listed. This could be additional synchronous carcinomas, polyps or e.g., IBD, other dysplastic normal tissue, subsumed under other lesions. Published work usually focuses on precursor lesions or the correct classification of polyps, but those models could be applied to e.g., correctly classify additional polyps in the same case. Additionally, because of the diversity of possible findings within this section, the models that have been developed so far only focus on one of the items that should be reported.
The reviewed studies utilized datasets derived from biopsy or polyp resections rather than oncologic resection specimens (Table 9). Across studies, there is lack of access to the development datasets.
It is important to highlight that the same task has different GT across studies. While only two studies address dysplasia grading, they incorporate different classes. Perlo et al. add Hyperplastic Polyp (HP) to their dysplasia categories, which is not used by Neto et al.. Similar observations apply for polyp classification.
Slide-level classification is achieved by Multiple Instance Learning by Neto et al., while Perlo et al. aggregated tile classification results from a CNN. Wei et al. defined class-specific thresholds to categorize whole slide images (WSIs).
Variability in ground truth definitions limits model comparability and clinical integration. Differences in annotation standards hinder consistent evaluation, while a lack of publicly available datasets and independent validation cohorts restricts reproducibility. Only Neto et al. provide accessible validation data, but broader external validation remains scarce. Existing studies focus on polypectomy specimens rather than oncologic resection samples, since the studies are designed for application to polyp classification in the colorectal cancer screening workflow. While existing models can assess isolated polyps, their performance on primary tissue slides, which encompass all layers of the colon, and thereby unseen tissue types, is uncertain.
Furthermore, several coexisting pathologies remain unexplored. For instance, no studies were identified that automatically evaluate (IBD) using H&E-stained slides. While Rymarczyk et al. present a model to score Crohn's disease and ulcerative colitis, it relies on prior classification into one of these two categories, highlighting a lack of methods for fully automatic reporting of this section of the guidelines.
Tumor budding (TB) is defined as presence of small clusters of up to four cancer cells at the tumor invasive front in a tissue section. TB assessment in cancer follows the standardized guidelines of the International Tumor Budding Consensus Conference (ITBCC) and is performed in hotspot regions (0.785 mm²) of H&E-stained tissue sections. This scoring has clinical implications, particularly for pT1 and stage II CRC patients, as high TB is associated with poor prognosis. Identifying and counting TBs on H&E slides is time-consuming and prone to inter-observer variability making it an ideal task to be automated by DL models.
Most published studies on AI-assisted TB assessment rely on IHC staining, patch-level, single-center training and validation of the model. After exclusion criteria, two studies following the standardized ITBCC assessment pipeline were left (Table 10).
The study by Lu, et al. uses faster RCNN for bud detection. Ground truth was established manually on H&E tissue slides and the model's evaluation time was significantly faster than manual assessment.
On the other hand, Bokhorst et al. created a pipeline integrating a U-Net model for tissue segmentation and HoVer-Net for nuclei detection at the invasive front. The tumor bulk and invasive front were identified using a convex-hull algorithm. Their strategy was validated on data from four different medical centers.
Both models present different approaches in TB detection and counting in H&E WSIs that seem promising to reduce time and inter-observer variability among pathologists. However, there are a number of limitations to be addressed before being able to deploy such models in the diagnostic routine, such as small validation datasets, an overestimation of TB count especially in necrotic areas, and an inability to distinguish pseudo budding.