QC Best Practices

The need for QC of hippocampus imaging

To advance hippocampal research, reliable and valid imaging techniques are essential. Historically, clear guidelines for quality control (QC) in segmentation of hippocampal subfields have long been nonexistent in the literature. Recently, the Hippocampal Subfields Group (HSG) addressed this gap by publishing a report (available here) which gathered insights from 37 researchers across 10 countries on best practices for QC in hippocampal imaging.

Survey respondents were also asked to indicate how important they believe QC to be, where all respondents acknowledged the importance of QC. Despite this, only 46% reported detailing their QC procedures in prior publications, highlighting a mismatch between felt importance of QC and subsequent reporting of steps that were taken to verify sufficient image standards.

We believe consistent application of boundary definitions is fundamental for reliable hippocampal measurements. Inconsistent QC practices can obscure real differences between groups, particularly in populations where MRI scans tend to be lower quality due to specific clinical or age-related characteristics that impact MRI acquisition. For a lengthier discussion, we refer to the paper, but below summarize key take-home messages about the impact of QC and best practices discussed in the paper. The guide provides practical steps at each stage, starting from data acquisition to final publication. To note, these guidelines are designed specifically for hippocampal subfield measurements derived from high-resolution, oblique coronal T2-weighted scans that are approximately orthogonal to the hippocampus, scans which we believe are prerequisite for segmentation at the level of subfields.

QC of Hippocampal Subfield Segmentation – Overview

Table 1 below summarize the recommended best practices, starting from assessment of overall image quality to final screening. The table highlights recommendations not only surrounding QC itself, but also provide recommendations on what to report in final publication, as we believe a larger degree of transparency in publications about the specific approach taken is necessary for reproducibility.

Each QC step in more detail

For more details on each step of the QC procedure, we here provide step by step information on common errors, how to identify and rate them, as well as correction of these errors considered best practice.

QC at acquisition

Acquisition is the first critical step for ensuring high-quality data. MRI scans are susceptible to several artifacts during acquisition, caused by factors such as participant movement, metal implants, magnetic field inhomogeneities related to head anatomy, and occasional mechanical issues in gradient coils. Whether by adjusting acquisition parameters or by initially reviewing the sequence output, it’s essential to confirm that the data meet the conditions necessary for accurate hippocampal boundary tracing.

In our view, high-resolution T2-weighted images are a prerequisite for hippocampal subfield data, where we recommended a maximum resolution in coronal view of 0.5 x 0.5 mm and a maximum slice thickness of 2.5 mm. Detailed acquisition guidelines are available from our Acquisition Working Group (available here) and has been discussed in a previous publication by the community (available here).

To mitigate potential data loss resulting from artifacts, especially common when acquiring data on special populations such as young children or individuals with late-stage dementia, it is recommended to plan for the additional time to acquire repeated scans whenever feasible. Additionally, if possible, initial reviewing of image quality during acquisition can help determine whether further scans are needed.

Image quality and landmark visibility

There are two main considerations when checking the quality of images – judging overall image quality and checking landmark visibility. Although integrally linked and often checked in tandem, each is exemplified and discussed in turn below. including a brief surveying of the approaches usually applied to rate and judge these parameters.

Image quality

MRI images are prone to several artifacts, and these get increasingly common in special populations. Artifacts not only affect the quality of the data, but when located in the temporal lobes likely has an impact on the quality of your segmentations. Deciding on a minimum quality for inclusion is an important consideration to be decided before analysis commence.

The most common artifacts to look out for are banding, ringing and motion artifacts. Banding artifacts are caused by inhomogeneities in the magnetic field and causes dark bands covering portions of the image. Ringing artifacts are a reconstruction error that causes rings of alternating dark and bright in the image. Motion artifacts are caused by movement during scanning. This results in ghost imaging overlays or just diffuse and noisy imaging broadly. The majority of respondents of our survey highlighted motion artifacts as the most common cause of exclusion.

Rating image quality

It is common, and recommended, to calculate and report on a combination of image signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) to give an overall indication of image quality. SNR is the mean signal of a region of interest compared to background noise (often the standard deviation of samples in air). CNR is the comparison of mean signal in a region of interest to a reference region in proportion to background noise, where again it is common to use the standard deviation of background noise.

As exemplified in the figure above, qualitative review of image artifacts impacting quality is recommended to complement the quantitative approach of SNR and CNR. The intention is to systematically categorize images according to defined operational criterion about severity of different image artifacts.

These are subjective judgements but should be quantified using reliability estimates of these judgements as performed by one or several independent raters (e.g., kappa statistics between- or within-rater; minimum 0.75 indicates a strong level of agreement). The exact format of the rating scale varies but typically follow a binary pass and fail format or using a multilevel ordinal scale that allow for classifying borderline cases in need of more detailed investigation.

Finally, there are some automatic QC methods, first and foremost useful in larger datasets where manual QC would be infeasible. MRIQC (Ding et al., 2019) is one such option.

Landmark visibility

Overall image quality is a good indication of potential issues with visibility. However, for accurate hippocampal subfield segmentation, the most important point is that specific landmarks (e.g., uncus, alveus, SRLM, fimbria) are visible. Both manual tracers and automated algorithms can tolerate some moderate to heavy artifacts on individual slices, but segmentation quality will decrease as artifacts cover a larger portion of slices.

Similar to image quality checking, rating scales should be applied to landmark visibility as well. Scans can, for example, be identified as “Pass/Clearly Visible,” “Check/Somewhat Visible,” or “Fail/Not Visible,” allowing one to determine whether exclusions are necessary.

Listed below are key landmarks used across commonly used protocols, along with examples. Other landmarks are sometimes used that will not be illustrated (e.g., fornix, ambient cistern, endfolial pathway).

Inferior and Superior Colliculi

These structures, also known as the lamina quadrigemina, mark the boundary between the hippocampal body and tail in many protocols, including the HSG protocol. Due to their size, they are rarely obscured by artifacts, but it remains essential to verify their visibility. For other landmarks, a specific rating of visibility might be needed to keep track of and documenting reliability and confidence in boundary placements in more detail, but for this it suffice to ensure sufficient quality that confident placement of the boundary between body and tail can be performed.

Stratum Radiatum Lacunosum Moleculare (SRLM)

The SRLM is a C-shaped white matter layer appearing as a dark band within the hippocampus on coronal T2 imaging, serving as a critical landmark for subfield segmentation by defining the boundaries between the dentate gyrus and surrounding subfields.

Reduced SRLM visibility on individual slices can sometimes be compensated by inferring its position from adjacent slices. However, this method is limited when SRLM is obscured across multiple consecutive slices, especially with large slice thickness that causes significant morphological changes between slices, hindering inference.

In the “Pass” example of the figure below, SRLM is clearly visible across both the head and body of hippocampus. In the “Check” example, SRLM is slightly less visible but might be sufficiently visible across multiple slices, which needs to be further investigated. In the “Fail” example, SRLM is too obscured within the particular slice, severely impacting confidence in border placement. Chances of adjacent slices having sufficient visibility are low.

Fimbria and Alveus

The fimbria and alveus are interconnected white matter structures bordering the hippocampus. In T2-weighted scans, they appear as dark bands that form the medial and superior boundaries of the hippocampus. These structures are often affected by partial voluming with nearby gray matter and CSF, resulting in poor visibility. However, reliable superior and medial border placement require them to be at least partially visible across slices, similar to SRLM.

Uncus

The uncus has a variability in exact placement and is surrounded by a combination of tissue types, making it prone to partial voluming effects that obscure it. This is less common than having issues of visibility for SRLM or fimbria/alveus, but is in the HSG protocol and many others necessary for identifying the transition between hippocampal head and body.

Segmentation Error Identification

Quality control (QC) for both automated and manual segmentations requires a thorough understanding of the relevant anatomy and the specific atlas or protocol used to define hippocampal subfield boundaries. In manual segmentation, researchers are responsible for accurately tracing subfield boundaries, with ongoing review both during and after the segmentation process. In contrast, automated segmentation relies entirely on post hoc inspection of the output to ensure quality.

Our survey respondents identified several common segmentation errors, each resulting in overestimation or underestimation of subfields or whole hippocampi volumes. The full list of segmentation errors that can occur is too long to list here, but the most commonly identified were:

Inclusion of the choroid plexus or cerebrospinal fluid
Overestimation due to partial volume effects
Voxels incorrectly extending into the fimbria
Isolated groups of segmented pixels disconnected from the hippocampus
Misplaced internal boundaries
Underestimation of subfield labels

Rating segmentation errors

Segmentation results are never perfect. Similar to the QC of image quality and landmark visibility, a useful approach is to apply a severity rating system for errors, with the goal of identifying cases that pose significant threats to validity. It is advisable to have multiple independent raters assess segmentation errors and to calculate inter- or intra-rater reliability to ensure consistency in error identification.

For large datasets (e.g., the Alzheimer’s Disease Neuroimaging Initiative), conducting full QC of all segmentations may be impractical. In such cases, a random sampling approach can be used to provide a general indication of data quality.

With some automatic softwares, snapshots of segmented slices are extracted by default that can be used in QC procedures. For example, the software Automatic Segmentation of Hippocampal Subfields (ASHS; Yushkevich et al., 2015) automatically generates mosaic screenshots consisting of excerpt coronal and sagittal slices. These give a chance to check segmentations for potential errors in a sample of slices, which is then followed by a more thorough slice-by-slice review for cases with identified errors.

Manual evaluation of segmentation quality is subjective, and thus requires standardized methods to ensure consistent decisions. Examples of manual evaluation from Wisse (LINK YOUTUBE) demonstrates that investigators can use different criteria but still similarly implement recommended best practices in identifying segmentation errors.

Resegmentation of MR Images

When selecting a segmentation atlas, ensure it has been validated in samples similar to the dataset being processed. Automated segmentation methods are typically validated against manual segmentations of specific samples, but their performance can vary with new datasets. Poor segmentation quality may indicate an unsuitable atlas.

Even with a suitable atlas, errors can occur. Automated resegmentation using a different atlas often improves data retention and reduces manual corrections. If this is not feasible or unsuccessful, adjust software-specific parameters and resegment using the original atlas.

We recommend considering resegmentation if errors are present in over 40% of the dataset, if severe errors affect more than 40% of slices, and if errors correlate systematically with a key variable. After resegmentation, repeat QC to assess error severity and determine the need for manual correction or data exclusion.

Manual Correction of Automated Segmentation

Given rater expertise and available time, manual correction can address errors in hippocampal subfield segmentation. Selective correction of severe issues, such as obvious mislabels, allows for resolving significant errors while maintaining the efficiency of automated segmentation in large datasets. Severity ratings from prior QC steps should guide these interventions, and corrections must align with the atlas used to ensure consistency, requiring raters to have atlas-specific expertise.

Over-correction should be avoided to preserve the uniformity inherent to automated segmentation, as manual edits can introduce variability and bias. To minimize human error, establish inter- or intra-rater reliability before correcting the full dataset. Reliability estimates should focus on volume measures (ICC > .85; Koo & Li, 2016) and spatial overlap (DSC > .70; Zijdenbos et al., 1994). These can be assessed either between multiple raters or by the same rater after a delay, using a subset of scans with errors to confirm consistency in correction decisions.

Maintaining external validity requires retaining as many cases as possible. However, segmentation errors that cannot be resolved through resegmentation or manual correction may necessitate excluding cases where errors compromise measurement validity.

Most survey respondents reported excluding severe errors while retaining small or moderate ones. QC often varies by subregion or hemisphere, allowing selective exclusion. For example, the left hemisphere might be excluded while retaining the right, or anterior hippocampal segments kept while posterior ones are excluded.

Exclusions reduce statistical power, so decisions about parameter estimation should consider:

The planned statistical method.
The extent and randomness of missing data.
Ensuring data loss is unrelated to key demographic or study variables.

After exclusion, assess whether the sample population still represents the target population. Listwise or pairwise deletions yield unbiased estimates only when data are missing completely at random. Methods like multiple imputation or latent modeling can analyze incomplete data but also rely on randomness in exclusions.

Formal tests, such as Little’s chi-square, can assess whether data are missing at random. However, severe cases often fail QC, making the assumption of randomness difficult to meet. Therefore, resegmentation and manual correction are crucial for retaining as much data as possible. A general guideline is to limit data loss from QC to no more than 40%.

Data Screening

Statistical data screening before hypothesis testing ensures the accuracy of data, verifies compliance with statistical assumptions, and supports best practices. Screening measurement values provides a secondary check on previous QC procedures, which should have already addressed most errors.

A common first step is to inspect univariate descriptive statistics. This includes checking for out-of-range values, ensuring plausible means and standard deviations, and identifying potential outliers. Investigators can also examine interhemispheric correlations of regional measures, expecting high consistency in most populations.

Outliers are a key focus of QC. They can be identified using z-scores exceeding |3.29|, flagged, and reviewed during analysis. Decisions to remove outliers should account for relevant factors, such as severe neurodegeneration, that may explain the deviation.

Image quality

We recommend to report on the approach taken to assessing image quality. This includes the criteria used for exclusion such as the use of a binary or multilevel rating scale, as well as statistics for rater reliability among those reviewing the images.

Excluding images at this point, we recommend reporting on the number of scans excluded as well as any covariates that correlate with the variables of interest.

Segmentation quality

Broadly speaking, we recommend to clearly describe in the methods section how segmentation labels were reviewed for errors. For manual segmentation, we recommend reporting whether segmentations were independently reviewed and to report on raters reliability with the manual segmentation protocol using intra-class correlation (kappa statistics). For automatic segmentations, we recommend reporting on the approach taken to reviewing segmentation quality, including the number of reviewers (and their inter-reliability) as well as any specific criteria applied (e.g., severe errors across at least three consecutive slices).

Segmentation correction and resegmentation

If segmentation correction is applied, we recommend reporting on agreement between raters performing the corrections (or within repetitions by the same rater). Specifically, we recommend reporting on agreement in volume measures (with an ICC > .85) and the spatial overlap (with DSC > .70). We also recommend reporting on the number or proportion of cases that were corrected.

When resegmentation is applied, we recommend reporting on how many cases that were resegmented, and on the use of any new parameters or atlases in resegmentation.

Data Exclusion

For data excluded, we recommend that the number or percentage of data excluded is reported. We also recommend to report on the number of raters and the procedure being used, as well as the reliability estimates between and/or within those raters (inter- and/or intra-rater reliability). If any covariates or variables of interest coincide with exclusion of data, these should be reported. A Little’s chi-squared test for assessment of missingness should also be performed and reported on.

Data Screening

Univariate outlier detection should be performed and reported. Beyond this, a multivariate outlier approach can be assessed (e.g., Mahalanobis distance) and reported. We also recommend demonstrating high correlation between hemispheres (unless specific attributes of the studied population precludes this, such as unilateral changes being highly common). In cases of longitudinal data, consistency within hemisphere (ICC3, DSC) over time can also be reported.