DICOM Whole Slide Imaging (WSI)

Table of Contents

Tools and Activities

For a quick start at using DICOM for Whole Slide Imaging, these links may be helpful:


This section is adapted from the Scope and Forward of Supplement 145.


The field of Pathology is undergoing a transformation in which digital imaging is becoming increasingly important. This transformation is fueled by the commercial availability of instruments for digitizing microscope slides. The whole-slide images (WSI) made by digitizing microscope slides at diagnostic resolution are very large. In addition to the size of WSI, the access characteristics of these images differ from other images presently stored in PACS systems. Pathologists need the ability to rapidly pan and zoom images, referred to as virtual microscopy.

In order to facilitate adoption of digital Pathology imaging into hospitals and laboratories, it is desirable that instruments that acquire WSI digital slides store these images into commercially available PACS systems using DICOM-standard messaging. Once this is done, the PACS systems capabilities for storing, archiving, retrieving, searching, and managing images can be leveraged for these new types of images. Additionally, a given case or experiment may comprise images from multiple modalities, including Radiology and Pathology, and all the images for a case or experiment could be managed together in a PACS system.

The DICOM Standard now provides support for WSI digital slides, by incorporating a way to handle tiled large images as multi-frame images and multiple images at varying resolutions.

Characteristics of Whole-Slide Images

Image dimensions, data size

Whole slide images (WSI) are large. A typical sample may be 20mm x 15mm in size, and may be digitized with a resolution of .25 micrometers/pixel (conventionally described as microns per pixel, or mpp) Most optical microscopes have an eyepiece which provides 10X magnification, so using a 40X objective lens actually results in 400X magnification. Although instruments which digitize microscope slides do not use an eyepiece and may not use microscope objective lenses, by convention images captured with a resolution of .25mpp are referred to as 40X, images captured with a resolution of .5mpp are referred to as 20X, etc. The resulting image is therefore about 80,000 x 60,000 pixels, or 4.8Gp. Images are usually captured with 24-bit color, so the image data size is about 15GB.

This is a typical example, but larger images may be captured. Sample sizes up to 50mm x 25mm may be captured from conventional 1 x 3 slides, and even larger samples may exist on 2 x 3 slides. Images may be digitized at resolutions higher than .25mpp; some scanning instruments now support oil immersion lenses which can magnify up to 100X, yielding .1mpp resolution. Some sample types are thicker than the depth of field of the objective lens, so capturing multiple focal planes is desirable (by convention the optical axis is Z, so focal planes are often called Z planes). Additionally, multi-spectral imaging may capture up to 10 spectral bands at 16-bit per pixel resolution.

Taking an extreme example, a sample of 50mm x 25mm could be captured at .1mpp with 10 Z‑planes, yielding a stack of 10 images of dimension 500,000 x 250,000 pixels. Each plane would contain 125Gp, or 375GB of data, and the entire image dataset would contain 3.75TB of data. This is a worst case but is conceivable given current technology, and in the future resolution will only increase, as will the practicality of capturing multiple Z-planes.

Access patterns, data organization

Due to the large amount of information on a microscope slide, pathologists cannot view an entire sample at high resolution. Instead, they pan through the slide at a relatively low resolution - typically 5mpp (2X) or 2.5mpp (4X) - and then zoom in to higher resolution for selected regions of diagnostic interest. Like all microscopists, pathologists typically focus as they are panning and zooming.

When slides are digitized, the software for viewing WSI must provide equivalent functionality. Pathology image viewers must provide rapid panning and zooming capabilities. When multiple Z‑planes are captured, viewers must also provide rapid change of Z-plane selection.

DICOM is an interchange standard, supporting store and forward architectures, but also supports interactive access, whether between an application and a local DICOM object store, or between a client workstation and a server. The data organization of the DICOM images supports these types of interactive access patterns because the large data sets of WSI preclude loading an entire image into application random access memory.

To facilitate rapid panning, the image data is stored in a tiled fashion. This enables random access to any subregion of the image without loading large amounts of data. To facilitate rapid zooming, the image may be stored at several pre-computed resolutions. This enables synthesis of subregions at any desired resolution without scaling large amounts of data. Finally, if multiple Z-planes are captured, these are stored as separate images, to facilitate loading subregions at any desired focal location.

The simplest way to store two-dimensional image data is a single frame organization, in which image data are stored in rows which extend across the entire image. Figure 1 shows a single frame image organization:

Figure 1 - Single Frame Image Organization

In this single frame organization, image pixels are stored starting from the upper left corner (dark purple square), in rows all the way across the image (medium purple stripe). All the pixels in the image are stored as rows, like text running across a page.

This is a simple organization, but it has an important limitation for large images like WSI: To view or process a subset of the image, a much larger subset of the image must be loaded. For example, in the illustration above the dark green rectangle indicates a region of the image to be viewed or processed. If a single read operation from secondary store will be used to access this area, the light green region indicates the region of the image which must be loaded to access the dark green region.

A more sophisticated way of storing two-dimensional image data is a tiled organization, in which image data are stored in square or rectangular tiles (which are in turn stored by row). Figure 2 shows a tiled image organization:

Figure 2 - Tiled Image Organization

Image pixels are stored starting from the upper left corner (dark purple square), in tiles (medium purple rectangle). All the pixels in the image are stored as tiles, like the pages in a book.

This organization is more complicated than single frame images, but it has an important advantage for large images like WSI: To view or process a subset of the image, only a small subset of the image must be loaded (assuming that efficient access to individual tiles on secondary store is supported). For example, in the illustration above the dark green rectangle indicates a region of the image to be viewed or processed. The light green region indicates the tiles of the image which must be loaded to access the dark green region.

The chosen tile size for an image affects the performance of accessing the image. Large tiles mean that fewer tiles must be loaded for each region, but more data will be loaded overall. Typical tile sizes might range from 240 x 240 pixels (172KB uncompressed) to 4,096 x 4,096 pixels (50MB uncompressed).

Although storing images with a tiled organization facilitates rapid panning, there is still an issue with rapid zooming. Consider Figure 3:

Figure 3 - Issue with Rapid Zooming

The problem is that at high resolution, a small image area must be accessed to render a given region (exemplified by the dark green area in illustration). At lower resolutions, progressively larger image areas must be accessed to render the same size region (lighter green areas in illustration). At the limit, to render a low-resolution thumbnail of the entire image, all the data in the image must be accessed and downsampled!

A solution to this problem is to pre-compute lower resolution versions of the image. These are typically spaced some power of 2 apart, to facilitate rapid and accurate downsampling, and add some overhead to the total stored data size. For example, generating resolution levels a factor of 2 apart adds about 32% to the size of the data set, and generating resolution levels a factor of 4 apart adds about 7% to the size of the data set.

The typical organization of a WSI for Pathology may be thought of as a pyramid of image data. Figure 4 shows such a pyramid:

Figure 4 - Whole-slide Image as a Pyramid of Image Data

As shown in this figure, the WSI consists of multiple images at different resolutions (the altitude of the pyramid corresponds to the zoom level). The base of the pyramid is the highest resolution image data as captured by the instrument. A thumbnail image may be created which is a low resolution version of the image to facilitate viewing the entire image at once. One or more intermediate levels of the pyramid may be created, at intermediate resolutions, to facilitate retrieval of image data at arbitrary resolution.

Each image in the pyramid may be stored as a series of tiles, to facilitate rapid retrieval or arbitrary subregions of the image.

Figure 4 shows a retrieved image region at an arbitrary resolution level, between the base level and the first intermediate level. The base image and the intermediate level image are tiled. The shaded areas indicate the image data which must be retrieved from the images to synthesize the desired subregion at the desired resolution.

Image data compression

Because of their large size, WSI data are often compressed. Depending on the application, lossless or lossy compression techniques may be used. Lossless compression typically yields a 3X-5X reduction in size. The most frequently used lossy compression techniques are JPEG and JPEG2000. For most applications, pathologists have found that there is no loss of diagnostic information when JPEG (at 15X-20X reduction) or JPEG2000 (at 30X-50X reduction) compression is used. Lossy compression is therefore often used in present-day WSI applications. JPEG2000 yields higher compression and fewer image artifacts than JPEG; however, JPEG2000 is compute-intensive.

The typical example image described above, which contains 15GB of image data, could be compressed with JPEG2000 to about 300MB. The extreme example described above could be compressed from 3.75TB to 75GB.

Sparse image data

Some instruments which digitize microscope slides do not capture all areas of the slide at the highest resolution. In this case the image data within any one level of the conceptual pyramid may be sparse, i.e., lacking some of the tiles.

Similarly, some instruments which capture multiple Z-planes do not capture 3D image information for all areas of a slide. In this case the image data within any one or all Z-planes may be sparse.

Description of the DICOM Whole Slide Image Storage IOD

Pixel Matrix

In all current DICOM image IODs, pixel matrix dimensions are stored as unsigned 16-bit integers, for a maximum value of 64K columns and rows. As noted above, WSI frequently have pixel dimensions which are larger than this. To remain within the (64K)2 size limit the standard uses tiling. The WSI IOD provides a Total Pixel Matrix up to (232)2, into which the tiles fit, and which defines the spatial orientation of the tiles relative to the slide.

Uncompressed DICOM image pixel data has a maximum size of 232 bytes (4GB). As noted above, WSI may have data sizes which are larger than this. However, compressed DICOM pixel data is sent using a structure that allows unrestricted lengths; since WSI is typically exchanged compressed, this 4GB limitation does not apply.

Tiled images

The basic mechanism for storing WSI in DICOM is to store the individual tiles of a single WSI pyramid level (resolution layer) as individual frames in a DICOM multi-frame image object. The tiles may be small, in which case many individual frames will be stored in the image, or they may be large, and in the limit may be so large that one or more levels of the pyramid require only one tile. In fact, an entire WSI level can be stored as one single tile (if it fits within the 64K2 frame pixel matrix limit).

Where multiple Z-plane images are needed for the WSI, each plane may be stored separately in an object in the series, or all the planes at one level may be stored in the same image object. Similarly, for multispectral imaging each wavelength may be stored separately, or all in the same object.

Each frame is located by three spatial coordinates relative to the WSI: X and Y offsets (by convention, the upper left corner pixel is {1,1}, and X increases down the image to the bottom, while Y ascends across the image to the right), and Z - which indicates the plane in which the image belongs.

Within each image object, tiling is on a regular grid that covers the entire imaged area. The tiling may be sparse or complete. If there are multiple Z-planes in a level, or multiple spectral bands, not only may the tiling may be sparse, but the sparseness may vary between the planes or bands. This applies whether the Z-planes or bands are in separate image objects, or all in the same object.

Within a level there may be several image objects, and the tiling does not need to be the same across those objects. E.g., there may be some image objects with large tiles, and some with small tiles. There may be different alignments of the tiling grid relative to the imaged area. Thus tiling on non-regular basis can be accommodated by using separate image objects.

The edge tiles may include areas that are not part of the scanned volume; those areas may use padding to fill out the tile.

Storing an Image Pyramid as a Series

Where multiple resolution images are needed or desired for the WSI, each level is stored separately in the series.

An image object describes one level. Levels are composed of tiles, and so may be sparse (not all tiles present). For any level the resolution is fixed for all tiles in the level, and all tiles have the same width and height, and may not overlap, although the level may be sparse and any number of tiles may be absent.

Figure 5 illustrates the correspondence of an image pyramid to DICOM images and series:

Figure 5 - Mapping a WSI Pyramid into a DICOM Series

The Series may also contain ancillary images, such as a slide label image or whole slide macro image.

Color / Optical Path

Different WSI images may have different numbers of color channels and different numbers of bits per channel. The most typical format for simple color images will be three channels, typically RGB data or transformed to YCBCR color space, with pixels having 8-bit samples for each channel.

Multi-spectral images may have a single frequency band encoded in each frame with up to 16 bits pixel depth; such images will be identified as monochrome, although the image object may include many co-extensive frames representing a tile in different spectral bands. The color mapping of each frame is conveyed through a description of the optical path.

The optical path description for each frame allows the specification of the illumination and the detection spectra (which may differ with fluorescence), lenses, polarization, and other parameters. In the simple color image case, illumination would be white light with RGB detection.

The color characteristics of an RGB image are corrected by an ICC Profile (included in the image object) to account for the illumination characteristics.

For multi-spectral images, each frequency band has a recommended display color. It is the responsibility of the display application to decide how to display multiple bands (encoded in separate frames), and how to use that recommended color (including blending of multiple spectral band frames).

WSI Frame of Reference

The DICOM Slide Coordinates Microscopy Visible Light IOD defines a Frame of Reference for localizing slide images using a slide-based coordinate system (comparable to the DICOM patient-based coordinate system). It specifies a particular corner of the slide as a nominal reference origin, and a right handed (X,Y,Z) coordinate system for positioning from that origin.

The WSI IOD retains that Frame of Reference and coordinate system, using it to localize each frame (tile), as well as the top left hand corner of the total imaging area.

Note that while the nominal reference origin and coordinate system are clearly defined, they are not intended to be reproducible across different mountings of a slide, even on the same equipment. Also note that the slide-based (X,Y) coordinate system is rotated 180 degrees from the conventional image matrix (row, column) orientation of the image frames with the label on the left. See Figure 6.

Figure 6 -Slide Coordinates Origin and (X,Y,Z) vs. Image Matrix Origin and (Rows,Columns)


The focal plane of a frame, or Z-plane, is identified as the nominal physical height (in μm) of image focus above the reference surface, which in the slide-based coordinate system is the top surface of the (glass) slide substrate, i.e., the side on which the specimen is placed.

Z-plane information is used for relative spatial positioning of image planes, and nominal inter-plane distance. An imaging focal plane may track variations in specimen thickness or the specimen surface contour, but only one Z-value is used. It thus has meaning only in a local context; it can be used for relative depths of different frames at the same (X,Y) tile position, and it can be used to match the frame at one tile position to a frame at an adjacent tile position with the same Z-plane depth. The Z value should not be used as an absolute depth measurement. See Figure 7.

Figure 7 -Z planes track curved surface

WSI Annotation and Analysis Results


As a general principle in DICOM, annotations are conveyed in information objects separate from the image. Since annotations may be created at a time much later than the image acquisition in a different Procedure Step, and on different equipment, and because annotations are of a different modality than image acquisition (i.e., they are created by a different type of process), they must be recorded in a separate Series (as a DICOM Series is limited to objects of a single Modality, produced by a single Equipment, in a single Procedure Step).

As independent objects, multiple annotation objects can reference the same image.

Types of annotation

There are several types of DICOM annotation objects serving different purposes:

Each of these has potential applicability to WSI.

Microscopy Bulk Simple Annotations

Microscopy Bulk Simple Annotations are specifically intended for annotating very large numbers of machine generated regions of interest created from whole slide microscopy images. The binary representation of the 2D or 3D coordinates of the points defining contours or geometric shapes is compact and indexed in order to minimize size yet provide sufficient precision. Commonality is refactored. Coded constructs defining the characteristics of the annotations are shared between annotations of the same type.


Segmentation is a type of derived image, and is encoded using the enhanced multi-frame paradigm, extended to support tiled pyramidal images. (Note there is also a DICOM capability for Surface Segmentation, which is not discussed here.) Each segment is linked to a categorization or classification of a corresponding area in an analyzed source image. Typically, a segmentation image frame is encoded with only 1-bit/pixel to show the presence or absence of the specified category at that pixel location. Alternatively, encoding can be 1-byte/pixel to allow a fractional assessment of the classification (either probability of the classification in the referenced pixel, or fractional occupancy of the pixel by the classification).

A segmentation image can be in the same Frame of Reference as the source image, in which case the spatial alignment can be specified relative to the Frame of Reference origin, and the spatial resolution (pixel spacing) can be different than the source image. However, the segmentation can also be aligned on a pixel-by-pixel basis with a source image, whether or not there is a Frame of Reference used. In that case, the segmentation frame has the same pixel spatial resolution and extent as the source frame. For WSI, segmentations can be created for any selected frames (tiles); it is not necessary to perform a segmentation across the entire image.

A segmentation frame can be derived from multiple source frames. Thus, multiple color channels can be used to perform the segmentation.

For a grayscale source image, the Blending Softcopy Presentation State can be used to control an initial presentation of the source image with the segmentation as a color overlay, with variable relative opacity. With a colorsource image, the segmentation image object itself can convey a recommended display color for the overlay, but there is currently no standard presentation state controlling color on color blending.

Structured Reporting

While Presentation State objects can carry textual annotation, that annotation is for human use only - it is not formally processable by automated applications in an interoperable manner. It does not use controlled and coded vocabulary, and conveys no structural semantics (relationships between annotations). Those capabilities are available with Structured Reporting (SR).

The areas in which SR is important are those where the annotations are intended to be used in the imaging analysis and review processes. For example, CAD analysis results, intended to be overlaid on images, and which require full contextual description of their evidentiary and inferential chain, are defined as SR objects. Similarly, SR can facilitate conveying provisional image measurements and findings (internal departmental work products), to be reviewed by a physician together with the imaging, as part of the clinical review and reporting process.

The final clinical report, intended for broad distribution outside the imaging environment, may be encoded as an HL7 CDA document. However, there are standard means of encoding DICOM object references in CDA, so that such reports can link to the imaging evidence (including reference of Presentation States to control display of referenced images).

Parametric Maps

In quantitative multi-spectral microscopy, pixel values can be mapped by the Real World Value Map within Parametric Map images to activity, concentration, or other physical measurements. Real World Value Maps can provide the conversion from pixel values to physical measurement values through a linear equation (slope and intercept), or through a look up table. Parametric Maps may be useful for encoding so-called "heat maps" that are generated by artificial intelligence algorithms and intended to be pseudo-colored and superimposed on anatomic images.

Presentation States

The Grayscale Softcopy Presentation State (GSPS), Color Softcopy Presentation State (CSPS), and Pseudo-Color Softcopy Presentation State (PCSPS) can be used as is for annotating individual frames (tiles). However, to be able to have a single annotation extend across tile boundaries, the annotation anchor locations to be relative to the whole image matrix, which is supported for the WSI IOD.

Note that a Presentation State annotation can apply to multiple frames. Thus a single annotation can be identified as applying to all the tiles of different spectra (colors) and/or different focal planes that are at the same position in the Image Pixel Matrix

Structured Display is another type of Presentation State that lays out multiple windows on a screen, and describes the images (and their initial presentation states) to be displayed in those windows.

WSI Workflow - IHE Digital Pathology Workflow – Image Acquisition (DPIA)


Traditionally, workflow management in the DICOM imaging environment uses the DICOM Modality Worklist (MWL) and Modality Performed Procedure Step (MPPS) services. These were defined for supporting human controlled imaging (radiologic technicians operating a scanner modality), and though initially considered for automated slide scanning modalities as well, have not proven to be popular and so an HL7 V2 based workflow has been defined by IHE instead.

The IHE Digital Pathology Workflow – Image Acquisition (DPIA) Integration Profile specifes the HL7 V2 messages and a mapping to the corresponding DICOM image attributes:

Relevant Parts of the Standard



Historical Information

The supplements that extended the DICOM Standard are listed here, but these are not maintained and only the standard itself should be used for reference.

Intellectual Property

Various patents have been asserted by DICOM members but are licensed without a fee under FRAND terms consistent with the procedures the DICOM Standards Committee and acceptable to NEMA counsel.

See also the home page and minutes of DICOM WG 26

Related Articles and Presentations

For further information contact dicom@medicalimaging.org.

Last updated: Thu Apr 7 15:17:59 EDT 2022

Copyright ©2020-2022 NEMA