DICOM PS3.5 2024e - Data Structures and Encoding

6 Value Encoding

A Data Set is constructed by encoding the Values of Attributes specified in the Information Object Definition (IOD) of a Real-World Object. The specific content and semantics of these Attributes are specified in Information Object Definitions (see PS3.3). The range of possible data types of these Values and their encoding are specified in this section. The structure of a Data Set, which is composed of Data Elements containing these Values, is specified in Section 7.

Throughout this Part, as well as other parts of the DICOM Standard, Tags are used to identify both specific Attributes and their corresponding Data Elements.

6.1 Support of Character Repertoires

Values that are text or character strings can be composed of Graphic and Control Characters. The Graphic Character set, independent of its encoding, is referred to as a Character Repertoire. Depending on the native language context in which Application Entities wish to exchange data using the DICOM Standard, different Character Repertoires will be used. The Character Repertoires supported by DICOM are:

Note

  1. [ISO/IEC 10646] corresponds to the Unicode character set. The ISO IR 192 corresponds to the use of the UTF-8 encoding for this character set.

  2. The [GB 18030] character set is harmonized with the Unicode character set on a regular basis, to reflect updates from both the Chinese language and from Unicode extensions to support other languages.

  3. The issue of font selection is not addressed by the DICOM Standard. Issues such as proper display of words like "bone" in Chinese or Japanese usage are managed through font selection. Similarly, other user interface issues like bidirectional character display and text orientation are not addressed by the DICOM Standard. The Unicode documents provide extensive documentation on these issues.

  4. The [GBK] character set is an extension of the [GB 2312] character set and supports the Chinese characters in [GB 18030] that is the Chinese adaptation of Unicode. The [GBK] is code point backward compatible to [GB 2312]. The [GB 18030] character set is an extension of the [GBK] character set for support of Unicode, and provides backward code point compatibility.

6.1.1 Representation of Encoded Character Values

As defined in the ISO Standards referenced in this section, byte values used for encoded representations of characters are represented in this section as two decimal numbers in the form column/row.

This means that the value can be calculated as (column * 16) + row, e.g., 01/11 corresponds to the value 27 (1BH).

Note

Two digit hex notation will be used throughout the remainder of this Standard to represent character encoding. The column/row notation is used only within Section 6.1 to simplify any cross referencing with applicable ISO standards.

The byte encoding space is divided into four ranges of values:

  • CL bytes from 00/00 to 01/15

  • GL bytes from 02/00 to 07/15

  • CR bytes from 08/00 to 09/15

  • GR bytes from 10/00 to 15/15

Note

[ISO/IEC 8859] does not differentiate between a code element, e.g., G0, and the area in the code table, e.g., GL, where it is invoked. The term "G0" specifies the code element as well as the area in the code table. In ISO/IEC 2022 there is a clear distinction between the code elements (G0, G1, G2, and G3) and the areas in which the code elements are invoked (GL or GR). In this Standard the nomenclature of ISO/IEC 2022 is used.

The Control Character set C0 shall be invoked in CL and the Graphic Character sets G0 and G1 in GL and GR respectively. Only some Control Characters from the C0 set are used in DICOM (see Section 6.1.3), and characters from the C1 set shall not be used.

6.1.2 Graphic Characters

A Character Repertoire, or character set, is a collection of Graphic Characters specified independently of their encoding.

6.1.2.1 Default Character Repertoire

The default repertoire for character strings in DICOM shall be the Basic G0 Set of the International Reference Version of [ISO 646] (ISO-IR 6). See Annex E for a table of the DICOM default repertoire and its encoding.

Note

This Basic G0 Set is identical with the common character set of [ISO/IEC 8859].

6.1.2.2 Extension or Replacement of the Default Character Repertoire

DICOM Application Entities (AEs) that extend or replace the default repertoire convey this information in the Specific Character Set (0008,0005) Attribute.

Note

The Attribute Specific Character Set (0008,0005) is encoded using a subset of characters from ISO-IR 6. See the definition for the Value Representation (VR) of Code String (CS) in Table 6.2-1.

For Data Elements with Value Representations of SH (Short String), LO (Long String), UC (Unlimited Characters), ST (Short Text), LT (Long Text), UT (Unlimited Text) or PN (Person Name) the Default Character Repertoire may be extended or replaced (these Value Representations are described in more detail in Section 6.2). If such an extension or replacement is used, the relevant "Specific Character Set" shall be defined as an Attribute of the SOP Common Module (0008,0005) (see PS3.3) and shall be stated in the Conformance Statement. PS3.2 gives conformance guidelines.

Note

  1. Preferred repertoires as defined in ENV 41 503 and ENV 41 508 for the use in Western and Eastern Europe, respectively, are: ISO-IR 100, ISO-IR 101, ISO-IR 144, ISO-IR 126. See Section 6.1.2.3.

  2. Information Object Definitions using different character sets cannot rely per se on lexical ordering or string comparison of Data Elements represented as character strings. These operations can only be carried out within a given character repertoire and not across repertoire boundaries.

6.1.2.3 Encoding of Character Repertoires

The 7-bit Default Character Repertoire can be replaced for use in Value Representations SH, LO, ST, LT, PN, UC and UT with one of the single-byte codes defined in PS3.3.

Note

This replacement character repertoire does not apply to other textual Value Representations (AE and CS).

The replacement character repertoire shall be specified in Value 1 of the Attribute Specific Character Set (0008,0005). Defined Terms for the Attribute Specific Character Set are specified in PS3.3.

Note

  1. The code table is split into the GL area, which supports a 94 character set only (bit combinations 02/01 to 07/14) plus SPACE in 02/00, and the GR area, which supports either a 94 or 96 character set (bit combinations 10/01 to 15/14 or 10/00 to 15/15). The default character set (ISO-IR 6) is always invoked in the GL area.

  2. All character sets specified in [ISO/IEC 8859] include ISO-IR 6. This set will always be invoked in the GL area of the code table and is the equivalent of ASCII [ANSI X3.4]), whereas the various extension repertoires are mapped onto the GR area of the code table.

  3. The 8-bit code table of [JIS X 0201] includes ISO-IR 14 (romaji alphanumeric characters) as the G0 code element and ISO-IR 13 (katakana phonetic characters) as the G1 code element. ISO-IR 14 is identical to ISO-IR 6, except that bit combination 05/12 represents a "¥" (YEN SIGN) and bit combination 07/14 represents an over-line.

Two character codes of the single-byte character sets invoked in the GL area of the code table, 02/00 and 05/12, have special significance in the DICOM Standard. The character SPACE, represented by bit combination 02/00, shall be used for the padding of Data Element Values that are character strings. The Graphic Character represented by the bit combination 05/12, "\" (BACKSLASH) (reverse solidus) in the repertoire ISO-IR 6, shall only be used in character strings with Value Representations of UT, ST and LT (see Section 6.2). Otherwise the character code 05/12 is used as a separator for multi-valued Data Elements (see Section 6.4).

Note

  1. When the Value of Specific Character Set (0008,0005) is either "ISO_IR 13" or "ISO 2022 IR 13", the graphic character represented by the bit combination 05/12 is a "¥" (YEN SIGN) in the character set of ISO-IR 14.

  2. The expected behavior on conversion during store-and-forward operations needs to be equivalent to the action of separating a multi-valued character stream for multi-valued VRs into individual values between 05/12 byte delimiters and to recombine them separated by 05/12 byte delimiters, regardless of which Graphic Character 05/12 represents in the respective Character Set.

  3. Graphic Characters that match the delimiter specified for the Character Set for multi-valued VRs cannot be represented as Values in that Character Set. I.e., a BACKSLASH encoded as 05/12 cannot be present within a Value (as opposed to between Values) in the Default Character Set and a YEN SIGN encoded as 05/12 cannot be present within a Value in [JIS X 0201].

The character DELETE (bit combination 07/15) shall not be used in DICOM character strings.

The replacement Character Repertoire specified in Value 1 of Specific Character Set (0008,0005) (or the Default Character Repertoire if Value 1 is empty) may be further extended with additional Coded Character Sets, if needed and permitted by the replacement Character Repertoire. The additional Coded Character Sets and extension mechanism shall be specified in additional Values of the Attribute Specific Character Set. If Attribute Specific Character Set (0008,0005) has a single Value, the DICOM SOP Instance supports only one code table and no Code Extension techniques. If Attribute Specific Character Set (0008,0005) has multiple Values, the DICOM SOP Instance supports Code Extension techniques as described in ISO/IEC 2022:1994.

The Character Repertoires that prohibit extension are identified in PS3.3.

Note

  1. Considerations on the Handling of Unsupported Character Sets:

    In DICOM, character sets are not negotiated between Application Entities but are indicated by a conditional Attribute of the SOP Common Module. Therefore, implementations may be confronted with character sets that are unknown to them.

    The Unicode Standard includes a substantial discussion of the recommended means for display and print for characters that lack font support. These same recommendations may apply to the mechanisms for unsupported character sets.

    The machine should print or display such characters by replacing all unknown characters with the four characters "\nnn", where "nnn" is the three digit octal representation of each byte.

    An example of this for an ASCII based machine would be as follows:

    Character String: Günther

    Encoded representation: 04/07 15/12 06/14 07/04 06/08 06/05 07/02

    ASCII based machine: G\374nther

    Implementations may also encounter Control Characters that they have no means to print or display. The machine may print or display such Control Characters by replacing the Control Character with the four characters "\nnn", where "nnn" is the three digit octal representation of each byte.

  2. Considerations for missing fonts

    The Unicode standard and the [GB 18030] standard define mechanisms for print and display of characters that are missing from the available fonts. If GBK is specified in Specific Character Set (0008,0005), the [GB 18030] rules of print and display of characters shall apply. The DICOM Standard does not specify user interface behavior since it does not affect network or media data exchange.

  3. The Unicode and [GB 18030] standards have distinct YEN SIGN, BACKSLASH, and several forms of reverse solidus. The separator for multi-valued Data Elements in DICOM is the character valued 05/12 regardless of what glyph is used to enter or display this character. The other reverse solidus characters that have a very similar appearance are not separators. The choice of font can affect the appearance of 05/12 significantly. Multi-byte encoding systems, such as [GB 18030], [GBK] and [ISO/IEC 2022], may generate encodings that contain a byte valued 05/12. Only the character that encodes as a single byte valued 05/12 is a delimiter.

    For multi-valued Data Elements, existing implementations that are expecting only single-byte replacement character sets may misinterpret the Value Multiplicity of the Data Element as a consequence of interpreting 05/12 bytes in multi-byte characters or [ISO/IEC 2022] escape sequences as delimiters, and this may affect the integrity of store-and-forward operations. Applications that do not explicitly state support for [GB 18030], [GBK] or [ISO/IEC 2022] in their conformance statement, might exhibit such behavior.

6.1.2.4 Code Extension Techniques

For Data Elements with Value Representations of SH (Short String), LO (Long String), UC (Unlimited Characters), ST (Short Text), LT (Long Text), UT (Unlimited Text) or PN (Person Name), the Default Character Repertoire or the character repertoire specified by Value 1 of Attribute Specific Character Set (0008,0005), may be extended using the Code Extension techniques specified by ISO/IEC 2022:1994.

If such Code Extension techniques are used, the related Specific Character Set or Sets shall be specified by Value 2 to Value n of Specific Character Set (0008,0005) of the SOP Common Module (see PS3.3), and shall be stated in the Conformance Statement.

Note

  1. Defined Terms for Specific Character Set (0008,0005) are defined in PS3.3.

  2. Support for Japanese kanji (ideographic), hiragana (phonetic), katakana (phonetic), Korean (Hangul phonetic and Hanja ideographic) and Chinese characters is defined in PS3.3.

  3. The Chinese Character Set (GB18030) and Unicode [ISO/IEC 10646] do not allow the use of Code Extension Techniques. If either of these character sets is used, no other character set may be specified in the Specific Character Set (0008,0005) Attribute, that is, it may have only one Value.

6.1.2.5 Usage of Code Extension

DICOM supports Code Extension techniques if Specific Character Set (0008,0005) is multi-valued. The method employed for Code Extension in DICOM is as described in ISO/IEC 2022:1994. The following assumptions shall be made and the following restrictions shall apply:

6.1.2.5.1 Assumed Initial States
  • Code element G0 and code element G1 (in 8-bit mode only) are always invoked in the GL and GR areas of the code table respectively. Designated character sets for these code elements are immediately in use. Code elements G2 and G3 are not used.

  • The primary set of Control Characters shall always be designated as the C0 code element and this shall be invoked in the CL area of the code table. The C1 code element shall not be used.

6.1.2.5.2 Restrictions for Code Extension
  • As code elements G0 and G1 always have shift status, Locking Shifts (SI, SO) are not required and shall not be used.

  • As code elements G2 and G3 are not used, Single Shifts (SS2 and SS3) cannot be used.

  • Only the ESC sequences specified in PS3.3 shall be used to activate Code Elements.

6.1.2.5.3 Requirements

The character set specified by Value 1 of Specific Character Set (0008,0005), or the Default Character Repertoire if Value 1 is missing, shall be active at the beginning of each textual Data Element Value, and at the beginning of each line (i.e., after a CR and/or LF) or page (i.e., after an FF).

If within a textual Value a character set other than the one specified in Value 1 of Specific Character Set (0008,0005), or the Default Character Repertoire if Value 1 is missing, has been invoked, the character set specified in the Value 1, or the Default Character Repertoire if Value 1 is missing, shall be active in the following instances:

  • before the end of line (i.e., before the CR and/or LF)

  • before the end of a page (i.e., before the FF)

  • before any other Control Character other than ESC (e.g., before any TAB)

  • before the end of a Data Element Value (e.g., before the 05/12 character code that separates multiple textual Data Element Values - 05/12 corresponds to "\" (BACKSLASH) in the case of default repertoire IR-6 or "¥" (YEN SIGN) in the case of IR-14).

  • before the "^" and "=" delimiters separating name components and name component groups in Data Elements with a VR of PN.

If within a textual Value a character set other than the one specified in Value 1 of Specific Character Set (0008,0005), or the Default Character Repertoire if Value 1 is missing, is used, the Escape Sequence of this character set must be inserted explicitly in the following instances:

  • before the first use of the character set in the line

  • before the first use of the character set in the page

  • before the first use of the character set in the Data Element Value

  • before the first use of the character set in the name component and name component group in Data Element with a VR of PN

Note

These requirements allow an application to skip lines, values, or components in a textual Data Element and start the new line with a defined character set without the need to track the character set changes in the text skipped. A similar restriction appears in the RFCs describing the use of multi-byte character sets over the Internet. An Escape Sequence switching to the Value 1 or default Specific Character Set is not needed within a line, value, or component if no Code Extensions are present. Nor is a switch needed to the Value 1 or default Specific Character Set if this character set has only the G0 Code Element defined, and the G0 Code Element is still active.

6.1.2.5.4 Levels of Implementation and Initial Designation
  1. Attribute Specific Character Set (0008,0005) not present:

    • 7-bit code

    • Implementation level: [ISO/IEC 2022] Level 1 - Elementary 7-bit code (code-level identifier 1)

    • Initial designation: ISO-IR 6 (ASCII) as G0.

    • Code Extension shall not be used.

  2. Attribute Specific Character Set (0008,0005) single Value other than "ISO_IR 192", "GB18030" or "GBK":

    • 8-bit code

    • Implementation level: [ISO/IEC 2022] Level 1 - Elementary 8-bit code (code-level identifier 11)

    • Initial designation: One of the [ISO/IEC 8859] defined character sets, the 8-bit code table [TIS 620-2533], or the 8-bit code table of [JIS X 0201] specified by Value 1 of Specific Character Set (0008,0005), as G0 and G1.

    • Code Extension shall not be used.

  3. Attribute Specific Character Set (0008,0005) multi-valued:

    • 8-bit code

    • Implementation level: [ISO/IEC 2022] Level 4 - Redesignation of Graphic Character Sets within a Code (code-level identifier 14)

    • Initial designation: One of the [ISO/IEC 8859] defined character sets, the 8-bit code table [TIS 620-2533], or the 8-bit code table of [JIS X 0201] specified by Value 1 of Specific Character Set (0008,0005), as G0 and G1. If Value 1 of Specific Character Set (0008,0005) is empty, ISO-IR 6 (ASCII) is assumed as G0, and G1 is undefined.

    • All character sets specified in the various Values of Attribute Specific Character Set (0008,0005), including Value 1, may participate in Code Extension.

  4. Attribute Specific Character Set (0008,0005) single Value "ISO_IR 192", "GB18030" or "GBK":

    • variable length code

    • Implementation level: not specified (not compatible with [ISO/IEC 2022])

    • Initial designation: as specified by Value 1 of Specific Character Set (0008,0005)

    • Code Extension shall not be used.

6.1.3 Control Characters

Textual data that is interchanged may require some formatting information. Control Characters are used to indicate formatting, but their use in DICOM is kept to a minimum since some machines may handle them inappropriately. [ISO 646] and ISO 6429:1990 define Control Characters. As shown in Table 6.1-1 below, only a subset of five Control Characters from the C0 set shall be used in DICOM for the encoding of Control Characters in text strings.

Table 6.1-1. DICOM Control Characters and Their Encoding

Acronym

Name

Coded Value

LF

Line Feed

00/10

FF

Form Feed

00/12

CR

Carriage Return

00/13

ESC

Escape

01/11

TAB

Horizontal Tab

00/09


The ESC character shall be used only for [ISO/IEC 2022] character set control sequences, in accordance with Section 6.1.2.5.

In text strings (Value Representation ST, LT, or UT) a new line shall be represented as CR LF.

Note

  1. Some machines (such as UNIX based machines) may interpret LF (00/10) as a new line. In such cases, it is expected that the DICOM format is converted to the correct internal representation for that machine.

  2. In previous editions of the Standard (see PS3.5 2015a), the TAB character was not listed as a Control Character.

DICOM PS3.5 2024e - Data Structures and Encoding