DICOM PS3.5 2021d - Data Structures and Encoding

J Character Sets and Person Name Value Representation using Unicode UTF-8, GB18030 and GBK (Informative)

The Unicode UTF-8 character set and the [GB 18030] character set may be used for multiple languages. Some of these languages may also be encoded using other character sets that are defined elsewhere in the DICOM Standard. As Unicode UTF-8 and [GB 18030] encodings do not allow [ISO/IEC 2022] character set replacement, these must be used for all strings in a single SOP Instance. This may have implications for the character set selected for the encoding of the SOP Instance.

Since the [GBK] character set is fully code point compatible to the larger character set of [GB 18030], and the specific examples of [GB 18030] encoding this in Annex (J.3 and J.4) include only the Chinese characters falling in the common coding area between the two standards, these examples are used to demonstrate the person name and text encoding in both standards. Examples specific to [GBK] are not necessary.

J.1 Example of Person Name Value Representation in the Chinese Language Using Unicode

Example J.1-1. Example of Person Name Value Representation in the Chinese Language Using Unicode

Person names in the Chinese language may be written in Hanzi (ideographic characters), and/or Latin (alphabetic characters). The Latin representation may be derived using pinyin or another Romanization method, or may be a chosen "westernized" name. The two component groups should be written in the order of alphabetic, then ideographic; the phonetic component group is typically not used (see Table 6.2-1). In this example the traditional script is used.


  1. Some healthcare information systems may encode a "westernized" name with other patient aliases in a separate attribute, e.g., Other Patient Names (0010,1091).

  2. Some environments using Chinese language may use the third name component, e.g., for the Yi or Mongolian script, with or without the first name component. This would be similar to the Japanese and Korean name component usage.

In the example below, the Specific Character Set attribute (0008,0005) would contain:

  • (0008,0005) ISO_IR 192

Text string:

  • Wang^XiaoDong=王^小東=

Character encoded representation is:

  • 0x57 0x61 0x6e 0x67 0x5e 0x58 0x69 0x61 0x6f 0x44 0x6f 0x6e 0x67 0x3d 0xe7 0x8e 0x8b 0x5e 0xe5 0xb0 0x8f 0xe6 0x9d 0xb1 0x3d


The underlined bytes correspond to the Unicode code points for the Chinese characters:

  • (U+738B)

  • (U+5C0F)

  • (U+6771)

and the corresponding UTF-8 encodings are:

  • UTF-8 (U+738b) = 0xe7 0x8e 0x8b

  • UTF-8 (U+5c0f U+6771) = 0xe5 0xb0 0x8f 0xe6 0x9d 0xb1

DICOM PS3.5 2021d - Data Structures and Encoding