C++ Character Types - dummies

By Stephen R. Davis

The standard char variable in C++ is a scant 1 byte wide and can handle only 255 different characters. This is plenty enough for European languages but not big enough to handle symbol-based languages such as kanji.

Several standards have arisen to extend the character set to handle the demands of these languages. UTF-8 uses a mixture of 8-, 16-, and 32-bit characters to implement almost every kanji or hieroglyph you can think of but still remain compatible with simple 8-bit ASCII. UTF-16 uses a mixture of 16- and 32-bit characters to achieve an expanded character set, and UTF-32 uses 32 bits for all characters.

UTF stands for Unicode Transformation Format, from which it gets the common nickname Unicode.

The table describes the different character types supported by C++. At first, C++ tried to get by with a vaguely defined wide character type, wchar_t. This type was intended to be the wide character type native to the application program’s environment. C++ ‘11 introduced specific types for UTF-16 and UTF-32.

The C++ Character Types
Variable Example What It Is
char ‘c’ ASCII or UTF-8 characters wchar_t L’c’ Character in wide format char_16t u’c’ UTF-16 character char_32t U’c’ UTF-32 character

UTF-16 is the standard encoding for Windows applications. The wchar_t type refers to UTF-16 in the Code::Blocks/gcc compiler.

Any of the character types in the table can be combined into strings as well:

wchar_t* wideString = L"this is a wide string";