Feature Compare: Characters to Code Points

In this article we are going to compare how to converts characters/single character strings to their ASCII or Unicode code points in various languages.

For instance, character "A" has code point 65, character "~" has code point 126, and character "ด" has code point 3604.

AWK

AWK doesn’t have any buildins to do these mappings, but it does have sprintf, which can be used to map a number to a single string character:

char = sprintf "%c", codepoint

To map characters to code points, we need to do some preprocessing. For instance, if we need to map printable ASCII characters to code points, we may want to do:

BEGIN {
    for (o = 32; o < 127; o ++) {
        c = sprintf ("%c", o)
        ord [c] = o
    }
}

Now in the rest of the program, we can do:

codepoint = ord [char]

assuming that char is a printable ASCII character.

Do note that AWK does not know about Unicode, but gAWK (GNU AWK) does.

Example (gAWK)

BEGIN {
    for (o = 32; o < 3700; o ++) {
        c = sprintf ("%c", o)
        ord [c] = o
    }
}

BEGIN {
    print ord ["A"]        # Prints: 65
    print ord ["ด"]        # Prints: 3604
    printf "%c\n", 70      # Prints: E
    printf "%c\n", 3604    # Prints: ด
}

Bash

For bash, we need to resort to printf. It behaves more or less like printf(1). And printf(1) has subtle differences between implementations on BSD and GNU. And this is a bit of a problem. Where as in the GNU implementation, the %c format takes an integer, and returns the character with that code point, while on BSD, the %c format takes a string, and returns the first character. This makes that we cannot use the %c format to map code points to characters.

But the printf buildin (and printf(1)) has a feature not present in the C implementation of printf(3): if the format string contains "\xHH", for some hexadecimal number HH, then this is mapped to the ASCII character with HH as code point.

So, to turn a code point into a character, we’re going to use printf twice. First to create a string with a "\xHH" escape, and then another printf to turn this into a character:

printf -v char "\x$(printf %x $codepoint)"

This takes the code point in $codepoint, and places the corresponding character in $char.

To go from a character to a code point, we need to use another feature of the printf buildin/printf(1) utility: if an argument of printf starts with single or double quote, the value is the code point of the first character following the quote. Note that we have to be careful to use this syntax, as the quote character itself is special to the shell — we need to escape it.

We can then use the following:

printf -v codepoint "%d" "'$char"

This takes the character in the variable $char, and places the corresponding code point into $codepoint.

Example

printf "%d\n" "'A"             # Prints: 65
printf "\x$(printf %x 70)\n"   # Prints: F

C

In C, characters are code points: a string is just an array of numbers, where each number is the corresponding code point. So we can just use:

ch [0] = codepoint;
codepoint = ch [0];

where codepoint contains the code point, and ch is a string (of type char *, and appropriately malloc -ed).

Single quote characters are also just code points, so we can also things like:

codepoint = 'A';

Example

char ch [3];
ch [0] = 70;
ch [1] = 'G';
ch [2] = '\0';
printf ("%d\n", 'A');   /* Prints: 65 */
printf ("%s\n", ch);    /* Prints: FG */

Lua

In Lua, there are methods char and byte in the string module to map between characters and code points:

char = string . char (codepoint)
codepoint = string . byte (char)

The string module does not have to be explicitly loaded.

Example

print (string . byte ("A"))  -- Prints: 65
print (string . char (70))   -- Prints: F

Node.js

Node.js has methods fromCodePoint and codePointAt in the String module to map between code points and code points:

char = Strings . fromCodePoint (codepoint, ...)
codepoint = char . codePointAt (0)

Note that fromCodePoint is called as a method in the String module, while codePointAt is called on a string. The latter takes a parameter, indicating from which position of the string we want the code point.

Node.js also as fromCharCode and charCodeAt which are subtly different. Node.js works with UTF-16, which means each character is encoded as one or two 16-bit units. While fromCodePoint and codePointAt deal with Unicode code points, fromCharCode andcharCodeAt deal with those 16-bit units. This only becomes relevant for code points outside of the Basic Multilingual Plane; that is, for code points exceeding U+FFFF (65535).

The String module does not to be loaded explicitly.

Example

console . log ("A" . codePointAt (0))               // Prints: 65
console . log ("ด" . codePointAt (0))               // Prints: 3604
console . log (String . fromCodePoint (70))         // Prints: F
console . log (String . fromCodePoint (70, 72, 74)) // Prints: FHJ
console . log (String . fromCodePoint (3604))       // Prints: ด

Perl

Perl has chr which takes a code point and returns the corresponding character, and ord which takes a character and returns the corresponding code point:

$char = chr $codepoint;
$codepoint = ord $char;

This works for all Unicode characters.

Example

say ord "A";    # Prints: 65
say ord "ด";    # Prints: 3604
say chr 70;     # Prints: F
say chr 3604;   # Prints: ด

Python

Just like Perl, Python has chr and ord which maps between code points and characters:

char = chr (codepoint)
codepoint = ord (char)

In Python 2, this works for characters with code points up to 255; Python 3 supports the full Unicode range.

Example

print (ord ("A"))   # Prints: 65
print (ord ("ด"))   # Prints: 3604
print (chr (70))    # Prints: F
print (chr (3604))  # Prints: ด

Ruby

Ruby has chr and ord which are methods called on integers and strings:

char = codepoint . chr
codepoint = char . ord

Without arguments, chr only works on numbers 0 - 255, giving a range error on larger numbers. To make use of the full Unicode range, pass in the encoding:

char = codepoint . chr (Encoding::UTF_8)

Example

puts "A" . ord                       # Prints: 65
puts "ด" . ord                       # Prints: 3604
puts 70 . chr                        # Prints: F
puts 3604 . chr (Encoding::UTF_8)    # Prints: ด