encoding convertfrom

encoding convertfrom ?encoding? string

The lower 8-bits of each character in string are taken as a single byte and the resulting sequence of bytes is converted from encoding to a Unicode string. If encoding is not specified, the system encoding is used.

See Also

ycl string decode
An alternative that returns an error rather than losing information.

Invalid Data

PYK 2017-08-19: If a string to be converted from utf-8 contains invalid utf-8 byte sequences, each invalid byte is interpreted as an 8-bit integer and converted to the unicode character at that code point. I.e., encoding convertfrom utf-8 will never fail, so it can not be used to determine whether a string is valid utf-8.

set value [binary format c 239]
set value [encoding convertfrom utf-8 $value]
scan $value %c codepoint ; # $codepoint == 239

For comparison, here is the same operation on a valid utf-8 sequence:

set value [binary format ccc 239 188 129]
set value [encoding convertfrom utf-8 $value]
scan $value %c codepoint ; # $codepoint == 65281

Fonts and Encodings

MG: I have a bit of a strange problem, hopefully someone can help. This example script shows what I'm trying to do - it displays cp437-encoded text in a text widget:

text .t -font Term
pack .t
.t insert end [format %c 152]
.t insert end [encoding convertfrom cp437 [format %c 152]]

The Term font being used is available at http://8bit.memoryleak.org/Flag/Term.ttf and is designed for displaying cp437-encoded text.

Character 152 in cp437 is a y-umlaut. However, the first insert displays a placeholder character (a solid down-arrow) instead. The second does display a y-umlaut, but it does so by mapping to character 255, which isn't available in the Term font (because it has no meaning in cp437), so Tcl uses a fallback font, and it looks totally wrong (Term is fixed-width and quite bold; the fallback font, Lucida Sans Unicode, doesn't match up at all).

I can use the Term font in other (non-Tcl) applications, for instance MS Word, and insert char 152, which gives a y-umlaut without any problem. I honestly have no idea what's causing this issue; can anyone shed any light?