Discussion:
printf '\uFEFF' outputs invalid UTF-8 on Windows
Kalle Olavi Niemitalo
2018-11-05 17:09:06 UTC
Permalink
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: msys
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='msys' -DCONF_MACHTYPE='x86_64-pc-msys' -DCONF_VENDOR='pc'
-DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H
-DRECYCLES_PIDS -I. -I. -I./include -I./lib -DWORDEXP_OPTION
-Wno-discarded-qualifiers -march=x86-64 -mtune=generic -O2 -pipe
-Wno-parentheses -Wno-format-security -D_STATIC_BUILD -g
uname output: MINGW64_NT-6.1 fjkallen 2.10.0(0.325/5/3) 2018-07-25 13:06
x86_64 Msys
Machine Type: x86_64-pc-msys

Bash Version: 4.4
Patch Level: 19
Release Status: release

Description:
The builtin printf '\uFEFF' outputs ED 9F BF ED BB BF in a
UTF-8 locale on Microsoft Windows, where sizeof(wchar_t) == 2.
It should output EF BB BF, like printf (GNU coreutils) 8.30
does.

The incorrect output ED 9F BF ED BB BF is a UTF-8-like encoding
of U+D7FF U+DEFF, which looks somewhat like a UTF-16 surrogate
pair but the U+D7FF character is not in the surrogate range.

Repeat-By:
Install Git for Windows 2.19.1, on Windows 7 SP1.
Start "Git Bash" from the Start menu.
Run the command:
env --ignore-environment LANG=en_US.UTF-8 \
/usr/bin/bash --noprofile -c 'builtin printf "\ufeff"' \
| od -t x1

Fix:
In lib/sh/unicode.c, change u32toutf16 to treat characters in the
U+E000...U+FFFF range just like the U+0000...U+D7FF range, i.e.
copy them unchanged to the output and not make a surrogate pair.
I did not test that change but the function clearly has a bug and
it matches the symptoms perfectly.
Chet Ramey
2018-11-05 19:27:37 UTC
Permalink
Post by Kalle Olavi Niemitalo
Bash Version: 4.4
Patch Level: 19
Release Status: release
The builtin printf '\uFEFF' outputs ED 9F BF ED BB BF in a
UTF-8 locale on Microsoft Windows, where sizeof(wchar_t) == 2.
It should output EF BB BF, like printf (GNU coreutils) 8.30
does.
Thanks for the report. This has been fixed for almost exactly two years
in the devel branch, the result of

http://lists.gnu.org/archive/html/bug-bash/2016-11/msg00039.html

and is fixed in the bash-5.x alpha and beta versions.

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU ***@case.edu http://tiswww.cwru.edu/~chet/
Loading...