Kalle Olavi Niemitalo
2018-11-05 17:09:06 UTC
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: msys
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='msys' -DCONF_MACHTYPE='x86_64-pc-msys' -DCONF_VENDOR='pc'
-DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H
-DRECYCLES_PIDS -I. -I. -I./include -I./lib -DWORDEXP_OPTION
-Wno-discarded-qualifiers -march=x86-64 -mtune=generic -O2 -pipe
-Wno-parentheses -Wno-format-security -D_STATIC_BUILD -g
uname output: MINGW64_NT-6.1 fjkallen 2.10.0(0.325/5/3) 2018-07-25 13:06
x86_64 Msys
Machine Type: x86_64-pc-msys
Bash Version: 4.4
Patch Level: 19
Release Status: release
Description:
The builtin printf '\uFEFF' outputs ED 9F BF ED BB BF in a
UTF-8 locale on Microsoft Windows, where sizeof(wchar_t) == 2.
It should output EF BB BF, like printf (GNU coreutils) 8.30
does.
The incorrect output ED 9F BF ED BB BF is a UTF-8-like encoding
of U+D7FF U+DEFF, which looks somewhat like a UTF-16 surrogate
pair but the U+D7FF character is not in the surrogate range.
Repeat-By:
Install Git for Windows 2.19.1, on Windows 7 SP1.
Start "Git Bash" from the Start menu.
Run the command:
env --ignore-environment LANG=en_US.UTF-8 \
/usr/bin/bash --noprofile -c 'builtin printf "\ufeff"' \
| od -t x1
Fix:
In lib/sh/unicode.c, change u32toutf16 to treat characters in the
U+E000...U+FFFF range just like the U+0000...U+D7FF range, i.e.
copy them unchanged to the output and not make a surrogate pair.
I did not test that change but the function clearly has a bug and
it matches the symptoms perfectly.
Machine: x86_64
OS: msys
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='msys' -DCONF_MACHTYPE='x86_64-pc-msys' -DCONF_VENDOR='pc'
-DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H
-DRECYCLES_PIDS -I. -I. -I./include -I./lib -DWORDEXP_OPTION
-Wno-discarded-qualifiers -march=x86-64 -mtune=generic -O2 -pipe
-Wno-parentheses -Wno-format-security -D_STATIC_BUILD -g
uname output: MINGW64_NT-6.1 fjkallen 2.10.0(0.325/5/3) 2018-07-25 13:06
x86_64 Msys
Machine Type: x86_64-pc-msys
Bash Version: 4.4
Patch Level: 19
Release Status: release
Description:
The builtin printf '\uFEFF' outputs ED 9F BF ED BB BF in a
UTF-8 locale on Microsoft Windows, where sizeof(wchar_t) == 2.
It should output EF BB BF, like printf (GNU coreutils) 8.30
does.
The incorrect output ED 9F BF ED BB BF is a UTF-8-like encoding
of U+D7FF U+DEFF, which looks somewhat like a UTF-16 surrogate
pair but the U+D7FF character is not in the surrogate range.
Repeat-By:
Install Git for Windows 2.19.1, on Windows 7 SP1.
Start "Git Bash" from the Start menu.
Run the command:
env --ignore-environment LANG=en_US.UTF-8 \
/usr/bin/bash --noprofile -c 'builtin printf "\ufeff"' \
| od -t x1
Fix:
In lib/sh/unicode.c, change u32toutf16 to treat characters in the
U+E000...U+FFFF range just like the U+0000...U+D7FF range, i.e.
copy them unchanged to the output and not make a surrogate pair.
I did not test that change but the function clearly has a bug and
it matches the symptoms perfectly.