built-in regex matches wrong character

Discussion:

m***@mamatb-laptop

2018-09-05 18:50:12 UTC

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-unknown-linux-gnu' -DCONF_VENDOR='unknown' -DLOCALEDIR='/usr/local/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -I. -I. -I./include -I./lib -g -O2 -Wno-parentheses -Wno-format-security
uname output: Linux mamatb-laptop 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Machine Type: x86_64-unknown-linux-gnu

Bash Version: 4.4
Patch Level: 0
Release Status: release

Description:
It seems like bash built-in regex matches some symbols that shouldn't. The following commands shows this:
[[ 'Âº' =~ [o-p] ]] && [[ ! 'Âº' =~ o ]] && [[ ! 'Âº' =~ p ]] && echo 'Âº between o and p but none of them'
[[ 'Âª' =~ [a-b] ]] && [[ ! 'Âª' =~ a ]] && [[ ! 'Âª' =~ b ]] && echo 'Âª between a and b but none of them'

Repeat-By:
Actually found out this while developing a bigger bash script, but it can be reproduced with the previous lines. Would you reply me at ***@gmail.com to know if this was in fact a bug? Thanks.

Eric Blake

2018-09-05 20:39:01 UTC

Permalink

[[ 'º' =~ [o-p] ]] && [[ ! 'º' =~ o ]] && [[ ! 'º' =~ p ]] && echo 'º between o and p but none of them'
[[ 'ª' =~ [a-b] ]] && [[ ! 'ª' =~ a ]] && [[ ! 'ª' =~ b ]] && echo 'ª between a and b but none of them'

Not a bug, but a property of your locale.

POSIX says that range expressions in regular expressions are
implementation-defined except for in the C locale, which means [a-b] is
free to match more than just the two ASCII characters 'a' and 'b', but
rather anything that your current locale considers equivalent.

If you run your script with LC_ALL=C in the environment, you won't have
that problem (because there, [a-b] is well-defined to be exactly two
characters). Or, you can use bash's 'shopt -s globasciiranges' which is
supposed to enable Rational Range Interpretation, where even in non-C
locales, a character range bounded by two ASCII characters takes on the
C locale definition of only the ASCII characters in that range, rather
than the locale's definition of whatever other characters might also be
equivalent (actually, while I know that shopt affects globbing, I don't
know if it also affects regex matching - but if it doesn't, that's
probably a bug that should be fixed).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org

Miguel Amat

2018-09-05 22:48:35 UTC

Permalink

Thanks for your response Eric, please find my attached screenshot
testing both solutions. Seems like setting LC_ALL=C in the environment
works fine while 'shopt -s globasciiranges' does not (also I could be
testing this the wrong way, first time using shopt).

Regards,
Miguel

Post by Eric Blake

Post by m***@mamatb-laptop
It seems like bash built-in regex matches some symbols that shouldn't.
[[ 'Âº' =~ [o-p] ]] && [[ ! 'Âº' =~ o ]] && [[ ! 'Âº' =~ p ]] && echo 'Âº
between o and p but none of them'
[[ 'Âª' =~ [a-b] ]] && [[ ! 'Âª' =~ a ]] && [[ ! 'Âª' =~ b ]] && echo 'Âª
between a and b but none of them'
Actually found out this while developing a bigger bash script, but it can
be reproduced with the previous lines. Would you reply me at

Not a bug, but a property of your locale.
POSIX says that range expressions in regular expressions are
implementation-defined except for in the C locale, which means [a-b] is
free to match more than just the two ASCII characters 'a' and 'b', but
rather anything that your current locale considers equivalent.
If you run your script with LC_ALL=C in the environment, you won't have
that problem (because there, [a-b] is well-defined to be exactly two
characters). Or, you can use bash's 'shopt -s globasciiranges' which is
supposed to enable Rational Range Interpretation, where even in non-C
locales, a character range bounded by two ASCII characters takes on the
C locale definition of only the ASCII characters in that range, rather
than the locale's definition of whatever other characters might also be
equivalent (actually, while I know that shopt affects globbing, I don't
know if it also affects regex matching - but if it doesn't, that's
probably a bug that should be fixed).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org

Chet Ramey

2018-09-06 14:24:17 UTC

Permalink

Post by Miguel Amat
Thanks for your response Eric, please find my attached screenshot
testing both solutions. Seems like setting LC_ALL=C in the environment
works fine while 'shopt -s globasciiranges' does not (also I could be
testing this the wrong way, first time using shopt).

globasciiranges isn't going to change things here, as explained in my
previous message.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU ***@case.edu http://tiswww.cwru.edu/~chet/

Chet Ramey

2018-09-06 14:17:10 UTC

Permalink

Post by Eric Blake
Or, you can use bash's 'shopt -s globasciiranges' which is
supposed to enable Rational Range Interpretation, where even in non-C
locales, a character range bounded by two ASCII characters takes on the C
locale definition of only the ASCII characters in that range, rather than
the locale's definition of whatever other characters might also be
equivalent (actually, while I know that shopt affects globbing, I don't
know if it also affects regex matching - but if it doesn't, that's probably
a bug that should be fixed).

Since bash uses the C library's regexp engine, and most C libraries don't
implement RRI, much less expose it as a flags option available via
regcomp(), there's no reason to expect that globasciiranges would have
any effect on regular expression matching.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU ***@case.edu http://tiswww.cwru.edu/~chet/

Eric Blake

2018-09-06 14:23:33 UTC

Permalink

Post by Chet Ramey

But bash could be taught to convert any regex that contains a range with
both endpoints ASCII into a different bracket expression before handing
things over to regcomp(). That is, if the user is matching against
[a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in
regcomp() to get RRI, just merely some pre-processing (and often memory
allocation, as the expansion of a range into a non-range tends to
require more characters).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org

Chet Ramey

2018-09-06 14:25:22 UTC

Permalink

Post by Eric Blake
But bash could be taught to convert any regex that contains a range with
both endpoints ASCII into a different bracket expression before handing
things over to regcomp(). That is, if the user is matching against [a-d],
bash hands [abcd] to regcomp() instead. You don't need a flag in regcomp()
to get RRI, just merely some pre-processing (and often memory allocation,
as the expansion of a range into a non-range tends to require more
characters).

Someone would have to write that code.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU ***@case.edu http://tiswww.cwru.edu/~chet/

Aharon Robbins

2018-09-06 17:39:37 UTC

Permalink

Post by Eric Blake
But bash could be taught to convert any regex that contains a range with
both endpoints ASCII into a different bracket expression before handing
things over to regcomp(). That is, if the user is matching against
[a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in
regcomp() to get RRI, just merely some pre-processing (and often memory
allocation, as the expansion of a range into a non-range tends to
require more characters).

This is easy and inexpensive for ASCII only. Full RRI does the
same thing for wide character sets as well, though, and there
the possibility for using very large amounts of memory makes the
rewrite-the-range idea less palatable.

--
Aharon (Arnold) Robbins arnold AT skeeve DOT com

Eric Blake

2018-09-06 17:58:17 UTC

Permalink

Post by Aharon Robbins

Post by Eric Blake
But bash could be taught to convert any regex that contains a range with
both endpoints ASCII into a different bracket expression before handing
things over to regcomp(). That is, if the user is matching against
[a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in
regcomp() to get RRI, just merely some pre-processing (and often memory
allocation, as the expansion of a range into a non-range tends to
require more characters).

Indeed. But the bash option is named 'globasciiranges', and I find far
more use in having ranges with both endpoints in single-byte ASCII
behaving sanely than I do for ranges with one or more ends resulting in
a multibyte character (by the time my regex involves multibyte
characters, I am already admitting that I am in locale-dependent
territory, and RRI may no longer be the best action anyway). That is,
RRI makes the most sense when dealing with ASCII characters (< 128) in
the first place, and that's a reasonable stopgap for immediate
implementation, even if we don't get full RRI across all of Unicode
(assuming that such might later become available via a new regcomp() flag).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org

Chet Ramey

2018-09-06 14:13:45 UTC

Permalink

Post by m***@mamatb-laptop
Bash Version: 4.4
Patch Level: 0
Release Status: release
It seems like bash built-in regex matches some symbols that shouldn't.

There are a couple of things to consider here.

1. Bash doesn't have a "built-in" regexp engine. It uses whatever POSIX-
compatible regexp API the C library provides.

2. POSIX range expressions are explicitly non-portable and locale-
dependent. The characters in a range depend on the locale's collation
sequence. Look back at this list for discussions of how upper and
lower case letters get into a range like a-z.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU ***@case.edu http://tiswww.cwru.edu/~chet/