5

Testing with Bash specifically, I tried using wildcards (as if it was a case statement):

[[ ${var@a} == *"A"* ]]

It surprisingly (to me) works, but it's uglier than a regex. Then I compared them time-wise:

$ time for ((i = 0 ; i < 1000000 ; i++ )); do [[ ${var@a} == *"A"* ]] && :; done

real    0m2.512s
user    0m2.500s
sys 0m0.003s

$ time for ((i = 0 ; i < 1000000 ; i++ )); do [[ ${var@a} =~ "A" ]] && :; done

real    0m3.578s
user    0m3.553s
sys 0m0.003s

Is there any explanation why (this simple) regular expression is so much slower here? Is it that Bash regex implementation is slow per se?

6
  • 4
    While this regexp is simple, regular expressions in general are more complex. Commented Jun 12 at 16:32
  • 2
    FWIW, my experience has been that regular expression handling in Bash is slow. It's been a long time since I tested that though. The regular expression feature can often be avoided by using the extended globbing feature (see the extglob section in glob - Greg's Wiki). I normally use Bash regular expressions only when BASH_REMATCH is useful. Commented Jun 12 at 20:28
  • @pjh Thanks, I will have a look! To me, when it's trivial like here, I would be tempted to use them. Once it gets more complicated, honestly, I prefer Perl one-liner to shove in (if only for the familiar syntax - even when BASH_REMATCH would have probably done the job). But that's not the kind of issue here. I was just surprised it is this different when Bash is handling both cases. Commented Jun 12 at 20:34
  • 3
    General rule: If performance is a concern, you shouldn't be using bash in the first place. The equivalent Python script takes about 0.3 sec for both glob and regexp. And 0.1 sec if you compile the regexp outside the loop. Commented Jun 12 at 22:00
  • @Barmar the only side-effects are i=1000000 and $? so a smart optimiser would generate code that simply omits the loop entirely and runs in constant time :-) Commented Jun 12 at 22:44

1 Answer 1

5

Is it that Bash regex implementation is slow per se?

Seems so, at least in comparison to globs (e.g. == *A*).
However, I don't think bash is to blame here too much, because ...

Regex matching is not implemented directly in bash

For executing¹ a parsed [[ =~ ]] command, bash just calls the C standard library functions regcomp and regexc from <regex.h> here.

That means the performance of =~ depends directly on your standard library, e.g. glibc in most cases. Maybe other implementations of the C standard library are faster (e.g. musl)? I don't know.

There is a bit of overhead for turning =~ a"."b into POSIX extended regular expression a\.b when parsing the command. However, that isn't the problem here as confirmed by below test:

time for ((i=0; i<1000000; i++)); do [[ t ]]; done                   # 1.6s
time for ((i=0; i<1000000; i++)); do [[ a == *b* ]]; done            # 1.9s
time for ((i=0; i<1000000; i++)); do [[ a =~ b ]]; done              # 2.8s
time for ((i=0; i<1000000; i++)); do [[ (t || a =~ b) || t ]]; done  # 1.9s

In the last command, bash has to parse a =~ b but does not execute it because t is true (like any other non-empty string) and the or-operator || short-circuits. Since the resulting time is roughly the same for [[ t ]] and [[ t || a =~ b ]]), the time required for parsing =~ is negligible.

¹ Just for context: Bash starts to evaluate both =~ and == in [[ here. The implementation behind == is here -- a rather convoluted implementation with > 300 lines of code. But I assume the implementation behind regcomp and regexc always has to be longer than that, see for instance regexc in glibc (don't forget to look into the intern functions too, e.g. re_search_internal).

Sign up to request clarification or add additional context in comments.

3 Comments

I did not expect this well annotated answer! :) Appreciate the effort, I actually might want to have a look into the linked code out of curiosity. It now got me wondering how come e.g. Perl is so much faster with a regex.
Perl is faster because 1. perl compiles regex literals only once, while bash does so with every iteration of the loop. 2. perl put more work into a clever/fast regex engine, while C standard libraries probably don't care that much about regex performance and just provide something because they have to.
To generalize the question: Why does bash -c 'for ((i=0; i<1000000; i++)); do true; done' take 1.96s but perl -e 'for (my $i=0; $i<1000000; $i++) {}' takes only 1/50 of that time = 0.04s? Answer: Because bash does not care about performance and its implementation does a lot of unnecessary work because its was easier to implement that way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.