Why not ripgrep?
Why not ripgrep?
Here's a thread on performance vs rg (ripgrep). https://github.com/BurntSushi/ripgrep/discussions/2597 didn't know about hypergrep either.
Haven't benchmarked *grep implementations, but assuming those are just CLI wrappers around RegEx libraries, I'd expect the RegEx benchmarks to be broader and more representative.
There, hyperscan is generally the king, which means hypergrep numbers are likely accurate: https://github.com/p-ranav/hypergrep?tab=readme-ov-file#dire...
Disclaimer: I rarely use any *grep utilities, but often implement string libraries.
OK, now that I have hands on a keyboard, this is what I meant by Hyperscan's match semantics being "peculiar":
$ echo 'foobar' | hg -o '\w{3}'
1:foobar
$ echo 'foobar' | grep -E -n -o '\w{3}'
1:foo
1:bar
Here's the aforementioned reddit thread: https://old.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...I want to be clear that these are intended semantics as part of Hyperscan. It's not a bug with Hyperscan. But it is something you'll need to figure out how to deal with (whether that's papering over it somehow, although I'm not sure that's possible, or documenting it as a difference) if you're building a grep around Hyperscan.
How about: use Hyperscan to round up all the lines that contain matches, and process those again with regex for the "-o" semantics.
You mean two different regex engines for the same search? That is perhaps conceptually fine, but in practice any two regex engines are likely to have differences that will make that strategy fall apart in some cases. (Perhaps unless those regex engines rigorously stick to a spec like POSIX or ecmascript. But that's not the case here. IIRC Hyperscan meticulously matches the behavior of a subset of PCRE2, but ripgrep's default engine is not PCRE2.)
You could perhaps work around this by only applying it as an optimization when you know the pattern has identical semantics in both regex engines. But you would have to do the work to characterize them.
I would rather just make the regex crate faster. If you look at the rebar benchmarks, it's not that far behind and is sometimes even faster. The case where Hyperscan really destroys everything else is for searches for many patterns.
Hyperscan has other logistical issues. It is a beast to build. And its pattern compilation times can be large (again, see rebar). Hyperscan itself only supports x86-64, so one would probably want to actually use Vectorscan (a fork of Hyperscan that supports additional architectures).
It might be the intended behavior of Hyperscan but it really feels like a bug in Hypergrep to report the matches like this - you cannot report a match which doesn't fully match the regex...
I also wonder if there's a performance issue when matching a really long line because Hyperscan is not greedy and will ping back to Hypergrep for every sub match. I guessing this is the reason for those shenanigans in the callback [0].
$ python -c 'print("foo" + "bar" * 3000)' | hg -o 'foo.*bar'
[0] https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...I don't disagree. It's why I brought this up. It's tricky to use Hyperscan, as-is, as a regex engine in a grep tool for these reasons. I don't mean to claim it is impossible, but there are non-trivial issues you'll need to solve.
It's hard to learn too much from hypergrep. It still has some rough spots:
$ hgrep -o 'foo.*bar' foobarbar.txt
foobarbar.txt
1:[Omitted long line with 1 matches]
$ hgrep -M0 -o 'foo.*bar' foobarbar.txt
Too few arguments
For more information try --help
$ hgrep -M 0 -o 'foo.*bar' foobarbar.txt
foobarbar.txt
1:[Omitted long line with 1 matches]
$ hgrep -M 0 'foo.*bar' foobarbar.txt
foobarbar.txt
1:[Omitted long line with 1 matches]
$ hgrep -M0 'foo.*bar' foobarbar.txt
terminate called after throwing an instance of 'std::invalid_argument'
what(): pattern not found
zsh: IOT instruction (core dumped) hgrep -M0 'foo.*bar' foobarbar.txt
Another issue with Hyperscan is that if you enable HS_FLAG_UTF8[1], which hypergrep does[2,3], and then search invalid UTF-8, then the result is UB.This flag instructs Hyperscan to treat the pattern as a sequence of UTF-8 characters. The results of scanning invalid UTF-8 sequences with a Hyperscan library that has been compiled with one or more patterns using this flag are undefined.
That's another issue you'll need to grapple with if you use Hyperscan. PCRE2 used to have this issue[4], but they've since defined the semantics of searching invalid UTF-8 with Unicode mode enabled. ripgrep 14 uses that new mode, but I haven't updated that FAQ answer yet.
Hyperscan isn't alone. Many regex engines do not support searching arbitrary byte sequences[5]. And this is why many/most regex engines are awkward to use in a fast grep implementation. Because you really do not want your grep to fall over when it comes across invalid UTF-8. And the overhead of doing UTF-8 checking in the first place (and perhaps let you just skip over lines that contain invalid UTF-8) would make it difficult to be competitive in performance. It also inhibits its usage in OSINT work.
[1]: https://intel.github.io/hyperscan/dev-reference/api_files.ht...
[2]: https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...
[3]: https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...
[4]: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#why...
[5]: https://github.com/BurntSushi/rebar/blob/96c6779b7e1cdd850b8...
is that an alias, or does hypergrep really use the same command name as mercurial?
It was renamed: https://github.com/p-ranav/hypergrep/commit/ee85b713aa84e005...
I'm the author of ripgrep and its regex engine.
Your claim is true to a first approximation. But greps are line oriented, and that means there are optimizations that can be done that are hard to do in a general regex library. You can read more about that here: https://blog.burntsushi.net/ripgrep/#anatomy-of-a-grep (greps are more than simple CLI wrappers around a regex engine).
If you read my commentary in the ripgrep discussion above, you'll note that it isn't just about the benchmarks themselves being accurate, but the model they represent. Nevertheless, I linked the hypergrep benchmarks not because of Hyperscan, but because they were done by someone who isn't the author of either ripgrep or ugrep.
As for regex benchmarks, you'll want to check out rebar: https://github.com/BurntSushi/rebar
You can see my full thoughts around benchmark design and philosophy if you read the rebar documentation. Be warned though, you'll need some time.
There is a fork of ripgrep with Hyperscan support: https://sr.ht/~pierrenn/ripgrep/
Hyperscan also has some preculiarities on how it reports matches. You won't notice it in basic usage, but it will appear when using something like the -o/--only-matching flag. For example, Hyperscan will report matches of a, b and c for the regex \w+, where as a normal grep will just report a match of abc. (And this makes sense given the design and motivation for Hyperscan.) Hypergrep goes to some pain to paper over this, but IIRC the logic is not fully correct. I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.
I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.
From some searching I think you might mean this: https://www.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...
Ah yup! I just posted a follow-up that links to that with an example (from a build of hypergrep off of latest master): https://news.ycombinator.com/item?id=38821321
rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.
Would be interesting to see comparisons where memory is limited, i.e., where the file being searched will not fit entirely into memory.
Personally I'm interested in "grep -o" alternatives. The files I'm searching are text but may have few newlines. For example I use ired instead of grep -o. ired will give the offsets of all matches, e.g.,
echo /\"something\"|ired -n 1.htm
Quick and dirty script, not perfect: #!/bin/sh
test $# -gt 0||echo "usage: echo string|${0##*/} file [blocksize] [seek] [match-no]"
{
read x;
x=$(echo /\""$x"\"|ired -n $1|sed -n ${4-1}p);
test "$x"||exit 1;
echo
printf s"$x"'\n's-${3-0}'\n'x$2'\n'|ired -n $1;
echo;
printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1;
echo;
echo w$(printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1)|ired -n /dev/stdout;
echo;
}
Another script I use loops through all the matches.rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.
Which test exactly? That's just likely because of memory maps futzing with the RSS data. Not actually more heap memory. Try with --no-mmap.
I'm not sure I understand the rest of your comment about grep -o. Grep tools usually have a flag to print the offset of each match.
EDIT: Now that I have hands on a keyboard, I'll demonstrate the mmap thing. First, ugrep:
$ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt
72
real 22.115
user 22.015
sys 0.093
maxmem 30 MB
faults 0
$ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt --mmap
72
real 21.776
user 21.749
sys 0.020
maxmem 802 MB
faults 0
And now for ripgrep: $ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt
72
real 0.076
user 0.046
sys 0.030
maxmem 779 MB
faults 0
$ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt --no-mmap
72
real 0.087
user 0.033
sys 0.053
maxmem 15 MB
faults 0
It looks like the difference here is that ripgrep chooses to use a memory map by default here. I don't think it makes much of a difference here.If the file were bigger than available memory, then the OS would automatically handle paging.
I think you should try it before you read these conflicting benchmarks from the authors: https://github.com/Genivia/ugrep-benchmarks
Any particular reason why newer tools don't follow the well-established XDG standard for config files? Those folder structures probably already exist on end user machines, and keep your home directory from getting cluttered with tens of config files
For ripgrep at least, you set an environment variable telling it where to look for a config file. You can put it anywhere, so you don't need to put it in $HOME.
I didn't do XDG because this route seemed simpler, and XDG isn't something that is used everywhere.
I didn't do XDG because this route seemed simpler
Simpler how? This requires custom config, instead of following what I set system-wide.
and XDG isn't something that is used everywhere.
Yeah, that‘s why it defines defaults to fall back on.
It's far simpler to implement.
No, you don't understand. I'm not saying The XDG variables might not be defined. Give me a little credit here lol. I have more than a passing familiarity with XDG. I've implemented it before. I'm saying the XDG convention itself may not apply. For example, Windows. And its controversial whether to use them on macOS when I last looked into it.
I don't see any significant problem with defining an environment variable. You likely already have dozens defined. I know I do.
I'm not trying to convince you of anything. Someone asked why. This is why for ripgrep at least.
Could ripgrep not simply add a check for the XDG environment variables and use those, if no rg environment variable is given? Of course if both are not available you would use the default.
The issue is complexity - we could create some sort or 'standard tool' library that 'just works' on all platforms, but now building the tool and runtime bootstrapping the tool become more complex, and hence more likely to _break_.
Really most people want it in their path and it just to work in as many scenarios as possible. Config almost shouldn't be the responsibility of the tool at all... (Options passed to the tool via env variables, perhaps)...
Of course. But now you've complicated how config files are found and it doesn't seem like an improvement big enough to justify it.
Bottom line is that while ripgrep doesn't follow XDG, it also doesn't force you to litter your HOME directory. That's what most people care about in my experience.
I would encourage you to search the ripgrep issue tracker for XDG. This has all been discussed.
Standard should be - tool tells you where it's configured, how to change the config, and choose a 'standard' default config, such as XDG.
Assuming you aren't doing weird things with paths, I can work around 'dumb lazy' developers releasing half-assed tools with symlinks/junctions, but I really don't want to spend a ton of time configuring your tool or fighting its presumptions.
Oh okay, I guess you've got it figured out. Now specify it in enough detail for others to implement it, get all stakeholders to agree and get everyone to implement it exactly to the spec.
Good luck. You're already off to a rough start with XDG, since that isn't what is used on Windows. And it's unclear whether it ought to be used on macOS.
Slight rant/aside but Firefox is bad for this. You can point it to a custom profile path (e.g. .config/mozilla) but ~/.mozilla/profile.ini MUST exist. Only that one file - you can move everything else.
In my mind, this is fine, as Firefox predates the standard by a long time. But newer tools specifically should know better.
XDG isn't recognized as an authority outside of XDG.
I feel like if you're going to make a new grep and put a web page for it, your webpage should start with why your grep is better than the default (or all the other ones).
Why did you build a new grep?
I feel like if you're going to make a new grep and put a web page for it, your webpage should start with why your grep is better than the default (or all the other ones).
No snark here, but is the subtitle not enough to start? "a more powerful, ultra fast, user-friendly, compatible grep"
Not really.
* a more powerful -- This is meaningless without some sort of examples. Powerful how? What does it do that's better than grep?
* ultra fast -- This at least means something, but it should be quantified in some way. "50%+ faster for most uses cases" or something like that.
* user-friendly -- not even sure what this means. Seems kind of subjective anyway. I find grep plenty user friendly, for a command line tool.
* compatible grep -- I mean, they all are pretty much, but I guess it's good to know this?
* ultra fast -- This at least means something, but it should be quantified in some way. "50%+ faster for most uses cases" or something like that.
That would be begging for nerd rage posts, just like so many disputing the benchmarks. >:D
* user-friendly -- not even sure what this means. Seems kind of subjective anyway. I find grep plenty user friendly, for a command line tool.
Just below is a huge, captioned screenshot of the TUI?
* compatible grep -- I mean, they all are pretty much, but I guess it's good to know this?
One would think so... but I have so many scars concerning incompatibilities with different versions of grep (as do others in the comments). If you don't know, then that feature isn't listed for you. :)
no snark here, but the subtitle was the start of my confusion: what does "user-friendly" mean in the context of grep, and why should I believe the claim?
regular expressions are not friendly, but the user friendly way for a cli filter to behave is to return retvals appropriately, output to stdout, error messages to stderr... does user friendly mean copious output to stderr? what else could it possibly mean? do I want copious output to stderr?
no snark here, but the subtitle was the start of my confusion: what does "user-friendly" mean in the context of grep, and why should I believe the claim?
Granted, it is far from a thing of beauty, but there is a large, captioned screenshot of the included text user interface just beneath. Then again, it is a website for a command line tool. "Many Bothans died to bring us this information."
Important note: not actually compatible. It took me seconds to find an option that does something completely different than the GNU version.
I would assume compatible meant posix/bsd - unless explicitly advertised AS "GNU grep compatible"?
From the OP: "Ugrep is compatible with GNU grep and supports GNU grep command-line options."
Which option is that? I'm scanning the ugrep page, but nothing is popping out to me.
Indeed. And here are some concrete examples around locale:
$ grep -V | head -n1
grep (GNU grep) 3.11
$ alias ugrep-grep="ugrep-4.4.1 -G -U -Y -. --sort -Dread -dread"
$ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
pokémon
$ echo 'pokémon' | LC_ALL=en_US.UTF-8 ugrep-grep 'pok[[=e=]]mon'
$ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
γ
$ echo 'γ' | LC_ALL=en_US.UTF-8 ugrep-grep -i 'Γ'
BSD grep works like GNU grep too: $ grep -V
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD
$ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
pokémon
$ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
γ
Interesting, it supports an n-gram indexer. ripgrep has had this planned for a few years now [1] but hasn't implemented it yet. For large codebases I've been using csearch, but it has a lot of limitations.
Unfortunately... I just tried the indexer and it's extremely slow on my machine. It took 86 seconds to index a Linux kernel tree, while csearch's cindex tool took 8 seconds.
That's close to a gig of disk reads; I trust you didn't try ugrep first and then cindex second, without taking into account caching.
I ran both multiple times, alternating (and making sure to clean out the indexes in between). Results were reasonably consistent across runs.
If you're gonna go the csearch route, you should also consider hound. I use it many times per day.
It creates per-directory index files on its first run. ugrep-indexer is also labeled as beta. A couple of relevant quotes from its GitHub site:
“Indexing adds a hidden index file ._UG#_Store to each directory indexed.”
“Re-indexing is incremental, so it will not take as much time as the initial indexing process.”
Someone please just standardize the grep flags across all platforms.
Specifically -P / --perl-regexp support on MacOS and FreeBSD
It really would reduce the WTF moments for the students.
Insert jokes about standards below... =)
That's what POSIX was supposed to be.
It's easier IMO to just use the same tool on all platforms. Which you can of course do.
Not sure if brew's grep is as NERF'ed, but POSIX standard often is just a subset of minimal features for the GNU version.
Cheers, =)
Yes, that's the problem. You need to maintain a close attention level to know which things are POSIX. And in the case of GNU grep, you actually need to set POSIXLY_CORRECT=1. Otherwise its behavior is not a subset.
POSIX also forbids greps from searching UTF-16 because it mandates that certain characters always use a single byte. ripgrep, for example, doesn't have this constraint and thus can transparently search UTF-16 correctly via BOM sniffing.
A little off-topic, but I'd love to see a tool similar to this that provides real-time previews for an entire shell pipeline which, most importantly, integrates into the shell. This allows for leveraging the completion system to complete command-line flags and using the line editor to navigate the pipeline.
In zsh, the closest thing I've gotten to this was to bind Ctrl-\ to the `accept-and-hold` zle widget, which executes what is in the current buffer while still retaining it and the cursor position. That gets me close (no more ^P^B^B^B^B for editing), but I'd much rather see the result of the pipeline in real-time rather than having to manually hit a key whenever I want to see the result.
Sounds similar to this: https://github.com/akavel/up
I guess Alt+a is the default zsh shortcut for that.
Okay, this solves a feature I was occasionally missing for a long time: searching for several terms in files (the "Googling files" feature). I wrote a 8 line script a few weeks ago to do this, that I will gladly throw away. I'll look into the TUI too.
(I've been using ripgrep for quite some time now, how does this otherwise compare to it? would I be able to just replace rg with ug?)
ugrep+ has this feature similar to ripgrep-all.
For regular use, I use ugrep’s %u option with its format feature to only get one match per line same as other grep tools.
Overall, I’m a happy user of ugrep. ugrep works as well as ripgrep for me. It’s VERY fast and has built-in option to search archives within archives recursively.
Is it that different from using fzf?
I currently use ripgrep-all (which can search into anything, video captions or pdfs) and fzf.
Cool, but in a real life scenario where the system is not able to pull from external packages because it is in a secured environment makes myself think this is moot as you'll be out of practice of actually running grep. I would avoid not staying out of practice with grep.
On the other hand for a non-work environment where security isn't in question this is cool.
This. I had to beg and wait about a year to get jq added to our base image once it passed sec review and all that.
I find bat pretty useful on my local machine
Also look at https://github.com/stealth/grab from Sebastian Krahmer.
ripgrep, grab, ugrep, hypergrep... Any of the four are probably fast enough for any of my use cases but I suddenly feel tempted to micro-optimize and spend ages comparing them all.
Slightly off topic, but how does one publish so many installable versions of a binary across all the package managers? I figured out how to do it for Brew, but the rest seems like a billion different steps that need to be done and I feel like I am missing something.
You only have to set up CI/CD once for each package type, afterwards all the packaging work is done for you automatically.
Ripgrep is also quite a large project (judging on both star count and contribution cout), so people probably volunteer to support their platform/package manager of choice.
There's a few ripgrep based tuis:
- https://github.com/acheronfail/repgrep
- https://github.com/konradsz/igrep
You can also use fzf with ripgrep to great effect:
[1]: https://github.com/junegunn/fzf/blob/master/ADVANCED.md#usin...
Just tried it out. It's blazingly fast. The interactive TUI search is pretty sweet.
Very insightful discussion. Is there a regex library that is tuned for in-memory data/strings? Similar to in-memory databases?
I recall using hyperscan but isn't it discontinued.
There are many grep variations. The Unix philosophy: do one thing well. The Unix reality: do many things poorly*
*grep, awk, sed
this is slick! easily the best of these new grep tools. thanks for sharing. i’ll use this when grep(1) doesn’t quite cut it
I will never learn this tool
I will not even contemplate using this tool.
The reason is very simple: I can trust 'grep' to be on any system I ever touch. Learning ugrep doesn't make any sense as I can't trust it to be available.
I could still use it on my own systems, but I work on customer systems which won't have this tool installed.
And I'm proficient enough with grep that it's 'good enough', I'm not focussing on a better grep. I'm focussing on fixing a problem, or trying something new.
I'd rather invest my time into something that will benefit me across all environments I work with.
Because a tool may be 'better' (whatever that means) doesn't mean it will see adoption.
This is not about being closeminded, but it's about focus on what's really important.
I really like the fuzzy match feature. Useful for typos or off by 1-2 characters.
Ugrep is also available in Debian based repos, which is super nice.
One thing I never liked about ripgrep is that it doesn't have a pager. Yes, it can be configured to use the system-wide ones, but it's an extra step (and every time I have to google how to preserve colors) and on Windows you're SOL unless you install gnu utils or something. The author always refused to fix that.
Ugrep not only has a pager built in, but it also allows searching the results which is super nice! And that feature works on all supported platforms!
This is what I do personally:
Should work just fine. For Windows, you can install `bat` to use a pager if you don't otherwise have one. You don't need GNU utils to have a pager.hi @burntsushi,
I use windows : didn't understand what you mean by "install `bat`" to use a pager.I use cygwin and WSL for my unix needs. I have more and less in cygwin for use in windows.
I referenced bat because I've found that suggesting cygwin sometimes provokes a negative reaction. The GP also mentioned needing to install GNU tooling as if it were a negative.
bat is fancy pager written in Rust. It's on GitHub: https://github.com/sharkdp/bat
I'm sure you know but windows command prompt always came with its inbuilt pager -- more. So, you could always do "dir | more" or "rg -p "%*" | more ". (more is good with colors without flags)
I didn't! I'm not a Windows user. Colors are half the battle, so that's good. Will it only appear if paging is actually needed? That's what the flags to `less` do in my wrapper script above. They are rather critical for this use case.
I don't believe bat is a paper; it's more of a pretty-printer that tends to call less.
Two pallets that should work on Windows are https://github.com/walles/moar (golang) and https://github.com/markbt/streampager (Rust). There might also be a newer one that uses rust, I'm unsure.
Interesting - for me a built-in pager is an antifeature. I don't want to figure out how to leave the utility. Worst of all, pager usually means that sometimes you get more pages and you need to press q to exit, and sometimes not. Annoying. I often type yhe next command right away and the pager means I get stuck, or worse, pager starts doing something in response to my keys (looking at you, `git log`).
Then again I'm on Linux and can always pipe to less if I need to. I'm also not the target audience for ugrep because I've never noticed that grep would be slow. :shrug:
You might appreciate setting `PAGER=cat` in your environment. ;)
Git obeys that value, and I would hope that most other UNIXy terminal apps do too.
Oh, wow, thank you! I must try this.
Some terminal emulators (kitty for sure) support "open last command output in pager". Works great with a pager that can understand ANSI colors - less fussing around with variables and flags to preserve colors in the pager
I assume the grep compatible bit is attractive to some people. Not me, but they exist.
I find myself returning to grep from my default of rg because I'm just too lazy to learn a new regex language. Stuff like word boundaries "\<word\>" or multiple patterns "\(one\|two\)".
That seems like the weirdest take ever: ripgrep uses pretty standard PCRE patterns, which are a lot more common than posix’s bre monstrosity.
To me the regex langage is very much a reason to not use grep.
A bit hyperbolic, no?
If you consider it "the weirdest ever", I'm guessing that I'm probably older than you. I've certainly been using regex long before PCRE became common.
As a vim user I compose 10s if not 100s of regexes a day. It does not use PCRE. Nor does sed, a tool I've been using for decades. Do you also recommend not using these?
I use all of those tools but the inconsistency drives me crazy as it's hard to remember which syntax to use where. Here's how to match the end of a word:
ripgrep, Python, JavaScript, and practically every other non-C language: \b
vim: \>
BSD sed: [[:>:]]
GNU sed, GNU grep: \> or \b
BSD grep: \>, \b, or [[:>:]]
less: depends on the OS it's running on
Did you know that not all of those use the same definition of what a "word" character is? Regex engines differ on the inclusion of things like \p{Join_Control}, \p{Mark} and \p{Connector_Puncuation}. Although in the case of \p{Connector_Punctuation}, regex engines will usually at least include underscore. See: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e798...
And then there's \p{Letter}. It can be spelled in a lot of ways: \pL, \p{L}, \p{Letter}, \p{gc=Letter}, \p{gc:Letter}, \p{LeTtEr}. All equivalent. Very few regex engines support all of them. Several support \p{L} but not \pL. See: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e798...
`pgrep`, or `grep -P`, uses PCRE though, AFAIUI.
ripgrep's regex syntax is pretty similar to grep -E. So if you know grep -E, most of that will transfer over.
Also, \< and \> are in ripgrep 14. Although you usually just want to use the -w/--word-regexp flag.
Isn't that inconsistent with the way Perl's regex syntax was designed? In Perl's syntax an escaped non-ASCII character is always a literal [^1], and that is guaranteed not to change.
That's nice for beginners because it saves you from having to memorize all the metacharacters. If you are in doubt you on whether something has a special meaning, you just escape it.
[^1]: https://perldoc.perl.org/perlrebackslash#The-backslash
Yes, it's inconsistent with Perl. But there are many things in ripgrep's default regex engine that are inconsistent with Perl, including the fact that all patterns are guaranteed to finish a search in linear time with respect to the haystack. (So no look-around or back-references are supported.) It is a non-goal of ripgrep to be consistent with Perl. Thankfully, if you want that, then you can get pretty close by passing the -P/--pcre2 flag.
With that said, I do like Perl's philosophy here. And it was my philosophy too up until recently. I decided to make an exception for \< and \> given their prevalence.
It was also only relatively recently that I made it possible for superfluous escapes to exist. Prior to ripgrep 14, unrecognized escapes were forbidden:
I had done it this way to make it possible to add new escape sequences in a semver compatible release. But in reality, if I were to ever add new escape sequences, it use one of the ascii alpha-numeric characters, as Perl does. So I decided it was okay to forever and always give up the ability to make, e.g., `\@` mean something other than just matching a literal `@`.`\<` and `\>` are forever and always the lone exceptions to this. It is perhaps a trap for beginners, but there are many traps in regexes, and this seemed worth it.
Note that `\b{start}` and `\b{end}` also exist and are aliases for `\<` and `\>`. The more niche `\b{start-half}` and `\b{end-half}` also exist, and those are what are used to implement the -w/--word-regexp flag. (Their semantics match GNU grep's -w/--word-regexp.) For example, `\b-2\b` will not match in `foo -2 bar` since `-` is not a word character and `\b` demands `\w` on one side and `\W` on the other. However, `rg -w -e -2` will match `-2` in `foo -2 bar`:
Ok, makes sense. And thanks for the detailed explaination about word boundaries and the hint about the --pcre flag (I hadn't realized it existed).
From the ugrep README:
For an up-to-date performance comparison of the latest ugrep, please see the ugrep performance benchmarks [at https://github.com/Genivia/ugrep-benchmarks]. Ugrep is faster than GNU grep, Silver Searcher, ack, sift. Ugrep's speed beats ripgrep in most benchmarks.
Does these performance comparison take into account the things BurntSushi (ripgrep author) pointed out in the ripgrep issue link elsewhere ITT? https://github.com/BurntSushi/ripgrep/discussions/2597
Either way, ripgrep is awesome and I’m staying with it.
Agreed - ripgrep is great, and I'm not planning to switch either. The performance improvement is tiny, anyways.
For me, it's a lot easier to compile a static binary of a C++ app than a Rust one. Never got that to work. Also nice to have compatibility with all of grep's arguments.
Cargo is one of the main reasons to use Rust of C++. I am pretty sure there is more involved with C++ than this:
Because this is faster?
Fuzzy matching is the main reason I switched to ugrep. This is insanely useful.
ripgrep stole the name but doesn’t follow the posix standard.
Why not ugrep?
They are more or less equivalent. One has obscure feature X other has obscure feature Y, one is a bit faster on A, other is a bit faster on B, the defaults are a bit different, and one is written in Rust, the other in C++.
Pick the one you like, or both. I have both on my machine, and tend to use the one that does what I want with the least options. I also use GNU grep when I don't need the speed or features of either ug and rg.
The best practical reason to choose this is its interactive features, like regexp building.