return to table of content

Regex character "$" doesn't mean "end-of-string"

Karellen
70 replies
9h9m

Folks who've worked with regular expressions before might know about ^ meaning "start-of-string" and correspondingly see $ as "end-of-string".

Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.

Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

wccrawford
51 replies
8h3m

It's kind of driving me nuts that the article says ^ is "start of string" when it's actually "start of line", just like $ is "end of line". \A is apparently "start of string" like \Z is "end of string".

masklinn
35 replies
7h41m

It’s not start of line though, unless the engine is in multiline mode. Here is the documentation for Python’s re for instance:

Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

Or JavaScript:

An input boundary is the start or end of the string; or, if the m flag is set, the start or end of a line.

\A and \Z are start/end of input regardless of mode… when they’re available, that’s not the case of all engines.

danbruc
32 replies
6h50m

It is start and end of line. [1]

Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).

In single-line [2] mode, the line starts at the start of the string and ends at the end of the line where the end of the line is either the end of the string if there is no terminating newline or just before the final newline if there is a terminating newline.

In multi-line mode a new line starts at the start of the string and after each newline and ends before each newline or at the end of the string if the last line has no terminating newline.

The confusion is that people think that they are in string-mode if they are not in multi-line mode but they are not, they are in single-line mode, ^ and $ still use the semantics of lines and a terminating newline, if present, is still not part of the content of the line.

With \n\n\n in single-line mode the non-greedy ^(\n+?)$ will capture only two of the newlines, the third one will be eaten by the $. If you make it greedy ^(\n+)$ will capture all three newlines. So arguably the implementations that do not match cat\n with cat$ are the broken ones.

[1] https://docs.python.org/3/howto/regex.html#more-metacharacte...

[2] I am using single-line to mean not multi-line for convenience even though single-line already has a different meaning.

masklinn
31 replies
6h37m

It is start and end of line.

You seem to have redefined “line” as “not a line”.

The confusion

I’m sure redefining “line” as “nothing like what anyone reasonable would interpret as a line” will help a lot and right clear up the confusion.

Bjartr
27 replies
5h47m

The line delimiter is a newline.

If you have a file containing `A\nB\nC` in a file, the file is three lines long.

I guess it could be argued that a file containing `A\nB\nC\n` has four lines, with the fourth having zero length.

That a regex is applying to an in memory string vs a file doesn't feel to me like it should have different semantics.

Digging into the history a little, it looks like regexes were popularized in text editors and other file oriented tooling. In those contexts I imagine it would be far more common to want to discard or ignore the trailing zero length line than to process it like every other line in a file.

akdev1l
26 replies
5h40m

Technically the “newline” character is actually a line _terminator_. Hence “A\n” is one line, not two. The “\n” is always at the end of a line by definition.

wtetzner
15 replies
5h22m

So if you have "A" in a file with no newline, there are no lines in that file?

jepler
11 replies
5h10m

Yes, that is a file with zero lines that ends with an "incomplete line". Processing of such files by standard line-oriented utilities is undefined in the opengroup spec. So, for instance, the effect of "grep"ping such a file is not defined. Heck, even "cat"ting such a file gives non-ideal results, such as colliding with the regular shell prompt. For this reason, a lot of software projects I work on check and correct this condition whenever creating a commit.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... ("text file")

rovr138
8 replies
4h54m

Yes, that is a file with zero lines that ends with an "incomplete line".

It's a file with zero complete lines. But it has 1 line, that's incomplete, right?

The file starts empty. Anything in it starts "a line". So it's 1 incomplete line.

I hate weird states.

xyzzy_plugh
2 replies
4h30m

No, it is valid for a file to have content but no lines.

Semantically many libraries treat that as a line because while \n<EOF> means "the end of the last line" having just <EOF> adds additional complexity the user has to handle to read the remaining input. But by the book it's not "a line".

If I said "ten buckets of water" does that mean ten full buckets? Or does a bucket with a drop in it count as "a bucket of water?" If I asked for ten buckets of water and you brought me nine and one half-full, is that acceptable? What about ten half-full buckets?

A line ends in a newline. A file with no newlines in it has no lines.

joshjje
1 replies
1h37m

Thats beyond ridiculous. Most languages when you are reading a line from a file, and it doesn't have a \n terminator, its going to give you that line, not say, oops, this isn't a line sorry.

LK5ZJwMwgBbHuVI
0 replies
50m

That's a relatively recent invention compared to tools like `wc` (or your favorite `sh` for that matter). See also: https://perldoc.perl.org/functions/chop wherein the norm was "just cut off the last character of the line, it will always be a newline"

coryrc
1 replies
2h7m

Pedantically, if it doesn't end with a newline, it's considered a binary file and not a text file. Binary files don't have lines.

In practice, most utilities expecting text files will still operate on it.

PaulDavisThe1st
0 replies
45m

No file has lines.

"Lines" are a convention established by (or not) software reading a data stream.

mort96
0 replies
2h50m

It's a file with 0 lines and some trailing garbage.

akdev1l
0 replies
4h28m

No, a line is defined as a sequence of characters (bytes?) with a line terminator at the end.

Technically as per posix a file as you describe is actually a binary file without any lines. Basically just random binary data that happens to kind of look like a line.

DougBTX
0 replies
2h32m

Another way to look at it is that concatenating files should sum the line count. Concatenating two empty files produces an empty file, so 0 + 0 = 0. If “incomplete lines” are not counted as lines, then the maths still works out. If they counted as lines, it would end up as 1 + 1 = 1.

rerdavies
1 replies
2h45m

The opengroup spec says no such thing.

simonh
0 replies
1h51m

3.206 Line

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

See also ‘3.403 Text File’ for the definition of a text file. No new line characters, no lines. No lines, not a text file.

mbrubeck
1 replies
3h59m

    $ echo -n "A" | wc --lines
    0

keybored
0 replies
2h17m

Yep. since wc(1) apparently strictly adheres to what a newline-terminated text file is. This is why plaintext files should end with a newline. :)

See: https://stackoverflow.com/a/25322168/1725151

LK5ZJwMwgBbHuVI
0 replies
52m

Why don't you go ask?

    $ echo -n foo | wc -l
    0

rerdavies
3 replies
2h46m

Technically, that is one of two possible interpretations, and you seem to have invented a "by definition" out of thin air.

Very very technically a "newline" character indicates the start of a new line, which is why it is not called the "end-of-line" character.

cortesoft
1 replies
2h25m

I mean, the person you are responding to didn't invent the definition out of thin air... the POSIX standard did:

3.206 Line A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...

nomel
0 replies
1h42m

Posix getline() includes EOF as a line terminator:

    getline() reads an entire line from stream, storing the address
       of the buffer containing the text into *lineptr.  The buffer is
       null-terminated and includes the newline character, if one was
       found.
    ...
    ... a delimiter character is not added if one was
       not present in the input before end of file was reached.
EOF seems same as end-of-string.

LK5ZJwMwgBbHuVI
0 replies
47m

It doesn't indicate the start of a new line, or files would start with it. Files end with it, which is why it is a line terminator. And it is by definition: by the standard, by the way cat and/or your shell and/or your terminal work together, and by the way standard utilities like `wc` treat the file.

Gormo
3 replies
4h14m

Suddenly the DOS/Windows solution of using \r\n instead of just \n seems to offer some advantages.

samatman
1 replies
3h57m

This does precisely nothing to solve the ambiguity issue when a final line lacks a newline. The representation of that newline isn't relevant to the problem.

Izkata
0 replies
1h17m

It's actually slightly worse: Windows defines newline as a delimiter, not a terminator. So this:

  foo\nbar\n
Would be 2 lines in *nix and 3 lines in windows.

deaddodo
0 replies
2h12m

The "Windows way" is the "right way" for a few reasons.

This is definitely not one of them.

joshjje
1 replies
1h45m

“A\n” is two lines.

LK5ZJwMwgBbHuVI
0 replies
46m

Factually incorrect.

danbruc
2 replies
6h29m

The POSIX definition of a line is a sequence of non-newline characters - possibly zero - followed by a newline. Everything that does not end with a newline is not a [complete] line. So strictly speaking it would even be correct that cat$ does not match cat because there is no terminating newline, it should only match cat\n. But as lines missing a terminating newline is a thing, it seems reasonable to be less strict.

masklinn
1 replies
1h5m

a line is a sequence of non-newline characters

Works for me.

How do you square that with your assertion that in your invention of "single-line mode" you implicitly define "line" as matching \n\n?

danbruc
0 replies
41m

If you are not in multi-line mode, then a single line is expected and consequently there is at most one newline at the end of the string. You can of course pick an input that violates this, run it against a multi-line string with several newlines in it. cat\n\n will not match cat$ because there is something between cat and the end of the line, it just happens to be a newline but without any special meaning because it is not the last character and you did not say that the input is multi-line.

eastbound
1 replies
6h56m

Probably a vulnerability issue. Programmers would leave multiline mode on by mistake, then validate that some string only contain ^[a-Z]*$… only for the string to have an \n and an SQL injection on the second line.

masklinn
0 replies
6h39m

Probably a vulnerability issue.

No? It’s a semantics decision.

amelius
12 replies
3h53m

What is driving me nuts is that we have Unicode now, so there is no need to use common characters like $ or ^ to denote special regex state transitions.

knome
6 replies
3h51m

the idea of changing a decades old convention to instead use, as I assume you are implying, some character that requires special entry, is beyond silly.

FranOntanaya
3 replies
3h5m

I don't think anyone that writes regex would feel specially challenged by using the Alt+ | Ctrl+Shift+u key combos for unicode entry. Having to escape less things in a pattern would be nice.

amelius
1 replies
2h53m

Also, code is read more often than it is written.

cortesoft
0 replies
2h21m

People say this all the time, but is it really always true? I have a ton of code that I wrote, that just works, and I never really look at it again, at least not with the level of inspection that requires parsing the regex in my head.

cortesoft
0 replies
2h22m

I write regexes all the time, and I don't know if I would be CHALLENGED by that, but it would be annoying. Escaping things is trivial, and since you do it all the time it is not anything extra to learn. Having to remember bespoke keystrokes for each character is a lot more to learn.

keybored
1 replies
2h14m

It’s not that silly. You constantly get into escape conundrums because you need to use a metacharacter which is also a metacharacter three levels deep in some embedding.

(But that might not solve that problem? Maybe the problem is mostly about using same-character delimiters for strings.)

And I guess that’s why Perl is so flexible with regards to delimiters and such.

LK5ZJwMwgBbHuVI
0 replies
44m

Yes, languages really need some sort of "raw string" feature like Python (or make regex literals their own syntax like Perl does). That's the solution here, not using weird characters...

Yujf
3 replies
3h48m

Why not? Common characters are easier to type and presumbly if you are using regex on a unicode string they might include these special characters anyway so what have you gained?

amelius
2 replies
2h30m

In theory yes, in practice no.

What you have gained is that the regex is now much easier to read.

knome
0 replies
1h51m

It's easy to read now.

LK5ZJwMwgBbHuVI
0 replies
42m

In theory yes, in practice no.

That's like "in theory we need 4 bytes to represent Unicode, but in practice 3 bytes is fine" (glances at universally-maligned utf8mb3)

yjftsjthsd-h
0 replies
3h46m

If we were willing to ignore the ability to actually type it, you don't need Unicode for that; ASCII has a whole block of control characters at the beginning; I think ASCII 25 ("End of medium") works here.

tangus
0 replies
7h2m

That gives the author space for another article ;)

davidw
0 replies
2h11m

What with unicode, it'd be fun to have Α and Ω available to make our regexps that much more readable...

jamesmunns
4 replies
8h53m

Same, tho it'd be interesting to see if this behavior holds if the file ends without a trailing newline and your match is on the final newline-less line.

fooofw
3 replies
8h21m

Fortunately, it's pretty simple to test.

    $ printf 'Line with EOL\nLine without EOL' | grep 'EOL$'        
    Line with EOL
    Line without EOL
    $ grep --version | head -n1
    grep (GNU grep) 3.8

romwell
1 replies
8h4m

The line does end with the file, so it's logically consistent.

It's not matching the newline character after all.

colimbarna
0 replies
7h36m

Yes exactly, they match the end of a line, not a newline character. Some examples from documentation:

man 7 regex: '$' (matching the null string at the end of a line)

pcre2pattern: The circumflex and dollar metacharacters are zero-width assertions. That is, they test for a particular condition being true without consuming any characters from the subject string. These two metacharacters are concerned with matching the starts and ends of lines. ... The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, that it does not actually match the newline. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.

jamesmunns
0 replies
7h27m

Thanks! I was AFK and didn't have a grep (or a shell) handy on my phone.

Izkata
4 replies
4h6m

Same here; when I saw the title I was like "well obviously not, where did you hear that?"

In nearly two decades of using regex I think this might be the first time I've heard of $ being end of string. It's always been end of line for me.

michaelt
2 replies
55m

Take a look at, for example, these stackoverflow answers about a regex to validate and e-mail address: https://stackoverflow.com/a/8829363

These people are I think not intending to say a newline character is permitted at the end of an e-mail address.

(Of course people using 'grep' would have different expectations for obvious reasons)

Izkata
1 replies
48m

Even disregarding whether or not end-of-string is also an end-of-line or not (see all the other comments below), $ doesn't match the newline, similar to zero-width matches like \b, so the newline wouldn't be included in the matched text either way.

I think this series of comments might be clearest: https://news.ycombinator.com/item?id=39764385

LK5ZJwMwgBbHuVI
0 replies
39m

Problem is, plenty of software doesn't actually look at the match but rather just validates that there was a match (and then continues to use the input to that match).

frame_ranger
0 replies
1h26m

You couldn’t write a post like this if you didn’t start with a strawman.

absoluteunit1
2 replies
6h55m

I’ve always thought that as well; mostly due to Vim though.

^ - takes you to start of line $ - takes you to end of line

Izkata
1 replies
4h17m

^ actually takes you to the first non-whitespace character in the line in vim. For start of line you want 0

kataklasm
0 replies
24m

I don't have (n)vi(m) open right now but I think this only applies to prepending spaces. For prepending tabs, 0 will take you to the first non-tab character as well.

notnmeyer
0 replies
2h0m

i feel like this perspective will be split between folks who use regex in code with strings and more sysadmin folks who are used to consuming lines from files in scripts and at the cli.

but yeah seems like a real misunderstanding from “start/end of string” people

kqr
0 replies
7h48m

I'm the same, but now that I try in Perl, sure enough, $ seems to default to being a positive lookahead assertion for the end of the string. It does not match and consume an EOL character.

Only in multiline mode does it match EOL characters, but it does still not appear to consume them. In fact, I cannot construct a regex that captures the last character of one line, then consumes the newline, and then captures the first character of the next line, while using $. The capture group simply ends at $.

cerved
0 replies
1h24m

In `sed` it's end of string.

String is usually end of line, but not if you use stuff like `N`, to manipulate multi-line strings

antegamisou
0 replies
8h16m

Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

Vim is what did that for me.

alphazard
0 replies
4h48m

This must be the "second problem" everyone talks about with regular expressions.

beardyw
38 replies
9h30m

Does anyone consider RegEx to be standardised? Moving to a new context is always a relearning exercise in my experience.

rusk
12 replies
9h26m

My understanding is it was standardised for Posix but the variants in popular use have so many variations.

I consider sed to be the baseline. If you can do sed you can do anything but it’s seriously limited.

susam
7 replies
9h19m

POSIX specifies two flavours of regular expressions: basic regular expressions (BRE) and extended regular expressions (ERE). There are subtle differences between the two and ERE supports more features than BRE. For example, what is written as a\(bc\)\{3\}d in BRE is written as a(bc){3}d in ERE. See https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... for more details.

The regular expression engines available in most mainstream languages go well beyond what is specified in POSIX though. An interesting example is named capturing group in Python, e.g., (?P<token>f[o]+).

tankenmate
3 replies
9h2m

Indeed, and the most common is Perl since it was the source of many of the extensions.

rusk
2 replies
7h52m

I would hazard that nowadays it’s Java due to its broad permeation of the application space

account42
1 replies
5h55m

If anything it would be ECMAScript (JavaScript dwarfs Java use) or PCRE (the de-facto contiuation of Perl regular expressions written in C but used in many languages).

rusk
0 replies
5h24m

Yes I think you’re right actually. I’m about 10 years off :)

jwilk
2 replies
8h39m

what is written as \(f..\)\1 in BRE is written as (f..)\1 in ERE

Oddly, there are no backreferences in POSIX EREs.

susam
0 replies
7h30m

You are right indeed. Looked at the specification again and indeed there is no back-reference in POSIX ERE.

Quoting from <https://pubs.opengroup.org/onlinepubs/9699919799.2008edition...>:

It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This was rejected by the standard developers as likely to decrease consensus.

Updated my comment to present a better example that avoids back-references. Thanks!

GrumpySloth
0 replies
6h2m

That’s because POSIX EREs are actual regular expressions thank god.

psd1
3 replies
8h32m

No gnu tool can balance brackets, afaics. So you can't do everything in sed. And sed is, by design, useless for matching text that spans lines, so good luck picking out paragraphs with it.

ykonstant
1 replies
7h49m

I am pretty sure even pure Awk can do it; or am I mistaken? I thought there was an even more sophisticated example in the Awk book.

Edit: oh, you mean via regex engines available in GNU tools; I am dumb. Hmm... is there no GNU extension with PCRE?

colimbarna
0 replies
7h19m

"Sed" is the name of a specific tool. It is not defined by the GNU tools, but has existed in some form since 1974, well before Perl. GNU sed and POSIX sed both support BRE and EREs, but not PCREs.

Maybe there's some other implementation of sed that supports PCREs but that would really be an extension of that implementation of sed rather than a property of sed.

And maybe there's some GNU tool that uses PCREs, but that GNU tool would not be GNU sed, so it would not be a relevant property.

Anyway, they probably should have said BREs or EREs rather than "sed"...

rusk
0 replies
7h55m

Sorry I meant to write “if you can do it in sed you can do it in anything” thereby implying it is a subset of the more generally available flavours. The issue at hand however is that there isn’t much in the way of standardisation but 95% of sed should work across all of them. Of course you should get more into the specifics of whatever your solution space supports.

telotortium
6 replies
9h25m

Languages invented after Perl will generally use some flavor of Perl regex syntax, but there are always some minor differences. The issue of the meaning of `$` and changing it via multi-line mode is usually consistent though.

usrusr
5 replies
9h5m

I like to think of "whatever browsers do in js" as an updated common baseline. Whatever your regex engine does, describe it as a delta to the js precedent. That thing is just so ubiquitous.

I do wonder though what's the highest number of different regex syntaxes I've ever encountered (perhaps written?) within a single line: bash, grep and sed are never not in a "hold my beer" mood!

psd1
0 replies
8h24m

Reason #2 to use powershell - consistent regex.

I've got "hold my beer" commits in .net - I've balanced brackets. I believe that's impossible in sed and grep. If I were going to write a json parser in a script, then a) stop me and b) it's got to be in powershell.

mwpmaybe
0 replies
3h13m

I do wonder though what's the highest number of different regex syntaxes I've ever encountered (perhaps written?) within a single line: bash, grep and sed are never not in a "hold my beer" mood!

Your comment is missing a trigger warning, lol. But seriously, this is one of my flags for "this should probably be a script, or an awk or perl one-liner."

layer8
0 replies
7h9m

That seems like just a web front-end developer’s perspective.

kstrauser
0 replies
3h26m

I’ll go along with that, as long as someone ports pcre to JavaScript and that’s the browser syntax we land on.

jasonjayr
6 replies
9h5m

The three big ones I know of are POSIX, Perl/PCRE(aka Perl-Compatible Regular Expression), and Go came along and <strike>added</strike> used re2, which is a bit different from the first too.

A lot of systems implemented PCRE, including JavaScript, since Perl extended the POSIX system with many useful extensions. IIRC, re2 tries to reign in on some of the performance issues and quirks the original systems had, while implementing the whole thing in Go.

edit: Did not realize re2 predated go ...

jerf
3 replies
4h29m

POSIX and PCRE are arguably redundant. They both support backreferences, which puts very significant constraints on their implementations. PCRE is at least functionally a superset of POSIX, whether or not there's some quirky thing POSIX supports that PCRE does not.

re2 adds a legitimate option to the menu of using NDFAs, which have the disadvantage of not supporting backreferences, but have the advantage of having constrained complexity of scanning a string. This does not come for free; you can conceivably end up with a compiled regexp of very large size with an NDFA approach, but most of the time you won't. The result may be generally slower than a PCRE-type approach, but it can also end up safer because you can be confident that there isn't a pathological input string for a given regexp that will go exponential.

This is one of those cases where ~99% of the time, it doesn't really matter which you choose, but at the scale of the Entire Programming World, both options need to be available. I've got some security applications where I legitimately prefer the re2 implementation in Go because it is advantageous to be confident that the REs I write have no pathological cases in the arbitrary input they face. PCRE can be necessary in certain high-performance cases, as long as you can be sure you're not going to get that pathological input.

RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment. I use both styles in my code. I've even got one unlucky exe I've been working with lately that has both, because it rather irreducibly has the requirements for both. Professionally annoying, but not actually a problem.

burntsushi
1 replies
4h9m

I'll add two notes to this:

* Finite automata based regex engines don't necessarily have to be slower than backtracking engines like PCRE. Go's regexp is in practice slower in a lot of cases, but this is more a property of its implementation than its concept. See: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa... --- Given "sufficient" implementation effort (~several person years of development work), backtrackers and finite automata engines can both perform very well, with one beating the other in some cases but not in others. It depends.

* Fun fact is that if you're iterating over all matches in a haystack (e.g., Go's `FindAll` routines), then you're susceptible to O(m * n^2) search time. This applies to all regex engines that implement some kind of leftmost match priority. See https://github.com/BurntSushi/rebar?tab=readme-ov-file#quadr... for a more detailed elaboration on this point.

jerf
0 replies
4h7m

Excellent, thank you.

keybored
0 replies
2h9m

RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment.

Good on you.

jpgvm
0 replies
8h58m

re2 predates Go and was written in C++.

foldr
0 replies
8h13m

Go's regex implementation is new in the sense that it's not just a binding to the re2 C++ library, but it uses the same non-backtracking algorithm.

bregma
4 replies
7h55m

The ISO/IEC 14882 C++ standard library <regex> mandates [0] implementations for six de jure standard regex grammars: IEEE Std 1003.1-2008 (POSIX) [1] BRE, ERE, awk, grep, and egrep and ECMA-262 EcmaScript 3 [2].

So, yes, at least someone (me) considers regex to be standardized in several published de jure standards.

  [0] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf#chapter.28
  [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
  [2] https://262.ecma-international.org/14.0/#sec-regexp-regular-expression-objects

pjc50
1 replies
7h2m

"At least six different standards" is an XKCD comic, not a standard.

riffraff
0 replies
6h43m

"The nice thing about standards is that you have so many to choose from." - Andrew Tanenbaum (or Grace Hopper)

account42
1 replies
5h56m

<regex> is not exactly an example anyone should follow.

bregma
0 replies
2h26m

You may be prejudiced against C++, but ISO/IEC 14882 is a published international standard that links to recognized regex standards, so answers the question "does anyone consider RegEx standardised?" very much in the affirmative.

wolletd
1 replies
9h25m

At some point, I felt like I knew them all. There are probably more regex dialects out there, but I don't encounter them and my set of knowledge works most of the time.

I feel it's like driving a rental car. It behaves slightly different than your own car, some features missing, some other features added, but in general, most of the things are pretty similar.

stanislavb
0 replies
9h12m

What a nice analogy. I’ll borrow it in the future.

tonyg
0 replies
5h23m

Delightfully, RFC 9485 https://datatracker.ietf.org/doc/rfc9485/ "I-Regexp: An Interoperable Regular Expression Format" was published just back in October last year!

out-of-ideas
0 replies
9h14m

kind of a trick question; there is POSIX and then there is the app you're using and whichever flags are enabled (albeit by default or explicitly defined)

beardyw
0 replies
5h50m

And don't get me started about find and replace, what is the symbol to insert the match?

MattHeard
0 replies
9h22m

My working assumption has always been to check the docs of your specific regexp parser, and to write some tests (either automated or manually in a REPL) with specific patterns that you are interested in using.

onion2k
32 replies
9h11m

I can hear thousands of bad hiring manager's adding 'How do you match the end of a string in a regex?' to their list of 'Ha! You don't know the trick!' questions designed to catch out candidates.

hoc
31 replies
8h15m

"I will hire you anyway, but I will pay you less"

Regex, useful in any job...

username_my1
30 replies
8h9m

regex is useful but chatgpt is amazing at it, so why spend a minute keeping such useless knowledge in mind.

if you know where to find something no point in knowing it.

ykonstant
25 replies
7h52m

Does gpt produce efficient regex? Are there any experts here that can assess the quality and correctness of gpt-generated regex? I wonder how regex responses by gpt are validated if the prompter does not have the knowledge to read the output.

thecatspaw
18 replies
7h43m

what does gpt say how we should validate email addresses?

layer8
7 replies
7h16m

…which both excludes addresses allowed by the RFC and includes addresses disallowed by the RFC. (For example, the RFC disallows two consecutive dots in the local-part.)

KMnO4
5 replies
5h24m

I take the descriptivist approach to email validation, rather than the prescriptivist.

I know an email has to have a domain name after the @ so I know where to send it.

I also know it has to have something before the @ so the domain’s email server knows how to handle it.

But do I care if the email server is supports sub addresses, characters outside of the commonly supported range (eg quotation marks and spaces), or even characters which aren’t part of the RFC? I do not.

If the user gives me that email, I’ll trust them. Worst case they won’t receive the verification email and will need to double check it. But it’s a lot better than those websites who try to tell me my email is invalid because their regex is too picky.

layer8
2 replies
5h19m

I generally agree, but the two consecutive dots (or leading/trailing dots) are an example that would very likely be a typo and that you wouldn’t particularly want to send. Similar for unbalanced quotes, angle brackets, and other grammar elements.

dumbo-octopus
1 replies
41m

I wonder whether simply (regex) replacing a sequence of .'s with a single one as part of a post-processing step would be effective.

layer8
0 replies
15m

That would be bad form, IMO. The user may have typed john..kennedy@example.com by mistake instead of john.f.kennedy@example.com, and now you’ll be sending their email to john.kennedy@example.com. Similar for leading or trailing dots. You can’t just decide what a user probably meant, when they type in something invalid.

wtetzner
0 replies
5h15m

Yeah, that's about as far as I've ever been comfortable going in terms of validating email addresses too: some stuff followed by "@" followed by more stuff.

Though I guess adding a check for invalid dot patterns might be worthwhile.

jcranmer
0 replies
4h29m

The HTML email regex validation [1] is probably the best rule to use for validating an email address in most user applications. It prohibits IP address domain literals (which the emailcore people have basically said is of limited utility [2]), and quoted strings in the localpart. Its biggest fault is allowing multiple dots to appear next to each other, which is a lot of faff to put in a regex when you already have to individually spell out every special character in atext.

[1] https://html.spec.whatwg.org/multipage/input.html#email-stat...

[2] https://datatracker.ietf.org/doc/draft-ietf-emailcore-as/

marcosdumay
0 replies
4h46m

What is maybe more important to note, it completely disallows the language of some 4/5 of the humanity. And partially disallows some 2/3 of the rest.

zaxomi
1 replies
6h35m

Remember to first punycode the domain part of an email address before trying to validate it, or it will not work with internationalized domain names.

jameshart
0 replies
5h54m

Support for IDN email addresses is still patchy at best. Many systems can’t send to them; many email hosts still can’t handle being configured for them.

sebstefan
0 replies
7h11m

Actually pretty good response if the programmer bothers to read all of it

I'd be more emphatic that you shouldn't rely on regexes to validate emails and that this should only be used as an "in the form validation" first step to warn of user input error, but the gist is there

This regex is *practical for most applications* (??), striking a balance between complexity and adherence to the standard. It allows for basic validation but does not fully enforce the specifications of RFC 5322, which are much more intricate and challenging to implement in a single regex pattern.

^ ("challenging"? Didn't I see that emails validation requires at least a grammar and not just a regex?)

For example, it doesn't account for quoted strings (which can include spaces) in the local part, nor does it fully validate all possible TLDs. Implementing a regex that fully complies with the RFC specifications is impractical due to their complexity and the flexibility allowed in the specifications.

For applications requiring strict compliance, it's often recommended to use a library or built-in function for email validation provided by the programming language or framework you're using, as these are more likely to handle the nuances and edge cases correctly. Additionally, the ultimate test of an email address's validity is sending a confirmation email to it.
bonki
0 replies
6h39m

Not good at all, but a little better than expected. I use + in email addresses prominently and there are so many websites who don't even allow that...

criley2
3 replies
7h30m

Prompt:

'I'm writing a nodejs javascript application and I need a regex to validate emails in my server. Can you write a regex that will safely and efficiently match emails?'

GPT4 / Gemini Advanced / Claude 3 Sonnet

GPT4: `const emailRegex = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;` Full answser: https://justpaste.it/cg4cl

Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)$/;` Full answer: https://justpaste.it/589a5

Claude 3: `const emailRegex = /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$/;` Full answer: https://justpaste.it/82r2v

zaxomi
0 replies
6h29m

Still doesn't support internationalized domain names.

dfawcus
0 replies
4h41m

Whereas email more or less lasts forever (mailbox contents), and has to be backwards compatible with older versions back to (at least) RFC 821/822, or those before. It also allows almost any character (when escaped at 821 level) in the host or domain part (domain names allow any byte value).

So a Internet email address match pattern has to be: "..*@..*", anything else can reject otherwise valid addresses.

That however does not account for earlier source routed addresses, not the old style UUCP bang paths. However those can probably be ignored for newly generated email.

I regularly use an email address with a "+" in the host part. When I used qmail, I often used addresses like: "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering received messages from mailing lists.

croemer
0 replies
4h49m

Terrible answers as far as I can tell, especially Chat got would throw out many valid email addresses.

skeaker
0 replies
54m

There really ought to be a regex repository of common use cases like these so we don't have to reinvent the wheel or dig up a random codebase that we hope is correct to copy from every time.

da39a3ee
4 replies
7h39m

You don't have to be an expert; you should very rarely be using regexes so complex that you can't understand them.

hnlmorg
1 replies
6h31m

...and if you can understand them then you clearly understand regex enough not to need ChatGPT to write them

kaibee
0 replies
4h56m

I understand assembly too.

zacmps
0 replies
7h22m

It might not be obvious when you hit that point, bad regexes can be subtle, just see that old cloudflare postmortem.

mnau
0 replies
1h47m

Even simple regexs can be problematic, e.g. Gitlab RCE bug through ExifTools

https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...

"a\ > ""

The second quote was not escaped because in the regex $tok =~ /(\\+)$/ the $ will match the end of a string, but also match before a newline at the end of a string, so the code thinks that the quote is being escaped when it’s escaping the newline.
2devnull
0 replies
5h19m

That was one of my first uh oh moments with gpt. Getting code that clearly had untestable/unreadable regexen, which given the source must have meant the regex were gpt generated. So much is going to go wrong, and soon.

berkes
2 replies
7h22m

if you know where to find something no point in knowing it.

Nonsense. And you know it.

First, you need to know what to find, before knowing where to find it. And knowing what to find requires intricate knowledge of the thing. Not intricate implementation details, but enough to point yourself in the right direction.

Secondly, you need to know why to find thing X and not thing Y. If anything, ChatGPT is even worse than google or stackoverflow in "solving the XY problem for you". XY is a problem you don't want solved, but instead to be told that you don't want to solve it.

Maybe some future LLM can also push back. Maybe some future LLM can guide you to the right answer for a problem. But at the current state: nope.

Related: regexes are almost never the best answer to any question. They are available and quick, so all considered, maybe "the best" for this case. But overall: nah.

pksebben
0 replies
2h34m

While I agree with your point that knowing things matters, it is entirely possible with the current batch of LLMs to get to an answer you don't know much about. It's actually one of the few things they do reliably well.

You start with what you do know, asking leading questions and being clear about what you don't, and you build towards deeper and deeper terminology until you get to the point where there are docs to read (because you still can't trust them to get the specifics right).

I've done this on a number of projects with pretty astonishing results, building stuff that would otherwise be completely out of my wheelhouse.

lolc
0 replies
28m

Funny for me there have been instances where the LLM did push back. I had a plan of how to solve something and tasked the LLM with a draft implementation. It kept producing another solution which I kept rejecting and specifying more details so it wouldn't stray. In the end I had to accept that my solution couldn't work, and that the proposed one was acceptable. It's going to happen again, because it often comes up with inferior solutions so I'm not very open to the reverse situation.

HumblyTossed
0 replies
4h41m

This is something ChatGPT would say.

Izmaki
20 replies
9h41m

The new-line character is an actual character "at the end" of the string though so it makes sense that $ would include the new-line character in multi-line matching.

IshKebab
18 replies
9h33m

Yes and every implementation gets that right. The point was when multi-line matching is disabled and only Javascript, Go and Rust get that right.

I'm not too surprised by PHP and Python getting it wrong. Java and C# is a slight surprise though.

danbruc
16 replies
9h9m

I don't think it is correct to say some get it right and some get it wrong, it is more of an design decision.

IshKebab
13 replies
8h55m

It's possible to get design decisions wrong. Clearly people expect `$` to only match end-of-string so they did make the wrong decision. It may not have been clear it was the wrong decision at the time.

danbruc
11 replies
8h39m

Things are obviously more complicated than that, lines are a complicated issue for historical reasons. There are two conventions, line termination and line separation. In case of line termination, the newline is part of the line and a string without a newline is not a [complete] line. In case of line separation, the newline is not part of the line but separates two lines. Also the way newlines are encoded is not universal.

fauigerzigerk
10 replies
7h51m

Why is this relevant when multi-line is disabled?

danbruc
9 replies
7h13m

Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $, the newline at the end is still not part of the content. You have to use \A and \Z if you want to treat all characters as a string instead of one or multiple lines.

burntsushi
8 replies
6h36m

Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $

No, you're not, except for this weird corner case where `$` can match before the last `\n` in a string. It's not just any `\n` that non-multiline `$` can match before. It's when it's the last `\n` in the string. See:

    >>> re.search('cat$', 'cat\n')
    <re.Match object; span=(0, 3), match='cat'>
    >>> re.search('cat$', 'cat\n\n')
    >>>
This is weird behavior. I assume this is why RE2 didn't copy this. And it's certainly why I followed RE2 with Rust's regex crate. Non-multiline `$` should only match at the end of the string. It should not be line-aware. In regex engines like Python where it has the behavior above, it is only "partially" line-aware, and only in the sense that it treats the last `\n` as special.

danbruc
7 replies
5h5m

But that is exactly what it means, the end of the line is before the terminating newline or at the end of the string if there is no terminating newline. Both ^ and $ always match at start or end of lines, \A and \Z match at the start or end of the string. The difference between multi-line and not is whether or not internal newlines end and start lines, it does not change the semantics from end of line to end of string. And if you are not in multi-line mode but have internal newlines, then you might also want single-line/dot-all mode.

One could certainly have a debate whether this behavior is too strongly tied to the origins of regular expressions and now does more harm than good, but I am not convinced that this would be an easy and obvious choice to have breaking change.

burntsushi
5 replies
4h18m

re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line. And giving it a `string` with multiple new lines doesn't necessarily mean you want to enable multi-line mode. They are orthogonal things.

Both ^ and $ always match at start or end of lines

This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition. Yet it does not match `cat` followed by the end of a line in `cat\n\n`. And it does not do so in Python or in any other regex engine.

You're trying to square a circle here. It can't be done.

Can you make sense of, historically, why this choice of semantics was made? Sure. I bet you can. But I can still evaluate the choice on its own merits today. And I did when I made the regex crate.

but I am not convinced that this would be an easy and obvious choice to have breaking change.

Rust's regex crate, Go's regexp package and RE2 all reject this whacky behavior. As the regex crate maintainer, I don't think I've ever seen anyone complain. Not once. This to me suggests that, at minimum, making `$` and `\z` equivalent in non-multiline mode is a reasonable choice. I would also argue it is the better and more sensible approach.

Whether other regex engines should have a breaking change or not to change the meaning of `$` is an entirely different question completely. That is neither here nor there. They absolutely will not be able to make such a change, for many good reasons.

danbruc
4 replies
2h24m

re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line.

Sure, it takes a string which might be a line or multiple or whatever. Does not change the fact that $ matches at the end of a line. If you want the end of the string, use \Z.

This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition.

In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

burntsushi
3 replies
2h16m

In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

I don't think this conversation is going anywhere. Your description of the semantics seems inconsistent and incomprehensible to me.

A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Like I said, your description makes sense if the input is meant to be interpreted as a single line. And in some contexts (like line oriented CLI tools), that can make sense. But that's not the case here. So your description makes no sense at all to me.

danbruc
2 replies
1h7m

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

Which is fine because lines are a subset of strings. And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Look at where this is coming from. You do line-based stuff, there is either no newline at all or there is exactly one newline at the end. You do file-based stuff, there are many newlines. In both cases the behavior of ^ and $ makes perfect sense.

Now you come along with cat\n\n which clearly falls into the file-based stuff category as it has more than one newline in it but you also insist that it is not multiple lines. If it is not multiple lines, then only the last character can be a newline, otherwise it would be multiple lines.

And I get it, yes, you can throw arbitrary strings at a regular expression, this line-based processing is not everything, but it explains why things behave the way they do. And that is also why people added \A and \Z. And I understand that ^ and $ are much nicer and much better known than \A and \Z. Maybe the best option would be to have a separate flag that makes them synonymous with \A and \Z and this could maybe even be the default.

burntsushi
1 replies
32m

And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

Where is this semantic explained in the `re` module docs?

This is totally and completely made up as far as I can tell.

This also seems entirely consistent with my rebuttal:

Me: What you're saying makes sense if condition foo holds.

You: Condition foo holds.

This is uninteresting to me because I see no reason to believe that condition foo holds. Where condition foo is "the input to re.search is expected to be a single line." Or more precisely, apparently, "the input to re.search is expected to be a single line when either ^ or $ appear in the pattern." That is totally bonkers.

but it explains why things behave the way they do

Firstly, I am not debating with you about the historical reasoning for this. Secondly, I am providing a commentary on the semantics themselves (they suck) and also on your explanation of them in today's context (it doesn't make sense). Thirdly, I am not making a prescriptive argument that established regex engines should change their behavior in any way.

If you're looking to explain why this semantic is the way it is, then I'd expect writing from the original implementors of it. Probably in Perl. I wouldn't at all be surprised if this was an "oops" or if it was implemented in a strictly-line-oriented context, and then someone else decided to keep it unthinkingly when they moved to a non-line-oriented context. From there, compatibility takes over as a reason for why it's with us today.

danbruc
0 replies
5m

I quoted the section from the Python module here. [1]

If you do not specify multi-line, bar$ matches a lines ending in bar, either foobar\n or foobar if the terminating newline has been removed or does not exist. If you specify multi-line, then it will also match at every bar\n within the string. So it either treats your input as a single line or as multiple lines. You can of course not specify multi-line and still pass in a string with additional newlines within the string, but then those newlines will be treated more or less as any other character, bar$ will not match bar\n\n. The exception is that dot will not match them except you set the single-line/dot-all flag, bar\n$ will match bar\n\n but bar.$ will not unless you specify the single-line/dot-all flag.

[1] https://news.ycombinator.com/item?id=39765086

IshKebab
0 replies
4h46m

But that is exactly what it means

I think you've kind of missed the point. Sure if `$` in non-multiline mode means "end of line" the behaviour might be reasonable. But the big error is that people DO NOT EXPECT `$` to mean "end of line" in that case. They expect it to mean "end of string". That's clearly the least surprising and most useful behaviour.

The bug is not in how they have implemented "end of line" matching in non-multiline mode. It's that they did it at all.

dfawcus
0 replies
7h50m

Given that in unix they sort started as:

    ed -> sed
    ed -> grep
The line oriented mature makes sense.

There is some sed multi-line capability if one uses the hold space, but it is much easier to just use awk.

tankenmate
1 replies
8h51m

Not quite, there are standards for this behaviour (formal and de jure).

danbruc
0 replies
7h3m

And the ones that do not match cat\n with cat$ arguably have it wrong. Both ^ and $ anchor to the start and end of lines, not to the start and end of strings, whether in multi-line mode or not.

noirscape
0 replies
7h30m

It's not wrong actually. It's the difference between BRE and ERE, which are the two different POSIX standards that define regex. In BRE the $ should always match the end of the string (the spec specifically says it should match the string terminator since "newlines aren't special characters"), while the ERE spec says it should match until the end of the line.

The real issue is that no language nowadays "just" implements BRE or ERE since both specs are lacking in features.

Most languages instead implement some variant of Perl's regex instead (often called PCRE regex because of the C library that brought Perl's regex to C), which as far as I can tell isn't standardized, so you get these subtle differences between implementations.

mnw21cam
0 replies
9h33m

The article is about when multi-line is disabled.

pjc50
11 replies
9h17m

Special misery case: Visual Studio supports regex search, where '$' matches \n.

The end of line character is usually the standard Windows \r\n.

Yes, that means if you want to really match the end of line you have to match "\r$". So broken.

jbverschoor
6 replies
9h1m

The whole \r is archaic. It doesn't even behave properly in most cases. Just use \n everywhere and bite the lemon for a short while to fix your problems.

And if you believe \r\n is the way to go, please make sure \n\r also works as they should have the same results. (or \r\n\r\r\r\r for that matter)

keybored
3 replies
2h3m

Why did they even decide to use two characters for the end of line? Seems bizarre. I could have imagined that `\r` and `\n` was a tossup. But why both?

mnau
1 replies
1h37m

Likely compatibility bugs going back decades (70s?). Probably with some terminal/teletype.

\r - returned teletype head to the start of a line

\n - move paper one line down

The sequence CR+LF was commonly used on many early computer systems that had adopted Teletype machines—typically a Teletype Model 33 ASR—as a console device, because this sequence was required to position those printers at the start of a new line. The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up."[2] In fact, it was often necessary to send extra padding characters—extraneous CRs or NULs—which are ignored but give the print head time to move to the left margin. Many early video displays also required multiple character times to scroll the display.

https://en.wikipedia.org/wiki/Newline

jbverschoor
0 replies
1h18m

It’s similar to an old school typewriter.

The handle does 2 things: return and feed. You can also just return by not pulling all the way or the other way around depending on the design

HideousKojima
0 replies
26m

Typewriters is why

psd1
0 replies
8h19m

There are unices that use LFCR endings... computing is an endless bath in history

HideousKojima
0 replies
4h27m

But without \r how am I supposed to print to my typewriter over serial cable? Only half-joking, that's the setup my family had in the early 90's.

skrebbel
3 replies
9h8m

FWIW, and I know this doesn't really address your complaint: I use Windows and I've set all my text editors to use LF exclusively years ago and Things Are Great. No more weird Git autocrlf warnings, no quirks when copying files over to/from people on Macs or Linuxes, etc. Even Notepad supports LF line endings for quite a long time now - to my practical experience, there's little remaining in Windows that makes CRLF "the OS standard line ending".

I bet if someday VS Code's Windows build ships with LF default on new installations, people won't even notice.

I mean, at some point it did matter what the OS did when you pressed the "Enter" button. But this isn't really the case much anymore. VS Code catches that keypress, and inserts whatever "files.eol" is set to. Sublime does the same. I didn't check, but I assume every other IDE has this setting.

Similarly, the HTML spec, which is pretty nuts, makes browsers normalize my enters to LF characters as I type into this textarea here (I can check by reading the `value` property in devtools), but when it's submitted, it converts every LF to a CRLF because that's how HTML forms were once specced back in the day. Again though, what my OS considers to be "the standard newline" is simply not considered at all. Even CMD.EXE batch files support LF.

I don't really type newlines all that much outside IDEs and browsers (incl electron apps) and places like MS Word, all of which disregard what the OS does and insert their own thing. Maybe the terminal? I don't even know. I doubt it's very consequential.

EDIT: PSA the same holds for backslashes! Do Not Use Backslashes. Don't use "OS specific directory separator constants". It's not 1998, just type "/" - it just works.

pjc50
0 replies
6h59m

I bet if someday VS Code's Windows build ships with LF default on new installations, people won't even notice.

As with '/', they really ought to do this some day but won't.

n_plus_1_acc
0 replies
8h9m

I could never get visual studio (not code) to not use \r\n when editing a solution file via the gui

divingdragon
0 replies
7h52m

Even CMD.EXE batch files support LF.

I don't know if it is the case on Windows 11, but I have surely been bitten by CMD batch files using LF line endings. I don't remember the exact issue but it may have been the one bug affecting labels. [1]

[1]: https://www.dostips.com/forum/viewtopic.php?t=8988#p58888

ikiris
10 replies
9h51m

this is mostly due to the different types of regex and less about it being platform dependent. $ was end of string in pcre which is the "old" perl compatible regex. python has its own which has quirks as mentioned, re2 is another option in go for example, and i think rust has its own version as well iirc.

wolletd
3 replies
9h32m

The differences of the various regex "dialects" came to me over the years of using regular expressions for all kinds of stuff.

Matching EOL feels natural for every line-based process.

What I find way more annoying is escaping characters and writing character groups. Why can't all regex engines support '\d' and '\w' and such? Why, in sed, is an unescaped '.' a regex-dot matching any character, but an unescaped '(' is just a regular bracket?

somat
2 replies
9h15m

Why, in sed, is an unescaped '.' a regex-dot matching any character, but an unescaped '(' is just a regular bracket?

It is because sed predates the very influential second generation Extended Regular Expression engine and by default uses the first generation Basic Regular Expression engine. So really it is for backwards compatibility.

http://man.openbsd.org/re_format#BASIC_REGULAR_EXPRESSIONS

you can usually pass sed a -r flag to get it to use ERE's

Actually I don't really know if BRE's predate ERE's or not. I assume they do based on the name but I might be wrong.

tankenmate
0 replies
8h36m

BRE and ERE was created at the same time. Prior to this there wasn't a clear standard for Regex. From my memory this was standardised in 1996 (IEEE Std 1003.1-1996).

The work originally came from work by Stephen Cole Kleene in the 1950s. It was introduced into Unix fame via the QED editor (which later became ed (and sed), then ex, then vi, then vim; all with differing authors) when Ken Thompson added regex when he ported QED to CTSS (an OS developed at MIT for the IBM 709, which was later used to develop Multics, and hence lead to Unix).

Also the "grep" command got its name from "ed"; "g" (the global ed command) "re" (regular expression), and "p" (the print ed command). Try it in vi/vim, :g/string/p it is the same thing as the grep command.

fsckboy
0 replies
1h49m

you can usually pass sed a -r flag

for portability, -E is the POSIX flag for the same thing

pjmlp
3 replies
9h43m

Indeed, there isn't any kind of universal regexp standard.

7bit
2 replies
9h26m

We should create a new RegEx flavour that standardises RegEx for good!

ajsnigrutin
1 replies
9h15m

"$" could be end of string or end of line in perl, depending on the setting (are you treating data as a multiline text, or each line separately). (/m, /s,...)

ikiris
0 replies
2h45m

Yeah I accidentally said string when I absolutely meant to say line there.

xlii
8 replies
9h34m

Regexp was one of the first things I truly internalized years ago when I was discovering Perl (which still lives in a cozy place in my heart due to a lovely “Camel” book).

Today most important bit of information is knowledge that implementations differ and I made a habit of pulling reference sheet for a thing I work with.

E.g. Emacs Regexp annoyingly doesn’t have word in form of “\w” but uses “\s_-“ (or something no reference sheet on screen) as character class (but Emacs has the best documentation and discoverability - a hill I’m willing to die on)

Some utilities require parenthesis escaping and some not. Sometimes this behavior is configurable and sometimes it’s not.

I lived through whole confusion, annoyance, denial phase and now I just accept it. Concept is the same everywhere but flavor changes.

ydant
3 replies
7h15m

Exactly the same here, re: Perl.

My brain thinks in Perl's regex language and then I have to translate the inconsistent bits to the language I'm using. Especially in the shell - I'm way more likely to just drop a perl into the pipeline instead of trying to remember how sed/grep/awk (GNU or BSD?) prefer their regex.

influx
1 replies
3h35m

GNU grep supports Perl regexp with -P

mwpmaybe
0 replies
3h17m

As does git grep!

mtmk
0 replies
1h23m

hah, I'm the same too, straight to 'perl -lne'. I believe that was one of Larry Wall's goals when creating Perl:

Perl is kind of designed to make awk and sed semi-obsolete.

https://github.com/Perl/perl5/commit/8d063cd8

pizzafeelsright
3 replies
3h26m

How did you internalize it? Perl looks like cat keyboarding.

mwpmaybe
1 replies
3h17m

The same way people internalize punching data and instructions into stacks of cards, or internalize advanced mathematical notation. Just because things aren't written in plain english words doesn't mean they can't be internalized.

chongli
0 replies
2h57m

Advanced math is mostly written in plain English, actually!

ydant
0 replies
41m

For me, Perl hit me at exactly the right time in my development. One or more of the various O'Reilly Perl books caught my attention in the bookstore, the foreword and the writing style was unlike anything else I'd read in programming up to that point, and I read the book and just felt a strong connection to how the language was structured, the design concepts behind it, the power of regex being built in to the language, etc. The syntax favored easy to write programs without unnecessary scaffolding (of course, leading to the jokes of it being write-only - also the jokes I could make about me programming largely in Java today), and the standard functionality plus the library set available felt like magic to me at that point.

Learning Perl today would be a very different experience. I don't think it would catch me as readily as it did back then. But it doesn't matter - it's embedded into me at a deep level because I learned it through a strong drive of fascination and infatuation.

As for the regex themselves? It's powerful and solved a lot of the problems I was trying to solve, was built fundamentally into Perl as a language, so learning it was just an easy iterative process. It didn't hurt that the particular period of time when I learned Perl/regex the community was really big on "leetcode" style exercises, they just happened to be focused around Perl Golf, being clever in how you wrote solutions to arbitrary problems, and abusive levels of regex to solve problems. We were all playing and play is a great way to learn.

ghusbands
7 replies
9h22m

Note: The table of data was gathered from regex101.com, I didn't test using the actual runtimes.

Has anyone confirmed this behaviour directly against the runtimes/languages? Newlines at the end of a string are certainly something that could get lost in transit inside an online service involving multiple runtimes.

coldtea
2 replies
8h27m

Newlines at the end of a string are certainly something that could get lost in transit inside an online service involving multiple runtimes.

In what way could newlines at the end of a string "could get lost in transit"?

ghusbands
1 replies
8h1m

If you write it to a text file by itself and then read it from that text file, each runtime can have a different definition of whether a newline at the end of the file is meaningful or not. Under POSIX, a newline should always be present at the end of a non-empty text file and is not meaningful; not everyone agrees or is aware.

There are plenty of other ways, too; bugs happen.

coldtea
0 replies
3h50m

Ideally no runtime should alter strings passing through ("in transit") from one runtime to another - unless it does some processing on them.

zimpenfish
0 replies
9h1m

https://go.dev/play/p/Tce1qWjfjOy matches their results.

I've also run that locally against "go1.22.1 darwin/arm64", "go1.21.5 windows/amd64", and "go1.21.0 linux/amd64" with the same result.

ghusbands
0 replies
7h50m

I've now tested C#, directly, and got the same result as the article. It also documents the behavior:

The ^ and $ language elements indicate the beginning and end of the input string. The end of the input string can be a trailing newline \n character.
AtNightWeCode
0 replies
9h16m

I fail to add carriage return to the test string on that site. Which I guess would be an issue on Windows.

jewel
6 replies
4h37m

This has security implications! Example exploitable ruby code:

  unless person_id =~ /^\d+$/
    abort "Bad person ID"
  end
  sql = "select * from people where person_id = #{person_id}"
In addition to injection attacks, this also can bite people when parsing headers, where a bad header is allowed to sneak past a filter.

jfhufl
4 replies
2h25m

Unsure what you mean?

    $ ruby -e 'x = "25" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    yes
    $ ruby -e 'x = "25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end' 
    yes
    $ ruby -e 'x = "a25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    no
Also, you'd want to use something that parameterizes the query with '?' (I use the Sequel gem) instead of just stuffing it into a sql string.

halostatue
1 replies
2h12m

You need to make your regex multi-line (`/^\d+$/m`), but that isn't the problem shown. Your query will be searching for `25\n`, not `25` despite your pre-check that it’s a good value.

The second line should always be no, which if you use `\A\d+\z`, it will be.

jfhufl
0 replies
2h5m

Yep, makes sense, thanks!

jfhufl
0 replies
2h21m

Well, learned something today after reading a bit further in the thread:

    ruby -e 'x = "a\n25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    yes
Good to know.

dr-smooth
0 replies
14m

    $ ruby -e 'x = "25\n; delete from people" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    yes

wodenokoto
4 replies
7h53m

So if you're trying to match a string without a newline at the end, you can't only use $ in Python! My expectation was having multiline mode disabled wouldn't have had this newline-matching behavior, but that isn't the case.

A reproducible example would be nice. I don’t understand what it is he cannot do. `re.search('$', 'no new lines')` returns a match.

iainmerrick
3 replies
7h48m

This unexpectedly matches:

re.match('^bob$', 'bob\n')

I didn't want the trailing newline to be included.

wodenokoto
2 replies
6h11m

But that string does have a new line at the end.

iainmerrick
1 replies
4h56m

re.match('^bob$', 'bob') → yes

re.match('^bob$', 'bobs') → no

Most people would expect 'bob\n' not to match, because I used '$' and it has an extra character at the end, just like 'bobs'. In Python it does match because '\n' is a special case.

rerdavies
0 replies
1h50m

... for some arbitrary definition of "most people".

Scubabear68
4 replies
6h9m

In 30 years of developing software I don’t think I ever used multi-line regexp even once.

thrdbndndn
2 replies
6h5m

Definitely not common, but if you are parsing a text file you're going to use it a lot (say, you're writing a JS parser).

marcosdumay
1 replies
4h38m

You really shouldn't use a lot of regexes for parsing code.

They go only on the tokenizer, if they go somewhere at all.

thrdbndndn
0 replies
4h28m

Agreed, it's more about quick and dirty ad hoc capture than full-fledged parser though (like when you want to extract certain object when scraping).

Terretta
0 replies
3h55m

In 30 years of developing software I don’t think I ever used multi-line regexp even once.

As long as sharing anecdata, in 30 years, it's almost the only way I use it.

It's incredible for slicing and dicing repetitious text into structure. You generally want some sort of Practical Extraction and Reporting Language, the core of which is something like a regular expression, generally able to handle the, well, irregularity.

Most recent example (I did this last week) was extracting Apple's app store purchases from an OCR of the purchase history available through Apple's Music app's Account page that lets you see all purchases across all digital offerings, but only as a long scrolling dialog box (reading that dialog's contents through accessibility hooks only retrieves the first few pages, unfortunately).

Each purchase contains one or more items and each item has one or more vertical lines, and if logos contain text they add arbitrary lines per logo.

A good match and sub match multi-line regex folds that mess back into a CSV. In this case, the regex for this was less than an 80 char line of code and worked in the find replace of Sublime Text which has multiline matching, subgroups, and back references.

Another way to do this is something like a state match/case machine, but why write a program when you can just write a regular expression?

tyingq
3 replies
6h50m

Seems odd to leave Perl off the list, given it's regex related.

Here's the explanation for $ in the perlre docs:

  $   Match the end of the string                 
      (or before newline at the end of the      
      string; or before any newline if /m is     
      used)

toyg
2 replies
5h49m

Yeah, omitting what is arguably the language most associated with regexes seems a bit of an oversight. I guess it shows how far off the radar Perl currently is.

demondemidi
0 replies
5h28m

Perl perfected the simplicity and flexibility of regex syntax from POSIX and it seems every other language after has just made it harder.

TillE
0 replies
2h24m

PHP uses PCRE, so it more or less serves as a stand-in for Perl in this case.

perlgeek
3 replies
7h58m

Raku (formerly Perl 6) has picked ^ and $ for start-of-string and end-of-string, and has introduced ^^ and $$ for start-of-line and end-of-line. No multi line mode is available or necessary. (There's also \h for horizontal and \v for vertical whitespace)

That's one of the benefits of a complete rethink/rewrite, you can learn from the fact that the old behavior surprised people.

richardwhiuk
1 replies
3h52m

Think I would have picked exactly the reverse (i.e. ^^ being more "starty" than "^").

lcnPylGDnU4H9OF
0 replies
3h32m

Reminds me of verbosity flags in some cli utilities. Often, -v is "verbose" and -vv is "very verbose" and -vvv... etc.

Terretta
0 replies
4h1m

And this is why this curmudgeon can't use Perl 6[^1]. It randomly shuffles the line noise we learned over decades.

It seems so obvious that's the opposite of what they should have defaulted to, that it clearly should have been ^ and $ for lines, and ^^ and $$ for the string, since like ((1)(2)(3)):

^^line1$\n^line2$\n^line3$\n$

[1]: That, and it's not anywhere, while Perl 5 is everywhere.

m0rissette
3 replies
7h2m

Why isn’t Perl anywhere on that chart when mentioning regex?

burntsushi
2 replies
6h17m

Because they're using regex101 to easily test the semantics of different regex engines and Perl isn't available on regex101. PCRE is though, which is a decent approximation. And indeed, Perl and PCRE behave the same for this particular case.

account42
1 replies
5h47m

Why isn’t Perl available on regex101 when its all about regex?

burntsushi
0 replies
5h31m

I dunno. Maybe because nobody has contributed it? Maybe because Perl isn't as widely used as it once was? Maybe because it's hard to compile Perl to WASM? Maybe some other reason?

vitiral
2 replies
3h34m

In Lua it's only the start/end of the string

A pattern is a sequence of pattern items. A caret '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.

https://www.lua.org/manual/5.3/manual.html#6.4.1

Lua's pattern matching is much simpler than regexes though.

Unlike several other scripting languages, Lua does not use POSIX regular expressions (regexp) for pattern matching. The main reason for this is size: A typical implementation of POSIX regexp takes more than 4,000 lines of code. This is bigger than all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 500 lines.

https://www.lua.org/pil/20.1.html

denzquix
1 replies
3h3m

In Lua it's only the start/end of the string

There's an additional caveat: if you use the optional "init" parameter to specify an offset into the string to start matching, the ^ anchor will match at that offset, which may or may not be what you expect.

vitiral
0 replies
1h30m

That is a good point, and something I've actually (personally) used quite a bit when writing parsers

PuffinBlue
2 replies
8h57m

This seems like the perfect opportunity to introduce those unfamiliar to Robert Elder. He makes cool YouTube[0] and blog content[1] and has a series on regular expressions[2] and does some quite deep dives into the differing behaviour of the different tools that implement the various versions.

His latest on the topic is cool too: https://www.youtube.com/watch?v=ys7yUyyQA-Y

He's has quite a lot of content that HN folks might be interested in I think, like the reality and woes of consulting[3]

[0] https://www.youtube.com/@RobertElderSoftware

[1] https://blog.robertelder.org/

[2] https://blog.robertelder.org/regular-expressions/

[3] https://www.youtube.com/watch?v=cK87ktENPrI

aquariusDue
0 replies
8h50m

I'm glad to see someone else that has stumbled over his content. Seconding the recommendation.

CatchSwitch
0 replies
6h17m

He has so many favorite Linux commands lol

user2342
1 replies
9h40m

I'm confused by this blog-post. In the table: what is the reg-ex pattern tested and against which input?

mnw21cam
0 replies
9h34m

The input being matched is "cat\n" and the regex pattern is one of:

  "cat$" with multiline enabled
  "cat$" with multiline disabled
  "cat\z"
  "cat\Z"

febeling
1 replies
8h58m

Seriously, just write one unit test for your regex.

mannykannot
0 replies
5h11m

Indeed, one should test any regex one puts any trust in, but the problem is that if you take as a fact something that is actually a false assumption (as the author did here), your test may well fail to find errors which may cause faults when the regex is put to use.

This, in a nutshell, is the sort of problem which renders fallacious the notion that you can unit-test your way to correct software.

croes
1 replies
8h8m

Isn't a string with a newline character automatically multiline?

The new line is just empty but not the first line anymore.

Joker_vD
0 replies
8h3m

No, it is not.

    3.195 Incomplete Line

    A sequence of one or more non-<newline> characters at the end of the file.

    3.206 Line

    A sequence of zero or more non-<newline> characters plus a terminating <newline> character.
courtesy of [0]. See also [1] for rationale on "text file":

   Text File

   [...] The definition of "text file" has caused controversy. The only difference between text and binary files is that text files have lines of less than {LINE_MAX} bytes, with no NUL characters, each terminated by a <newline>. The definition allows a file with a single <newline>, or a totally empty file, to be called a text file. If a file ends with an incomplete line it is not strictly a text file by this definition. [...]
[0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

[1] https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd...

wruza
0 replies
9h0m

By default, '$' only matches at the end of the string and immediately before the newline (if any) at the end of the string.

The rationale was probably "it should be easier to match input strings" and now it's harder for everyone.

weinzierl
0 replies
5h0m

The table in the article makes this look complicated, but it really isn't. All the cases in the article can be grouped into two families:

- The JS/Go/Rust family, which treats $ like \z and does not support \Z at all

- The Java, .NET, PHP, Python family, which treats $ like \Z and may or may not (Python) support \z.

\Z does away with \n before the end of the string, while \z treats \n as a regular character. For multiline $ the distinction doesn't matter, because \n is the end.

Really the only deviation from the rule is Python's \Z, which is indeed weird.

teknopaul
0 replies
8h55m

Tldr;

$ does not mean end of string in Python.

somat
0 replies
9h34m

Structural regexes as found in the sam editor are an obscure but well engineered regex engine. I am far from an expert but my main takeaway from them is that most regex engines have an implied structure built around "lines" of text. While you can work around this, it is awkward. Structural regexes allow you to explicitly define the structure of a match, that is, you get to tell the engine what a "line" is.

http://man.cat-v.org/plan_9/1/sam

silent_cal
0 replies
5h40m

I think there's a big opportunity to re-write Regex as a SQL-type language. It's too bad I don't feel like trying.

raldi
0 replies
5h32m

Cmd-F perl

no matches

pksebben
0 replies
2h31m

Regex would really benefit from a comprehensive industrial standard. It's such a powerful tool that you have to keep relearning whenever you switch contexts.

nurtbo
0 replies
2h54m

Totally get the desire, but also feels like last two paragraphs are solvable with

``` re.match(text).extract().rstrip(“\n”) ```

nunez
0 replies
5h36m

You can also use (?m) to enable multiline processing on PCRE-compatible regexp engines.

nebulous1
0 replies
5h50m

The fact that there are so many different peculiarities in different regex systems has always raised the hairs on the back of my neck. As in when a tool accepts a regex and I have to a trawl the manual to find out exactly what regex is acceptable to it.

mmh0000
0 replies
2h4m

  > So if you're trying to match a string without a newline at the end, you can't 
  only use $ in Python! My expectation was having multiline mode disabled 
  wouldn't have had this newline-matching behavior, but that isn't the case.
I would argue this is correct behavior, a "line" isn't a "line" if it doesn't end with \n.[1]

  > 3.206 Line - A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

menacingly
0 replies
2h46m

Of course it’s line. How could it be the end of the string when the matter at hand is defining the string?

mdavid626
0 replies
7h9m

Is this a bug?

masswerk
0 replies
9h20m

As for the good old reference implementation (not "Parameter Efficient Reinforcement Learning"):

  my $string = "cat\n";
  /cat$/s  -> true
  /cat\Z/s -> true
  /cat\z/s -> false

k3vinw
0 replies
8h11m

Another poor soul trying to solve one problem using regex and now they have two… ;)

javier_e06
0 replies
3h48m

I would hold a code review hostage if any file does not end with an empty new line.

My reasoning would be if the file is transmitted and gets truncated nobody would know for sure if it does not end a new line. Brownie points if this is code end has a comment that the files ends there.

The article calls computer languages platforms but the are computer languages. Bash is not included. Weird. I believe the most common use of regular expressions is the use of grep or egrep with bash or some other shell but, who knows. Maybe I am hanging with the wrong crowd.

humanlity
0 replies
7h2m

Interesting

gorjusborg
0 replies
4h57m

If you really want to learn regex, you'll have a hard time piecing it all together via blog posts.

Brad Freidl's Mastering Regular Expressions is a good book to read if you want to stop being surprised/lost.

I'll admit I stopped at the dive into DFA/NFA engine details.

frou_dh
0 replies
8h25m

Something I found really surprising about Python's regexp implementation is that it doesn't support the typical character classes like [:alnum:] etc.

It must be some kind of philosophical objection because there's no way something with as much water under the bridge as Python simply hasn't got around to it.

danbruc
0 replies
7h30m

People are confused about strings and lines. A string is a sequence of characters, a line can be two different things. If you consider the newline a line terminator, then a line is a sequence of non-newline characters - possibly zero - plus a newline. If there is no new-line at the end, then it is not a [complete] line. That is what POSIX uses. If you consider the newline a line separator, then a line is a sequence of non-newline characters - possibly zero. In either case, the content of the line ends before the newline, either because the newline terminates the line or because it separates the line from the next. [1]

The semantics of ^ and $ is based on lines - whether single-line or multi-line mode. For string based semantics - which you could also think of as entire file if you are dealing with files - use \A and \Z or their equivalents.

[1] Both interpretations have their merits. If you transmit text over a serial connection, it is useful to have a newline as line terminator so that you know when you received a complete line. If you put text into text files, it might arguably be easier to look at a newline as a line separator because then you can not have a invalid last line. On the other hand having line terminators in text files allows you to detect incompletely written lines.

cpeterso
0 replies
3h25m

$ is the regex’s “the buck stops here” symbol. Here at the end of the line. :)

aftbit
0 replies
2h9m

Wait, in non-multiline mode, it only matches _one_ trailing newline? And not any other whitespace, including \r or \r\n? That is indeed surprising behavior. Why? Why not just make it end of string like the author expected?

    >>> import re
    >>> bool(re.search('abc$', 'abc'))
    True
    >>> bool(re.search('abc$', 'abc\n'))
    True
    >>> bool(re.search('abc$', 'abc\n\n'))
    False
    >>> bool(re.search('abc$', 'abc '))
    False
    >>> bool(re.search('abc$', 'abc\t'))
    False
    >>> bool(re.search('abc$', 'abc\r'))
    False
    >>> bool(re.search('abc$', 'abc\r\n'))
    False

SAI_Peregrinus
0 replies
4h27m

POSIX regexes and Python regexes are different. In general, you need to reference the regex documentation for your implementation, since the syntax is not universal.

Per POSIX chapter 9[1]:

9.2 … "The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions limit the processing to lines; that is, zero or more characters followed by a <newline>."

and 9.3.8 … "A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character."

combine to mean that $ may match the end of string OR the end of the line, and it's up to the utility (or mode) to define which. Most of the common utilities (grep, sed, awk, Python, etc) treat it as end of line by default, since they operate on lines by default.

THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You cannot reliably read or write regular expressions without knowing which language & options are being used.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

Existing4190
0 replies
7h23m

perlre Metacharacters documentation states: $ Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)

(/m enables multiline mode)

AtNightWeCode
0 replies
9h1m

There are many differences between implementations of regex. To name a few. Lookbehind, atomic groups, named capturing groups, recursion, timeouts and my favorite interop problem, unicode.