HN comments for: Regex character "$" doesn't mean "end-of-string"

Karellen

70 replies

9h9m

2024-03-20 09:17:03 UTC

Folks who've worked with regular expressions before might know about ^ meaning "start-of-string" and correspondingly see $ as "end-of-string".

Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.

Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

wccrawford

51 replies

8h3m

2024-03-20 10:22:58 UTC

It's kind of driving me nuts that the article says ^ is "start of string" when it's actually "start of line", just like $ is "end of line". \A is apparently "start of string" like \Z is "end of string".

masklinn

35 replies

7h41m

2024-03-20 10:45:33 UTC

It’s not start of line though, unless the engine is in multiline mode. Here is the documentation for Python’s re for instance:

Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

Or JavaScript:

An input boundary is the start or end of the string; or, if the m flag is set, the start or end of a line.

\A and \Z are start/end of input regardless of mode… when they’re available, that’s not the case of all engines.

danbruc

32 replies

6h50m

2024-03-20 11:36:16 UTC

It is start and end of line. [1]

Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).

In single-line [2] mode, the line starts at the start of the string and ends at the end of the line where the end of the line is either the end of the string if there is no terminating newline or just before the final newline if there is a terminating newline.

In multi-line mode a new line starts at the start of the string and after each newline and ends before each newline or at the end of the string if the last line has no terminating newline.

The confusion is that people think that they are in string-mode if they are not in multi-line mode but they are not, they are in single-line mode, ^ and $ still use the semantics of lines and a terminating newline, if present, is still not part of the content of the line.

With \n\n\n in single-line mode the non-greedy ^(\n+?)$ will capture only two of the newlines, the third one will be eaten by the $. If you make it greedy ^(\n+)$ will capture all three newlines. So arguably the implementations that do not match cat\n with cat$ are the broken ones.

[1] https://docs.python.org/3/howto/regex.html#more-metacharacte...

[2] I am using single-line to mean not multi-line for convenience even though single-line already has a different meaning.

masklinn

31 replies

6h37m

2024-03-20 11:49:08 UTC

It is start and end of line.

You seem to have redefined “line” as “not a line”.

The confusion

I’m sure redefining “line” as “nothing like what anyone reasonable would interpret as a line” will help a lot and right clear up the confusion.

Bjartr

27 replies

5h47m

2024-03-20 12:39:34 UTC

The line delimiter is a newline.

If you have a file containing `A\nB\nC` in a file, the file is three lines long.

I guess it could be argued that a file containing `A\nB\nC\n` has four lines, with the fourth having zero length.

That a regex is applying to an in memory string vs a file doesn't feel to me like it should have different semantics.

Digging into the history a little, it looks like regexes were popularized in text editors and other file oriented tooling. In those contexts I imagine it would be far more common to want to discard or ignore the trailing zero length line than to process it like every other line in a file.

akdev1l

26 replies

5h40m

2024-03-20 12:46:25 UTC

Technically the “newline” character is actually a line _terminator_. Hence “A\n” is one line, not two. The “\n” is always at the end of a line by definition.

wtetzner

15 replies

5h22m

2024-03-20 13:03:47 UTC

So if you have "A" in a file with no newline, there are no lines in that file?

jepler

11 replies

5h10m

2024-03-20 13:16:21 UTC

Yes, that is a file with zero lines that ends with an "incomplete line". Processing of such files by standard line-oriented utilities is undefined in the opengroup spec. So, for instance, the effect of "grep"ping such a file is not defined. Heck, even "cat"ting such a file gives non-ideal results, such as colliding with the regular shell prompt. For this reason, a lot of software projects I work on check and correct this condition whenever creating a commit.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... ("text file")

rovr138

8 replies

4h54m

2024-03-20 13:31:59 UTC

Yes, that is a file with zero lines that ends with an "incomplete line".

It's a file with zero complete lines. But it has 1 line, that's incomplete, right?

The file starts empty. Anything in it starts "a line". So it's 1 incomplete line.

I hate weird states.

xyzzy_plugh

2 replies

4h30m

2024-03-20 13:56:31 UTC

No, it is valid for a file to have content but no lines.

Semantically many libraries treat that as a line because while \n<EOF> means "the end of the last line" having just <EOF> adds additional complexity the user has to handle to read the remaining input. But by the book it's not "a line".

If I said "ten buckets of water" does that mean ten full buckets? Or does a bucket with a drop in it count as "a bucket of water?" If I asked for ten buckets of water and you brought me nine and one half-full, is that acceptable? What about ten half-full buckets?

A line ends in a newline. A file with no newlines in it has no lines.

joshjje

1 replies

1h37m

2024-03-20 16:49:37 UTC

Thats beyond ridiculous. Most languages when you are reading a line from a file, and it doesn't have a \n terminator, its going to give you that line, not say, oops, this isn't a line sorry.

LK5ZJwMwgBbHuVI

0 replies

50m

2024-03-20 17:36:03 UTC

That's a relatively recent invention compared to tools like `wc` (or your favorite `sh` for that matter). See also: https://perldoc.perl.org/functions/chop wherein the norm was "just cut off the last character of the line, it will always be a newline"

coryrc

1 replies

2h7m

2024-03-20 16:19:25 UTC

Pedantically, if it doesn't end with a newline, it's considered a binary file and not a text file. Binary files don't have lines.

In practice, most utilities expecting text files will still operate on it.

PaulDavisThe1st

0 replies

45m

2024-03-20 17:41:30 UTC

No file has lines.

"Lines" are a convention established by (or not) software reading a data stream.

mort96

0 replies

2h50m

2024-03-20 15:36:02 UTC

It's a file with 0 lines and some trailing garbage.

akdev1l

0 replies

4h28m

2024-03-20 13:58:25 UTC

No, a line is defined as a sequence of characters (bytes?) with a line terminator at the end.

Technically as per posix a file as you describe is actually a binary file without any lines. Basically just random binary data that happens to kind of look like a line.

DougBTX

0 replies

2h32m

2024-03-20 15:54:02 UTC

Another way to look at it is that concatenating files should sum the line count. Concatenating two empty files produces an empty file, so 0 + 0 = 0. If “incomplete lines” are not counted as lines, then the maths still works out. If they counted as lines, it would end up as 1 + 1 = 1.

rerdavies

1 replies

2h45m

2024-03-20 15:41:45 UTC

The opengroup spec says no such thing.

simonh

0 replies

1h51m

2024-03-20 16:35:04 UTC

3.206 Line

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

See also ‘3.403 Text File’ for the definition of a text file. No new line characters, no lines. No lines, not a text file.

mbrubeck

1 replies

3h59m

2024-03-20 14:27:35 UTC

    $ echo -n "A" | wc --lines
    0

keybored

0 replies

2h17m

2024-03-20 16:09:38 UTC

Yep. since wc(1) apparently strictly adheres to what a newline-terminated text file is. This is why plaintext files should end with a newline. :)

See: https://stackoverflow.com/a/25322168/1725151

LK5ZJwMwgBbHuVI

0 replies

52m

2024-03-20 17:33:56 UTC

Why don't you go ask?

    $ echo -n foo | wc -l
    0

rerdavies

3 replies

2h46m

2024-03-20 15:40:21 UTC

Technically, that is one of two possible interpretations, and you seem to have invented a "by definition" out of thin air.

Very very technically a "newline" character indicates the start of a new line, which is why it is not called the "end-of-line" character.

cortesoft

1 replies

2h25m

2024-03-20 16:01:29 UTC

I mean, the person you are responding to didn't invent the definition out of thin air... the POSIX standard did:

3.206 Line A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...

nomel

0 replies

1h42m

2024-03-20 16:44:38 UTC

Posix getline() includes EOF as a line terminator:

    getline() reads an entire line from stream, storing the address
       of the buffer containing the text into *lineptr.  The buffer is
       null-terminated and includes the newline character, if one was
       found.
    ...
    ... a delimiter character is not added if one was
       not present in the input before end of file was reached.

EOF seems same as end-of-string.

LK5ZJwMwgBbHuVI

0 replies

47m

2024-03-20 17:39:33 UTC

It doesn't indicate the start of a new line, or files would start with it. Files end with it, which is why it is a line terminator. And it is by definition: by the standard, by the way cat and/or your shell and/or your terminal work together, and by the way standard utilities like `wc` treat the file.

Gormo

3 replies

4h14m

2024-03-20 14:11:52 UTC

Suddenly the DOS/Windows solution of using \r\n instead of just \n seems to offer some advantages.

samatman

1 replies

3h57m

2024-03-20 14:29:09 UTC

This does precisely nothing to solve the ambiguity issue when a final line lacks a newline. The representation of that newline isn't relevant to the problem.

Izkata

0 replies

1h17m

2024-03-20 17:09:40 UTC

It's actually slightly worse: Windows defines newline as a delimiter, not a terminator. So this:

  foo\nbar\n

Would be 2 lines in *nix and 3 lines in windows.

deaddodo

0 replies

2h12m

2024-03-20 16:14:06 UTC

The "Windows way" is the "right way" for a few reasons.

This is definitely not one of them.

joshjje

1 replies

1h45m

2024-03-20 16:41:37 UTC

“A\n” is two lines.

LK5ZJwMwgBbHuVI

0 replies

46m

2024-03-20 17:40:13 UTC

Factually incorrect.

danbruc

2 replies

6h29m

2024-03-20 11:56:51 UTC

The POSIX definition of a line is a sequence of non-newline characters - possibly zero - followed by a newline. Everything that does not end with a newline is not a [complete] line. So strictly speaking it would even be correct that cat$ does not match cat because there is no terminating newline, it should only match cat\n. But as lines missing a terminating newline is a thing, it seems reasonable to be less strict.

masklinn

1 replies

1h5m

2024-03-20 17:21:45 UTC

a line is a sequence of non-newline characters

Works for me.

How do you square that with your assertion that in your invention of "single-line mode" you implicitly define "line" as matching \n\n?

danbruc

0 replies

41m

2024-03-20 17:45:09 UTC

If you are not in multi-line mode, then a single line is expected and consequently there is at most one newline at the end of the string. You can of course pick an input that violates this, run it against a multi-line string with several newlines in it. cat\n\n will not match cat$ because there is something between cat and the end of the line, it just happens to be a newline but without any special meaning because it is not the last character and you did not say that the input is multi-line.

eastbound

1 replies

6h56m

2024-03-20 11:30:20 UTC

Probably a vulnerability issue. Programmers would leave multiline mode on by mistake, then validate that some string only contain ^[a-Z]*$… only for the string to have an \n and an SQL injection on the second line.

masklinn

0 replies

6h39m

2024-03-20 11:46:55 UTC

Probably a vulnerability issue.

No? It’s a semantics decision.

amelius

12 replies

3h53m

2024-03-20 14:33:09 UTC

What is driving me nuts is that we have Unicode now, so there is no need to use common characters like $ or ^ to denote special regex state transitions.

knome

6 replies

3h51m

2024-03-20 14:35:32 UTC

the idea of changing a decades old convention to instead use, as I assume you are implying, some character that requires special entry, is beyond silly.

FranOntanaya

3 replies

3h5m

2024-03-20 15:20:50 UTC

I don't think anyone that writes regex would feel specially challenged by using the Alt+ | Ctrl+Shift+u key combos for unicode entry. Having to escape less things in a pattern would be nice.

amelius

1 replies

2h53m

2024-03-20 15:33:02 UTC

Also, code is read more often than it is written.

cortesoft

0 replies

2h21m

2024-03-20 16:05:33 UTC

People say this all the time, but is it really always true? I have a ton of code that I wrote, that just works, and I never really look at it again, at least not with the level of inspection that requires parsing the regex in my head.

cortesoft

0 replies

2h22m

2024-03-20 16:04:17 UTC

I write regexes all the time, and I don't know if I would be CHALLENGED by that, but it would be annoying. Escaping things is trivial, and since you do it all the time it is not anything extra to learn. Having to remember bespoke keystrokes for each character is a lot more to learn.

keybored

1 replies

2h14m

2024-03-20 16:12:02 UTC

It’s not that silly. You constantly get into escape conundrums because you need to use a metacharacter which is also a metacharacter three levels deep in some embedding.

(But that might not solve that problem? Maybe the problem is mostly about using same-character delimiters for strings.)

And I guess that’s why Perl is so flexible with regards to delimiters and such.

LK5ZJwMwgBbHuVI

0 replies

44m

2024-03-20 17:42:25 UTC

Yes, languages really need some sort of "raw string" feature like Python (or make regex literals their own syntax like Perl does). That's the solution here, not using weird characters...

Yujf

3 replies

3h48m

2024-03-20 14:38:04 UTC

Why not? Common characters are easier to type and presumbly if you are using regex on a unicode string they might include these special characters anyway so what have you gained?

amelius

2 replies

2h30m

2024-03-20 15:55:53 UTC

In theory yes, in practice no.

What you have gained is that the regex is now much easier to read.

knome

0 replies

1h51m

2024-03-20 16:35:01 UTC

It's easy to read now.

LK5ZJwMwgBbHuVI

0 replies

42m

2024-03-20 17:44:08 UTC

In theory yes, in practice no.

That's like "in theory we need 4 bytes to represent Unicode, but in practice 3 bytes is fine" (glances at universally-maligned utf8mb3)

yjftsjthsd-h

0 replies

3h46m

2024-03-20 14:39:54 UTC

If we were willing to ignore the ability to actually type it, you don't need Unicode for that; ASCII has a whole block of control characters at the beginning; I think ASCII 25 ("End of medium") works here.

tangus

0 replies

7h2m

2024-03-20 11:24:05 UTC

That gives the author space for another article ;)

davidw

0 replies

2h11m

2024-03-20 16:14:59 UTC

What with unicode, it'd be fun to have Α and Ω available to make our regexps that much more readable...

jamesmunns

4 replies

8h53m

2024-03-20 09:33:06 UTC

Same, tho it'd be interesting to see if this behavior holds if the file ends without a trailing newline and your match is on the final newline-less line.

fooofw

3 replies

8h21m

2024-03-20 10:05:12 UTC

Fortunately, it's pretty simple to test.

    $ printf 'Line with EOL\nLine without EOL' | grep 'EOL$'        
    Line with EOL
    Line without EOL
    $ grep --version | head -n1
    grep (GNU grep) 3.8

romwell

1 replies

8h4m

2024-03-20 10:21:50 UTC

The line does end with the file, so it's logically consistent.

It's not matching the newline character after all.

colimbarna

0 replies

7h36m

2024-03-20 10:50:38 UTC

Yes exactly, they match the end of a line, not a newline character. Some examples from documentation:

man 7 regex: '$' (matching the null string at the end of a line)

pcre2pattern: The circumflex and dollar metacharacters are zero-width assertions. That is, they test for a particular condition being true without consuming any characters from the subject string. These two metacharacters are concerned with matching the starts and ends of lines. ... The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, that it does not actually match the newline. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.

jamesmunns

0 replies

7h27m

2024-03-20 10:59:05 UTC

Thanks! I was AFK and didn't have a grep (or a shell) handy on my phone.

Izkata

4 replies

4h6m

2024-03-20 14:20:42 UTC

Same here; when I saw the title I was like "well obviously not, where did you hear that?"

In nearly two decades of using regex I think this might be the first time I've heard of $ being end of string. It's always been end of line for me.

michaelt

2 replies

55m

2024-03-20 17:31:09 UTC

Take a look at, for example, these stackoverflow answers about a regex to validate and e-mail address: https://stackoverflow.com/a/8829363

These people are I think not intending to say a newline character is permitted at the end of an e-mail address.

(Of course people using 'grep' would have different expectations for obvious reasons)

Izkata

1 replies

48m

2024-03-20 17:38:04 UTC

Even disregarding whether or not end-of-string is also an end-of-line or not (see all the other comments below), $ doesn't match the newline, similar to zero-width matches like \b, so the newline wouldn't be included in the matched text either way.

I think this series of comments might be clearest: https://news.ycombinator.com/item?id=39764385

LK5ZJwMwgBbHuVI

0 replies

39m

2024-03-20 17:46:55 UTC

Problem is, plenty of software doesn't actually look at the match but rather just validates that there was a match (and then continues to use the input to that match).

frame_ranger

0 replies

1h26m

2024-03-20 17:00:23 UTC

You couldn’t write a post like this if you didn’t start with a strawman.

absoluteunit1

2 replies

6h55m

2024-03-20 11:31:26 UTC

I’ve always thought that as well; mostly due to Vim though.

^ - takes you to start of line $ - takes you to end of line

Izkata

1 replies

4h17m

2024-03-20 14:09:05 UTC

^ actually takes you to the first non-whitespace character in the line in vim. For start of line you want 0

kataklasm

0 replies

24m

2024-03-20 18:02:03 UTC

I don't have (n)vi(m) open right now but I think this only applies to prepending spaces. For prepending tabs, 0 will take you to the first non-tab character as well.

notnmeyer

0 replies

2h0m

2024-03-20 16:26:02 UTC

i feel like this perspective will be split between folks who use regex in code with strings and more sysadmin folks who are used to consuming lines from files in scripts and at the cli.

but yeah seems like a real misunderstanding from “start/end of string” people

kqr

0 replies

7h48m

2024-03-20 10:38:44 UTC

I'm the same, but now that I try in Perl, sure enough, $ seems to default to being a positive lookahead assertion for the end of the string. It does not match and consume an EOL character.

Only in multiline mode does it match EOL characters, but it does still not appear to consume them. In fact, I cannot construct a regex that captures the last character of one line, then consumes the newline, and then captures the first character of the next line, while using $. The capture group simply ends at $.

cerved

0 replies

1h24m

2024-03-20 17:01:50 UTC

In `sed` it's end of string.

String is usually end of line, but not if you use stuff like `N`, to manipulate multi-line strings

antegamisou

0 replies

8h16m

2024-03-20 10:10:21 UTC

Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

Vim is what did that for me.

alphazard

0 replies

4h48m

2024-03-20 13:38:18 UTC

This must be the "second problem" everyone talks about with regular expressions.

beardyw

38 replies

9h30m

2024-03-20 08:55:59 UTC

Does anyone consider RegEx to be standardised? Moving to a new context is always a relearning exercise in my experience.

rusk

12 replies

9h26m

2024-03-20 09:00:44 UTC

My understanding is it was standardised for Posix but the variants in popular use have so many variations.

I consider sed to be the baseline. If you can do sed you can do anything but it’s seriously limited.

susam

7 replies

9h19m

2024-03-20 09:07:20 UTC

POSIX specifies two flavours of regular expressions: basic regular expressions (BRE) and extended regular expressions (ERE). There are subtle differences between the two and ERE supports more features than BRE. For example, what is written as a$bc$\{3\}d in BRE is written as a(bc){3}d in ERE. See https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... for more details.

The regular expression engines available in most mainstream languages go well beyond what is specified in POSIX though. An interesting example is named capturing group in Python, e.g., (?P<token>f[o]+).

tankenmate

3 replies

9h2m

2024-03-20 09:24:38 UTC

Indeed, and the most common is Perl since it was the source of many of the extensions.

rusk

2 replies

7h52m

2024-03-20 10:34:26 UTC

I would hazard that nowadays it’s Java due to its broad permeation of the application space

account42

1 replies

5h55m

2024-03-20 12:31:44 UTC

If anything it would be ECMAScript (JavaScript dwarfs Java use) or PCRE (the de-facto contiuation of Perl regular expressions written in C but used in many languages).

rusk

0 replies

5h24m

2024-03-20 13:01:57 UTC

Yes I think you’re right actually. I’m about 10 years off :)

jwilk

2 replies

8h39m

2024-03-20 09:47:16 UTC

what is written as $f..$\1 in BRE is written as (f..)\1 in ERE

Oddly, there are no backreferences in POSIX EREs.

susam

0 replies

7h30m

2024-03-20 10:56:26 UTC

You are right indeed. Looked at the specification again and indeed there is no back-reference in POSIX ERE.

Quoting from <https://pubs.opengroup.org/onlinepubs/9699919799.2008edition...>:

It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This was rejected by the standard developers as likely to decrease consensus.

Updated my comment to present a better example that avoids back-references. Thanks!

GrumpySloth

0 replies

6h2m

2024-03-20 12:24:30 UTC

That’s because POSIX EREs are actual regular expressions thank god.

psd1

3 replies

8h32m

2024-03-20 09:54:40 UTC

No gnu tool can balance brackets, afaics. So you can't do everything in sed. And sed is, by design, useless for matching text that spans lines, so good luck picking out paragraphs with it.

ykonstant

1 replies

7h49m

2024-03-20 10:36:49 UTC

I am pretty sure even pure Awk can do it; or am I mistaken? I thought there was an even more sophisticated example in the Awk book.

Edit: oh, you mean via regex engines available in GNU tools; I am dumb. Hmm... is there no GNU extension with PCRE?

colimbarna

0 replies

7h19m

2024-03-20 11:07:23 UTC

"Sed" is the name of a specific tool. It is not defined by the GNU tools, but has existed in some form since 1974, well before Perl. GNU sed and POSIX sed both support BRE and EREs, but not PCREs.

Maybe there's some other implementation of sed that supports PCREs but that would really be an extension of that implementation of sed rather than a property of sed.

And maybe there's some GNU tool that uses PCREs, but that GNU tool would not be GNU sed, so it would not be a relevant property.

Anyway, they probably should have said BREs or EREs rather than "sed"...

rusk

0 replies

7h55m

2024-03-20 10:31:43 UTC

Sorry I meant to write “if you can do it in sed you can do it in anything” thereby implying it is a subset of the more generally available flavours. The issue at hand however is that there isn’t much in the way of standardisation but 95% of sed should work across all of them. Of course you should get more into the specifics of whatever your solution space supports.

telotortium

6 replies

9h25m

2024-03-20 09:00:53 UTC

Languages invented after Perl will generally use some flavor of Perl regex syntax, but there are always some minor differences. The issue of the meaning of `$` and changing it via multi-line mode is usually consistent though.

usrusr

5 replies

9h5m

2024-03-20 09:21:03 UTC

I like to think of "whatever browsers do in js" as an updated common baseline. Whatever your regex engine does, describe it as a delta to the js precedent. That thing is just so ubiquitous.

I do wonder though what's the highest number of different regex syntaxes I've ever encountered (perhaps written?) within a single line: bash, grep and sed are never not in a "hold my beer" mood!

psd1

0 replies

8h24m

2024-03-20 10:02:45 UTC

Reason #2 to use powershell - consistent regex.

I've got "hold my beer" commits in .net - I've balanced brackets. I believe that's impossible in sed and grep. If I were going to write a json parser in a script, then a) stop me and b) it's got to be in powershell.

mwpmaybe

0 replies

3h13m

2024-03-20 15:13:10 UTC

I do wonder though what's the highest number of different regex syntaxes I've ever encountered (perhaps written?) within a single line: bash, grep and sed are never not in a "hold my beer" mood!

Your comment is missing a trigger warning, lol. But seriously, this is one of my flags for "this should probably be a script, or an awk or perl one-liner."

layer8

0 replies

7h9m

2024-03-20 11:16:59 UTC

That seems like just a web front-end developer’s perspective.

kstrauser

0 replies

3h26m

2024-03-20 15:00:14 UTC

I’ll go along with that, as long as someone ports pcre to JavaScript and that’s the browser syntax we land on.

Calzifer

0 replies

4h55m

2024-03-20 13:31:40 UTC

Isn't JavaScripts regex one of the worst modern regex implementations?

They seem to improve. Negative lookbehind isn't missing anymore [1]. But still lack the handy \Q and \E to escape stuff [2].

[1] https://stackoverflow.com/a/3950684

[2] https://stackoverflow.com/q/6318710

jasonjayr

6 replies

9h5m

2024-03-20 09:21:30 UTC

The three big ones I know of are POSIX, Perl/PCRE(aka Perl-Compatible Regular Expression), and Go came along and <strike>added</strike> used re2, which is a bit different from the first too.

A lot of systems implemented PCRE, including JavaScript, since Perl extended the POSIX system with many useful extensions. IIRC, re2 tries to reign in on some of the performance issues and quirks the original systems had, while implementing the whole thing in Go.

edit: Did not realize re2 predated go ...

jerf

3 replies

4h29m

2024-03-20 13:57:28 UTC

POSIX and PCRE are arguably redundant. They both support backreferences, which puts very significant constraints on their implementations. PCRE is at least functionally a superset of POSIX, whether or not there's some quirky thing POSIX supports that PCRE does not.

re2 adds a legitimate option to the menu of using NDFAs, which have the disadvantage of not supporting backreferences, but have the advantage of having constrained complexity of scanning a string. This does not come for free; you can conceivably end up with a compiled regexp of very large size with an NDFA approach, but most of the time you won't. The result may be generally slower than a PCRE-type approach, but it can also end up safer because you can be confident that there isn't a pathological input string for a given regexp that will go exponential.

This is one of those cases where ~99% of the time, it doesn't really matter which you choose, but at the scale of the Entire Programming World, both options need to be available. I've got some security applications where I legitimately prefer the re2 implementation in Go because it is advantageous to be confident that the REs I write have no pathological cases in the arbitrary input they face. PCRE can be necessary in certain high-performance cases, as long as you can be sure you're not going to get that pathological input.

RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment. I use both styles in my code. I've even got one unlucky exe I've been working with lately that has both, because it rather irreducibly has the requirements for both. Professionally annoying, but not actually a problem.

burntsushi

1 replies

4h9m

2024-03-20 14:16:51 UTC

I'll add two notes to this:

* Finite automata based regex engines don't necessarily have to be slower than backtracking engines like PCRE. Go's regexp is in practice slower in a lot of cases, but this is more a property of its implementation than its concept. See: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa... --- Given "sufficient" implementation effort (~several person years of development work), backtrackers and finite automata engines can both perform very well, with one beating the other in some cases but not in others. It depends.

* Fun fact is that if you're iterating over all matches in a haystack (e.g., Go's `FindAll` routines), then you're susceptible to O(m * n^2) search time. This applies to all regex engines that implement some kind of leftmost match priority. See https://github.com/BurntSushi/rebar?tab=readme-ov-file#quadr... for a more detailed elaboration on this point.

jerf

0 replies

4h7m

2024-03-20 14:19:14 UTC

Excellent, thank you.

keybored

0 replies

2h9m

2024-03-20 16:17:29 UTC

RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment.

Good on you.

jpgvm

0 replies

8h58m

2024-03-20 09:28:30 UTC

re2 predates Go and was written in C++.

foldr

0 replies

8h13m

2024-03-20 10:13:14 UTC

Go's regex implementation is new in the sense that it's not just a binding to the re2 C++ library, but it uses the same non-backtracking algorithm.

bregma

4 replies

7h55m

2024-03-20 10:30:52 UTC

The ISO/IEC 14882 C++ standard library <regex> mandates [0] implementations for six de jure standard regex grammars: IEEE Std 1003.1-2008 (POSIX) [1] BRE, ERE, awk, grep, and egrep and ECMA-262 EcmaScript 3 [2].

So, yes, at least someone (me) considers regex to be standardized in several published de jure standards.

  [0] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf#chapter.28
  [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
  [2] https://262.ecma-international.org/14.0/#sec-regexp-regular-expression-objects

pjc50

1 replies

7h2m

2024-03-20 11:24:19 UTC

"At least six different standards" is an XKCD comic, not a standard.

riffraff

0 replies

6h43m

2024-03-20 11:43:20 UTC

"The nice thing about standards is that you have so many to choose from." - Andrew Tanenbaum (or Grace Hopper)

account42

1 replies

5h56m

2024-03-20 12:29:58 UTC

<regex> is not exactly an example anyone should follow.

bregma

0 replies

2h26m

2024-03-20 16:00:01 UTC

You may be prejudiced against C++, but ISO/IEC 14882 is a published international standard that links to recognized regex standards, so answers the question "does anyone consider RegEx standardised?" very much in the affirmative.

wolletd

1 replies

9h25m

2024-03-20 09:01:26 UTC

At some point, I felt like I knew them all. There are probably more regex dialects out there, but I don't encounter them and my set of knowledge works most of the time.

I feel it's like driving a rental car. It behaves slightly different than your own car, some features missing, some other features added, but in general, most of the things are pretty similar.

stanislavb

0 replies

9h12m

2024-03-20 09:14:44 UTC

What a nice analogy. I’ll borrow it in the future.

tonyg

0 replies

5h23m

2024-03-20 13:03:02 UTC

Delightfully, RFC 9485 https://datatracker.ietf.org/doc/rfc9485/ "I-Regexp: An Interoperable Regular Expression Format" was published just back in October last year!

out-of-ideas

0 replies

9h14m

2024-03-20 09:12:12 UTC

kind of a trick question; there is POSIX and then there is the app you're using and whichever flags are enabled (albeit by default or explicitly defined)

beardyw

0 replies

5h50m

2024-03-20 12:36:05 UTC

And don't get me started about find and replace, what is the symbol to insert the match?

MattHeard

0 replies

9h22m

2024-03-20 09:04:15 UTC

My working assumption has always been to check the docs of your specific regexp parser, and to write some tests (either automated or manually in a REPL) with specific patterns that you are interested in using.

onion2k

32 replies

9h11m

2024-03-20 09:15:32 UTC

I can hear thousands of bad hiring manager's adding 'How do you match the end of a string in a regex?' to their list of 'Ha! You don't know the trick!' questions designed to catch out candidates.

hoc

31 replies

8h15m

2024-03-20 10:11:42 UTC

"I will hire you anyway, but I will pay you less"

Regex, useful in any job...

username_my1

30 replies

8h9m

2024-03-20 10:16:50 UTC

regex is useful but chatgpt is amazing at it, so why spend a minute keeping such useless knowledge in mind.

if you know where to find something no point in knowing it.

ykonstant

25 replies

7h52m

2024-03-20 10:33:59 UTC

Does gpt produce efficient regex? Are there any experts here that can assess the quality and correctness of gpt-generated regex? I wonder how regex responses by gpt are validated if the prompter does not have the knowledge to read the output.

thecatspaw

18 replies

7h43m

2024-03-20 10:42:52 UTC

what does gpt say how we should validate email addresses?

rhd

12 replies

7h34m

2024-03-20 10:52:32 UTC

chatgpt-4:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/696f7046-7f43-4331-b12b-538566...

chatgpt-3.5:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/aaa09ae8-3fd9-4df7-a417-948436...

layer8

7 replies

7h16m

2024-03-20 11:10:45 UTC

…which both excludes addresses allowed by the RFC and includes addresses disallowed by the RFC. (For example, the RFC disallows two consecutive dots in the local-part.)

KMnO4

5 replies

5h24m

2024-03-20 13:02:43 UTC

I take the descriptivist approach to email validation, rather than the prescriptivist.

I know an email has to have a domain name after the @ so I know where to send it.

I also know it has to have something before the @ so the domain’s email server knows how to handle it.

But do I care if the email server is supports sub addresses, characters outside of the commonly supported range (eg quotation marks and spaces), or even characters which aren’t part of the RFC? I do not.

If the user gives me that email, I’ll trust them. Worst case they won’t receive the verification email and will need to double check it. But it’s a lot better than those websites who try to tell me my email is invalid because their regex is too picky.

layer8

2 replies

5h19m

2024-03-20 13:07:19 UTC

I generally agree, but the two consecutive dots (or leading/trailing dots) are an example that would very likely be a typo and that you wouldn’t particularly want to send. Similar for unbalanced quotes, angle brackets, and other grammar elements.

dumbo-octopus

1 replies

41m

2024-03-20 17:45:07 UTC

I wonder whether simply (regex) replacing a sequence of .'s with a single one as part of a post-processing step would be effective.

layer8

0 replies

15m

2024-03-20 18:11:20 UTC

That would be bad form, IMO. The user may have typed john..kennedy@example.com by mistake instead of john.f.kennedy@example.com, and now you’ll be sending their email to john.kennedy@example.com. Similar for leading or trailing dots. You can’t just decide what a user probably meant, when they type in something invalid.

wtetzner

0 replies

5h15m

2024-03-20 13:10:52 UTC

Yeah, that's about as far as I've ever been comfortable going in terms of validating email addresses too: some stuff followed by "@" followed by more stuff.

Though I guess adding a check for invalid dot patterns might be worthwhile.

jcranmer

0 replies

4h29m

2024-03-20 13:57:22 UTC

The HTML email regex validation [1] is probably the best rule to use for validating an email address in most user applications. It prohibits IP address domain literals (which the emailcore people have basically said is of limited utility [2]), and quoted strings in the localpart. Its biggest fault is allowing multiple dots to appear next to each other, which is a lot of faff to put in a regex when you already have to individually spell out every special character in atext.

[1] https://html.spec.whatwg.org/multipage/input.html#email-stat...

[2] https://datatracker.ietf.org/doc/draft-ietf-emailcore-as/

marcosdumay

0 replies

4h46m

2024-03-20 13:40:41 UTC

What is maybe more important to note, it completely disallows the language of some 4/5 of the humanity. And partially disallows some 2/3 of the rest.

zaxomi

1 replies

6h35m

2024-03-20 11:51:01 UTC

Remember to first punycode the domain part of an email address before trying to validate it, or it will not work with internationalized domain names.

jameshart

0 replies

5h54m

2024-03-20 12:32:34 UTC

Support for IDN email addresses is still patchy at best. Many systems can’t send to them; many email hosts still can’t handle being configured for them.

sebstefan

0 replies

7h11m

2024-03-20 11:15:15 UTC

Actually pretty good response if the programmer bothers to read all of it

I'd be more emphatic that you shouldn't rely on regexes to validate emails and that this should only be used as an "in the form validation" first step to warn of user input error, but the gist is there

This regex is *practical for most applications* (??), striking a balance between complexity and adherence to the standard. It allows for basic validation but does not fully enforce the specifications of RFC 5322, which are much more intricate and challenging to implement in a single regex pattern.

^ ("challenging"? Didn't I see that emails validation requires at least a grammar and not just a regex?)

For example, it doesn't account for quoted strings (which can include spaces) in the local part, nor does it fully validate all possible TLDs. Implementing a regex that fully complies with the RFC specifications is impractical due to their complexity and the flexibility allowed in the specifications.

For applications requiring strict compliance, it's often recommended to use a library or built-in function for email validation provided by the programming language or framework you're using, as these are more likely to handle the nuances and edge cases correctly. Additionally, the ultimate test of an email address's validity is sending a confirmation email to it.

bonki

0 replies

6h39m

2024-03-20 11:47:19 UTC

Not good at all, but a little better than expected. I use + in email addresses prominently and there are so many websites who don't even allow that...

criley2

3 replies

7h30m

2024-03-20 10:56:00 UTC

Prompt:

'I'm writing a nodejs javascript application and I need a regex to validate emails in my server. Can you write a regex that will safely and efficiently match emails?'

GPT4 / Gemini Advanced / Claude 3 Sonnet

GPT4: `const emailRegex = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;` Full answser: https://justpaste.it/cg4cl

Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)$/;` Full answer: https://justpaste.it/589a5

Claude 3: `const emailRegex = /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$/;` Full answer: https://justpaste.it/82r2v

zaxomi

0 replies

6h29m

2024-03-20 11:57:22 UTC

Still doesn't support internationalized domain names.

dfawcus

0 replies

4h41m

2024-03-20 13:45:17 UTC

Whereas email more or less lasts forever (mailbox contents), and has to be backwards compatible with older versions back to (at least) RFC 821/822, or those before. It also allows almost any character (when escaped at 821 level) in the host or domain part (domain names allow any byte value).

So a Internet email address match pattern has to be: "..*@..*", anything else can reject otherwise valid addresses.

That however does not account for earlier source routed addresses, not the old style UUCP bang paths. However those can probably be ignored for newly generated email.

I regularly use an email address with a "+" in the host part. When I used qmail, I often used addresses like: "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering received messages from mailing lists.

croemer

0 replies

4h49m

2024-03-20 13:37:13 UTC

Terrible answers as far as I can tell, especially Chat got would throw out many valid email addresses.

skeaker

0 replies

54m

2024-03-20 17:32:15 UTC

There really ought to be a regex repository of common use cases like these so we don't have to reinvent the wheel or dig up a random codebase that we hope is correct to copy from every time.

da39a3ee

4 replies

7h39m

2024-03-20 10:47:28 UTC

You don't have to be an expert; you should very rarely be using regexes so complex that you can't understand them.

hnlmorg

1 replies

6h31m

2024-03-20 11:55:19 UTC

...and if you can understand them then you clearly understand regex enough not to need ChatGPT to write them

kaibee

0 replies

4h56m

2024-03-20 13:30:16 UTC

I understand assembly too.

zacmps

0 replies

7h22m

2024-03-20 11:03:57 UTC

It might not be obvious when you hit that point, bad regexes can be subtle, just see that old cloudflare postmortem.

mnau

0 replies

1h47m

2024-03-20 16:39:27 UTC

Even simple regexs can be problematic, e.g. Gitlab RCE bug through ExifTools

https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...

"a\ > ""

The second quote was not escaped because in the regex $tok =~ /(\\+)$/ the $ will match the end of a string, but also match before a newline at the end of a string, so the code thinks that the quote is being escaped when it’s escaping the newline.

2devnull

0 replies

5h19m

2024-03-20 13:07:32 UTC

That was one of my first uh oh moments with gpt. Getting code that clearly had untestable/unreadable regexen, which given the source must have meant the regex were gpt generated. So much is going to go wrong, and soon.

berkes

2 replies

7h22m

2024-03-20 11:04:33 UTC

if you know where to find something no point in knowing it.

Nonsense. And you know it.

First, you need to know what to find, before knowing where to find it. And knowing what to find requires intricate knowledge of the thing. Not intricate implementation details, but enough to point yourself in the right direction.

Secondly, you need to know why to find thing X and not thing Y. If anything, ChatGPT is even worse than google or stackoverflow in "solving the XY problem for you". XY is a problem you don't want solved, but instead to be told that you don't want to solve it.

Maybe some future LLM can also push back. Maybe some future LLM can guide you to the right answer for a problem. But at the current state: nope.

Related: regexes are almost never the best answer to any question. They are available and quick, so all considered, maybe "the best" for this case. But overall: nah.

pksebben

0 replies

2h34m

2024-03-20 15:52:15 UTC

While I agree with your point that knowing things matters, it is entirely possible with the current batch of LLMs to get to an answer you don't know much about. It's actually one of the few things they do reliably well.

You start with what you do know, asking leading questions and being clear about what you don't, and you build towards deeper and deeper terminology until you get to the point where there are docs to read (because you still can't trust them to get the specifics right).

I've done this on a number of projects with pretty astonishing results, building stuff that would otherwise be completely out of my wheelhouse.

lolc

0 replies

28m

2024-03-20 17:57:59 UTC

Funny for me there have been instances where the LLM did push back. I had a plan of how to solve something and tasked the LLM with a draft implementation. It kept producing another solution which I kept rejecting and specifying more details so it wouldn't stray. In the end I had to accept that my solution couldn't work, and that the proposed one was acceptable. It's going to happen again, because it often comes up with inferior solutions so I'm not very open to the reverse situation.

HumblyTossed

0 replies

4h41m

2024-03-20 13:45:01 UTC

This is something ChatGPT would say.

Izmaki

20 replies

9h41m

2024-03-20 08:44:56 UTC

The new-line character is an actual character "at the end" of the string though so it makes sense that $ would include the new-line character in multi-line matching.

IshKebab

18 replies

9h33m

2024-03-20 08:53:05 UTC

Yes and every implementation gets that right. The point was when multi-line matching is disabled and only Javascript, Go and Rust get that right.

I'm not too surprised by PHP and Python getting it wrong. Java and C# is a slight surprise though.

danbruc

16 replies

9h9m

2024-03-20 09:17:24 UTC

I don't think it is correct to say some get it right and some get it wrong, it is more of an design decision.

IshKebab

13 replies

8h55m

2024-03-20 09:30:50 UTC

It's possible to get design decisions wrong. Clearly people expect `$` to only match end-of-string so they did make the wrong decision. It may not have been clear it was the wrong decision at the time.

danbruc

11 replies

8h39m

2024-03-20 09:47:41 UTC

Things are obviously more complicated than that, lines are a complicated issue for historical reasons. There are two conventions, line termination and line separation. In case of line termination, the newline is part of the line and a string without a newline is not a [complete] line. In case of line separation, the newline is not part of the line but separates two lines. Also the way newlines are encoded is not universal.

fauigerzigerk

10 replies

7h51m

2024-03-20 10:35:17 UTC

Why is this relevant when multi-line is disabled?

danbruc

9 replies

7h13m

2024-03-20 11:12:56 UTC

Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $, the newline at the end is still not part of the content. You have to use \A and \Z if you want to treat all characters as a string instead of one or multiple lines.

burntsushi

8 replies

6h36m

2024-03-20 11:50:03 UTC

Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $

No, you're not, except for this weird corner case where `$` can match before the last `\n` in a string. It's not just any `\n` that non-multiline `$` can match before. It's when it's the last `\n` in the string. See:

    >>> re.search('cat$', 'cat\n')
    <re.Match object; span=(0, 3), match='cat'>
    >>> re.search('cat$', 'cat\n\n')
    >>>

This is weird behavior. I assume this is why RE2 didn't copy this. And it's certainly why I followed RE2 with Rust's regex crate. Non-multiline `$` should only match at the end of the string. It should not be line-aware. In regex engines like Python where it has the behavior above, it is only "partially" line-aware, and only in the sense that it treats the last `\n` as special.

danbruc

7 replies

5h5m

2024-03-20 13:21:19 UTC

But that is exactly what it means, the end of the line is before the terminating newline or at the end of the string if there is no terminating newline. Both ^ and $ always match at start or end of lines, \A and \Z match at the start or end of the string. The difference between multi-line and not is whether or not internal newlines end and start lines, it does not change the semantics from end of line to end of string. And if you are not in multi-line mode but have internal newlines, then you might also want single-line/dot-all mode.

One could certainly have a debate whether this behavior is too strongly tied to the origins of regular expressions and now does more harm than good, but I am not convinced that this would be an easy and obvious choice to have breaking change.

burntsushi

5 replies

4h18m

2024-03-20 14:07:49 UTC

re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line. And giving it a `string` with multiple new lines doesn't necessarily mean you want to enable multi-line mode. They are orthogonal things.

Both ^ and $ always match at start or end of lines

This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition. Yet it does not match `cat` followed by the end of a line in `cat\n\n`. And it does not do so in Python or in any other regex engine.

You're trying to square a circle here. It can't be done.

Can you make sense of, historically, why this choice of semantics was made? Sure. I bet you can. But I can still evaluate the choice on its own merits today. And I did when I made the regex crate.

but I am not convinced that this would be an easy and obvious choice to have breaking change.

Rust's regex crate, Go's regexp package and RE2 all reject this whacky behavior. As the regex crate maintainer, I don't think I've ever seen anyone complain. Not once. This to me suggests that, at minimum, making `$` and `\z` equivalent in non-multiline mode is a reasonable choice. I would also argue it is the better and more sensible approach.

Whether other regex engines should have a breaking change or not to change the meaning of `$` is an entirely different question completely. That is neither here nor there. They absolutely will not be able to make such a change, for many good reasons.

danbruc

4 replies

2h24m

2024-03-20 16:02:28 UTC

re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line.

Sure, it takes a string which might be a line or multiple or whatever. Does not change the fact that $ matches at the end of a line. If you want the end of the string, use \Z.

In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

burntsushi

3 replies

2h16m

2024-03-20 16:10:33 UTC

In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

I don't think this conversation is going anywhere. Your description of the semantics seems inconsistent and incomprehensible to me.

A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Like I said, your description makes sense if the input is meant to be interpreted as a single line. And in some contexts (like line oriented CLI tools), that can make sense. But that's not the case here. So your description makes no sense at all to me.

danbruc

2 replies

1h7m

2024-03-20 17:19:23 UTC

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

Which is fine because lines are a subset of strings. And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Look at where this is coming from. You do line-based stuff, there is either no newline at all or there is exactly one newline at the end. You do file-based stuff, there are many newlines. In both cases the behavior of ^ and $ makes perfect sense.

Now you come along with cat\n\n which clearly falls into the file-based stuff category as it has more than one newline in it but you also insist that it is not multiple lines. If it is not multiple lines, then only the last character can be a newline, otherwise it would be multiple lines.

And I get it, yes, you can throw arbitrary strings at a regular expression, this line-based processing is not everything, but it explains why things behave the way they do. And that is also why people added \A and \Z. And I understand that ^ and $ are much nicer and much better known than \A and \Z. Maybe the best option would be to have a separate flag that makes them synonymous with \A and \Z and this could maybe even be the default.

burntsushi

1 replies

32m

2024-03-20 17:54:10 UTC

And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

Where is this semantic explained in the `re` module docs?

This is totally and completely made up as far as I can tell.

This also seems entirely consistent with my rebuttal:

Me: What you're saying makes sense if condition foo holds.

You: Condition foo holds.

This is uninteresting to me because I see no reason to believe that condition foo holds. Where condition foo is "the input to re.search is expected to be a single line." Or more precisely, apparently, "the input to re.search is expected to be a single line when either ^ or $ appear in the pattern." That is totally bonkers.

but it explains why things behave the way they do

Firstly, I am not debating with you about the historical reasoning for this. Secondly, I am providing a commentary on the semantics themselves (they suck) and also on your explanation of them in today's context (it doesn't make sense). Thirdly, I am not making a prescriptive argument that established regex engines should change their behavior in any way.

If you're looking to explain why this semantic is the way it is, then I'd expect writing from the original implementors of it. Probably in Perl. I wouldn't at all be surprised if this was an "oops" or if it was implemented in a strictly-line-oriented context, and then someone else decided to keep it unthinkingly when they moved to a non-line-oriented context. From there, compatibility takes over as a reason for why it's with us today.

danbruc

0 replies

2024-03-20 18:21:11 UTC

I quoted the section from the Python module here. [1]

If you do not specify multi-line, bar$ matches a lines ending in bar, either foobar\n or foobar if the terminating newline has been removed or does not exist. If you specify multi-line, then it will also match at every bar\n within the string. So it either treats your input as a single line or as multiple lines. You can of course not specify multi-line and still pass in a string with additional newlines within the string, but then those newlines will be treated more or less as any other character, bar$ will not match bar\n\n. The exception is that dot will not match them except you set the single-line/dot-all flag, bar\n$ will match bar\n\n but bar.$ will not unless you specify the single-line/dot-all flag.

[1] https://news.ycombinator.com/item?id=39765086

IshKebab

0 replies

4h46m

2024-03-20 13:39:48 UTC

But that is exactly what it means

I think you've kind of missed the point. Sure if `$` in non-multiline mode means "end of line" the behaviour might be reasonable. But the big error is that people DO NOT EXPECT `$` to mean "end of line" in that case. They expect it to mean "end of string". That's clearly the least surprising and most useful behaviour.

The bug is not in how they have implemented "end of line" matching in non-multiline mode. It's that they did it at all.

dfawcus

0 replies

7h50m

2024-03-20 10:36:31 UTC

Given that in unix they sort started as:

    ed -> sed
    ed -> grep

The line oriented mature makes sense.

There is some sed multi-line capability if one uses the hold space, but it is much easier to just use awk.

tankenmate

1 replies

8h51m

2024-03-20 09:35:09 UTC

Not quite, there are standards for this behaviour (formal and de jure).

danbruc

0 replies

7h3m

2024-03-20 11:23:02 UTC

And the ones that do not match cat\n with cat$ arguably have it wrong. Both ^ and $ anchor to the start and end of lines, not to the start and end of strings, whether in multi-line mode or not.

noirscape

0 replies

7h30m

2024-03-20 10:56:36 UTC

It's not wrong actually. It's the difference between BRE and ERE, which are the two different POSIX standards that define regex. In BRE the $ should always match the end of the string (the spec specifically says it should match the string terminator since "newlines aren't special characters"), while the ERE spec says it should match until the end of the line.

The real issue is that no language nowadays "just" implements BRE or ERE since both specs are lacking in features.

Most languages instead implement some variant of Perl's regex instead (often called PCRE regex because of the C library that brought Perl's regex to C), which as far as I can tell isn't standardized, so you get these subtle differences between implementations.

mnw21cam

0 replies

9h33m

2024-03-20 08:53:43 UTC

The article is about when multi-line is disabled.

pjc50

11 replies

9h17m

2024-03-20 09:09:15 UTC

Special misery case: Visual Studio supports regex search, where '$' matches \n.

The end of line character is usually the standard Windows \r\n.

Yes, that means if you want to really match the end of line you have to match "\r$". So broken.

jbverschoor

6 replies

9h1m

2024-03-20 09:25:06 UTC

The whole \r is archaic. It doesn't even behave properly in most cases. Just use \n everywhere and bite the lemon for a short while to fix your problems.

And if you believe \r\n is the way to go, please make sure \n\r also works as they should have the same results. (or \r\n\r\r\r\r for that matter)

keybored

3 replies

2h3m

2024-03-20 16:23:24 UTC

Why did they even decide to use two characters for the end of line? Seems bizarre. I could have imagined that `\r` and `\n` was a tossup. But why both?

mnau

1 replies

1h37m

2024-03-20 16:49:21 UTC

Likely compatibility bugs going back decades (70s?). Probably with some terminal/teletype.

\r - returned teletype head to the start of a line

\n - move paper one line down

The sequence CR+LF was commonly used on many early computer systems that had adopted Teletype machines—typically a Teletype Model 33 ASR—as a console device, because this sequence was required to position those printers at the start of a new line. The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up."[2] In fact, it was often necessary to send extra padding characters—extraneous CRs or NULs—which are ignored but give the print head time to move to the left margin. Many early video displays also required multiple character times to scroll the display.

https://en.wikipedia.org/wiki/Newline

jbverschoor

0 replies

1h18m

2024-03-20 17:07:56 UTC

It’s similar to an old school typewriter.

The handle does 2 things: return and feed. You can also just return by not pulling all the way or the other way around depending on the design

HideousKojima

0 replies

26m

2024-03-20 18:00:09 UTC

Typewriters is why

psd1

0 replies

8h19m

2024-03-20 10:07:02 UTC

There are unices that use LFCR endings... computing is an endless bath in history

HideousKojima

0 replies

4h27m

2024-03-20 13:58:48 UTC

But without \r how am I supposed to print to my typewriter over serial cable? Only half-joking, that's the setup my family had in the early 90's.

skrebbel

3 replies

9h8m

2024-03-20 09:17:57 UTC

FWIW, and I know this doesn't really address your complaint: I use Windows and I've set all my text editors to use LF exclusively years ago and Things Are Great. No more weird Git autocrlf warnings, no quirks when copying files over to/from people on Macs or Linuxes, etc. Even Notepad supports LF line endings for quite a long time now - to my practical experience, there's little remaining in Windows that makes CRLF "the OS standard line ending".

I bet if someday VS Code's Windows build ships with LF default on new installations, people won't even notice.

I mean, at some point it did matter what the OS did when you pressed the "Enter" button. But this isn't really the case much anymore. VS Code catches that keypress, and inserts whatever "files.eol" is set to. Sublime does the same. I didn't check, but I assume every other IDE has this setting.

Similarly, the HTML spec, which is pretty nuts, makes browsers normalize my enters to LF characters as I type into this textarea here (I can check by reading the `value` property in devtools), but when it's submitted, it converts every LF to a CRLF because that's how HTML forms were once specced back in the day. Again though, what my OS considers to be "the standard newline" is simply not considered at all. Even CMD.EXE batch files support LF.

I don't really type newlines all that much outside IDEs and browsers (incl electron apps) and places like MS Word, all of which disregard what the OS does and insert their own thing. Maybe the terminal? I don't even know. I doubt it's very consequential.

EDIT: PSA the same holds for backslashes! Do Not Use Backslashes. Don't use "OS specific directory separator constants". It's not 1998, just type "/" - it just works.

pjc50

0 replies

6h59m

2024-03-20 11:27:00 UTC

I bet if someday VS Code's Windows build ships with LF default on new installations, people won't even notice.

As with '/', they really ought to do this some day but won't.

n_plus_1_acc

0 replies

8h9m

2024-03-20 10:17:17 UTC

I could never get visual studio (not code) to not use \r\n when editing a solution file via the gui

divingdragon

0 replies

7h52m

2024-03-20 10:33:46 UTC

Even CMD.EXE batch files support LF.

I don't know if it is the case on Windows 11, but I have surely been bitten by CMD batch files using LF line endings. I don't remember the exact issue but it may have been the one bug affecting labels. [1]

[1]: https://www.dostips.com/forum/viewtopic.php?t=8988#p58888

ikiris

10 replies

9h51m

2024-03-20 08:35:24 UTC

this is mostly due to the different types of regex and less about it being platform dependent. $ was end of string in pcre which is the "old" perl compatible regex. python has its own which has quirks as mentioned, re2 is another option in go for example, and i think rust has its own version as well iirc.

wolletd

3 replies

9h32m

2024-03-20 08:54:29 UTC

The differences of the various regex "dialects" came to me over the years of using regular expressions for all kinds of stuff.

Matching EOL feels natural for every line-based process.

What I find way more annoying is escaping characters and writing character groups. Why can't all regex engines support '\d' and '\w' and such? Why, in sed, is an unescaped '.' a regex-dot matching any character, but an unescaped '(' is just a regular bracket?

somat

2 replies

9h15m

2024-03-20 09:11:04 UTC

Why, in sed, is an unescaped '.' a regex-dot matching any character, but an unescaped '(' is just a regular bracket?

It is because sed predates the very influential second generation Extended Regular Expression engine and by default uses the first generation Basic Regular Expression engine. So really it is for backwards compatibility.

http://man.openbsd.org/re_format#BASIC_REGULAR_EXPRESSIONS

you can usually pass sed a -r flag to get it to use ERE's

Actually I don't really know if BRE's predate ERE's or not. I assume they do based on the name but I might be wrong.

tankenmate

0 replies

8h36m

2024-03-20 09:50:08 UTC

BRE and ERE was created at the same time. Prior to this there wasn't a clear standard for Regex. From my memory this was standardised in 1996 (IEEE Std 1003.1-1996).

The work originally came from work by Stephen Cole Kleene in the 1950s. It was introduced into Unix fame via the QED editor (which later became ed (and sed), then ex, then vi, then vim; all with differing authors) when Ken Thompson added regex when he ported QED to CTSS (an OS developed at MIT for the IBM 709, which was later used to develop Multics, and hence lead to Unix).

Also the "grep" command got its name from "ed"; "g" (the global ed command) "re" (regular expression), and "p" (the print ed command). Try it in vi/vim, :g/string/p it is the same thing as the grep command.

fsckboy

0 replies

1h49m

2024-03-20 16:37:30 UTC

you can usually pass sed a -r flag

for portability, -E is the POSIX flag for the same thing

pjmlp

3 replies

9h43m

2024-03-20 08:42:57 UTC

Indeed, there isn't any kind of universal regexp standard.

7bit

2 replies

9h26m

2024-03-20 09:00:31 UTC

We should create a new RegEx flavour that standardises RegEx for good!

jasonjayr

1 replies

9h3m

2024-03-20 09:23:44 UTC

https://xkcd.com/927/

rerdavies

0 replies

1h48m

2024-03-20 16:37:50 UTC

https://datatracker.ietf.org/doc/rfc9485/

https://xkcd.com/927/

ajsnigrutin

1 replies

9h15m

2024-03-20 09:11:44 UTC

"$" could be end of string or end of line in perl, depending on the setting (are you treating data as a multiline text, or each line separately). (/m, /s,...)

ikiris

0 replies

2h45m

2024-03-20 15:40:51 UTC

Yeah I accidentally said string when I absolutely meant to say line there.

xlii

8 replies

9h34m

2024-03-20 08:52:43 UTC

Regexp was one of the first things I truly internalized years ago when I was discovering Perl (which still lives in a cozy place in my heart due to a lovely “Camel” book).

Today most important bit of information is knowledge that implementations differ and I made a habit of pulling reference sheet for a thing I work with.

E.g. Emacs Regexp annoyingly doesn’t have word in form of “\w” but uses “\s_-“ (or something no reference sheet on screen) as character class (but Emacs has the best documentation and discoverability - a hill I’m willing to die on)

Some utilities require parenthesis escaping and some not. Sometimes this behavior is configurable and sometimes it’s not.

I lived through whole confusion, annoyance, denial phase and now I just accept it. Concept is the same everywhere but flavor changes.

ydant

3 replies

7h15m

2024-03-20 11:11:14 UTC

Exactly the same here, re: Perl.

My brain thinks in Perl's regex language and then I have to translate the inconsistent bits to the language I'm using. Especially in the shell - I'm way more likely to just drop a perl into the pipeline instead of trying to remember how sed/grep/awk (GNU or BSD?) prefer their regex.

influx

1 replies

3h35m

2024-03-20 14:50:59 UTC

GNU grep supports Perl regexp with -P

mwpmaybe

0 replies

3h17m

2024-03-20 15:09:24 UTC

As does git grep!

mtmk

0 replies

1h23m

2024-03-20 17:03:36 UTC

hah, I'm the same too, straight to 'perl -lne'. I believe that was one of Larry Wall's goals when creating Perl:

Perl is kind of designed to make awk and sed semi-obsolete.

https://github.com/Perl/perl5/commit/8d063cd8

pizzafeelsright

3 replies

3h26m

2024-03-20 14:59:53 UTC

How did you internalize it? Perl looks like cat keyboarding.

mwpmaybe

1 replies

3h17m

2024-03-20 15:08:58 UTC

The same way people internalize punching data and instructions into stacks of cards, or internalize advanced mathematical notation. Just because things aren't written in plain english words doesn't mean they can't be internalized.

chongli

0 replies

2h57m

2024-03-20 15:29:19 UTC

Advanced math is mostly written in plain English, actually!

ydant

0 replies

41m

2024-03-20 17:45:10 UTC

For me, Perl hit me at exactly the right time in my development. One or more of the various O'Reilly Perl books caught my attention in the bookstore, the foreword and the writing style was unlike anything else I'd read in programming up to that point, and I read the book and just felt a strong connection to how the language was structured, the design concepts behind it, the power of regex being built in to the language, etc. The syntax favored easy to write programs without unnecessary scaffolding (of course, leading to the jokes of it being write-only - also the jokes I could make about me programming largely in Java today), and the standard functionality plus the library set available felt like magic to me at that point.

Learning Perl today would be a very different experience. I don't think it would catch me as readily as it did back then. But it doesn't matter - it's embedded into me at a deep level because I learned it through a strong drive of fascination and infatuation.

As for the regex themselves? It's powerful and solved a lot of the problems I was trying to solve, was built fundamentally into Perl as a language, so learning it was just an easy iterative process. It didn't hurt that the particular period of time when I learned Perl/regex the community was really big on "leetcode" style exercises, they just happened to be focused around Perl Golf, being clever in how you wrote solutions to arbitrary problems, and abusive levels of regex to solve problems. We were all playing and play is a great way to learn.

ghusbands

7 replies

9h22m

2024-03-20 09:04:33 UTC

Note: The table of data was gathered from regex101.com, I didn't test using the actual runtimes.

Has anyone confirmed this behaviour directly against the runtimes/languages? Newlines at the end of a string are certainly something that could get lost in transit inside an online service involving multiple runtimes.

coldtea

2 replies

8h27m

2024-03-20 09:59:25 UTC

Newlines at the end of a string are certainly something that could get lost in transit inside an online service involving multiple runtimes.

In what way could newlines at the end of a string "could get lost in transit"?

ghusbands

1 replies

8h1m

2024-03-20 10:25:27 UTC

If you write it to a text file by itself and then read it from that text file, each runtime can have a different definition of whether a newline at the end of the file is meaningful or not. Under POSIX, a newline should always be present at the end of a non-empty text file and is not meaningful; not everyone agrees or is aware.

There are plenty of other ways, too; bugs happen.

coldtea

0 replies

3h50m

2024-03-20 14:35:46 UTC

Ideally no runtime should alter strings passing through ("in transit") from one runtime to another - unless it does some processing on them.

zimpenfish

0 replies

9h1m

2024-03-20 09:25:33 UTC

https://go.dev/play/p/Tce1qWjfjOy matches their results.

I've also run that locally against "go1.22.1 darwin/arm64", "go1.21.5 windows/amd64", and "go1.21.0 linux/amd64" with the same result.

ghusbands

0 replies

7h50m

2024-03-20 10:36:23 UTC

I've now tested C#, directly, and got the same result as the article. It also documents the behavior:

The ^ and $ language elements indicate the beginning and end of the input string. The end of the input string can be a trailing newline \n character.

burntsushi

0 replies

6h43m

2024-03-20 11:43:16 UTC

Yes, and with more regex engines: https://github.com/BurntSushi/rebar/blob/177f5d55e916964b9c4...

Beyond what's in the OP, that includes RE2, Hyperscan, D's std.regex, ICU, Perl, Python's third party `regex` package, and `regress`.

AtNightWeCode

0 replies

9h16m

2024-03-20 09:10:43 UTC

I fail to add carriage return to the test string on that site. Which I guess would be an issue on Windows.

jewel

6 replies

4h37m

2024-03-20 13:48:51 UTC

This has security implications! Example exploitable ruby code:

  unless person_id =~ /^\d+$/
    abort "Bad person ID"
  end
  sql = "select * from people where person_id = #{person_id}"

In addition to injection attacks, this also can bite people when parsing headers, where a bad header is allowed to sneak past a filter.

jfhufl

4 replies

2h25m

2024-03-20 16:00:54 UTC

Unsure what you mean?

    $ ruby -e 'x = "25" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    yes
    $ ruby -e 'x = "25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end' 
    yes
    $ ruby -e 'x = "a25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    no

Also, you'd want to use something that parameterizes the query with '?' (I use the Sequel gem) instead of just stuffing it into a sql string.

halostatue

1 replies

2h12m

2024-03-20 16:14:16 UTC

You need to make your regex multi-line (`/^\d+$/m`), but that isn't the problem shown. Your query will be searching for `25\n`, not `25` despite your pre-check that it’s a good value.

The second line should always be no, which if you use `\A\d+\z`, it will be.

jfhufl

0 replies

2h5m

2024-03-20 16:21:16 UTC

Yep, makes sense, thanks!

jfhufl

0 replies

2h21m

2024-03-20 16:05:01 UTC

Well, learned something today after reading a bit further in the thread:

    ruby -e 'x = "a\n25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    yes

Good to know.

dr-smooth

0 replies

14m

2024-03-20 18:12:08 UTC

    $ ruby -e 'x = "25\n; delete from people" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
    yes

mnau

0 replies

1h42m

2024-03-20 16:43:53 UTC

Practical Gitlab RCE that involved end of line regex in ExifTools:

https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...

wodenokoto

4 replies

7h53m

2024-03-20 10:33:16 UTC

So if you're trying to match a string without a newline at the end, you can't only use $ in Python! My expectation was having multiline mode disabled wouldn't have had this newline-matching behavior, but that isn't the case.

A reproducible example would be nice. I don’t understand what it is he cannot do. `re.search('$', 'no new lines')` returns a match.

iainmerrick

3 replies

7h48m

2024-03-20 10:37:53 UTC

This unexpectedly matches:

re.match('^bob$', 'bob\n')

I didn't want the trailing newline to be included.

wodenokoto

2 replies

6h11m

2024-03-20 12:15:03 UTC

But that string does have a new line at the end.

iainmerrick

1 replies

4h56m

2024-03-20 13:30:35 UTC

re.match('^bob$', 'bob') → yes

re.match('^bob$', 'bobs') → no

Most people would expect 'bob\n' not to match, because I used '$' and it has an extra character at the end, just like 'bobs'. In Python it does match because '\n' is a special case.

rerdavies

0 replies

1h50m

2024-03-20 16:36:08 UTC

... for some arbitrary definition of "most people".

Scubabear68

4 replies

6h9m

2024-03-20 12:17:33 UTC

In 30 years of developing software I don’t think I ever used multi-line regexp even once.

thrdbndndn

2 replies

6h5m

2024-03-20 12:21:27 UTC

Definitely not common, but if you are parsing a text file you're going to use it a lot (say, you're writing a JS parser).

marcosdumay

1 replies

4h38m

2024-03-20 13:48:23 UTC

You really shouldn't use a lot of regexes for parsing code.

They go only on the tokenizer, if they go somewhere at all.

thrdbndndn

0 replies

4h28m

2024-03-20 13:58:11 UTC

Agreed, it's more about quick and dirty ad hoc capture than full-fledged parser though (like when you want to extract certain object when scraping).

Terretta

0 replies

3h55m

2024-03-20 14:31:22 UTC

In 30 years of developing software I don’t think I ever used multi-line regexp even once.

As long as sharing anecdata, in 30 years, it's almost the only way I use it.

It's incredible for slicing and dicing repetitious text into structure. You generally want some sort of Practical Extraction and Reporting Language, the core of which is something like a regular expression, generally able to handle the, well, irregularity.

Most recent example (I did this last week) was extracting Apple's app store purchases from an OCR of the purchase history available through Apple's Music app's Account page that lets you see all purchases across all digital offerings, but only as a long scrolling dialog box (reading that dialog's contents through accessibility hooks only retrieves the first few pages, unfortunately).

Each purchase contains one or more items and each item has one or more vertical lines, and if logos contain text they add arbitrary lines per logo.

A good match and sub match multi-line regex folds that mess back into a CSV. In this case, the regex for this was less than an 80 char line of code and worked in the find replace of Sublime Text which has multiline matching, subgroups, and back references.

Another way to do this is something like a state match/case machine, but why write a program when you can just write a regular expression?

tyingq

3 replies

6h50m

2024-03-20 11:36:27 UTC

Seems odd to leave Perl off the list, given it's regex related.

Here's the explanation for $ in the perlre docs:

  $   Match the end of the string                 
      (or before newline at the end of the      
      string; or before any newline if /m is     
      used)

toyg

2 replies

5h49m

2024-03-20 12:36:56 UTC

Yeah, omitting what is arguably the language most associated with regexes seems a bit of an oversight. I guess it shows how far off the radar Perl currently is.

demondemidi

0 replies

5h28m

2024-03-20 12:58:18 UTC

Perl perfected the simplicity and flexibility of regex syntax from POSIX and it seems every other language after has just made it harder.

TillE

0 replies

2h24m

2024-03-20 16:02:14 UTC

PHP uses PCRE, so it more or less serves as a stand-in for Perl in this case.

perlgeek

3 replies

7h58m

2024-03-20 10:28:31 UTC

Raku (formerly Perl 6) has picked ^ and $ for start-of-string and end-of-string, and has introduced ^^ and $$ for start-of-line and end-of-line. No multi line mode is available or necessary. (There's also \h for horizontal and \v for vertical whitespace)

That's one of the benefits of a complete rethink/rewrite, you can learn from the fact that the old behavior surprised people.

richardwhiuk

1 replies

3h52m

2024-03-20 14:33:46 UTC

Think I would have picked exactly the reverse (i.e. ^^ being more "starty" than "^").

lcnPylGDnU4H9OF

0 replies

3h32m

2024-03-20 14:54:35 UTC

Reminds me of verbosity flags in some cli utilities. Often, -v is "verbose" and -vv is "very verbose" and -vvv... etc.

Terretta

0 replies

4h1m

2024-03-20 14:25:25 UTC

And this is why this curmudgeon can't use Perl 6[^1]. It randomly shuffles the line noise we learned over decades.

It seems so obvious that's the opposite of what they should have defaulted to, that it clearly should have been ^ and $ for lines, and ^^ and $$ for the string, since like ((1)(2)(3)):

^^line1$\n^line2$\n^line3$\n$

[1]: That, and it's not anywhere, while Perl 5 is everywhere.

m0rissette

3 replies

7h2m

2024-03-20 11:24:37 UTC

Why isn’t Perl anywhere on that chart when mentioning regex?

burntsushi

2 replies

6h17m

2024-03-20 12:09:15 UTC

Because they're using regex101 to easily test the semantics of different regex engines and Perl isn't available on regex101. PCRE is though, which is a decent approximation. And indeed, Perl and PCRE behave the same for this particular case.

account42

1 replies

5h47m

2024-03-20 12:38:52 UTC

Why isn’t Perl available on regex101 when its all about regex?

burntsushi

0 replies

5h31m

2024-03-20 12:55:18 UTC

I dunno. Maybe because nobody has contributed it? Maybe because Perl isn't as widely used as it once was? Maybe because it's hard to compile Perl to WASM? Maybe some other reason?

vitiral

2 replies

3h34m

2024-03-20 14:52:43 UTC

In Lua it's only the start/end of the string

A pattern is a sequence of pattern items. A caret '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.

https://www.lua.org/manual/5.3/manual.html#6.4.1

Lua's pattern matching is much simpler than regexes though.

Unlike several other scripting languages, Lua does not use POSIX regular expressions (regexp) for pattern matching. The main reason for this is size: A typical implementation of POSIX regexp takes more than 4,000 lines of code. This is bigger than all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 500 lines.

https://www.lua.org/pil/20.1.html

denzquix

1 replies

3h3m

2024-03-20 15:23:28 UTC

In Lua it's only the start/end of the string

There's an additional caveat: if you use the optional "init" parameter to specify an offset into the string to start matching, the ^ anchor will match at that offset, which may or may not be what you expect.

vitiral

0 replies

1h30m

2024-03-20 16:56:13 UTC

That is a good point, and something I've actually (personally) used quite a bit when writing parsers

PuffinBlue

2 replies

8h57m

2024-03-20 09:28:59 UTC

This seems like the perfect opportunity to introduce those unfamiliar to Robert Elder. He makes cool YouTube[0] and blog content[1] and has a series on regular expressions[2] and does some quite deep dives into the differing behaviour of the different tools that implement the various versions.

His latest on the topic is cool too: https://www.youtube.com/watch?v=ys7yUyyQA-Y

He's has quite a lot of content that HN folks might be interested in I think, like the reality and woes of consulting[3]

[0] https://www.youtube.com/@RobertElderSoftware

[1] https://blog.robertelder.org/

[2] https://blog.robertelder.org/regular-expressions/

[3] https://www.youtube.com/watch?v=cK87ktENPrI

aquariusDue

0 replies

8h50m

2024-03-20 09:36:11 UTC

I'm glad to see someone else that has stumbled over his content. Seconding the recommendation.

CatchSwitch

0 replies

6h17m

2024-03-20 12:09:16 UTC

He has so many favorite Linux commands lol

user2342

1 replies

9h40m

2024-03-20 08:46:44 UTC

I'm confused by this blog-post. In the table: what is the reg-ex pattern tested and against which input?

mnw21cam

0 replies

9h34m

2024-03-20 08:51:53 UTC

The input being matched is "cat\n" and the regex pattern is one of:

  "cat$" with multiline enabled
  "cat$" with multiline disabled
  "cat\z"
  "cat\Z"

febeling

1 replies

8h58m

2024-03-20 09:28:18 UTC

Seriously, just write one unit test for your regex.

mannykannot

0 replies

5h11m

2024-03-20 13:14:59 UTC

Indeed, one should test any regex one puts any trust in, but the problem is that if you take as a fact something that is actually a false assumption (as the author did here), your test may well fail to find errors which may cause faults when the regex is put to use.

This, in a nutshell, is the sort of problem which renders fallacious the notion that you can unit-test your way to correct software.

croes

1 replies

8h8m

2024-03-20 10:18:30 UTC

Isn't a string with a newline character automatically multiline?

The new line is just empty but not the first line anymore.

Joker_vD

0 replies

8h3m

2024-03-20 10:23:42 UTC

No, it is not.

    3.195 Incomplete Line

    A sequence of one or more non-<newline> characters at the end of the file.

    3.206 Line

    A sequence of zero or more non-<newline> characters plus a terminating <newline> character.

courtesy of [0]. See also [1] for rationale on "text file":

   Text File

   [...] The definition of "text file" has caused controversy. The only difference between text and binary files is that text files have lines of less than {LINE_MAX} bytes, with no NUL characters, each terminated by a <newline>. The definition allows a file with a single <newline>, or a totally empty file, to be called a text file. If a file ends with an incomplete line it is not strictly a text file by this definition. [...]

[0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

[1] https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd...

wruza

0 replies

9h0m

2024-03-20 09:26:36 UTC

By default, '$' only matches at the end of the string and immediately before the newline (if any) at the end of the string.

The rationale was probably "it should be easier to match input strings" and now it's harder for everyone.

weinzierl

0 replies

5h0m

2024-03-20 13:26:30 UTC

The table in the article makes this look complicated, but it really isn't. All the cases in the article can be grouped into two families:

- The JS/Go/Rust family, which treats $ like \z and does not support \Z at all

- The Java, .NET, PHP, Python family, which treats $ like \Z and may or may not (Python) support \z.

\Z does away with \n before the end of the string, while \z treats \n as a regular character. For multiline $ the distinction doesn't matter, because \n is the end.

Really the only deviation from the rule is Python's \Z, which is indeed weird.

teknopaul

0 replies

8h55m

2024-03-20 09:31:00 UTC

Tldr;

$ does not mean end of string in Python.

somat

0 replies

9h34m

2024-03-20 08:52:24 UTC

Structural regexes as found in the sam editor are an obscure but well engineered regex engine. I am far from an expert but my main takeaway from them is that most regex engines have an implied structure built around "lines" of text. While you can work around this, it is awkward. Structural regexes allow you to explicitly define the structure of a match, that is, you get to tell the engine what a "line" is.

http://man.cat-v.org/plan_9/1/sam

silent_cal

0 replies

5h40m

2024-03-20 12:46:33 UTC

I think there's a big opportunity to re-write Regex as a SQL-type language. It's too bad I don't feel like trying.

raldi

0 replies

5h32m

2024-03-20 12:54:44 UTC

Cmd-F perl

no matches

pksebben

0 replies

2h31m

2024-03-20 15:55:19 UTC

Regex would really benefit from a comprehensive industrial standard. It's such a powerful tool that you have to keep relearning whenever you switch contexts.

nurtbo

0 replies

2h54m

2024-03-20 15:32:08 UTC

Totally get the desire, but also feels like last two paragraphs are solvable with

``` re.match(text).extract().rstrip(“\n”) ```

nunez

0 replies

5h36m

2024-03-20 12:49:52 UTC

You can also use (?m) to enable multiline processing on PCRE-compatible regexp engines.

nebulous1

0 replies

5h50m

2024-03-20 12:35:51 UTC

The fact that there are so many different peculiarities in different regex systems has always raised the hairs on the back of my neck. As in when a tool accepts a regex and I have to a trawl the manual to find out exactly what regex is acceptable to it.

mmh0000

0 replies

2h4m

2024-03-20 16:21:52 UTC

  > So if you're trying to match a string without a newline at the end, you can't 
  only use $ in Python! My expectation was having multiline mode disabled 
  wouldn't have had this newline-matching behavior, but that isn't the case.

I would argue this is correct behavior, a "line" isn't a "line" if it doesn't end with \n.[1]

  > 3.206 Line - A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

menacingly

0 replies

2h46m

2024-03-20 15:40:42 UTC

Of course it’s line. How could it be the end of the string when the matter at hand is defining the string?

mdavid626

0 replies

7h9m

2024-03-20 11:17:15 UTC

Is this a bug?

masswerk

0 replies

9h20m

2024-03-20 09:06:30 UTC

As for the good old reference implementation (not "Parameter Efficient Reinforcement Learning"):

  my $string = "cat\n";
  /cat$/s  -> true
  /cat\Z/s -> true
  /cat\z/s -> false

k3vinw

0 replies

8h11m

2024-03-20 10:15:45 UTC

Another poor soul trying to solve one problem using regex and now they have two… ;)

javier_e06

0 replies

3h48m

2024-03-20 14:38:43 UTC

I would hold a code review hostage if any file does not end with an empty new line.

My reasoning would be if the file is transmitted and gets truncated nobody would know for sure if it does not end a new line. Brownie points if this is code end has a comment that the files ends there.

The article calls computer languages platforms but the are computer languages. Bash is not included. Weird. I believe the most common use of regular expressions is the use of grep or egrep with bash or some other shell but, who knows. Maybe I am hanging with the wrong crowd.

humanlity

0 replies

7h2m

2024-03-20 11:23:46 UTC

Interesting

homakov

0 replies

6h14m

2024-03-20 12:11:55 UTC

This led to a few serious bugs in Ruby-based apps. Always use \A\z

https://homakov.blogspot.com/2012/05/saferweb-injects-in-var...

https://sakurity.com/blog/2015/02/28/openuri.html

https://sakurity.com/blog/2015/06/04/mongo_ruby_regexp.html

hans_castorp

0 replies

9h9m

2024-03-20 09:17:11 UTC

Fun fact: in Postgres, 'cat\n' matches 'cat$' when the so called "weird" newline matching is enabled :)

https://www.postgresql.org/docs/current/functions-matching.h...

gorjusborg

0 replies

4h57m

2024-03-20 13:28:57 UTC

If you really want to learn regex, you'll have a hard time piecing it all together via blog posts.

Brad Freidl's Mastering Regular Expressions is a good book to read if you want to stop being surprised/lost.

I'll admit I stopped at the dive into DFA/NFA engine details.

frou_dh

0 replies

8h25m

2024-03-20 10:00:56 UTC

Something I found really surprising about Python's regexp implementation is that it doesn't support the typical character classes like [:alnum:] etc.

It must be some kind of philosophical objection because there's no way something with as much water under the bridge as Python simply hasn't got around to it.

danbruc

0 replies

7h30m

2024-03-20 10:56:35 UTC

People are confused about strings and lines. A string is a sequence of characters, a line can be two different things. If you consider the newline a line terminator, then a line is a sequence of non-newline characters - possibly zero - plus a newline. If there is no new-line at the end, then it is not a [complete] line. That is what POSIX uses. If you consider the newline a line separator, then a line is a sequence of non-newline characters - possibly zero. In either case, the content of the line ends before the newline, either because the newline terminates the line or because it separates the line from the next. [1]

The semantics of ^ and $ is based on lines - whether single-line or multi-line mode. For string based semantics - which you could also think of as entire file if you are dealing with files - use \A and \Z or their equivalents.

[1] Both interpretations have their merits. If you transmit text over a serial connection, it is useful to have a newline as line terminator so that you know when you received a complete line. If you put text into text files, it might arguably be easier to look at a newline as a line separator because then you can not have a invalid last line. On the other hand having line terminators in text files allows you to detect incompletely written lines.

cpeterso

0 replies

3h25m

2024-03-20 15:01:14 UTC

$ is the regex’s “the buck stops here” symbol. Here at the end of the line. :)

aftbit

0 replies

2h9m

2024-03-20 16:17:36 UTC

Wait, in non-multiline mode, it only matches _one_ trailing newline? And not any other whitespace, including \r or \r\n? That is indeed surprising behavior. Why? Why not just make it end of string like the author expected?

    >>> import re
    >>> bool(re.search('abc$', 'abc'))
    True
    >>> bool(re.search('abc$', 'abc\n'))
    True
    >>> bool(re.search('abc$', 'abc\n\n'))
    False
    >>> bool(re.search('abc$', 'abc '))
    False
    >>> bool(re.search('abc$', 'abc\t'))
    False
    >>> bool(re.search('abc$', 'abc\r'))
    False
    >>> bool(re.search('abc$', 'abc\r\n'))
    False

SAI_Peregrinus

0 replies

4h27m

2024-03-20 13:59:13 UTC

POSIX regexes and Python regexes are different. In general, you need to reference the regex documentation for your implementation, since the syntax is not universal.

Per POSIX chapter 9[1]:

9.2 … "The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions limit the processing to lines; that is, zero or more characters followed by a <newline>."

and 9.3.8 … "A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character."

combine to mean that $ may match the end of string OR the end of the line, and it's up to the utility (or mode) to define which. Most of the common utilities (grep, sed, awk, Python, etc) treat it as end of line by default, since they operate on lines by default.

THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You cannot reliably read or write regular expressions without knowing which language & options are being used.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

Existing4190

0 replies

7h23m

2024-03-20 11:03:06 UTC

perlre Metacharacters documentation states: $ Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)

(/m enables multiline mode)

AtNightWeCode

0 replies

9h1m

2024-03-20 09:24:48 UTC

There are many differences between implementations of regex. To name a few. Lookbehind, atomic groups, named capturing groups, recursion, timeouts and my favorite interop problem, unicode.