Folks who've worked with regular expressions before might know about ^ meaning "start-of-string" and correspondingly see $ as "end-of-string".
Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.
Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?
It's kind of driving me nuts that the article says ^ is "start of string" when it's actually "start of line", just like $ is "end of line". \A is apparently "start of string" like \Z is "end of string".
It’s not start of line though, unless the engine is in multiline mode. Here is the documentation for Python’s re for instance:
Or JavaScript:
\A and \Z are start/end of input regardless of mode… when they’re available, that’s not the case of all engines.
It is start and end of line. [1]
Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).
In single-line [2] mode, the line starts at the start of the string and ends at the end of the line where the end of the line is either the end of the string if there is no terminating newline or just before the final newline if there is a terminating newline.
In multi-line mode a new line starts at the start of the string and after each newline and ends before each newline or at the end of the string if the last line has no terminating newline.
The confusion is that people think that they are in string-mode if they are not in multi-line mode but they are not, they are in single-line mode, ^ and $ still use the semantics of lines and a terminating newline, if present, is still not part of the content of the line.
With \n\n\n in single-line mode the non-greedy ^(\n+?)$ will capture only two of the newlines, the third one will be eaten by the $. If you make it greedy ^(\n+)$ will capture all three newlines. So arguably the implementations that do not match cat\n with cat$ are the broken ones.
[1] https://docs.python.org/3/howto/regex.html#more-metacharacte...
[2] I am using single-line to mean not multi-line for convenience even though single-line already has a different meaning.
You seem to have redefined “line” as “not a line”.
I’m sure redefining “line” as “nothing like what anyone reasonable would interpret as a line” will help a lot and right clear up the confusion.
The line delimiter is a newline.
If you have a file containing `A\nB\nC` in a file, the file is three lines long.
I guess it could be argued that a file containing `A\nB\nC\n` has four lines, with the fourth having zero length.
That a regex is applying to an in memory string vs a file doesn't feel to me like it should have different semantics.
Digging into the history a little, it looks like regexes were popularized in text editors and other file oriented tooling. In those contexts I imagine it would be far more common to want to discard or ignore the trailing zero length line than to process it like every other line in a file.
Technically the “newline” character is actually a line _terminator_. Hence “A\n” is one line, not two. The “\n” is always at the end of a line by definition.
So if you have "A" in a file with no newline, there are no lines in that file?
Yes, that is a file with zero lines that ends with an "incomplete line". Processing of such files by standard line-oriented utilities is undefined in the opengroup spec. So, for instance, the effect of "grep"ping such a file is not defined. Heck, even "cat"ting such a file gives non-ideal results, such as colliding with the regular shell prompt. For this reason, a lot of software projects I work on check and correct this condition whenever creating a commit.
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... ("text file")
It's a file with zero complete lines. But it has 1 line, that's incomplete, right?
The file starts empty. Anything in it starts "a line". So it's 1 incomplete line.
I hate weird states.
No, it is valid for a file to have content but no lines.
Semantically many libraries treat that as a line because while \n<EOF> means "the end of the last line" having just <EOF> adds additional complexity the user has to handle to read the remaining input. But by the book it's not "a line".
If I said "ten buckets of water" does that mean ten full buckets? Or does a bucket with a drop in it count as "a bucket of water?" If I asked for ten buckets of water and you brought me nine and one half-full, is that acceptable? What about ten half-full buckets?
A line ends in a newline. A file with no newlines in it has no lines.
Thats beyond ridiculous. Most languages when you are reading a line from a file, and it doesn't have a \n terminator, its going to give you that line, not say, oops, this isn't a line sorry.
That's a relatively recent invention compared to tools like `wc` (or your favorite `sh` for that matter). See also: https://perldoc.perl.org/functions/chop wherein the norm was "just cut off the last character of the line, it will always be a newline"
Pedantically, if it doesn't end with a newline, it's considered a binary file and not a text file. Binary files don't have lines.
In practice, most utilities expecting text files will still operate on it.
No file has lines.
"Lines" are a convention established by (or not) software reading a data stream.
It's a file with 0 lines and some trailing garbage.
No, a line is defined as a sequence of characters (bytes?) with a line terminator at the end.
Technically as per posix a file as you describe is actually a binary file without any lines. Basically just random binary data that happens to kind of look like a line.
Another way to look at it is that concatenating files should sum the line count. Concatenating two empty files produces an empty file, so 0 + 0 = 0. If “incomplete lines” are not counted as lines, then the maths still works out. If they counted as lines, it would end up as 1 + 1 = 1.
The opengroup spec says no such thing.
3.206 Line
A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
See also ‘3.403 Text File’ for the definition of a text file. No new line characters, no lines. No lines, not a text file.
Yep. since wc(1) apparently strictly adheres to what a newline-terminated text file is. This is why plaintext files should end with a newline. :)
See: https://stackoverflow.com/a/25322168/1725151
Why don't you go ask?
Technically, that is one of two possible interpretations, and you seem to have invented a "by definition" out of thin air.
Very very technically a "newline" character indicates the start of a new line, which is why it is not called the "end-of-line" character.
I mean, the person you are responding to didn't invent the definition out of thin air... the POSIX standard did:
3.206 Line A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...
Posix getline() includes EOF as a line terminator:
EOF seems same as end-of-string.It doesn't indicate the start of a new line, or files would start with it. Files end with it, which is why it is a line terminator. And it is by definition: by the standard, by the way cat and/or your shell and/or your terminal work together, and by the way standard utilities like `wc` treat the file.
Suddenly the DOS/Windows solution of using \r\n instead of just \n seems to offer some advantages.
This does precisely nothing to solve the ambiguity issue when a final line lacks a newline. The representation of that newline isn't relevant to the problem.
It's actually slightly worse: Windows defines newline as a delimiter, not a terminator. So this:
Would be 2 lines in *nix and 3 lines in windows.The "Windows way" is the "right way" for a few reasons.
This is definitely not one of them.
“A\n” is two lines.
Factually incorrect.
The POSIX definition of a line is a sequence of non-newline characters - possibly zero - followed by a newline. Everything that does not end with a newline is not a [complete] line. So strictly speaking it would even be correct that cat$ does not match cat because there is no terminating newline, it should only match cat\n. But as lines missing a terminating newline is a thing, it seems reasonable to be less strict.
Works for me.
How do you square that with your assertion that in your invention of "single-line mode" you implicitly define "line" as matching \n\n?
If you are not in multi-line mode, then a single line is expected and consequently there is at most one newline at the end of the string. You can of course pick an input that violates this, run it against a multi-line string with several newlines in it. cat\n\n will not match cat$ because there is something between cat and the end of the line, it just happens to be a newline but without any special meaning because it is not the last character and you did not say that the input is multi-line.
Probably a vulnerability issue. Programmers would leave multiline mode on by mistake, then validate that some string only contain ^[a-Z]*$… only for the string to have an \n and an SQL injection on the second line.
No? It’s a semantics decision.
What is driving me nuts is that we have Unicode now, so there is no need to use common characters like $ or ^ to denote special regex state transitions.
the idea of changing a decades old convention to instead use, as I assume you are implying, some character that requires special entry, is beyond silly.
I don't think anyone that writes regex would feel specially challenged by using the Alt+ | Ctrl+Shift+u key combos for unicode entry. Having to escape less things in a pattern would be nice.
Also, code is read more often than it is written.
People say this all the time, but is it really always true? I have a ton of code that I wrote, that just works, and I never really look at it again, at least not with the level of inspection that requires parsing the regex in my head.
I write regexes all the time, and I don't know if I would be CHALLENGED by that, but it would be annoying. Escaping things is trivial, and since you do it all the time it is not anything extra to learn. Having to remember bespoke keystrokes for each character is a lot more to learn.
It’s not that silly. You constantly get into escape conundrums because you need to use a metacharacter which is also a metacharacter three levels deep in some embedding.
(But that might not solve that problem? Maybe the problem is mostly about using same-character delimiters for strings.)
And I guess that’s why Perl is so flexible with regards to delimiters and such.
Yes, languages really need some sort of "raw string" feature like Python (or make regex literals their own syntax like Perl does). That's the solution here, not using weird characters...
Why not? Common characters are easier to type and presumbly if you are using regex on a unicode string they might include these special characters anyway so what have you gained?
In theory yes, in practice no.
What you have gained is that the regex is now much easier to read.
It's easy to read now.
That's like "in theory we need 4 bytes to represent Unicode, but in practice 3 bytes is fine" (glances at universally-maligned utf8mb3)
If we were willing to ignore the ability to actually type it, you don't need Unicode for that; ASCII has a whole block of control characters at the beginning; I think ASCII 25 ("End of medium") works here.
That gives the author space for another article ;)
What with unicode, it'd be fun to have Α and Ω available to make our regexps that much more readable...
Same, tho it'd be interesting to see if this behavior holds if the file ends without a trailing newline and your match is on the final newline-less line.
Fortunately, it's pretty simple to test.
The line does end with the file, so it's logically consistent.
It's not matching the newline character after all.
Yes exactly, they match the end of a line, not a newline character. Some examples from documentation:
man 7 regex: '$' (matching the null string at the end of a line)
pcre2pattern: The circumflex and dollar metacharacters are zero-width assertions. That is, they test for a particular condition being true without consuming any characters from the subject string. These two metacharacters are concerned with matching the starts and ends of lines. ... The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, that it does not actually match the newline. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.
Thanks! I was AFK and didn't have a grep (or a shell) handy on my phone.
Same here; when I saw the title I was like "well obviously not, where did you hear that?"
In nearly two decades of using regex I think this might be the first time I've heard of $ being end of string. It's always been end of line for me.
Take a look at, for example, these stackoverflow answers about a regex to validate and e-mail address: https://stackoverflow.com/a/8829363
These people are I think not intending to say a newline character is permitted at the end of an e-mail address.
(Of course people using 'grep' would have different expectations for obvious reasons)
Even disregarding whether or not end-of-string is also an end-of-line or not (see all the other comments below), $ doesn't match the newline, similar to zero-width matches like \b, so the newline wouldn't be included in the matched text either way.
I think this series of comments might be clearest: https://news.ycombinator.com/item?id=39764385
Problem is, plenty of software doesn't actually look at the match but rather just validates that there was a match (and then continues to use the input to that match).
You couldn’t write a post like this if you didn’t start with a strawman.
I’ve always thought that as well; mostly due to Vim though.
^ - takes you to start of line $ - takes you to end of line
^ actually takes you to the first non-whitespace character in the line in vim. For start of line you want 0
I don't have (n)vi(m) open right now but I think this only applies to prepending spaces. For prepending tabs, 0 will take you to the first non-tab character as well.
i feel like this perspective will be split between folks who use regex in code with strings and more sysadmin folks who are used to consuming lines from files in scripts and at the cli.
but yeah seems like a real misunderstanding from “start/end of string” people
I'm the same, but now that I try in Perl, sure enough, $ seems to default to being a positive lookahead assertion for the end of the string. It does not match and consume an EOL character.
Only in multiline mode does it match EOL characters, but it does still not appear to consume them. In fact, I cannot construct a regex that captures the last character of one line, then consumes the newline, and then captures the first character of the next line, while using $. The capture group simply ends at $.
In `sed` it's end of string.
String is usually end of line, but not if you use stuff like `N`, to manipulate multi-line strings
Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?
Vim is what did that for me.
This must be the "second problem" everyone talks about with regular expressions.