So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago? When I started writing compilers in the 80's and 90's lex/flex and yacc/bison were popular. ANTLR came out but I never had a chance to use it. Everything after lexing and parsing was always hand rolled.
Weird that this is about building a C compiler[0] in OCaml. I expected the implementation language to also be C both for consistency but also because i'm willing to bet that there are more people who can read C than OCaml.
[0] actually from the readme in the github repo[1] it seems to be a C subset, not all of C
ocaml makes writing a compiler enormously more accessible, and learning to read ocaml, while it can be somewhat intimidating at first, is much easier than learning to write a compiler
(imagine a medieval accountant trying to learn to do long division in roman numerals. he'll be much better off learning the western arabic numerals fibonacci is so excited about)
I really really really want to get more into Ocaml but as far as I know there is no good support for a debugger to use in an IDE like VSCode or even vim. Everyone I talk to says they just do print debugging. It's easier to 'reason about your code' in FP, but I do NOT want to go back to a time where I coded without breakpoints. I use F# because of its first party tooling support
it does have a debugger with breakpoints, which even supports time-travel debugging (except on windows obviously), but i've never used it. it even has first-party ide integration: https://ocaml.org/manual/5.2/debugger.html#s:inf-debugger
i use debuggers a lot when i'm programming in assembly, and from time to time when i'm programming in c or c++, but in ocaml i've never needed one. it's not that i've never had bugs in ocaml, but they tend to be of a different flavor, a flavor for which breakpoints are of little value
it sounds like your f# experience is different; what kinds of bugs have you recently found the debugger valuable for in f#?
debug logging is usually a superior option to breakpoint debugging for problems to which both are applicable, because debug logging shows you the whole history of your program's execution rather than just a single point in time. breakpoint debugging requires a lot of manual labor, painstakingly operating the machine to navigate the execution to the state that has the problem. it's like grinding in a video game. i'd rather program the computer to do that labor for me, at which point i no longer need the debugger
(except in c and assembly)
it does have a debugger with breakpoints, which even supports time-travel debugging (except on windows obviously), but i've never used it. it even has first-party ide integration: https://ocaml.org/manual/5.2/debugger.html#s:inf-debugger
1. I am developing on windows so that's an issue for me and
2. I don't use emacs, I use VScode and I've not been able to get the experimental debugger working for the VScode plugin.
It's not that the debugger is purely for fixing bugs; I use it as an active part of development. I want to be able to freeze the program and inspect values as they are running. I may know what the types or state of the program are just by viewing the code, but I want to be able to inspect the actual data for myself as the program is running.
debug logging shows you the whole history of your program's execution rather than just a single point in time
breakpoints also provide stack traces, so they provide a kind of history as well. I'd rather inspect and interact with a program than dig through thousands of lines of logs
breakpoint debugging requires a lot of manual labor, painstakingly operating the machine to navigate the execution to the state that has the problem
I see things the opposite as you: print debugging is tedious and requires restarting the program from the beginning whenever you make a change to the source, unless you are constructing your program piecemeal with the repl, which I consider to be extremely tedious as well. To me, a debugger with breakpoints is a far more efficient way to code than print debugging.
I think there is a cultural difference in software engineering between people who use debuggers and people who don't. John Carmack once pointed out that people who come from the game dev and Windows/PC world use debuggers while people from the linux and web dev world tend not to. It seems to be a matter of preference/taste, and I think FP programmers seem to have a distaste for debuggers and graphical debugging/development environments
"... I think FP programmers seem to have a distaste for debuggers ..." I'm unsure that I'd make this assertion so broadly. For Haskell, I'm usually left using Debug.Trace because I've found the traditional symbolic-step debugger is less than ergonomic in the face of lazy evaluation. In Common Lisp, the debugger is my best friend.
"... and graphical debugging/development environments" The loud ones might be saying they use Arch btw (on ThinkPads no less) and lean heavily on using neovim inside the current popular Rust rewrite of tmux, but I personally don't care for it.
VS Code is neat. I mostly still use Emacs because that's where the tooling investment has historically been, but my ideal is definitely much closer to Smalltalk-meets-Oberon.
mine too! except, not so much of a closed-world system? a lot of what i like about emacs is that it has some of that same live-malleability that smalltalk and oberon have
if you've tried godot i'm interested to hear what you think about it
Unfortunately, I have no idea what godot is. I would like to know more though if you're so inclined.
(Surely not the game engine? That's the only thing my disambiguation machinery can come up with.)
yup, the game engine!
John Carmack once pointed out that people who come from the game dev and Windows/PC world use debuggers while people from the linux and web dev world tend not to. It seems to be a matter of preference/taste, and I think FP programmers seem to have a distaste for debuggers and graphical debugging/development environments
i think that's true! but i don't think it's purely a matter of preference; it's also a matter of what the ecosystems support well and what you're trying to achieve. lisp is a huge exception to the rule about fp programmers; lisp systems, even including emacs, have extremely strong debugging support. but non-lisp fp languages tend to heavily emphasize static typing, thinking things through ahead of time, and correctness by construction, which reduce the need for debugging. but those are more valuable for writing compilers than for writing games or uis in general, where it's hard to define what counts as 'correct' ahead of time but easy to recognize it when you test it interactively
webdev of course has the problem that you can't stop your http response handler while the browser is waiting for a response. and, often, it's sadly not very concerned with ux. and it's often concerned with operating systems in a way where you have to debug problems after they occur, in part because of the scale of the problems
automated testing is another ecosystem thing that reduces the need for debuggers; to a significant extent it competes with static typing
one of the things i really appreciate about godot is being able to continuously adjust parameters and observe variables in my games as they're running. godot is motherfucking awesome, man. i definitely don't have a distaste for graphical debugging and development environments!
breakpoints also provide stack traces, so they provide a kind of history as well. I'd rather inspect and interact with a program than dig through thousands of lines of logs (...) print debugging is tedious and requires restarting the program from the beginning whenever you make a change to the source
oh, see, i have programs like grep and emacs that dig through thousands of lines of logs for me. often when i'm debugging from logs i don't run the program at all; i just look at the logs and the source code. sometimes the program is running someplace i can't interact with it—memorably, on some occasions, on a satellite out of range of the groundstation. and usually exceptions on python or java give me a pretty decent stack trace, though other log messages unfortunately don't
there's another ecosystem support issue here—although i've sometimes developed on systems that supported fix-and-continue (cmucl, gforth, squeak, basic-80, godot), python and gcc support it very poorly or not at all. so for me i have to restart the program from the beginning whenever i make a change to the source in any case, whether i'm doing printf debugging or not. godot, again, is a very nice exception to this rule, and incidentally lets me add print debugging to the game while it's running
one of the great things about a breakpoint debugger, from my point of view, is that it makes it possible to add logging after the fact to a running program without editing the source or restarting it
i really appreciate you sharing your experience!
If you're literally "from the Linux world", you're probably younger and less experienced, and likely from a hobby background rather than CS or engineering.
Developers in Unix shops before the Linux era used debuggers.
Obviously, the GNU project developed a debugger for an audience. GNU wouldn't be a complete replacement for proprietary systems like Unix with only compilers, but no debugger.
OCaml? Thanks for saving me a click!
OCaml is one of the most used languages for compiler design
A good engineer should be able to use the right tool for the job
For hobbyist compiler implementations, right? Compilers for the most popular languages are either written in C/C++, or self-hosted.
You can write compilers in almost any language. I fail to see how C, C++, or even Java or Python aren’t the right tool for the job here. I like pattern matching too, but given that hundreds of successful production compilers have been written without pattern matching, it’s surely just a personal preference.
I’ve worked on multiple compilers in industry that are written in Ocaml. A number of industrial static analyzers are written in Ocaml too (eg, Infer from Facebook/Meta). Yes, LLVM and GCC are the big ones written in the C/C++ family but they don’t represent everything.
And the Go, Java, Ruby, JavaScript, C#, Typescript, PHP, Kotlin, R compilers, and so on.
But even for hobby projects, it’s just a matter of personal preference. OCaml is great for implementing compilers. So are Go, C++, and Java.
I mean, we _are_ talking about a book which invites you to build your own toy C compiler ^^
Nevertheless, OCaml is very strong in compiler design. For example Rust and Hack were written in OCaml initially.
Nevertheless you are not wrong that compilers needing the very last bit of performance like the JVM and LLVM tend to be written in C++
But the barrier is quite a lot more tending to high performance/very high performance and not toy/production
Java and Python are suitable for implementing a toy Compiler and the auther invites you to use any language you like. Just the reference implementation is using OCaml
I would however argue that using C++ is quite advanced since it does not have pattern matching and using C is just masochm. You will be fighting against the language to do even trivial things instead of fighting the actual problem at hand
I totally agree that OCaml is a great language to write a compiler. I’ve used Rust and Haskell, and loved them both.
I was more so pushing back on the the implication that if it’s not OCaml, it’s not the right tool for the job.
Like, I honestly can’t think of a mainstream language in which it would be hard to implement a C compiler in.
A Retargetable C Compiler is another book that implements a C compiler in C.
https://www.amazon.com/Retargetable-Compiler-Design-Implemen...
The bonus of writing a C compiler in C is that you get to being able to experiment with self-compilation.
There are many disadvantages to writing a compiler in c (and I have done it). For me, the biggest is simply that it is very verbose for the type of things that you do commonly in a compiler. I have written a self-hosting compiler that transpiles to c and the generated code is about 5x as long as the original code. I could go through and clean it up and probably get it down to 2-3x, but there are certain concepts that cannot easily be expressed compactly in c (except possibly with preprocessor abuse that I don't have patience for). A simple example is that if you want to represent a union type you have to manually define 2 structs and an enum. Matching with switch is also verbose as you have to both match and manually assign the value in the match case. Again, you can use macros to do this though at that point you arguably aren't even using c but rather a hybrid language. A language like ocaml, makes it much more straightforward to define and match on unions as well as gives you other nice high level things like gc so you can focus on the compiler itself and not low level details.
Written in OCaml? Ahh interesting. Suddenly I remember similar book:
Modern Compiler Implementation in ML: https://www.cs.princeton.edu/~appel/modern/ml/
As an undergrad student, I think the C version is kinda easier to understand, though.
I wonder why there is not the same book for c++... mmmmh... I really wonder... (irony).
It is because c++ has an absurdely and grotesquely massive and complex syntax (like rust...).
Yeah, Rust is the language for people who think C++ is not complex (or hostile) _enough_.
When I tried to read some rust, I was surprised on how much alien it is to mainstream languages and how convoluted the syntax is.
My experience (and I admit I may be too biased given years of prior C/C++ experience) is that Rust's syntax is a necessity, since no other mainstream languages besides C/C++ are as low-level as Rust.
Most mainstream languages have a GC, and don't support distinguishing between values on the stack or references, don't need to deal with lifetimes or don't provide the safety you get with them, etc.
I'm curious though, could you give an example of syntax you consider convoluted, and how you would do it instead?
How do you want to be taken seriously if you don't see the convolution of the syntax of c++ and rust? You are going against an absolute truth.
Oh, I absolutely see it.
My point was that it's necessary. How would you implement the same features Rust and C++ have, without garbage collection, but with simpler syntax?
So you are very careful not to depend on any of their toxic complex features. In the end, better not use them at all.
Even plain and simple C99 compilers can be reasonably written by a solo dev, so a motivated small team...
I do depend on their features, and I see the value in doing that.
For example, at $WORK I had to reimplement Apache's mod_rewrite.c in Rust, and I made it 100x faster for our particular use-case.
I could've done that in C as well, sure, but the simplicity of the C language just moves the complexity to the code itself; whereas, with Rust, I whipped out a prototype in just 3 days, and was able to freely pass around pointers to data, with zero allocations, zero copying, and every time it compiled I knew it was guaranteed to be safe.
You can't get that safety in C. You can't get that speed in Java.
You can't do this in any other language that does not have these features. I absolutely agree that the syntax is horrible... but I see no other way to achieve this.
I severely disagree, this is quite short sighted.
The obscene complexity of the rust language (like c++) makes a toolchain beyond anything reasonable to code alternatives: that reason alone is sufficient to avoid it.
You can have as many "features" you want, the anti-feature of absurd/grotesque syntax complexity _alone_ buries it.
There is nothing more to say or argue about. This is dead simple.
We will have to agree to disagree, then.
Indeed, but with a major difference:
Your side has a significantly higher cost technical dependency than we don't have.
I would love to see a book that talks about going all the way to generate machine code, i.e., not stopping at generation of assembly.
Alternatively, I would like to learn about not just how to make a compiler, but also simultaneously a debugger, hot-reloading, etc.
Writing an simple assembler is trivial. Even macro assemblers are very easy.
However, it's also boring.
Nevertheless the contents of the book cover all the techniques required to write an assembler, if you'd really like to
I understand that assembly file can be parsed in the same way. However, I want to learn about the machine instructions to the level of bits, and likewise the layouts of binary files. Unless I am able to go all the way to machine code loaded in memory, I would not know where in memory to add a breakpoint instruction when a developer wants the same on a line of code.
If there is some library that can help create machine code from assembly instructions on a line by line basis (at least as opposed to invoking a separate program that generates the entire binary collectively from the assembly code), that could also work.
In my case, I already know enough of the lexer, parser, etc., parts. What's missing is going all the way to making a debugger, profiler, etc.
If there is some library that can help create machine code from assembly instructions on a line by line basis
That's what JIT libraries do, for example asmjit: https://github.com/asmjit/asmjit/blob/master/test/asmjit_tes...
asmjit is great; I used it for building a primitive (but quite fast) query engine for in-memory graphs. It made it simple to go from query AST to native instructions, most of which was "call this other compiled function".
Looks cool! Thanks.
Building a debugger and profiler is quite an advanced task compared to building an assembler though ^^
Also much of that work is heavily dependent on the used operating system.
Nevertheless, I'm wishing you all the best on your journey!
Really? You can get quite far just with ptrace() on Linux... and maybe something like system("nm xxxx > file") for the symbols.
Checkout GNU Binutils and their usage of libbfd and libopcodes.
There can be weird interactions unless there are strong enough limits on what kind of expressions the assembler allows. Especially if it supports conditional assembly and loops in the macros. One ugly way around it -- which causes its own headaches -- is to introduce pass-sensitive conditional assembly (as in "if in pass 1/2/...").
It's also "fun" if some instructions come in different sizes... and you may need stronger restrictions on allowed expressions in that case.
The debugger book is coming soon. https://nostarch.com/building-a-debugger
Awesome! Thanks.
I swear I've seen this cover before... is this a new release or an updated edition of an older book?
Book is not yet published but in early access since a couple of years
Was featured here a couple of times.
Unfortunately the timing of the release is quite unfortunate with regards to the summer holidays. Will take a look at it next year
Book is not yet published but in early access since a couple of years
According to the top post's link, it was released in July 2024.
I seem to be mistaken. The 20th of August as listed by Amazon must be the European release date then
It’s actually out now, I have a copy! Ordered directly fro No Starch Press.
Many compiler related books take inspiration from the "Dragon book" (Compilers: Principles, Techniques and Tools). So with likely lots of books with similar looking covers.
The cover looks nothing like the dragon book however?
Point was more that there are bunch of compiler related books featuring a dragon (and knight/other character). The original also has a different looking dragon & knight themed cover for every edition.
Like if I’d see the book on a shelf I would instantly guess it’s related to compilers. And I bet that’s completely intentional homage to the original.
There was a HN article about the same book about a month ago:
https://news.ycombinator.com/item?id=40940799
So maybe you saw it then.
”Automate the Boring Stuff with Python” has a similar cover, by the same publisher.
I believe the author first started by making blog posts and then interrupted them to simply make a book about it
I see many comments saying that the book implements the C compiler in ocaml. In the introduction the author states that the book actually uses pseudo code so you are actually free to implement it in any language. The only recommendation is that you use a language with pattern matching because the pseudo code makes heavy use of it. The reference implementation is in ocaml.
Thanks, can you please lemme know which part uses pattern matching? I'd assume mostly in the lexer, but the parser should just be something that consume the tokens and spit out AST. Unless of course it combines the two.
Presumably anything that walks the syntax tree.
Thanks. At first I thought it is something like regex, but then I found it's something in functional programming. I need to read a few chapters of the book before buying it because I'm not interested in learning FP at the moment.
Question for HN, pattern matching is defined as “runtime type/value checking”, is that correct?
Is duck typing the pseudo-unsafe alternative? (Not unsafe as in accessing unsafe memory, but as in throwing exceptions if the duck-typed function doesn’t exist on the current type)
Can C handle both?
Coming from a static type system like rust and c#, i’m doing alot of “if this is a duck, duck.quack()” and i’m looking for faster alternatives and less verbosity if possible
One thing is that pattern matching can make writing tree manipulation code succinct and easier to read. For example, take this article[0] that describes the difference list algorithm (in Haskell). Basically, it's kind of like a rope, but for lists. It's a tree with lists at the leaves, and when you want to convert it into a list, you rewrite the tree to be right-leaning, and then concatenate all the lists at once. This turns repeated concatenation at the end of lists from taking quadratic time into one that takes linear time (strcpy can be an example of this in C [1]). The code can be written like this:
data Tree a = Leaf a | Branch (Tree a) (Tree a)
fromList :: [a] -> Tree [a]
fromList = Leaf
toList :: Tree [a] -> [a]
toList (Leaf x) = x
toList (Branch (Leaf x) r) = x ++ toList r
toList (Branch (Branch l1 l2) r)
= toList (Branch l1 (Branch l2 r))
append :: Tree [a] -> Tree [a] -> Tree [a]
append = Branch
In a language that doesn't have tree pattern matching, the code wouldn't be this short and easy to understand, and I don't think it could be replicated just by having duck typing. Rust has pattern matching, but because it's primarily focused on lower-level concerns like pointers and memory ownership, pattern matching isn't this nice, because you have to pattern match on the pointer first.Since a compiler is all about tree manipulation, support for tree pattern matching should be a boon.
[0]: http://h2.jaguarpaw.co.uk/posts/demystifying-dlist/
[1]: https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...
I can implement it in rust?
Useful list considering that feature: https://en.wikipedia.org/wiki/Category:Pattern_matching_prog...
Somewhat unrelated: Is there a book that walks you through building a database system from storage to queries, optimizer, execution, indexing, transactions, etc?
In the early 90's Al Stevens wrote 2 books C Database Development and C++ Database Development with source code which might be a good starting point.
Interesting suggestion! here is the book on archive.org: https://archive.org/details/cdatabasedevelop00stev/mode/2up
Database Design and Implementation, ISBN 3030338355 ¹). Java source code for the SimpleDB system from the book available from the author's website ²).
transaction processing by gray (rip) and reuter was pretty close back in the 90s. i don't think it covered query optimization because it's really about tp monitors rather than databases, but, perhaps surprisingly, it does cover the other topics you're asking about
Have read the first few chapters and it expects that you either read the accompanying source code or implement your own and pass the tests. The pseudo code presented in the book often look like function calls with the inner details not there in the book. Furthermore, as already pointed out in another comment, the available implementation is in OCaml, which is probably not something many C programmers have experience with.
Nevertheless, I think I'm learning more from this book than most other books I've tried before that are far more theoretical or abstract. I'm still eager to reach the chapter on implementing C types. I think it's a good book, but it requires more effort than something like Crafting Interpreters or Writing a Compiler/Interpreter in Go, while also covering topics not in those books.
Nand2Tetris is also like that - they provide an outline and tests, but you have to do the work. And, having the implementation language be different from the target language reduces confusion.
Plus, you get to become proficient in OCaml, which is a pretty good language.
that's a good point—it was pretty confusing when i wrote ur-scheme in scheme, or stoneknifeforth in stoneknifeforth, because i kept getting confused about which level of the language i was changing things in
I thought this book looked neat but closed the tab before reading the comments here, and after this one decided to go ahead and buy it. Sounds really fun!
How does it compare with N.Wirth's?
https://onlinebooks.library.upenn.edu/webbin/book/lookupid?k...
The book is a very hands on tutorial whereas Wirths is basic literature for the general case.
While they teach similar content, they have a different approach.
There are literally thousands of compiler design books out there, I don't really see anything particularly comparable between this book and Wirth's
Wirth's book does not implement a "real" programming language. Whatever your thoughts on Oberon and Pascal-like SHOUTCASE languages, it's largely irrelevant. Oberon is arguably a "real" language (and operating system), but Wirth's book does not cover the implementation of Oberon. It covers the implementation of Oberon0, an inarguably toy subset of Oberon. (Actually, "subset" is not even correct.) The example code has also diverged from the book, with Wirth abandoning the strategy described in the book for avoiding redundant initialization of the module static base, among other things.
Aside from that, I encourage everyone who cites Compiler Construction to actually work through the first 10% of the book and then count the number of errata.
Similar to studying OS concepts using Silberschatz' Operating System Concept and Tanenbaum's Operating Systems Design and Implementation. The former only explains the theoritical ideas, while the latter is the documentation of an implementation.
This looks cool, been interested in learning more about compilers since I did the basics in college. Lots of things seem to focus on making interpreters and never make it to the code generation part so its nice to see that this features information about that.
With no disrespect to the book that's the subject of this thread as I haven't read it, but Bob Nystrom's Crafting Interpreter [0] is a fantastic book. It covers all phases in compilation, including both an interpreter and a VM.
It's been covered on several threads here over the years [1].
[0]: https://craftinginterpreters.com/ [1]: https://hn.algolia.com/?q=crafting+interpreters
I remember seeing this a while back. That typesetting is beautiful. Thank you for bringing it up here, I might have to pick that one up.
I’ve been bored with building line-of-business applications, despite designing for complex requirements in high-volume distributed systems.
In fact I took a break from CS learning entirely 9 months ago. Even reading HN. I’ve been studying electronics and analog signal processing instead.
But now that I’ve built about 50 guitar pedals of increasing complexity, I feel ready to switch back to CS studies again.
This book covers compiling to assembly whereas Crafting Interpreters only has a bytecode VM implementation. We'll see how good this book is when it drops, but I think that's a worthwhile feature that Crafting Interpreters punted on.
Dropping this one here! (no affiliation)
https://www.linuxfromscratch.org/
"Linux From Scratch (LFS) is a project that provides you with step-by-step instructions for building your own custom Linux system, entirely from source code."
Why though? It doesn't seem to be related at all to the OP other than both are tutorial books?
This has nothing to do with the post?
In Ocaml, interesting. I was similarly surprised when I learned that the firs Rust compiler was written in Ocaml, too https://users.rust-lang.org/t/understanding-how-the-rust-com...
Tree processing is best done in a language with decent algebraic datatypes and pattern matching. I would’ve preferred Standard ML, but, well, pot-ay-to, pot-ah-to. Haskell is another choice but the techniques you need to use there (while undeniably gaining you some possibilities) don’t really generalize to other languages, so you’re now writing a book about compiler construction in Haskell rather than just compiler construction. Ditto for Rust. Kotlin has deliberately anemic pattern matching. C# or F# leave you depending on Microsoft’s benevolence (sic). Metalua and Sweet.js both have decent ADT support but both are pretty much dead. Racket exists, I guess, and there are some pattern-matching libraries for normal Scheme as well, but the charisma malus of the parenthesis is real even if I don’t understand what causes it.
So OCaml was probably the most mainstream choice among the languages with appropriate tools, as funny as that sounds. And honestly, once you get over the syntax, it doesn’t actually have anything outrageous.
ML (short for "meta-language") was originally designed for use in programming language research, and really shines for that purpose. And OCaml is probably the most pragmatic dialect for the purpose.
SML is very dated and the standard library and ecosystem lack many things that are considered table stakes for a viable programming language nowadays. And F# and Scala are fine as enterprise languages, but being tied to .NET and Java respectively makes them less desirable for implementing a language that won't itself be coupled to one of those runtimes.
I took a compilers course in university and the course culminated in having a compiler for C Minus (a subset of C). The professor noted how each year the line count of the compilers was dropping as students found ways libraries or languages that made it easier. I think the evolution was Java -> Antlr -> Python. I used OCaml and emitted LLVM and blew that metric out of the water.
Blew it out of the water with more or less lines of code? :)
Far fewer, to the point of another student asking me what I even did for the project because I didn't have to implement any of the algorithms.
I learned how to write a compiler by studying BYTE magazine in the 70's which published the source to a complete Pascal compiler as an article!
https://archive.org/details/byte-magazine-1978-09 (part 1)
All 3 parts of Tiny Pascal:
https://albillo.hpcalc.org/publications/Easter%20Egg%20-%20T...
Thank you for sharing this, very useful. The BYTE magazine is absolutely amazing, it's a shame nothing similar could be done today.
The Byte magazine is incredible. First time reading it. The archive.org collection is a gold mine for learning. Thank you very much for posting it.
I uh misread the title and thought someone built a C compiler in Scratch.
On topic, though: wouldn't a simpler language (maybe even a pseudo language) be a better target for a first learning compiler. I understand they don't build a full C compiler, but still. It looks to me like there's a lot of complexity add from choosing such a lofty target.
What do you think would make a better target? C maps pretty closely to assembly, so it seems like it would be the simplest. Maybe Pascal or BASIC, but most people these days don’t have experience with Pascal, and BASIC would probably be too simple for a full-length book.
For writing an interpreter or transpiler, there are probably better options, but for a true compiler I can’t think of a better choice than C (or at least a subset of C).
chibicc[0] complement this book nicely, in addition to a basic compiler, it guides you through writing the preprocessor and driver, which, although not addressed much in literature, are the missing link between the compiler built from the book and real C projects.
Thanks, I wish the companion book were ready!
What approach does this book take to error recovery?
Several "compiler light" style articles and books kind of walk over that part, and it can be non-trivial to do properly, especially with modern expectations.
I remember way back in the day, one of the early C compilers for the PDP, and, honestly, it could almost be argued that ed(1) had better error messages than what that thing produced.
A lot of simple compilers hit an error and just give up.
So, just curious what the approach was in this book.
I don't really need to know how to build a compiler, and I've got enough other "don't need but am doing out of curiosity" things going on that I don't need any more of those, but if it wasn't $70 I'd probably get it anyway. It would be interesting to compare to the last building a compiler book I read back and see how things have changed. Based on the comments here a lot has changed.
That last book was Allen Holub's "Compiler Design in C", which is from 1990. Here's how the blurb on the back describes it:
Allen I. Holub's Compiler Design in C offers a comprehensive, new approach to compilers that proves to be more accessible to computer science students than the other strictly mathematical books.
With this method in mind, the book features three major aspects:
(1) The author develops fully functional versions of lex and yacc (tools available in the UNIX® operating system to write compilers), (2) he uses lex and yacc to develop a complete C compiler that includes parts of C that are normally left out of compiler design books (eg., the complete C "type" system, and structures), and (3) the version of yacc developed here improves on the UNIX version of yacc in two ways (error recovery and the parser, which automatically produces a window-oriented debugging environment in which the parse and value stacks are visible).
It's out of print, but the author has made a searchable PDF available on his website [1]. I found it quite useful.
Holub seems to like the "learn by doing" approach. He's got another book, "Holub on Patterns" that teaches all the design patterns from the gang of four book organically by developing two programs that together use all of those patterns. The two programs are an embedded SQL interpreter and a GUI application for Conway's Game of Life.
PS: Ooh. It occurred to me that No Starch Press books are often available on O'Reilly Learning. I checked and this one is there. So I guess it is going on my "don't need but am doing out of curiosity" pile after all.
I’ve been working through this book implementing the compiler in Ada. So far, I’m really enjoying it. The book doesn’t make too many assumptions about implementation details, leaving you free to experiment and fill in the blanks yourself.
It feels like a more advanced version of Crafting Interpreters.
I haven’t looked at the OCaml implementation at all. The text and unit tests are all you need.
Discussion on the Ada Forum: https://forum.ada-lang.io/t/writing-a-c-compiler/1024
I’m working through this book now and really enjoying it!
Each chapter of the book includes a test suite to run against the code you’ve written.
In some ways, the tests in this book feel very similar to the labs in the book Computer Systems: A programmers perspective — which is high praise!
cool, remember some tutorials online i think from the same author (not 100% sure) doing stuff around c compilation in python. shame its not in a language i want to learn. the other book on compilers i got is almost to heavy to lift! :D
It also will be available via Amazon after August 20, 2024.
https://www.amazon.com/Writing-Compiler-Programming-Language...
Parser-generators were always academic projects that had little relevance to making real-world programming languages -- where parsing is very easy to write, and necessarily benefits from doing it (ie., you can get better error handling/etc.).
Today most languages are front-ends for LLVM IR, but LLVM is very slow and takes a long time to optimize. Many new languages target x86/arm directly with their own weakly optimized backends, and output an LLVM IR for "release builds".
When your computer was anemic, and could barely do the tasks required for it, eking out a few percent — or a 2x! — from an optimizer was important.
Now-a-days, the difference between "big compiler optimized" and "little compiler not optimized" can be quite dramatic; but, is probably no more than 4x — certainly within range of the distinction between "systems programming language" and "high tuned JITted scripting language". I think most people are perfectly fine with the performance of highly-tuned scripting languages. The result is that all of the overhead of "big compiler" is just ... immaterial; overhead. This is especially true for the case of extremely well-tuned code, where the algorithm and — last resort — assembly, will easily beat out the best optimizer by at least an order-of-magnitude, or more.
Wouldn’t just a simple case of bad code generation render little compiler into a toy one?
Repeating others, today’s compilers are really just “optimizing compilers”, there is no room for toying in production environments.
If you find some time to go through gcc bugzilla you'll find shockingly simple snippets of code that miscompiled (often by optimization passes), with fixes never backported to older versions that production environments like RHEL are using.
Ok, but that’s sw engineering issue.
I still insist that a production grade compiler can’t leave performance on table. Which is where the current battlefield is.
I think a production grade compiler not only can, but must, leave performance on the table when the cost is correctness (unless the performance gain is incredibly high and the correctness loss is minimal). Correctness is not all important, but it is the most important thing. Unfortunately, compiler writers do not agree and they do silly things like "let's assume UB cannot ever happen and optimize based on that".
Correctness is already a must, how did you arrive to this?
if only the gcc and llvm maintainers, and c standard authors, agreed with you
I mean these compilers build all these SW stacks, even this very browser I’m using, where are these correctness issues you are talking about?
there are new items in this category every day, but https://blog.cr.yp.to/20240803-clang.html is noteworthy
Eh https://news.ycombinator.com/item?id=41146860
yes, that
I do not agree in the general case. There are very useful DSL compilers which do not consider performance at all, but just compile to a target which does the optimization for them (JVM, LLVM IR or even just C)
Yes but those are called transpilers, right?
There is no fundamental difference between a compiler, transpiler and interpreter.
The techniques employed are very similar
I realized with all the rhel systems I’m using, we are never using default toolchains on them. Just use those old systems to run stuff, even newer toolchains.
if you aren't running on the gpu you're leaving 80+% of your computer's performance on the table. no optimizing compiler is going to make your legacy c or lisp or rust code run efficiently on the gpu, or even in most cases on a multicore cpu. nor, as thechao points out, can it compete with assembly-language programmers for simd vectorization on the cpu
in summary, optimizing compilers for c or pascal or zig or rust or whatever can only be used for code where considerations like compatibility, ease of programming, security, and predictability are more important than performance
probably the vast majority of production code is already in python and javascript, which don't even have reasonable compilers at all
How come you made this about computer’s performance :)
because computer performance is the topic of https://news.ycombinator.com/item?id=41255695, https://news.ycombinator.com/item?id=41255365, and https://news.ycombinator.com/item?id=41255195, the three levels of parent comments in the thread, one of which was by you
No.
Main post is about a C compiler, post I responded is saying
which I could only interpret as a machine that can run a single process at a time, so not really about gpus or what.
yacc/bison was used for lex, bc, pcc, gcc, original awk, the bsd pascal compiler, eqn, m4 (!), and many other languages. it's still used for pcc, oawk, mawk, pari/gp, and units. that's just what i have sitting around in my downloads directory
and, while we're talking about ocaml, ocaml does use ocamllex and ocamlyacc for its own parser
so, while you can certainly do without parser generators, they have very commonly been used for making real-world programming languages. almost every programming language anyone here has ever heard of was first implemented with a parser generator. the main exceptions are probably fortran, cobol, algol, lisps, c, and pascal
I guess I wasn't very clear. I didnt mean to say, as a historical matter, were irrelevant.
I meant to say the idea of a parser generator is a solution to a problem that that real world langs don't really have. When writing a programming language, your issue isnt how much time the parser is going to take to write, or how complex it's going to be. The parser is a relatively trivial part of the problem.
Due to language designers often being taught to develop langauges in this fashion, many have relied on these tools. But the modern view of compliers as "programming langauge UIs" and the focus on DX, i'd argue its actively pathological to use a parser generator.
Much academic work has, til recently, focused on these areas -- whereas today, the bulk of the difficulty is in understanding SSA/LLVM/ARM/Basic Optimizatiosn/etc. details which are "boring, circumstantial" etc. and not really research projects. I was just pointing this out since a lot of people, myself included, go down the EBNF parser-generator rabbit hole and think inventing a langauge is some formal exercise -- when the reality is the opposite: it's standard programming-engineering work.
The one thing parser generators do is that they ensure that the language you implement actually matches the grammar you wrote. That’s still an important assurance to have.
I've created many programming languages. All the ones I finished and were useful did not have grammars that "I wrote".
Are you saying your programming languages don’t have a defined grammar?
The parser defines the grammar. This is quite common in mainstream languages -- iirc, only after some years did python get a formal description of a grammar.
i was puzzled about that too and got even more puzzled when i looked at the languages i could find that he's implemented
do you mean https://github.com/mjburgess/Lyssa and https://github.com/mjburgess/Quazar? it's true that i can't find a grammar in either of them (https://github.com/mjburgess/Lyssa/blob/master/src/impl.py#L... seems more forthy than anything else) but (while they are very much the sort of things that i like, thank you for sharing) they also seem somewhat less like 'real-world programming languages' than things like awk, ocaml, or our other example in this thread, gcc c and objective-c until gcc replaced the bison parser with a handwritten one in 02006. that last compiler was the compiler nextstep was built on, which got steve jobs back in control of apple, to replace macos with nextstep. seems pretty real-world to me
maybe you're talking about stuff you haven't released?
Indeed, that's a defunct profile where everything should be private anyway. The reops there are 13/14 years old: these were experiments with using RPython to create languages, I'd guess when I was ~20. The point of those was to profile RPython. I have created real front-ends and compiler backends in C for non-trivial langugaes.
I will soon likely create a probabilistic programming language and compiler.
they also tell you where the grammar is ambiguous, which can be useful
oh, i agree that parsing is not the hard part of writing a compiler, and that compilers classes overemphasize it
but no language starts out as a 'real world lang'; every language is initially a toy language, and only becomes a 'real world lang' in the unlikely case that it turns out to be useful. and parser generators are very useful for creating toy languages. that's why virtually every real world lang you've ever used was implemented first using a parser generator, even if the parser you're using for it now is handwritten
having a formally defined grammar is also very helpful for dx features like syntax highlighting and automated refactoring
This still isn't quite correct. Back in the day parsing was a much larger portion of the complexity of a compiler: performance was much more of a concern, as was memory usage. Parser generators were an attempt at helping with that, by allowing programmers to produce more efficient (e.g. table-driven) parsers than what they could have otherwise written by hand. They only really went out of fashion because A) computers got faster and bigger faster than programs got longer, so parsing became less and less of a percentage of total utilization, and B) you can get much better error messages out of recursive-descent parsers.
Fortran compilers did go through a phase when table-driven parsers were used, but it had the disadvantage of needing complicated lexers and statement classifiers that rely on semantic information. Fortran’s a hard language to parse, given its lack of reserved words, optional spaces, and many ambiguities.
The f18 compiler’s parser uses parser combinations to construct a backtracking recursive descent parser that builds a parse tree for the whole source file before doing any semantic analysis. This approach allows good error recovery without having to revert any updates to the symbol table.
that's an interesting approach! though probably not applicable to c and c++
i assume that by 'parser combinations' you mean parser combinators
what i meant about fortran is that the first fortran compiler didn't use a parser generator
Could you give some specific examples of those new languages with their own backends for faster builds?
I was immediately thinking of Jai, Zig, .. but I've seen a few
Correct me if I'm wrong, but I think both Zig and Jai use LLVM as their default backend...at least, that's what I have seen via live streaming for Jai, and from Zig's repo.
Zig is moving away from LLVM (https://github.com/ziglang/zig/issues/16270) and Rust has added Cranelift as a debug backend (https://lwn.net/Articles/964735/).
Not sure about Jai.
I don't think it's moving away, because if you have read the entire thread, the gaming community reacts negatively to say the least and they assure everyone that LLVM does not going anywhere any time soon.
Were they? GCC abandoned bison in favour of their own parser relatively recently.
"relatively recently" as in around 20 years ago, in GCC 3.x according to sources I found.
gcc was first released in 01987, but it didn't replace its bison parser for c until gcc 4.1 https://gcc.gnu.org/gcc-4.1/changes.html which was in 02006 https://gcc.gnu.org/releases.html, only 18 years ago, and 19 years after it was released. joseph myers first proposed doing this in 02004 https://gcc.gnu.org/legacy-ml/gcc-patches/2004-10/msg01969.h...
so gcc has literally been using a parser-generator-generated parser for c for more than half its existence, at which point it had already become the most popular c compiler across the unix world and basically the only one used for linux, which had already mostly annihilated the proprietary unix market. it was also imposingly dominant in the embedded space
and i think that kind of development path is fairly typical; getting a parser up and running is easier with a parser generator, but it can be tricky to get it to do strange things when that's what you want (especially with lalr, less so with peg)
There are actually quite a few changes!
The most obvious change you'll see is the use of SSA, which has become the dominant representation in IR starting 25-30 years ago.
There's also been an increase in the importance of compiler IRs, and especially the concept of code passing through multiple IRs before reaching machine code.
Formal semantics has become more of a thing in the past decade or so. It's now routine that even weak memory models have a detailed formal model of how they work. In LLVM, it's now a requirement that you demonstrate formal correctness of new InstCombine transformations (which are essentially peephole optimizations).
The use of parser generators has really fallen into disrepute; everything has transitioned to handrolled parsers these days. Language standards themselves are starting to rely on context-sensitive keywords, which are hard to implement in a generator-based lexer/parser setup.
Optimizations have generally broadened in scope; we're now seeing whole-function level of optimization being the default way to look at stuff for analysis, and there's been a variety of techniques introduced to make whole-program optimization (aka LTO) much more tractable.
Another subtle but major shift is that compilers are increasingly reliant on inferred analysis from dumber representations over the original declared intent of the code. For example, SROA in LLVM (which breaks up structs so that individual fields can be independently allocated in registers) relies not on looking at the way the struct is declared but the offsets within the struct that various load and store operations use.
A final major shift is the trend towards support for ancillary tooling in the programming language space, so that the compiler isn't merely a tool that goes from source to object code, but is something that can be actively queried for information about the source code. Things like the language server, or smart formatting, or automatic refactoring tooling.
This is a fantastic comment, thanks.
The book doesn't have 2024 in the title. I suspect they put it there because last time a post about this book was made, I noted that it was from 2022, not realizing that the book has now been released in 2024.
https://news.ycombinator.com/item?id=40940799
As far as I can tell, the main difference is that static single assignment (SSA) as an intermediate form was not the norm 30 years ago, but it is nowadays. Also, in newer books, it's more common to go over global register allocation now, whether that's graph coloring or linear scan register allocation. If you read old compiler books, the main optimizations they talk about are use-def chains, moving computations out of loops, and using the local and tree-based Sethi-Ullman register allocation algorithm.
ANTLR generate a visitor system from a BNF grammar that you need to implement the logic of in your chosen language, which I believe is similar to C++'s std::visit. lex and yacc generate state machines using BNF grammar and you implement the logic in the tools themselves
Compiling today might be done at run time, exploiting dynamic information. The line between a compiler and an interpreter is blurred.
You can use the "classic" tool set and parse many programming languages. The real trick in writing compilers is taking advantage of the hardware (disclaimer: I designed and wrote middle-pass portions for IBM 360/370 processors (and clones), supercomputers from Cray and ETA Systems, and Sun 4 workstations among others, including some RISC systems that just disappeared around the bursting of the dot-com bubble) I tried and failed to optimize MPP systems from all of the "name" players in the 1990's. It kind-of broke my heart...
That "middle-pass" approach that will let you address many targets is still valid; the trick is finding a sufficiently robust and flexible internal representation at the right level. You also have to be able to out-guess the chip vendors where before you could go to the architect or a complete "System" book and get the real scoop, including things you shouldn't do. Oddly enough, there is simultaneously useful and completely worthless documentation scattered about the internet.
You might want to take a look at Muchnick and Jones' _Program_Flow_Analysis_ (yes, it's from 1981) but chapters 4-6 can be applied at code-generation time. How that fits modern Intel processors (for example) is unknown. Idealizing your processor as a RISC-V might be a reasonable way to proceed but in the end, you'll have to translate the code for the target -- it will be reasonably straight-forward if you drive it all from tables but it's not trivial.