return to table of content

Parse, Don't Validate (2019)

dgb23
38 replies
10h27m

This is very good advice and a great article. It comes up on this site now and then because of it.

For those who don't necessarily program in statically typed functional languages:

The idea transcends paradigms.

You'll find very similar notions in 80's/90's OO literature, for example in Design by Contract. I'm sure one can dig deeper and find papers, discussions and specifications that go further back.

I think TypeScript is often written in such a way where you refine the types at runtime. I assume Design by Contract has influenced Clojure's spec (Clojure is a dynamic language).

Fundamentally this is about assumptions and guarantees (or requiring and providing). Once an assumption is checked and guarantees can be made, then other parts of the program don't need to check overlapping assumptions again.

In fact I think one of the most confusing things when you read code is seeing already guaranteed properties being checked again somewhere else. It makes code harder to reason about and improve.

atoav
25 replies
9h49m

When you have a language with a strong type system, this is one of the practical things that ultimately gives you freedom as your program gets bigger and more complex.

But you have to use it. E.g. by having a Class UncheckedEmail, a Class ValidEmail and a Class VerifiedEmail and ensuring that the conversion from one to the other has to involve your email-verification process.

That way you never have to guess whether the email adress is unchecked, valid or verified and there is no need for "is_email_verified" booleans that you may or may not forget to update/check. If you use the wrong thing in the wrong place your type checker yells at you, while you can focus on actually important stuff.

mrkeen
20 replies
9h25m

Better yet, skip the Valid* classes and only make classes for valid objects.

'Int' is fine. We don't need ActualInt and IPromiseItsReallyAnInt.

noelwelsh
15 replies
9h1m

I think you've really missed the point.

bvrmn
9 replies
8h25m

I'm curious how would look like a real example with different Email classes. It seems silly. You don't often need a VerifiedEmail. But VerifiedUser is more useful from VerifiedEmail. You could get an implicitly verified email without introducing a new type.

noelwelsh
2 replies
8h17m

Abstracting the email example, if you have a finite state machine that verifies some fact, you can model each state as a type. Their example goes Email -> ValidEmail -> VerifiedEmail. You could equally go String -> Email -> VerifiedEmail, where String is what the user submits as their email address. If you are verifying emails in the context of verifying users you could go Map<String, String> -> User -> VerifiedUser, where Map<String, String> is what you receive from an HTML form submission.

Getting hung up on the details of which type you use is missing the overall point.

bvrmn
0 replies
6h22m

State machine is always a trivial part. Email verification is stupidly simple task. But User propagation through the system and available actions based on verified email are not.

atoav
0 replies
7h10m

Good point, the names really don't matter and the first type could as well have been a string. I just wanted to illustrate the concept of using the type system to add important information and guarantees to certain variables.

Your state machine analogy is a good one, because the conversion between different types is a bit like the transitions between states: you have to make explicit how exactly they are meant to happen (or whether they are actually possible).

This is a good thing, having less degrees of freedom may seem like it makes it harder to code, but in fact it allows you to reason better about what the system is doing at any given point in your code.

codetrotter
2 replies
8h2m

You don't often need a VerifiedEmail. But VerifiedUser is more useful from VerifiedEmail. You could get an implicitly verified email without introducing a new type.

Be careful about how the email is implicitly verified though, and how different login methods and assumptions interact when for example migrating users from one system or platform to another.

https://krebsonsecurity.com/2024/07/researchers-weak-securit...

analysis released by security experts at Metamask and Paradigm finds the most likely explanation for what happened is that Squarespace assumed all users migrating from Google Domains would select the social login options — such “Continue with Google” or “Continue with Apple” — as opposed to the “Continue with email” choice

Squarespace never accounted for the possibility that a threat actor might sign up for an account using an email associated with a recently-migrated domain before the legitimate email holder created the account themselves

since there’s no password on the account, it just shoots them to the ‘create password for your new account’ flow. And since the account is half-initialized on the backend, they now have access to the domain in question

This kind of mistake makes me think that it’s better to be super explicit about the state of everything when it comes to accounts.

So even if it may seem excessive to have VerifiedEmail as a type instead of marking the account as verified or not. I will prefer being explicit.

And in medium to big size systems with multiple pieces of user account data that require separately keeping track of verification state it will be necessary anyway. For example even something as simple as having one email and one phone number associated with a user. Or beyond that one user having multiple email addresses or multiple phone numbers etc.

bvrmn
0 replies
6h25m

So in the end users were in state "not verified" after migration. Seems email is a secondary property.

atoav
0 replies
7h16m

Thank you for beinging up this excellent display of hidden dangers to my very crude example about why type systems should be used if they are there.

My point was specifically about complexity. Using a string for all kind of emails of course works, tying guarantees to other things like VerifiedUser works as well. These implicit guarantees just fail to safe your ass once things get more complex and there is an edge case you didn't think of.

TeMPOraL
2 replies
8h17m

But you don't want "implicitly verified", when there exist any code to which the difference between "verified" and "not verified" matters in some way.

In your own example, if you happen to use e-mail verification as proxy for user verification, then the very code that creates VerifiedUser instances would want to have VerifiedEmail as input!

bvrmn
1 replies
6h27m

You don't need VerifiedUser in the moment user click verification link. It's a state changing action. You get verified user from db and work with it later.

TeMPOraL
0 replies
42m

Fine. It still has merit that the code pulling verified user data from verified users table returns VerifiedUser objects, so the code processing them is clear about its assumptions, and won't accidentally get reused or refactored in UnverifiedUser flows.

mrkeen
4 replies
8h11m

I regret shooting from the hip, and answering the comment directly rather than tying it back to the article.

String->(valid)Email is 'Parse', (invalid)Email->ValidEmail is 'Validate'.

That way you never have to guess whether the email adress is unchecked

Now you just get to guess whether, for every class, there is actually a ValidClass that you were supposed to be using, instead of Class.

  authenticate(User user);
^ This is a bug. We just let a malicious User object in because we invested extra time, effort and sloc into making both a User and a ValidUser class, disregarding the first "simple idea" from the article:

> 1. Use a data structure that makes illegal states unrepresentable

Or as I wrote:

only make classes for valid objects
noelwelsh
3 replies
8h5m

Now you just get to guess whether, for every class, there is actually a ValidClass that you were supposed to be using, instead of Class.

The type system stops this. You can't pass, e.g. a User to authenticate because it only accepts a ValidUser.

only make classes for valid objects

Using primitives (String, Int, etc.) for unvalidated input, and custom classes for validated input is fine in many cases. However, sometimes you need to represent data that is in the process of being validated (e.g. when validation takes time, like waiting for a user to validate their email address) and then these intermediate classes arrive.

szundi
1 replies
5h34m

But your colleage just implemented it with User - because he could

noelwelsh
0 replies
3h58m

No tool can fully protect against user incompetence (pun intended). You can have the fanciest type system in the world and still implement everything as a String, for example.

fiddlerwoaroof
0 replies
4h57m

I think, for most cases with partial validation, you should parse the fields first into valid types for those fields and store them in a hashmap or something. When all the fields have been collected, parse the hashmap into your aggregate type.

bell-cot
1 replies
8h17m

Int's are a very poor analogy.

Somewhat less bad would be IEEE 754 floating point numbers - where your floating-point "numbers" can include both +0 and -0, sub-normal numbers, infinities, NaNs, and other miseries.

mrkeen
0 replies
8h9m

Int's are a very poor analogy.

Then program using very poor analogies, like Int.

atoav
0 replies
7h24m

This was of course an example to illustrate the concept. So whatever makes sense in reality will differ depending on the usecase.

I assumed there is a typical registration process. Future users of your application input a text in a field that is meant to be a reachable email address. So the first step would be to filter out garbage or typoed text that your mail-sending function would not be able to handle. And because you don't know how long people need to click on a link in their mail you need to store and work with that validated-but-unverified email address till they do. And for this having your own type that cannot be mixed up with the verified email addresses can make sense. Depending on how complex your application is.

DarkNova6
0 replies
5h38m

The idea is take even further in DDD. A core idea is to allow the creation of data-graphs which are always in a valid business state.

These so called aggregates can only be created from a factory to avoid bad initialization. This way you never need to make ad hoc validations to be “ really really reallysure”.

sesm
3 replies
7h56m

Wouldn't UncheckedEmail be just string?

atoav
1 replies
7h33m

All of these are just strings. But if you have a function that requires that string to be of type VerifiedMail, your type checker or compiler will either warn you about the misuse or not even let you compile the code (like in Rust).

This is good, because you as the programmer can rely on the fact that wherever you see a string of type VerifiedMail, that it is indeed a string containing a verified email adress, you don't even need to check, because you know the conversion between the different types had be done explicitly.

You can of course extend the whole thing and have a OnboardUser with a ValidatedMail and only convert the OnboardUser into an actual User once there is a VerifiedMail etc.

You get the idea. Whenever you find yourself wondering if a variable is actually holding the expected information, it is a good idea to leverage the type system to replace wondering with knowing.

But you are right: unchecked email could as well just be a string.

HelloNurse
0 replies
6h49m

In context, there's presumably a form in a web page containing a claimed email address and other personal data that "evolves" a bunch of strings into valid Email, Address, Name etc. objects or into a collection of form fields with a list of error messages attached, depending on how validation goes.

dwattttt
0 replies
7h50m

In representation yes. But if you have UncheckedEmail, you can't accidentally validate a Username, or use an UncheckedEmail as a Password

paganel
6 replies
9h30m

is seeing already guaranteed properties being checked again somewhere else

Because at some point those "already guaranteed properties" might "disappear", to be more exact, the process/procedure that implements and runs them might not do its thing anymore for one reason or another.

When that happens, because statistically speaking it will happen, then all the other processes/scripts/pieces of code depending on that "original" validating process would be in a very rough place.

TeMPOraL
4 replies
8h35m

Yes, and that's what the approach from the article is trying to help with. If you're coding in a statically typed language, the "already guaranteed properties" are propagated by types, so they can't just "disappear" - there's no way to get from e.g. String to Username without going through a String->Username function that does the checks, so functions accepting Username can rely on it having the relevant properties.

Of course, this stops being true when you start hacking around the checks, side-stepping the parsers entirely (say, type-casting String to Username without any check). There's nothing a language can do to stop you when you really want to do this[0] - but then, you're an adult; if you see restrictions and safety interlocks, and put effort to hack around them, then any problem is really on you.

This gets more difficult in dynamically typed languages, as you have to rely more on naming conventions and programmers not being idiots, but even without a typechecker, it will be rather obvious when you're doing something that could break the chain of guarantees.

--

[0] - Some can try; I've seen a Haskell paper about this idea that does very complex type magic to try and truly ensure that only the parsing function can actually construct the result type. I tried reproducing that in C++ once, but C++ just can't give such guarantees.

kitd
2 replies
6h47m

Any language that allows private constructors makes this pretty easy, no?

dgb23
1 replies
5h57m

The constructor doesn't have to be private, it just has to be a constructor.

For example take this library: https://github.com/google/uuid

It represents UUIDs as `type UUID [16]byte`.

It would be trivial to circumvent a constructor: `var uuid UUID`. Go values are zero initialized.

(Aside: You might actually have a reason to do this, but often you would rather use https://pkg.go.dev/github.com/google/uuid#NullUUID).

But it's obvious that you would use one of the constructors in the library to get an actual valid UUID.

TeMPOraL
0 replies
38m

In C++ it's less obvious, so while a private constructor does most of the job, it's still tricky to come up with a way of designing those types so they feel natural, are easy to construct in the right way (via appropriate parsers), impossible to construct the wrong way by accident, and fulfill all the roles people expect, and don't devolve into impenetrable mess of template metaprogramming that can only be debugged by the author of the type definition framework.

ChrisSD
0 replies
8h15m

Some can try

Rust does this with standard types even. "Parse, don't validate" is the answer to the question: why does Rust have so many string types? Of course you can hack around this using `unsafe` but, as you say, in that case you have to put visible effort into deliberately subverting the type system.

dgb23
0 replies
6h5m

I understand where it comes from, but I can't ignore the feeling that this is anxiety driven development.

Sure, the farther away from your control and trust circle, the more you're inclined to check assumptions that are already guaranteed. It's a valid reason to do this as you explained, but I think the tradeoff has to be considered:

Generally speaking, checking assumptions that are already guaranteed is a _bug_. It violates the DRY principle[0], and will break your program, when those guarantees relax or the assumptions become tighter at one place or another, because they diverge.

And again, it can be confusing for maintainers and makes it harder to reason about a program and make sensible changes. The anxiety that drives the checks will leak right into the reader who might be wary of breaking Chesterton's Fence. Now you have someone testing everything in order to figure out if there are code paths that only hit one or the other of the checking code and stuff like that.

Needless to say it can also make performance worse, because you're doing more work than needed, especially if the checks require fetching data from disk or the network etc. This type of performance degradation is quite common.

[0] The real/actual one, not the one where people factor out superficially repetitive code.

Sakos
4 replies
10h6m

I've been going through the comments on previous posts. I think one of the biggest issues with the article is the title. It seems to act as an anchor for a lot of people to the point where they'll argue against things that aren't in the article, just implied by the title (without context). So there'll be people arguing that she doesn't want to validate at all and just wants to parse, when really the article is about where you validate your data (and what you do with the result). It is not about getting rid of all validation.

blowski
3 replies
9h54m

This is true of all famous essays. People remove the nuance included in the body and over-apply the title without really understanding it.

For example "Goto considered harmful". I remember working with a very good programmer who'd used a "goto", and a much less senior one[1] rejected their PR by linking to the article.

[1] I'm ashamed to say it was me, a long time ago.

ffsm8
2 replies
7h0m

That's precisely why I love to only comment things with questions, i.e. "isn't go-to considered harmful? Is this really the best way?".

It gives the person making the change the opportunity to give context for their decision - and either change or spell out their reasoning for it, potentially giving me an opportunity to learn from them.

Another good benefit is that it doesn't "attack" the PR author in the "non-violent communication" way

blowski
1 replies
6h50m

Doesn't that seem a bit passive aggressive? See what I did there?

ffsm8
0 replies
3h29m

Surely it depends on both your phrasing and how you usually interact with your colleagues, no?

sriram_malhar
8 replies
9h7m

Can someone steeped in type theory explain the following:

I never understood why the default for [a] means that the list could be empty. If ...

    foo: [a] -> a
... foo is supposed to get a list of a's, it should get a list of a's, with at least one a. If the list can be empty, then explicitly annotate it so:

   foo: [a*] -> a
One way or the other one has to deal with the empty list explicitly (in the signature). If you allow empty lists, it will have to return a 'Maybe a'. It seems to me that it just makes processing the result easier in the common case if the input were to be constrained.

noelwelsh
4 replies
9h2m

The definition of a list is that it contains zero or more elements. There's nothing deep about it; it's just the way the type is defined (and always has been, back to the early days of Lisp from which functional programming evolved.)

sriram_malhar
1 replies
7h11m

Yes, that's true. But Lisp is untyped.

The notation [a] could just as easily have meant a non-empty list

HelloNurse
0 replies
6h38m

It is necessary to represent the empty list whose elements (if it had any) are type A, and its type must be the same as a nonempty list of A, because concatenating the two must be a well-defined and representable operation while concatenating empty lists of different types must be as wrong as concatenating nonempty ones. Moreover, in practice most variables, function parameters and the like should be declared as a possibly empty list of A, without special cases.

TeMPOraL
1 replies
8h30m

Conceptually, I see this as seeing a list as its own thing, independent of its contents. That is, a List<Int> is a value in itself, that can exist even if no instance of an Int exists anywhere in the program; it'll just be a List of Int that happens to have zero Ints in it.

Probably makes more sense when you come from imperative programming background, where List<Int> is a piece of mutable state that you can construct and then fill, in two explicitly separate steps (and then possibly empty it again in yet another step). Then again, I believe even mathematicians are fine with ideas of an empty set, or of a one-element set being distinct from the element itself.

nyssos
0 replies
7h58m

Then again, I believe even mathematicians are fine with ideas of an empty set, or of a one-element set being distinct from the element itself.

Especially mathematicians. Distinguishing stuff and structure is a common theme throughout mathematics.

zeendo
0 replies
5h18m

Haskell has this simply for historical reasons. It's a wart and we wish it weren't so but changing it would break a lot of legacy code (and probably have adverse performance impacts in some cases) so it remains. So as it is today, [a] indeed means it can be empty or not. And a function signature of "[a] -> a" is essentially an unsafe partial function.

That isn't to say you _have_ to write it this way for new things and there are packages like https://hackage.haskell.org/package/safe which provide non-partial/safe versions of the various unsafe base functions (like head, maximum and friends).

The base package in Haskell also includes nonempty which is probably what you want in a lot of cases, anyway.

layer8
0 replies
7h22m

The type of elements and the number of elements are two orthogonal aspects. You might need at least one element, or you might need at least two elements, or you might need an even number of elements, and so on. This is independent from whether you need the elements to be of type a (or some other condition on the element types). Ideally, all these conditions could be expressed in the type system. But the requirement "all elements must be of type a" by itself does not imply "there must be at least one element", nor does it necessarily imply having to special-case the empty list.

brandonspark
0 replies
8h51m

There's no reason this couldn't be done. Indeed, in OCaml (which I am more familiar with), you could easily define:

```ocaml

type 'a nonempty = Single of 'a | Cons of 'a * 'a nonempty

```

This would be the type of lists that contain one or more elements of the type parameter.

I think it's just convention that typically, when we talk about lists, we are interested in the empty case as well. Finding "all X that satisfy P in Y", as a general computational problem, is _very_ common (consider: filtering a list, querying for a predicate in a collection, finding sequences of moves in a search space), and generally could result in an empty list as a possible output.

In a non-practical sense, if you want the type theory, another reason is you can think of `[a]` as the free monoid on the collection of `a`. In other words, strings of elements of `a`, joined via concatenation. This monoid requires a unit, which is the empty list.

PreInternet01
6 replies
10h32m

(2019), but still good-ish advice. The pattern works like a charm in modern C# as well, and has nice space-saving effects too by allowing you to omit the explicit variable declaration:

    if(!Whatever.TryParse<Thingy>(input, out var output)) output = some-sane-default;
or:

    if(!Whatever.TryParse<Thingy>(input, out var output)) throw new ApplicationException($"Not a valid Thingy: {input}");
Protip: don't do the latter in your kernel-mode driver.

yakshaving_jgt
1 replies
8h40m

(2019), but still good-ish advice

Why only "good-ish"? And how does it relate to the year the article was published? Surely you are implying that the advice in the article would be more authoritative if it were published earlier than 2019, right?

HelloNurse
0 replies
6h46m

Aversion to truth seems more fashionable now than in 2019.

zo1
0 replies
8h37m

Protip: Don't do either. And definitely don't do the first.

Explicit is always better than implicit defaults that get used instead when you give it a wrong value that you think is correct.

What you should do is throw your hands up early, fail to parse, and have a very clearly defined process and protocol to handle files that couldn't be loaded. It'll force you to ask yourself very difficult questions that aren't covered by either of the two options you posted.

The real failure in the recent Crowdstrike kernel-mode driver failing to parse some def/config file is that the dev/product owner/BA didn't ask "what happens if we try load a file that's invalid?"

zeendo
0 replies
5h30m

Please don't do the first. Handle the bad cases. "sane default" fallback should be extremely rare.

Explicit > Implicit

nucleardog
0 replies
2h35m

if(!Whatever.TryParse<Thingy>(input, out var output)) output = some-sane-default;

I can't think of many (probably any) situations where I'd want to find that.

If _no_ input is provided (i.e., the parameter is optional), sure, using a sane default makes sense.

If _invalid_ input is provided, for the love of god please don't pretend like nothing's wrong.

If someone walks into a florist and asks for a coffee, the correct answer is not for them to be handed a rose. They're going to cut their mouth all up when they try and drink it.

Your method/module/program does not have an output defined for that set of inputs. Make that obvious rather than just doing wrong or non-obvious things in a way that quickly makes your program almost impossible to reason about. Do yourself a favour and clearly raise the issue and leave yourself a stack trace pointing directly to the issue instead of setting yourself up for the vague bug about incorrect behaviour when someone catches this in a few months.

Akronymus
0 replies
5h59m

if(!Whatever.TryParse<Thingy>(input, out var output)) output = some-sane-default;

I absolutely hate that. IMO you should handle the error of an invalid input outside of the function to parse. F# makes that easy.

    type Whatever =
    static member create input =
      match input with
      | ValidWhatever x -> Some x
      | _ -> None

    match Whatever.create input with
    | Some x -> //process the parsed data
    | None -> //handle it not being parsed well
Or you could also use Option.map/Option.bind to build a pipeline to handle chained operations in an ergonomic manner.

With this, you can only instantiate any instances through the create method with parses the input.

Altough, you probably want to use a result rather than option, but I digress.

zigzag312
4 replies
6h34m

Is this opposite to the following opinion?

"“required” keyword in Protocol Buffers turned out to be a horrible mistake"

https://capnproto.org/faq.html#how-do-i-make-a-field-require...

Having both flexible, unvalidated parsing and validated parsing functions would probably be best IMHO.

lexicality
1 replies
5h26m

No, the idea is that your application level validation functions take in data, own it and return a validation error or new type that has invalid states removed.

In this case you could make a wrapper function that accepts the raw binary data, passes it to Cap'n'Proto, validates the output (this field is actually required etc) and then returns it

zigzag312
0 replies
4h55m

Of course this is the correct way, but doing this manually is a lot of work. Ability to append validation schema to the input schema and have validation functions and valid-state types generated automatically would be a big boost in productivity.

dgb23
1 replies
5h39m

The problem here as I understood it is more general and orthogonal to what the Parse, Don't Validate article is talking about.

1. A field being "required", is not a property of the field itself, but of the construct holding the field. JSON-Schema does this correctly, by letting you define an array of required fields on the schema of an object instead of it being a property of a specific field.

2. Consumers, not producers should decide what their assumptions are. Producers should decide what their guarantees are.

3. Everyone, including the parts in the middle (here it's a message bus) should only state assumptions that they actually need in order to function. Use different schemas for different assumptions (which is easier to do if the "required" assumption is a property of a construct and not a field).

The example in the article you mentioned illustrates nicely how these three principles are broken by "required" (or maybe how Protocol Buffers are used in general).

zigzag312
0 replies
5h7m

2. Consumers, not producers should decide what their assumptions are. Producers should decide what their guarantees are.

Issue with schema definition languages is that only one set classes is generated. Forcing consumers to create large amounts of mapping logic to transform input to objects with required assumptions.

I like the idea of an array of required fields on the schema of an object. Using arrays of required fields, could be used to generate different set of classes. That would make writing near-duplicate classes of input objects and mapping logic unnecessary.

the_gipsy
4 replies
8h52m

Sadly this doesn't work at all in go with the zero-values concept.

Smaug123
3 replies
6h34m

I imagine it can be made to work, as long as you make sure the zero value is never actually valid; then you're back in the usual billion-dollar-mistake world where everything is implicitly optional, rather than in the hellscape where you can never know whether you've got something meaningful in your hand or not. (I realise that Golang strongly believes zero values are usually valid, but Golang is wrong about this.)

the_gipsy
2 replies
5h15m

The zero value is valid in many cases: false, 0, "", etc.

Smaug123
1 replies
5h8m

There's no such thing as a "valid" value. Values can only be valid in some context. The empty string is valid only in contexts where… it has meaning. (Quite hard to say this without being obviously circular!) Golang has chosen the point of view that "in most contexts, the zero value is valid"; it is wrong about this. There are certainly some contexts where for some particular type the zero value is valid, but it is rarely true in general.

Booleans are harmful for other reasons, sometimes related to "parse, don't validate". Write a discriminated union that makes the true/false distinction meaningful, if you really do have a two-valued data type ("--dry-run" is `| Dry | Wet`, not a bool; this mistake is called "boolean blindness"). Write a type that actually contains the data you want, if the boolean is supposed to indicate the validity of some other data (`option<string>`, not `string * bool` where `something, false` implicitly means the `something` is meaningless).

the_gipsy
0 replies
1h30m

I agree, it seems I wasn't clear. The values false, zero, or empty string are "valid" in most contexts. But in go, using the stdlib json package, it is not possible to distinguish a missing value from the zero value.

You can use pointers in some cases, but then everything becomes a full-blown pointer for the lack of an Option<T> type, and it still doesn't fully solve the problem.

When parsing "enums" in go, which are really just a type alias and some loose constants, again it's not possible to prevent zero values sneaking in with the stdlib json package. E.g. you get a value of the correct type, but it's not one of your consts, but equivalent to a new value `YourType("")`.

Thus, a lot of validation is necessary, between parsing and using values.

valenterry
2 replies
7h57m

Is it possible to implement foo? Trivially, the answer is no, as Void is a type that contains no values, so it’s impossible for any function to produce a value of type Void

That's actually not really correct. Or rather, it is technically correct but it will confuse the readers who work in languages like Java.

While void in languages like Java means that the result of the function cannot be used or has no meaning, it is NOT equivalent to types like the bottom type of Haskell. Because that would mean that the function can never return.

Rather, void is similar to the "unit type" (https://en.wikipedia.org/wiki/Unit_type) which does have a value. It's like an empty tuple. It contains no information other then "the function call has finished". (and of course in languages with exceptions, this means that no exception was thrown)

Otherwise, I like the article. More people should read and understand this way of thinking.

thedataangel
1 replies
7h52m

`Void` is not equivalent to the unit type. Haskell _has_ a unit type, called `()`. A return value of `Void` represents a function which is unimplementable, or which never terminates. It can also represent an unreachable possibility, e.g. in the type `Either Void a`.

valenterry
0 replies
7h14m

Yeah - I just wanted to emphasize that this article will be confusing for most developers. Since most are used to void from languages like Java. And the meaning of that void is different from the one in Haskell.

tossandthrow
1 replies
9h49m

This is a super tool! But it still requires one to write good gates.

For all projects of some. Size I would advice people to use Zod or the like (unless there are special circumstances such that external deps can not be used).

zendist
1 replies
10h0m

It seems that CloudStrike only parsed and didn't validate, to great effect :-) /s

Not saying that this advice isn't solid, just thought it's funny given the news of this week.

keybored
0 replies
7h57m

Parsing subsumes validation.

mindesc
1 replies
9h5m

yeah. use raw datatype url provided by the language and get hacked by some exotic xss you are not aware, because the specs have kitchen and sink included

yakshaving_jgt
0 replies
8h42m

You can make your own types.

madduci
1 replies
8h34m

I don't know why this is a good advice at all.

If you have exposed APIs, you should prevent malicious payloads and what happens when the parser can be broken through invalid data, causing also Out of Memory exceptions?

It might work only if you have some safe guardrails around the APIs, but just exposing naked endpoints, without a minimum of checks or a Web Application Firewall, this isn't a real good advice

Smaug123
0 replies
6h39m

Obviously if you've written a buggy parser then all bets are off, just as all bets are off if you've written bugs into any other part of your program. But those bugs would probably still be there if you didn't explicitly parse the input - they'd just appear later on, mixed in with your business logic which is performing ad-hoc parses whenever it needs the data!

kgeist
1 replies
8h58m

Now I have a single, snappy slogan that encapsulates what type-driven design means to me, and better yet, it’s only three words long:

Parse, don’t validate.

For me the slogan is rather "always validate only in the single constructor" (or constructor function, doesn't matter). That way, you cannot have invalid objects at all, and there's always a single source of truth. If you want to modify the object, implement it via constructing a new state by calling the same constructor again.

epolanski
0 replies
8h50m

Not really the same thing.

The point is that validation alone is then lost as information later on.

E.g. validating an int to be positive has limited benefits if you don't parse it to be a positive int, because there's no such information at the type level later on, same could apply to a non empty array/list where following consumers would then need to check again if the list is really non empty.

This kind of information cannot always be encoded in objects or constructors.

WiSaGaN
1 replies
10h6m

Utilize strong type system to make the error case unrepresentable. This is great advice to reduce bugs in software in general. It takes more time to think about the problem and to make a design following this. However, a lot of times it is worth the time.

Smaug123
0 replies
6h41m

I'm going to make the bold claim that it doesn't take more time to do this, if your language supports algebraic data types. It just sort of naturally happens. Of course, if your language requires a great deal of ceremony (C++, Java, C#, Python, Golang, Javascript, …) to model data, then it will take more time.

yakshaving_jgt
0 replies
8h32m

One of my favourite articles published during my career. I've noticed that people often just read the title and assume parsing and validation are somehow mutually exclusive, but in practice that's not the case. Parsing often includes validation. This is addressed in the article, under Use abstract datatypes to make validators “look like” parsers.

It's the same kind of ground as avoiding primitive obsession.

wormlord
0 replies
2h11m

This idea + Domain modeling go together like peanut butter and chocolate. The idea that an invalid state in your Domain is unrepresentable takes so much work off the database and API, and makes programs so easy to test-- since an invalid Domain representation simply will not parse, and can never make it to your API or DB, taking so much logic out of the parts of a software stack that are hardest to test.

willsmith72
0 replies
4h59m

this reminds me of a team i worked in which had very few unit tests, and compensated with in-depth complex type systems

comprehensive test coverage meets the same goal, and TDD looks a lot like "type-driven design", only easier to read and maintain.

teeheelol
0 replies
10h30m

Forwarded to crowdstrike.

maw
0 replies
4h53m

Whenever this comes up, I'm reminded of section 5 in https://cr.yp.to/qmail/guarantee.html which among other things says "Don't parse" and "there are two types of command interfaces in the world of computing: good interfaces and user interfaces".

If I were were to teach a class about programming in the medium (as opposed to in the small or in the large), I think I'd assign my students an essay comparing and contrasting these suggestions. Each has something to teach us, and maybe they're not as contradictory as it may seem at first.

keybored
0 replies
8h4m

One of my favorite things that I’ve read via HN.

It seems like this approach will often bottom out in smart constructors since type systems either are limited or make you work too hard to prove relatively simple thing.

kayo_20211030
0 replies
5h3m

Is Postel's law relevant here?

hintymad
0 replies
27m

This reminds me of a comment someone made during the craze of XMLs in the mid 2000s. In the comment the author suspected that so many organizations chose XML to implement their domain-specific languages, configuration languages included, only because XML offers a parser, while most organizations didn't want to bother with writing their own parser.

It beats me why people didn't want to write parsers, though. Writing parsers is not that hard, and is quite fun.

hatsuseno
0 replies
9h27m

I feel like this idea is another form of, or at least related to, my own habit to process input in two phases, plan and execute. Run the input through a planner component that produces a sequential list of instruction that would 'do' whatever it is we're doing. I've changed this style to permit parallel execution or other more complicated structures than just a flat list, but I often come back to this base plan. Invalid input would be caught during the planning phase, and I need to ensure the planner can't make impossible plans up to an extent. Can't say I've avoided every type of problem or bug this way, but it sure as hell gives me a good base to work with.

fragmede
0 replies
8h24m

(2019)

brunooliv
0 replies
10h54m

This post is bookmarked for me since this first time I read it and I occasionally come back to it, it’s a great one

andrewghull
0 replies
4h18m

Here's the final example in TypeScript if you find that easier to read:

  type NonEmpty<T> = [T, ...T[]]

  const head = <T>(list: NonEmpty<T>) => list[0]

  function getConfigurationDirectories(): NonEmpty<string> {
    const configDirsString = process.env["CONFIG_DIRS"]
    const [firstDir, ...restDirs] = configDirsString.split(',')
    if (firstDir === undefined) throw Error("CONFIG_DIRS cannot be empty");
    return [firstDir, ...restDirs];
  }

  function main() {
    const configDirs = getConfigurationDirectories();
    initializeCache(head(configDirs))
  }

Vosporos
0 replies
9h29m

A seminal text that has had a high cultural impact.

TacticalCoder
0 replies
6h4m

What is needed are canonical representation of data and then you must parse, re-encode, and verify that the re-encoded data matches bit for bit what was parsed.

Not in unit tests but when the app is running: you take the data in, you parse it, you re-encode/re-serialize it/re-whatever it. If it's not matching the data that came in, the data that came in is rejected.

And that should just be one of the steps taken to verify that the data looks legit.

Garlef
0 replies
7h12m

Hm... I quite like the idea but I think the initial example is not very good and also the remark about 'shotgun parsing' seems to blut some levels:

The key idea seems to be that the border between the periphery/plumbing/deserialization code and the actual business logic should be as strict and direct and isolated as possible. Only pass objects/data/payloads to the business logic that have been fully ingested into the data model of the business logic. And keep the ingestion in one place.

From this perspective, the section about "shotgun parsing" might give some people the wrong idea and derail some discussions: If it's an actual part of the business requirements that branching and validations need to happen (branching for example over the existence of an optional value), a superficial reading of the article might lead someone to incorrectly identify this as "shotgun parsing".