On one hand, thanks for being honest about a story of how this bug came to be.
On the other hand, I don’t think advertising the fact that the company introduced a major bug from copy and pasting ChatGPT code around and that they spent a week being unable to even debug why it was failing.
I don’t know much about this startup, but this blog post had the opposite effect of all of the other high quality post-mortem posts I’ve read lately. Normally the goal of these posts is to demonstrate your team’s rigor and engineering skills, not reveal that ChatGPT is writing your code and your engineers can’t/won’t debug it for a very long time despite knowing that it’s costing them signups.
It read like no one really knew what they were doing. "We just let it generate the code and everything seemed to work" is certainly not a good way to market your company.
Eh I imagine they looked over the code as well, doing code review -- and at first glance, the code looks reasonable. I certainly wasn't able to catch the bug even though I tried to find it (and I was given a tiny collection of lines and the knowledge that there's a bug there!).
If anything, I think this says something about how dangerous ChatGPT and similar tools are: reading code is harder than writing code, and when you use ChatGPT, your role stops being that of a programmer and becomes that of a code reviewer. Worse, LLMs are excellent at producing plausible output (I mean that's literally all they do), which means the bugs will look like plausibly correct code as well.
I don't think this is indicative of people who don't know what they're doing. I think this is indicative of people using "AI" tools to help with programming at all.
I think using AI tools to write production code is probably indicative of people who don't really know what they are doing.
The best way not to have subtle bugs is to think deeply about your code, not subcontract it out -- whether that is to people far away who both cannot afford to think as deeply about your code and aren't as invested in it, or to an AI that is often right and doesn't know the difference between correct and incorrect.
It's just a profound abrogation of good development principles to behave this way. And where is the benefit in doing this repeatedly? You're just going to end up with a codebase nobody really owns on a cognitive level.
At least when you look at a StackOverflow answer you see the discussion around it from other real people offering critiques!
ETA in advance: and yes, I understand all the comparison points about using third party libraries, and all the left-pad stuff (don't get me started on NPM). But the point stands: the best way not to have bugs is to own your code. To my mind, anyone who is using ChatGPT in this way -- to write whole pieces of business logic, not just to get inspiration -- is failing at their one job. If it's to be yours, it has to come from the brain of someone who is yours too. This is an embarrassing and damaging admission and there is no way around it.
ETA (2): code review, as a practice, only works when you and the people who wrote the code have a shared understanding of the context and the goal of the code and are roughly equally invested in getting code through review. Because all the niche cases are illuminated by those discussions and avoided in advance. The less time you've spent on this preamble, the less effective the code review will be. It's a matter of trust and culture as much as it's a matter of comparing requirements with finished code.
You could say the same about the output of a compiler. No one owns that at a cognitive level. They own it at a higher level - the source code.
Same thing here. You own the output of the AI at a cognitive level, because you own the prompts that created it.
Notwithstanding the fact that compilers did not fall out of the sky and very much have people that own them at the cognitive level, I think this is still a different situation.
With a compiler you can expect a more or less one to one translation between source code and the operation of the resulting binary with some optimizations. When some compiler optimization causes undesired behavior, this too is a very difficult problem to solve.
Intentionally 10xing this type of problem by introducing a fuzzy translation between human language and source code then 1000xing it by repeating it all over the codebase just seems like a bad decision.
Right. I mean... I sometimes think that Webpack is a malign, inscrutable intelligence! :-)
But at least it's supposed to be deterministic. And there's a chance someone else will be able to explain the inner workings in a way I can repeatably test.
Except, for starters, that you're not using the LLM to replace a compiler.
You're using it to replace a teammate.
Yes, and when compilers fail, it's a very complex problem to solve, that usually requires many hours from experienced dev. Luckily,
(1) Compilers are reproducible (or at least repeatable), so you can share your problem with other, and they can help.
(2) For common languages, there are multiple compilers and multiple optimization options, which (and that's _very important_) produce identically-behaving programs - so you can try compiling same program with different settings, and if they differ, you know compiler is bad.
(3) The compilers are very reliable, and bugs when compiler succeeds, but generates invalid code are even rarer - in many years of my career, I've only seen a handful of them.
Compare to LLMs, which are non-reproducible, each one is giving a different answer (and that's by design) and finally have huge appear-to-succeed-but-produce-bad-output error rate, with value way more than 1%. If you had a compiler that bad, you'd throw it away in disgust and write in assembly language.
But not the now, quite obviously.
Colour me cynical but I don't feel like pretending the future is here only to have to have to fix its blind incompetence.
I totally disagree with this. You might as well argue that we shouldn't use code-completion of any kind because you might accidentally pick the wrong dependency or import. Or perhaps we shouldn't use any third-party libraries at all because you can use them to write reasonable-looking but incorrect code? Heck, why even bother using a programming language at all since we don't "own" how it's interpreted or compiled? Ultimately I agree that using third-party tools saves time at the cost of potentially introducing some types of bugs. (Note that said tools may also help you avoid other types of bugs!) But it's clearly a tradeoff (and one where we've collectively disagreed with you the vast, vast majority of the time) and boiling that down to AI=bad misses the forest for the trees.
It is possible to use autocompletion correctly.
It is possible to use libraries correctly.
It is not possible to use AI correctly. It is only possible to correct its inevitable mistakes.
AI can provably write code without mistakes?
I'm a some-time Django developer and... I caught the bug instantly. Once I saw it was model/ORM code it was the first thing I looked for.
I say that not to brag because (a) default args is a known python footgun area already and (b) I'd hope most developers with any real Django or SQLAlchemy experience would have caught this pretty quick. I guess I'm just suggesting that maybe domain experience is actually worth something?
I missed it, because I was really confused what the models were doing: why is there an id and a subscription_id? Are the user_id fields related?
I've since moved on to primarily working with Java, so it's been a few years since working with Django on a daily basis and the default still jumped out to me immediately. Experience and domain knowledge is so important, especially when you need to evaluate ChatGPT's code for quality and correctness.
Also, where were their tests in the first place? Or am I expecting too much there?
This is an error that should probably have been caught just based upon the color of the text when it was typed/pasted into the source code. Of the uuid() call was in quotes, it would have appeared as text. When you’re blindly using so much copy/pasted code (regardless of the source), it’s really easy to miss errors like this.
But our existing tools are already built to help us avoid this.
Back in the day, I used a tool from a group in Google called “error-prone”. It was great at catching things like this (and Lorne goal NPE in Java). It would examine code before compiling to find common errors like this. I wish we had more “quick” check tools for more languages.
It's not in quotes. It's a function call.
The issue is that the function call happens once, when you define the class, rather than happening each time you instantiate the class.
I don't know Python or SQLAlchemy that great, though I do have the benefit of it being cut down to a small amount of code and being told there was a bug there. That said, I didn't see the actual bug, but I did mentally flag that as something I ought to look up how it actually behaved. It's suspicious that some columns used `default` with what looks like Python code while others used `server_default` with what appears to be strings that look more like database engine code. If I was actually responsible for this, I'd want to dig into why there is that difference and where and when that code actually runs.
It's also the case that "code review" covers a lot of things, from quickly skimming the code and saying eh, it's probably fine, to deeply reading and ensuring that you fully understand the behavior in all possible cases of every line of code. The latter is much more effective, but probably not nearly as common as it ought to be.
This is why I typically only use LLMs in programming as a semi-intelligent doc delver with requests like, “give me an example of usage of X API with Y language on Z platform”.
Not only does it goof up less frequently on small focused snippets like this, it also requires me to pick the example apart and pay close enough attention to it that goofups don’t slip by as easily and it gets committed to memory more readily than with copypasting or LLM-backed autocomplete.
Except, have you met startup devs? This is by and large the "move fast then unbreak things" approach.
This is why working for startups gives me PTSD. I wouldn't recommend it to anyone.
The idea of inheriting a ChatGPT code base no one understands now makes it worse.
Just give it to GPT-5 for a refactor, easy!
It terrifies me that I have heard people say this unironically.
It’s the natural outcome of SV types denigrating the value of education.
Forget knowing anything, just come up with a nice pitch deck and let the LLM write the stack.
Not wholly surprised these people are YC backed. I’ve got the impression YC don’t place much weight on technical competence, assuming you can just hire the talent if you know how to sell.
Well, now replace “hire some talent” with “get a GPT subscription and YOLO”, and you get the foundation these companies of tomorrow are going to be built on.
Which hey, maybe that’s right and they know something I don’t.
For that matter, has OP even met the HN accepted wisdom? "No one knows what they're doing, everyone's faking it, it's fine if you are too" -- so don't take it as a red flag when your fumbling around keeps blowing up, because it surely must work that way everywhere else.
My early rant against this mentality: https://news.ycombinator.com/item?id=19214749
Its very humbling coming out of startup-land and working with big tech engineers and realizing their tooling runs circles around everybody else and enables them to be much more precise with their work and scale, though it isn't without trade-offs.
Yeah but a lot of that is just the accrual of improvements that is possible with a lot of resources over a long period of time.
People working in "big tech" aren't fundamentally better at building reliable tools and systems; the time and resource constraints are entirely different.
And the stakes! This outage might have cost the OP $10k. A similar snafu at a larger company might have cost tens of millions or more.
The big tech tooling probably cost tens of millions of dollars to create, and probably had a couple $10k mistakes on the way to getting it written and running.
It’s possible to move fast the same way, but break less things than this. For example, in this case, they said that they introduced tests to mitigate this. I can assure you that introducing tests takes more time than Google searches to check in like 2 minutes what each lines really does.
The idea of moving fast is to have extensive logs and alerts so you fix all error fasts while they appear without "wasting time" with long expensive tests in a phase where things change every day.
5 days to find out you have "duplicate key" errors in the db is the opposite of fast
This is getting more common. I have already had people try to tell me how something works from a chat gpt summary. This would have led to us taking a completely different direction… 5 minutes of reading the actual docs and I found out they were wrong.
Now at a new company i have caught several people copy pasting gpt code that is just horrendous.
It seems like this is where the industry is headed. The only thing i have found gpt to be good at is solving interview questions although it still uses phantom functions about 50% of the time. The future is bumming me out.
Like people who post "here's what ChatGPTx said" instead of their own answer. Quite literally, what is the point?
However, I don't think it's really bad for the technical industries long term. It probably does mean that some companies with loose internal quality control and enough shiftless employees pasting enough GPT spew without oversight will go to the wall because their software became unmaintainable and not useful, but this already happens. It's probably not hugely worse than the flood of bootcamp victims who wrote fizzbuzz in Python, get recruited by a credulous, cheap or desperate company and proceed to trash the place if not adequately supervised. If you can't detect bad work being committed, that's mostly on the company, not ChatGPT. Yes, it may make it harder, a bit, but it was oversight you should already have been prepared to have, doubly so if you can't trust employee output. It also probably implies strong QA, which is already a prerequisite of a solid product company.
Normal interest rates coming back will cut away companies that waste everyone's time and money by overloading themselves on endlessly compounding technical debt.
Is the idea here that normal (read: low?) interest rates will let companies spend more time getting things right?
No, the idea is that historically-normal interest rates around the 5-10% mark won't be conducive to free VC cash being sprayed around for start-ups to wank themselves silly over "piv-iterating" endlessly over spamming complete nonsense and using headcount and office shininess as a substitute for useful and robust products.
Yes, it makes the barrier higher even for good products and helps entrench incumbents, but short of a transnational revolution, the macroeconomic system is what is it and you can only chose to find the good things in it or give up entirely.
Yeah I've seen this and i hate it. If i wanted to know what chatgpt said I'd just ask it myself.
It’sa Tower of Babel like effect
In my experience, hardly anyone in software does know what they're doing, for sufficiently rigorous values of "know what you're doing." We all read about other people's stupid mistakes, and think "haha, I would never have done that, because I know about XYZ!" And then we go off and happily make some equally stupid mistake because we don't know about ABC.
There’s a difference between not knowing what you are doing and making a mistake.
The "zen" of LLMs is that they do not see a real distinction between these two things, or either of these two things and success ;-)
An awful lot of mistakes are made because one didn't know something that would have enabled one to avoid it. Not knowing what you don't know is difficult to work around.
I dunno. I tend to annoy people when taking on jobs by telling people what I am concerned about and do not understand, and then sharing with them the extent to which I have managed to allay my own concerns through research.
I turn down a lot of jobs I don't feel confident with; maybe more than I should.
An LLM never will.
Everyone using C++20 compilers: side-glancing monkey.
The difference is that you can read the code that ChatGPT generates.
I don't really care about them marketing their company, but, Jesus, seriously, that's how software is going to be written now. TBH, I'm not really sure if it's that much different from how it was, but it sounds just... fabulous.
Then no one knows what they are doing. I really don't know any company that doesn't make what could be considered rookie mistakes by some armchair "developer" here on HN.
They spent 5 days. The bug type is pretty common and could easily be done by a developer. (It's a similar class to the singleton default argument issue that many people complain about) Meh, I don't mind the cautionary tale and don't think chatgpt was even relevant.
It's actually a tricky bug, because usual tests wouldn't catch it (db wiped for good isolation) and many ways of manual testing would restart the service (and reset the value) on any change and prevent you from seeing it. Ideally there would be a lint that catches this situation for you.
TBH, while I definitely could see this being an easy bug to write, something is definitely wrong if it took 5 days to identify the root cause of this bug.
That is, I'm struggling to understand how a dive into the logs wouldn't show that all of these inserts were failing with duplicate key constraint violations. At that point at least I'd think you'd be able to narrow down the bug to a problem with key generation, at which point you're 90% of the way there.
I also don't agree that "usual tests wouldn't catch it (db wiped for good isolation)". I'd think that you'd have at least one test case that inserted multiple users within that single test.
The bug was in multiple subscriptions not just users. And I can't think of one non-contrived reason to do it. Even when testing the visibility/access of subscriptions between users you need 2 users, but only one subscription.
create a subscription for a test user. delete it. Make sure you can create another subscription for the same user.
create subscriptions with and without overlapping effective windows
Those seem like very basic tests that would have highlighted the underlying issue
Hindsight is 20/20. It’s always easy to think of tests that would have caught the issue. Can you think of tests that will catch the next issue though?
Sure, hindsight is 20/20, but a bunch of these comments are replying to the assertion "And I can't think of one non-contrived reason to do it" (have a single test case with multiple subscriptions). That's the assertion I think is totally weird - I can think of tons of non-contrived reasons to have multiple subscriptions in a single test case.
I wouldn't pillory someone if they left out a test case like this, but neither would I assert that a test case like this is for some reason unthinkable or some outlandish edge case.
Go on then - so far the examples I've seen don't make sense in the context of stripe.
I'd argue that a suite of tests that exercise all reasonably likely scenarios is table stakes. And would have caught this particular bug.
I'm not talking about 100% branch coverage, but 100% coverage of all happy paths and all unhappy paths that a user might reasonably bump into.
OK maybe not 100% of the scenarios the entire system expresses, but pretty darn close for the business critical flows (signups, orders, checkouts, whatever).
Have a look at stripe API. You don't delete subscriptions. You change them to free plan instead / cancel and then resume later. This for would not result in deletion of the entry. You also can update to change the billing date, so no overlapping subscriptions are needed. Neither test would result in the described bug.
Or add some debug logging? 5 days into a revenue-block bug, if I can't repro manually or via tests, I would have logged the hell out of this code. No code path or metric would be spared.
What? This doesn't make any sense:
1. First, if you look at the code they posted, they had the same bug on line 45 where they create new Stripe customers.
2. The issue is not multiple subscriptions per user (again, if you look at the code, you'll see each Subscription has one foreign key user_id column). The problem is if you had multiple subscriptions (each from different users) created from the same backend instance then they'd get the same PK.
Not every user needs a stripe customer. I'm creating the stripe entries only on subscription in my app.
Your second point is true, but I don't see what it changes. Most automated unit/integration testing would just wipe the database between tests and needing two subscribed users in a single test is not that likely.
Apparently not.
Yeah, we have quite literally caught bugs like this in 5 minutes in prod not bc we made a mistake, but bc a customer’s internal API made a schema change without telling us and our database constraints and logging protected us.
But it took about 5 minutes, and 4 of those minutes were waiting for Kibana to load.
Why? In fact, not having good isolation would have caught this bug. Generate random emails for each test. Why would you test on a completely new db as if that is what will happen in the real world?
It makes your tests more robust. Generally you don’t want tests that are too sensitive to external state since they will fail spuriously and become useless.
Of course your tests shouldn't be sensitive to external state. Why would other tests running affect your test?
They shouldn't. But they do. We're not perfectly spherical developers and we all make mistakes. Sometimes it's also extremely tricky to figure out what state is leaking, especially is it's an access race issue and happens only for some tests and very rarely. If you haven't seen that happening, you just need to work on larger projects.
I've worked at several large companies. In our e2e tests we did not create isolated dbs per test. If the test failed because of other tests running that's a bug in the test and that person would get automatically @mentioned on slack and they would have to fix the build.
It's extremely common. You want to know that objects/rows from one test don't bleed into another by accident. It allows you to write more strict and simpler assetions - like "there's one active user in the database" after a few changes, rather than "this specific user is in the database and active and those other ones are not anymore".
Leaking information between test runs can actually make things pass by accident.
...who didn't know how the ORM they were using worked. That's what makes them look so bad here: nobody knew how it worked, not even at the surface level of knowing what the SQL actually generated by the tool looks like.
In their defense, I find SQLAlchemy syntax quite horrible, and I always have to look up everything. It also got a 2.0 release recently which changes some syntax (good luck guessing which version ChatGPT will use), and makes the process even more annoying.
SQLAlchemy syntax is ridiculously obvious and straightforward as long as you're not doing anything weird.
The takeaway here is that they weren't mature enough to realize they were, in fact, doing something "weird". I.e. Using UUIDs for PKs, because hey "Netflix does it so we have to too! Oh and we need an engineering blog to advertise it".
Edit. More clarity about why the UUID is my point of blame: If they had used a surrogate and sequential Integer PK for their tables, they would never have to tell SQLAlchemy what the default would be, it's implied because it's the standard and non-weird behavior that doesn't include a footgun.
Unfortunately, UUID as PK is an extremely common pattern these days, because devs love to believe that they’ll need a distributed DB, and also that you can’t possibly use integers with such a setup. The former is rarely true, the latter is blatantly false.
looking at the query logs for the nighttime period should have made the bug fairly obvious
They even said they had sentry set up.. they'd notice the duplicate key error immediately.
I read the part where they said they poured through "hundreds of sentry logs" and immediately was like "no you didn't."
This is not an error that would be difficult to spot in an error aggregator, it would throw some sort of constraint error with a reasonable error message.
agree, but this sounds like it would produce logs/error messages which could then lead to a solution, quicker... if the logs were captured and propagated sufficiently
Oh definitely. Their ops experience seems very low. But I've found it extremely uncommon to see anything better in smaller projects.
Volume/load testing (or really, any decent acceptance testing) would catch it.
Load testing - yes, but it's not that usual unfortunately. (Even though it should be) Acceptance testing - again, maybe, if they use 20 or so subscriptions in one batch, which may not be the case.
A functioning dev env would have caught this if they manually tested more than once. Typically you dont run 40 dev instances. Or a staging environment.
More importantly, what was the motivation behind a rewrite from TypeScript to Python? From the article
Seems like this entire mess could've been avoided if they had stuck with their existing codebase, which seemed to have been satisfying their business requirements.There is well-hidden vendor-lock when using NextJS, at least.
There is no vendor lockin with nextjs
Simply put - if you want get the best out of the framework, you need to host it in Vercel. Otherwise, there are better options for frameworks. No need to ”fight it”.
You will find many issues from GitHub which are not considered because they would make the framework ”better” or easier to use on other clouds.
Making things worse in a free offering by a company to profit from premium offerings by the same company its the pinacle of capitalism, reminds me of a recurrent joke I have with a friend while playing Call Of Duty, that they will get greedier and soon will sell not only character's skins but also shaders/textures for the maps, oh so you want to see something better than placeholder textures? We have the DLC just for you!
More likely outcome: ads on static textures and between lobbies, oh but you can pay to turn them into _theme of the week_.
Don't give them ideas...
Environment destruction gameplay but there's always another ad under the ads, except when it's a lootbox.
I can imagine worse, too! They haven't even really started turning that knob yet.
The Video game metaphor stretches pretty far.
Madden has a monopoly license for NFL content. For a decade the biggest complaint was how they gate kept rosters behind the yearly re-release. Eventually they allowed roster sharing but they put it behind the most god awful inept UI you could possibly imagine such that practically casual gamers wouldn't bother with it.
Then Madden came out with Madden Ultimate Team (like trading cards MTX) and have been neglecting non-MUT modes ever since. They don't explicitly regress the rest of their game, they just commit resources to that effect.
Its like malicious compliance. They don't embrace, extend, extinguish, but they get a similar effect just with resourcing, layoffs, whatever.
Do you mind naming some NextJS alternatives without potential vendor lock in? Mulling a change in my fe.
For example, Remix has got a lot of traction recently. It is also backed by Shopify where the business model does not conflict.
https://remix.run/
There is no explicit vendor lock-in, but features of the framework are designed heavily towards Vercel-specific features.
The SST team actually has an open-next project[1] that does a ton of work to shim most of the Vercel-specific features into an AWS deployment. I don't think it has full parity and it's a third party adapter from a competing host. The fact that it's needed at all is a sign of now closely tied Next and Vercel are.
[1] https://github.com/sst/open-next
that’s a bold claim, could you give an example?
Probably the most known example https://github.com/vercel/next.js/discussions/19065
It is not an issue if you host in Vercel.
Implementing the requested feature would make the framework much better and easier to use when self-hosted elsewhere. But there is neglection to resolve the issue. This is just one case.
FWIW, we host on Cloudflare and use their API to resize images on the fly and we're fine. Not so much a "lock-in" if other vendors can fill-in, is it?
You likely needed to do more extra work than needed, when compared to some other options.
The lock-in here is the added developer time and complexity vs. just paying premium.
I disagree with your threshold for what makes something a lock-in but I admire your ideology of less friction in portability
What API do you use to resize images? Cloudflare? I know Vercel and NextJS have an <Image> resizing/optimization component that gets pricy.
If you need to change it when switching vendors and only they offer it/it's proprietary, it's vendor lock-in.
I’m sure they are influenced by the likes of Reddit and Twitter rewriting their stack. I mean, that’s what has to be done, right? /s
Is there a trend there of moving from Next to FastAPI? I would be surprised.
Perhaps they are doing some AI thing and want to have python everywhere.
My guess, when I read it, was this would permit them to independently scale the backend on some commodity capacity provider, and then their Nextjs frontend becomes just another React app. OP didn’t mention what their product was, but if it’s AI-adjacent then a python backend doesn’t sound like a terrible idea.
Or if you really want to rewrite your back end, why not just use Express? It would be wildly quicker to rewrite than switching languages. That along with the article makes me question the competency of the company. They got customers, sure, but in the long run these decisions will pile up.
This is a pretty common mistake with sqlalchemy whether you’re using ChatGPT or not. I learned the same lesson years ago, although I caught it while testing. I write plenty of python and I just don’t often pass functions in as parameters. In this case you need to!
For something like this where you’re generating a unique id and probably need it in every model, it’s better to write a new Base model that includes things like your unique id, created/changed at timestamps, etc. and then subclass it for every model. Basically, write your model boilerplate once and inherit it. This way you can only fuck this up once and you’re bound to catch it early before you make this mistake in a later addition like subscription management.
Depending on your viewpoint, ORM are by nature a mistake.
The time people spend learning the quirks of an ORM is much better put into learning SQL.
Honestly same can be said about a lot of frameworks. You will pry my vanilla JS debugged with print statements hand-coded in vi from these hands only when they're cold and dead.
Yeah, I have a workflow where I inject JavaScript into arbitrary webpages and store the results. There’s no substitute for vanilla js knowledge when you’re in the weeds.
I wonder are there linters to detect those types of mistakes for SQLAlchemy. Even though I'm aware of such pitfalls, it's nice if linters can catch them cause I'm not confident to cache them all the time during code review.
Some linters like pyright can identify dangerous defaults in function call, like `def hello(x=[]): pass` (mutable values shouldn't be a default). Linter plugins for widely-used and critical libraries like SQLAlchemy are nice to have.
The mutable-default-arguments issue is easy for a third-party to linter to catch because it doesn't require any specific knowledge about primary keys and databases. Are there static typing plugins for other common packages that would catch issues like this?
I personally use pylint, which I've found to be the most aggressive python linter.
Ah, but it’s not a characteristic of SQLAlchemy tho. It’s how Python evaluates statements. Both Peewee and the Django ORM work on the same principle with default values.
The intent is to pass a callable, not to call a function and populate an argument with what it returns.
Correct, it's not specific to sqlalchemy - I'm just saying I notice this a lot with sqlalchemy. Probably because it was the first significant bug I had to figure out how to fix when I introduced it in one of my first apps. I guess we never forget the first time we shot ourselves in the foot.
Yea, more of an issue of how this python library can cause misunderstandings, and ChatGPT failing in the same misunderstanding that would have been made by an engineer who lacks experience in that library.
This mistake would have happened even if they did not use ChatGPT.
That explains where ChatGpt is getting that from. I guess it is only as smart as the average coder.
The fact that they couldn't find it by looking at error logs is weird to me.
This is an entirely forgivable error but should have been found the first time they got an email about it:
"Oh, look, the error logs have a duplicate key exception for the primary key, how do we generate primary keys.... (facepalm)"
Funnily enough, I saw the error in their snippet as soon as I read it but dismissed it thinking there was some new-fangled python feature which allowed that to work like the function def defines that default= accepts only functions so the function gets passed? -- I haven't kept up with modern python and that sounds cool and I figured the bug couldn't be THAT simple.
I was wondering that too. Why wouldn’t the error be in the logs?
Guess: the logs were on an ec2 instance that was thrown away regularly, and the overnight reports didn't give reproduce steps or timestamps; so when they checked it "works fine".
There's value in having your backtrace surfaced to end users rather than swallowing an exception and displaying "didn't work".
I don't think showing stack traces to users is good practice? Every time one of my users gets a didn't work message I log the stack trace instead.
Why would you show them a stack trace? This should be logged.
it was on some temporary AWS service like lambda or something? (We had eight ECS tasks on AWS, all running five instances of our backend), but, regardless logs should be somewhere persistent.
If they weren't, that should be the first thing you fix.
Yeah this is not a good thing to advertise.
- They were under large time constraints, but decided a full rewrite to a completely different stack was a good idea.
- They copy-pasted a whole bunch of code, tested it manually once locally, once in production, and called it a day.
- The debugging procedure for this issue so significant it made them dread waking up involved... testing it once and moving on. Every day.
The bug is pretty easy to miss, but should also be trivial to diagnose the moment you look at the error message, and trivial to reproduce if you just try more than once.
I'd rather they admit a mistake and learn a lesson from it even if it isn't a good thing to advertise. That said, I agree that you are identifying a more important issue here but I also think you are being a bit too subtle about even if I agree with what you are saying. The real lesson that they should have learned from this ordeal is to never push code directly into production --- period. The article never mentions using a testbed or sandbox beforehand, and I kinda feel like they learned a good lesson but it may in fact be the wrong lesson to learn here.
I don't see how testbed/sandbox would have helped, unless they'd also have a dedicated QA person _and_ configured their sandbox so have dramatically fewer instances.
Because I can see "create a new subscription" in the manual test plan, but not "create 5x new subscription".
> trivial to reproduce if you just try more than once
A lot more than once: they had 40 instances of their app, and the bug was only triggered by getting two requests on the same instance.
A bunch of developers including me once spent a whole weekend trying to reproduce a bug that was affecting production and/or guess from the logs where to look for it. Monday morning, team lead called a meeting, asked for everything we could find out, and… Opened the app in six tabs simultaneously and pressed the button in question in one of the tabs. And it froze! Knowing how to reproduce on our computers, we found and fixed the bug in the next 30 minutes.
Ironically one of my frequent GPT questions is “X is supposed to do Y, but is doing Z. What logs should I look at and what error messages to keep an eye out for?”
That's an alright takeaway: the team made a rookie mistake and then they made a PR mistake by oversharing.
Otherwise, I think this comment thread is a classic example why company engineering blogs choose to be boring. Better ten articles that have some useful information, than a single article that allows the commentariat to pile on and ruin your reputation.
I think it’s an unfair takeaway. I have over a decade of experience and still had to stare at the line to find the bug. If that makes them incompetent, I stand with them. It’s a bug I’ve seen people make in other contexts, not just chatbots.
The AI angle is probably why people are piling on. There’s a latent fear that AI will take our jobs, and this is a great way to skewer home that we’re still needed. For now.
The one thing I will say is that it probably wouldn’t take me days to track it down. But that’s only because I have lots of experience dealing with bugs the way that The Wolf deals with backs of cars. When you’re trying to run a startup on top of everything else, it can be easy to miss.
I’m happy they gave us a glimpse of early stage growing pains, and I don’t think this was a PR fumble. It shows that lots of people want what they’re making, which is roughly the only thing that matters.
Eh, I think it speaks fairly well for them.
On the one hand it does seem like a fairly inexperienced organization with some pretty undercooked release and testing processes, but on the other hand all that stuff is ultimately fixable. This is a relatively harmless way of learning that lesson. Admitting a problem is the first step toward fixing it.
A culture of ass-covering is much harder to fix, and will definitely get in the way of addressing these types of issues
A mistake is interesting if the mistake itself or the RCA is interesting - using sloppy methods isn't really that interesting on its face.
Pile-on aside, the problem with this blog article is that it doesn't really have much of a useful takeaway.
They didn't even really talk the offending line in detail. They didn't really talk about what did to fix their engineering pipelines. It was just a story about how they let ChatGPT write some code, the code was buggy, and the bug was hard to spot because they relied on customers e-mailing them about it in a way that only happened when they were sleeping.
It's not really a postmortem, it's a story about fast and loose startup times. Which could be interesting in itself, except it's being presented more as an engineering postmortem blog minus the actionable lessons.
That's why everyone is confused about why this company posted this as a lesson: The lesson is obvious and, frankly, better left as a quiet story for the founders to chuckle about to their friends.
The bad thing is what they did, not that the disclosed it.
I agree that this is probably to their disadvantage, but I would much rather have people admitting their faults than hiding them. If everyone did this the world would be better.
Of course the best solution is to not have faults but that is like saying that the solution to being poor is to have lots of money. It's much easier to say than do.
The bad thing is their engineering culture and not anything technical. We all make mistakes, the question is how we fix them. Look a the last sentences of the post:
None of those are unconditionally bad! Every project I've worked on could use more testing; we all copy-pasted code at least occasionally, and pushing to main is fine in some circumstances.
The real problem is that they went live, but their tooling (or knowledge how to use it) was so bad it took 5 days to solve the simple issue; and meanwhile, they kept pushing new code ("10-20 commits/day") while their customers were suffering. This is what really causes the reputation hit.
In a way, it does give you an opportunity to think about what you appreciate in a detailed postmortem - not just a single cause, but human and organizational factors too, and an attempt to figure out explicit mitigations. I’ll admit the informality and the breezy tone here made me go “woah, they’re a bit cavalier…”
that blog post is all they have so not much to worry about company wise
I appreciate the author's honesty. Its better to see transparently what happened so customers know the problem is fixed.
These criticisms about engineering PR are too heavy handed. Great engineers solve problems and describe problems without finger pointing to place blame. In fact I think that the worst engineers I’ve worked with are the ones most often reaching for someone to place it on.
This is embarrassing, I’d honestly consider pulling this post for your reputation.
CEO thoughts: "Oh post-morten are always well received, I should write one for that very really basic bug we had and how we took 5 days to find it, and forget to mention how we fix it or how we have changed our structure so that it never happen again"
Also the CEO: "remember to be defensive on reddit comments saying how we are a small 1 million dollar backed startup and how it's normal do to this king of rookies mistake to be fast."