return to table of content

Spending 3 months investigating a 7-year old bug and fixing it in 1 line of code

TehShrike
49 replies
1d3h

I particularly liked this part:

Knowing very little about USB audio processing, but having cut my teeth in college on 8-bit 8051 processors, I knew what kind of functions tended to be slow. I did a Ctrl+F for “%” and found a 16-bit modulo right in the audio processing code.

That feeling of saving days of work because you remember a clue from previous experience is so good.

nikanj
24 replies
1d2h

This essentially is why senior engineers get much bigger salaries

mulmen
23 replies
1d

As a total newbie I saved my company a quarter million dollars in Oracle licensing in a single afternoon by rewriting a PL/SQL function. That change was a few lines of SQL. Seniors don’t have a monopoly on good ideas.

Salary is driven by market conditions and nothing else. It is not an approximation of merit or even delivered value.

stavros
12 replies
20h33m

This is laughably false. The highly-paid, experienced seniors produce so much more value than juniors that it's not even in the same ballpark. It's also usually the kind of value that juniors don't even notice, because it's not measured in lines of code.

A good junior will write a hundred lines of code in a day. A good senior will delete a hundred because they realize the user dictated a solution instead of detailing their problem, asked them, and figured out that they can solve that problem by changing a config variable somewhere.

klabb3
6 replies
16h32m

Violent agree on variance in value produced. Violent disagree on that junior, senior or other titles or roles have such strong correlations. For very simple reasons: we can’t measure value, and we absolutely can’t measure value within a performance review cycle.

The most devastating value destruction (aside from the rare intern deleting the prod db) that I’ve seen consistently is with senior/rockstars who introduce new tech, takes credit, moves on. There’s a reason for the term resume driven development. Think about what a negative force multiplier can do for an org.

stavros
2 replies
11h4m

I don't know, I think where I work we have a pretty good idea for the value each person brings. I don't know how much they're paid, but I do know how good each person is (including whether they tend to complicate things, to use exciting technologies, etc).

Maybe it varies per company.

mulmen
1 replies
9h15m

I don't know how much they're paid

But that’s the entire point you’re missing. The pay is not proportional to contribution or technical skill. It’s proportional to market forces and negotiation skill.

stavros
0 replies
5h31m

I know what level each of our people is, and levels are compensated fairly evenly. The fact that I don't know exact numbers doesn't mean I don't have a proxy.

chrismorgan
1 replies
4h59m

Think about what a negative force multiplier can do for an org.

Negative force multipliers are easily remedied: just make sure you have an even number of them.

atherton33
0 replies
3h12m

This totally happened on my first team.

We had a guy who would argue about everything that knew the CTO so we had to tolerate him.

Then we hired a second one and they just argued with each other all the time and the rest of the team could finally make progress.

It was awesome.

_rm
0 replies
6h12m

How does violent agree/disagree work? Like after you conclude you agree/disagree to this internet text, do you then proceed to scream out on your balcony that which you agree with / smash up your apartment in rage, respectively?

vaylian
1 replies
5h36m

This is laughably false. The highly-paid, experienced seniors produce so much more value than juniors that it's not even in the same ballpark. It's also usually the kind of value that juniors don't even notice, because it's not measured in lines of code.

This particular case sounds like someone got incorrectly hired as a junior. Maybe they didn't have enough "real world" corporate experience and that is why they weren't offered a senior position?

mulmen
0 replies
1h21m

I was fresh out of college. But it’s not like that was an isolated incident or that other juniors don’t also have good ideas.

_rm
1 replies
5h58m

It's true, not "laughably false". I've seen with my own eyes the most effective developer in a company being paid in the bottom quartile, as well as a vice versa case.

In the former case, we basically had to demand management raise his salary to the low end of his market value, over the cause of six months, until they finally gave in. It was just so disgusting to us we couldn't let it go.

The reason comes down to a skill bias - it's a different skill set to navigate other people into getting yourself a good salary, versus navigating the ins and outs of coding. The skills don't overlap, so time spent on one detracts from another.

In the end he finally got the message we kept ramming in to his head, applied to work at a brand-name tech company, and instantly more than doubled his salary. He could've done so years earlier.

This stuff is the norm. I've been a manager having eyes on salaries while also having eyes on people's performance (although unfortunately not much of a lever on the former), and rest assured it is often a very jarring experience. Like "that person should be let go immediately / that person should job hop immediately".

stavros
0 replies
5h31m

instantly more than doubled his salary. He could've done so years earlier.

So it's not true, then?

The GP claims this is universally true. All I need to do is post a counterexample, and I did. Yes, there are shitty companies that try to keep salaries as low as possible, not realizing that that will lose them their best people. Don't work for those!

mewpmewp2
0 replies
19h28m

Yes, but they don't get paid as many times as junior engs compared to how many more times value they bring.

close04
4 replies
1d

Statistically speaking a senior (more experienced) engineer is more likely to consistently deliver time saving results, while a junior is more likely to occasionally do it, if ever.

Proving it’s not a one time thing is what pushes you in the salary and seniority ranking.

nashashmi
3 replies
23h1m

Senior engineers have less opportunity to write time consumingly careful code because they get paid so much. Much easier to throw new great hardware at it.

fifilura
1 replies
20h57m

Senior engineers have less time to write code period.

And this is what saves the day.

Code is a liability.

JonChesterfield
0 replies
20h0m

The corporate structures that reward people who prove especially good at building the product with more meetings and less time building the product are perhaps not optimal in their deployment of resources.

Maximising the fraction of the product built by people who don't know what they're doing would however explain the emergent properties of modern software.

dgfitz
0 replies
19h54m

Senior engineers can write time-consuming, careful code efficiently. This is why they are seniors.

dehugger
3 replies
18h57m

My view as a burgeoning senior dev is that the "senior" bit is generally less about coding and more about domain knowledge.

Understanding the business processes of your industry, how to solicit feedback from and interact with end users, how to explain things to management/sell on ideas.

If you put a junior dev in front of a panel of executives and ask them to explain requirements for a project odds are quite high they will info dump tech mumbo-jumbo. A senior should be able to explain risks, benefits, timelines, and impacted areas of the business in a manner that non technical people can easily grok.

truncate
1 replies
17h8m

> they will info dump tech mumbo-jumbo

I'm mid-level engineer. Honestly, several staff+ engineers may not be spitting tech mumbo-jumbo, but they do dump all other kind of BS. Political BS, "tactical tornados"[1]. May not necessarily mean they were good at engineering, but just good with people skills. Obviously, not everyone is like that, but I would say many are.

[1] https://news.ycombinator.com/item?id=33394287#:~:text=The%20....

skydhash
0 replies
5h26m

If it will be BS, it should be understandable BS.

vsuperpower2020
0 replies
17h51m

For me, "senior" just counts the amount of time they've been doing something. If someone isn't very good at something after putting ten thousand hours into it, they just might work at microsoft.

johnnyanmac
0 replies
20h11m

Not a monopoly, but a majority. Many juniors who do have that potential don't ever get put in such a situation.

Junior/senior isn't necessarily about skill level; I'm sure many can find a senior with 1YOE ten times over. It's about trust both in technical and sociopolitical navigation through the job. That's only really gained with time and experience (and yes, isn't perfect. Hence, the aforementioned 1x10 senior. Still "trusted" more than a 1 year junior).

djoldman
13 replies
1d2h

Those are the times one gets the opposite of imposter syndrome.

ASalazarMX
8 replies
1d1h

Fortunately it's a temporal state, otherwise there's risk of entering the Dunning-Kruger effect.

"You did awesome, but don't let it go to your head."

dylan604
7 replies
23h44m

F-that! That's one of those times where I re-enact the scene from the Bond Golden Eye film where the guy jumps up extending both arms yelling "Yes! I am invincible!" Of course I totally expect the hubris to be short lived, just maybe not with liquid nitrogen

https://www.youtube.com/watch?v=fXW02XmBGQw

squigz
6 replies
23h15m

I alternate between "I am the best programmer to ever exist" and "I am completely incompetent at this and I should quit" while debugging.

dylan604
3 replies
22h53m

I've been known to inform people that the person that wrote the incredibly horrendous code that caused whatever problems to occur should be fired immediately knowing good and well that I was the only dev to write any of the code.

dmd
2 replies
5h31m

me, yesterday: What absolute piece of shit asshole wrote this shell script? my wife: Was it you? me: Well obviously

dctoedt
1 replies
4h50m

my wife: Was it you? me: Well obviously

"Research shows" (I read long ago) that the happiest men are those take their wives' advice — that's certainly been true for my own N=1, for coming up on four decades now. I'd imagine we could replace "wives" with "spouses" and get comparable results.

Angostura
0 replies
2h6m

I'd imagine we could replace "wives" with "spouses" and get comparable results.

Haven’t you just risked an infinite loop in your code?

9659
0 replies
10m

If you are honest, it may not even alternate. Both feelings can exist at the same time.

brailsafe
2 replies
18h50m

This is very true, and we need these moments. Lately I've been struggling to figure out where my place is, how to maybe get back into freelancing, what I'm technically good at etc... lots of ruminating since I've been out of work for a year.

But.. I met someone at the gym who's been struggling with an esoteric problem on an ancient piece of software for over a decade, and they approached me to ask if I could solve it. I said "maybe", sat on it for a few days, and then replicated the issue on my machine and solved their problem in about an hour. I asked for $50 and they gave me double, which was wildly more rewarding than being paid $100k to write react all year, not that that salary is on the table any longer.

manmal
0 replies
3h53m

not that that salary is on the table any longer

Is the job market so bad right now? I‘m in a privileged position (and not in the US), so have no clue of the state of things.

9659
0 replies
11m

$50 for a one off thing, for a gym buddy is fine. in the blue collar world, that would be a 'case of beer' for helping me out.

But in business, you need to charge an honest / fair amount. (sure, sometimes that 1 hour bug fix had $100K of 'value', but we could argue about honest / fair).

You mention being an employee at $100K a year. Double that, gives you a contractor rate of $100 an hour. That is the floor of what you should be asking floor; as in the absolute lowest. Another $50 or $100 an hour is still fair and honest in todays economy.

justinclift
0 replies
6h27m

There's an opposite to imposter syndrome?

drewg123
4 replies
23h11m

If modulus is expensive, and he's checking a power-of-2, why not just use a bitwise AND.

Eg, for positive integers, x % 16 == x & 15. That should be trivially cheap.

ladberg
1 replies
23h5m

It wasn't `x % 16` it was `x % y` where x and y are 16-bit integers. A compiler would also have taken care of it if it were just a literal.

drewg123
0 replies
22h51m

Whoops.. I misread what he was doing.

xgkickt
0 replies
17h45m

Reading the comments, that's kinda what they did, though they had to learn that first and only now realize they only needed one.

kevin_thibedeau
0 replies
16h5m

Any top-tier C compiler will optimize modular division by a constant into a more efficient operation(s). It is better to keep the intent of the code clear rather than devolve it into increasingly obtuse bit-twiddling tricks the compiler can figure out on its own.

beebmam
4 replies
22h42m

why is this not considered a compiler optimization and/or language problem? it seems to me that compiler optimizations for expressive programming languages should be able to handle something like this

JonChesterfield
3 replies
20h4m

What would you hope a compiler to optimise x % y into?

Higher level change-the-algorithm aspirations haven't really been met by sufficiently smart compilers yet, with the possible exception of scalar evolution turning loops into direct calculation of the result. E.g. I don't know of any that would turn bubble sort into a more reasonable sort routine.

vitus
2 replies
19h22m

If y is always a power of 2 (as suggested in the comments), then I'd expect it to turn into an AND of some sort.

And more generally, with older architectures, integer division was much slower than integer multiplication, so compilers would generally transform this into a multiplication plus some shifts [0]. For context in that timeframe, MUL on Sandy Bridge introduces 3-4 cycles worth of latency (depending on the exact variant), compared to DIV introducing 20+ (per Agner Fog's excellent instruction tables [1]). So even computing x - y * (x / y) with the clever math to replace x/y would be much faster than just x%y. (It's somewhat closer today, but integer division is still fairly slow.)

[0] https://news.ycombinator.com/item?id=1131177 (the linked article 404s now, but it's archived: https://web.archive.org/web/20110222015211/https://ridiculou...)

[1] page 220 of https://www.agner.org/optimize/instruction_tables.pdf

wizzwizz4
1 replies
19h20m

So even computing x - y * (x / y) with the clever math to replace x/y would be much faster than just x%y.

That only works when y is constant. Otherwise, you need to work out what to replace x/y with… which ultimately takes longer than just using the DIV instruction.

vitus
0 replies
19h13m

That only works when y is constant.

Excellent point! That said, that was the case in this particular example.

This 16-bit modulo was just a final check that the correct number of bytes or bits were being sent (expecting remainder zero), so the denominator was going to be the same every time.

(Libraries like libdivide allow you to memoize the magic numbers for frequently-used denominators, and if on x86 you have floating point operations with more precision than you need for integer division, you can potentially use the FPU instead of the ALU: https://lemire.me/blog/2017/11/16/fast-exact-integer-divisio...)

brogrammernot
10 replies
1d2h

This exact type of thing is why when I switched to the dark side (product) and sat in management meetings where often non-technical folks would go “we could measure by lines of code or similar” for productivity I often pointed out how that was a bad idea.

Did I win? Of course not, it’s hard for non-technical people to fully appreciate these things and any sort of larger infrastructure work, esp for developer productivity because it goes back to well how you going to measure that ROI.

Anyways, this was fun to read and brought back good engineering memories. I’d also like to say, as it brought back a bug I chased forever, fuck you channelfactory in c#.

khazhoux
3 replies
10h58m

But you must admit, this is not the common case. If a developer regularly takes 3 months to fix every bug, then those all better be nasty heisenbeasts, because it's more likely that the developer is just slow.

mjburgess
1 replies
4h50m

The issue is the assumption that managers can assess productivity if is it is captured by a number; but would otherwise be aware, they'd be hopeless at it in a conversation. ie., theyre reducing it to a number to hide the fact they cannot do it.

It's strange that we havent figured out how to trust technical leadership to assess these things for management.

In many ways, the answer is obvious: give technical leaders economic incentives for team productivity. They will then use their expertise to actually assess relevant teams.

zarathustreal
0 replies
4h20m

I think the problem is that in order to evaluate something you need to have equal understanding of it as the person that made it. This is a problem in every field, everywhere. In fact it’s the reason pure democracy isn’t the optimal strategy for governance - the masses aren’t really qualified to make decisions.

Taylor_OD
0 replies
1h32m

Slow or has bad debugging abilities. This article is noteworthy because of the length of time taken and allowed for a bug fix. I can imagine almost any manager saying this isnt high priority enough for this time investment after week 2 or month 1.

neonsunset
2 replies
1d1h

Troubleshooting vendor WCF SDK version mismatch was not fun, and the guy who had to reverse engineer it to attempt a .NET Core port probably lost a few years off his lifespan (this was before CoreWCF was a thing).

When people bash gRPC today, they don't know of the horrors of the past.

brogrammernot
1 replies
23h33m

Yeah, I’ve lived the life of straddling .NET Core and ASP.NET while also dealing with React vs Angular2+ and having half of the system in the script bundling hell that was razor views and all sorts of craziness.

That experience is actually what led me to switch over to Product among other things, I get it when people joke (half joke) about considering retirement rather than going through that again.

neonsunset
0 replies
23h13m

At the time, we had already been using React for front-end widgets so migrating most other parts to then latest .NET Core 3.1 went surprisingly smooth. There were a couple of EF queries that stopped working as EF Core disabled application side evaluation by default, but that was ultimately a good thing as the intention wasn't to pull more data than needed.

Instead, the actual source of problems was K8S and the huge amount of institutional knowledge it required that wasn't there. I still don't think K8S is that good, it's useful but it and containerized environments in general to this day have a lot of rough edges and poorly implemented design aspects - involved runtimes like .NET CLR and OpenJDK end up having to do special handling for them because reporting of core count and available memory is still scuffed while the storage is likely to be a network drive. The latter is not an issue in C# where pretty much all I/O code is non-blocking so there is no impact on application responsiveness, but it still violates many expectations. Aspects of easy horizontal scaling and focus on lean deployments are primarily more useful for worse languages with weaker runtimes that cannot scale as well within a single process.

I suppose, a silver lining to your situation on the other hand is that developers get to have a PO/PM with strong technical background which makes so many communication issues go away.

Swizec
1 replies
1d

Have you ever suggested that management/leadership should measure productivity by lines of document text written? They might better grok how that’s a bad idea. Especially since many of them much prefer to communicate in bullet-pointed slides than documents.

joshspankit
0 replies
17h40m

Or measure their mechanic's productivity by number of hours spent on the car

jonathanlydall
0 replies
1d1h

I really miss working with WCF, said no one ever.

winrid
6 replies
1d1h

Reminds me of fixing an ~11yr old bug in Enemy Territory. I had to spend a night debugging the C code only to realize the issue was in the UI config: https://github.com/etlegacy/etlegacy-deprecated/pull/100/fil...

(IIRC UI scrolled twice for every mouse movement + you couldn't select items in server browser with mouse wheel as it would skip every other one)

lostlogin
3 replies
1d1h

That was such a great game but sadly it seemed to fizzle out. There were lots of neat exploits which made it even better. I also liked the communication style, with pre canned message you could give with certain key combos.

Terr_
1 replies
20h52m

In a similar vein, the voice tree from Starseige:Tribes (1998) was mind-blowing for the dialup era.

Ex: VSAB -> "I am attacking the enemy base!" (Voice, Self, Attacking, Base)

ramses0
0 replies
19h2m

VGS! Midair keeps the torch lit, and has had a re-release, but they're less "base" and more CTF with lights.

winrid
0 replies
20h59m

There's usually a full server or two. ETLegacy has plenty of players for me. but yeah, the communication style is fun. and if you join a server on axis you'll usually get spammed with "Hallo! Hallo! Hallo!" :)

intelVISA
1 replies
5h56m

WolfET was such good fun, awesome.

winrid
0 replies
2h42m

still is! just make sure you play ETLegacy as it's a more maintained client.

rented_mule
5 replies
1d1h

The worst I experienced in this direction was also on a consumer device about 15 years ago. Performance was degraded and we couldn't explain it. A team of 5 of us was assembled to figure it out.

We spent over three months on it before finding a root cause. It was over two months before we could even understand how to measure it - we were seeing parts of the automated overnight test suite run taking longer, but every night it would be different tests that were slow. A key finding was that almost everything was slow on some boots of the device and fast on other boots of the device, and there was a reboot before each test was run. Doing some manual testing showed it being close to a 50% chance of a boot leading to slowness. Now what?

I eventually got frustrated and took the brute force / mindless approach... binary search over commits. Unfortunately, that wasn't easy because our build was 45-60 minutes, and then there was a heavily manual installation process that took 10-20 minutes, followed by several reboots to see if anything was slow. And there were several thousand commits since the last known good build (the previously shipped version of the device). The build/install/testing process was not easily automated, and we were not on git, otherwise using git-bisect would have been nice. Instead, I spent weeks doing the binary search manually.

That yielded the offending commit. The problem was that it was a massive commit (tens of thousands of lines of code) from a group in another part of the company. It was a snapshot of all of their development over the course of a couple of years. The commit message, and the authors, stated that the commit was a no-op with everything behind a disabled feature flag.

So now it was onto code level binary search. Keep deleting about half of the code in the commit, in this case by chunks that are intended to be inactive. After eventually deleting all the inactive code, there were still a few dozen lines of changes in a Linux subsystem that did window compositing. Those lines of code were all quite interdependent, so it was hard to delete much and keep things functional, so now on to walking through code. At least I could use my brain again!

Using the clue that the problem was happening about half the time and given that this code was in C, I started looking for uninitialized booleans. Sure enough, there was one called something like `enable_transparency`. Disabled code was setting it to `true`, but nothing was setting it to `false` when their system was disabled. Before their commit, there was no variable - `false` was being passed into the initializer call directly. Adding `= false` to the declaration was the fix.

So, well over a year of engineering hours spent to figure out the issue. The upside is that some people on the team didn't know how to proceed, so they spent their time speeding up random things that were slow. So the device ended up being noticeably faster when we were done. But it was pretty stressful as we were closing in on our launch date with little visibility into whether we'd figure it out or not.

hoten
3 replies
21h7m

Oh man, that sounds rough. I salute you.

This probably wasn't an option back then with your toolchain, but it's so reassuring to know modern compilers / ASAN are amazing at catching this class of bugs today.

leni536
2 replies
20h9m

AFAIK ASAN does not catch uninitialized variables, MSAN does. MSAN is significantly harder to set up.

JonChesterfield
1 replies
19h56m

Branch on uninit lights up beautifully in valgrind which has no set up, just run valgrind ./a.out

leni536
0 replies
19h49m

Good point, although sometimes valgrind is too slow.

namrog84
0 replies
19h48m

C++ senior dev here. Of the few teams I've been on, one of the first things I make sure is setup right is cranking up warnings, warnings as errors. Which include things like un initialized variables. I then fix up the errors and make sure they are part of build gates.

These types of problems(undefined behavior and or uninitialized) are often hard(time consuming) to diagnose and fairly common.

Lots of places overlook simple static analysis or built in compile features.

pvaldes
2 replies
1d2h

"I also ended up needing to find a Perl script that was buried deep in some university website. I still don’t know anything about Perl, but I got it to run"

Find dusty Perl script forgotten for years. Still works

Not the first time that I hear that

nikanj
1 replies
1d1h

Outside of javascript, it’s a pretty reasonable assumption that if you have the sources, you can get them to run

spc476
0 replies
19h13m

Most Perl code now a days runs under Perl5. I once tried running a Perl4 script in Perl5 and did not have a good day.

omoikane
2 replies
1d

given a fixed denominator, any 16-bit modulo can be rewritten as three 8-bit modulos

Anybody know what's the exact transformation here? I searched around and found this answer, but it doesn't work:

https://stackoverflow.com/a/10441333

o11c
0 replies
20h46m

If the denominator is a constant, wouldn't it be faster to use the divmod identity to turn it into (divide, multiply, subtract), then use the usual constant-divide-is-multiply-and-shift optimization?

justincredible
0 replies
22h7m

The article isn't very clear but assuming it's a 16-bit numerator and 8-bit denominator, then MSN's answer to [0] lays it out (although for higher bit sizes). If the denominator was 16-bit, then the top-rated answer (by caf) to the same SO question seems like another approach, but that wouldn't be a one line change.

[0] https://stackoverflow.com/questions/2566010/fastest-way-to-c...

tommiegannert
1 replies
1d

Kudos also to the original author for not doing premature optimization, of course. It wasn't until the iPad that it was needed. However, a TODO might have been useful. ;)

tedunangst
0 replies
23h20m

Only for users that didn't use both features at the same time. Users who did probably experienced the same bug, but it took until a critical mass of users reported the bug to get it fixed. At which point the fix probably took four times longer than necessary because the developer was unfamiliar with the design and the toolchain had decayed.

shermantanktop
1 replies
1d3h

This kind of bug is always an emotional rollercoaster of anticipation, discovery, disappointment, angst, self-criticality, and satisfaction.

Tao3300
0 replies
19h21m

And then they screw you over on performance reviews because it was pointed to fit within a single sprint.

m3kw9
1 replies
1d3h

These one line fix always seem like a stupid bug , but in reality most bugs are like this and the fix is in the discovery

xeromal
0 replies
1d1h

One of the reasons I struggle to give ETAs on fixing a bug. The moment I know what the issue is, the solution to fix it is usually already figured out barring a rearchitecture of some services or infrastructure.

creeble
1 replies
1d1h

Ha, coincidentally, I designed and built an 8051-based MIDI switch in the early 90’s. There weren’t that many good tools at the time, and I designed everything from the software and UI to the circuit board and rack-mount case.

I even wrote an 8051 assembler in C, but found a good tiny-C compiler for it before it went into production.

You are not a programmer unless you’ve written key-debounce code :)

(OTOH, some of the worst programmers I’ve ever had the displeasure of working with were amazing low-level code hackers. In olden times, it seems like you were either good at that level of abstraction, or you were good at a much different [“higher”] level, seldom both.)

r4nd_f
0 replies
3h53m

You are not a programmer unless you’ve written key-debounce code :)

I've had to do that once, and I still consider it a blessing from Satan above that I both a) figured it out b) it worked every time (plus bonus C: explained the logic to better students in my class)

thenoblesunfish
0 replies
11h10m

Why is it so bad for a second noteon to leave the original note sounding? It makes no sense for keyboards but maybe you really do want two of the same note sounding.

readthenotes1
0 replies
1d2h

"...it was based on a USB product we had already been making for PCs for almost a decade.

This product was so old in fact that nobody knew how to compile the source code. "

I think you mean "Management was so bad, nobody knew how to compile the source code".

There are plenty of systems out there that can and and plenty that cannot be reproduced from source. The biggest difference is the card taken to do so, not the age.

pelagicAustral
0 replies
1d3h

Some of the stuff I've struggled with the most over the years have been SQL constraints that are not documented. I remember (probably like 10 years ago), I deployed an update to an ancient Windows Forms implementation that deprecated some login and instead made use of Windows Authentication. It worked like a charm for all users, but one! Checked everything, replicated the machine, tried so many weird stuff, and in the end, what was happening is that the "Users" table had a constraint in the number of characters for the username. This username was over the limit and was not being validated... Another one was a report that was giving the wrong amount, but getting the data from database seemed to do the math right... it was the damn Money datatype, changed to decimal, done...

magwa101
0 replies
1d1h

Similarly, I spent 6 weeks on a kernel token-ring driver intermittent initialization issue. This required kernel restarts over and over to observe the issue. Breakpoints were useless as they hid the issue. Turns out initialization in a specific step was not synchronous and reading the status was a race condition. It tooks weeks of staring, joking around, thinking, bs'ing, then suddenly, voila. Changed the order of the code, worked.

langsoul-com
0 replies
9h40m

Half the challenge of the bug isn't fixing the code. It's finding wtf is happening.

iam-TJ
0 replies
10h21m

Based on experience gained from debugging complex issues and code over decades I have a mantra I repeat to myself and others:

"Almost every bug turns out to be a 1 that should be 0, or a 0 that should be 1"

Keeping this in mind often keeps one focused on the detail of the underlying binary values and how they are being manipulated.

halifaxbeard
0 replies
1d

Reminds me of a bug I fixed in yamux, simply because of how long I've had to deal with it. Bug existed for as long as yamux did. (yamux is used by hashicorp for stream muxing everywhere in their products.)

If yamux's keepalive fails/times out, and you're calling Read on a demuxed stream, it blocks forever.

https://github.com/hashicorp/yamux/pull/127

gred
0 replies
4h29m

I love that the fix was an optimization allowing the code to keep the simplifying assumption that only one MIDI event ever needs to be buffered... rather than a "cleaner" / "future-proof" design change allowing buffering of more than one MIDI event.

figassis
0 replies
20h32m

The number of times I bumped by head against a desk, after missing multiple deadlines and then out of nowhere having a random moment of clarity such has “this gives me X vibes, but it would be insane if this was actually the case”, and then I do a quick string search and there it is.

anytime5704
0 replies
16h38m

First time I’ve seen hackernews link to Lemmy.

Love to see it. That place needs more organic growth.

EvgeniyZh
0 replies
14h18m

I recently found a bug whose fix amounted to one-liner. It all started with random CI failure when I was working on adding some new functionality.

I've rerun test locally -- no fail. I've changed the seed to one that was used in failing run -- nothing.

I add a loop to the code to repeat text hundred times -- still nothing. I run the test in bash loop hundred time -- 3 fails. So this already hints on some internal problems. I fixed every possible source of randomness and verified that all the inputs are identical between the runs -- still fails only once in a while. I started building MWE, but the function involved in reproduction is fairly complicated. I'm left with a hundreds lines of Jax code which fails in couple of percent of cases.

I look at the output of the compiler, and it is identical between failing and successful runs. So the problem is in the compiled code. The compiled code is ~1000 lines of HLO (not much better than assembly). Unfortunately HLO tooling is both unfamiliar to me and not well fit to this case (or at least I couldn't figure it out). So I start manually bisecting the code. I'm finally left with ~30 lines of HLO. It fails even less often (1% maybe), but at least it runs fast. It also seems to fail in exactly the same way (i.e., there is single incorrect output that I've between 3 fails). Now that's something maintainers can be hoped to look at.

It turned out that matrices with same content but different layout were deduplicated, leading to, in my case, transposed matrix being replaced by non-transposed one. The hash used for storage did take layout into account so the bug appeared only if two entries ended up in the same bucket (~3% of times). The fix was an obvious one liner [1].

[1] https://github.com/openxla/xla/commit/76e7353599d914546f9b30...