HN comments for: Ask HN: Machine learning engineers, what do you do at work?

davedx

88 replies

2024-06-07 18:03:48 UTC

pip install pytorch

Environment broken

Spend 4 hours fixing python environment

pip install Pillow

Something something incorrect cpu architecture for your Macbook

Spend another 4 hours reinstalling everything from scratch after nuking every single mention of python

pip install … oh time to go home!

sigmoid10

10 replies

2024-06-07 18:14:09 UTC

If you're still doing ML locally in 2024 and also use an ARM macbook, you're asking for trouble.

nicce

2 replies

1d23h

2024-06-07 19:17:07 UTC

ARM macbook

Funnily, the only real competitor for Nvidias' GPUs are Macbooks with 128GB of RAM.

hu3

0 replies

1d5h

2024-06-08 12:34:17 UTC

And they don't compete in performance.

hkt

0 replies

1d3h

2024-06-08 14:41:49 UTC

I see your contemporary hardware choices and raise you my P900 ThinkStation with 256GB of RAM and 48 Xeon cores. Eventually it might even acquire modern graphics hardware.

spmurrayzzz

1 replies

1d23h

2024-06-07 18:35:16 UTC

Can you expand on this a bit? My recent experiences with MLX have been really positive, so I'm curious what footguns you're alluding to here.

(I don't do most of my work locally, but for smaller models its pretty convenient to work on my mbp).

sigmoid10

0 replies

1d5h

2024-06-08 13:08:47 UTC

MPS implementations generally lag behind CUDA kernels, especially for new and cutting edge stuff. Sure, if you're only running CPU inference or only want to use the GPU for simple or well established models, then things have gotten to the point where you can almost get the plug and play experience on Apple silicon. But if you're doing research level stuff and training your own models, the hassle is just not worth it once you see how convenient ML has become in the cloud. Especially since you don't really want to store large training datasets locally anyways.

blitzar

1 replies

1d9h

2024-06-08 08:52:34 UTC

Nahh im l33t - intel macbook and no troubles.

rwalle

0 replies

1d5h

2024-06-08 13:17:52 UTC

can't read? parent clearly says "ARM macbook".

genevra

0 replies

1d23h

2024-06-07 18:53:27 UTC

For real

davedx

0 replies

1d12h

2024-06-08 06:21:54 UTC

What can I say, I enjoy pain!?

anArbitraryOne

0 replies

1d21h

2024-06-07 20:40:59 UTC

I wish my company would understand this and let us use something else. Luckily, they don't really seem to care that I use my Linux based gaming machine most of the time

makapuf

10 replies

2024-06-07 18:05:19 UTC

Maybe pip should not work by default (but python -m venv then pip install should)

avmich

8 replies

2024-06-07 18:23:59 UTC

Legends say there were times when you'd have a program.c file and just run cc program.c, and then could just execute the compiled result. Funny that programmer's job is highly automatable, yet we invent ourselves tons of intermediate layers which we absolutely have to deal with manually.

EnergyAmy

3 replies

1d23h

2024-06-07 19:17:31 UTC

And then you'd have to deal with wrong glibc versions or mysterious segfaults or undefined behavior or the the code assuming the wrong arch or ...

KeplerBoy

2 replies

1d22h

2024-06-07 19:45:01 UTC

python solves none of those issues. It just adds a myriad of ways those problems can get to you.

All of a sudden you have people with C problems, who have no idea they're even using compiled dependencies.

EnergyAmy

1 replies

1d22h

2024-06-07 20:06:25 UTC

In theory you're right, CPython is written in C and it could segfault or display undefined behavior. In practice, you're quite wrong.

It's not really much of a counterargument to say that Python is good enough that you don't have to care what's under the hood, except when it breaks because C sucks so badly.

KeplerBoy

0 replies

1d22h

2024-06-07 20:25:06 UTC

I was specifically talking about python packages using C. You type "pip install" and god knows what's going to happen. It might pull a precompiled wheel, it might just compile and link some C or Fortran code, it might need external dependecies. It might install flawlessly and crash as soon as you try to run it. All bets are off.

I never experienced CPython itself segfault, it's always due to some package.

makapuf

1 replies

1d10h

2024-06-08 07:40:59 UTC

I agree simplicity is king. But you're comparing making a script using dependencies and tooling for those dependencies and a C program with no dependencies. You can download a simple python script and run it directly if it has no dependencies besides stdlib (which is way larger in python). That's why I love using bottle.py by example.

avmich

0 replies

1d3h

2024-06-08 14:41:04 UTC

Agree. But even with dependencies running "make" seems to be way simpler than having to install particular version of tools for a project, making venv and then picking versions of dependencies.

The point is the same - we had it simpler and now, with all capabilities for automation, we have it more complex.

Frankly, I suspect most of the efforts now are spent fighting non-essential complexities, like compatibilities, instead of solving the problem at hand. That means we create problems for ourselves faster than removing them.

reportgunner

0 replies

3h2m

2024-06-09 15:25:30 UTC

You can do the same in python if you're not new to programming.

davedx

0 replies

1d12h

2024-06-08 06:23:42 UTC

I actually did a small C project a couple of years ago, the spartan simplicity there can have its own pain too, like having to maintain a Makefile. LOL. It’s swings and roundabouts!

jononor

0 replies

20h15m

2024-06-08 22:12:58 UTC

Some Linux distros are moving that way, particularly for the included Python/pip version. My Arch Linux already does so some years, and I do not set it up myself - so I think it is default.

SushiHippie

8 replies

1d23h

2024-06-07 18:56:11 UTC

Can recommend using conda, more specifically mambaforge/micromamba (no licensing issues when used at work).

This works way better than pip, as it does more checks/dependency checking, so it does not break as easily as pip, though this makes it definitely way slower when installing something. It also supports updating your environment to the newest versions of all packages.

It's no silver bullet and mixing it with pip leads to even more breakages, but there is pixi [0] which aims to support interop between pypi and conda packages

[0] https://prefix.dev/

tasuki

5 replies

1d23h

2024-06-07 19:22:35 UTC

I had a bad experience with Conda:

- If they're so good at dependency management, why is Conda installed through a magical shell script?

- It's slow as molasses.

- Choosing between Anaconda/Miniconda...

When forced to use Python, I prefer Poetry, or just pip with freezing the dependencies.

The Python people probably can't even imagine how great dependency management is in all the other languages...

akkad33

1 replies

1d22h

2024-06-07 19:53:52 UTC

Mamba/micromamba solves the slowness problem of conda

epoxia

0 replies

1d22h

2024-06-07 20:09:22 UTC

To add. Conda has parallelized downloads and is faster. Not as fast as mamba, but faster than previously. pr merged sep 2022 -> https://github.com/conda/conda/pull/11841

tedivm

0 replies

1d22h

2024-06-07 19:47:30 UTC

I absolutely hate conda. I had to support a bunch of researchers who all used it and it was a nightmare.

semi-extrinsic

0 replies

1d12h

2024-06-08 06:21:55 UTC

rye and uv, while "experimental", are orders of magnitude better than poetry and pip IMHO.

SushiHippie

0 replies

1d17h

2024-06-08 01:24:07 UTC

Yeah, I agree, maybe I should have also mentioned the bad things about it, but after trying many different tools that's the one that I stuck with, as creating/destroying environments is a breeze once you got it working and the only time my environment broke was when I used pip in that environment.

The Python people probably can't even imagine how great dependency management is in all the other languages...

Yep, I wish I could use another language at work.

Choosing between Anaconda/Miniconda...

I went straight with mamba/micromamba as anaconda isn't open source.

reportgunner

0 replies

3h6m

2024-06-09 15:21:44 UTC

Whenever anyone recommends conda I automatically assume they don't know much about python.

davedx

0 replies

1d12h

2024-06-08 06:20:04 UTC

Yes I started with conda I think and ended up switching to venv and I can’t even remember why, it’s a painful blur now. It was almost certainly user error too somewhere along the way (probably one of the earlier steps), but recovering from it had me seriously considering buying a Linux laptop.

This happened about a week ago

ninkendo

7 replies

1d6h

2024-06-08 12:08:26 UTC

I count at least a half dozen “just use X” replies to this comment, for at least a half dozen values of X, where X is some wrapper on top of pip or a replacement for pip or some virtual environment or some alternative to a virtual environment etc etc etc.

Why is python dependency management so cancerously bad? Why are there so many “solutions” to this problem that seem to be out of date as soon as they exist?

Are python engineers just bad, or?

(Background: I never used python except for one time when I took a coursera ML course and was immediately assaulted with conda/miniconda/venv/pip/etc etc and immediately came away with a terrible impression of the ecosystem.)

sjducb

2 replies

1d5h

2024-06-08 13:05:37 UTC

Two problems intersect:

- You can’t have two versions of the same package in the namespace at the same time. - The Python ecosystem is very bad at backwards compatibility

This means that you might require one package that requires foo below version 1.2 and another package that requires foo version 2 and above.

There is no good solution to the above problem.

This problem is amplified when lots of the packages were written by academics 10 years ago and are no longer maintained.

The bad solutions are: 1) Have 2 venvs - not always possible and if you keep making venvs you’ll have loads of them. 2) Rewrite your code to only use one library 3) Update one of the libraries 4) Don’t care about the mismatch and cross your fingers that the old one will work with the newer library.

Most of the tooling follows approach 1 or 4

fragmede

1 replies

19h58m

2024-06-08 22:29:49 UTC

Disk space is cheap, so where it's possible to have 2 (or more) venvs, that seems easiest. The problem with venv is that they don't automatically activate. I've been using a very simple wrapper around python to automatically activate venvs so I can just cd into the directory and do python foo.py and have it use the local venv.

I threw it online at https://github.com/fragmede/python-wool/

sjducb

0 replies

11h4m

2024-06-09 07:23:09 UTC

You’re already managing a few hundred dependencies and their versions. Each venv roughly doubles the number of dependencies and they all have slightly different versions.

Now your 15 venvs deep, and have over 3000 different package version combinations installed. Your job is to upgrade them right now because of a level 8 CVE

reportgunner

0 replies

3h8m

2024-06-09 15:19:42 UTC

Just replace python with mac or apple in your comment and I think you will understand.

lloydatkinson

0 replies

8h15m

2024-06-09 10:12:14 UTC

Unironically yes, it really is that bad. A moderately bad language that happened to have some popular data science and math libraries from the beginning.

I can only imagine it seemed like an oasis to R which is bottom tier.

So when you combine data scientists, academics, mathematicians, juniors, grifters selling courses…

things like bad package management, horrible developer experience, absolutely no drive in the ecosystem to improve anything beyond the “pinnacle” of wrapping C libraries are all both inevitable and symptoms of a poorly designed ecosystem.

globular-toast

0 replies

20h49m

2024-06-08 21:38:08 UTC

It's not bad. It works really well. There's always room for improvement. That's technology for you. Python probably does attract more than it's fair share of bad engineers, though.

fbdab103

0 replies

1d2h

2024-06-08 15:34:36 UTC

I think it is worth separating the Python ML ecosystem from the rest. While traditional Python environment management has many sore points, it is usually not terrible (though there are many gotchas still-to-this-day-problems that should have been corrected long ago).

The ML system is a whole another stack of problems. The elephant in the room is Nvidia who is not known for playing well with others. Aside from that, the state of the art in ML is churning rapidly as new improvements are identified.

next_xibalba

6 replies

1d23h

2024-06-07 18:52:48 UTC

Do people doing ML/DS not use conda anymore?

buildbot

2 replies

1d23h

2024-06-07 18:59:42 UTC

A lot do, personally, every single time I try to go back to conda/mamba whatever, I get some extremely weird C/C++ related linking bug - just recently, I ran into an issue where the environment was _almost_ completely isolated from the OS distro's C/C++ build infra, except for LD, which was apparently so old it was missing the vpdpbusd instruction (https://github.com/google/XNNPACK/issues/6389). Except the thing was, that wouldn't happen when building outside of of the Conda environment. Very confusing. Standard virtualenvs are boring but nearly always work as expected in comparison.

I'm an Applied Scientist vs. ML Engineer, if that matters.

astromaniak

1 replies

1d15h

2024-06-08 02:34:13 UTC

It's probably easier to reinstall everything anew from time to time. Instead of fixing broken 18.04 just move to 22.04. Most tools should work, if you don't have huge codebase which requires old compiler...

Conda.. it interfere with OS setup and has not always the best utils. Like ffmpeg is compiled with limited options, probably due to licensing.

buildbot

0 replies

1d12h

2024-06-08 06:10:43 UTC

I do all the time, and always have (in fact my first job was bare metal OS install automation), this was Rocky 9.4. New codebase, new compiler weird errors. I did actually reinstall and switch over to Ubuntu 24.04 after that issue lol.

copperroof

1 replies

1d14h

2024-06-08 03:51:38 UTC

If they are they should stop.

It causes so many entirely unnecessary issues. The conda developers are directly responsible for maybe a month of my wasted debugging time. At my last job one of our questions for helping debug client library issues was “are you using conda”. And if so we just would say we can’t help you. Luckily it was rare, but if conda was involved it was 100% conda fault somehow, and it was always a stupid decision they made that flew in the face of the rest of the python packaging community.

Data scientist python issues are often caused by them not taking the 1-3 days it takes to fully understand their tool chain. It’s genuinely quite difficult to fuck up if you take the time once to learn how it all works, where your putbon binaries are on your system etc. Maybe not the case 5 years ago. But today it’s pretty simple.

buildbot

0 replies

1d12h

2024-06-08 06:12:06 UTC

Fully agree with this. Understand the basic tools that currently exist and you'll be fine. Conda constantly fucks things up in weird hard to debug ways...

blt

0 replies

1d11h

2024-06-08 06:39:28 UTC

I can't point at a single reason, but I got sick of it.

The interminable solves were awful. Mamba made it better, but can still be slow.

Plenty of more esoteric packages are on PyPI but not Conda. (Yes, you can install pip packages in a conda env file.)

Many packages have a default version and a conda-forge version; it's not always clear which you should use.

In Github CI, it takes extra time to install.

Upon installation it (by default) wants to mess around with your .bashrc and start every shell in a "base" environment.

It's operated by another company instead of the Python Software Foundation.

idk, none of these are deal-breakers, but I switched to venv and have not considered going back.

globular-toast

6 replies

1d10h

2024-06-08 07:49:14 UTC

You could learn how to use Python. Just spend one of those 4 hours actually learning. Imagine just getting into a car and pressing controls until something happened. This wouldn't be allowed to happen in any other industry.

sirlunchalot

2 replies

1d10h

2024-06-08 08:06:49 UTC

Could you be a bit more specific about what you mean by "You could learn how to use python"? What resources would you recommend to learn how to work around problems the OP has? What basic procedures/resources can you recommend to "learn python"? I work as a software developer alongside my studies and often face the same problems as OP that I would like to avoid. Very grateful for any tips!

globular-toast

0 replies

2024-06-08 18:01:21 UTC

Basically just use virtual environments via the venv module. The only thing you really need to know is that Python doesn't support having multiple versions of a package installed in the same environment. That means you need to get very familiar with creating (and destroying) environments. You don't need to know any of this if you just use tools that happen to be written in Python. But if you plan to write Python code then you do. It should be in Python books really, but they tend to skip over the boring stuff.

d0mine

0 replies

5h2m

2024-06-09 13:25:11 UTC

How do you learn anything in the space of software engineering? In general, there are many different problems and even more solutions with different tradeoffs.

To avoid spending hours on fixing broken environments after a single "pip install", I would make it easy to rollback to a known state. For example, recreate virtualenv from a lock requirements file stored in git: `pipenv sync` (or corresponding command using your preferred tool).

mountainriver

1 replies

1d5h

2024-06-08 13:25:37 UTC

Oh wow, I’ve been a Python engineer for over a decade and getting dependencies right for machine learning has very little to do with Python and everything to do with c++/cuda

globular-toast

0 replies

22h21m

2024-06-08 20:06:09 UTC

I've done it. Isn't it just following instructions? What part of that means destroying every mention of Python on the system?

kobalsky

0 replies

1d3h

2024-06-08 15:09:10 UTC

I've been programming with python for decades and the problem they are describing says more about the disastrous state of python's package management and the insane backwards compatibility stance python devs have.

Half of the problems I've helped some people solve stem from python devs insisting on shuffling std libraries around between minor versions.

Some libraries have a compatibility grid with different python minor versions, because how often they break things.

mysteria

5 replies

1d23h

2024-06-07 19:13:35 UTC

I thought most ML engineers use their laptops as dumb terminals and just remote into a Linux GPU server.

daemonologist

3 replies

1d17h

2024-06-08 00:56:08 UTC

Yeah, the workday there looks pretty similar though, except that installing pytorch and pillow is usually no problem. Today it was flash-attn I spent the afternoon on.

ungamedplayer

2 replies

1d13h

2024-06-08 04:38:12 UTC

Isn't this what containers are for. Someone somewhere gets it configured right and then you download and run pre setup container and add your job data ? Or am I looking at the problem wrong?

rolisz

1 replies

1d13h

2024-06-08 04:55:45 UTC

But then how do you test out the latest model that came out from who knows where and has the weirdest dependencies and a super obscure command to install?

fshbbdssbbgdd

0 replies

1d12h

2024-06-08 05:47:08 UTC

Just email all your data to the author and ask them to run it for you.

davedx

0 replies

1d12h

2024-06-08 06:21:11 UTC

Spoiler: my main role isn’t ML engineer :) and that doesn’t sound like a bad idea at all

posix_monad

3 replies

1d6h

2024-06-08 11:37:28 UTC

Python's dominance is holding us back. We need a stack with a more principled approach to environments and native dependencies.

llm_trw

1 replies

1d5h

2024-06-08 12:33:51 UTC

Here's what getting PyTorch built reproducibly looks like: https://hpc.guix.info/blog/2021/09/whats-in-a-package/

Since then the whole python ecosystem has gotten worse.

We are building towers on quicksand.

It's not about python, it's about people who don't care about dependencies.

hkt

0 replies

1d3h

2024-06-08 14:39:59 UTC

Dependency management is just.. hard. It is one of the things where everything relies upon it but nobody thinks "hey, this is my responsibility to improve" so it is left to people who have the most motivation, academic posts, or grant funding. This is roughly the same problem that led to heartbleed for OpenSSL.

manusachi

0 replies

2024-06-08 17:45:08 UTC

Do you know what other ecosystem comes closest to the existing in Python? I've heard good things about Julia.

13 years ago when I was trying to explore the field R seemed to be the most popular, but looks like not anymore. (I didn't get into the field, and do just a regular SWE, so I'm not aware of the trends).

There is also a lot of development in Elixir ecosystem around the subject [1].

[1](https://dashbit.co/blog/elixir-ml-s1-2024-mlir-arrow-instruc...)

phaedrus

2 replies

1d20h

2024-06-07 21:41:08 UTC

As an amateur game engine developer, I morosely reflect my hobby seems to actually consist of endlessly chasing things that were broken by environment updates (OS, libraries, compiler, etc.) That is, most of the time I sit down to code I actually spend nuking and reinstalling things that (I thought) were previously working.

Your comment makes me feel a little better that this is not merely some personal failing of focus, but happens in a professional setting too.

davedx

0 replies

1d12h

2024-06-08 06:18:21 UTC

Oh god yes I remember trying to support old Android games several OS releases later… Impossible, I gave up!

It’s why I still use react, their backcompat is amazing

ClimaxGravely

0 replies

1d16h

2024-06-08 02:07:13 UTC

Happens in AAA too but we tend to have teams that shield everyone from that before they get to work. I ran a team like that for a couple years.

For hobby stuff at home though I don't tend to hit those types of issues because my projects are pretty frozen dependency-wise. Do you really have OS updates break stuff for you often? I'm not sure I recall that happening on a home project in quite a while.

loftyal

2 replies

1d8h

2024-06-08 10:20:51 UTC

...and people criticise node's node_modules. At least you don't spend hours doing this

mountainriver

0 replies

1d5h

2024-06-08 13:24:20 UTC

Not nearly as hard of a problem. Python does work just fine when it’s pure Python. The trouble comes with all the C/Cuda dependencies in machine learning

RamblingCTO

0 replies

1d7h

2024-06-08 10:50:58 UTC

But you do because your local node_modules and upstream are out of sync and CI is broken. Happens at least once a month just before a release of course. I'd rather have my code failing locally than trying to debug what's out of sync on upstream.

jshbmllr

2 replies

1d22h

2024-06-07 19:39:34 UTC

I do this... but air-gapped :(

fragmede

0 replies

1d9h

2024-06-08 09:00:20 UTC

that sounds very painful.

daemonologist

0 replies

1d17h

2024-06-08 00:59:27 UTC

Oof. At our company only CI/CD agents (and laptops) are allowed to access the internet, and that's bad enough.

__rito__

2 replies

1d13h

2024-06-08 04:59:46 UTC

Just use conda.

SoftTalker

1 replies

1d13h

2024-06-08 05:03:52 UTC

Then you get one of my favorites: NVIDIA-<something> has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

davedx

0 replies

1d12h

2024-06-08 06:22:30 UTC

Iirc I originally used conda because I couldn’t get faiss to work in venv, lol. That was a while ago though

vergessenmir

1 replies

1d11h

2024-06-08 07:22:32 UTC

Why do ML, especially nowadays on a Mac, when you can do it on an Ubuntu based machine.

Surely work can provide that?

zxexz

0 replies

1d9h

2024-06-08 08:50:49 UTC

Why Ubuntu specifically? Not even being snarky. Calling out a specific distro, vs. the operating system itself. I've had more pain setting up ML environments with Ubuntu than a MacBook, personally - though pure Debian has been the easiest to get stable from scratch. Ubuntu usually screws me over one way or another after a month or so. I think I've spend a cumulative month of my life tracking down things related to changes inn netplan, cloud-init, etc. Not to mention Ubuntu Pro spam being incessant, as official policy of Canonical [0]. I first used the distro all the way back at Warty Warthog, and it was my daily driver from Feisty until ~Xenial. I think it was the Silicon Valley ad in the MotD that was the last straw for me.

[0] https://bugs.launchpad.net/ubuntu/+source/ubuntu-meta/+bug/1...

el_benhameen

1 replies

1d12h

2024-06-08 06:02:10 UTC

Environment broken

Something something incorrect cpu architecture for your Macbook

I’m glad I have something in common with the smart people around here.

davedx

0 replies

1d12h

2024-06-08 06:17:08 UTC

Yeah getting a good python environment setup is a very humbling experience

SJC_Hacker

1 replies

1d15h

2024-06-08 02:54:35 UTC

Docker is your friend

or pyenv at least

navbaker

0 replies

1d5h

2024-06-08 12:36:08 UTC

Yes, I’ve switched from conda to a combination of dev containers and pyenv/pyenv-virtualenv on both my Linux and MacBook machines and couldn’t be happier

wszrfcbhujnikm

0 replies

1d12h

2024-06-08 05:38:23 UTC

Docker fixes this!

rqtwteye

0 replies

1d22h

2024-06-07 20:01:16 UTC

Oh my! This hits home. We have some test scripts written in python. Every time I try to run them after a few months I spend a day fixing the environment, package dependencies and other random stuff. Python is pretty nice once it works, but managing the environment can be a pain.

mc10

0 replies

1d12h

2024-06-08 06:19:21 UTC

uv is a great drop-in replacement for pip: https://astral.sh/blog/uv

llm_trw

0 replies

1d5h

2024-06-08 12:28:51 UTC

Just use Linux. Then you only have to fight the nvidia drivers.

htrp

0 replies

1d2h

2024-06-08 15:41:51 UTC

Use standard cloud images

Something something incorrect cpu architecture for your Macbook

KingMachiavelli

0 replies

11h50m

2024-06-09 06:37:55 UTC

Really can't recommend Nix for Python stuff more, it's by far the best at managing dependencies that use FFI/native extensions. It can be a pain sometimes to port your existing Poetry/etc. project to work with nix2lang converters but really pays off.

0x008

0 replies

1d10h

2024-06-08 07:47:53 UTC

I can recommend to try poetry. It is a lot more succesful in resolving dependencies than pip.

Although I think the UX of poetry is stupid and I do not agree with some design decisions, I have not had any dependency conflicts since I used it.

schmookeeg

34 replies

1d14h

2024-06-08 04:16:41 UTC

Getting my models dunked on by people who can't open MS Outlook more than 3 tries out of 5, however, have a remarkable depth and insight into their chosen domain of expertise. It's rather humbling.

Collaborating with nontechnical people is oddly my favorite part of doing MLE work right now. It wasn't the case when I did basic web/db stuff. They see me as a magician. I see them as voodoo priests and priestesses. When we get something trained up and forecasting that we both like, it's super fulfilling. I think for both sides.

Most of my modeling is healthcare related. I tease insights out of a monstrous data lake of claims, Rx, doctor notes, vital signs, diagnostic imagery, etc. What is also monstrous is how accessible this information is. HIPAA my left foot.

Since you seemed to be asking about the temporal realities, it's about 3 hours of meetings a week, probably another 3 doing task grooming/preparatory stuff, fixing some ETL problem, or doing a one-off query for the business, the rest is swimming around in the data trying to find a slight edge to forecast something that surprised us for a $million or two using our historical snapshots. It's like playing wheres waldo with math. And the waldo scene ends up being about 50TB or so in size. :D

saulrh

15 replies

1d12h

2024-06-08 05:32:11 UTC

HIPAA my left foot.

That was my experience as well - training documentation for fresh college grads (i.e. me) directed new engineers to just... send SQL queries to production to learn. There was a process for gaining permissions, there were audit logs, but the only sign-off you needed was your manager, permission lasted 12 months, and the managers just rubber-stamped everyone.

That was ten years ago. Every time I think about it I find myself hoping that things have gotten better and knowing they haven't.

jollofricepeas

11 replies

1d5h

2024-06-08 12:38:05 UTC

So…

Is it surprising that Engineers in healthcare dont read the actual HIPAA documentation?

Use of health data is permitted so long as it’s for payment, treatment or operations. Disclosures and patient consent are not required.

There are helpful summaries on the US Department of Health and Humans Services website of the various rules (Security, Privacy & Notification)

Source: https://www.hhs.gov/hipaa/for-professionals/privacy/guidance...

This allowance is permitted to covered entities and by extension their vendors (business associates) by HIPAA.

If it wasn’t then, theoretically the US healthcare industry would grind to a halt considering the number of intermediaries for a single transaction.

Example:

Doctor writes script -> EHR -> Pharmacy -> Switch -> Clearinghouse —> PA Processing -> PBM/Plan makes determination

Along this flow there are other possible branches and vendors.

It’s beyond complex.

outside1234

4 replies

1d3h

2024-06-08 14:33:26 UTC

Define operations, because that sounds like a loophole that basically allows you to use it for anything

Spooky23

2 replies

4h27m

2024-06-09 14:00:33 UTC

HIPPA protects you against gossipy staff. Beyond that, it’s mostly smoke - entire industries are built on loopholes.

Pre-HIPPA, the hospital would tell a news reporter about your status. Now, drug marketing companies know about your prescriptions before your pharmacy does.

specialist

1 replies

3h43m

2024-06-09 14:44:06 UTC

Mid aughts, we devs sat in the meetings discussing how to enable this. With teams from Google Health, Microsoft Health Vault, some pharmas.

In attendance were marketing, biz dev, execs. Ours and theres. But no legal. Hmmm.

And it would have worked if it weren't for those meddling teenagers.

From your comment, I'm inferring they're now allowed to use health records for marketing.

Spooky23

0 replies

1h26m

2024-06-09 17:01:41 UTC

My wife about 9 years ago was admitted to the hospital for a ruptured ectopic pregnancy. The baby would have been about two months along.

On the would-be due date, a box arrived via FedEx of Enfamil samples and a chirpy “welcome baby” message. Of course there was no baby.

It turns out that using prescription data, admission data and other information that is aggregated as part of subrogation and other processes, you can partially de-anonymize/rebuild / health record and make an inference. Enfamil told me what list they used and I bought it, it contained a bunch of information including my wife’s. I also know everyone in my zip code who had diabetes in 2015.

There’s even more intrusive stuff around opioid surveillance.

connicpu

0 replies

20h6m

2024-06-08 22:21:42 UTC

Basically the only thing you can't do with the data is disclose it to someone who doesn't also fall under HIPAA

aiforecastthway

3 replies

1d4h

2024-06-08 13:31:50 UTC

> If it wasn’t then, theoretically the US healthcare industry would grind to a halt considering the number of intermediaries for a single transaction.

It just occurred to me that cleaning up our country's data privacy / data ownership mess might have extraordinarily positive second-order effects on our Kafkaesque and criminally expensive healthcare "system".

Maybe making it functionally impossible for there to be hundreds of middlemen between me and my doctor would be a... good thing?

throwaway936r8

0 replies

17h20m

2024-06-09 01:07:08 UTC

I think it would have the opposite effect. We could end up with a few all-in-one systems that would dominate the market and have little incentive to improve or compete on usability and price.

specialist

0 replies

3h49m

2024-06-09 14:38:09 UTC

Correct. Would also mostly stop a lot of scams like identity theft.

A national identity service, like a normal mature economy, would be a massive public policy and public health win.

But it'd slightly decrease profits. So of course is quite impossible to implement.

Spooky23

0 replies

4h28m

2024-06-09 13:59:41 UTC

Only if your mortgage isn’t paid by such a middleman.

specialist

0 replies

3h54m

2024-06-09 14:33:04 UTC

This is why I advocate all PII data be encrypted at rest at the field level.

Worked on EMRs (during the aughts). Had yearly HIPAA and other security theater training. Not optional for any one in the field.

Of course we had root access. I forget the precise language, but HIPAA exempts "intermediaries" such as ourselves. How else could we build and verify the system(s). And that's why HIPAA is a joke.

Yes, our systems had consent and permissions and audit logs cooked in. So theoretically peeking at PII could be caught. Alas, it was just CYA. None of our customers reviewed their access logs.

I worked very hard to figure out how to protect patient (and voter) privacy. Eventually conceded that deanon always beats anon, because of Big Data.

I eventually found the book Translucent Databases. Shows how to design schemas to protect PII for common use cases. Its One Weird Trick is applying proper protection of passwords (salt + hash) to all other PII. Super clever. And obvious once you're clued in.

That's just 1/2 of the solution.

The other 1/2 is policy (non technical). All data about a person is owned by that person. Applying property law to PII transmutes 3rd party retention of PII from an asset to a liability. And once legal, finance, and accounting people get involved, orgs won't hoard PII unless it's absolutely necessary.

(The legal term "privacy" means something like "sovereignty over oneself", not just the folk understanding of "keeping my secrets private.)

1oooqooq

0 replies

1h43m

2024-06-09 16:44:15 UTC

Use of health data is permitted so long as it’s for payment, treatment or operations

please, do tell me, where "estimating better pricing models to extract more profit" fit into those?

Doctor writes script -> EHR -> Pharmacy -> Switch -> Clearinghouse —> PA Processing -> PBM/Plan makes determination

All the OP mentions happens way after the fact of those paths you described. People have already been charged. Treatment was already decided.

bick_nyers

2 replies

1d12h

2024-06-08 05:55:56 UTC

... you didn't have a UAT environment?

saulrh

0 replies

1d11h

2024-06-08 06:47:26 UTC

There were a couple "not prod" environments, but they were either replicated directly from prod or so poorly maintained that they were unusable (empty tables, wrong schemas, off by multiple DB major versions, etc), no middle ground. So institutional culture was to just run everything against prod (for bare selects that could be copied and pasted into the textbox in the prod-access web tool) or a prod replica (for anything that needed a db connection). The training docs actually did specify Real Production, and first-week tasks included gaining Real Production access. If I walked in and was handed that training documentation today I'd raise hell and/or quit on the spot, but that was my first job out of college - it was basically everyone's first job out of college, they strongly preferred hiring new graduates - and I'd just had to give up on my PhD so I didn't have the confidence, energy, or pull to do anything about it, even bail out.

That was also the company where prod pushes happened once a month, over the weekend, and were all hands on deck in case of hiccups. It was an extraordinarily strong lesson in how not to organize software development.

(edit: if what you're really asking is "did every engineer have write access to production", the answer was, I believe, that only managers did, and they were at least not totally careless with it. not, like, actually responsible, no "formal post-mortem for why we had to use break-glass access", but it generally only got brought out to unbreak prod pushes. Still miserable.)

roughly

0 replies

1d10h

2024-06-08 08:09:01 UTC

There’s an old joke that everyone’s got a testing environment, but some people are lucky enough to have a separate production environment.

nomilk

3 replies

1d3h

2024-06-08 14:28:37 UTC

The Dead Internet Theory says most activity on the internet is by bots [1]. The Dead Privacy Theory says approximately all private data is not private; but rather is accessible on whim by any data scientist, SWE, analyst, or db admin with access to the database.

[1] https://en.wikipedia.org/wiki/Dead_Internet_theory

mr_toad

0 replies

11h55m

2024-06-09 06:32:07 UTC

At least more of that access is logged now. It used to be only the production databases that were properly logged, now it’s more common for every query to be logged, even in dev environments. The next step will be more monitoring of those logs.

htrp

0 replies

1d3h

2024-06-08 15:16:05 UTC

The Dead Privacy Theory says approximately all private data is not private; but rather is accessible on whim by any data scientist, SWE, analyst, or db admin with access to the database.

I like this so much I'm definitely stealing it!

bearjaws

0 replies

1d1h

2024-06-08 16:55:46 UTC

Damn, I've talked about this many times at my last job (startup that went from 100k patients to ~2.5M in 5 years). I love the name Dead Privacy Theory

soared

2 replies

1d13h

2024-06-08 05:14:23 UTC

3 hours of meetings a week, that’s incredible. Sounds like your employer understands and values your time!

visarga

0 replies

1d12h

2024-06-08 05:31:52 UTC

13 meetings/week, at least one full day of work wasted for me

schmookeeg

0 replies

1d1h

2024-06-08 16:44:42 UTC

They really do. This has been my longest tenure at any position by far and Engineer QoL is a massive part of it. Our CTO came up through the DBA/Data/Engineering Management ranks and the empathy is solidly there.

As we grow, I'm ever watchful for our metamorphosis into a big-dumb-company, but no symptoms yet. :)

hamasho

2 replies

1d12h

2024-06-08 06:05:51 UTC

I worked on a project to analyze endoscope videos to find diseases. I examined a lot of images and videos annotated with symptoms of various diseases labeled by assistants and doctors. Most of them are really obvious, but others are almost impossible to detect. In rare cases, despite my best efforts, I couldn't see any difference between the spot labeled as a symptom of cancer and the surrounding area. There's no a-ha moment, like finding an insect mimicking its environment. No matter how many times I tried, I just couldn't see any difference.

aswegs8

1 replies

1d3h

2024-06-08 14:50:13 UTC

Mind sharing how to get a foot into the field? I've got a good amount of domain knowledge from my studies in life science and rather meager experience from learning to code on my own for a few years. It seems like I cant compete with CS majors and gotta find a way to leverage my domain knowledge.

hamasho

0 replies

18h37m

2024-06-08 23:50:16 UTC

I'm not an expert in machine learning, but rather a web developer and data engineer helping develop a system to detect diseases from endoscopy images using the model developed by other ML engineers. And it was 5 years ago when I worked on the project, so please take it with a grain of salt.

If you want to learn machine learning for healthcare in general, it may help to start problems with tabular data like CSVs instead of images. Image processing is a lot harder, and takes a lot of time and computational power. But it's best to learn what you're interested in the most.

Anyway, first you need to be familiar with basic; Python, machine learning, and popular libraries like scikit-learn, matplotlib, numpy, and pandas. Those are tons of articles, textbooks, and videos to help you learn them.

If you grasp the basics, I think it's better to learn from actual code to train/evaluate models rather than more theories. Kaggle may be a good starting point. They host a lot of competitions for machine learning problems. There are easy competitions for beginners, and competitions and datasets in the medical field.

You can view notebooks (actual code to solve those problems well written by experts) and popular ones are very educational. You can learn a lot by reading those code, understanding concepts and how to use libraries, and modifying some code to see how it changes the result. ChatGPT is also helpful.

If you want to learn image classification, the technology used to detect objects from images and videos is called image classification and object detection. It uses CNN, one of the deep neural networks. You also need to learn basic image processing, how to train a deep neural network, how to evaluate, and libraries like OpenCV/Pillow/PyTorch/TorchVision. There are a lot of image classification competitions in the medical field on Kaggle too[0][1].

To run those notebooks, I recommend Google Colab. Image processing often uses a lot of GPUs, and you may not have GPUs, or even if you have it's difficult to set up the right environment. It's easier to use those dedicated cloud services and it doesn't cost much.

It's hard to learn, but sometimes fun, so enjoy your journey!

[0] https://www.kaggle.com/datasets/paultimothymooney/chest-xray... [1] https://www.kaggle.com/code/arkapravagupta/endoscopy-multicl...

ProjectArcturis

2 replies

1d2h

2024-06-08 16:21:58 UTC

What business are you in that predicting health data can make you millions?

dr_kiszonka

0 replies

20h18m

2024-06-08 22:09:15 UTC

Insurance, health benefits.

apwheele

0 replies

2h26m

2024-06-09 16:01:56 UTC

The scale of health insurance claims is incredible, my company has a process that simply identifies when car insurance should pay the bill instead of medicaid/medicare post traffic accident (subrogation). Seems a minor thing right? We process right at 1 billion a year in related claims (and I don't know our US market share, maybe like 10-20%).

I am guessing every datascientist that works for a BlueCross BlueShield at individual states deals with processes that touch multiple-million dollars of claims.

We even now have various dueling systems -- one company has a model to tack on more diagnoses to push up the bill, another has a process to identify that upcoding. One company has models to auto accept/deny claims, another has an automatic prior authorization process to try to usurp that later denial stage, etc.

constantinum

1 replies

1d12h

2024-06-08 05:40:26 UTC

I tease insights out of a monstrous data lake of claims, Rx, doctor notes, vital signs

I'm curious to know the tech stack behind converting unstructured to structured data(for reporting and analysis)

dax77

0 replies

23h43m

2024-06-08 18:44:43 UTC

Take a look at AWS Healthlake and AWS Comprehend Medical

kvakerok

0 replies

15h12m

2024-06-09 03:15:36 UTC

HIPAA your left foot because nobody reads fine print anymore and signs their soul away for a one-time $60 discount.

htrp

0 replies

1d2h

2024-06-08 15:41:10 UTC

Getting my models dunked on by people who can't open MS Outlook more than 3 tries out of 5, however, have a remarkable depth and insight into their chosen domain of expertise. It's rather humbling.

The people who have lasted in those roles have built up a large degree of intuition on how their domains work (or they would've done something else).

conkeisterdoor

0 replies

1d2h

2024-06-08 16:25:53 UTC

This sounds almost exactly like my day-to-day as a solo senior data engineer — minus building and training ML models, and I don't work in healthcare. My peers are all very non-technical business directors who are very knowledgeable about their domains, and I'm like a wizard who can conjure up time savings/custom reporting/actionable insights for them.

Collaborating with them is great, and has been a great exercise in learning how to explain complex ideas to non-technical business people. Which has the side effect of helping me get better at what I do (because you need a good understanding of a topic to be able to explain it both succinctly and accurately to others). It has also taught me to appreciate the business context and reasoning that can drive decisions about how a business uses or develops data/software.

tambourineman88

25 replies

2024-06-07 17:58:32 UTC

The opposite of what you’d think when studying machine learning…

95% of the job is data cleaning, joining datasets together and feature engineering. 5% is fitting and testing models.

toephu2

14 replies

2024-06-07 17:59:44 UTC

Sounds like a Data Scientist job?

moandcompany

6 replies

2024-06-07 18:04:40 UTC

This is a large problem in industry: defining away some of the most important parts of a job or role as (should be) someone else's.

There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.

tedivm

4 replies

1d22h

2024-06-07 19:37:02 UTC

It's not about "yucky" so much as specialization and only having a limited time in life to learn everything.

Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.

AndrewKemendo

2 replies

1d22h

2024-06-07 19:56:31 UTC

My answer is yes to both of those

If other peoples work is reliant on yours then you should know how their part of the system transforms your inputs

Similarly you should fully understand how all the inputs to your part of the system are generated

No matter your coupling pattern, if you have more than 1 person product, knowing at least one level above and below your stack is a baseline expectation

This is true with personnel leadership too, I should be able to troubleshoot one level above and below me to some level of capacity.

otteromkram

1 replies

1d4h

2024-06-08 14:06:22 UTC

The parent comment had three examples...

mrbombastic

0 replies

2024-06-08 17:38:06 UTC

2/3 is close enough in ML world

moandcompany

0 replies

1d22h

2024-06-07 20:24:03 UTC

I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.

I've seen these too, and you aren't wrong. Division into specializations can work "way better" (i.e. the overall potential is higher), but in practice the differentiating factors that matter will come down to organizational and ultimately human-factors. The anecdotal cases I draw my observations from organizations operating at the scale of 1-10 people, as well as 1,000s working in this field.

Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

To realize the higher potential mentioned above, what they need to be doing is appreciating the value of what those things are and those who do those things beyond: these are the people that do the things I don't want to do or don't want to understand. That appreciation usually comes from having done and understanding that work.

When specializations are used, they tend to also manifest into organizational structures and dynamics which are ultimately comprised of humans. Conway's Law is worth mentioning here because the interfaces between these specializations become the bottleneck of your system in realizing that "higher potential."

As another commenter mentions, the effectiveness of these interfaces, corresponding bottlenecking effects, and ultimately the entire people-driven system is very much driven by how the parties on each side understand each other's work/methods/priorities/needs/constraints/etc, and having an appreciation for how they affect (i.e. complement) each other and the larger system.

hiatus

0 replies

1d22h

2024-06-07 19:31:07 UTC

There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.

See: the use of "devops" to encapsulate "everything besides feature development"

huygens6363

1 replies

1d11h

2024-06-08 06:39:22 UTC

“Scientist”? Is this like Software Engineer?

staunton

0 replies

1d7h

2024-06-08 10:58:38 UTC

I guess it means "someone who has or is about to have a PhD".

auntienomen

1 replies

1d22h

2024-06-07 20:04:39 UTC

A good DS can double as an MLE.

disgruntledphd2

0 replies

1d11h

2024-06-08 06:48:14 UTC

And sometimes, a good MLE can double as a DS.

Personally I think we calcified the roles around data a little too soon but that's probably because there was such demand and the space is wide.

maxlamb

0 replies

1d7h

2024-06-08 10:56:32 UTC

Sounds like a data engineer job to me

jamil7

0 replies

1d6h

2024-06-08 11:36:02 UTC

My partner is a data engineer, from what I’ve gathered the departments are often very small or one person so the roles end up blending together a lot.

RSZC

0 replies

1d13h

2024-06-08 05:27:51 UTC

Used to do this job once upon a time - can't overstate the importance of just being knee-deep in the data all day long.

If you outsource that to somebody else, you'll miss out on all the pattern-matching eureka moments, and will never know the answers to questions you never think to ask.

dblohm7

5 replies

1d23h

2024-06-07 18:39:04 UTC

As somebody whose machine learning expertise consists of the first cohort of Andrew Ng's MOOC back in 2011, I'm not too surprised. One of the big takeaways I took from that experience was the importance of getting the features right.

ismailmaj

2 replies

1d7h

2024-06-08 10:40:32 UTC

This was very important with classical machine learning, now with deep learning, feature engineering became useless as the model can learn the relevant features by itself.

However, having a quality and diverse dataset is more important now than ever.

beckhamc

0 replies

2024-06-08 17:32:46 UTC

no we just replaced feature engineering with architectural engineering

Salgat

0 replies

1d1h

2024-06-08 17:00:43 UTC

That depends on the type of data, and regardless, your goal is to minimizing the input data since it has a direct impact on performance overhead and duration of inference.

geoduck14

0 replies

1d19h

2024-06-07 22:55:39 UTC

was the importance of getting the features right.

Yeah, but also knowing which features to get right. Right?

Animats

0 replies

1d12h

2024-06-08 05:35:05 UTC

I remember that class. Someone from Blackrock taught it at Hacker Dojo. The good old days of support vector machines and Matlab.

llama_person

1 replies

1d4h

2024-06-08 13:50:19 UTC

Same here, it's tons of work to collect, clean, validate data, followed by a tiny fun portion where you train models, then you do the whole loop over again.

gopher_space

0 replies

22h14m

2024-06-08 20:13:08 UTC

it's tons of work to collect, clean, validate data

That's my fun part. The discovery process is a joy especially if it means ingesting a whole new domain and meeting people.

whiplash451

0 replies

1d4h

2024-06-08 13:58:55 UTC

In a sense, the data _is_ the model (inductive bias) so splitting « data work » and « model work » like you do is arbitrary.

AndrewKemendo

0 replies

1d22h

2024-06-07 19:49:26 UTC

As it was in the beginning and now and ever shall be amen

At the staff/principal level it’s all about maintaining “data impedance” between the product features that rely on inference models and the data capture

This is to ensure that as the product or features change it doesn’t break the instrumentation and data granularity that feed your data stores and training corpus

For RL problems however it’s about making sure you have the right variables captured for state and action space tuple and then finding how to adjust the interfaces or environment models for reward feedback

burnedout_dc4e3

10 replies

1d18h

2024-06-07 23:31:40 UTC

I've been doing machine learning since the mid 2000s. About half of my time is spent keeping data pipelines running to get data into shape for training and using in models.

The other half is spent doing tech support for the bunch of recently hired "AI scientists" who can barely code, and who spend their days copy/pasting stuff into various chatbot services. Stuff like telling them how to install python packages and use git. They have no plan for how their work is going to fit into any sort of project we're doing, but assert that transformer models will solve all our data handling problems.

I'm considering quitting with nothing new lined up until this hype cycle blows over.

naveen99

3 replies

1d15h

2024-06-08 03:00:03 UTC

You’re living the dream. Why quit ?

bowsamic

1 replies

1d12h

2024-06-08 06:19:55 UTC

Is that really your idea of a dream?

naveen99

0 replies

1d5h

2024-06-08 13:20:43 UTC

My dreams are usually more disturbing, or fun…

But yes. My work is kind of similar… I do some data curation / coding, and help 2 engineers who report to me. I enjoy it.

burnedout_dc4e3

0 replies

1d3h

2024-06-08 14:33:08 UTC

I like to feel useful, and like I'm actually contributing to things. I probably didn't express it well in my first post, but the attitude is very much that my current role is obsolete and a relic that's just sticking around until the AI can do everything.

It means I'm marginalized in terms of planning. The company has long term goals that involve making good use of data. Right now, the plan is that "AI" will get us there, with no plan B is it doesn't work. When it inevitably fails to live up to the hype, we're going to have a bunch of clobbered together systems that are expensive to run, rather than something that we can keep iterating on.

It means I'm marginalized in terms of getting resources for projects. There's a lot of good my team could be doing if we had the extra budget for more engineers and computing. Instead that budget is being sent off to AI services, and expensive engineer time is being spent on tech support for people that slapped "LLM" all over their resume.

whiplash451

1 replies

1d4h

2024-06-08 14:01:36 UTC

There are companies where applied scientists are required to code well. Just ask how they are hired before joining (that should be a positive feature).

burnedout_dc4e3

0 replies

23h37m

2024-06-08 18:50:13 UTC

Yeah, we used to be like that. Then, when this hype cycle started ramping up, the company brought in a new exec who got rid of that. I brought it up with the CEO, but nothing changed, so that's another reason for me to leave.

bentt

1 replies

1d14h

2024-06-08 03:55:16 UTC

I wonder if this is how the OG VR guys felt in 2016.

Havoc

0 replies

1d7h

2024-06-08 11:23:24 UTC

Well Palmer Luckey sold oculus and now makes military gear so I guess he chose violence after his VR era

nitwit005

0 replies

13h35m

2024-06-09 04:52:07 UTC

Why help them? Tell your boss you're pointing them to the other ones you helped previously.

m_ke

0 replies

1d10h

2024-06-08 08:09:23 UTC

I just quit a day ago with nothing lined up for the same reason.

tenache

6 replies

1d22h

2024-06-07 19:35:48 UTC

Although I studied machine learning and was originally hired for that role, the company pivoted and is now working with LLMs, so I spend most of my day working on figuring out how different LLMs work, what parameters work best for them, how to do RAG, how to integrate them with other bots.

KeplerBoy

5 replies

1d22h

2024-06-07 19:53:25 UTC

Would you not consider LLMs as a part of machine learning?

chudi

1 replies

1d22h

2024-06-07 20:26:00 UTC

Probably it's because we are not training them anymore and just using with prompts. Seems like more of a swe regular type of job

aulin

0 replies

1d8h

2024-06-08 09:53:43 UTC

except regular swe is way more fun than writing prompts

uoaei

0 replies

1d11h

2024-06-08 07:26:27 UTC

There is a vanishingly small percentage of people actually working on the design and training of LLMs vs all those who call themselves "AI engineers" who are just hitting APIs.

layer8

0 replies

1d17h

2024-06-08 00:48:38 UTC

They are the result of machine learning.

Cyclone_

0 replies

1d22h

2024-06-07 20:02:46 UTC

I'd say deep learning is a subset of machine learning, and LLMs are a subset of deep learning.

Xenoamorphous

5 replies

1d4h

2024-06-08 14:19:37 UTC

I’m a regular software dev but I’ve had to do ML stuff by necessity.

I wonder how “real” ML people deal with the stochastic/gradient results and people’s expectations.

If I do ordinary software work the thing either works or it doesn’t, and if it doesn’t I can explain why and hopefully fix it.

Now with ML I get asked “why did this text classifier not classify this text correctly?” and all I can say is “it was 0.004 points away to meet the threshold”, and “it didn’t meet it because of the particular choice of words or even their order” which seems to leave everyone dissatisfied.

xtagon

2 replies

1d3h

2024-06-08 14:50:31 UTC

Not all ML is built on neural nets. Genetic programming and symbolic regression is fun because the resulting model is just code, and software devs know how to read code.

nchfgsj1

0 replies

1d3h

2024-06-08 15:05:33 UTC

Genetic programming however isn’t machine learning, but instead it’s an AI algorithm. An extremely interesting one as well! It was fun to have my eyes opened after being taught genetic algorithms, to then be brought into genetic programming

aiforecastthway

0 replies

22h55m

2024-06-08 19:32:42 UTC

Symbolic regression has the same failure mode; the reasons why the model failed can be explained in a more digestible way, but the actual truth of what happened is fundamentally similar -- some coefficient was off by some amount and/or some monomial beat out another in some optimization process.

At least with symbolic regression you can treat the model as an analyzable entity from first principles theories. But that's not really particularly relevant to most failure modes in practice, which usually boil down to either missing some qualitative change such as a bifurcation or else just parameters being off by a bit. Or a little bit of A and a little bit of B.

hkt

1 replies

1d3h

2024-06-08 14:36:10 UTC

This seems to be the absolute worst of all worlds: the burden of software engineering with the tools of an English Language undergrad.

gopher_space

0 replies

22h32m

2024-06-08 19:55:20 UTC

The English degree helps explain why word choice and order matter, giving you context and guidelines for software design.

rurban

4 replies

1d3h

2024-06-08 15:24:28 UTC

Highly paid cleaning lady. With dirty data you get no proper results. BTW: perl is much better than python on this.

Highly paid motherboard troubleshooter, because those all those H100's really get hot, even with watercooling, and we have no dedicated HW guy.

Fighting misbehaving third-party deps, as everyone else.

shoggouth

3 replies

2024-06-08 17:39:12 UTC

Could you talk more about “BTW: perl is much better than python on this.”?

eb0la

2 replies

2024-06-08 17:50:30 UTC

I haven't touched Perl in more than 20 years... ... but I (routinely) miss something like:

   $variable = something() if sanity_check()

And

   do_something() unless $dont_do_that

reportgunner

0 replies

2h56m

2024-06-09 15:32:01 UTC

Both work in python:

        variable = something() if sanity_check() else None


        do_something() if not dont_do_that else None

jononor

0 replies

20h10m

2024-06-08 22:17:09 UTC

There exists a ternary if statement?

foo = something() if sanity_check else None

Can replace None with foo (or any other expression), if desired.

angarg12

3 replies

2024-06-07 18:24:50 UTC

My job title is ML Engineer, but my day to day job is almost pure software engineering.

I build the systems to support ML systems in production. As others have mentioned, this includes mostly data transformation, model training, and model serving.

Our job is also to support scientists to do their job, either by building tools or modifying existing systems.

However, looking outside, I think my company is an outlier. It seems in the industry the expectations for a ML Engineer are more aligned to what a data/applied scientist does (e.g. building and testing models). That introduces a lot of ambiguity into the expectations for each role in each company.

tedivm

0 replies

1d22h

2024-06-07 19:50:25 UTC

In my experience your company is doing it right, and doing it the way that other successful companies do.

I gave a talk at the Open Source Summit on MLOps in April, and one of the big points I try to drive home is that it's 80% software development and 20% ML.

https://www.youtube.com/watch?v=pyJhQJgO8So

hnthrowaway0328

0 replies

1d23h

2024-06-07 18:49:56 UTC

That's really the kind of job I'd love. Whatever the data is, I don't care. I make sure that the users get the correct data quickly.

exegete

0 replies

1d20h

2024-06-07 21:57:35 UTC

My company is largely the same. I’m an MLE and partner with data scientists. I don’t train or validate the models. I productionize and instrument the feature engineering pipelines and model deployments. More data engineering and MLOps than anything. I’m in a highly regulated industry so the data scientists have many compliance tasks related to the models and we engineers have our own compliance tasks related to the deployments. I was an MLE at another company in the very same industry before and did everything in the model lifecycle and it was just too much.

trybackprop

2 replies

1d23h

2024-06-07 18:48:04 UTC

In a given week, I usually do the following:

* 15% of my time in technical discussion meetings or 1:1's. Usually discussing ideas around a model, planning, or ML product support

* 40% ML development. In the early phase of the project, I'm understanding product requirements. I discuss an ML model or algorithm that might be helpful to achieve product/business goals with my team. Then I gather existing datasets from analysts and data scientists. I use those datasets to create a pipeline that results in a training and validation dataset. While I wait for the train/validation datasets to populate (could take several days or up to two weeks), I'm concurrently working on another project that's earlier or further along in its development. I'm also working on the new model (written in PyTorch), testing it out with small amounts of data to gauge its offline performance, to assess whether or not it does what I expect it to do. I sanity check it by running some manual tests using the model to populate product information. This part is more art than science because without a large scale experiment, I can only really go by the gut feel of myself and my teammates. Once the train/valid datasets have been populated, I train a model on large amounts of data, check the offline results, and tune the model or change the architecture if something doesn't look right. After offline results look decent or good, I then deploy the model to production for an experiment. Concurrently, I may be making changes to the product/infra code to prepare for the test of the new model I've built. I run the experiment and ramp up traffic slowly, and once it's at 1-5% allocation, I let it run for weeks or a month. Meanwhile, I'm observing the results and have put in alerts to monitor all relevant pipelines to ensure that the model is being trained appropriately so that my experiment results aren't altered by unexpected infra/bug/product factors that should be within my control. If the results look as expected and match my initial hypothesis, I then discuss with my team whether or not we should roll it out and if so, we launch! (Note: model development includes feature authoring, dataset preparation, analysis, creating the ML model itself, implementing product/infra code changes)

* 20% maintenance – Just because I'm developing new models doesn't mean I'm ignoring existing ones. I'm checking in on those daily to make sure they haven't degraded and resulted in unexpected performance in any way. I'm also fixing pipelines and making them more efficient.

* 15% research papers and skills – With the world of AI/ML moving so fast, I'm continually reading new research papers and testing out new technologies at home to keep up to date. It's fun for me so I don't mind it. I don't view it as a chore to keep me up-to-date.

* 10% internal research – I use this time to learn more about other products within the team or the company to see how my team can help or what technology/techniques we can borrow from them. I also use this time to write down the insights I've gained as I look back on my past 6 months/1 year of work.

ZenMikey

1 replies

1d1h

2024-06-08 16:35:38 UTC

How do you select what papers to read? How often does that research become relevant to your job?

trybackprop

0 replies

2024-06-08 17:43:28 UTC

I select papers based on references from coworkers, Twitter posts by prominent ML researchers I follow, ML podcasts, and results.

The research becomes relevant immediately because my team is always looking to incorporate it into our production models right away. Of course it does take some planning (3-6 months) before it's fully rolled out in production.

itake

2 replies

1d23h

2024-06-07 19:09:00 UTC

not sure if this counts as ML engineering, but I support all the infra around the ML models: caching, scaling, queues, decision trees, rules engines, etc.

selimthegrim

0 replies

1d13h

2024-06-08 05:20:01 UTC

What do you do with decision trees specifically?

barrenko

0 replies

1d11h

2024-06-08 06:43:13 UTC

MLOps, sure.

singularity2001

1 replies

1d8h

2024-06-08 10:25:21 UTC

Teaching others python.

reeboo

0 replies

1d3h

2024-06-08 15:22:09 UTC

Underrated comment. At my place of work, I find this to be a huge part of the MLE job. Everyone knows R but none of the cloud tools have great R support.

rldjbpin

1 replies

9h3m

2024-06-09 09:24:12 UTC

junior level role, but currently it is a mix of working like a proxy product owner and half software engg.

the users are researchers and have deep technical knowledge of their use case. it is still a challenge to map their needs into design decisions of what they want in the end. thanks to open-source efforts, the model creation is rather straightforward. but everything around making that happen and shaping it like a tool is a ride.

especially love the ever-changing technical stack of "AI" services by major cloud providers rn. it makes mlops nothing more than a demo imho.

reportgunner

0 replies

2h47m

2024-06-09 15:40:03 UTC

Junior level role product owner ?

redwood

1 replies

1d17h

2024-06-08 01:14:20 UTC

Do people feel like they are more or less in demand with the hyper around genai?

uoaei

0 replies

1d10h

2024-06-08 07:29:09 UTC

Demand is higher for flashy things that look good on directors' desks, definitely. But there's less attention on less flashy applications of machine learning, unless your superiors are so clueless that they think what you're doing is GenAI. Which sometimes the systems/models being trained are legitimately generative, but in the more technical, traditional sense.

hirako2000

1 replies

2024-06-07 18:02:25 UTC

The amount of response may be self explaining.

Not my main work, but spending a lot of time gluing things together. Tweaking existing open source. Figuring out how to optimize resources, retraining models on different data sets. Trying to run poorly put together python code. Adding missing requirements files. Cleaning up data. Wondering what could in fact really be useful to solve with ML that hasn't been done years ago already. Browsing the prices of the newest GPUs and calculating whether that would be worth it to get one rather than renting overpriced hours off hosting providers. Reading papers until my head hurt, that is just 1 by 1, by the time I finish the abstract and glanced over a few diagrams in the middle.

ZenMikey

0 replies

1d1h

2024-06-08 16:34:17 UTC

Where do you locate/how do you select papers?

redwood

0 replies

1d17h

2024-06-08 01:14:42 UTC

What are the tools people up to use? Feature platforms like Tecton on the list?

primaprashant

0 replies

1d10h

2024-06-08 07:34:39 UTC

Been working as an MLE for the last 5 years and as another comment said most of the work is close to SWE. Depending on the stage of the project I'm working on, day-to-day work varies but it's along the lines of one of these:

- Collaboration with stakeholders & TPMs and analyzing data to develop hypotheses to solve business problems with high priority

- Framing business problems as ML problems and creating suitable metrics for ML models and business problems

- Building PoCs and prototypes to validate the technical feasibility of the new features and ideas

- Creating design docs for architecture and technical decisions

- Collaborating with the platform teams to set up and maintain the data pipelines based on the needs of new and exiting ML projects

- Building, deploying, and maintaining ML microservices for inference

- Writing design docs for running A/B tests and performing post-test analyses

- Setting up pipelines for retraining of ML models

npalli

0 replies

1d3h

2024-06-08 15:21:52 UTC

Clean data and try to get people to understand why they need to have clean data.

mardifoufs

0 replies

1d20h

2024-06-07 22:12:32 UTC

I work on optimizing our inference code, "productizing" our trained models and currently I'm working on local training and inference since I work in an industry where cloud services just aren't very commonly used yet. It's super interesting too since it's not LLMs, meaning that there aren't as many pre made tools and we have to make tons of stuff by ourselves. That means touching anything from assessing data quality (again, the local part is the challenge) to using CUDA directly as we already have signal processing libs that are built around it and that we can leverage.

Sometimes it also involves building internal tooling for our team (we are a mixed team of researchers/MLEs), to visualize the data and the inferences as again, it's a pretty niche sector and that means having to build that ourselves. That allowed me to have a lot of impact in my org as we basically have complete freedom w.r.t tooling and internal software design, and one of the tools that I built basically on a whim is now on the way to be shipped in our main products too.

lemursage

0 replies

7h52m

2024-06-09 10:35:08 UTC

In larger companies, and, specifically, bigger projects, systems tend to have multiple ML components, and those are usually a mix of large NN models and more classical (ML) algorithms, so you end up tweaking multiple parts at once. In my case optimising for such systems is ~90% of the work. For instance, can I make the model lighter or go faster and keep the performance? Or, can I make it go faster? Loss change, pruning, quantisation, dataset optimisation etc. -- most of the time I'm testing out those options & tweaking parameters. There is of course the deployment part, but this one is usually a quickie if your team has specific processes/pipelines for this. There's a checklist of what you must do while deploying, along with cost targets.

In my case, there are established processes and designated teams for cleaning & collecting data, but you still do a part of it yourself to provide guidelines. So, even though data is always a perpetual problem, I can shed off most of that boring stuff.

Ah, and of course you're not a real engineer if you don't spend at least 1-2% of your time explaining to other people (surprisingly often to a technical staff, but not ML-oriented) why doing X is a really bad idea. Or, just explaining how ML systems work with ill-fitted metaphors.

jackspawn

0 replies

1d4h

2024-06-08 14:10:32 UTC

50%+ of my time is spent on backend engineering because the ML is used inside a bigger API.

I take responsibility for the end to end experience of said API, so I will do whatever gives the best value per time spent. This often has nothing to do with the ML models.

giantg2

0 replies

1d18h

2024-06-08 00:07:33 UTC

I've interviewed for a few of the ML positions and turned them down because they were just data jockey positions.

frankPoole

0 replies

1d22h

2024-06-07 19:31:26 UTC

Pretty much the same as the others, building tool, data cleaning, etc. But something I don't see mentioned: experiment design/ data collection protocols

exe34

0 replies

1d4h

2024-06-08 13:56:13 UTC

90% of the time it's figuring out what data to feed into neural networks, 2% of the time figure out stuff about neural networks and the other 8% of the time figure out why on earth the recall rate is 100%.