return to table of content

My Python code is a neural network

pakl
17 replies
1d3h

There exists the Universal (Function) Approximation Theorem for neural networks — which states that they can represent/encode any function to a desired level of accuracy[0].

However there does not exist a theorem stating that those approximations can be learned (or how).

[0] https://en.m.wikipedia.org/wiki/Universal_approximation_theo...

jb1991
11 replies
1d3h

FYI, there are actually many algorithms going back longer than the neural network algorithm that have been proven to be a universal function approximator. Neural networks are certainly not the only and not the first to do so. There are quite a few that are actually much more appropriate for many cases than a neural network.

derangedHorse
10 replies
1d2h

What other algorithms can do this and which situations would they be more useful than neural networks?

someoneontenet
3 replies
1d2h

Newtons Method approximates square roots. Its useful if you want to approximate something like that without pulling in the computational power required of NN.

kristjansson
0 replies
1d1h

Newton's method related to universal function approximation in the same way a natural oil seep is related to a modern IC engine...

jb1991
0 replies
11h43m

By definition, that’s not a “universal“ function approximator.

astrobe_
0 replies
1d1h

I think the problem to solve is more like : given a set of inputs and outputs, find a function that gives the expected output for each input [1]. This is like Newton's method on a higher order ;-). One can find such a tool in Squeak or Pharo Smalltalk, IIRC.

[1] https://stackoverflow.com/questions/1539286/create-a-functio...

dontlikeyoueith
3 replies
17h47m

The Taylor Series dates to 1715. Fourier Series dates to the 1820s.

Both are universal function approximators and both can be learned via gradient descent.

For the case where the function you want to learn actually is polynomial or periodic (respectively), these are better than neural networks.

camjw
2 replies
7h7m

For your interest, Taylor Series are not universal function approximators - the Taylor Series around 0 for

f(x) = e^(-1/x^2) if x != 0 else 0

is identically zero (all partial derivatives are 0 at 0) but the function is clearly not identically zero. So the radius of convergence for this Taylor series is infinite but it only equals the approximated function at one point.

I'm sure there are some conditions you can put on f to make the Taylor Series a UFA but it's been quite a while since I did any real analysis so I have forgotten!

Doesn't detract from the overall point though that there are UFAs that are not neural nets. I should say that I don't know what the precise definition of a UFA really is, but I assume you have to have more than equality at one point.

Horffupolde
0 replies
6h13m

Taylor series work on differentiable intervals. You specifically chose a function and interval where this is not true. Of course it will not be a good approximation.

gnyeki
0 replies
1d2h

This area is covered by non-parametric statistics more generally. There are many other methods to non-parametrically estimate functions (that satisfy some regularity conditions). Tree-based methods are one family of such methods, and the consensus still seems to be that they perform better than neural networks on tabular data. For example:

https://arxiv.org/abs/2106.03253

arketyp
1 replies
1d3h

Makes you wonder what is meant by learning...

dekhn
0 replies
1d3h

Learning is using observations to create/update a model that makes predictions which are more accurate than chance. At some point the model ends up having generalizability beyond the domain.

visarga
0 replies
1d1h

They can model only continuous functions, more specifically any continuous function on compact subsets of ℝⁿ. They can approximate functions to an arbitrary level of accuracy, given sufficient neurons

richrichie
0 replies
1d2h

Not any function though. There are restrictions on type of functions "universal" approximation theorem is applicable for. Interestingly, the theorem is about a single layer network. In practice, that does not work as well as having many layers.

montebicyclelo
0 replies
23h50m

People throw that proof around all the time; but all it does is show that a neural net is equivalent to a lookup table; and a lookup table with enough memory can approximate any function. It's miles away from explaining how real world, useful, neural nets, like conv-nets, transformers, LSTMs, etc. actually work.

dekhn
11 replies
1d3h

Are RNNs completely subsumed by transformers? IE, can I forget about learning anything about how to work with RNNs, and instead focus on transformers?

Voloskaya
6 replies
1d3h

Not if you want to be a PhD/Researcher in ML, yes otherwise.

Source: Working on ML/LLMs as a research engineer for the past 7 years, including for one of the FAANG's research lab, always wanted to take time to learn about RNN but never did and never needed to.

jszymborski
2 replies
18h28m

None of the students who have taken the classes I TA pass w/I learning about RNNs.

dekhn
1 replies
5h7m

Is that true also of LSTMs?

jszymborski
0 replies
2h7m

Yes. We cover Jordan and Elman RNN, LSTMs, and GRUs. Assignments only really test for LSTM knowledge, though.

rolisz
1 replies
1d2h

Oh, I'm sure plenty of recent PhDs don't know about RNNs. They've been dropped like a hot potato in the last 4-5 years.

Voloskaya
0 replies
22h30m

I think to do pure research it’s definitely worth knowing about the big ideas of the past, why we moved on from them, what we learned etc.

Fripplebubby
2 replies
1d1h

To further problematize this question (which I don't feel like I can actually answer), consider this paper: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" - https://arxiv.org/pdf/2006.16236

What this shows is that actually a specific narrow definition of transformer (a transformer with "causal masking" - see paper) is equivalent to an RNN, and vice versa.

Similarly Mamba (https://arxiv.org/abs/2312.00752), the other hot architecture at the moment, has an equivalent unit to a gated RNN. For performance reasons, I believe they use an equivalent CNN during training and an RNN during inference!

visarga
0 replies
1d1h

There still are important distinctions. RNNs have constant memory while transformers expand their memory with each new token. They are related, but one could in theory process an unbounded sequence while the other cannot because of growing memory usage.

Fripplebubby
0 replies
1d1h

To be more concrete: you might decide not to learn about RNNs, but still find them lurking in the things you did learn about!

toxik
0 replies
1d1h

Transformers have finite context, RNNs don’t. In practice the RNN gradient signal is limited by back propagation through time, it decays. This is in fact the whole selling point of transformers; association is not harder or easier in near/short distance. But in theory a RNN can remember infinitely far away.

scotchmi_st
4 replies
1d3h

This is an interesting article if you read it like a howto for constructing a neural network for performing a practical task. But if you take it at face-value, and follow a similar method the next time you need to parse some input, then, well, I don't know what to say really.

The author takes a hard problem (parsing arbitrary input for loosely-defined patterns), and correctly argues that this is likely to produce hard-to-read 'spaghetti' code.

They then suggest replacing that with code that is so hard to read that there is still active research into how it works, (i.e a neural net).

Don't over-index something that's inscrutable versus something that you can understand but is 'ugly'. Sometimes, _maybe_, a ML model is what you want for a task. But a lot of the time, something that you can read and see why it's doing what it's doing, even if that takes some effort, is better than something that's impossible.

mr_toad
1 replies
13h39m

The spaghetti code approach is basically an expert system. An old school algorithmic AI. Outside constrained domains these systems never really performed very well. Reality is just too messy.

Having a system where you can see why it works the way it does is all very well, but not if it keeps giving the wrong answers. In real world use getting the right answer is often more important than knowing how you got that answer.

zelphirkalt
0 replies
12h20m

You can make an expert system extensible though. You can make it definitely recognize some pattern and when there is a complaint about it not recognizing another pattern, you can add it. Hopefully you wrote the code in a way that easily allows adding new patterns, of course.

thoughtlede
0 replies
1d1h

I think the mention of 'spaghetti code' is a red herring from the author. If the output from an algorithm cannot be defined precisely as a function of the input, but you have some examples to show, that's where machine learning (ML) is useful.

In the end, ML provides one more option to choose from. Whether it works or not for you depends on evaluations and how deterministic and explainability you need from the chosen algorithm/option.

The thing that struck me is if RNN is the right choice given that it would need to be trained and we need a lot of examples than what we might have. That said, maybe based on known 'rules', we can produce synthetic data for both +ve and -ve cases.

meindnoch
0 replies
1d3h

"Let's tuck all that unsightly spaghetti code behind this neat 1000x1000 matrix of floats!"

sdwr
2 replies
1d4h

This is new to me, and therefore bad and scary.

It's great that you know NN well enough to fold it into regular work. But think of all us poor regular developers! Who now have to grapple with:

- an unfamiliar architecture

- uncertainty / effectively non-deterministic results in program flow

sva_
1 replies
1d3h

NN are in principle deterministic (unless you add randomness to it such as is the case with LLM top p/k temperature).

Uncertainty is probably the better word of the two, but I feel like there should be a different term.

zelphirkalt
0 replies
41m

Unpredictability?

jlturner
2 replies
1d3h

If this interests you, it’s worth taking a look at Genetic Programming. I find it to be a simpler approach at the same problem, no math required. It simply recombines programs by their AST, and given some heuristic, optimizes the program for it. The magic is in your heuristic function, where you can choose what you want to optimize for (ie. Speed, program length, minimize complex constructs or function calls, network efficiency, some combination therein, etc).

https://youtu.be/tTMpKrKkYXo

nickpsecurity
0 replies
1d2h

I’ll add the Humies Awards that highlight human-competitive results. One can learn a lot about what can or can’t be done in this field by just skimming across all the submitted papers.

https://www.human-competitive.org/

PixelN0va
0 replies
1d3h

hmm thanks for the link

alsxnt
2 replies
1d4h

Recurrent neural networks can be used for arbitrary computations, the equivalence to Turing machines has been proven. However, they are utterly impractical for the task.

This seems to be a state machine that is somehow learned. The article could benefit from a longer synopsis and "Python" does not appear to be relevant at all. Learning real Python semantics would prove quite difficult due to the nature of the language (no standard, just do as CPython does).

danans
1 replies
1d3h

Recurrent neural networks can be used for arbitrary computations, the equivalence to Turing machines has been proven. However, they are utterly impractical for the task.

Karpathy's 2015 RNN article [1] demonstrated that RNNs trained character-wise on Shakespeare's works could produce Shakespeare-esque text (albeit without the narrative coherence of LLMs). Given that, why wouldn't they be able to handle natural language as formulaic as code review comments?

In that case inference was run with randomized inputs in order to generate random "Shakespeare", but the structure of the language and style was still learned by the RNN. Perhaps it could be used for classification also.

1. https://karpathy.github.io/2015/05/21/rnn-effectiveness/

vidarh
0 replies
1d3h

For RNN abilities, RWKV is worth a look[1]

It's billed as "an RNN with GPT-level LLM performance".

[1] https://www.rwkv.com/

Fripplebubby
2 replies
1d3h

Love this post! Gets into the details of what it _really_ means to take some function and turn it into an RNN, and comparing that to the "batteries included" RNNs included in PyTorch, as a learning experience.

Question:

To model the state, we need to add three hidden layers to the network.

How did you determine that it would be three hidden layers? Is it a consequence of the particular rule you were implementing, or is that generally how many layers you would use to implement a rule of this shape (using your architecture rather than Elman's - could we use fewer layers with Elman's?)?

gnyeki
1 replies
1d2h

I'm glad you found it valuable! Both are good questions and I haven't gone far enough mapping the code to Elman's architecture to know the answer to the second.

For your first question, using three hidden layers makes it a little clearer what the network does. Each layer performs one step of the calculation. The first layer collects what is known from the current token and what we knew after the calculation for the previous token. The second layer decides whether the current token looks like program code, by checking if it satisfies the decision rule. The third layer compares the decision with what we decided for previous tokens.

I think that this could be compressed into a single hidden layer, too. A ReLU should be good enough at capturing non-linearities so this should work.

Fripplebubby
0 replies
1d1h

Ah, that makes sense. So, we consider two hidden layers more as "memory" or "buffers", and actually the rule is implemented in just one layer, at least for a single token.

fnord77
1 replies
1d2h

To model the state, we need to add three hidden layers to the network

Why 3?

And why use "h" for layer names?

muricula
0 replies
23h22m

`h` is for "hidden" layers.

danans
1 replies
1d4h

I'd like to see a cost vs precision/recall comparison of using a RNN vs an LLM (local or API) for a problem like this.

FeepingCreature
1 replies
9h56m

Fwiw, I know LGTM as "let's get this moving" on pull requests. Seems to be contended.

random3
0 replies
8h10m

LGTM means "looks good to me" in review parlance

ultra_nick
0 replies
1d

I feel like neural networks are increasingly going to look like code.

The next big innovation will be whoever figures out how to convert MOE style models into something like function calls.

thih9
0 replies
1d3h

Of course, we should try and avoid writing spaghetti code if we can. But there are problems that are so ill-specified that any serious attempt to solve them results in just that.

Can you elaborate or do you have an example?

Based on just the above, I disagree - I'd say it's the job of the programmer to make sure that the problem is well-specified and that they can write maintainable code.

suzukigsx1100g
0 replies
21h38m

That’s pretty lightwork for a snake in general. Send me a direct message if you come up with something better.

skybrian
0 replies
1d

This article doesn't talk much about testing or getting training data. It seems like that part is key.

For code that you think you understand, it's because you've informally proven to yourself that it has some properties that generalize to all inputs. For example, a sort algorithm will sort any list, not just the ones you tested.

The thing we're uncertain about for a neural network is that we don't know how it will generalize; there are no properties that we think are guaranteed for unseen input, even if it's slightly different input. It might be because we have an ill-specified problem and we don't know how to mathematically specify what properties we want.

If you can actually specify a property well enough to write a property-based test (like QuickCheck) then you can generate large amounts of tests / training data though randomization. Start with one example of what you want, then write tests that generate every possible version of both positive and negative examples.

It's not a proof, but it's a start. At least you know what you would prove, if you could.

If you have such a thing, relying on spaghetti code or a neural network seem kind of similar? If you want another property to hold, you can write another property-based test for it. I suppose with the neural network you can train it instead of doing the edits yourself, but then again we have AI assistance for code fixes.

I think I'd still trust code more. At least you can debug it.

ryjo
0 replies
1d3h

Really awesome. Thanks for this thorough write-up. I don't totally understand the deeper math concepts mentioned in this article around RNNs, but it's sparked some of my own thoughts. It feels similar to things I've been exploring lately-- that is: building your app interwoven with forward chaining algorithms. In your case, you're using RNNs, and in mine, I'm building into the Rete algorithm.

You also touch on something in this article that I've found quite powerful: putting things in terms of digesting an input string character-by-character. Then, we offload all of the reasoning logic to our algorithm. We write very thin i/o logic, and then the algorithm does the rest.

lowyek
0 replies
16h14m

I would love to see some work on duality i.e. code to neural network back and forth. Reason being - I can't debug a neural network but if it can be linearized into a if-else case with help of the token information => I can validate what it's doing -> fix it and then move it back to it's compressed neural representation.

Just another thought experiment -> sometimes I imagine neural networks as a zip of the training data where compression algorithm is backpropagation. Just like we have programs which let us see what files inside the zip are -> I imagine there can be programs which will let us select certain inference path of the neural net and then see what data affected that => then we edit that data to fix our issues or add more data there => and we have live neural network debugging and reprogramming in the same way we edit compressed zips

lawlessone
0 replies
1d4h

Edit: ok i see it detects code.

I thought it was replacing bits of ANN with custom python functions.

jpe90
0 replies
3h50m

I recently wrote a blog post exploring the idea of interfacing with local LLMs for ambiguous tasks like this. Doesn't that make more sense than coding the neural network yourself? Using something like llama.cpp and evaluating whether a small model solves your problem out of the box, and fine-tuning if not, then programmatically interfacing with llama.cpp via a wrapper of your choice seems more pragmatic to me.

godelski
0 replies
1d

  > Humans are bad at managing spaghetti code. Of course, we should try and avoid writing spaghetti code if we can. But there are problems that are so ill-specified that any serious attempt to solve them results in just that.
Sounds like a skill issue.

But seriously, how many programmers do you know that reach for the documents or help pages (man pages?) instead of just looking for the first SO post with a similar question? That's how you start programming because you're just trying to figure out how to do anything in the first place, but not where you should be years later. If you've been programming in a language for years you should have read a good portion of the docs in that time (in addition to SO posts), blogs, and so much more. Because the things change too, so you have to be keeping up, and the truth is that this will never happen if you just read SO posts to answer your one question (and the next, and the next) because it will always lag behind what tools exist and even more likely will significantly lag because more recent posts have less time to gain upvotes.

It kinda reminds me of the meme "how to exit vim." And how people state that it is so hard to learn. Not only does just typing `vim` into the terminal literally tell you how to quit, but there's a built in `vimtutor` that'll tell you how to use it and doesn't take very long to use. I've seen people go through this and be better than people that have "used" vim for years. And even then, how many people write `:help someFunction` into vim itself? Because it is FAR better than googling your question and you'll actually end up learning how the whole thing fits together because it is giving you context. The same is true for literally any programming language.

You should also be writing docs to your code because if you have spaghetti code, there's a puzzle you haven't solved yet. And guess what, documenting is not too different from the rubber ducky method. Here's the procedure: write code to make shit work, write docs and edit your code as you realize you can make things better, go on and repeat but not revisit functions as you fuck them up with another function. It's not nearly as much work as it sounds and the investments compound. But quality takes time and nothing worth doing is easy. It takes time to learn any habit and skill. If you always look for the quickest solution to "just get it done" and you never come back, then you probably haven't learned anything, you've just parroted someone else. Moving fast and breaking things is great, but once you have done that you got to clean up your mess. You don't clean your kitchen by breaking your dining room table. And your house isn't clean if all your dishes are on the table! You might have to temporarily move stuff around, but eventually you need to clean shit up. And code is exactly the same way. If you regularly clean your house, it stays clean and is easy to keep clean. But if you do it once a year it is a herculean effort that you'll dread.

dinobones
0 replies
1d

This article was going decently and then it just falls off a cliff.

The article basically says: 1) Here’s this complex problem 2) Here’s some hand written heuristics 3) Here’s a shitty neural net 4) Here’s another neural net with some guys last name from the PyTorch library 5) Here are the constraints with adopting neural nets

You can see why this is so unsatisfying, the leaps in logic become more and more generous.

What I would have loved to see, is a comparison of a spaghetti code implementation vs a neural net implementation on a large dataset/codebase, then show examples in the validation set that maybe the neural net generalizes to, or fails at, but the heuristic fails at, and so on.

This would demonstrate the value of neural nets, if for example, there’s a novel example that the neural net finds that the spaghetti heuristic can’t.

Show tangible results, show some comparison, show something, giving some rough numbers on the performance of each in aggregate would be really useful.

awwaiid
0 replies
5h26m

OK so first compile python to a NN. But next let's twist or overlay that onto a Transformer-based NN. Then we can have a Transformer Virtual Machine (TVM) execute arbitrary programs.

Use some of that transfer-learning (adding weights on top of each other) and an LLM can be "born" with an algorithm deeply encoded.

29athrowaway
0 replies
14h15m

Next time do one with Bayesian networks or another Probabilistic graphical model.