return to table of content

σ-GPTs: A new approach to autoregressive models

cs702
30 replies
2d4h

This looks great.

The authors randomly permute (i.e., shuffle) input tokens in training and add two positional encodings to each token: one with the token's position and another with the position of the token to be predicted. Otherwise, the model is a standard autoregressive GPT. The consequences of this seemingly "simple" modification are significant:

* The authors can prompt the trained model with part of a sequence and then decode the missing tokens, all at once, in parallel, regardless of order -- i.e., the model can in-fill in parallel.

* The authors can compute conditional probability densities for every missing token in a sequence, again in parallel, i.e., densities for all missing tokens at once.

* The authors propose a rejection-sampling method for generating in-fill tokens, again in parallel. Their method seems to work well in practice.

I've added this to my reading list. Thank you for sharing it on HN.

thomashop
10 replies
2d2h

I don't understand how that parallel prediction can work...

Let's say I give it as input the sentence:

I . . . . . . . . happily.

The second word to be predicted depends on the first word.

cs702
9 replies
2d2h

Give the model the tokens "happily" and "I", and add to each input token its respective position embedding and the position embedding for the token to be predicted. You can do this in parallel for all tokens to be predicted. The model has been trained so it can predict tokens in any position.

hexomancer
7 replies
2d2h

Yes, but is there any guarantee that the complete sentence makes sense?

entropicdrifter
5 replies
2d2h

That guarantee didn't exist with regular GPT LLMs, did it? It just came about as an emergent property of throwing more and more compute, training data, and training time at the problem

amluto
3 replies
1d22h

I think it’s effectively built in to the design. The model outputs a probability distribution for the first unknown token [0]. Then some code outside the model chooses a token and runs the model again with that token provided to the model. So the second output token’s probability distribution is automatically conditioned on the first output token, etc.

Sometimes people will attempt to parallelize this by using a faster model to guess a few tokens and then evaluating them in as a batch with the main model to determine whether the choices were good.

[0] Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

qeternity
2 replies
1d20h

I think it’s effectively built in to the design.

It isn't. There is no guarantee that successive tokens will be comprehensible.

Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

The logits are the probability distribution (well technically, you would apply softmax). Temperature is a parameter for how you sample those logits in a non-greedy fashion.

hexaga
1 replies
1d19h

Temperature is a parameter for how you sample those logits in a non-greedy fashion.

I think temperature is better understood as a pre-softmax pass over logits. You'd divide logits by the temp, and then their softmax becomes more/less peaky.

    probs = (logits / temp).softmax()
Sampling is a whole different thing.

qeternity
0 replies
1d19h

Sure, my comment about softmax was simply about the probability distribution. But temperature is still part of sampling. If you’re greedy decoding, temperature doesn’t matter.

alextheparrot
0 replies
1d22h

No, but it makes more conceptual sense given the model can consider what was said before it

toxik
0 replies
2d1h

That is indeed an issue. Their sampling method rejects impossible combinations.

KRAKRISMOTT
0 replies
2d2h

Isn't this bag of words all over again? Except with positional hints?

mglikesbikes
5 replies
1d20h

Off topic, but what do you use for your reading list?

ofou
0 replies
1d20h

I use Emergent Mind[1] to keep track of new research published on ArXiv. You can bookmark articles once logged in. It's very useful for keeping track of articles, reading quick summaries, and following conversations on various social media.

[1]: https://www.emergentmind.com/papers/2404.09562

inhumantsar
0 replies
1d19h

hijacking for a bit of shameless self promotion: if you're an obsidian user, I recently built a plugin that simplifies web pages, parses out metadata, and saves them to obsidian as markdown files: https://github.com/inhumantsar/slurp

arXiv comes through a bit ugly atm but it's high on my to-do list. I'm leveraging the library that Firefox uses for reader mode, so most sites come through quite well. A lot of my work right now is expanding their metadata support and fixing parser issues.

cs702
0 replies
1d19h

old-fashioned text files

concurrentsquar
0 replies
1d16h

Google Chrome has a built-in reading list (go open the 3-dotted menu at the top-right corner, then click on "Bookmarks and lists" -> "Reading list")

barfbagginus
0 replies
1d18h

Zotero is great for organizing and annotating papers, keeping notes, and building bibliographies.

You can create libraries and sub libraries according to topic, and also create libraries for projects or reading lists. You can file items into multiple libraries, and you can also create shared libraries, allowing your team to share annotated papers.

Finally it can archive offline copies of web pages, which makes it useful for blog articles and other online resources that might vanish.

There's a learning curve, but it's worth it if you find yourself juggling dozens or hundreds of technical papers! Enjoy!

tripplyons
4 replies
2d

The only difference I see from XLNet is how they use it during inference.

arnaudpannatier
3 replies
1d23h

Hey! I'm Arnaud, first author of the paper. XLNet also shuffles the data during training, but they use a masking mechanism instead of the causal + double positional encoding. The application differs, XLNet is not AFAIK focused on generation (even if it can be used for that) and the burst-sampling idea is new.

RivieraKid
1 replies
1d22h

Are there any obvious practical application of this algorithm for existing large (10B+) text / image models?

Does the rejection sampling lead to a statistically correct sample from the joint probability distribution or is that just a (possibly rough) approximation?

arnaudpannatier
0 replies
1d5h

For the application: being able to prompt anywhere in the sequence can be of interest. For what we've seen in the experiment, the rejection sampling leads to similar generation than the autoregressive one, we did not see any mode collapse or anything of that kind.

tripplyons
0 replies
1d22h

Thanks for the clarification!

nico
2 replies
1d22h

I know this is for tokens/text, but can the same concept be applied to images using something like a diffusion model? And then be able to upscale images arbitrarily by infilling?

gwern
1 replies
1d22h

Yes. See the related work section in the paper: there is a long history of models, recently like MAE and MaskGit, which predict pixels in basically arbitrary orders, and that is useful because it lets you train on subsets of each image, upscale/infill during generation, and so on. (If you know what MAE is, that might be the fastest way to summarize OP: "it's a GPT trained like a MAE".)

psb217
0 replies
1d21h

People also often forget "orderless autoregression", which was introduced a while back and has been reinvented many times since. See Sec 4 (pg 8) of "Neural Autoregressive Distribution Estimation" [https://arxiv.org/abs/1605.02226]. The main difference from current work is that this 2016 paper used MLPs and convnets on fixed-length observations/sequences, so sequence position is matched one-to-one with position in the network's output, rather than conditioning on a position embedding. Of course, Transformers make this type of orderless autoregression more practical for a variety of reasons -- TFs are great!

Key quote from Sec 4: "In this section we describe an order-agnostic training procedure, DeepNADE (Uria et al., 2014), which will address both of the issues above. This procedure trains a single deep neural network that can assign a conditional distribution to any variable given any subset of the others. This network can then provide the conditionals in Equation 1 for any ordering of the input observations. Therefore, the network defines a factorial number of different models with shared parameters, one for each of the D! orderings of the inputs. At test time, given an inference task, the most convenient ordering of variables can be used."

RivieraKid
1 replies
1d22h

If there are multiple missing tokens, what's the positional encoding for the "token to be predicted"?

toxik
0 replies
2d1h

This problem formulation has been around for a while, it’s kind of the holy grail of modeling. What is new compared to PixelCNN and related is this position embedding idea.

taneq
0 replies
2d1h

Wow, if that works that's wild (and also has that "damn, now you say it it's obvious" flavour that so many really cool discoveries share...)

WanderPanda
0 replies
1d23h

Wait wasn't BERT all about non-causal masking aka predicting words in the middle?!

optimalsolver
15 replies
2d3h

Yann LeCun would say [0] that it's autoregression itself that's the problem, and ML of this type will never bring us anywhere near AGI.

At the very least you can't solve the hallucination problem while still in the autoregression paradigm.

[0] https://twitter.com/ylecun/status/1640122342570336267

andreasmetsala
4 replies
2d3h

Does everything have to take us towards AGI? If someone makes a LLM that’s faster (cheaper) to run then that has value.

I don’t think we want AGI for most tasks unless the intent is to produce suffering in sentient beings.

ben_w
3 replies
1d23h

I don’t think we want AGI for most tasks unless the intent is to produce suffering in sentient beings.

Each letter of "AGI" means different things to different people, and some use the combination to mean something not present in any of the initials.

The definition OpenAI uses is for economic impact, so for them, they do want what they call AGI for most tasks.

I have the opposite problem with the definition, as for me, InstructGPT met my long-standing definition of "artificial intelligence" while suddenly demonstrating generality in that it could perform arbitrary tasks rather than just next-token prediction… but nobody else seems to like that, and I'm a linguistic descriptivist, so I have to accept words aren't being used the way I expected and adapt rather than huff.

barfbagginus
2 replies
1d18h

I call GPT an AGI

1. To highlight that the system passes the turing test and has general intelligence abilities beyond the median human

2. To piss off people who want AGI to be a God or universal replacement for any human worker or intellectual

The problem with AGI as a universal worker replacement - the way that it can lead to sentient suffering - is the presumption that these universal worker replacements should be owned by automated corporations and hyper wealthy individuals, rather than by the currently suffering sentient individuals who actually need the AI assistance.

If we cannot make Universal Basic AGI that feeds and clothes everyone by default as part of the shared human legacy - UBAGI - then AGI will cause harm and suffering.

ben_w
1 replies
1d8h

1. To highlight that the system passes the turing test and has general intelligence abilities beyond the median human

I think that heavily depends on what you mean by "intelligence", which in turn depends on how you want to make use of it. I would agree that it's close enough to the Turing test as to make the formal test irrelevant.

AI training currently requires far more examples than any organic life. It can partially make up for this by transistors operating faster than synapses by the same ratio to which marathon runners are faster than continental drift — but only partially. In areas where there is a lot of data, the AI does well; in areas where there isn't, it doesn't.

For this reason, I would characterise them as what you might expect from a shrew that was made immortal and forced to spend 50,000 years reading the internet — it's still a shrew, just with a lot of experience. Book smarts, but not high IQ.

With LLMs, the breadth of knowledge makes it difficult to discern the degree to which they have constructed a generalised world model vs. have learned a lot of catch-phrases which are pretty close to the right answer. Asking them to play chess can result in them attempting illegal moves, for example, but even then they clearly had to build a model of a chess board good enough to support the error instead of making an infinitely tall chess board in ASCII art or switching to the style of a chess journalist explaining some famous move.

For a non-LLM example of where the data-threshold is, remember that Tesla still doesn't have a level 4 self-driving system despite millions of vehicles and most of those operating for over a year. If they were as data-efficient as us, they'd have passed the best human drivers long ago. As is, while they have faster reactions than we do and while their learning experiences can be rolled out fleet-wide overnight, they're still simply not operating at our level and do make weird mistakes.

barfbagginus
0 replies
7h44m

So your points are

I don't really pick a definitiom intelligence

llms could be regurgitating training data

they take more data to train than humans

They can't do some tasks, like driving

However, in my experience, LLM are more empathetic than humans, more able to help me reason about my feelings and communication problems than humans, less likely to perform microagressions or be racist or ableist than humans, and better at math and science than most humans. These are just my personal feelings as an autistic person, which I can back up only loosely with benchmark data, but which I will expect to see the world coming to realize over the next years.

So in terms of being able to constructively interact with me in an intelligent and helpful way, LLMs are often more useful better than humans that I have access to. I say they are smarter than these people as well, because AI will give me solutions that are useful, and which other humans could not give me.

The fact that it cannot drive doesn't bother me since I don't consider driving a general skill but a specialized skill. It can still have general intelligence without being able to do some specific things. Going back to my original post, I specifically reject AGI definitions where to be generally intelligent the AI has to out perform humans in every possible skill. I would consider that a super intelligent AGI.

As for the information problem and data issue, AIs so far have been black boxes isolated from reality and we haven't solved the online continuous learning problem. I believe that as we turn AIs into agents which are constantly interacting with reality via high bandwidth token streams, we will have a lot more data to train with. I also believe that we'll start being able to train continuously on that data. Then even assuming that training is no more efficient than it is today, I think the extra data could make the difference.

I'm also not convinced that AI won't eventually be able to learn from as little data as humans do. I don't think it has to be the case, and I also don't discount the possibility of an AI winter that leaves AI less than efficient than humans are for a long long time maybe even forever. However I also feel like we may come to understand why humans learn so fast, and might be able to transfer some insights into artificial systems. I also know that people will be trying very hard to solve the AI energy and data usage problems, since their major threats against large-scale AI adoption. So we'll be trying really hard to do it and we'll have a blueprint for how to do it - our brains. That means there's a chance we'll crack that problem.

Finally the regurgitation issue is irrelevant to intelligence - just like it would be irrelevant if the brain is secretly just regurgitating stuff it learned. Because the brain can also do novel things.

Furthermore we know that llms can learn and usefully reason about context information outside of their training distributions. This is called in context learning.

For example if I come from a culture that the AI was not really well trained on, I can give it four or five examples of values that are important to me in that culture, and then it will be able to extrapolate how to apply or respect those values in situations that I present.

And again here's the kicker- it'll do this more faithfully than the average person. Remember that if you tell a person five values from a culture outside of their own, and ask them to uphold those values... Perhaps half will just get angry and give you some kind of racist slur, and then 80% of the remainder will lack the empathy and mental flexibility to do a good job.

Finally I need to point out that I have studied AI for over two decades out of books starting from the '80s, then the '90s, then the 00s and 10s. And the change in the literature and capabilities has been unreal.

Perhaps you are forgetting how feeble AI was before, or simply not putting it to use. There are many many tasks that no AI from over 3 years ago could have touched, and now suddenly you can do it for just a $20 a month subscription.

The change in capabilities is so drastic that I wonder if you're simply discounting that change because you're not using AI, comparing it to old AI, or seeing it enable things that no AI before could have possibly done, no matter how hard you tried.

So to conclude, the change has been too great, enabled too many new things, and taking such a big departure from old AI, and consistently outperforms humans on so many tasks that I find important, that I feel it would be not only senseless to say that there isn't some intelligence there - some useful information processing capability that I can depend on and rely on more than a human, in many tasks and settings where humans are consistently bad. In fact it would be harmful for me if I didn't realize that these things have changed, because I would not be benefiting from them.

barfbagginus
3 replies
1d18h

Can I please convert you into someone who summarily barks at people for making the LeCunn Fallacy rather than making the LeCunn Fallacy yourself?

And can you stop talking about AGI when it's not relevant to a conversation? Let's call that the AGI fallacy - the argument that a given development is worthless - despite actual technical improvements - because it's not AGI or supposedly can't lead to AGI.

It's a problem.

Every single paper on transformers has some low information comment to the effect of, "yeah, but this won't give us AGI because of the curse of LeCunn". The people making these comments never care about the actual improvement, and are never looking for improvements themselves. It becomes tiring to people, like yours truly :3, who do care about the work.

Let's look at the structure of the fallacy. You're sidestepping the "without a major redesign" in his quote. That turns his statement from a statement of impossibility into a much weaker statement saying that auto regressive models currently have a weakness. A weakness which could possibly be fixed by redesign, which LeCunn admits.

In fact this paper is a major redesign. It solves a parallelism problem, rather than the hallucination problem. But it still proves that major redesigns do sometimes solve major problems in the model.

There could easily arise a regressive model that allows progressive online updating from an external world model - that's all it takes to break LeCunn's curse. There's no reason to think the curse can't be broken by redesign.

optimalsolver
2 replies
1d10h

This thing will still hallucinate, not matter what new bells and whistles have been attached to it, meaning it will never be used for anything important and critical in the real world.

barfbagginus
0 replies
8h23m

You have no proof that every modification of the architecture will continue to have hallucinations. How could you prove that? Even LeCunn admits that the right modification could solve the issue.

You're trying to make this point in a circular way - saying it's impossible just because you say it's impossible - for some reason other than trying to get to the bottom of the truth. You want to believe that there's some kind of guarantee that no offspring of the auto regressive architecture can never get rid of hallucinations.

I'm saying they're simply no such guarantee.

barfbagginus
0 replies
6h17m

Plus, humans bullshit all the time, even well paid and highly trained humans like Doctors and Lawyers. They will bullshit while charging you 400 an hour. Then they'll gaslight you if you try to correct their bullshit.

AI will bullshit sometimes, but you can generally call it on the bullshit and correct it.

For the tasks that it helps me with, I could work with a human. But the human I could afford would be a junior programmer. Not only do they bullshit more than a well prompted AI, but I also have to pay them 30 an hour, and they can't properly write specs or analyze requirements. GPT 4 can analyze requirements. Much better then a junior, and I'm many ways better than me. For pennies.

I do use it in the real world, to maintain and develope software for the industrial design company I own. It would be foolish if I didn't. I've been able to modernize all our legacy codes and build features that used to stump me.

Maybe the fact is that I'm an incompetent programmer, and that's why I find it so helpful.

If that's the case so be it! It's still a significant help that is accessible to me. That matters!

vessenes
2 replies
2d3h

I think this method might not be amenable to the exponential divergence argument actually.

Depending on token sampling methods, this one could look at a proposed generation as a whole and revise it. I’m not sure the current token sampling method they propose does this right now, but I think it’s possible with the information they get out of the probabilities.

modeless
1 replies
2d1h

Yes, to me this seems to address LeCun's objection, or at least point the way to something that does. It seems possible to modify this into something that can identify and correct its own mistakes during the sampling process.

vessenes
0 replies
1d12h

Well, I think I understand LeCun has a broader critique that any sort of generated-in-a-vacuum text which doesn't interact with meatspace is fundamentally going to be prone toward divergence. Which, I might agree with, but is also, just, like, his opinion, man. Or put less colloquially, that's a philosophical stance sitting next to the math argument for divergence.

I do think this setup can answer (much of) the math argument.

sebzim4500
0 replies
1d19h

LeCun is a very smart guy but his track record predicting limitations of autoregressive LLMs is terrible.

cs702
0 replies
2d3h

LeCun may or may not be right, but I'm not sure this is relevant to the discussion here.

The OP's authors make no claims about how their work might help get us closer to AGI.

They simply enable autoregressive LLMs to do new things that were not possible before.

TheEzEzz
0 replies
2d2h

LeCun is very simply wrong in his argument here. His proof requires that all decoded tokens are conditionally independent, or at least that the chance of a wrong next token is independent. This is not the case.

Intuitively, some tokens are harder than others. There may be "crux" tokens in an output, after which the remaining tokens are substantially easier. It's also possible to recover from an incorrect token auto-regressively, by outputting tokens like "actually no..."

szvsw
3 replies
2d3h

Wow, really cool concept! I wonder if this starts to become similar dynamics to what we see in image generation models, where structure/detail emerges in one region of the image and then the surrounding areas start to resolve themselves into place. That kind of behavior seems particularly useful for longer reasoning/logic/planning, where the big ideas might become apparent first, and then the interstitial details and text just naturally fill in…

byteknight
2 replies
2d3h

The process you describe is referred to as diffusion

szvsw
0 replies
2d2h

Yep yep I know, but I was trying to suggest something diffusion-like occurring with a language model through a totally separate mechanism that does not rely on the denoising process (at least not literally).

immibis
0 replies
2d2h

I'm fairly certain diffusion refers to the overall architecture, not the emergent self-organization process.

skilled
3 replies
2d4h

The main idea is to train the model to generate sequences in a random order, which allow conditional density estimation, infilling and generating sequences by burst using a novel rejection sampling method.

In exploring that idea, we aslo compared to a discrete diffusion baseline, which also allows to generate sequences in burst. We were surprised to see that diffusion models were able to solve path-finding task and we made a short Twitter thread

The said thread:

https://nitter.poast.org/ArnaudPannatier/status/176286434739...

And a showcase here:

https://www.idiap.ch/~apannatier/sigma-gpt/

(excerpt taken from here: https://www.idiap.ch/~apannatier/)

3abiton
1 replies
2d2h

I just wonder if such models based on their method would make hallucination even worse.

arnaudpannatier
0 replies
1d23h

Hey, I'm Arnaud, first author of the paper. The answer is a bit mixed. We actually started looking into this because of a repetition problem that appeared in a low-data regime for a sequence generation task. Basically, the left-to-right GPT was stuck repeating the same token once it sampled twice the same in a row during generation. And to mitigate that, we tried to generate the sequence in a random order and it seemed to help and we see less of this repetition issue. We initially thought when we don't have enough data, shuffling would be like data-augmentation and might actually help the model reach better performance. But this is not what we found in the experiments: apparently as learning in any order is a harder task, the model memorise the data more.

nico
0 replies
1d22h

We were surprised to see that diffusion models were able to solve path-finding task

I wonder if this type of method might allow for faster solutions to the traveling salesman problem

behnamoh
3 replies
2d1h

Title is incorrect: it's σ not Σ.

modeless
2 replies
2d1h

Σ is uppercase σ. Maybe this happened automatically? Pretty funny if so. Correct in a Greek context; clearly incorrect in a math context.

mehulashah
1 replies
2d

Yes, HN automatically did that.

modeless
0 replies
2d

For future reference, it is possible to edit the titles of stories you've submitted. This allows you to correct any errors introduced by HN's title rewriting heuristics at submission time, without waiting for a moderator to do it for you. Just like for comments, though, the edit window is time limited. For comments the window is two hours. I don't know if it's the same for story titles.

vessenes
2 replies
1d12h

I kept thinking about this paper today, and I really like the capabilities.

A number of things that are relatively hard for sequential LLMs are easy here. Want json? Fix curly brace tokens to the beginning and end.

Want a specific token-length explanation of an answer? Write a short answer, post-pend it, and infill.

Want a higher-density answer to something? Add a density assessment section to your generated text, a space for the LLM to score info-density, and generate looking for a high density score.

I would guess there's a lot here to be experimented with. It would be nice to get an 8b parameter model with reasonable number of tokens (x3 based on the paper, sadly) through it.

zakkor1
1 replies
1d12h

Fix curly brace tokens to the beginning

Regular LLMs can already do this, by prefilling the start of the assistant's response.

But there is actually something even better: you can constrain the LLM's output to a specific grammar (like JSON), so it'll only be able to answer with syntactically valid JSON.

vessenes
0 replies
22h56m

Yes. And you can have a grammar parser only select from valid tokens in a randomized distribution. But, this feels much more sophisticated to me, especially if you can mix specific token-based grammar requirements with other instructions during the token selection phase.

lukasb
1 replies
1d23h

Weird that they chose an example that ended up somewhat nonsensical.

sebzim4500
0 replies
1d18h

Part of the issue is they are training a pretty tiny model, it's not like GPT-2 ~100M is especially coherent either.

mbil
2 replies
2d

I wonder if this would help especially for computer code generation, where what is output at a given step may materially depend on what would be written at a later step.

mbil
1 replies
1d23h

And, maybe prohibitively slow, perhaps integrate some kind of linting or syntax checking as part of the rejection sampling. I.e. burst sample N potential generated snippets in parallel, reject those that are syntactically invalid.

barfbagginus
0 replies
1d18h

It would be nice if it could diffuse right on the AST. That would ensure each generated item passes a syntax check, without the waste of rejection sampling

hammock
2 replies
2d4h

Add to this some kind of "autofocus" for the user to click on the word that is the "center" of the prompt and you've really got something

MaheshNat
1 replies
1d15h

What exactly do you mean by "autofocus"?

hammock
0 replies
1d

I mean the user clicks on the word that is the "focus" of the prompt the way that you click on a camera screen

bigyikes
1 replies
2d3h

Is this applying the learnings from vision transformers to language transformers?

If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.

I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?

seurimas
0 replies
2d3h

Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).

omernivro
0 replies
20h40m

This is an interesting study. A similar permutation approach appears already in the Taylorformer paper (https://arxiv.org/pdf/2305.19141v1). The authors use a Transformer decoder for continuous processes, like time series. During training, each sequence is shuffled randomly. Each sequence element has a positional encoding. Then, they use log-likelihood on the shuffled sequence. There, the permutation helps with predictions for interpolation, extrapolation and irregularly sampled data. Also, they show it helps with 'consistency', i.e., roughly the MSE is the same regardless of the generated order.

What might this paper add to our understanding or application of these ideas?

The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.

nsagent
0 replies
1d16h

What's old [1] is new again... without citing prior work. It's not like it's an unknown work. It was published in ICML and has ~250 citations.

[1]: https://arxiv.org/abs/1902.03249

naveen99
0 replies
1d15h

Bert had random masking from the sequence. But time is sequential.

klysm
0 replies
2d4h

Encoding the sequence like that seems like a really clever workaround for some of the data dependency limitations of GPT.

barfbagginus
0 replies
1d18h

Great, now I'm imagining GPT flexing its roided biceps while making sigma faces, as edgy incel music goes hard in the background with spiky synths and a boy's choir.

After seeing how awesome the showcase looks, I'm not even sure I'm mad about this, lol

aconz2
0 replies
1d17h

Is there code somewhere? I don't totally understand the double position and shuffling. Interesting they use concat instead of plus for the positionals

ETH_start
0 replies
1d10h

This is not a phonetically friendly acronym.