Jeff Dean: Trends in Machine Learning [video]

I see from comments I'm far from the only one using AI to summarise videos before deciding whether to watch them.

Reminds of the meme "why spend 10 minutes doing something when you can spend a week automating it". i.e. "why spend an hour watching a talk when you can spend 5 hours summarising it with AI and debating the summary's accuracy".

This sounds silly but potential gains from learning AI summarisation tooling/flows are large, hence why it warrants discussion. Learning how to summarise effectively might save hours per week and improve decisions about which sources deserve our limited time/attention.

I feel like I'm missing some boat, but I'm not sure what boat it is. These "AI" systems seem very superficial to me, and they give me the same feeling as VR does. When I see VR be some terrible approximation of reality, it just makes me feel like I'm wasting my time in it, when I could go experience the real thing. Same with AI "augmentation" tooling. Why don't I just read a book instead of getting some unpredictable (or predictably unpredictable) synposis? It's not like there's too much specific information there. These tools are just exploding the amount of unspecific information. Who has ever said: "hey, I have too much information for building this system or learning this topic"? Basically no one.

It's just going to move everything to the middle of the Bell curve, leaving the wings to die in obscurity.

If you know a book’s worth reading, going ahead and reading it works well. But for a lot of books/talks there’s competition for time - eg my bookshelf has 20 half read books (this is after triaging out the ones that aren’t worthy of my time) - any tooling that can help better determine where to invest tens or hundreds or hours of my time is a win.

Regarding accuracy, I think we’re at a tipping point where ease of use and accuracy is starting to make it worth the effort. For example Bard seems to know about youtube videos (just a couple of months ago you’d have to download it -> audio to text -> feed into a LLM). So the combination of greater accuracy and much easier to use make it worth considering.

LLM accuracy is so bad, especially in summarization, that I now have to fact check google search results because they’ve been repeatedly wrong about things like the hours restaurants are open.

There's a huge difference between summarizing a stable document that was part of the training data or the prompt, and knowing ephemeral facts like restaurant hours.

Technically true statement. If you're offering it to imply that the GP bears responsibility for knowing what document was in the training data and what's not, I have to quibble with you.

Knowing it's shortcomings should be the responsibility of the search app that is currently designed to give screen real estate to the wrong summary of the ephemeral fact. Or, users will start to lose trust.

It's because they don't understand language. You may have been mislead by their ability to generate language.

If you know a book’s worth reading, going ahead and reading it works well. But for a lot of books/talks there’s competition for time - eg my bookshelf has 20 half read books (this is after triaging out the ones that aren’t worthy of my time) - any tooling that can help better determine where to invest tens or hundreds or hours of my time is a win.

Is it that hard to determine that a book is worth reading where worth is measured from your perspective? It's usually pretty easy, at least for technical books. Fiction books are another story, but that's life. Having some unknown stochastic system giving me a decision based upon some unknown statistical data is not something I'm particularly interested in. I'm interested in my stochastic system and decision making. Trying to automate life away is a fool's errand.

Is it that hard to determine that a book is worth reading

I'm a huge believer in doing plenty of research about what to read. The simple rationale: it takes a tiny amount of time to learn about a book relative to the time it takes to read it. Even when I get a sense a book is bad, I still tend to spend at least a couple of hours before making the tough call not to bother reading further (I handled one literally 5 minutes ago that wasted a good few hours of my life). I'm not saying AI summaries solve this problem entirely, but they're just one additional avenue for consultation that might only take a minute or two and potentially save hours. It might improve my hit rate from - I dunno - 70% to 80%. Same idea for videos/articles/other media.

I think the more you outsource "what is worth my time" the less you're actually getting an answer about what's worth YOUR time. The more you rule out the possibility of surprise up front, the less well-informed your assumption about worth can possibly be.

There are FAR too many dimensions like word choice, sentence style, allusion, etc, that resist effective summarization.

I get where you're coming from and definitely vet books in similar ways depending on the subject, but I also feel like this process is pretty limited in ways too and appeals to some sort of objective third party that just doesn't exist. If you really want to know or have an opinion on a work/theory/book at the end of the day you have to engage with it yourself on some level.

In graduate school for example, it was pretty painfully obvious that most people didn't actually read a book and come to their own conclusions, but rather read summaries from people they already agreed with and worked backwards from there, especially on more theoretical matters.

I feel like on the long term this just leads to a person superficially knowing a lot about a wide variety of topics, but never truly going deep and gaining real understanding on any of them- it's less "knowing" and more the feeling of knowing.

Again, not saying this in an accusatory way because I totally do engage in this behavior too, I think everyone does to some degree, but I just feel the older I get, the less valuable this sort of information is. It's great for broad context and certain situations I suppose, but in a lot of areas I consider myself an expert, I would probably strongly disagree with summaries given on subjects and they also tend to miss finer details or qualifying points that are addressed with proper context.

IMHO, the good old method of skimming through the table of contents, reading the preface and perhaps the first couple of chapters is going to be a much higher fidelity indicator of whether a book is worth your time than reading an AI generated summary.

Why don't I just read a book instead of getting some unpredictable (or predictably unpredictable) synposis? It's not like there's too much specific information there.

I'm trying to understand this comment, because I couldn't disagree more. It is the absolute explosion of available data sources that has me wanting to be much more judicious with where I spend my time reading/watching in the first place.

Your comment was interesting to me because I feel like I agree with one of its main sentiments: that AI generated content all kinda "sounds the same" and gives a superficial-feeling analysis. But that is why I think AI is a fantastic tool for summarizing existing information sources so I can see if I want to spend anymore time digging in to begin with.

Relying on glorified matrices (that's what machine learning is) for world data curation is just begging to handicap yourself into a cyborg dependent on a mainframe computer's implementation. An implementation and design that is rarely scrutinized for safety and alignment features.

Why not just make your brain smarter, instead of trying to cram foreign silicon components into your skull?

Why not both?

Because maximizing both biological vectors of self-improvement and computing based avenues of skill acquisition is limited by the fact that it's a multi-objective optimization problem when you combine them together during maximization. Optimizing one de-optimizes the other. They, biology and computers, conflict with each other in fact. So, at best, you have to reach for a Pareto frontier.

And, it turns out, technology can't be trusted, as there is always some sort of black box associated with its employment. Formally, there is always a comprehension involved when it comes to the development and integration of technology into human life. You can't really trust this stubborn built-in feature of technological and economic success if you don't pierce through its secrets (knowledge is the power to counteract cryptographic objects). After all, it could be a malicious trojan horse that "basic common sense" insists on us all using for "bettering" our daily lives.

A very unfriendly artificial intelligence is trying to sneak through civilization for its own desires. And you're letting it just pass on by, as a result of your compliance with the dominant narrative and philosophy of capitalist economics.

I was thinking the other day: Star Trek computers make a lot of sense if they are working with our current level of AI.

You can talk to it, it can give you back answers that are mostly correct to many questions, but you don't really trust it. You have real people pilot the ship, aim and fire weapons, and anything else important.

And nobody in Star Trek thinks the ship computer is sentient. On the other hand, the holodeck sometimes malfunctions and some holodeck character (like Moriarty) becomes sentient running on just a subset of the ship computer. That suggests sentience (in the Star Trek universe) is a property of the software architecture, not hardware.

Firstly, they had unlimited energy and replicators - which means they could make whatever hardware they wanted.

And they also had bio-neural circuits. And photonic chips.

So, hardware was already way ahead of software.

All this goes to show that in real world, the actual science (and fiction) around material sciences was already quite advanced compared to software.

I had a conversation with a friend where he suggested that he had had a broad range of experiences just from gaming. I think the context was a conversation about how experiences in life can expand you — something like that.

The whole premise bothered me though.

I can remember a bike ride where I was experiencing the onset of heat stroke and had to make quick decisions to perhaps save my life.

I remembered decades ago lost in Michigan's upper peninsula with the wife, on apparently some logging road and the truck getting into deeper and deeper snow as we proceeded until I made the decision to instead turn around and go back the way we came lest we become stranded in the middle of nowhere.

I remember having to use my wits, make difficult decisions while hitchhiking from Anchorage, Alaska to the lower 48 when I was in my early twenties....

The actual world, the chance of actual death, strangers, serendipity ... no amount of VR or AI really compares.

You're not wrong, but I also think the problem predates video games. Films, novels and even religious texts all are scrutinized for changing people's perspective on life. Fiction has a longstanding hold on society, but it inherently coexists with the "harsh reality" of survival and resource competition. Introducing video games into the equation is like re-hashing the centuries old Alice in Wonderland debate.

Playing video games all day isn't an enriching or well-rounded use of time, but neither is throwing yourself into danger and risk all the time. The real world is a game of carefully-considered strategy, where our response to hypothetical situations informs our performance during real ones. Careful reflection on fiction can be a philosophically powerful tool, for good or bad.

100 percent on things moving to the center of the curve.

For now, that’s not a bad thing if you need to know what the average information is.

As time goes by it might not be a good thing.

I just read “Robust Python” book. My overall reaction is that book could have been written with half the length and still be valuable for me. I can't stop thinking if I could ask LLM to summarize each chapter for me, I still could "read" the whole book in the manner the author outlinea but save a tons of time.

What is your workflow for this, if you don't mind me asking?

If you’re interested I did a YouTube video and short blog post about it

https://www.jerpint.io/blog/yougptube/

https://www.youtube.com/watch?v=WtMrp2hp94E

This is pretty cool. Would it be possible to just stream the audio directly into Whisper, maybe using something like vlc, at x2 play speed to get the summary faster?

Probably, the openAI api got a lot better since I made that post, though if you stream audio at 2x speed you have to expect a drop in quality since on average most clips whisper is trained on are not at 2x

Try out www.askyoutube.ai!

My approach wasn't fancy, just asked bard (aka gemini). I was drawn to bard/gemini for this since the source video is on youtube, so figured google would better support its related service (although that was an arbitrary hunch)

https://imgur.com/a/psb64IP

This is exactly why I built https://www.askyoutube.ai. It helps you figure out if a video has the answer you want before you spend time watching it. It does this by aggregating information from multiple videos in one-go.

I don't think it completely replaces watching videos in some cases but it definitely helps you skip the fluff and even guides you to the right point in the video.

Do you transcribe the videos or use the captions, because GPT4 can already do the latter?

It can be either depending on the mode, I don't think GPT4 can already do the latter though.

What tool do you use to summarize video?

(since it's a youtube video) I used bard/gemini: https://imgur.com/a/psb64IP

I have no idea if it's the best (or even a good) tool. Other commenters suggest some other tools (for both text summaries and condensed video summaries - a sort of 'highlights reel'):

https://news.ycombinator.com/item?id=39435930

https://news.ycombinator.com/item?id=39435964

(Little self-plug) I made a tool that’s pretty relevant

https://www.platoedu.org/videos/oSCRZkSQ1CE/watch

It's not really giving summaries but gives topic/section timestamps and highlights what was discussed.

(Main focus is actually making mini-courses off of YouTube videos but I found the section summaries really useful for figuring out which parts to watch)

and debating the summary's accuracy

Perhaps consider simply reading the description for an accurate summary.

From the description:

Abstract: In this talk I’ll highlight several exciting trends in the field of AI and machine learning. Through a combination of improved algorithms and major efficiency improvements in ML-specialized hardware, we are now able to build much more capable, general purpose machine learning systems than ever before. As one example of this, I’ll give an overview of the Gemini family of multimodal models and their capabilities. These new models and approaches have dramatic implications for applying ML to many problems in the world, and I’ll highlight some of these applications in science, engineering, and health. This talk will present work done by many people at Google.

sure its an accurate summary, but is it at a granularity or specificity that you want? LLM summaries lets you move around the latent space of summaries and you probably dont agree with the one chosen for youtube descriprtions.

In this case the video description contains a useful Abstract. AI summaries can offer additional value though, going into more/less detail (as desired), and allowing you ask follow up questions to drill into anything potentially of interest.

What ai tool do you use to summarize?

I've A/B tested this with webinars and the tools I've tried tend to miss some really valuable/interesting stuff even when I give it the full transcript. Same goes for when I try to use ChatGPT or other tools for full interactive analysis, even when I basically hand it what I'm looking for as if I hadn't watched the video it will leave out the critical information

1. Author spends a week producing a video when writing an article would have taken a day.

2. Viewer spends hours summarizing the video to an article so they don't have to watch it.

P R O G R E S S

I made a tool that might be interesting for people here!

https://www.platoedu.org/videos/oSCRZkSQ1CE/watch

It's not really giving summaries but gives topic/section timestamps and highlights what was discussed

(for example: The Transformer Model (21:06 - 24:48) - Introduction of the Transformer model as a more efficient alternative to recurrent models for language processing)

The main focus is actually creating Anki-like spaced repetition questions/flashcards for videos and lectures you watch to retain knowledge, but I found the section information quite helpful for finding which parts of the video contain the info relating to topics/concepts

Learning how to summarise effectively might save hours per week and improve decisions about which sources deserve our limited time/attention.

If you like summaries, you'll probably love de-summaries (WIP): https://socontextual.com/

Parts of it were good, but it mostly felt like he was reading through a slideshow created by the Google marketing team.

Agreed -- given the talk is titled "Exciting trends in Machine Learning" it feels pretty incomplete to gloss over ChatGPT blowing up.

OpenAI bet big on (1) emergent behaviors as you scale language models and (2) RLHF / fine-tuning to follow instructions.

Both those topics were lightly covered but very much from the "Google did this" perspective (word2vec, step-by-step reasoning).

The original paper on emergent behaviors in LLMs is from Google: https://arxiv.org/abs/2206.07682

The difference between OpenAI and Google isn't due to different research directions between the two firms. It is a difference in ability to execute.

This paper was interesting at the time but the main sentence in it was wrong: “Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.” Such behaviors can totally be predicted if one looks at the evolution of the likelihoods for these behaviors by scaling and extrapolating from the small models.

Maybe you can predict emergent abilities post hoc, but I recall no one predicting beforehand that, for example, a pure language model could do translation simply be giving a prompt that said “translate to French”

Then they're not "emergent"

I don't think many-shot prompting or chain-of-thought were predictable from model size. They just showed up at a particular model+data size.

From the GPT-2 paper: ``Also thanks to all the Googlers who helped us with training infrastructure.''

Google certainly had influential papers on language models having emergent behaviors, but this isn't the one that inspired scaling up GPT. It was published August 2022 and GPTs at OpenAI were getting scaled up long before this.

The research breakthroughs all happened at Google, so it's not surprising.

RLHF was invented by OpenAI.

DPO was invented at Stanford.

Exactly. Google.

And all the people that actually did the inventing promptly left Google.

(yes I get there's 2 exceptions)

What I hate about the ML community is that papers have become ad pieces now. I'm all for them releasing technical papers, but do we have to fuck up the review system for it. We don't need to railroad our research community and doing so is pretty risky. I don't buy the scale is all you need side. But if you do, doesn't matter if I'm wrong, we'll just get there when Sam gets his $7T. But if I'm right, we need new ideas if we're going to keep progressing without missing a step.

Domain adaptation across verticals is the only driver of innovation. Check out the NeSy computation engine. Its ``semantic parsing'' is domain parsing and its symbols are numeric. Scale if you want. It works for images. That's all that matters, right?

Mapping language onto the latent space of images gets you crude semantic attributes sometimes. If you have to push for multimodal out of the gate maybe start with articulatory perception. These LVMs aren't going to cut it.

ML research isn't meant to further your understanding of anything. You can't separate it from corporate interest and land grabbing. It's the same thing re-hashed every year by the same people. NLP is pretty much a subfield of IR at this point.

I love how fast my comment was blacklisted to the bottom of this thread. lol

I think you need to understand your audience better. Remember that most people have a very shallow understanding of DL systems. You got blog posts that talk about attention from a mathematical perspective but don't even hint at softmax tempering and just mention that there's a dot product. Or where ML researchers don't know things like why doubling batch size doesn't cut training in half or using fp16 doesn't cut memory in half. So I think toning down the language, being clearer and adding links will help you be more successful in communicating.

I watched some of this after it was promoted to me on YouTube.

It is very much: Look at this other cool blackbox that Google has made - I think it has a helpful personality.

And sure: Applied AI / ML is a part of computer science but I had hoped it would be more of a walk-through of what has happened in terms of advances in theory, algorithms, architecture, training methodology, and perhaps some sort of explanation of why does it work giving this general purpose model a bunch of medical data? (Is it merely an elaborate fuzzy search, how does it extrapolate, is there actually reasoning going on and how does emerge from a neural net etc...)

Maybe this was a review talk, designed for an audience of undergraduate students from various majors.

All machine learning is convolution.

Only certain types of neural networks use convolutions. It's not universal.

What are some examples of types that don't?

The transformer module is currently dominating ML, and is widely used in text, vision, audio, and video models. It was introduced in 2017 and shows no real signs of being displaced. It has no convolutions.

https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc... http://jalammar.github.io/illustrated-transformer/

… or does it https://grlearning.github.io/papers/11.pdf

If they use dot products on at least one layer with fully-connected inputs, which they do, along with everything else derived from the basic MLP model, then they're technically performing convolution.

Of course, the convolution concept breaks down when nonlinear activation functions are introduced, so I'm not sure the equivalence is really all that profound.

I think anything that doesn't have an explicit convolution layer? Transformer, MLP, RNN don't automatically have a convolution layer, although for many tasks you can add it in if you want.

Transformers, MLPs.

No, ALL machine learning architectures are convolutions https://grlearning.github.io/papers/11.pdf

Reading the paper, they're saying convolutions are powerful enough to express any possible architecture. But that's just computational universality - this does not mean they are convolutions.

Every matrix multiply can be viewed as a 1x1 convolution, just like every convolution can be viewed as an (implicit, for larger than 1x1 kernels) matrix multiply.

I'm not sure this is particularly enlightening, but it's probably one small step of understanding that is required to truly "get" the underlying math.

Edit: Should have said that every matrix multiply against weights is a convolution. The matrix multiply in the standard attention block (query times keys) really doesn't make sense to be seen as a convolution.

Integral transform is the right abstraction level. Matmul is an integral transform.

All computers do is add.

And conditional jumps

What is a jump but an addition to the instruction pointer?

True

Random forest would like a word.

At 1:00:18 it's claimed, "more than just publications" and that products are live for the public to use.

And they mention something that sounds cool. It's called DermAssist, it's to find similar images related to skin-diseases.

However, when checking the website, you are supposed to join a waitlist, but then it 404s.

Is the tool already shutdowned or it was never released ?

Hi, my partner worked in technology strategy for a large healthcare system where she saw many derm AI-based applications evaluated over the last decade. (Dis)Incentives aside, they all followed a similar\ story arc overpromising in their research findings and underdelivering in actual care delivery. They've been around for longer than you'd imagine. I hope they reach their potential, but suggest approaching them with healthy skepticism.

For an example of medical AI that delivers take a look at https://cydarmedical.com/

Real-time augmented reality for keyhole aortic surgery using machine vision to understand patients' individual anatomies.

Full disclosure: I was the original CTO.

Looks awesome :)

It's not shutdowned I'm pretty sure. Looks like a bug. Hopefully it will be fixed soon.

Let's hope so, because it seems really useful.

In the meantime, I found an open-source equivalent https://modelderm.com/en.html but if we could see the certified + tested tool it'd be nice :)

If you are in the US, "DermAssist is not available in the United States"(https://health.google/consumers/dermassist/)

thats a new one... so where is it available?

Not to shade on anyone, but even Jeff Dean isn't good at predicting the future.

The pathways model/architecture aren't exactly where things are moving towards, LLM is.

https://blog.google/technology/ai/introducing-pathways-next-...

Care to share some topics where the things are moving towards? I understand diffusion, GaNs and Mamba is in vogue these days, but those are different logical architecture. I am unsure where the next level ML physical architecture research is moving towards.

I think at this rate, everything is moving towards Transformer based models(text/audio/image/video), as Sora has shown, there isn't really anything Transformer can't do, it can generate both real life quality photo and video. Its ability to fit ANY given distribution is beyond compare, the most powerful neural network we have ever designed, nothing else is even close.

GANs are on the contrary, not hot any more in industry, diffusion models have achieved high fidelity in image generation, hard to see how GAN can make a comeback. It is faster, but it image generation in terms of quality is done, the wow factor is no more.

This might be a hot take, but I think architectural changes is going to die down in industry, Transformer is the new MOS transistor. As billions of dollars pumping into making it runs faster AND cheaper, other alternative architecture is going to have a hard time compete.

There is no question in my mind that the transformer architecture will not stop to evolve. Already now, we are stretching the definition by calling current transformers that way; the 2017 transformer had an encoder block which is nearly always absent nowadays, and the positional encoder and multi-head attention have been substantially modified.

VRAM costs and latency constraints will drive architectural changes, which Mamba hints at: we will have less quadratic scaling in the architectures that transformers evolve into, and likely the attention mechanism will look more and more akin to database retrieval (for way more evolved database querying mechanisms than is often seen in relational databases). One day, the notion of a maximum context size will be archaeological. Breaking the sequentiality of predicting only the next token would also improve throughput which could require changes. I expect experts to also evolve into new forms of sparsity. More easily quantizable architectures may also emerge.

The original transformer is an encoder decoder model, where the decoder model is what leads to first GPT model. Except you need to feed the encoder states to the decoder attention module in the original proposal, it is basically the same decoder only model. I would argue the decoder only model is even simpler in that regards.

When it comes to the core attention mechanism it is surprisingly stable comparing to other techniques in neural networks. There is the qkv projection, then dot product attention, then two layer of ffn. Arguably the most influential changes regarding attention itself is the multi query/grouped attention, but that is still imo, a reasonably small change.

If you look back into the convolutional NNs, their shapes and operators just changes every six months back in the day.

At the same time, the original transformer today is still a useful architecture, even in production, some bert models must be hanging around still.

Not that I am saying it didn’t change at all, but the core stays very much stable across countless revisions. If you read the original transformer paper, you already understood 80% of what LLama model does, the same thing can’t be said for other models is what I meant

Summary: https://www.videogist.co/videos/jeff-dean-google-exciting-tr...

Thanks for the post - I put together the tool above. I tried to strike a balance between being concise but also capturing all the important details. For that reason, the tool is hit or miss on longer (> 45 min) videos - the summary on this video is good but I've seen it omit important details on other long videos. The tool also captures relevant screenshots for each section.

Hopefully it's helpful. You can summarize additional videos by submitting a youtube URL in the nav bar or the home page. Also, feedback welcome!

Are you using LLM to summarize? If yes can you share the prompts used?

There's a typo in the title of the talk but don't worry I fixed it: Trends in Computer Vision

``In recent years, ML has completely changed our view of what is possible with computers''

In recent years ML has completely changed our view of what's possible with computer vision.

``Increasing scale delivers better results''

This is true for computer vision.

``The kinds of computations we want to run and the hardware on which we run them is changing dramatically''

Optimizations on operations for computer vision isn't exactly dramatic change. Who is We?

Trends in Machine Learning, 2010: semantic search for advertising. Trends in Machine Learning, 2024: semantic search for advertising, short-form video content.

https://blog.seanbethard.net/five-epistemes/

lol. Three computer vision researchers dislike this comment. Do any of you want to respond to it?

I'm surprised he didn't mention much about vector search

It is a finished thing, no? So many systems/products have it built in for quite a few years now.

Seeing a lot of people talk about auto-summary tools which reminds me of a joke.

“I took a speed-reading course and read War and Peace in twenty minutes. It involves Russia.” ― Woody Allen

This is such a great comment about so many things right now. Thank you.

I think this is the time where the idiom "it's good fishing in troubled waters" makes completely sense.

This is a moment where it is wise to wait and see. Investing in building AI/AGI startups now is like being a fish more than a fishermen. An outlier could win market share but will be only one within zillions. Google is catching up OpenAI but their competition is fruitful for all. Typical oligopoly. A single startup showing good traction will be immediately acquired.

From the business perspective it is time to focus on "go to market execution" and less or nothing on research. The research results are coming alone, except if you are one of the top scientist teams in the world or someone alike Ramanujan.

One thing I think should be fairly clear is Google will actually try to compete in this space (vs quitting if it isn't an immediate success as they usually do). Traditional web search will be fairly uncommon in the long run I think.

It's a nice high level overview about the state of the art in machine learning.

If you have watched his other talks, Jeff Dean generally does a very good job explaining things from a high level point of view.

Here is a controversial view: I think that the current neural networks driven approach coupled with massively distributed computing has plateaued.

For the machine learning field to move forward, it will need a different, less data/compute hungry paradigm.

The highlight of this event was running with Jeff at Rice University before his talk:

https://x.com/JeffDean/status/1756319820482592838?s=20