SMERF: Streamable Memory Efficient Radiance Fields

The mirror on the wall of the bathroom in the Berlin location looks through to the kitchen in the next room. I guess the depth gauging algorithm uses parallax, and mirrors confuse it, seeming like windows. The kitchen has a blob of blurriness as the rear of the mirror intrudes into kitchen, but you can see through the blurriness to either room.

The effect is a bit spooky. I felt like a ghost going through walls.

The refigerator in the NYC scene has a very slick specular lighting effect based on the angle you're viewing it from, and if you go "into" the fridge you can see it's actually generating a whole 3d scene with blurry grey and white colors that turn out to precisely mimic the effects of the light from the windows bouncing off the metal, and you can look "out" from the fridge into the rest of the room. Same as the full-length mirror in the bedroom in the same scene—there's a whole virtual "mirror room" that's been built out behind the mirror to give the illusion of depth as you look through it. Very cool and unique consequence of the technology

Wow, thanks for the tip. Fridge reflection world is so cool. Feels like something David Lynch might dream up.

A girl is eating her morning cereal. Suddenly she looks apprehensively at the fridge. Camera dollies towards the appliance and seamlessly penetrates the reflective surface, revealing a deep hidden space that exactly matches the reflection. At the dark end of the tunnel, something stirs... A wildly grinning man takes a step forward and screams.

Would you be offended if I animated that scene? It is really well described?

Please feel free!

Please share if you do, that sounded spooky af

Funnily enough, this is how reflections are usually emulated in game engines that do not support raytracing: another copy of the world behind the mirror. Also used in films in a few places (e.g. Terminator)

Please look at the refrigerator I mentioned—it's definitely not the classic "mirror world" reflection that you'd normally see in video games. I'm talking about the specular / metallic highlights on the fridge being simulated entirely with depth features.

Mirror worlds are a pretty common effect you'll see in NeRFs. Otherwise you would need a significantly more complex view dependent feature rendered onto a flat surface.

Neat! Here are some screenshots of the same phenomenon with the TV in Berlin: https://imgur.com/a/3zAA5K8

This happens with any 3D reconstruction. It's because any mirror is indistinguishable from a window into a mirrored room. The tricky thing is if there's actually a something behind the mirror as well.

What does the reconstructed space look like when there are opposing mirrors? It’ll just be a long corridor of ever more blurry rooms?

Oh wow yeah. It's interesting because when I look at the fridge my eye maps that to "this is a reflective surface", which makes sense because that's true in the source images, but then it's actually rendered as a cavity with appropriate features rendered in 3D space. What's a strange feeling is to enter the fridge and then turn around! I just watched Hbomberguy's Patreon-only video on the video game Myst, and in Myst the characters are trapped in books. If you choose the wrong path at the end of the game you get trapped in a book, and the view you get trapped in a book looks very similar to the view from inside the NYC fridge!

Yes!

The barely-there reflection on the Berlin TV is also a trip to enter, and observe the room from.

You can also get inside the bookcase for the ultimate Matthew McConaughey experience.

Try noclipping through the TV in the Berlin living room. It gets pleasantly creepy.

It has exactly the same drawbacks as photogrammetry in regards of highly reflective surfaces.

Wow. Some questions:

Take for instance the fulllivingroom demo. (I prefer fps mode.)

1) How many images are input?

2) How long does it take to compute these models?

3) How long does it take to prepare these models for this browser, with all levels, etc?

4) Have you tried this in VR yet?

Glad you liked our work!

1) Around 100-150 if memory serves. This scene is part of the mip-NeRF 360 benchmark, which you can download from the corresponding project website: https://jonbarron.info/mipnerf360/

2) Between 12 and 48 hours, depending on the scene. We train on 8x V100s or 16x A100s.

3) The time for preparing assets is included in 2). I don't have a breakdown for you, but it's something like 50/50.

4) Nope! A keen hacker might be able to do this themselves by editing the JavaScript code. Open your browser's DevTools and have a look -- the code is all there!

Update: Code for the web viewer is here,

https://github.com/smerf-3d/smerf-3d.github.io/blob/main/vie...

What is the license? The repo doesn't say.

Oops, I need to update the license files.

Our code is released under the Apache 2.0 license, as in this repo: https://github.com/google-research/google-research/blob/mast...

Do you need position data to go along with the photos or just the photos?

For VR, there’s going to be some very weird depth data from those reflections, but maybe they would not be so bad when you are in headset.

Do you need position data to go along with the photos or just the photos?

Short answer: Yes.

Long answer: Yes, but it can typically be derived from images. Structure-from-motion methods are typically used to derive lens and position information for each photo in the training set. These are then used by Zip-NeRF (our teacher) and SMERF (our model) to train a model.

Not exactly what you asked for. But I recently came across this VR example using Gaussian Splatting instead. Exciting times.

https://twitter.com/gracia_vr/status/1731731549886787634

https://www.gracia.ai

How long until you can stitch Street View into a seamless streaming NeRF of every street in the world? I hope that's the goal you're working towards!

I read another article talking about what waymo was working on and this looks oddly similar... My understanding is that the goal is to use this to reconstruct 3d models of street view images in real time.

Block-NeRF is a predecessor work that helped inspire SMERF, in fact!

https://waymo.com/research/block-nerf/

Very cool. Thanks!

;)

Haha, too bad the Earth VR team was disbanded because that would be the Holy Grail. If someone can get the budget to work on that I'd be tempted to come back to Google just to help get it done! It's what I always wanted when I was building the first Earth VR demo...

Not every street yet: https://waymo.com/research/block-nerf/

Is there any relation between this class of rendering techniques and the way the BD scenes in Cyberpunk 2077 were created? The behavior of the volume and the "voxels" seem eerily similar.

I can't say. I'm not familiar with BD in Cyberpunk.

https://youtu.be/KXXGS3MGCro?t=118

It's a sort of replayable cutscene that happens a couple times in the game where you can wander through it. The noteworthy bit is it's rendered out of voxels that look very similar to the demos but at a much lower resolution and if you push the frustrum into any objects, you get the same kind of effect where the surface breaks into blocks.

Interesting effect. It does look very voxel-y. I'm not a video game developer at heart, so I can only guess how it was implemented. I doubt NeRF models were involved, but I wouldn't be surprised if some sort of voxel discretization was.

It seems like it might even just be some kind of shader

If you think about how they created this from the POV of the game creation pipeline, then that probably is the way. If this is done by creating a shader on top of "plain old" 3D assets, then aside from the programmers/artists involved with creating that shader everyone else can go about their business with minimal retraining. There probably was a lot of content to create, to that optimization likely took priority over other methods of implementing this effect.

I doubt Cyberpunk uses more than a special shader for the BD sequences, but what’s a lot more remarkable to me is how similar the idea is at heart. Maybe we’re actually going to see this (maybe sans the brain-implant to record them, but hey) after all. Amazing technology, that’s for sure.

Holy mother of god. Wow!

Either matterport takes and runs with this or this is a startup waiting to disrupt Realestate.

I can’t believe how smooth this ran on my smartphone.

Feedback: if there was a mode to use the phone compass and gyro for navigation, it’d feel natural. Felt weird to navigate with fingers and figure how to move in xyz dimension.

As others have said, VR mode would be epic.

Is this really something the real estate market wants though? The point of using styled and meticulously chosen images is to entice people to visit the property in person. I think it’s hard to fall for a home because you saw it through virtual reality.

It is, yes. If you browse zillow you'll find many homes have 3D views attached. These are often image-sphere captures that you can painfully move through by clicking. While I agree that full res photos can be more appealing, the user experience with SMERF is so much better it might leave end users with a more positive feeling about a property and thus increase the chances of a sale.

I think it’s hard to fall for a home because you saw it through virtual reality.

I think if you take this 1-2 steps further and combine this with halucinating already owned furniture, or furniture that matches the prospective buyers taste into the property, this will make it a lot easier to fall for a home.

Thanks for the feedback!

I agree, we could do better with the movement UX. A challenge for another day.

Since the viewer is on GitHub, I’ll take it for a spin.

Are you accepting pull requests?

Does the an open source toolchain exist for capturing, processing, and hosting navigable 3D walkthroughs like this (e.g. something like an open-source Matterport)?

Not yet, as far as I'm aware. The current flow involves a DSLR for capture, COLMAP for camera parameter estimation, one codebase for training a teacher model, our codebase for training SMERF, and our web viewer for rendering models.

Sounds like an opportunity!

Is there a significant advantage for capturing using DSLRs vs using the phone camera of a decent phone?

The big difference is access to fisheye lenses a burst mode that can be run for minutes at a time, and the ability to minimize the amount of camera post processing. In principle, the capture could be done with a smartphone, but the experience of doing so is pretty time consuming right now.

You don't need a toolchain for capturing; you just need the data. Get it now; process it when better tools become available. There are guides for shooting for Photogrammetry and NeRF that are generally applicable to what you need to do.

Any plans to release the models ?

The pretrained models are already available online! Check out the "demo" section of the website. Your browser is fetching the model when you run the demo.

Will the code be released, or an API endpoint? Otherwise it will be impossible for us to use it for anything.. since it's Google I assume it will just end up in a black hole like most of the research.. or five years later some AI researchers leave and finally create a startup.

I hope to release code in the new year, but it'll take a while. The codebase is heavily wired into other not-yet-open-sourced libraries, and it'll take a while to disentangle them.

That sounds terrific! I really appreciate your effort. It's amazing work and so great of you to share it.

When might we see this in consumer VR? I'm surprised we don't already but I was suspecting it was a computation constraint.

Does this relieve the computation constraint enough to run on Quest 2/3?

Is there something else that would prevent binocular use?

I can't predict the future, but I imagine soon: all of the tools are there. The reason we didn't develop for VR is actually simpler than you'd think: we just don't have the developer time! At the end of the day, only a handful of people actively wrote code for this project.

Any plans to open source the code?

Yes, I hope so! But it'll take at least a few months of work. We have some tight dependencies to not-yet-open-sourced code, and until that's released, any code we put out will be dead on arrival.

In the meantime, feel free to explore the live viewer code!

https://github.com/smerf-3d/smerf-3d.github.io/blob/main/vie...

I recently got a new quest and I am wondering the same thing. The fact that this is currently running in a browser (and can run on a mobile device) gives me hope that we will see something like this in VR sooner rather than later.

What I'm seeing from all of these things is very accurate single navigable 3D images.

What I haven't seen anything of is feature and object detection, blocking and extraction.

Hopefully a more efficient and streamable codec necessitates the sort of structure that lends itself more easily to analysis.

3D understanding as a field is very much in its infancy. Good work is being done in this area, but we've got a long ways to go yet. SMERF is all about "view synthesis" -- rendering realistic images -- with no attempt at semantic understanding or segmentation.

"It's my VR-deployed SMERF CLIP model with LLM integration, and I want it now!"

It is funny how quickly goalposts move! I love to see progress though, and wow, is progress happening fast!

You mean something like this? https://jumpat.github.io/SA3D/

Found by putting "nerf sam segment 3d" into DuckDuckGo.

Hope this doesn't come as snarky, but does Google pressure researchers to do PR in their papers? This really is cool, but there is a lot of self-promotion in this paper and very little discussion of limitations (and the discussion of them is bookended by qualifications why they really aren't limitations).

It makes it harder for me to trust the paper if I feel like the paper is trying to persuade me of something rather than describe the complete findings.

People are not allowed to be proud of their work anymore?

Oh absolutely. I guess I just got the feeling reading this that there was more than the standard pride here and that there was professional PR going on. If no one else is getting that vibe I'm okay to accept it's just me.

I won't say too much about this, but the amount of buzz around articles these days is more of "research today" sort of thing. Top conferences like CVPR receives thousands of submissions each year, and there's a lot of upside to getting your work in front of as many eyeballs as possible.

By no means do I claim that SMERF is the be-all-end-all in real-time rendering, but I do believe it's a sold step in the right direction. There are all kinds of ways to improve this work and others in the field: smaller representation sizes, faster training, higher quality, and fewer input images would all make this technology more accessible.

I wonder since this runs at real time framerate if it would be possible for someone to composite a regular rasterized frame on top of something like this (with correct depth testing) to make a game

For example a 3rd person game where the character you control and the NPCs/enemies is raster but the environment is all radiance fields

This should absolutely be possible! The hard part is making it look natural: NeRF models (including SMERF) have no explicit materials or lighting. That means that any character inserted into the game will look out of place.

Why bother to make it look natural when you can have a really awkward greenscreen-like effect for nostalgic and "artistic" purposes?

If you go to one of the demos the space bar will cycle through some debug modes. One shows a surface reconstruction. It comes from the usual structure from motion techniques I presume, so it's coarse and noisy, but I think the fundamental idea is viable.

It runs impressively well on my 2yo s21fe. It was super impressive how it streamed in more images as I explored the space. The tv reflections in the Berlin demo were super impressive.

My one note is that it look a really long time to load all the images - the scene wouldn't render until all ~40 initial images loaded. Would it be possible to start partially rendering as the images arrive, or do you need to wait for all of them before you can do the first big render?

Pardon our dust: "images" is a bad name for what's being loaded. Past versions of this approach (MERF) stored feature vectors in PNG images. We replace them with binary arrays. Unfortunately, all such arrays need to be loaded before the first frame can be rendered.

You do however point out one weakness of SMERF: large payload sizes. If we can figure out how to compress them by 10x, it'll be a very different experience!

Or even just breaking them down into smaller chunks (prioritise loading the ones closer to where the user is looking) could help

The viewer biases towards assets closer to user's camera (otherwise you'd have to load the whole scene!). We tried training SMERF with a larger number of smaller submodels, but at some point, it becomes too onerous to train and quality begins to suffer.

Very impressive! Any information on how this compares to 3D Gaussian splatting in terms of performance, quality or data size?

All these details and more in our technical paper! In short: SMERF training takes much longer, SMERF rendering is nearly as fast as 3DGS when a CUDA GPU is available, and quality is visibly higher than 3DGS on large scenes and slightly higher on smaller scenes.

https://arxiv.org/abs/2312.07541

Is it possible to use zip-nerf to train GS to eliminate the floaters.

Maybe! That's the seed of a completely different research paper :)

There is a market here for Realtors to upload pictures and produce walk-throughs of homes for sale.

The Luma folks made something similar: https://apps.apple.com/app/luma-flythroughs/id6450376609?l=e...

Be careful with this one! Luma's offering requires that the camera follow the recorded video path. Our method lets the camera go wherever you desire!

https://matterport.com/

Im not sure why this demo runs so horribly in Firefox but not other browsers..anyone else having this?

Runs pretty well (20-100 fps depending on the scene) for me on both Firefox 120.1.1 on Android 14 (Pixel 7; smartphone preset) and Firefox 120.0.1 on Fedora 39 (R7 5800, 64 GB memory, RX 6600 XT; 1440p; desktop preset).

It seems that for some reason, my firefox is stuck in software compositor. I am getting:

WebRender initialization failed Blocklisted; failure code RcANGLE(no compositor device for EGLDisplay)(Create)_FIRST 3D11_COMPOSITING runtime failed Failed to acquire a D3D11 device Blocklisted; failure code FEATURE_FAILURE_D3D11_DEVICE2

I'm running a 3060

We unfortunately haven't tested our web viewer in Firefox. Let us know which platform you're running and we'll do our best to take a look in the new year (holiday vacation!).

In the meantime, give it a shot in a Webkit- or Chromium-based browser. I've had good results on Safari on iPhone, Chrome on Android/Macbook/Windows.

Are radiance fields related to Gaussian splattering?

Gaussian Splatting is heavily inspired by work in radiance fields (or NeRF) models. They use much of the same technology!

Similar inputs, similar outputs, different representation.

this looks really amazing. i have a relatively old smartphone (2019) and its really surprisingly smooth and high fidently. amazing job!

Thank you :). I'm glad to hear it! Which model are you using?

samsung galaxy 10se

memory efficient? It downloaded 500meg!

A. Storage isn't memory

B. That's hardly anything in 2023.

Right-o. The web viewer is swapping assets in and out of memory as the user explores the scene. The Network and disc requirements are high but memory usage is low.

Is there a relatively easy way to apply these kinds of techniques (either NeRFs or gaussian splats) to larger environments even if it's lower precision? Like say small towns/a few blocks worth of env.

You’re under the right paper for doing this. Instead of one big model, they have several smaller ones for regions in the scene. This way rendering is fast for large scenes.

This is similar to Block-NeRF [0], in their project page they show some videos of what you’re asking.

As for an easy way of doing this, nothing out-of-the-box. You can keep an eye on nerfstudio [1], and if you feel brave you could implement this paper and make a PR!

[0] https://waymo.com/intl/es/research/block-nerf/

[1] https://github.com/nerfstudio-project/nerfstudio

In principle, there's no reason you can't fit multiple City blocks at the same time with Instant NGP on a regular desktop. The challenge is in estimating the camera and lens parameters over such a large space. I expect such a reconstruction to be quite fuzzy given the low space resolution.

Why is there a 300m^2 footprint limit if the sub-models are dynamically loaded. Is this constrained by training, rasterizing, or both?

In terms of the live viewer, there's actually no limit on footprint size. 300 m^2 is simply the biggest indoor capture we had!

I had read about a competing technology that was suggesting NeRF's were a dead end

but perhaps that was biased?

You're probably thinking of 3D Gaussian Splatting (3DGS), another fantastic approach to real-time novel view synthesis. There's tons of fantastic work being built on 3DGS right now, and the dust has yet to settle with respect to which method is "better". Right now, I can say that SMERF has slightly higher quality on than 3DGS on small scenes and visibly higher quality on big scenes and runs on a wider variety of devices, but takes much longer than 3DGS to train.

https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

I'm following this through two minutes paper and I'm looking forward to using it.

My grandpa died 2 years ago and in hindsight I took pictures for using them as in your demo.

Awesome thanks:)

It would be my dream to make capturing 3D memories as easy and natural as taking a 2D photos with your smartphone today. Someday!

This is very impressive but given its by Google, will some code ever be released?

I hope to release the code in the new year, but we have some big dependencies that need to be released worse. In the meantime, you can already begin hacking on the live viewer, https://github.com/smerf-3d/smerf-3d.github.io/blob/main/vie...

What kind of modes does the viewer cycle through when I press the space key?

Nice discovery :). Check the developer console: it'll tell you.

This is __really__ stunning work, huge, huge, deal that I'm seeing this in a web browser on my phone. Congratulations!

When I look at the NYC scene in the highest quality on desktop, I'm surprised by how low-quality ex. the stuff on the counter and shelves is. So then I load the lego model, and see that's _very_ detailed, so it doesn't seem inherent to the method.

Is it a consequence of input photo quality, or something else?

This is __really__ stunning work

Thank you :)

Is it a consequence of input photo quality, or something else?

It's more a consequence of spatial resolution: the bigger the space, the more voxels you need to maintain a fixed resolution (e.g. 1 mm^3). At some point, we have to give up spatial resolution to represent larger scenes.

A second limitation is the teacher model we're distilling. Zip-NeRF (https://jonbarron.info/zipnerf/) is good, but it's not _perfect_. SMERF reconstruction quality is upper-bounded by its Zip-NeRF teacher.

Impressive is not a big enough statement! This is incredibly smooth on my phone and crazy good on a desktop pc. Keep it up!

Thank you :)

Just ran this on my phone through a browser, this is very impressive

Thank you :)

Great work!!

Question for the authors, are there opportunities, where they exist, to not use optimization or tuning methods for reconstructing a model of a scene?

We are refining efficient ways of rendering a view of a scene from these models but the scenes remain static. The scenes also take a while to reconstruct too.

Can we still achieve the great look and details of RF and GS without paying for an expensive reconstruction per instance of the scene?

Are there ways of greedily reconstructing a scene with traditional CG methods into these new representations now that they are fast to render?

Please forgive any misconceptions that I may have in advanced! We really appreciate the work y'all are advancing!

Are there opportunities, where they exist, to not use optimization or tuning methods for reconstructing a model of a scene?

If you know a way, let me know! Every system I'm aware of involves optimization in one way or another, from COLMAP to 3D Gaussian Splatting to Instant NGP and more. Optimization is a powerful workhorse that gives us a far wider range of models than a direct solver ever could. > Can we still achieve the great look and details of RF and GS without paying for an expensive reconstruction per instance of the scene?

In the future I hope so. We don't have a convincing way to generate 3D scenes yet, but given the progress in 2D, I think it's only a matter of time.

Are there ways of greedily reconstructing a scene with traditional CG methods into these new representations now that they are fast to render?

Not that I'm aware of! If there were, I think these works should be on the front page instead of SMERF.

Amazing, impressive, almost unbelievable :O

Thank you!

Google DeepMind Google Research Google Inc.

What a variety of groups! How did this come about?

Collaboration is a thing at the Big G :)

Any plans to do this in VR? I would love to try this.

Not at the moment but an intrepid hacker could surely extend our JavaScript code and put something together.

UPDATE: The code for our web viewer is here: https://github.com/smerf-3d/smerf-3d.github.io/blob/main/vie...

Can you recommend a good entry point into the theory/math behind these? This is one of those true "wtf, we can do this now?" moments, I'm super curious about how these are generated/created.

Oof, there's a lot of machinery here. It depends a lot on your academic background.

I'd recommend starting with a tutorial on neural radiance fields, aka NeRF, (https://sites.google.com/berkeley.edu/nerf-tutorial/home) and an applied overview of Deep Learning with tools like PyTorch or JAX. This line of work is still "cutting edge" research, so a lot of knowledge hasn't been rolled up into textbook or article form yet.

Since you're here @author :) Do you mind giving a quick rundown on how this competes with the quality of zip-nerf?

Check out our explainer video for answers to this question and more! https://www.youtube.com/watch?v=zhO8iUBpnCc

Will there be any notebooks or other code released to train our own models?

I hope so, but it'll be a good while before we can release anything. We have tight dependencies to other not-yet-OSS libraries, and until they're released, our won't work either.

Wow! What am I even looking at here? Polygons, voxels, or something else entirely? How were the benchmarks recorded?

You're looking at something called a "neural radiance field" backed by a sparse, low resolution voxel grid and a dense high resolution triplane grid. That's a bit of a word soup, but you can think of it like a glowing fog rendered with ray marching.

The benchmark details are a bit complicated. Check out the technical paper's experiment section for the nitty gritty details.

Very impressive demo.

Thank you!

"Researchers create open-source platform for Neural Radiance Field development" (2023) https://news.ycombinator.com/item?id=36966076

NeRF Studio > Included Methods, Third-party Methods: https://docs.nerf.studio/#supported-methods

Neural Radiance Field: https://en.wikipedia.org/wiki/Neural_radiance_field

Get this on a VR headset and you have a game changer literally.

I'm curious how the creators would compare this to the capabilities of Unreal Engine 5 (as far as the display technology goes.)