ZipSplat: Fewer Gaussians, Better Splats

Alexander Veicht¹ Sunghwan Hong^1,* Dániel Baráth¹ Marc Pollefeys^1,2

¹ETH Zürich ²Microsoft

^* Corresponding author

Paper Code Weights BibTeX

TL;DR

ZipSplat is a feed-forward 3D Gaussian Splatting model that predicts Gaussians directly in 3D from a compact set of scene tokens: fewer Gaussians, better splats, pose-free in under a second, with a single model that spans the quality-efficiency curve at inference.

ZipSplat teaser: PSNR vs number of Gaussians, and qualitative renders

ZipSplat decouples Gaussians from the pixel grid, reaching higher quality with far fewer Gaussians in under a second.

Left: PSNR against the number of Gaussians on DL3DV with 24 input views. Each star is a single ZipSplat model evaluated at a different compression ratio. At comparable quality, ZipSplat uses up to 33× fewer Gaussians, and it gains about 2.1 dB with 6× fewer.

Right: A direct quality comparison on the same scene. YoNoSplat needs 380K Gaussians to reach a quality that ZipSplat achieves with just 15K, and ZipSplat continues to improve well beyond that point while C3G, with a fixed token budget, lacks the capacity for fine detail.

Explore in 3D

Drag to orbit, scroll to zoom: a live ZipSplat reconstruction from 24 drone frames, predicted pose-free in a single forward pass. Slide Quality to vary the compression budget; switch to Tokens to color each scene token's Gaussians.

Color

249K / 249K Gaussians

EfficiencyQuality

Abstract

Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs.

We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining.

ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ~6× fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1 and 1.2 dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines.

How it works

Most feed-forward methods place one Gaussian per pixel, so the number of Gaussians is fixed by image resolution. ZipSplat instead represents the scene as a compact set of scene tokens.

A multi-view backbone extracts dense visual tokens from the input images. ZipSplat groups similar tokens together, merging repeated observations of the same surface into a smaller set of scene tokens. Attention layers refine each scene token against the full set of visual tokens, and a lightweight MLP decodes it into a small group of Gaussians with unconstrained 3D positions, rather than along a pixel ray. Because placement is no longer tied to pixels, ZipSplat assigns Gaussians by scene complexity rather than the image grid, concentrating them where geometry is detailed.

The number of scene tokens is set by a single compression ratio at inference. Lowering it produces a lighter reconstruction and raising it a sharper one, both from the same trained weights.

Adjustable quality

One trained model, the full quality-efficiency curve. Drag the slider to trade reconstruction fidelity for fewer Gaussians. No retraining required.

ZipSplat compression demo — drag slider to adjust quality

r = 1.00 249K / 249K Gaussians

fewer Gaussiansbetter splats

Scene gallery

Drag to orbit, scroll to zoom. Click a thumbnail to load that scene.

RE10K · 6v

DL3DV · 24v

DL3DV · 6v

RE10K · 6v

DL3DV · 12v

RE10K · 6v

DL3DV · 24v

RE10K · 6v

DL3DV · 24v

Try ZipSplat on your own scenes or read the full paper.

Paper Code Weights

BibTeX

@article{veicht2026zipsplat,
  title   = {ZipSplat: Fewer Gaussians, Better Splats},
  author  = {Veicht, Alexander and Hong, Sunghwan and Barath, Daniel and Pollefeys, Marc},
  journal = {arXiv preprint arXiv:2606.05102},
  year    = {2026}
}