Learn AI Series (#87) - 3D Vision

[IMAGE: https://images.hive.blog/DQmYPPmhKswr7977MfswpFvHiWZgdgD1ichcq3k22pLJ1xM/variant-b-07-purple.png]

What will I learn

You will learn depth estimation from single images (monocular) and stereo pairs;
point clouds: representing 3D data as collections of points in space;
PointNet: processing unordered point sets with shared MLPs and permutation-invariant pooling;
Neural Radiance Fields (NeRF): reconstructing 3D scenes from 2D photographs using volume rendering;
positional encoding for spatial coordinates and why MLPs need it for high-frequency detail;
3D Gaussian Splatting: real-time 3D rendering from point-based representations;
3D reconstruction pipelines and their practical applications across industries.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#87) - 3D Vision

Solutions to Episode #86 Exercises

Exercise 1: Inpaint mask analyzer.

import numpy as np


class InpaintMaskAnalyzer:
    """Analyze binary masks for inpainting:
    area, coverage, boundary complexity,
    classification, and step recommendations."""

    def analyze(self, mask):
        h, w = mask.shape
        total_pixels = h * w
        masked = (mask == 255)
        area = int(masked.sum())
        coverage = area / total_pixels

        # Bounding box of masked region
        rows = np.any(masked, axis=1)
        cols = np.any(masked, axis=0)
        if not rows.any():
            return {"area": 0, "coverage": 0.0,
                    "bbox": (0, 0, 0, 0),
                    "aspect_ratio": 0.0,
                    "boundary_complexity": 0.0}
        r_min, r_max = np.where(rows)[0][[0, -1]]
        c_min, c_max = np.where(cols)[0][[0, -1]]
        bbox = (int(r_min), int(c_min),
                int(r_max), int(c_max))
        bh = r_max - r_min + 1
        bw = c_max - c_min + 1
        aspect = bw / max(bh, 1)

        # Boundary complexity: fraction of masked
        # pixels that border at least one unmasked
        # 4-neighbor
        padded = np.pad(masked, 1,
                        constant_values=False)
        bp = np.zeros_like(masked)
        for dr, dc in [(-1, 0), (1, 0),
                        (0, -1), (0, 1)]:
            shifted = padded[1 + dr:h + 1 + dr,
                             1 + dc:w + 1 + dc]
            bp |= (masked &amp; (~shifted))
        boundary_count = int(bp.sum())
        complexity = (boundary_count / area
                      if area &gt; 0 else 0.0)

        return {
            "area": area,
            "coverage": coverage,
            "bbox": bbox,
            "aspect_ratio": aspect,
            "boundary_complexity": complexity,
        }

    def classify_mask(self, mask):
        info = self.analyze(mask)
        cov = info["coverage"]
        if cov &lt; 0.05:
            cat = "small_patch"
            steps = 20
        elif cov &lt; 0.25:
            cat = "medium_region"
            steps = 30
        elif cov &lt; 0.50:
            cat = "large_area"
            steps = 40
        else:
            cat = "reconstruction"
            steps = 50
        return cat, steps, info


analyzer = InpaintMaskAnalyzer()

# Generate 4 test masks on 256x256
size = 256
masks = {}

# 30x30 centered square
m = np.zeros((size, size), dtype=np.uint8)
c = size // 2
m[c - 15:c + 15, c - 15:c + 15] = 255
masks["30x30 square"] = m

# 100x100 centered square
m = np.zeros((size, size), dtype=np.uint8)
m[c - 50:c + 50, c - 50:c + 50] = 255
masks["100x100 square"] = m

# Horizontal stripe
m = np.zeros((size, size), dtype=np.uint8)
m[c - 40:c + 40, :] = 255
masks["horiz stripe"] = m

# Checkerboard (16x16 blocks)
m = np.zeros((size, size), dtype=np.uint8)
for r in range(0, size, 32):
    for cc in range(0, size, 32):
        m[r:r + 16, cc:cc + 16] = 255
masks["checkerboard"] = m

print(f"{'Mask':&lt;16} {'Area':&gt;6} "
      f"{'Cov%':&gt;6} {'Cmplx':&gt;6} "
      f"{'Category':&lt;16} {'Steps':&gt;5}")
print("-" * 60)
for name, mask in masks.items():
    cat, steps, info = analyzer.classify_mask(mask)
    print(f"{name:&lt;16} {info['area']:&gt;6} "
          f"{info['coverage'] * 100:&gt;5.1f}% "
          f"{info['boundary_complexity']:&gt;6.3f} "
          f"{cat:&lt;16} {steps:&gt;5}")

The two square masks have the lowest boundary complexity because a square's perimeter-to-area ratio decreases as it gets bigger -- a 30x30 square has relatively more boundary pixels than a 100x100 square. The horizontal stripe has moderate complexity (long boundary edges but all straight). The checkerboard has the highest complexity by far because every 16x16 block has its entire perimeter exposed to unmasked neighbors, creating an enormous amount of boundary relative to area.

Exercise 2: Style transfer weight explorer.

import numpy as np


class StyleWeightExplorer:
    """Explore content/style weight combinations
    for neural style transfer."""

    def __init__(self, seed=42):
        rng = np.random.RandomState(seed)
        self.content_feat = rng.randn(
            1, 64, 32, 32).astype(np.float32)
        self.style_feat = rng.randn(
            1, 64, 32, 32).astype(np.float32)

    def gram_matrix(self, features):
        b, c, h, w = features.shape
        F_map = features.reshape(b, c, h * w)
        G = np.matmul(F_map,
                      F_map.transpose(0, 2, 1))
        return G / (c * h * w)

    def content_loss(self, gen, content):
        return float(np.mean(
            (gen - content) ** 2))

    def style_loss(self, gen, style):
        g_gram = self.gram_matrix(gen)
        s_gram = self.gram_matrix(style)
        return float(np.mean(
            (g_gram - s_gram) ** 2))

    def run(self):
        gen = self.content_feat.copy()
        c_loss = self.content_loss(
            gen, self.content_feat)
        s_loss = self.style_loss(
            gen, self.style_feat)

        alphas = [1, 10, 100]
        betas = [1e3, 1e4, 1e5, 1e6]

        print(f"Content loss (gen=content): "
              f"{c_loss:.6f}")
        print(f"Style loss (gen=content):   "
              f"{s_loss:.6f}")
        print()
        print(f"{'alpha':&gt;6} {'beta':&gt;8} "
              f"{'C_part':&gt;10} {'S_part':&gt;10} "
              f"{'Total':&gt;10} {'S_dom':&gt;6}")
        print("-" * 52)

        for a in alphas:
            for b in betas:
                c_part = a * c_loss
                s_part = b * s_loss
                total = c_part + s_part
                s_dom = (s_part / total
                         if total &gt; 0
                         else 0)
                marker = ""
                if 0.4 &lt;= s_dom &lt;= 0.6:
                    marker = " &lt;-- balanced"
                elif s_dom &gt; 0.9:
                    marker = " &lt;-- style-dom"
                elif s_dom &lt; 0.1:
                    marker = " &lt;-- content-dom"
                print(f"{a:&gt;6} {b:&gt;8.0f} "
                      f"{c_part:&gt;10.4f} "
                      f"{s_part:&gt;10.4f} "
                      f"{total:&gt;10.4f} "
                      f"{s_dom:&gt;6.3f}{marker}")


explorer = StyleWeightExplorer()
explorer.run()

Since gen = content_features, the content loss starts at exactly 0.0 (content is perfectly preserved). The entire total loss comes from the style term. This means the style dominance ratio is 1.0 for every (alpha, beta) pair -- which makes sense: if you haven't started optimizing yet and your starting point is the content image, there's zero content loss and all the gradient comes from style. In practice, after a few hundred optimization steps, the generated image moves away from the content (increasing content loss) and toward the style (decreasing style loss), and the balance depends on the alpha/beta ratio.

Exercise 3: Diffusion strength calibrator.

import numpy as np


class StrengthCalibrator:
    """Calibrate img2img strength by simulating
    1D diffusion editing."""

    def __init__(self, dim=128, T=1000, seed=42):
        rng = np.random.RandomState(seed)
        self.T = T
        self.dim = dim
        betas = np.linspace(1e-4, 0.02, T)
        alphas = 1.0 - betas
        self.alpha_bars = np.cumprod(alphas)

        self.x0_orig = rng.randn(dim)
        freq = 2 * np.pi * 3 / dim
        edit = 0.5 * np.sin(
            freq * np.arange(dim))
        self.x0_target = self.x0_orig + edit

    def simulate_edit(self, x0, strength,
                      num_steps=50):
        t_start = int((1 - strength) * self.T)
        t_start = max(1, min(t_start, self.T - 1))

        rng = np.random.RandomState(99)
        noise = rng.randn(self.dim)
        ab = self.alpha_bars[t_start]
        xt = np.sqrt(ab) * x0 + np.sqrt(
            1 - ab) * noise

        step_size = t_start // max(num_steps, 1)
        step_size = max(step_size, 1)
        timesteps = list(
            range(0, t_start, step_size))[::-1]

        x = xt.copy()
        for i, t in enumerate(timesteps):
            ab_t = self.alpha_bars[t]
            pred_noise = (
                (x - np.sqrt(ab_t) * x0)
                / np.sqrt(1 - ab_t + 1e-12))
            pred_x0 = (
                (x - np.sqrt(1 - ab_t)
                 * pred_noise)
                / np.sqrt(ab_t + 1e-12))
            pred_x0 = np.clip(pred_x0, -3, 3)

            if i + 1 &lt; len(timesteps):
                ab_prev = self.alpha_bars[
                    timesteps[i + 1]]
            else:
                ab_prev = 1.0
            dir_xt = np.sqrt(
                1 - ab_prev) * pred_noise
            x = np.sqrt(ab_prev) * pred_x0 + dir_xt

        return x

    def run(self):
        strengths = [0.1, 0.2, 0.3, 0.4, 0.5,
                     0.6, 0.7, 0.8, 0.9, 1.0]

        orig_norm = np.mean(self.x0_orig ** 2)

        print(f"{'Str':&gt;5} {'MSE_orig':&gt;10} "
              f"{'MSE_tgt':&gt;10} {'Pres':&gt;6} "
              f"{'Trans':&gt;6} {'Bal':&gt;6}")
        print("-" * 48)

        best_bal = None
        best_str = None
        for s in strengths:
            result = self.simulate_edit(
                self.x0_orig, s)
            mse_orig = np.mean(
                (result - self.x0_orig) ** 2)
            mse_tgt = np.mean(
                (result - self.x0_target) ** 2)

            pres = max(0, 1 - mse_orig / max(
                orig_norm, 1e-12))
            trans = max(0, 1 - mse_tgt / max(
                orig_norm, 1e-12))
            bal = abs(pres - trans)

            if best_bal is None or bal &lt; best_bal:
                best_bal = bal
                best_str = s

            print(f"{s:&gt;5.1f} {mse_orig:&gt;10.6f} "
                  f"{mse_tgt:&gt;10.6f} "
                  f"{pres:&gt;6.3f} "
                  f"{trans:&gt;6.3f} "
                  f"{bal:&gt;6.3f}")

        print(f"\nBest balance at strength="
              f"{best_str:.1f}")


cal = StrengthCalibrator()
cal.run()

As strength increases, MSE to the original grows (more noise destroys more of the input signal) while MSE to the target initially decreases then plateaus. Low strength means almost no change -- high preservation, low transformation. High strength means almost complete destruction of the original. The best balance point typically falls somewhere in the 0.4-0.6 range, which matches the practical experience from episode #86 where we noted that strength 0.5-0.6 is the sweet spot for moderate scene edits.

On to today's episode

Welcome back! For the past ten episodes we've been working exclusively in 2D: classifying flat images, drawing bounding boxes on flat images, segmenting flat images pixel by pixel, reading text off flat images, analyzing sequences of flat images (video), and generating or editing flat images with diffusion models. Everything so far in the computer vision arc has operated on a fundamentally two-dimensional representation of the world.

But the world is NOT flat. Objects have depth. They have volume, surfaces, and spatial relationships that a single 2D image can only hint at. A self-driving car needs to know not just that there's a pedestrian in the frame, but how far away that pedestrian is. An AR headset needs to understand the 3D geometry of a room so it can place virtual objects on real tables without them floating in mid-air. A robotic arm needs to know the exact 3D shape and position of an object before it can grasp it reliably.

3D vision bridges the gap between 2D pixel understanding (what we've been doing) and real-world spatial reasoning (what machines actually need for physical interaction). This episode covers the key techniques -- from estimating depth out of flat pictures, through representing 3D data as point clouds, all the way to reconstructing entire 3D scenes from nothing but a handful of photographs ;-)

Depth estimation: how far is everything?

The most basic 3D vision task: given a 2D image, estimate the distance from the camera to every pixel. This produces a depth map -- an image where pixel brightness corresponds to distance.

Monocular depth estimation predicts depth from a single RGB image. This is technically an ill-posed problem -- a single 2D picture contains insufficient information to determine exact 3D geometry. The model has to rely on learned priors: perspective cues (parallel lines converging toward a vanishing point), relative object size (a car that appears tiny is probably far away), texture gradients (surfaces farther away show finer texture patterns), and occlusion (if object A partially covers object B, A is closer).

MiDaS (from Intel ISL) is the standard monocular depth model. It uses a DPT (Dense Prediction Transformer) backbone -- essentially a Vision Transformer (episode #54) adapted for pixel-level predictions:

import torch
import cv2
import numpy as np

# MiDaS: the standard monocular depth model
model = torch.hub.load(
    "intel-isl/MiDaS", "DPT_Large")
model.eval()

transform = torch.hub.load(
    "intel-isl/MiDaS",
    "transforms").dpt_transform

image = cv2.imread("street.jpg")
image_rgb = cv2.cvtColor(
    image, cv2.COLOR_BGR2RGB)
input_tensor = transform(
    image_rgb).unsqueeze(0)

with torch.no_grad():
    depth = model(input_tensor)
    depth = torch.nn.functional.interpolate(
        depth.unsqueeze(1),
        size=image.shape[:2],
        mode="bilinear",
        align_corners=False
    ).squeeze()

depth_np = depth.numpy()
# Normalize for visualization
# (inverse depth: closer = brighter)
depth_vis = cv2.normalize(
    depth_np, None, 0, 255,
    cv2.NORM_MINMAX).astype(np.uint8)
depth_colored = cv2.applyColorMap(
    depth_vis, cv2.COLORMAP_INFERNO)
cv2.imwrite("depth_map.png", depth_colored)

MiDaS produces relative depth -- it tells you that object A is closer than object B, but not the exact metric distance in meters. For autonomous driving and robotics, you need metric depth (actual meters), which requires either stereo cameras, LiDAR data for training supervision, or camera calibration information.

Stereo depth uses two cameras separated by a known baseline distance (like human eyes). The disparity -- how far apart the same object appears in the left and right images -- is inversely proportional to its distance. Deep learning stereo matching networks like AANet and RAFT-Stereo learn to find pixel correspondences between left and right images more accurately than classical block matching ever could:

def disparity_to_depth(disparity,
                       focal_length,
                       baseline):
    """Convert stereo disparity to metric depth.
    focal_length: in pixels
    baseline: distance between cameras in meters
    """
    depth = (focal_length * baseline) / (
        disparity + 1e-6)
    return depth


# Example: 50mm lens on full-frame sensor
# (pixel size ~6um)
# focal_length_px = 50mm / 0.006mm = 8333 px
# baseline = 0.12 meters (12cm between cameras)
# disparity = 100 pixels
# -&gt; depth = 8333 * 0.12 / 100 = 10.0 meters

print("Stereo depth examples:")
for disp in [200, 100, 50, 25, 10]:
    d = disparity_to_depth(disp, 8333, 0.12)
    print(f"  disparity={disp:&gt;4}px "
          f"-&gt; depth={d:&gt;6.1f}m")

The relationship is elegant: double the disparity, halve the depth. Close objects show large disparity (they shift a lot between left and right views). Distant objects show tiny disparity (they barely move). At infinite distance, disparity is zero -- both cameras see the object in the exact same position.

Point clouds: representing 3D data

A point cloud is the simplest 3D data format: a collection of (x, y, z) coordinates in space, optionally with color or other per-point attributes. LiDAR sensors produce point clouds directly by measuring laser return times. Depth cameras (Intel RealSense, Microsoft Kinect) produce depth maps that we can convert to point clouds using the camera's intrinsic parameters:

import numpy as np


def depth_to_point_cloud(depth_map,
                         intrinsics):
    """Convert a depth map to a 3D point cloud.
    intrinsics: (fx, fy, cx, cy) -- focal
    lengths and principal point in pixels."""
    fx, fy, cx, cy = intrinsics
    h, w = depth_map.shape

    # Create pixel coordinate grids
    u = np.arange(w)
    v = np.arange(h)
    u, v = np.meshgrid(u, v)

    # Backproject to 3D using the pinhole
    # camera model
    z = depth_map
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy

    # Stack into (N, 3) point cloud
    points = np.stack(
        [x, y, z], axis=-1).reshape(-1, 3)

    # Remove invalid points (zero depth)
    valid = points[:, 2] &gt; 0
    return points[valid]


# Example with synthetic depth
depth = np.random.uniform(
    1.0, 10.0, (480, 640))
# Typical RGB-D camera intrinsics
intrinsics = (525.0, 525.0, 319.5, 239.5)
pcd = depth_to_point_cloud(depth, intrinsics)
print(f"Point cloud: {pcd.shape[0]} points")
print(f"X range: [{pcd[:, 0].min():.1f}, "
      f"{pcd[:, 0].max():.1f}]")
print(f"Y range: [{pcd[:, 1].min():.1f}, "
      f"{pcd[:, 1].max():.1f}]")
print(f"Z range: [{pcd[:, 2].min():.1f}, "
      f"{pcd[:, 2].max():.1f}]")

The backprojection formula is just the inverse of the standard pinhole camera projection. If a 3D point (X, Y, Z) projects to pixel (u, v) via u = fx * X/Z + cx, then given (u, v, Z) we can recover X = (u - cx) * Z / fx. Having said that, the hard part in practice is getting accurate depth values -- LiDAR is precise but expensive, stereo depth has noise and holes, and monocular depth gives only relative values.

Processing point clouds with neural networks

Here's where it gets interesting. Images are regular grids -- every pixel has a fixed position relative to its neighbors, which is exactly why convolutional filters work (episode #45). Point clouds are unordered sets. There's no "left neighbor" or "top-right pixel." If you shuffle the order of points in a cloud, the 3D shape hasn't changed at all. Any network that processes point clouds must be permutation-invariant -- it must produce the same output regardless of the order the points come in.

PointNet (Qi et al., 2017) solved this with a clean architectural insight: process each point independently through shared MLPs (so each point gets the same transformation), then aggregate across all points with a max-pool operation. Max-pooling is permutation-invariant -- the maximum value of a set doesn't change if you reorder the set:

import torch
import torch.nn as nn


class PointNet(nn.Module):
    """Simplified PointNet for classification.
    Key insight: shared MLPs + max-pool gives
    permutation invariance."""

    def __init__(self, num_classes=40):
        super().__init__()
        # Shared MLPs: same weights for every
        # point (like 1x1 convolutions)
        self.mlp1 = nn.Sequential(
            nn.Linear(3, 64), nn.ReLU(),
            nn.BatchNorm1d(64),
            nn.Linear(64, 128), nn.ReLU(),
            nn.BatchNorm1d(128))
        self.mlp2 = nn.Sequential(
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, 1024))
        # Classifier on global feature
        self.classifier = nn.Sequential(
            nn.Linear(1024, 512), nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256), nn.ReLU(),
            nn.Linear(256, num_classes))

    def forward(self, x):
        # x: (batch, num_points, 3)
        b, n, _ = x.shape
        # Process each point independently
        h = x.reshape(b * n, 3)
        h = self.mlp1(h)
        h = h.reshape(b, n, 128)
        h = h.reshape(b * n, 128)
        h = self.mlp2(h)
        h = h.reshape(b, n, 1024)
        # Global max pool across all points
        global_feat = h.max(dim=1).values
        return self.classifier(global_feat)


model = PointNet(num_classes=10)
cloud = torch.randn(4, 1024, 3)
logits = model(cloud)
print(f"Input: {cloud.shape}")
print(f"Output: {logits.shape}")
# Verify permutation invariance
perm = torch.randperm(1024)
cloud_shuffled = cloud[:, perm, :]
logits2 = model(cloud_shuffled)
diff = (logits - logits2).abs().max().item()
print(f"Max diff after shuffling: {diff:.8f}")

The permutation invariance check at the bottom is the key sanity test -- shuffling the point order should produce identical output (up to floating point precision). PointNet works well for classification (is this point cloud a chair, table, or airplane?) and segmentation (label each point as belonging to a specific part). The limitation is that max-pooling throws away local structure -- two points that are close together in space have no special relationship in PointNet's representation. PointNet++ addressed this by applying PointNet hierarchically on local neighborhoods, similar to how CNNs build up receptive fields.

NeRF: 3D from photographs

Neural Radiance Fields (Mildenhall et al., 2020) are one of those ideas that feel almost like magic the first time you see them. You take maybe 50-100 photographs of a scene from different angles, and the system reconstructs a complete 3D representation that lets you render the scene from any viewpoint -- including viewpoints that were never photographed.

The core idea: train a small MLP to map any 3D coordinate (x, y, z) and viewing direction (theta, phi) to a color and density:

f(x, y, z, theta, phi) -&gt; (r, g, b, sigma)

where sigma is the volume density -- how opaque the space is at that point. To render an image from a new viewpoint, you cast rays through each pixel, sample points along each ray, query the network at each sample point, and compose the colors using volume rendering:

import torch
import torch.nn as nn


class NeRF(nn.Module):
    """Simplified Neural Radiance Field.
    Maps (position, direction) -&gt; (color,
    density)."""

    def __init__(self, pos_dim=63,
                 dir_dim=27, hidden=256):
        super().__init__()
        # Position encoding -&gt; density + feature
        self.pos_net = nn.Sequential(
            nn.Linear(pos_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.density_head = nn.Linear(hidden, 1)
        self.feature_head = nn.Linear(
            hidden, hidden)

        # Direction encoding -&gt; color
        self.color_net = nn.Sequential(
            nn.Linear(hidden + dir_dim,
                      hidden // 2),
            nn.ReLU(),
            nn.Linear(hidden // 2, 3),
            nn.Sigmoid(),
        )

    def forward(self, pos_enc, dir_enc):
        h = self.pos_net(pos_enc)
        density = torch.relu(
            self.density_head(h))
        feature = self.feature_head(h)
        color = self.color_net(
            torch.cat([feature, dir_enc],
                      dim=-1))
        return color, density


def volume_render(colors, densities, deltas):
    """Classic volume rendering: compose colors
    along a camera ray by integrating density
    and color at sampled points."""
    # alpha = probability of hitting something
    alpha = 1.0 - torch.exp(
        -densities * deltas)
    # Transmittance: probability that the ray
    # reaches this sample without hitting
    # anything earlier
    transmittance = torch.cumprod(
        1.0 - alpha + 1e-10, dim=-1)
    transmittance = torch.cat([
        torch.ones_like(
            transmittance[..., :1]),
        transmittance[..., :-1]], dim=-1)
    # Weight = "hit here AND nothing blocked it"
    weights = alpha * transmittance
    pixel_color = (weights.unsqueeze(-1)
                   * colors).sum(dim=-2)
    return pixel_color

The architecture splits position and direction deliberately. Density depends only on position -- whether a point in space is occupied or empty doesn't change based on your viewing angle. Color depends on both position and direction, because real surfaces exhibit view-dependent effects like specular reflections, glossy highlights, and transparency that change as you move around them. A shiny metal surface looks different from the left than from the right, but the surface itself is in the same location regardless.

Positional encoding: why coordinates need frequency lifting

The pos_dim=63 and dir_dim=27 in the code above are NOT the raw coordinate dimensions (which would be 3 and 2 respectively). They're the result of positional encoding -- mapping the raw coordinates through sinusoidal functions at multiple frequencies:

def positional_encoding(x, num_freqs=10):
    """Lift raw coordinates to higher
    dimensions using sinusoids. Same idea
    as transformer positional encoding
    (episode #52) applied to spatial coords."""
    encodings = [x]
    for i in range(num_freqs):
        freq = 2.0 ** i
        encodings.append(torch.sin(
            freq * x))
        encodings.append(torch.cos(
            freq * x))
    return torch.cat(encodings, dim=-1)


# 3D position -&gt; 3 + 3*2*10 = 63 dims
pos = torch.randn(100, 3)
pos_enc = positional_encoding(pos, 10)
print(f"Position: {pos.shape} -&gt; "
      f"encoded: {pos_enc.shape}")

# 2D direction -&gt; 2 (for simplified)
# In practice: 3D unit vector with
# num_freqs=4 -&gt; 3 + 3*2*4 = 27 dims

Why is this needed? MLPs have a strong bias toward learning smooth, low-frequency functions (this is called spectral bias). Raw coordinates like (0.312, 0.745, 2.001) vary slowly and smoothly across space, so the MLP naturally produces blurry output. By encoding coordinates into sinusoidal features at multiple frequencies -- including high frequencies like sin(512 * x) -- you give the network the ability to represent sharp edges, fine textures, and intricate details. Without positional encoding, NeRF produces blurry, oversmoothed reconstructions. With it, you get crisp, photorealistic renderings.

NeRF training is supervised: you have real photos from known camera positions, you render the scene from those viewpoints by casting rays and querying the network, and you minimize the pixel-wise MSE between rendered and real images. Once trained, rendering from any new viewpoint is possible -- the MLP has learned a continuous 3D representation of the entire scene.

The major limitation: speed. Rendering one pixel requires sampling 64-256 points along a ray and running an MLP forward pass for each. A 1080p frame has ~2 million pixels. That's hundreds of millions of network evaluations per frame. Even on fast GPUs, rendering takes seconds to minutes per frame. Real-time rendering was the unsolved problem -- until Gaussian splatting came along.

Gaussian splatting: real-time 3D

3D Gaussian Splatting (Kerbl et al., 2023) takes a fundamentally different approach. Instead of representing the scene as a continuous function queried via ray marching (NeRF), it represents the scene as a collection of 3D Gaussians -- think of them as colored, translucent blobs floating in space. Each Gaussian has a position, a covariance matrix (controlling its shape and orientation), an opacity, and color represented as spherical harmonics for view-dependent effects:

import torch


class GaussianScene:
    """Conceptual 3D Gaussian splatting scene.
    Each Gaussian is a colored, translucent
    3D blob with learnable parameters."""

    def __init__(self, num_gaussians):
        # Position (XYZ center)
        self.positions = torch.randn(
            num_gaussians, 3)
        # Scale (size in each axis)
        self.scales = (
            torch.ones(num_gaussians, 3)
            * 0.01)
        # Rotation (quaternion)
        self.rotations = torch.zeros(
            num_gaussians, 4)
        self.rotations[:, 0] = 1.0
        # Opacity (logit-space)
        self.opacities = torch.zeros(
            num_gaussians, 1)
        # Color (spherical harmonic coeffs
        # for view-dependent appearance)
        self.sh_coeffs = torch.randn(
            num_gaussians, 48)

    def parameter_count(self):
        total = (3 + 3 + 4 + 1 + 48)
        return total * len(self.positions)


scene = GaussianScene(100_000)
print(f"Gaussians: {len(scene.positions):,}")
print(f"Params per Gaussian: 59")
print(f"Total params: "
      f"{scene.parameter_count():,}")
print(f"Memory (float32): "
      f"{scene.parameter_count() * 4 / 1e6:.1f}"
      f" MB")

The rendering approach is the key difference from NeRF. Instead of casting rays through the scene and querying a function at sampled points (ray marching), Gaussian splatting projects each Gaussian onto the camera's image plane (rasterization or splatting). For each Gaussian, you compute its 2D projection (an ellipse on screen), sort Gaussians by depth, and blend them front-to-back using alpha compositing. This is embarassingly parallel and maps directly to GPU rasterization pipelines -- the same hardware that renders video games at 60+ FPS.

The training loop mirrors NeRF: start with a sparse point cloud from Structure-from-Motion (SfM -- a classical algorithm that reconstructs 3D points and camera poses from multiple images), initialize one Gaussian per point, render from known viewpoints, compare to real photos, backpropagate through the differentiable rasterizer to adjust Gaussian parameters. The system also adaptively splits large Gaussians that cover too much area (adding detail), clones small Gaussians in under-reconstructed regions, and prunes Gaussians with near-zero opacity (removing waste).

The result: 30+ FPS rendering at 1080p, compared to NeRF's minutes per frame. Quality is comparable or sometimes better, and the explicit point-based representation is easier to manipulate than NeRF's implicit function -- you can delete Gaussians, move them, or merge scenes. This has made Gaussian splatting the practical choice for real applications: VR/AR environments, game asset creation, virtual tourism, and real estate walkthroughs.

Practical applications

3D vision is driving real products across multiple industries:

Autonomous vehicles: LiDAR point clouds combined with camera depth estimation for 3D object detection and path planning. Tesla's "pure vision" approach uses monocular and stereo depth estimation from cameras to avoid the cost of LiDAR sensors altogether
Augmented reality: understanding room geometry to anchor virtual objects on real surfaces. Apple's ARKit and Google's ARCore both use monocular depth estimation on phone cameras
Robotics: grasping objects requires knowing their precise 3D shape and position in the robot's coordinate frame. Bin-picking systems in warehouses use depth cameras and point cloud processing
Cultural preservation: scanning historical buildings, sculptures, and artifacts into digital 3D models. The Notre-Dame reconstruction effort after the 2019 fire relied heavily on 3D scanning and photogrammetry
Real estate and mapping: Google Earth's 3D cities are built from aerial photogrammetry. Matterport creates 3D home walkthroughs. Luma AI lets you capture 3D scenes from phone video using Gaussian splatting

The trend is clearly toward fewer sensors and more computation -- replacing expensive LiDAR with monocular or stereo depth from cheap cameras, replacing professional 3D scanners with phone cameras plus neural reconstruction. The models keep getting better at extracting 3D understanding from 2D inputs, which makes the hardware requirements progressively cheaper.

Samengevat

Monocular depth estimation predicts relative depth from a single image using learned priors (perspective cues, object size, texture gradients); models like MiDaS use DPT (ViT-based) architectures; stereo depth uses two cameras and pixel disparity for metric depth in actual meters;
point clouds represent 3D data as unordered sets of (x, y, z) points; they can be produced by LiDAR, stereo cameras, or backprojection from depth maps using camera intrinsics;
PointNet processes point clouds with shared MLPs applied to each point independently, followed by max-pooling for permutation invariance; PointNet++ extends this with hierarchical local neighborhoods;
NeRF trains an MLP to map 3D coordinates and viewing direction to color and density, enabling photorealistic novel view synthesis from photographs; positional encoding with sinusoidal frequencies is critical for capturing high-frequency detail;
volume rendering composes colors along camera rays by integrating density and color at sampled points; this is differentiable, allowing end-to-end training from 2D image supervision;
3D Gaussian Splatting represents scenes as collections of colored 3D Gaussians that are rasterized (splatted) onto the image plane, achieving real-time rendering (30+ FPS) with quality comparable to NeRF; adaptive splitting, cloning, and pruning concentrate detail where needed;
the field is moving from expensive sensors (LiDAR, 3D scanners) toward neural reconstruction from commodity cameras -- phones can now capture 3D scenes that previously required specialized equipment.

We've now covered depth estimation, point cloud processing, and two approaches to 3D reconstruction from photographs. The computer vision section of this series has taken us from raw pixel operations through detection, segmentation, OCR, video, generative models, editing, and now 3D understanding. There's still more ground to cover in how machines interpret the visual world -- particularly around understanding human faces and applying vision to specialized scientific domains.

Exercises

Exercise 1: Build a stereo depth accuracy analyzer. Create a class StereoDepthAnalyzer that: (a) takes camera parameters (focal length in pixels, baseline in meters), (b) implements disparity_to_depth(disparity) and depth_to_disparity(depth) using the standard formula depth = f * B / disparity, (c) implements depth_error_from_disparity_error(true_depth, disparity_error_px) that computes how much depth error (in meters) results from N pixels of disparity error at a given true depth -- this shows the critical insight that depth accuracy degrades quadratically with distance, (d) for true depths [1, 2, 5, 10, 20, 50, 100] meters and a disparity error of 1 pixel, prints a table showing: true depth, true disparity, erroneous disparity (true +/- 1), resulting depth error in meters, and relative error as percentage. Use focal_length=1000px and baseline=0.12m. Verify that the depth error grows quadratically (roughly proportional to depth^2 / (f * B)) -- a 1-pixel disparity error at 2m causes ~4x the depth error as the same error at 1m.

Exercise 2: Build a point cloud statistics calculator. Create a class PointCloudStats that: (a) generates a synthetic point cloud representing a room: floor points at y=0, back wall at z=3, left wall at x=-2, right wall at x=2, plus a cube (side length 0.5m) centered at (0, 0.25, 1.5), each surface with 500 points (with small Gaussian noise sigma=0.02), (b) computes basic statistics: total point count, bounding box (min/max for each axis), centroid, (c) implements estimate_normals(points, k=20) that for each point finds its k nearest neighbors (using scipy.spatial.KDTree) and fits a plane to them via PCA (the normal is the eigenvector with the smallest eigenvalue of the covariance matrix of the neighbors), (d) classifies each point as "horizontal" (normal mostly aligned with Y axis, abs(ny) > 0.8) or "vertical" (abs(ny) < 0.3), (e) prints: total points, bounding box, percentage horizontal vs vertical, and the average normal vector for each category. Verify that floor points are classified as horizontal and wall points as vertical.

Exercise 3: Build a NeRF ray sampling analyzer. Create a class RayAnalyzer that: (a) implements cast_ray(origin, direction, near, far, num_samples) that generates num_samples evenly spaced sample points along a ray from near to far distance, returning the 3D coordinates and the delta (distance between consecutive samples), (b) implements stratified_sampling(origin, direction, near, far, num_samples) that divides the [near, far] range into num_samples equal bins and samples one random point within each bin -- this is the sampling strategy NeRF actually uses, reducing aliasing compared to uniform spacing, (c) for a camera at origin (0, 0, 0) looking along +Z with near=0.5 and far=5.0, generates rays for a 4x4 grid of pixels (using a simple pinhole camera with focal length 50 pixels and image center at (2, 2)), (d) for each ray, computes: the total ray length, the number of samples, the average delta between samples, and the total volume sampled (approximated as num_rays * avg_delta * sample_cross_section where cross_section = (far/focal_length)^2 per pixel), (e) prints a table comparing uniform vs stratified sampling for sample counts [8, 16, 32, 64, 128, 256]: for each count, show the average delta, the standard deviation of deltas (should be 0 for uniform, nonzero for stratified), and the expected rendering time relative to 64 samples (linear scaling). Verify that stratified sampling has nonzero delta variance (it's intentionally randomized) while uniform sampling has exactly zero.

Thanks for reading!

@scipio