Learn AI Series (#87) - 3D Vision
[IMAGE: https://images.hive.blog/DQmYPPmhKswr7977MfswpFvHiWZgdgD1ichcq3k22pLJ1xM/variant-b-07-purple.png]
What will I learn
- You will learn depth estimation from single images (monocular) and stereo pairs;
- point clouds: representing 3D data as collections of points in space;
- PointNet: processing unordered point sets with shared MLPs and permutation-invariant pooling;
- Neural Radiance Fields (NeRF): reconstructing 3D scenes from 2D photographs using volume rendering;
- positional encoding for spatial coordinates and why MLPs need it for high-frequency detail;
- 3D Gaussian Splatting: real-time 3D rendering from point-based representations;
- 3D reconstruction pipelines and their practical applications across industries.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision (this post)
Learn AI Series (#87) - 3D Vision
Solutions to Episode #86 Exercises
Exercise 1: Inpaint mask analyzer.
import numpy as np
class InpaintMaskAnalyzer:
"""Analyze binary masks for inpainting:
area, coverage, boundary complexity,
classification, and step recommendations."""
def analyze(self, mask):
h, w = mask.shape
total_pixels = h * w
masked = (mask == 255)
area = int(masked.sum())
coverage = area / total_pixels
# Bounding box of masked region
rows = np.any(masked, axis=1)
cols = np.any(masked, axis=0)
if not rows.any():
return {"area": 0, "coverage": 0.0,
"bbox": (0, 0, 0, 0),
"aspect_ratio": 0.0,
"boundary_complexity": 0.0}
r_min, r_max = np.where(rows)[0][[0, -1]]
c_min, c_max = np.where(cols)[0][[0, -1]]
bbox = (int(r_min), int(c_min),
int(r_max), int(c_max))
bh = r_max - r_min + 1
bw = c_max - c_min + 1
aspect = bw / max(bh, 1)
# Boundary complexity: fraction of masked
# pixels that border at least one unmasked
# 4-neighbor
padded = np.pad(masked, 1,
constant_values=False)
bp = np.zeros_like(masked)
for dr, dc in [(-1, 0), (1, 0),
(0, -1), (0, 1)]:
shifted = padded[1 + dr:h + 1 + dr,
1 + dc:w + 1 + dc]
bp |= (masked & (~shifted))
boundary_count = int(bp.sum())
complexity = (boundary_count / area
if area > 0 else 0.0)
return {
"area": area,
"coverage": coverage,
"bbox": bbox,
"aspect_ratio": aspect,
"boundary_complexity": complexity,
}
def classify_mask(self, mask):
info = self.analyze(mask)
cov = info["coverage"]
if cov < 0.05:
cat = "small_patch"
steps = 20
elif cov < 0.25:
cat = "medium_region"
steps = 30
elif cov < 0.50:
cat = "large_area"
steps = 40
else:
cat = "reconstruction"
steps = 50
return cat, steps, info
analyzer = InpaintMaskAnalyzer()
# Generate 4 test masks on 256x256
size = 256
masks = {}
# 30x30 centered square
m = np.zeros((size, size), dtype=np.uint8)
c = size // 2
m[c - 15:c + 15, c - 15:c + 15] = 255
masks["30x30 square"] = m
# 100x100 centered square
m = np.zeros((size, size), dtype=np.uint8)
m[c - 50:c + 50, c - 50:c + 50] = 255
masks["100x100 square"] = m
# Horizontal stripe
m = np.zeros((size, size), dtype=np.uint8)
m[c - 40:c + 40, :] = 255
masks["horiz stripe"] = m
# Checkerboard (16x16 blocks)
m = np.zeros((size, size), dtype=np.uint8)
for r in range(0, size, 32):
for cc in range(0, size, 32):
m[r:r + 16, cc:cc + 16] = 255
masks["checkerboard"] = m
print(f"{'Mask':<16} {'Area':>6} "
f"{'Cov%':>6} {'Cmplx':>6} "
f"{'Category':<16} {'Steps':>5}")
print("-" * 60)
for name, mask in masks.items():
cat, steps, info = analyzer.classify_mask(mask)
print(f"{name:<16} {info['area']:>6} "
f"{info['coverage'] * 100:>5.1f}% "
f"{info['boundary_complexity']:>6.3f} "
f"{cat:<16} {steps:>5}")
The two square masks have the lowest boundary complexity because a square's perimeter-to-area ratio decreases as it gets bigger -- a 30x30 square has relatively more boundary pixels than a 100x100 square. The horizontal stripe has moderate complexity (long boundary edges but all straight). The checkerboard has the highest complexity by far because every 16x16 block has its entire perimeter exposed to unmasked neighbors, creating an enormous amount of boundary relative to area.
Exercise 2: Style transfer weight explorer.
import numpy as np
class StyleWeightExplorer:
"""Explore content/style weight combinations
for neural style transfer."""
def __init__(self, seed=42):
rng = np.random.RandomState(seed)
self.content_feat = rng.randn(
1, 64, 32, 32).astype(np.float32)
self.style_feat = rng.randn(
1, 64, 32, 32).astype(np.float32)
def gram_matrix(self, features):
b, c, h, w = features.shape
F_map = features.reshape(b, c, h * w)
G = np.matmul(F_map,
F_map.transpose(0, 2, 1))
return G / (c * h * w)
def content_loss(self, gen, content):
return float(np.mean(
(gen - content) ** 2))
def style_loss(self, gen, style):
g_gram = self.gram_matrix(gen)
s_gram = self.gram_matrix(style)
return float(np.mean(
(g_gram - s_gram) ** 2))
def run(self):
gen = self.content_feat.copy()
c_loss = self.content_loss(
gen, self.content_feat)
s_loss = self.style_loss(
gen, self.style_feat)
alphas = [1, 10, 100]
betas = [1e3, 1e4, 1e5, 1e6]
print(f"Content loss (gen=content): "
f"{c_loss:.6f}")
print(f"Style loss (gen=content): "
f"{s_loss:.6f}")
print()
print(f"{'alpha':>6} {'beta':>8} "
f"{'C_part':>10} {'S_part':>10} "
f"{'Total':>10} {'S_dom':>6}")
print("-" * 52)
for a in alphas:
for b in betas:
c_part = a * c_loss
s_part = b * s_loss
total = c_part + s_part
s_dom = (s_part / total
if total > 0
else 0)
marker = ""
if 0.4 <= s_dom <= 0.6:
marker = " <-- balanced"
elif s_dom > 0.9:
marker = " <-- style-dom"
elif s_dom < 0.1:
marker = " <-- content-dom"
print(f"{a:>6} {b:>8.0f} "
f"{c_part:>10.4f} "
f"{s_part:>10.4f} "
f"{total:>10.4f} "
f"{s_dom:>6.3f}{marker}")
explorer = StyleWeightExplorer()
explorer.run()
Since gen = content_features, the content loss starts at exactly 0.0 (content is perfectly preserved). The entire total loss comes from the style term. This means the style dominance ratio is 1.0 for every (alpha, beta) pair -- which makes sense: if you haven't started optimizing yet and your starting point is the content image, there's zero content loss and all the gradient comes from style. In practice, after a few hundred optimization steps, the generated image moves away from the content (increasing content loss) and toward the style (decreasing style loss), and the balance depends on the alpha/beta ratio.
Exercise 3: Diffusion strength calibrator.
import numpy as np
class StrengthCalibrator:
"""Calibrate img2img strength by simulating
1D diffusion editing."""
def __init__(self, dim=128, T=1000, seed=42):
rng = np.random.RandomState(seed)
self.T = T
self.dim = dim
betas = np.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
self.alpha_bars = np.cumprod(alphas)
self.x0_orig = rng.randn(dim)
freq = 2 * np.pi * 3 / dim
edit = 0.5 * np.sin(
freq * np.arange(dim))
self.x0_target = self.x0_orig + edit
def simulate_edit(self, x0, strength,
num_steps=50):
t_start = int((1 - strength) * self.T)
t_start = max(1, min(t_start, self.T - 1))
rng = np.random.RandomState(99)
noise = rng.randn(self.dim)
ab = self.alpha_bars[t_start]
xt = np.sqrt(ab) * x0 + np.sqrt(
1 - ab) * noise
step_size = t_start // max(num_steps, 1)
step_size = max(step_size, 1)
timesteps = list(
range(0, t_start, step_size))[::-1]
x = xt.copy()
for i, t in enumerate(timesteps):
ab_t = self.alpha_bars[t]
pred_noise = (
(x - np.sqrt(ab_t) * x0)
/ np.sqrt(1 - ab_t + 1e-12))
pred_x0 = (
(x - np.sqrt(1 - ab_t)
* pred_noise)
/ np.sqrt(ab_t + 1e-12))
pred_x0 = np.clip(pred_x0, -3, 3)
if i + 1 < len(timesteps):
ab_prev = self.alpha_bars[
timesteps[i + 1]]
else:
ab_prev = 1.0
dir_xt = np.sqrt(
1 - ab_prev) * pred_noise
x = np.sqrt(ab_prev) * pred_x0 + dir_xt
return x
def run(self):
strengths = [0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 1.0]
orig_norm = np.mean(self.x0_orig ** 2)
print(f"{'Str':>5} {'MSE_orig':>10} "
f"{'MSE_tgt':>10} {'Pres':>6} "
f"{'Trans':>6} {'Bal':>6}")
print("-" * 48)
best_bal = None
best_str = None
for s in strengths:
result = self.simulate_edit(
self.x0_orig, s)
mse_orig = np.mean(
(result - self.x0_orig) ** 2)
mse_tgt = np.mean(
(result - self.x0_target) ** 2)
pres = max(0, 1 - mse_orig / max(
orig_norm, 1e-12))
trans = max(0, 1 - mse_tgt / max(
orig_norm, 1e-12))
bal = abs(pres - trans)
if best_bal is None or bal < best_bal:
best_bal = bal
best_str = s
print(f"{s:>5.1f} {mse_orig:>10.6f} "
f"{mse_tgt:>10.6f} "
f"{pres:>6.3f} "
f"{trans:>6.3f} "
f"{bal:>6.3f}")
print(f"\nBest balance at strength="
f"{best_str:.1f}")
cal = StrengthCalibrator()
cal.run()
As strength increases, MSE to the original grows (more noise destroys more of the input signal) while MSE to the target initially decreases then plateaus. Low strength means almost no change -- high preservation, low transformation. High strength means almost complete destruction of the original. The best balance point typically falls somewhere in the 0.4-0.6 range, which matches the practical experience from episode #86 where we noted that strength 0.5-0.6 is the sweet spot for moderate scene edits.
On to today's episode
Welcome back! For the past ten episodes we've been working exclusively in 2D: classifying flat images, drawing bounding boxes on flat images, segmenting flat images pixel by pixel, reading text off flat images, analyzing sequences of flat images (video), and generating or editing flat images with diffusion models. Everything so far in the computer vision arc has operated on a fundamentally two-dimensional representation of the world.
But the world is NOT flat. Objects have depth. They have volume, surfaces, and spatial relationships that a single 2D image can only hint at. A self-driving car needs to know not just that there's a pedestrian in the frame, but how far away that pedestrian is. An AR headset needs to understand the 3D geometry of a room so it can place virtual objects on real tables without them floating in mid-air. A robotic arm needs to know the exact 3D shape and position of an object before it can grasp it reliably.
3D vision bridges the gap between 2D pixel understanding (what we've been doing) and real-world spatial reasoning (what machines actually need for physical interaction). This episode covers the key techniques -- from estimating depth out of flat pictures, through representing 3D data as point clouds, all the way to reconstructing entire 3D scenes from nothing but a handful of photographs ;-)
Depth estimation: how far is everything?
The most basic 3D vision task: given a 2D image, estimate the distance from the camera to every pixel. This produces a depth map -- an image where pixel brightness corresponds to distance.
Monocular depth estimation predicts depth from a single RGB image. This is technically an ill-posed problem -- a single 2D picture contains insufficient information to determine exact 3D geometry. The model has to rely on learned priors: perspective cues (parallel lines converging toward a vanishing point), relative object size (a car that appears tiny is probably far away), texture gradients (surfaces farther away show finer texture patterns), and occlusion (if object A partially covers object B, A is closer).
MiDaS (from Intel ISL) is the standard monocular depth model. It uses a DPT (Dense Prediction Transformer) backbone -- essentially a Vision Transformer (episode #54) adapted for pixel-level predictions:
import torch
import cv2
import numpy as np
# MiDaS: the standard monocular depth model
model = torch.hub.load(
"intel-isl/MiDaS", "DPT_Large")
model.eval()
transform = torch.hub.load(
"intel-isl/MiDaS",
"transforms").dpt_transform
image = cv2.imread("street.jpg")
image_rgb = cv2.cvtColor(
image, cv2.COLOR_BGR2RGB)
input_tensor = transform(
image_rgb).unsqueeze(0)
with torch.no_grad():
depth = model(input_tensor)
depth = torch.nn.functional.interpolate(
depth.unsqueeze(1),
size=image.shape[:2],
mode="bilinear",
align_corners=False
).squeeze()
depth_np = depth.numpy()
# Normalize for visualization
# (inverse depth: closer = brighter)
depth_vis = cv2.normalize(
depth_np, None, 0, 255,
cv2.NORM_MINMAX).astype(np.uint8)
depth_colored = cv2.applyColorMap(
depth_vis, cv2.COLORMAP_INFERNO)
cv2.imwrite("depth_map.png", depth_colored)
MiDaS produces relative depth -- it tells you that object A is closer than object B, but not the exact metric distance in meters. For autonomous driving and robotics, you need metric depth (actual meters), which requires either stereo cameras, LiDAR data for training supervision, or camera calibration information.
Stereo depth uses two cameras separated by a known baseline distance (like human eyes). The disparity -- how far apart the same object appears in the left and right images -- is inversely proportional to its distance. Deep learning stereo matching networks like AANet and RAFT-Stereo learn to find pixel correspondences between left and right images more accurately than classical block matching ever could:
def disparity_to_depth(disparity,
focal_length,
baseline):
"""Convert stereo disparity to metric depth.
focal_length: in pixels
baseline: distance between cameras in meters
"""
depth = (focal_length * baseline) / (
disparity + 1e-6)
return depth
# Example: 50mm lens on full-frame sensor
# (pixel size ~6um)
# focal_length_px = 50mm / 0.006mm = 8333 px
# baseline = 0.12 meters (12cm between cameras)
# disparity = 100 pixels
# -> depth = 8333 * 0.12 / 100 = 10.0 meters
print("Stereo depth examples:")
for disp in [200, 100, 50, 25, 10]:
d = disparity_to_depth(disp, 8333, 0.12)
print(f" disparity={disp:>4}px "
f"-> depth={d:>6.1f}m")
The relationship is elegant: double the disparity, halve the depth. Close objects show large disparity (they shift a lot between left and right views). Distant objects show tiny disparity (they barely move). At infinite distance, disparity is zero -- both cameras see the object in the exact same position.
Point clouds: representing 3D data
A point cloud is the simplest 3D data format: a collection of (x, y, z) coordinates in space, optionally with color or other per-point attributes. LiDAR sensors produce point clouds directly by measuring laser return times. Depth cameras (Intel RealSense, Microsoft Kinect) produce depth maps that we can convert to point clouds using the camera's intrinsic parameters:
import numpy as np
def depth_to_point_cloud(depth_map,
intrinsics):
"""Convert a depth map to a 3D point cloud.
intrinsics: (fx, fy, cx, cy) -- focal
lengths and principal point in pixels."""
fx, fy, cx, cy = intrinsics
h, w = depth_map.shape
# Create pixel coordinate grids
u = np.arange(w)
v = np.arange(h)
u, v = np.meshgrid(u, v)
# Backproject to 3D using the pinhole
# camera model
z = depth_map
x = (u - cx) * z / fx
y = (v - cy) * z / fy
# Stack into (N, 3) point cloud
points = np.stack(
[x, y, z], axis=-1).reshape(-1, 3)
# Remove invalid points (zero depth)
valid = points[:, 2] > 0
return points[valid]
# Example with synthetic depth
depth = np.random.uniform(
1.0, 10.0, (480, 640))
# Typical RGB-D camera intrinsics
intrinsics = (525.0, 525.0, 319.5, 239.5)
pcd = depth_to_point_cloud(depth, intrinsics)
print(f"Point cloud: {pcd.shape[0]} points")
print(f"X range: [{pcd[:, 0].min():.1f}, "
f"{pcd[:, 0].max():.1f}]")
print(f"Y range: [{pcd[:, 1].min():.1f}, "
f"{pcd[:, 1].max():.1f}]")
print(f"Z range: [{pcd[:, 2].min():.1f}, "
f"{pcd[:, 2].max():.1f}]")
The backprojection formula is just the inverse of the standard pinhole camera projection. If a 3D point (X, Y, Z) projects to pixel (u, v) via u = fx * X/Z + cx, then given (u, v, Z) we can recover X = (u - cx) * Z / fx. Having said that, the hard part in practice is getting accurate depth values -- LiDAR is precise but expensive, stereo depth has noise and holes, and monocular depth gives only relative values.
Processing point clouds with neural networks
Here's where it gets interesting. Images are regular grids -- every pixel has a fixed position relative to its neighbors, which is exactly why convolutional filters work (episode #45). Point clouds are unordered sets. There's no "left neighbor" or "top-right pixel." If you shuffle the order of points in a cloud, the 3D shape hasn't changed at all. Any network that processes point clouds must be permutation-invariant -- it must produce the same output regardless of the order the points come in.
PointNet (Qi et al., 2017) solved this with a clean architectural insight: process each point independently through shared MLPs (so each point gets the same transformation), then aggregate across all points with a max-pool operation. Max-pooling is permutation-invariant -- the maximum value of a set doesn't change if you reorder the set:
import torch
import torch.nn as nn
class PointNet(nn.Module):
"""Simplified PointNet for classification.
Key insight: shared MLPs + max-pool gives
permutation invariance."""
def __init__(self, num_classes=40):
super().__init__()
# Shared MLPs: same weights for every
# point (like 1x1 convolutions)
self.mlp1 = nn.Sequential(
nn.Linear(3, 64), nn.ReLU(),
nn.BatchNorm1d(64),
nn.Linear(64, 128), nn.ReLU(),
nn.BatchNorm1d(128))
self.mlp2 = nn.Sequential(
nn.Linear(128, 256), nn.ReLU(),
nn.Linear(256, 1024))
# Classifier on global feature
self.classifier = nn.Sequential(
nn.Linear(1024, 512), nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256), nn.ReLU(),
nn.Linear(256, num_classes))
def forward(self, x):
# x: (batch, num_points, 3)
b, n, _ = x.shape
# Process each point independently
h = x.reshape(b * n, 3)
h = self.mlp1(h)
h = h.reshape(b, n, 128)
h = h.reshape(b * n, 128)
h = self.mlp2(h)
h = h.reshape(b, n, 1024)
# Global max pool across all points
global_feat = h.max(dim=1).values
return self.classifier(global_feat)
model = PointNet(num_classes=10)
cloud = torch.randn(4, 1024, 3)
logits = model(cloud)
print(f"Input: {cloud.shape}")
print(f"Output: {logits.shape}")
# Verify permutation invariance
perm = torch.randperm(1024)
cloud_shuffled = cloud[:, perm, :]
logits2 = model(cloud_shuffled)
diff = (logits - logits2).abs().max().item()
print(f"Max diff after shuffling: {diff:.8f}")
The permutation invariance check at the bottom is the key sanity test -- shuffling the point order should produce identical output (up to floating point precision). PointNet works well for classification (is this point cloud a chair, table, or airplane?) and segmentation (label each point as belonging to a specific part). The limitation is that max-pooling throws away local structure -- two points that are close together in space have no special relationship in PointNet's representation. PointNet++ addressed this by applying PointNet hierarchically on local neighborhoods, similar to how CNNs build up receptive fields.
NeRF: 3D from photographs
Neural Radiance Fields (Mildenhall et al., 2020) are one of those ideas that feel almost like magic the first time you see them. You take maybe 50-100 photographs of a scene from different angles, and the system reconstructs a complete 3D representation that lets you render the scene from any viewpoint -- including viewpoints that were never photographed.
The core idea: train a small MLP to map any 3D coordinate (x, y, z) and viewing direction (theta, phi) to a color and density:
f(x, y, z, theta, phi) -> (r, g, b, sigma)
where sigma is the volume density -- how opaque the space is at that point. To render an image from a new viewpoint, you cast rays through each pixel, sample points along each ray, query the network at each sample point, and compose the colors using volume rendering:
import torch
import torch.nn as nn
class NeRF(nn.Module):
"""Simplified Neural Radiance Field.
Maps (position, direction) -> (color,
density)."""
def __init__(self, pos_dim=63,
dir_dim=27, hidden=256):
super().__init__()
# Position encoding -> density + feature
self.pos_net = nn.Sequential(
nn.Linear(pos_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
)
self.density_head = nn.Linear(hidden, 1)
self.feature_head = nn.Linear(
hidden, hidden)
# Direction encoding -> color
self.color_net = nn.Sequential(
nn.Linear(hidden + dir_dim,
hidden // 2),
nn.ReLU(),
nn.Linear(hidden // 2, 3),
nn.Sigmoid(),
)
def forward(self, pos_enc, dir_enc):
h = self.pos_net(pos_enc)
density = torch.relu(
self.density_head(h))
feature = self.feature_head(h)
color = self.color_net(
torch.cat([feature, dir_enc],
dim=-1))
return color, density
def volume_render(colors, densities, deltas):
"""Classic volume rendering: compose colors
along a camera ray by integrating density
and color at sampled points."""
# alpha = probability of hitting something
alpha = 1.0 - torch.exp(
-densities * deltas)
# Transmittance: probability that the ray
# reaches this sample without hitting
# anything earlier
transmittance = torch.cumprod(
1.0 - alpha + 1e-10, dim=-1)
transmittance = torch.cat([
torch.ones_like(
transmittance[..., :1]),
transmittance[..., :-1]], dim=-1)
# Weight = "hit here AND nothing blocked it"
weights = alpha * transmittance
pixel_color = (weights.unsqueeze(-1)
* colors).sum(dim=-2)
return pixel_color
The architecture splits position and direction deliberately. Density depends only on position -- whether a point in space is occupied or empty doesn't change based on your viewing angle. Color depends on both position and direction, because real surfaces exhibit view-dependent effects like specular reflections, glossy highlights, and transparency that change as you move around them. A shiny metal surface looks different from the left than from the right, but the surface itself is in the same location regardless.
Positional encoding: why coordinates need frequency lifting
The pos_dim=63 and dir_dim=27 in the code above are NOT the raw coordinate dimensions (which would be 3 and 2 respectively). They're the result of positional encoding -- mapping the raw coordinates through sinusoidal functions at multiple frequencies:
def positional_encoding(x, num_freqs=10):
"""Lift raw coordinates to higher
dimensions using sinusoids. Same idea
as transformer positional encoding
(episode #52) applied to spatial coords."""
encodings = [x]
for i in range(num_freqs):
freq = 2.0 ** i
encodings.append(torch.sin(
freq * x))
encodings.append(torch.cos(
freq * x))
return torch.cat(encodings, dim=-1)
# 3D position -> 3 + 3*2*10 = 63 dims
pos = torch.randn(100, 3)
pos_enc = positional_encoding(pos, 10)
print(f"Position: {pos.shape} -> "
f"encoded: {pos_enc.shape}")
# 2D direction -> 2 (for simplified)
# In practice: 3D unit vector with
# num_freqs=4 -> 3 + 3*2*4 = 27 dims
Why is this needed? MLPs have a strong bias toward learning smooth, low-frequency functions (this is called spectral bias). Raw coordinates like (0.312, 0.745, 2.001) vary slowly and smoothly across space, so the MLP naturally produces blurry output. By encoding coordinates into sinusoidal features at multiple frequencies -- including high frequencies like sin(512 * x) -- you give the network the ability to represent sharp edges, fine textures, and intricate details. Without positional encoding, NeRF produces blurry, oversmoothed reconstructions. With it, you get crisp, photorealistic renderings.
NeRF training is supervised: you have real photos from known camera positions, you render the scene from those viewpoints by casting rays and querying the network, and you minimize the pixel-wise MSE between rendered and real images. Once trained, rendering from any new viewpoint is possible -- the MLP has learned a continuous 3D representation of the entire scene.
The major limitation: speed. Rendering one pixel requires sampling 64-256 points along a ray and running an MLP forward pass for each. A 1080p frame has ~2 million pixels. That's hundreds of millions of network evaluations per frame. Even on fast GPUs, rendering takes seconds to minutes per frame. Real-time rendering was the unsolved problem -- until Gaussian splatting came along.
Gaussian splatting: real-time 3D
3D Gaussian Splatting (Kerbl et al., 2023) takes a fundamentally different approach. Instead of representing the scene as a continuous function queried via ray marching (NeRF), it represents the scene as a collection of 3D Gaussians -- think of them as colored, translucent blobs floating in space. Each Gaussian has a position, a covariance matrix (controlling its shape and orientation), an opacity, and color represented as spherical harmonics for view-dependent effects:
import torch
class GaussianScene:
"""Conceptual 3D Gaussian splatting scene.
Each Gaussian is a colored, translucent
3D blob with learnable parameters."""
def __init__(self, num_gaussians):
# Position (XYZ center)
self.positions = torch.randn(
num_gaussians, 3)
# Scale (size in each axis)
self.scales = (
torch.ones(num_gaussians, 3)
* 0.01)
# Rotation (quaternion)
self.rotations = torch.zeros(
num_gaussians, 4)
self.rotations[:, 0] = 1.0
# Opacity (logit-space)
self.opacities = torch.zeros(
num_gaussians, 1)
# Color (spherical harmonic coeffs
# for view-dependent appearance)
self.sh_coeffs = torch.randn(
num_gaussians, 48)
def parameter_count(self):
total = (3 + 3 + 4 + 1 + 48)
return total * len(self.positions)
scene = GaussianScene(100_000)
print(f"Gaussians: {len(scene.positions):,}")
print(f"Params per Gaussian: 59")
print(f"Total params: "
f"{scene.parameter_count():,}")
print(f"Memory (float32): "
f"{scene.parameter_count() * 4 / 1e6:.1f}"
f" MB")
The rendering approach is the key difference from NeRF. Instead of casting rays through the scene and querying a function at sampled points (ray marching), Gaussian splatting projects each Gaussian onto the camera's image plane (rasterization or splatting). For each Gaussian, you compute its 2D projection (an ellipse on screen), sort Gaussians by depth, and blend them front-to-back using alpha compositing. This is embarassingly parallel and maps directly to GPU rasterization pipelines -- the same hardware that renders video games at 60+ FPS.
The training loop mirrors NeRF: start with a sparse point cloud from Structure-from-Motion (SfM -- a classical algorithm that reconstructs 3D points and camera poses from multiple images), initialize one Gaussian per point, render from known viewpoints, compare to real photos, backpropagate through the differentiable rasterizer to adjust Gaussian parameters. The system also adaptively splits large Gaussians that cover too much area (adding detail), clones small Gaussians in under-reconstructed regions, and prunes Gaussians with near-zero opacity (removing waste).
The result: 30+ FPS rendering at 1080p, compared to NeRF's minutes per frame. Quality is comparable or sometimes better, and the explicit point-based representation is easier to manipulate than NeRF's implicit function -- you can delete Gaussians, move them, or merge scenes. This has made Gaussian splatting the practical choice for real applications: VR/AR environments, game asset creation, virtual tourism, and real estate walkthroughs.
Practical applications
3D vision is driving real products across multiple industries:
- Autonomous vehicles: LiDAR point clouds combined with camera depth estimation for 3D object detection and path planning. Tesla's "pure vision" approach uses monocular and stereo depth estimation from cameras to avoid the cost of LiDAR sensors altogether
- Augmented reality: understanding room geometry to anchor virtual objects on real surfaces. Apple's ARKit and Google's ARCore both use monocular depth estimation on phone cameras
- Robotics: grasping objects requires knowing their precise 3D shape and position in the robot's coordinate frame. Bin-picking systems in warehouses use depth cameras and point cloud processing
- Cultural preservation: scanning historical buildings, sculptures, and artifacts into digital 3D models. The Notre-Dame reconstruction effort after the 2019 fire relied heavily on 3D scanning and photogrammetry
- Real estate and mapping: Google Earth's 3D cities are built from aerial photogrammetry. Matterport creates 3D home walkthroughs. Luma AI lets you capture 3D scenes from phone video using Gaussian splatting
The trend is clearly toward fewer sensors and more computation -- replacing expensive LiDAR with monocular or stereo depth from cheap cameras, replacing professional 3D scanners with phone cameras plus neural reconstruction. The models keep getting better at extracting 3D understanding from 2D inputs, which makes the hardware requirements progressively cheaper.
Samengevat
- Monocular depth estimation predicts relative depth from a single image using learned priors (perspective cues, object size, texture gradients); models like MiDaS use DPT (ViT-based) architectures; stereo depth uses two cameras and pixel disparity for metric depth in actual meters;
- point clouds represent 3D data as unordered sets of (x, y, z) points; they can be produced by LiDAR, stereo cameras, or backprojection from depth maps using camera intrinsics;
- PointNet processes point clouds with shared MLPs applied to each point independently, followed by max-pooling for permutation invariance; PointNet++ extends this with hierarchical local neighborhoods;
- NeRF trains an MLP to map 3D coordinates and viewing direction to color and density, enabling photorealistic novel view synthesis from photographs; positional encoding with sinusoidal frequencies is critical for capturing high-frequency detail;
- volume rendering composes colors along camera rays by integrating density and color at sampled points; this is differentiable, allowing end-to-end training from 2D image supervision;
- 3D Gaussian Splatting represents scenes as collections of colored 3D Gaussians that are rasterized (splatted) onto the image plane, achieving real-time rendering (30+ FPS) with quality comparable to NeRF; adaptive splitting, cloning, and pruning concentrate detail where needed;
- the field is moving from expensive sensors (LiDAR, 3D scanners) toward neural reconstruction from commodity cameras -- phones can now capture 3D scenes that previously required specialized equipment.
We've now covered depth estimation, point cloud processing, and two approaches to 3D reconstruction from photographs. The computer vision section of this series has taken us from raw pixel operations through detection, segmentation, OCR, video, generative models, editing, and now 3D understanding. There's still more ground to cover in how machines interpret the visual world -- particularly around understanding human faces and applying vision to specialized scientific domains.
Exercises
Exercise 1: Build a stereo depth accuracy analyzer. Create a class StereoDepthAnalyzer that: (a) takes camera parameters (focal length in pixels, baseline in meters), (b) implements disparity_to_depth(disparity) and depth_to_disparity(depth) using the standard formula depth = f * B / disparity, (c) implements depth_error_from_disparity_error(true_depth, disparity_error_px) that computes how much depth error (in meters) results from N pixels of disparity error at a given true depth -- this shows the critical insight that depth accuracy degrades quadratically with distance, (d) for true depths [1, 2, 5, 10, 20, 50, 100] meters and a disparity error of 1 pixel, prints a table showing: true depth, true disparity, erroneous disparity (true +/- 1), resulting depth error in meters, and relative error as percentage. Use focal_length=1000px and baseline=0.12m. Verify that the depth error grows quadratically (roughly proportional to depth^2 / (f * B)) -- a 1-pixel disparity error at 2m causes ~4x the depth error as the same error at 1m.
Exercise 2: Build a point cloud statistics calculator. Create a class PointCloudStats that: (a) generates a synthetic point cloud representing a room: floor points at y=0, back wall at z=3, left wall at x=-2, right wall at x=2, plus a cube (side length 0.5m) centered at (0, 0.25, 1.5), each surface with 500 points (with small Gaussian noise sigma=0.02), (b) computes basic statistics: total point count, bounding box (min/max for each axis), centroid, (c) implements estimate_normals(points, k=20) that for each point finds its k nearest neighbors (using scipy.spatial.KDTree) and fits a plane to them via PCA (the normal is the eigenvector with the smallest eigenvalue of the covariance matrix of the neighbors), (d) classifies each point as "horizontal" (normal mostly aligned with Y axis, abs(ny) > 0.8) or "vertical" (abs(ny) < 0.3), (e) prints: total points, bounding box, percentage horizontal vs vertical, and the average normal vector for each category. Verify that floor points are classified as horizontal and wall points as vertical.
Exercise 3: Build a NeRF ray sampling analyzer. Create a class RayAnalyzer that: (a) implements cast_ray(origin, direction, near, far, num_samples) that generates num_samples evenly spaced sample points along a ray from near to far distance, returning the 3D coordinates and the delta (distance between consecutive samples), (b) implements stratified_sampling(origin, direction, near, far, num_samples) that divides the [near, far] range into num_samples equal bins and samples one random point within each bin -- this is the sampling strategy NeRF actually uses, reducing aliasing compared to uniform spacing, (c) for a camera at origin (0, 0, 0) looking along +Z with near=0.5 and far=5.0, generates rays for a 4x4 grid of pixels (using a simple pinhole camera with focal length 50 pixels and image center at (2, 2)), (d) for each ray, computes: the total ray length, the number of samples, the average delta between samples, and the total volume sampled (approximated as num_rays * avg_delta * sample_cross_section where cross_section = (far/focal_length)^2 per pixel), (e) prints a table comparing uniform vs stratified sampling for sample counts [8, 16, 32, 64, 128, 256]: for each count, show the average delta, the standard deviation of deltas (should be 0 for uniform, nonzero for stratified), and the expected rendering time relative to 64 samples (linear scaling). Verify that stratified sampling has nonzero delta variance (it's intentionally randomized) while uniform sampling has exactly zero.
Thanks for reading!
@scipio