Abstract

Orlume is a browser-native image processing system that combines deep learning-based scene understanding with real-time physically-based rendering. The system performs monocular depth estimation using Vision Transformer architectures, generates surface normals through gradient-based reconstruction, and applies deferred shading with GGX specular reflectance and horizon-based ambient occlusion (HBAO).

100% Client-Side Processing

60fps Real-time Rendering

150 Semantic Classes

Zero Server Dependencies

Key Innovation: First fully browser-based implementation of neural 3D relighting that combines monocular depth estimation with physically-based rendering for interactive photo manipulation.

System Architecture

The Orlume processing pipeline implements a multi-stage architecture optimized for GPU parallelism:

┌─────────────────────────────────────────────────────────────────────┐
│                        INPUT PROCESSING                              │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────┐                 │
│  │  Image  │───▶│ sRGB→Linear  │───▶│  Normalize  │                 │
│  │ Decode  │    │  Conversion  │    │   [0,1]     │                 │
│  └─────────┘    └──────────────┘    └─────────────┘                 │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    NEURAL INFERENCE (Parallel)                       │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐         │
│  │ Depth Anything │  │   SegFormer    │  │  MediaPipe FM  │         │
│  │     V2 ViT     │  │     B0-512     │  │   468 Points   │         │
│  │  (Depth Map)   │  │  (Materials)   │  │  (Face Mesh)   │         │
│  └───────┬────────┘  └───────┬────────┘  └───────┬────────┘         │
└──────────┼───────────────────┼───────────────────┼──────────────────┘
           │                   │                   │
           ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    GEOMETRY RECONSTRUCTION                           │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐         │
│  │ Scharr Kernel  │  │ Material Map   │  │ Face Normals   │         │
│  │  ∇D → Normal   │  │  RGBA Encode   │  │  Triangulation │         │
│  └───────┬────────┘  └───────┬────────┘  └───────┬────────┘         │
└──────────┼───────────────────┼───────────────────┼──────────────────┘
           │                   │                   │
           └───────────────────┴───────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    DEFERRED RENDERING (WebGL2/WebGPU)                │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  G-Buffer: Albedo | Normals | Depth | Materials | Position    │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                              │                                       │
│              ┌───────────────┼───────────────┐                      │
│              ▼               ▼               ▼                      │
│       ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│       │   HBAO   │    │ GGX BRDF │    │  Shadow  │                  │
│       │  8-dir   │    │ Specular │    │  Raymarch│                  │
│       └────┬─────┘    └────┬─────┘    └────┬─────┘                  │
│            └───────────────┼───────────────┘                        │
│                            ▼                                         │
│       ┌─────────────────────────────────────────────────────────┐   │
│       │  Final Composite: Diffuse + Specular + AO + Shadows     │   │
│       │  Tone Mapping: ACES Filmic | Exposure Compensation      │   │
│       └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Technical Capabilities

Subsystem	Technology	Performance
Depth Estimation	Depth Anything V2 (ViT-Small)	~150ms @ 1080p
Semantic Segmentation	SegFormer B0 (ADE20K 150-class)	~200ms @ 512×512
Face Mesh	MediaPipe (468 landmarks)	~16ms per frame
PBR Shading	GGX + HBAO + Soft Shadows	60fps @ 4K
Neural Upscaling	Real-ESRGAN / ESRGAN-thick	~2s per 2× upscale

ML Monocular Depth Estimation

Orlume employs Depth Anything V2, a state-of-the-art monocular depth estimation model based on the Vision Transformer (ViT) architecture. The model processes single RGB images to produce dense, relative depth maps that serve as the foundation for 3D scene reconstruction.

Model Specification

Property	Value
Model ID	`Xenova/depth-anything-small-hf`
Architecture	Vision Transformer (ViT) Encoder + CNN Decoder
Input Resolution	Any (internally resized to 518×518)
Output	Single-channel depth map, normalized [0, 1]
Inference Backend	ONNX Runtime (WebGPU → WASM fallback)

Depth Processing Pipeline

// Depth estimation with bilateral smoothing
const depthTensor = await pipeline('depth-estimation', image);
const depthMap = normalizeMinMax(depthTensor);
const smoothedDepth = bilateralFilter(depthMap, {
    spatialSigma: 9,
    rangeSigma: 0.1   // Edge-preserving parameter
});

Reference: Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR 2024. The model is trained on 62M unlabeled images using a self-training paradigm.

ML Semantic Segmentation

Material-aware rendering is achieved through SegFormer B0, a hierarchical Transformer encoder with lightweight MLP decoder. The model classifies each pixel into one of 150 semantic categories from the ADE20K dataset, which are then mapped to physically-based material properties.

Material Property Mapping

Semantic Class	Roughness	Metallic	Subsurface	Emissive
Person/Skin	0.60	0.00	0.35	0.00
Metal/Car/Building	0.30	0.95	0.00	0.00
Glass/Window	0.02	0.00	0.00	0.00
Vegetation	0.85	0.00	0.10	0.00
Sky	1.00	0.00	0.00	1.00
Lamp/Light	0.50	0.00	0.00	0.80

RGBA Material Encoding

R = Roughness × 255
G = Metallic × 255
B = Subsurface Scattering × 255
A = Emissive × 255

Reference: Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS 2021.

ML Face Mesh Detection

For portrait images, MediaPipe Face Mesh provides 468 3D facial landmarks that are triangulated into a dense mesh. This enables accurate facial geometry reconstruction for realistic skin rendering with subsurface scattering.

Mesh Generation

468 vertices — Sparse 3D landmark positions
~900 triangles — Dense tessellation via Delaunay triangulation
Smooth normals — Area-weighted vertex normal averaging
Depth interpolation — Barycentric coordinates for dense depth map

// Per-vertex normal via area-weighted averaging
vec3 computeSmoothNormal(int vertexIdx) {
    vec3 normal = vec3(0.0);
    for (int t = 0; t < adjacentTriangles; t++) {
        vec3 faceNormal = cross(v1 - v0, v2 - v0);
        float area = length(faceNormal) * 0.5;
        normal += normalize(faceNormal) * area;
    }
    return normalize(normal);
}

ML Neural Image Upscaling

Super-resolution is powered by Real-ESRGAN with optional face enhancement via GFPGAN. The RRDB (Residual-in-Residual Dense Block) architecture reconstructs high-frequency details that are lost in traditional bicubic upscaling.

Scale Factor	Architecture	Use Case
2×	Real-ESRGAN x2	General photo enhancement
4×	Real-ESRGAN x4 + GFPGAN	Portrait restoration

Reference: Wang, X. et al. "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." ICCV 2021 Workshop.

SHADER Surface Normal Estimation

Surface normals are computed from the depth map using the Scharr operator, which provides better rotational symmetry than the traditional Sobel kernel. A 9-tap Gaussian filter is applied for artifact-free surfaces.

Scharr Gradient Kernels

Gx (Horizontal):          Gy (Vertical):
┌────┬────┬────┐          ┌────┬─────┬────┐
│ -3 │  0 │ +3 │          │ -3 │ -10 │ -3 │
├────┼────┼────┤          ├────┼─────┼────┤
│-10 │  0 │+10 │          │  0 │   0 │  0 │
├────┼────┼────┤          ├────┼─────┼────┤
│ -3 │  0 │ +3 │          │ +3 │ +10 │ +3 │
└────┴────┴────┘          └────┴─────┴────┘

Normal Reconstruction

// Fragment shader: Depth → Normal
vec3 computeNormal(vec2 uv, sampler2D depthTex) {
    float d = texture(depthTex, uv).r;
    float dx = texture(depthTex, uv + vec2(1.0, 0.0) / resolution).r - d;
    float dy = texture(depthTex, uv + vec2(0.0, 1.0) / resolution).r - d;
    
    vec3 normal = normalize(vec3(-dx * normalStrength, 
                                  -dy * normalStrength, 
                                  1.0));
    return normal * 0.5 + 0.5; // Encode to [0,1] for storage
}

Gaussian Smoothing (9-tap)

N_smooth = Σ w_ij × N_ij / 16

Weights: [1,2,1; 2,4,2; 1,2,1]

SHADER Physically-Based Rendering

Orlume implements a full Cook-Torrance BRDF with GGX microfacet distribution, Fresnel-Schlick approximation, and Smith geometry term. This provides physically-accurate light interaction that responds correctly to material properties.

BRDF Components

GGX Normal Distribution Function (D)

D(m) = α² / (π × ((n·m)² × (α² - 1) + 1)²)

where α = roughness², m = half-vector, n = surface normal

Fresnel-Schlick Approximation (F)

F(v,h) = F₀ + (1 - F₀) × (1 - v·h)⁵

where F₀ = 0.04 for dielectrics, ~0.7-1.0 for metals

Smith Geometry Function (G)

G(l,v,h) = G₁(l) × G₁(v)
G₁(x) = 2(n·x) / (n·x + √(α² + (1-α²)(n·x)²))

Final BRDF Integration

vec3 cookTorranceBRDF(vec3 N, vec3 V, vec3 L, vec3 albedo, 
                       float roughness, float metallic) {
    vec3 H = normalize(V + L);
    float NdotL = max(dot(N, L), 0.0);
    float NdotV = max(dot(N, V), 0.0);
    float NdotH = max(dot(N, H), 0.0);
    float VdotH = max(dot(V, H), 0.0);
    
    // GGX Distribution
    float alpha = roughness * roughness;
    float alpha2 = alpha * alpha;
    float denom = NdotH * NdotH * (alpha2 - 1.0) + 1.0;
    float D = alpha2 / (PI * denom * denom);
    
    // Fresnel
    vec3 F0 = mix(vec3(0.04), albedo, metallic);
    vec3 F = F0 + (1.0 - F0) * pow(1.0 - VdotH, 5.0);
    
    // Geometry (Smith-GGX)
    float k = alpha / 2.0;
    float G1L = NdotL / (NdotL * (1.0 - k) + k);
    float G1V = NdotV / (NdotV * (1.0 - k) + k);
    float G = G1L * G1V;
    
    // Specular term
    vec3 specular = (D * F * G) / (4.0 * NdotL * NdotV + 0.001);
    
    // Diffuse (Lambert)
    vec3 kD = (1.0 - F) * (1.0 - metallic);
    vec3 diffuse = kD * albedo / PI;
    
    return (diffuse + specular) * NdotL;
}

SHADER Horizon-Based Ambient Occlusion

HBAO (Horizon-Based Ambient Occlusion) provides realistic contact shadows and ambient darkening in crevices. The algorithm ray-marches in multiple directions to find the horizon angle at each pixel.

Algorithm Parameters

Parameter	Value	Description
Directions	8	Cardinal + diagonal directions
Steps per Direction	8	Ray marching samples
Radius	8px	Sample distance
Bias	0.025	Self-occlusion prevention

float computeHBAO(vec2 uv, float centerDepth) {
    float occlusion = 0.0;
    for (int d = 0; d < 8; d++) {
        vec2 dir = directions[d];
        float maxHorizon = -1.0;
        
        for (int s = 1; s <= 8; s++) {
            vec2 sampleUV = uv + dir * float(s) * radius;
            float sampleDepth = texture(depthTex, sampleUV).r;
            float heightDiff = sampleDepth - centerDepth;
            float horizonAngle = heightDiff / (float(s) * radius);
            maxHorizon = max(maxHorizon, horizonAngle);
        }
        occlusion += clamp(maxHorizon, 0.0, 1.0);
    }
    return 1.0 - (occlusion / 8.0) * intensity;
}

Reference: Bavoil, L., Sainz, M. "Image-space horizon-based ambient occlusion." SIGGRAPH 2008, Talk.

SHADER Soft Shadow Computation

Shadows are computed via screen-space ray marching from each fragment toward the light source. A novel anti-banding technique combines per-pixel dithering with Gaussian depth sampling.

Anti-Banding Techniques

Pseudo-random dithering — Per-pixel offset using hash function
9-tap Gaussian blur — Smooth depth sampling at 3px radius
Gradient accumulation — Soft blocking instead of hard thresholds
48 ray steps — Quadratic distribution (denser near fragment)

float hash(vec2 p) {
    return fract(sin(dot(p, vec2(127.1, 311.7))) * 43758.5453);
}

float calculateSoftShadow(vec2 uv, vec2 lightPos) {
    float dither = hash(uv * resolution) * 0.5;
    vec2 rayDir = normalize(lightPos - uv);
    float shadow = 0.0;
    
    for (int i = 0; i < 48; i++) {
        float t = (float(i) + dither) / 48.0;
        t = t * t; // Quadratic distribution
        vec2 samplePos = mix(uv, lightPos, t);
        
        // 9-tap Gaussian depth sample
        float depth = sampleDepthSmooth(samplePos);
        float heightDiff = depth - texture(depthTex, uv).r;
        
        shadow += smoothstep(0.0, 0.05, heightDiff);
    }
    return 1.0 - (shadow / 48.0);
}

SHADER Volumetric God Rays

Atmospheric light scattering is simulated through radial blur from the light source position. The effect includes chromatic aberration, bloom, and depth-aware masking for realistic results.

Effect Parameters

Parameter	Range	Description
Intensity	0.0 - 2.0	Ray brightness
Decay	0.90 - 1.0	Falloff per sample
Samples	32 - 128	Ray marching iterations
Chromatic	0.0 - 0.1	RGB channel separation
Scatter	0.0 - 1.0	Atmospheric scattering

GPU WebGPU Backend

The primary rendering backend leverages WebGPU for modern GPU acceleration with WGSL shaders. This provides significant performance improvements over WebGL2, especially for compute-heavy operations.

Feature	WebGPU	WebGL2
Shader Language	WGSL	GLSL ES 3.00
Compute Shaders	✓	✗
Bind Groups	✓	✗
Multi-threaded	✓	Limited

GPU WebGL2 Fallback

For browsers without WebGPU support, a full WebGL2 backend provides identical visual results with GLSL ES 3.00 shaders. Automatic detection ensures the best available backend is selected.

Required Extensions

EXT_color_buffer_float — Floating-point render targets
OES_texture_float_linear — Linear filtering for float textures
EXT_float_blend — Blending with float framebuffers

GPU GLSL Shader Architecture

The shader pipeline processes images through multiple passes:

Pass	Shader	Output
1	Develop (Exposure, WB)	Linear RGB
2	HSL Color Mixer	Color-adjusted RGB
3	Normal Generation	Normal map (RGB)
4	HBAO	AO mask (R)
5	PBR Lighting	Lit RGB
6	Shadow Pass	Shadow mask (R)
7	Composite + Tone Map	Final sRGB

Color Grading Pipeline

Color processing follows a strict order to maintain predictable results:

Exposure — EV stops (-5 to +5)
White Balance — Temperature (2000K-12000K) + Tint
Contrast — S-curve with midpoint preservation
Highlights/Shadows — Luminance-selective adjustment
Whites/Blacks — Endpoint clipping control
HSL — Per-channel hue/saturation/luminance
Vibrance — Saturation-aware saturation boost

HSL Color Mixer

The 8-channel HSL mixer provides independent control over specific color ranges, implemented in a single-pass GPU shader:

Channel	Hue Range	Center Hue
Red	330° - 30°	0°
Orange	15° - 45°	30°
Yellow	45° - 75°	60°
Green	75° - 165°	120°
Aqua	165° - 195°	180°
Blue	195° - 255°	225°
Purple	255° - 285°	270°
Magenta	285° - 330°	310°

Tone Mapping

Final output uses ACES Filmic tone mapping for cinematic highlight roll-off:

vec3 ACESFilm(vec3 x) {
    float a = 2.51;
    float b = 0.03;
    float c = 2.43;
    float d = 0.59;
    float e = 0.14;
    return clamp((x * (a * x + b)) / (x * (c * x + d) + e), 0.0, 1.0);
}

Reference: Academy Color Encoding System (ACES). Academy of Motion Picture Arts and Sciences, 2014.

Academic References

Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR 2024.
Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS 2021.
Wang, X. et al. "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." ICCV 2021 Workshop.
Bavoil, L., Sainz, M. "Image-space horizon-based ambient occlusion." SIGGRAPH 2008.
Walter, B. et al. "Microfacet Models for Refraction through Rough Surfaces." EGSR 2007.
Karis, B. "Real Shading in Unreal Engine 4." SIGGRAPH 2013 Course.

Technical Dependencies

Library	Version	Purpose
Transformers.js	2.17+	ML model inference (ONNX Runtime)
Three.js	0.160+	3D mesh rendering
MediaPipe	0.10+	Face mesh detection
ONNX Runtime Web	1.17+	Neural network execution

Browser Compatibility

Browser	WebGPU	WebGL2
Chrome	113+	90+
Firefox	120+	90+
Safari	17+	15+
Edge	113+	90+