Abstract

Orlume is a browser-native image processing system that combines deep learning-based scene understanding with real-time physically-based rendering. The system performs monocular depth estimation using Vision Transformer architectures, generates surface normals through gradient-based reconstruction, and applies deferred shading with GGX specular reflectance and horizon-based ambient occlusion (HBAO).

100% Client-Side Processing
60fps Real-time Rendering
150 Semantic Classes
Zero Server Dependencies

Key Innovation: First fully browser-based implementation of neural 3D relighting that combines monocular depth estimation with physically-based rendering for interactive photo manipulation.

System Architecture

The Orlume processing pipeline implements a multi-stage architecture optimized for GPU parallelism:

┌─────────────────────────────────────────────────────────────────────┐
│                        INPUT PROCESSING                              │
│  ┌─────────┐    ┌──────────────┐    ┌─────────────┐                 │
│  │  Image  │───▶│ sRGB→Linear  │───▶│  Normalize  │                 │
│  │ Decode  │    │  Conversion  │    │   [0,1]     │                 │
│  └─────────┘    └──────────────┘    └─────────────┘                 │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    NEURAL INFERENCE (Parallel)                       │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐         │
│  │ Depth Anything │  │   SegFormer    │  │  MediaPipe FM  │         │
│  │     V2 ViT     │  │     B0-512     │  │   468 Points   │         │
│  │  (Depth Map)   │  │  (Materials)   │  │  (Face Mesh)   │         │
│  └───────┬────────┘  └───────┬────────┘  └───────┬────────┘         │
└──────────┼───────────────────┼───────────────────┼──────────────────┘
           │                   │                   │
           ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    GEOMETRY RECONSTRUCTION                           │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐         │
│  │ Scharr Kernel  │  │ Material Map   │  │ Face Normals   │         │
│  │  ∇D → Normal   │  │  RGBA Encode   │  │  Triangulation │         │
│  └───────┬────────┘  └───────┬────────┘  └───────┬────────┘         │
└──────────┼───────────────────┼───────────────────┼──────────────────┘
           │                   │                   │
           └───────────────────┴───────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    DEFERRED RENDERING (WebGL2/WebGPU)                │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  G-Buffer: Albedo | Normals | Depth | Materials | Position    │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                              │                                       │
│              ┌───────────────┼───────────────┐                      │
│              ▼               ▼               ▼                      │
│       ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│       │   HBAO   │    │ GGX BRDF │    │  Shadow  │                  │
│       │  8-dir   │    │ Specular │    │  Raymarch│                  │
│       └────┬─────┘    └────┬─────┘    └────┬─────┘                  │
│            └───────────────┼───────────────┘                        │
│                            ▼                                         │
│       ┌─────────────────────────────────────────────────────────┐   │
│       │  Final Composite: Diffuse + Specular + AO + Shadows     │   │
│       │  Tone Mapping: ACES Filmic | Exposure Compensation      │   │
│       └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                            

Technical Capabilities

Subsystem Technology Performance
Depth Estimation Depth Anything V2 (ViT-Small) ~150ms @ 1080p
Semantic Segmentation SegFormer B0 (ADE20K 150-class) ~200ms @ 512×512
Face Mesh MediaPipe (468 landmarks) ~16ms per frame
PBR Shading GGX + HBAO + Soft Shadows 60fps @ 4K
Neural Upscaling Real-ESRGAN / ESRGAN-thick ~2s per 2× upscale

ML Monocular Depth Estimation

Orlume employs Depth Anything V2, a state-of-the-art monocular depth estimation model based on the Vision Transformer (ViT) architecture. The model processes single RGB images to produce dense, relative depth maps that serve as the foundation for 3D scene reconstruction.

Model Specification

Property Value
Model ID Xenova/depth-anything-small-hf
Architecture Vision Transformer (ViT) Encoder + CNN Decoder
Input Resolution Any (internally resized to 518×518)
Output Single-channel depth map, normalized [0, 1]
Inference Backend ONNX Runtime (WebGPU → WASM fallback)

Depth Processing Pipeline

// Depth estimation with bilateral smoothing
const depthTensor = await pipeline('depth-estimation', image);
const depthMap = normalizeMinMax(depthTensor);
const smoothedDepth = bilateralFilter(depthMap, {
    spatialSigma: 9,
    rangeSigma: 0.1   // Edge-preserving parameter
});
Reference: Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR 2024. The model is trained on 62M unlabeled images using a self-training paradigm.

ML Semantic Segmentation

Material-aware rendering is achieved through SegFormer B0, a hierarchical Transformer encoder with lightweight MLP decoder. The model classifies each pixel into one of 150 semantic categories from the ADE20K dataset, which are then mapped to physically-based material properties.

Material Property Mapping

Semantic Class Roughness Metallic Subsurface Emissive
Person/Skin 0.60 0.00 0.35 0.00
Metal/Car/Building 0.30 0.95 0.00 0.00
Glass/Window 0.02 0.00 0.00 0.00
Vegetation 0.85 0.00 0.10 0.00
Sky 1.00 0.00 0.00 1.00
Lamp/Light 0.50 0.00 0.00 0.80

RGBA Material Encoding

R = Roughness × 255
G = Metallic × 255
B = Subsurface Scattering × 255
A = Emissive × 255
Reference: Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS 2021.

ML Face Mesh Detection

For portrait images, MediaPipe Face Mesh provides 468 3D facial landmarks that are triangulated into a dense mesh. This enables accurate facial geometry reconstruction for realistic skin rendering with subsurface scattering.

Mesh Generation

  • 468 vertices — Sparse 3D landmark positions
  • ~900 triangles — Dense tessellation via Delaunay triangulation
  • Smooth normals — Area-weighted vertex normal averaging
  • Depth interpolation — Barycentric coordinates for dense depth map
// Per-vertex normal via area-weighted averaging
vec3 computeSmoothNormal(int vertexIdx) {
    vec3 normal = vec3(0.0);
    for (int t = 0; t < adjacentTriangles; t++) {
        vec3 faceNormal = cross(v1 - v0, v2 - v0);
        float area = length(faceNormal) * 0.5;
        normal += normalize(faceNormal) * area;
    }
    return normalize(normal);
}

ML Neural Image Upscaling

Super-resolution is powered by Real-ESRGAN with optional face enhancement via GFPGAN. The RRDB (Residual-in-Residual Dense Block) architecture reconstructs high-frequency details that are lost in traditional bicubic upscaling.

Scale Factor Architecture Use Case
Real-ESRGAN x2 General photo enhancement
Real-ESRGAN x4 + GFPGAN Portrait restoration
Reference: Wang, X. et al. "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." ICCV 2021 Workshop.

SHADER Surface Normal Estimation

Surface normals are computed from the depth map using the Scharr operator, which provides better rotational symmetry than the traditional Sobel kernel. A 9-tap Gaussian filter is applied for artifact-free surfaces.

Scharr Gradient Kernels

Gx (Horizontal):          Gy (Vertical):
┌────┬────┬────┐          ┌────┬─────┬────┐
│ -3 │  0 │ +3 │          │ -3 │ -10 │ -3 │
├────┼────┼────┤          ├────┼─────┼────┤
│-10 │  0 │+10 │          │  0 │   0 │  0 │
├────┼────┼────┤          ├────┼─────┼────┤
│ -3 │  0 │ +3 │          │ +3 │ +10 │ +3 │
└────┴────┴────┘          └────┴─────┴────┘
                            

Normal Reconstruction

// Fragment shader: Depth → Normal
vec3 computeNormal(vec2 uv, sampler2D depthTex) {
    float d = texture(depthTex, uv).r;
    float dx = texture(depthTex, uv + vec2(1.0, 0.0) / resolution).r - d;
    float dy = texture(depthTex, uv + vec2(0.0, 1.0) / resolution).r - d;
    
    vec3 normal = normalize(vec3(-dx * normalStrength, 
                                  -dy * normalStrength, 
                                  1.0));
    return normal * 0.5 + 0.5; // Encode to [0,1] for storage
}

Gaussian Smoothing (9-tap)

Nsmooth = Σ wij × Nij / 16

Weights: [1,2,1; 2,4,2; 1,2,1]

SHADER Physically-Based Rendering

Orlume implements a full Cook-Torrance BRDF with GGX microfacet distribution, Fresnel-Schlick approximation, and Smith geometry term. This provides physically-accurate light interaction that responds correctly to material properties.

BRDF Components

GGX Normal Distribution Function (D)

D(m) = α² / (π × ((n·m)² × (α² - 1) + 1)²)

where α = roughness², m = half-vector, n = surface normal

Fresnel-Schlick Approximation (F)

F(v,h) = F₀ + (1 - F₀) × (1 - v·h)⁵

where F₀ = 0.04 for dielectrics, ~0.7-1.0 for metals

Smith Geometry Function (G)

G(l,v,h) = G₁(l) × G₁(v)
G₁(x) = 2(n·x) / (n·x + √(α² + (1-α²)(n·x)²))

Final BRDF Integration

vec3 cookTorranceBRDF(vec3 N, vec3 V, vec3 L, vec3 albedo, 
                       float roughness, float metallic) {
    vec3 H = normalize(V + L);
    float NdotL = max(dot(N, L), 0.0);
    float NdotV = max(dot(N, V), 0.0);
    float NdotH = max(dot(N, H), 0.0);
    float VdotH = max(dot(V, H), 0.0);
    
    // GGX Distribution
    float alpha = roughness * roughness;
    float alpha2 = alpha * alpha;
    float denom = NdotH * NdotH * (alpha2 - 1.0) + 1.0;
    float D = alpha2 / (PI * denom * denom);
    
    // Fresnel
    vec3 F0 = mix(vec3(0.04), albedo, metallic);
    vec3 F = F0 + (1.0 - F0) * pow(1.0 - VdotH, 5.0);
    
    // Geometry (Smith-GGX)
    float k = alpha / 2.0;
    float G1L = NdotL / (NdotL * (1.0 - k) + k);
    float G1V = NdotV / (NdotV * (1.0 - k) + k);
    float G = G1L * G1V;
    
    // Specular term
    vec3 specular = (D * F * G) / (4.0 * NdotL * NdotV + 0.001);
    
    // Diffuse (Lambert)
    vec3 kD = (1.0 - F) * (1.0 - metallic);
    vec3 diffuse = kD * albedo / PI;
    
    return (diffuse + specular) * NdotL;
}

SHADER Horizon-Based Ambient Occlusion

HBAO (Horizon-Based Ambient Occlusion) provides realistic contact shadows and ambient darkening in crevices. The algorithm ray-marches in multiple directions to find the horizon angle at each pixel.

Algorithm Parameters

Parameter Value Description
Directions 8 Cardinal + diagonal directions
Steps per Direction 8 Ray marching samples
Radius 8px Sample distance
Bias 0.025 Self-occlusion prevention
float computeHBAO(vec2 uv, float centerDepth) {
    float occlusion = 0.0;
    for (int d = 0; d < 8; d++) {
        vec2 dir = directions[d];
        float maxHorizon = -1.0;
        
        for (int s = 1; s <= 8; s++) {
            vec2 sampleUV = uv + dir * float(s) * radius;
            float sampleDepth = texture(depthTex, sampleUV).r;
            float heightDiff = sampleDepth - centerDepth;
            float horizonAngle = heightDiff / (float(s) * radius);
            maxHorizon = max(maxHorizon, horizonAngle);
        }
        occlusion += clamp(maxHorizon, 0.0, 1.0);
    }
    return 1.0 - (occlusion / 8.0) * intensity;
}
Reference: Bavoil, L., Sainz, M. "Image-space horizon-based ambient occlusion." SIGGRAPH 2008, Talk.

SHADER Soft Shadow Computation

Shadows are computed via screen-space ray marching from each fragment toward the light source. A novel anti-banding technique combines per-pixel dithering with Gaussian depth sampling.

Anti-Banding Techniques

  • Pseudo-random dithering — Per-pixel offset using hash function
  • 9-tap Gaussian blur — Smooth depth sampling at 3px radius
  • Gradient accumulation — Soft blocking instead of hard thresholds
  • 48 ray steps — Quadratic distribution (denser near fragment)
float hash(vec2 p) {
    return fract(sin(dot(p, vec2(127.1, 311.7))) * 43758.5453);
}

float calculateSoftShadow(vec2 uv, vec2 lightPos) {
    float dither = hash(uv * resolution) * 0.5;
    vec2 rayDir = normalize(lightPos - uv);
    float shadow = 0.0;
    
    for (int i = 0; i < 48; i++) {
        float t = (float(i) + dither) / 48.0;
        t = t * t; // Quadratic distribution
        vec2 samplePos = mix(uv, lightPos, t);
        
        // 9-tap Gaussian depth sample
        float depth = sampleDepthSmooth(samplePos);
        float heightDiff = depth - texture(depthTex, uv).r;
        
        shadow += smoothstep(0.0, 0.05, heightDiff);
    }
    return 1.0 - (shadow / 48.0);
}

SHADER Volumetric God Rays

Atmospheric light scattering is simulated through radial blur from the light source position. The effect includes chromatic aberration, bloom, and depth-aware masking for realistic results.

Effect Parameters

Parameter Range Description
Intensity 0.0 - 2.0 Ray brightness
Decay 0.90 - 1.0 Falloff per sample
Samples 32 - 128 Ray marching iterations
Chromatic 0.0 - 0.1 RGB channel separation
Scatter 0.0 - 1.0 Atmospheric scattering

GPU WebGPU Backend

The primary rendering backend leverages WebGPU for modern GPU acceleration with WGSL shaders. This provides significant performance improvements over WebGL2, especially for compute-heavy operations.

Feature WebGPU WebGL2
Shader Language WGSL GLSL ES 3.00
Compute Shaders
Bind Groups
Multi-threaded Limited

GPU WebGL2 Fallback

For browsers without WebGPU support, a full WebGL2 backend provides identical visual results with GLSL ES 3.00 shaders. Automatic detection ensures the best available backend is selected.

Required Extensions

  • EXT_color_buffer_float — Floating-point render targets
  • OES_texture_float_linear — Linear filtering for float textures
  • EXT_float_blend — Blending with float framebuffers

GPU GLSL Shader Architecture

The shader pipeline processes images through multiple passes:

Pass Shader Output
1 Develop (Exposure, WB) Linear RGB
2 HSL Color Mixer Color-adjusted RGB
3 Normal Generation Normal map (RGB)
4 HBAO AO mask (R)
5 PBR Lighting Lit RGB
6 Shadow Pass Shadow mask (R)
7 Composite + Tone Map Final sRGB

Color Grading Pipeline

Color processing follows a strict order to maintain predictable results:

  1. Exposure — EV stops (-5 to +5)
  2. White Balance — Temperature (2000K-12000K) + Tint
  3. Contrast — S-curve with midpoint preservation
  4. Highlights/Shadows — Luminance-selective adjustment
  5. Whites/Blacks — Endpoint clipping control
  6. HSL — Per-channel hue/saturation/luminance
  7. Vibrance — Saturation-aware saturation boost

HSL Color Mixer

The 8-channel HSL mixer provides independent control over specific color ranges, implemented in a single-pass GPU shader:

Channel Hue Range Center Hue
Red 330° - 30°
Orange 15° - 45° 30°
Yellow 45° - 75° 60°
Green 75° - 165° 120°
Aqua 165° - 195° 180°
Blue 195° - 255° 225°
Purple 255° - 285° 270°
Magenta 285° - 330° 310°

Tone Mapping

Final output uses ACES Filmic tone mapping for cinematic highlight roll-off:

vec3 ACESFilm(vec3 x) {
    float a = 2.51;
    float b = 0.03;
    float c = 2.43;
    float d = 0.59;
    float e = 0.14;
    return clamp((x * (a * x + b)) / (x * (c * x + d) + e), 0.0, 1.0);
}
Reference: Academy Color Encoding System (ACES). Academy of Motion Picture Arts and Sciences, 2014.

Academic References

  • Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR 2024.
  • Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS 2021.
  • Wang, X. et al. "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." ICCV 2021 Workshop.
  • Bavoil, L., Sainz, M. "Image-space horizon-based ambient occlusion." SIGGRAPH 2008.
  • Walter, B. et al. "Microfacet Models for Refraction through Rough Surfaces." EGSR 2007.
  • Karis, B. "Real Shading in Unreal Engine 4." SIGGRAPH 2013 Course.

Technical Dependencies

Library Version Purpose
Transformers.js 2.17+ ML model inference (ONNX Runtime)
Three.js 0.160+ 3D mesh rendering
MediaPipe 0.10+ Face mesh detection
ONNX Runtime Web 1.17+ Neural network execution

Browser Compatibility

Browser WebGPU WebGL2
Chrome 113+ 90+
Firefox 120+ 90+
Safari 17+ 15+
Edge 113+ 90+