Table of Contents

Sulphur 2 Video Pipeline on RTX 3070 Laptop 8GB

This page is the technical deep-dive companion to the public showcase at davidhsaiou.com/ai-video. It documents every non-obvious design decision, quantitative evidence, and tuning history for the ai-shorts repo's T2V / I2V video generation pipeline running entirely on a single RTX 3070 Laptop 8GB — no cloud, no paid API.


1. Overview

The pipeline runs Sulphur 2 22B GGUF (based on LightTricks LTX-Video 2.3) in ComfyUI to produce broadcast-quality 1920×1080 H.264 MP4 segments, then concatenates them into a finished short-form video. Key characteristics:

  • Hardware constraint — RTX 3070 Laptop 8GB; all inference happens locally.
  • Three-pass design — Draft T2V/I2V (Pass 1) → LCM latent-space refinement (Pass 2) → Real-ESRGAN pixel-space upscale.
  • Joint AV generation — LTX-Video 2.3's joint audio-video diffusion model generates an ambient audio/music track synchronised with the video latent in Pass 1. This track is environmental sound / background music only — it is not voice narration. Voice-over and subtitle injection belong to a separate narration pipeline that is not integrated here.
  • Output — 1920×1080 H.264 MP4 with the LTX joint AV audio track muxed in.
  • Showcasedavidhsaiou.com/ai-video

2. Three-Pass Architecture

Pass 1 — Draft (T2V / I2V)

Parameter Value
Working resolution 512×288 input → LTX auto-rounds to 640×352
Sampler euler_ancestral_cfg_pp
Steps 8
CFG 1.0
LoRA ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors at strength 0.7
Scheduler LTXVScheduler — max_shift=4.0, base_shift=1.5, stretch=true, terminal=0.1
Attention SageAttention v2 (sageattn_qk_int8_pv_fp16_cuda)
Joint AV Video latent + audio latent generated simultaneously
Typical duration ~250–311 s/segment (warm model cache)

Pass 2 — LCM Refine

Parameter Value
Upsampler LTXVLatentUpsampler ×21280×704 latent
Sampler LCM
Steps 4
Manual sigmas 0.85, 0.7250, 0.4219, 0.0
CFG 1.0
LoRA cond_safe at strength 0.5 (lower than Pass 1)
Typical duration ~180 s/segment (after CFG=1.0 fix; was ~500 s at CFG>1)

Pass 3 — ESRGAN Pixel Upscale

Parameter Value
Model RealESRGAN_x2plus.pth (×2, 64 MB, RRDBNet)
Output resolution 1920×1080 after ×2 upscale + Lanczos downsample
Typical duration ~140–160 s/segment

Audio Mux

ffmpeg muxes the Pass 1 LTX audio latent output (decoded by the LTX Audio VAE) into the final MP4 as a single ambient/music track. Voice-over narration and subtitle tracks are not part of this pipeline.

Total Pipeline Time

~10 minutes per segment with warm model cache.


3. VRAM Asymmetry — Architecture Design Rationale

The three stages have fundamentally different VRAM footprints, which drives several design constraints:

Stage Working resolution VRAM usage Headroom
Pass 1 — Draft 640×352 latent ~5.8 GB ~2 GB
Pass 2 — LCM Refine 1280×704 latent ~7.8 GB Saturated
ESRGAN Upscale 1920×1080 pixel 22B model evicted ~64 MB model on RAM

Design implications:

  • Any future feature extension — IPAdapter-style identity injection, ControlNet conditioning, additional LoRAs — can only attach to Pass 1. Pass 2 is VRAM-saturated at 1280×704; any expansion of attention there causes OOM.
  • The 22B Sulphur model is fully evicted from VRAM before ESRGAN runs. ESRGAN's 64 MB RRDBNet runs in system RAM, not VRAM.
  • The Q3_K_M quantisation of the 22B base (instead of fp8mixed) is the primary lever that keeps Pass 1 at ~5.8 GB instead of exceeding 8 GB.

4. Key Technical Decisions

4.1 CFG=1.0 Is Non-Negotiable

The distillation LoRA (cond_safe) was trained at CFG=1.0. Raising CFG pushes guidance away from the training distribution, producing frame-to-frame instability.

Quantitative evidence (Session 2 measurement):

Configuration Median PSNR-Y Notes
CFG=3.0 (broken) 21.79 dB Heavy flicker, visible on playback
CFG=1.0 (fixed) 32.22 dB Stable output (+10.43 dB)
Pre-split perfect run 41.16 dB Upper bound baseline

Performance bonus: At CFG=1.0 each sampling step requires only one forward pass (conditional only). CFG>1 requires two passes (conditional + unconditional), doubling sampling cost — the ~500 s Pass 2 time at CFG=3.0 dropped to ~180 s at CFG=1.0.

Known trade-off: Negative prompts lose force at CFG=1.0. Watermarks and text artefacts may still appear. Mitigation paths: (a) prompt engineering to avoid text-triggering phrases, (b) post-generation inpaint, (c) future NAG (Negative Attention Guidance) node — the community-standard CFG=1.0 negative complement.

4.2 Alignment with Upstream Sulphur 2 Distilled Reference Workflow

The reference JSON is ltx23_t2v distilled.json from huggingface.co/SulphurAI/Sulphur-2-base.

Three-way parameter comparison:

Parameter Lightricks distilled Sulphur 2 distilled This pipeline
Base model LTX-2.3 22B fp16/fp8 sulphur_dev fp8mixed sulphur_dev-Q3_K_M.gguf
LoRA distilled-lora-384-1.1 (s=0.5/0.5) cond_safe (s=0.7/0.5) cond_safe (s=0.7/0.5)
Pass 1 scheduler ManualSigmas 9-value LTXVScheduler 8-step LTXVScheduler 8-step
Pass 1 sampler euler_ancestral_cfg_pp euler_ancestral_cfg_pp euler_ancestral_cfg_pp
Pass 2 sampler euler_cfg_pp (non-ancestral) lcm lcm
Pass 2 sigmas 0.85, 0.7250, 0.4219, 0.0 same same
CFG (both passes) 1.0 1.0 1.0

Placeholder note: The Sulphur 2 workflow JSON contains a sulphur_final.safetensors reference. This is an author-acknowledged placeholder — the instruction is "use the LoRA OR the full models, don't use both." This file is not required and is not loaded.

8GB trade-off: sulphur_dev-Q3_K_M.gguf (Q3 quantisation) replaces Sulphur's recommended fp8mixed to fit within 8 GB VRAM. Behaviour is closely equivalent; quality is slightly lower than fp8mixed but the gap is not perceptible at 640×352 draft resolution.

4.3 Real-ESRGAN ×2, Not ×4

realesr-general-x4v3.pth (×4) produces approximately 25 GB of intermediate pixel buffer for a 97-frame segment. On an 8 GB VRAM / 32 GB RAM configuration this triggers an OOM kill.

RealESRGAN_x2plus.pth (×2, 64 MB, RRDBNet) upscales the 1280×704 Pass 2 output directly to 2560×1408, which is then Lanczos-downsampled to 1920×1080. This avoids the ×4 intermediate buffer entirely.

4.4 SageAttention v2 Must Be Built from Source

The PyPI package sageattention==1.0.6 is a stub. When applied to transformer attention blocks it produces visible mesh and checkerboard artefacts on the output frames.

The working build is compiled from source using scripts/infra/sageattention-build.sh. Both Pass 1 and Pass 2 attach a PatchSageAttentionKJ (KJNodes) node configured for sageattn_qk_int8_pv_fp16_cuda.

4.5 I2V Chaining for Continuous Narrative

LTXVImgToVideoInplace takes the last frame of the preceding segment as frame-0 anchor for the next segment, producing temporally continuous output across segment boundaries.

  • Pass 1 node 17 and Pass 2 node 75 are both statically present in the ComfyUI graph.
  • Runtime switching is done via the bypass input only — no graph surgery at runtime.
  • The noise mask for Pass 2 is constructed directly from the input latent shape, sidestepping the LTXVAddGuideAdvanced patchify crash that occurs when mask dimensions do not match the upsampled latent size.

4.6 Local Prompt Enhancer

A local llama.cpp server (http://127.0.0.1:8080, OpenAI-compatible /v1/chat/completions) runs Gemma 3 12B Q3 GGUF to expand raw prompts into LTX cinematic language (camera motion, lighting, materials, atmosphere).

  • For I2V chain segments the last frame PNG is sent as a base64 image alongside the previous enhanced prompt to maintain visual and semantic continuity across segments.
  • Connection failure falls back silently to the raw prompt — no exception is raised.
  • Enhancement runs on CPU at approximately 10 seconds per segment with no GPU involvement.

5. PSNR-Y Measurement Methodology

Tool: scripts/interframe_psnr.py

How it works:

  1. ffmpeg decodes the MP4 to a raw 8-bit luma (Y) stream.
  2. Adjacent frame pairs are compared and PSNR-Y is computed for each transition.
  3. Statistics reported: mean / median / min / max / p10 / p25.
  4. The five worst frame transitions are listed with frame indices and PSNR values.

Why PSNR-Y rather than PSNR-RGB:

Human vision is far more sensitive to luminance variation than to chrominance variation. Inter-frame luma SNR is a direct proxy for visible flicker. A low PSNR-Y transition (< ~28 dB) is reliably visible as a flash or jump cut on playback.

Practical application:

PSNR-Y converts "the output looks wrong" into a measurable, comparable, regression-testable number. All hypotheses tested during Session 2 (see Section 9) were validated or rejected by PSNR-Y before any change was accepted.


6. Model Stack

All models in production use are commercial-safe licensed.

Model Purpose License
sulphur_dev-Q3_K_M.gguf (Sulphur 2 22B) Video generation (T2V / I2V) Apache-2.0 (LTX upstream)
ltx-2-3-22b-VAE.safetensors Image latent encode / decode Apache-2.0
ltx-2.3-22b-distilled_audio_vae.safetensors Synchronised audio latent decode Apache-2.0
ltx-2-3-22b-text_encoder.safetensors Pass 1 text conditioning Apache-2.0
gemma-3-12b-it-Q3_K_M.gguf Prompt enhancer (CPU) Gemma Terms of Use
ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors Distillation adapter (cond_safe) Apache-2.0
LTX-2.3 Spatial Upscaler ×2 Pass 2 latent-space upsampling Apache-2.0
RealESRGAN_x2plus.pth Pixel-space upscaling BSD-3-Clause

7. CLI Flow

The pipeline exposes three primary modes operated via scripts/pipeline.sh. State is persisted to outputs/<run_id>/run.json (Pydantic schema), allowing free interruption, prompt revision, and single-segment reruns between stages.

# Step 1 — Draft (Pass 1 only)
scripts/pipeline.sh --draft --prompt "..."
# Outputs: outputs/<run_id>/run.json, draft_<run_id>_seg0.mp4, latents, frames

# Step 2 — Resume (Pass 2 LCM refine)
scripts/pipeline.sh --resume <run_id>
# Outputs: seg_<run_id>_<i>_pass2.mp4 (1024x768 preview) + concat preview

# Step 3 — Upscale (ESRGAN + audio mux)
scripts/pipeline.sh --upscale <run_id>
# Outputs: seg_<run_id>_<i>_final.mp4, <run_id>_1080p.mp4 (final 1920×1080)

Additional CLI Modes

Flag Purpose
--append <run_id> Append a new segment to an existing draft run; automatically uses the previous segment's last frame as the I2V anchor
--rework <run_id> --segment N --prompt "..." Re-run Pass 1 for segment N; optionally cascades to re-run downstream segments
--re-enhance <run_id> Re-run the prompt enhancer only; does not touch video
--enhance-only <run_id> Same as --re-enhance but hash-gated: skips unchanged prompts and invalidates downstream cached artefacts
--cleanup <run_id> / --cleanup-all Remove latents, audio files, and run.json for one or all runs
--no-enhance Skip the prompt enhancer; use raw prompt directly
--no-cascade When used with --rework, suppresses automatic re-run of downstream segments

Design principle: Every stage is idempotent. run.json is the single source of truth for resolved parameters; --resume and --upscale read it and refuse to proceed if a CLI override conflicts (exit code 2), preventing half-mixed runs where Pass 1 and Pass 2 used different resolutions.


8. Resolution Precision Rules

LTX VAE requires width and height to be divisible by 32. The pipeline enforces a three-level precedence for resolution selection:

  1. CLI flag (--width, --height) — highest priority
  2. run.json field — written by --draft, read by subsequent stages
  3. RESOLUTION_DEFAULTS — compile-time constants: 512×288 for draft, 1920×1080 for final

--draft resolves and writes final values back to run.json. --resume and --upscale are read-only with respect to resolution. A CLI override that conflicts with run.json causes an immediate exit 2 — this prevents accidental half-mixed runs where Pass 1 and Pass 2 used different resolutions.


9. Tuning History and Lessons Learned

Session 1 — 2026-05-15: Initial Hardware Tuning

Starting from a monolithic legacy ComfyUI workflow, empirically tuned on the RTX 3070 Laptop 8GB to reach stable production settings:

  • Final per-segment time: ~5:30 (warm cache).
  • Production baseline run 1778935286 — perfect smooth output on pre-split monolithic workflow (median PSNR-Y 41.22).

Session 2 — 2026-05-18: Flicker Root Cause and Recovery

Symptom: After splitting the monolithic workflow into shots/pass1.json + shots/pass2_refine.json, 1080p output showed severe inter-frame flicker.

Root cause: Commit 5fcdfe3 introduced three coupled regressions during the split:

  1. CFG changed from 1.0 to 3.0.
  2. base_shift changed from 1.5 (official) to 1.2.
  3. Pass 2 sigma values diverged from the official Sulphur 2 reference.

Wrong-direction hypothesis (COR-1684, closed Wont Do): A hypothesis that reducing max_shift from 4.0 to 2.05 and base_shift from 1.2 to 0.95 would improve stability. Measured result: mean PSNR-Y degraded by −3.26 dB. Discarded.

Actual fix: Full parameter alignment with the Sulphur 2 official distilled T2V workflow (see Section 4.2).

Measured improvement:

Metric Post-split broken Post-alignment fixed Perfect baseline
Pass 1 mean PSNR-Y 26.59 dB 36.41 dB (+9.82) 41.22 dB

Core lesson: When splitting or refactoring a configuration file that contains coupled model hyperparameters, diff the before/after parameter set explicitly against the upstream reference. Internal consistency of the new files is not sufficient — the parameters must match the training-time values.


10. Troubleshooting

Symptom Likely cause Resolution
Inter-frame flicker CFG ≠ 1.0 Align to Sulphur 2 distilled workflow: CFG=1.0 for both passes
Mesh / checkerboard artefacts SageAttention PyPI stub installed Build SageAttention v2 from source via scripts/infra/sageattention-build.sh
ESRGAN OOM kill ×4 upscale path producing ~25 GB buffer Switch to RealESRGAN_x2plus.pth (×2)
Pass 2 taking ~500 s CFG>1 running two forward passes per step Set CFG=1.0; expected time ~180 s
Frame 0 drift across segment boundary I2V anchor not active Confirm LTXVImgToVideoInplace nodes are active in both Pass 1 (node 17) and Pass 2 (node 75)
Watermark / text residue on output Negative prompt ineffective at CFG=1.0 Avoid text-triggering phrases in prompt; post-generate inpaint; or add NAG node
llama-server connection refused Prompt enhancer offline Launch server via scripts/infra/llamacpp-launch.sh; pipeline falls back to raw prompt silently
Pass 1 OOM Working resolution exceeds 8 GB budget Keep draft resolution at ≤640×352; Pass 2 already saturates at 1280×704
ComfyUI RAM balloon Model cache not disabled Start ComfyUI with --cache-none --reserve-vram 0
--resume exits with code 2 CLI resolution override conflicts with run.json Remove the conflicting CLI flag or start a new --draft run