Sulphur 2 Video Pipeline on RTX 3070 Laptop 8GB

This page is the technical deep-dive companion to the public showcase at davidhsaiou.com/ai-video. It documents every non-obvious design decision, quantitative evidence, and tuning history for the ai-shorts repo's T2V / I2V video generation pipeline running entirely on a single RTX 3070 Laptop 8GB — no cloud, no paid API.

1. Overview

The pipeline runs Sulphur 2 22B GGUF (based on LightTricks LTX-Video 2.3) in ComfyUI to produce broadcast-quality 1920×1080 H.264 MP4 segments, then concatenates them into a finished short-form video. Key characteristics:

Hardware constraint — RTX 3070 Laptop 8GB; all inference happens locally.
Three-pass design — Draft T2V/I2V (Pass 1) → LCM latent-space refinement (Pass 2) → Real-ESRGAN pixel-space upscale.
Joint AV generation — LTX-Video 2.3's joint audio-video diffusion model generates an ambient audio/music track synchronised with the video latent in Pass 1. This track is environmental sound / background music only — it is not voice narration. Voice-over and subtitle injection belong to a separate narration pipeline that is not integrated here.
Output — 1920×1080 H.264 MP4 with the LTX joint AV audio track muxed in.
Showcase — davidhsaiou.com/ai-video

2. Three-Pass Architecture

Pass 1 — Draft (T2V / I2V)

Parameter	Value
Working resolution	512×288 input → LTX auto-rounds to 640×352
Sampler	`euler_ancestral_cfg_pp`
Steps	8
CFG	1.0
LoRA	`ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors` at strength 0.7
Scheduler	`LTXVScheduler` — max_shift=4.0, base_shift=1.5, stretch=true, terminal=0.1
Attention	SageAttention v2 (`sageattn_qk_int8_pv_fp16_cuda`)
Joint AV	Video latent + audio latent generated simultaneously
Typical duration	~250–311 s/segment (warm model cache)

Pass 2 — LCM Refine

Parameter	Value
Upsampler	`LTXVLatentUpsampler ×2` → 1280×704 latent
Sampler	LCM
Steps	4
Manual sigmas	`0.85, 0.7250, 0.4219, 0.0`
CFG	1.0
LoRA	cond_safe at strength 0.5 (lower than Pass 1)
Typical duration	~180 s/segment (after CFG=1.0 fix; was ~500 s at CFG>1)

Pass 3 — ESRGAN Pixel Upscale

Parameter	Value
Model	`RealESRGAN_x2plus.pth` (×2, 64 MB, RRDBNet)
Output resolution	1920×1080 after ×2 upscale + Lanczos downsample
Typical duration	~140–160 s/segment

Audio Mux

ffmpeg muxes the Pass 1 LTX audio latent output (decoded by the LTX Audio VAE) into the final MP4 as a single ambient/music track. Voice-over narration and subtitle tracks are not part of this pipeline.

Total Pipeline Time

~10 minutes per segment with warm model cache.

3. VRAM Asymmetry — Architecture Design Rationale

The three stages have fundamentally different VRAM footprints, which drives several design constraints:

Stage	Working resolution	VRAM usage	Headroom
Pass 1 — Draft	640×352 latent	~5.8 GB	~2 GB
Pass 2 — LCM Refine	1280×704 latent	~7.8 GB	Saturated
ESRGAN Upscale	1920×1080 pixel	22B model evicted	~64 MB model on RAM

Design implications:

Any future feature extension — IPAdapter-style identity injection, ControlNet conditioning, additional LoRAs — can only attach to Pass 1. Pass 2 is VRAM-saturated at 1280×704; any expansion of attention there causes OOM.
The 22B Sulphur model is fully evicted from VRAM before ESRGAN runs. ESRGAN's 64 MB RRDBNet runs in system RAM, not VRAM.
The Q3_K_M quantisation of the 22B base (instead of fp8mixed) is the primary lever that keeps Pass 1 at ~5.8 GB instead of exceeding 8 GB.

4. Key Technical Decisions

4.1 CFG=1.0 Is Non-Negotiable

The distillation LoRA (cond_safe) was trained at CFG=1.0. Raising CFG pushes guidance away from the training distribution, producing frame-to-frame instability.

Quantitative evidence (Session 2 measurement):

Configuration	Median PSNR-Y	Notes
CFG=3.0 (broken)	21.79 dB	Heavy flicker, visible on playback
CFG=1.0 (fixed)	32.22 dB	Stable output (+10.43 dB)
Pre-split perfect run	41.16 dB	Upper bound baseline

Performance bonus: At CFG=1.0 each sampling step requires only one forward pass (conditional only). CFG>1 requires two passes (conditional + unconditional), doubling sampling cost — the ~500 s Pass 2 time at CFG=3.0 dropped to ~180 s at CFG=1.0.

Known trade-off: Negative prompts lose force at CFG=1.0. Watermarks and text artefacts may still appear. Mitigation paths: (a) prompt engineering to avoid text-triggering phrases, (b) post-generation inpaint, (c) future NAG (Negative Attention Guidance) node — the community-standard CFG=1.0 negative complement.

4.2 Alignment with Upstream Sulphur 2 Distilled Reference Workflow

The reference JSON is ltx23_t2v distilled.json from huggingface.co/SulphurAI/Sulphur-2-base.

Three-way parameter comparison:

Parameter	Lightricks distilled	Sulphur 2 distilled	This pipeline
Base model	LTX-2.3 22B fp16/fp8	sulphur_dev fp8mixed	sulphur_dev-Q3_K_M.gguf
LoRA	distilled-lora-384-1.1 (s=0.5/0.5)	cond_safe (s=0.7/0.5)	cond_safe (s=0.7/0.5)
Pass 1 scheduler	ManualSigmas 9-value	LTXVScheduler 8-step	LTXVScheduler 8-step
Pass 1 sampler	euler_ancestral_cfg_pp	euler_ancestral_cfg_pp	euler_ancestral_cfg_pp
Pass 2 sampler	euler_cfg_pp (non-ancestral)	lcm	lcm
Pass 2 sigmas	0.85, 0.7250, 0.4219, 0.0	same	same
CFG (both passes)	1.0	1.0	1.0

Placeholder note: The Sulphur 2 workflow JSON contains a sulphur_final.safetensors reference. This is an author-acknowledged placeholder — the instruction is "use the LoRA OR the full models, don't use both." This file is not required and is not loaded.

8GB trade-off: sulphur_dev-Q3_K_M.gguf (Q3 quantisation) replaces Sulphur's recommended fp8mixed to fit within 8 GB VRAM. Behaviour is closely equivalent; quality is slightly lower than fp8mixed but the gap is not perceptible at 640×352 draft resolution.

4.3 Real-ESRGAN ×2, Not ×4

realesr-general-x4v3.pth (×4) produces approximately 25 GB of intermediate pixel buffer for a 97-frame segment. On an 8 GB VRAM / 32 GB RAM configuration this triggers an OOM kill.

RealESRGAN_x2plus.pth (×2, 64 MB, RRDBNet) upscales the 1280×704 Pass 2 output directly to 2560×1408, which is then Lanczos-downsampled to 1920×1080. This avoids the ×4 intermediate buffer entirely.

4.4 SageAttention v2 Must Be Built from Source

The PyPI package sageattention==1.0.6 is a stub. When applied to transformer attention blocks it produces visible mesh and checkerboard artefacts on the output frames.

The working build is compiled from source using scripts/infra/sageattention-build.sh. Both Pass 1 and Pass 2 attach a PatchSageAttentionKJ (KJNodes) node configured for sageattn_qk_int8_pv_fp16_cuda.

4.5 I2V Chaining for Continuous Narrative

LTXVImgToVideoInplace takes the last frame of the preceding segment as frame-0 anchor for the next segment, producing temporally continuous output across segment boundaries.

Pass 1 node 17 and Pass 2 node 75 are both statically present in the ComfyUI graph.
Runtime switching is done via the bypass input only — no graph surgery at runtime.
The noise mask for Pass 2 is constructed directly from the input latent shape, sidestepping the LTXVAddGuideAdvanced patchify crash that occurs when mask dimensions do not match the upsampled latent size.

4.6 Local Prompt Enhancer

A local llama.cpp server (http://127.0.0.1:8080, OpenAI-compatible /v1/chat/completions) runs Gemma 3 12B Q3 GGUF to expand raw prompts into LTX cinematic language (camera motion, lighting, materials, atmosphere).

For I2V chain segments the last frame PNG is sent as a base64 image alongside the previous enhanced prompt to maintain visual and semantic continuity across segments.
Connection failure falls back silently to the raw prompt — no exception is raised.
Enhancement runs on CPU at approximately 10 seconds per segment with no GPU involvement.

5. PSNR-Y Measurement Methodology

Tool: scripts/interframe_psnr.py

How it works:

ffmpeg decodes the MP4 to a raw 8-bit luma (Y) stream.
Adjacent frame pairs are compared and PSNR-Y is computed for each transition.
Statistics reported: mean / median / min / max / p10 / p25.
The five worst frame transitions are listed with frame indices and PSNR values.

Why PSNR-Y rather than PSNR-RGB:

Human vision is far more sensitive to luminance variation than to chrominance variation. Inter-frame luma SNR is a direct proxy for visible flicker. A low PSNR-Y transition (< ~28 dB) is reliably visible as a flash or jump cut on playback.

Practical application:

PSNR-Y converts "the output looks wrong" into a measurable, comparable, regression-testable number. All hypotheses tested during Session 2 (see Section 9) were validated or rejected by PSNR-Y before any change was accepted.

6. Model Stack

All models in production use are commercial-safe licensed.

Model	Purpose	License
`sulphur_dev-Q3_K_M.gguf` (Sulphur 2 22B)	Video generation (T2V / I2V)	Apache-2.0 (LTX upstream)
`ltx-2-3-22b-VAE.safetensors`	Image latent encode / decode	Apache-2.0
`ltx-2.3-22b-distilled_audio_vae.safetensors`	Synchronised audio latent decode	Apache-2.0
`ltx-2-3-22b-text_encoder.safetensors`	Pass 1 text conditioning	Apache-2.0
`gemma-3-12b-it-Q3_K_M.gguf`	Prompt enhancer (CPU)	Gemma Terms of Use
`ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors`	Distillation adapter (cond_safe)	Apache-2.0
LTX-2.3 Spatial Upscaler ×2	Pass 2 latent-space upsampling	Apache-2.0
`RealESRGAN_x2plus.pth`	Pixel-space upscaling	BSD-3-Clause

7. CLI Flow

The pipeline exposes three primary modes operated via scripts/pipeline.sh. State is persisted to outputs/<run_id>/run.json (Pydantic schema), allowing free interruption, prompt revision, and single-segment reruns between stages.

# Step 1 — Draft (Pass 1 only)
scripts/pipeline.sh --draft --prompt "..."
# Outputs: outputs/<run_id>/run.json, draft_<run_id>_seg0.mp4, latents, frames

# Step 2 — Resume (Pass 2 LCM refine)
scripts/pipeline.sh --resume <run_id>
# Outputs: seg_<run_id>_<i>_pass2.mp4 (1024x768 preview) + concat preview

# Step 3 — Upscale (ESRGAN + audio mux)
scripts/pipeline.sh --upscale <run_id>
# Outputs: seg_<run_id>_<i>_final.mp4, <run_id>_1080p.mp4 (final 1920×1080)

Additional CLI Modes

Flag	Purpose
`--append <run_id>`	Append a new segment to an existing draft run; automatically uses the previous segment's last frame as the I2V anchor
`--rework <run_id> --segment N --prompt "..."`	Re-run Pass 1 for segment N; optionally cascades to re-run downstream segments
`--re-enhance <run_id>`	Re-run the prompt enhancer only; does not touch video
`--enhance-only <run_id>`	Same as `--re-enhance` but hash-gated: skips unchanged prompts and invalidates downstream cached artefacts
`--cleanup <run_id>` / `--cleanup-all`	Remove latents, audio files, and `run.json` for one or all runs
`--no-enhance`	Skip the prompt enhancer; use raw prompt directly
`--no-cascade`	When used with `--rework`, suppresses automatic re-run of downstream segments

Design principle: Every stage is idempotent. run.json is the single source of truth for resolved parameters; --resume and --upscale read it and refuse to proceed if a CLI override conflicts (exit code 2), preventing half-mixed runs where Pass 1 and Pass 2 used different resolutions.

8. Resolution Precision Rules

LTX VAE requires width and height to be divisible by 32. The pipeline enforces a three-level precedence for resolution selection:

CLI flag (--width, --height) — highest priority
run.json field — written by --draft, read by subsequent stages
RESOLUTION_DEFAULTS — compile-time constants: 512×288 for draft, 1920×1080 for final

--draft resolves and writes final values back to run.json. --resume and --upscale are read-only with respect to resolution. A CLI override that conflicts with run.json causes an immediate exit 2 — this prevents accidental half-mixed runs where Pass 1 and Pass 2 used different resolutions.

9. Tuning History and Lessons Learned

Session 1 — 2026-05-15: Initial Hardware Tuning

Starting from a monolithic legacy ComfyUI workflow, empirically tuned on the RTX 3070 Laptop 8GB to reach stable production settings:

Final per-segment time: ~5:30 (warm cache).
Production baseline run 1778935286 — perfect smooth output on pre-split monolithic workflow (median PSNR-Y 41.22).

Session 2 — 2026-05-18: Flicker Root Cause and Recovery

Symptom: After splitting the monolithic workflow into shots/pass1.json + shots/pass2_refine.json, 1080p output showed severe inter-frame flicker.

Root cause: Commit 5fcdfe3 introduced three coupled regressions during the split:

CFG changed from 1.0 to 3.0.
base_shift changed from 1.5 (official) to 1.2.
Pass 2 sigma values diverged from the official Sulphur 2 reference.

Wrong-direction hypothesis (COR-1684, closed Wont Do): A hypothesis that reducing max_shift from 4.0 to 2.05 and base_shift from 1.2 to 0.95 would improve stability. Measured result: mean PSNR-Y degraded by −3.26 dB. Discarded.

Actual fix: Full parameter alignment with the Sulphur 2 official distilled T2V workflow (see Section 4.2).

Measured improvement:

Metric	Post-split broken	Post-alignment fixed	Perfect baseline
Pass 1 mean PSNR-Y	26.59 dB	36.41 dB (+9.82)	41.22 dB

Core lesson: When splitting or refactoring a configuration file that contains coupled model hyperparameters, diff the before/after parameter set explicitly against the upstream reference. Internal consistency of the new files is not sufficient — the parameters must match the training-time values.

10. Troubleshooting

Symptom	Likely cause	Resolution
Inter-frame flicker	CFG ≠ 1.0	Align to Sulphur 2 distilled workflow: CFG=1.0 for both passes
Mesh / checkerboard artefacts	SageAttention PyPI stub installed	Build SageAttention v2 from source via `scripts/infra/sageattention-build.sh`
ESRGAN OOM kill	×4 upscale path producing ~25 GB buffer	Switch to `RealESRGAN_x2plus.pth` (×2)
Pass 2 taking ~500 s	CFG>1 running two forward passes per step	Set CFG=1.0; expected time ~180 s
Frame 0 drift across segment boundary	I2V anchor not active	Confirm `LTXVImgToVideoInplace` nodes are active in both Pass 1 (node 17) and Pass 2 (node 75)
Watermark / text residue on output	Negative prompt ineffective at CFG=1.0	Avoid text-triggering phrases in prompt; post-generate inpaint; or add NAG node
llama-server connection refused	Prompt enhancer offline	Launch server via `scripts/infra/llamacpp-launch.sh`; pipeline falls back to raw prompt silently
Pass 1 OOM	Working resolution exceeds 8 GB budget	Keep draft resolution at ≤640×352; Pass 2 already saturates at 1280×704
ComfyUI RAM balloon	Model cache not disabled	Start ComfyUI with `--cache-none --reserve-vram 0`
`--resume` exits with code 2	CLI resolution override conflicts with `run.json`	Remove the conflicting CLI flag or start a new `--draft` run

Table of Contents