Sulphur 2 Video Pipeline on RTX 3070 Laptop 8GB
This page is the technical deep-dive companion to the public showcase at davidhsaiou.com/ai-video. It documents every non-obvious design decision, quantitative evidence, and tuning history for the ai-shorts repo's T2V / I2V video generation pipeline running entirely on a single RTX 3070 Laptop 8GB — no cloud, no paid API.
1. Overview
The pipeline runs Sulphur 2 22B GGUF (based on LightTricks LTX-Video 2.3) in ComfyUI to produce broadcast-quality 1920×1080 H.264 MP4 segments, then concatenates them into a finished short-form video. Key characteristics:
- Hardware constraint — RTX 3070 Laptop 8GB; all inference happens locally.
- Three-pass design — Draft T2V/I2V (Pass 1) → LCM latent-space refinement (Pass 2) → Real-ESRGAN pixel-space upscale.
- Joint AV generation — LTX-Video 2.3's joint audio-video diffusion model generates an ambient audio/music track synchronised with the video latent in Pass 1. This track is environmental sound / background music only — it is not voice narration. Voice-over and subtitle injection belong to a separate narration pipeline that is not integrated here.
- Output — 1920×1080 H.264 MP4 with the LTX joint AV audio track muxed in.
- Showcase — davidhsaiou.com/ai-video
2. Three-Pass Architecture
Pass 1 — Draft (T2V / I2V)
| Parameter | Value |
|---|---|
| Working resolution | 512×288 input → LTX auto-rounds to 640×352 |
| Sampler | euler_ancestral_cfg_pp |
| Steps | 8 |
| CFG | 1.0 |
| LoRA | ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors at strength 0.7 |
| Scheduler | LTXVScheduler — max_shift=4.0, base_shift=1.5, stretch=true, terminal=0.1 |
| Attention | SageAttention v2 (sageattn_qk_int8_pv_fp16_cuda) |
| Joint AV | Video latent + audio latent generated simultaneously |
| Typical duration | ~250–311 s/segment (warm model cache) |
Pass 2 — LCM Refine
| Parameter | Value |
|---|---|
| Upsampler | LTXVLatentUpsampler ×2 → 1280×704 latent |
| Sampler | LCM |
| Steps | 4 |
| Manual sigmas | 0.85, 0.7250, 0.4219, 0.0 |
| CFG | 1.0 |
| LoRA | cond_safe at strength 0.5 (lower than Pass 1) |
| Typical duration | ~180 s/segment (after CFG=1.0 fix; was ~500 s at CFG>1) |
Pass 3 — ESRGAN Pixel Upscale
| Parameter | Value |
|---|---|
| Model | RealESRGAN_x2plus.pth (×2, 64 MB, RRDBNet) |
| Output resolution | 1920×1080 after ×2 upscale + Lanczos downsample |
| Typical duration | ~140–160 s/segment |
Audio Mux
ffmpeg muxes the Pass 1 LTX audio latent output (decoded by the LTX Audio VAE) into the final MP4 as a single ambient/music track. Voice-over narration and subtitle tracks are not part of this pipeline.
Total Pipeline Time
~10 minutes per segment with warm model cache.
3. VRAM Asymmetry — Architecture Design Rationale
The three stages have fundamentally different VRAM footprints, which drives several design constraints:
| Stage | Working resolution | VRAM usage | Headroom |
|---|---|---|---|
| Pass 1 — Draft | 640×352 latent | ~5.8 GB | ~2 GB |
| Pass 2 — LCM Refine | 1280×704 latent | ~7.8 GB | Saturated |
| ESRGAN Upscale | 1920×1080 pixel | 22B model evicted | ~64 MB model on RAM |
Design implications:
- Any future feature extension — IPAdapter-style identity injection, ControlNet conditioning, additional LoRAs — can only attach to Pass 1. Pass 2 is VRAM-saturated at 1280×704; any expansion of attention there causes OOM.
- The 22B Sulphur model is fully evicted from VRAM before ESRGAN runs. ESRGAN's 64 MB RRDBNet runs in system RAM, not VRAM.
- The Q3_K_M quantisation of the 22B base (instead of fp8mixed) is the primary lever that keeps Pass 1 at ~5.8 GB instead of exceeding 8 GB.
4. Key Technical Decisions
4.1 CFG=1.0 Is Non-Negotiable
The distillation LoRA (cond_safe) was trained at CFG=1.0. Raising CFG pushes guidance away from the training distribution, producing frame-to-frame instability.
Quantitative evidence (Session 2 measurement):
| Configuration | Median PSNR-Y | Notes |
|---|---|---|
| CFG=3.0 (broken) | 21.79 dB | Heavy flicker, visible on playback |
| CFG=1.0 (fixed) | 32.22 dB | Stable output (+10.43 dB) |
| Pre-split perfect run | 41.16 dB | Upper bound baseline |
Performance bonus: At CFG=1.0 each sampling step requires only one forward pass (conditional only). CFG>1 requires two passes (conditional + unconditional), doubling sampling cost — the ~500 s Pass 2 time at CFG=3.0 dropped to ~180 s at CFG=1.0.
Known trade-off: Negative prompts lose force at CFG=1.0. Watermarks and text artefacts may still appear. Mitigation paths: (a) prompt engineering to avoid text-triggering phrases, (b) post-generation inpaint, (c) future NAG (Negative Attention Guidance) node — the community-standard CFG=1.0 negative complement.
4.2 Alignment with Upstream Sulphur 2 Distilled Reference Workflow
The reference JSON is ltx23_t2v distilled.json from huggingface.co/SulphurAI/Sulphur-2-base.
Three-way parameter comparison:
| Parameter | Lightricks distilled | Sulphur 2 distilled | This pipeline |
|---|---|---|---|
| Base model | LTX-2.3 22B fp16/fp8 | sulphur_dev fp8mixed | sulphur_dev-Q3_K_M.gguf |
| LoRA | distilled-lora-384-1.1 (s=0.5/0.5) | cond_safe (s=0.7/0.5) | cond_safe (s=0.7/0.5) |
| Pass 1 scheduler | ManualSigmas 9-value | LTXVScheduler 8-step | LTXVScheduler 8-step |
| Pass 1 sampler | euler_ancestral_cfg_pp | euler_ancestral_cfg_pp | euler_ancestral_cfg_pp |
| Pass 2 sampler | euler_cfg_pp (non-ancestral) | lcm | lcm |
| Pass 2 sigmas | 0.85, 0.7250, 0.4219, 0.0 | same | same |
| CFG (both passes) | 1.0 | 1.0 | 1.0 |
Placeholder note: The Sulphur 2 workflow JSON contains a sulphur_final.safetensors reference. This is an author-acknowledged placeholder — the instruction is "use the LoRA OR the full models, don't use both." This file is not required and is not loaded.
8GB trade-off: sulphur_dev-Q3_K_M.gguf (Q3 quantisation) replaces Sulphur's recommended fp8mixed to fit within 8 GB VRAM. Behaviour is closely equivalent; quality is slightly lower than fp8mixed but the gap is not perceptible at 640×352 draft resolution.
4.3 Real-ESRGAN ×2, Not ×4
realesr-general-x4v3.pth (×4) produces approximately 25 GB of intermediate pixel buffer for a 97-frame segment. On an 8 GB VRAM / 32 GB RAM configuration this triggers an OOM kill.
RealESRGAN_x2plus.pth (×2, 64 MB, RRDBNet) upscales the 1280×704 Pass 2 output directly to 2560×1408, which is then Lanczos-downsampled to 1920×1080. This avoids the ×4 intermediate buffer entirely.
4.4 SageAttention v2 Must Be Built from Source
The PyPI package sageattention==1.0.6 is a stub. When applied to transformer attention blocks it produces visible mesh and checkerboard artefacts on the output frames.
The working build is compiled from source using scripts/infra/sageattention-build.sh. Both Pass 1 and Pass 2 attach a PatchSageAttentionKJ (KJNodes) node configured for sageattn_qk_int8_pv_fp16_cuda.
4.5 I2V Chaining for Continuous Narrative
LTXVImgToVideoInplace takes the last frame of the preceding segment as frame-0 anchor for the next segment, producing temporally continuous output across segment boundaries.
- Pass 1 node 17 and Pass 2 node 75 are both statically present in the ComfyUI graph.
- Runtime switching is done via the
bypassinput only — no graph surgery at runtime. - The noise mask for Pass 2 is constructed directly from the input latent shape, sidestepping the
LTXVAddGuideAdvancedpatchify crash that occurs when mask dimensions do not match the upsampled latent size.
4.6 Local Prompt Enhancer
A local llama.cpp server (http://127.0.0.1:8080, OpenAI-compatible /v1/chat/completions) runs Gemma 3 12B Q3 GGUF to expand raw prompts into LTX cinematic language (camera motion, lighting, materials, atmosphere).
- For I2V chain segments the last frame PNG is sent as a base64 image alongside the previous enhanced prompt to maintain visual and semantic continuity across segments.
- Connection failure falls back silently to the raw prompt — no exception is raised.
- Enhancement runs on CPU at approximately 10 seconds per segment with no GPU involvement.
5. PSNR-Y Measurement Methodology
Tool: scripts/interframe_psnr.py
How it works:
ffmpegdecodes the MP4 to a raw 8-bit luma (Y) stream.- Adjacent frame pairs are compared and PSNR-Y is computed for each transition.
- Statistics reported: mean / median / min / max / p10 / p25.
- The five worst frame transitions are listed with frame indices and PSNR values.
Why PSNR-Y rather than PSNR-RGB:
Human vision is far more sensitive to luminance variation than to chrominance variation. Inter-frame luma SNR is a direct proxy for visible flicker. A low PSNR-Y transition (< ~28 dB) is reliably visible as a flash or jump cut on playback.
Practical application:
PSNR-Y converts "the output looks wrong" into a measurable, comparable, regression-testable number. All hypotheses tested during Session 2 (see Section 9) were validated or rejected by PSNR-Y before any change was accepted.
6. Model Stack
All models in production use are commercial-safe licensed.
| Model | Purpose | License |
|---|---|---|
sulphur_dev-Q3_K_M.gguf (Sulphur 2 22B) |
Video generation (T2V / I2V) | Apache-2.0 (LTX upstream) |
ltx-2-3-22b-VAE.safetensors |
Image latent encode / decode | Apache-2.0 |
ltx-2.3-22b-distilled_audio_vae.safetensors |
Synchronised audio latent decode | Apache-2.0 |
ltx-2-3-22b-text_encoder.safetensors |
Pass 1 text conditioning | Apache-2.0 |
gemma-3-12b-it-Q3_K_M.gguf |
Prompt enhancer (CPU) | Gemma Terms of Use |
ltx-2.3-22b-distilled-lora-1.1_fro90_ceil72_condsafe.safetensors |
Distillation adapter (cond_safe) | Apache-2.0 |
| LTX-2.3 Spatial Upscaler ×2 | Pass 2 latent-space upsampling | Apache-2.0 |
RealESRGAN_x2plus.pth |
Pixel-space upscaling | BSD-3-Clause |
7. CLI Flow
The pipeline exposes three primary modes operated via scripts/pipeline.sh. State is persisted to outputs/<run_id>/run.json (Pydantic schema), allowing free interruption, prompt revision, and single-segment reruns between stages.
# Step 1 — Draft (Pass 1 only)
scripts/pipeline.sh --draft --prompt "..."
# Outputs: outputs/<run_id>/run.json, draft_<run_id>_seg0.mp4, latents, frames
# Step 2 — Resume (Pass 2 LCM refine)
scripts/pipeline.sh --resume <run_id>
# Outputs: seg_<run_id>_<i>_pass2.mp4 (1024x768 preview) + concat preview
# Step 3 — Upscale (ESRGAN + audio mux)
scripts/pipeline.sh --upscale <run_id>
# Outputs: seg_<run_id>_<i>_final.mp4, <run_id>_1080p.mp4 (final 1920×1080)
Additional CLI Modes
| Flag | Purpose |
|---|---|
--append <run_id> |
Append a new segment to an existing draft run; automatically uses the previous segment's last frame as the I2V anchor |
--rework <run_id> --segment N --prompt "..." |
Re-run Pass 1 for segment N; optionally cascades to re-run downstream segments |
--re-enhance <run_id> |
Re-run the prompt enhancer only; does not touch video |
--enhance-only <run_id> |
Same as --re-enhance but hash-gated: skips unchanged prompts and invalidates downstream cached artefacts |
--cleanup <run_id> / --cleanup-all |
Remove latents, audio files, and run.json for one or all runs |
--no-enhance |
Skip the prompt enhancer; use raw prompt directly |
--no-cascade |
When used with --rework, suppresses automatic re-run of downstream segments |
Design principle: Every stage is idempotent. run.json is the single source of truth for resolved parameters; --resume and --upscale read it and refuse to proceed if a CLI override conflicts (exit code 2), preventing half-mixed runs where Pass 1 and Pass 2 used different resolutions.
8. Resolution Precision Rules
LTX VAE requires width and height to be divisible by 32. The pipeline enforces a three-level precedence for resolution selection:
- CLI flag (
--width,--height) — highest priority run.jsonfield — written by--draft, read by subsequent stagesRESOLUTION_DEFAULTS— compile-time constants: 512×288 for draft, 1920×1080 for final
--draft resolves and writes final values back to run.json. --resume and --upscale are read-only with respect to resolution. A CLI override that conflicts with run.json causes an immediate exit 2 — this prevents accidental half-mixed runs where Pass 1 and Pass 2 used different resolutions.
9. Tuning History and Lessons Learned
Session 1 — 2026-05-15: Initial Hardware Tuning
Starting from a monolithic legacy ComfyUI workflow, empirically tuned on the RTX 3070 Laptop 8GB to reach stable production settings:
- Final per-segment time: ~5:30 (warm cache).
- Production baseline run
1778935286— perfect smooth output on pre-split monolithic workflow (median PSNR-Y 41.22).
Session 2 — 2026-05-18: Flicker Root Cause and Recovery
Symptom: After splitting the monolithic workflow into shots/pass1.json + shots/pass2_refine.json, 1080p output showed severe inter-frame flicker.
Root cause: Commit 5fcdfe3 introduced three coupled regressions during the split:
- CFG changed from 1.0 to 3.0.
base_shiftchanged from 1.5 (official) to 1.2.- Pass 2 sigma values diverged from the official Sulphur 2 reference.
Wrong-direction hypothesis (COR-1684, closed Wont Do): A hypothesis that reducing max_shift from 4.0 to 2.05 and base_shift from 1.2 to 0.95 would improve stability. Measured result: mean PSNR-Y degraded by −3.26 dB. Discarded.
Actual fix: Full parameter alignment with the Sulphur 2 official distilled T2V workflow (see Section 4.2).
Measured improvement:
| Metric | Post-split broken | Post-alignment fixed | Perfect baseline |
|---|---|---|---|
| Pass 1 mean PSNR-Y | 26.59 dB | 36.41 dB (+9.82) | 41.22 dB |
Core lesson: When splitting or refactoring a configuration file that contains coupled model hyperparameters, diff the before/after parameter set explicitly against the upstream reference. Internal consistency of the new files is not sufficient — the parameters must match the training-time values.
10. Troubleshooting
| Symptom | Likely cause | Resolution |
|---|---|---|
| Inter-frame flicker | CFG ≠ 1.0 | Align to Sulphur 2 distilled workflow: CFG=1.0 for both passes |
| Mesh / checkerboard artefacts | SageAttention PyPI stub installed | Build SageAttention v2 from source via scripts/infra/sageattention-build.sh |
| ESRGAN OOM kill | ×4 upscale path producing ~25 GB buffer | Switch to RealESRGAN_x2plus.pth (×2) |
| Pass 2 taking ~500 s | CFG>1 running two forward passes per step | Set CFG=1.0; expected time ~180 s |
| Frame 0 drift across segment boundary | I2V anchor not active | Confirm LTXVImgToVideoInplace nodes are active in both Pass 1 (node 17) and Pass 2 (node 75) |
| Watermark / text residue on output | Negative prompt ineffective at CFG=1.0 | Avoid text-triggering phrases in prompt; post-generate inpaint; or add NAG node |
| llama-server connection refused | Prompt enhancer offline | Launch server via scripts/infra/llamacpp-launch.sh; pipeline falls back to raw prompt silently |
| Pass 1 OOM | Working resolution exceeds 8 GB budget | Keep draft resolution at ≤640×352; Pass 2 already saturates at 1280×704 |
| ComfyUI RAM balloon | Model cache not disabled | Start ComfyUI with --cache-none --reserve-vram 0 |
--resume exits with code 2 |
CLI resolution override conflicts with run.json |
Remove the conflicting CLI flag or start a new --draft run |