From 9869560fa33cb204bc0b9cea3443a261751901f4 Mon Sep 17 00:00:00 2001 From: Test User Date: Thu, 19 Feb 2026 04:22:36 +0000 Subject: [PATCH] Remove misplaced training analysis (moved to ai-video-generator repo) --- cogvideo_training_analysis.md | 508 ---------------------------------- 1 file changed, 508 deletions(-) delete mode 100644 cogvideo_training_analysis.md diff --git a/cogvideo_training_analysis.md b/cogvideo_training_analysis.md deleted file mode 100644 index 789d7cb..0000000 --- a/cogvideo_training_analysis.md +++ /dev/null @@ -1,508 +0,0 @@ -# CogVideoX Fine-Tuning Expert Analysis -**Date:** Feb 19, 2026 -**For:** Powerslap Dataset (3K videos) -**Goal:** Domain-specific video generation fine-tuning - ---- - -## Executive Summary - -**You're doing this right.** Most CogVideoX users run inference with base models. You're building custom training infrastructure for a niche domain (combat sports) that the base model has never seen. This is exactly when fine-tuning shines. - -**Key Finding:** Research shows **10-30 videos can produce strong results** with proper LoRA training. You have **3,000 videos** — that's 100-300x the minimum. This is a massive advantage for domain adaptation. - ---- - -## 📚 Literature Review - -### 1. **Temporal In-Context Fine-Tuning (TIC-FT)** — The State-of-the-Art Paper - -**Source:** arxiv.org/html/2506.00996v2 -**Published:** Dec 2025 -**Base Model:** CogVideoX-5B - -#### Key Findings: - -**Training Setup:** -- **20 training samples** (condition-target pairs) -- **LoRA rank 128** -- **6,000 training steps** -- **Batch size 2** -- **Single H100 80GB GPU** -- **~13 hours** training time - -**Results:** -- Strong performance with as few as **10-30 samples** -- Outperforms ControlNet, Fun-pose, and spatial in-context methods -- Works on tasks: character-to-video, object-to-motion, style transfer, action transfer - -**Hyperparameters:** -```python -learning_rate = 1e-3 to 1e-4 # Recommended range -optimizer = "Adam" -lora_rank = 128 -lora_alpha = 128 # Set to rank or rank // 2 -batch_size = 2 -training_steps = 6000 -``` - -**Architecture:** -- Temporal concatenation (not spatial grids) -- Buffer frames with progressive noise levels -- No architectural modifications needed -- Unified 3D attention across condition + target frames - ---- - -### 2. **HuggingFace Diffusers Official Training Guide** - -**Source:** huggingface.co/docs/diffusers/training/cogvideox - -#### CogVideoX Team Official Recommendations: - -**Dataset:** -- **100 videos** minimum for best results -- **4,000 training steps** total -- **~40 training epochs** (100 videos × 40 epochs = 4K steps) - -**For Smaller Datasets (25-50 videos):** -- **1,500-2,000 steps** works well -- **~30 training epochs** (50 videos × 30 epochs = 1,500 steps) - -**Learning Rate:** -- **1e-3 to 1e-4** (official range from CogVideoX authors + experiments) -- Lower LR (1e-4) for stability, higher (1e-3) for faster convergence - -**LoRA Settings:** -```python -lora_rank = 64 # Official recommendation for new concepts/styles -lora_alpha = 64 # Set to rank (not 1 like original SAT repo) -# Rank 16/32 works if base model already generates moderately good results on your captions -# Rank 4 is too low — doesn't produce promising results -``` - -**Memory Optimizations:** -```python ---enable_slicing # VAE slicing ---enable_tiling # VAE tiling ---use_8bit_adam # Reduces memory usage -``` - -**Training Command Example:** -```bash -accelerate launch train_cogvideox_lora.py \ - --pretrained_model_name_or_path THUDM/CogVideoX-2b \ - --instance_data_root /path/to/videos \ - --caption_column captions.txt \ - --video_column videos.txt \ - --rank 64 \ - --lora_alpha 64 \ - --mixed_precision fp16 \ - --height 480 --width 720 --fps 8 --max_num_frames 49 \ - --train_batch_size 1 \ - --num_train_epochs 30 \ - --gradient_accumulation_steps 1 \ - --learning_rate 1e-3 \ - --lr_scheduler cosine_with_restarts \ - --lr_warmup_steps 200 \ - --optimizer Adam \ - --adam_beta1 0.9 \ - --adam_beta2 0.95 \ - --max_grad_norm 1.0 -``` - ---- - -### 3. **Finetrainers (CogVideoX-Factory)** - -**Source:** github.com/huggingface/finetrainers (formerly cogvideox-factory) - -#### Production Training Framework - -**Features:** -- Memory-optimized LoRA training -- Distributed training support (DDP, FSDP-2, HSDP) -- Multi-resolution bucketing -- Precomputation for large datasets -- FP8 training support - -**Example Success Story:** -- **Wallace & Gromit LoRA** -- **13 hours on L40S (32GB VRAM)** -- LoRA rank 128 -- Example dataset curation tools included - -**Supported Models:** -- CogVideoX-2B, CogVideoX-5B -- LTX-Video, HunyuanVideo, Wan, Flux - -**Memory Requirements (CogVideoX-5B):** -- **LoRA training:** 18 GB VRAM (with optimizations) -- **Full fine-tuning:** 53 GB VRAM - -**Key Optimizations:** -- Pre-computation of VAE latents + text embeddings -- Flash/Flex/Sage/xformers attention backends -- FP8 weight casting for <24GB training - ---- - -### 4. **Official CogVideo Finetune Repo** - -**Source:** github.com/zai-org/CogVideo/blob/main/finetune/ - -**Example Dataset:** -- **70 training videos** -- Resolution: **200 × 480 × 720** (frames × height × width) -- SAT (SwissArmyTransformer) backend -- Weight conversion tools: SAT ↔ HuggingFace - ---- - -## 🎯 Recommendations for Your Powerslap Training - -### Dataset Stats -- **3,000 videos** (powerslap domain) -- **Current progress:** 909/2982 captioned (30%) -- **LLaVA-34B captions** with powerslap domain prompt - -### Proposed Training Strategy - -#### **Option A: Conservative (Proven Settings)** - -```python -# Model -base_model = "THUDM/CogVideoX-5B" # Better quality than 2B -training_method = "LoRA" - -# Dataset -num_videos = 100 # Start with 100 well-captioned videos -training_steps = 4000 -batch_size = 2 -gradient_accumulation = 1 -effective_batch_size = 2 - -# LoRA -lora_rank = 128 # High rank for new domain -lora_alpha = 128 - -# Optimization -learning_rate = 1e-3 # Upper end of recommended range -optimizer = "Adam" -adam_beta1 = 0.9 -adam_beta2 = 0.95 -lr_scheduler = "cosine_with_restarts" -lr_warmup_steps = 200 -max_grad_norm = 1.0 - -# Precision -mixed_precision = "bf16" # CogVideoX-5B trained in BF16 -enable_slicing = True -enable_tiling = True - -# Video settings -fps = 8 -max_num_frames = 49 -height = 480 -width = 720 -``` - -**Expected Results:** -- **Training time:** ~15-20 hours on H100 -- **VRAM:** ~20-25 GB (with optimizations) -- **Quality:** Strong domain adaptation, faithful powerslap mechanics - ---- - -#### **Option B: Aggressive (Maximum Data)** - -```python -# Dataset -num_videos = 1000 # Use 1/3 of your dataset -training_steps = 12000 # 12 epochs × 1000 videos -batch_size = 4 # Larger batch if VRAM allows -gradient_accumulation = 2 -effective_batch_size = 8 - -# LoRA -lora_rank = 256 # Higher rank for richer domain -lora_alpha = 128 # Keep alpha lower for stability - -# Optimization -learning_rate = 5e-4 # Lower LR for large dataset -optimizer = "AdamW" -weight_decay = 1e-2 # Regularization for large data -``` - -**Expected Results:** -- **Training time:** ~60-80 hours on H100 -- **VRAM:** ~30-35 GB -- **Quality:** Extremely specialized powerslap model, handles edge cases - ---- - -#### **Option C: TIC-FT Style (Research-Backed)** - -Based on the TIC-FT paper's approach: - -```python -# Dataset -num_videos = 20 # Minimal test set -training_steps = 6000 -batch_size = 2 - -# LoRA -lora_rank = 128 -lora_alpha = 128 - -# Temporal In-Context Fine-Tuning -# (Requires modifying training script to concatenate condition + target frames temporally) -buffer_frames = 3 # Progressive noise transition frames -condition_frames = 1 # Single reference frame -target_frames = 48 # Generate 48 frames from 1 condition frame - -learning_rate = 1e-3 -``` - -**Expected Results:** -- **Training time:** ~13 hours on H100 -- **VRAM:** ~20 GB -- **Quality:** Good with minimal data, best for controlled generation tasks - ---- - -### Caption Quality Recommendations - -**LLaVA-34B Powerslap Prompt** — ✅ You're already doing this right! - -**Caption Length:** -- **50-100 words** is ideal (ChatGLM recommendation) -- Focus on: - - **Motion dynamics:** "winds up", "delivers powerful slap", "head snaps to side" - - **Positioning:** "stance shifts", "weight transfers", "defensive positioning" - - **Impact physics:** "recoils from impact", "absorbs the strike", "staggers backward" - - **Camera movement:** "camera pans left", "zooms in on contact" - -**Example Good Caption:** -``` -Competitor A assumes an orthodox stance, weight balanced evenly. -He winds up with his right hand, rotating his torso for maximum power. -The open-hand slap connects cleanly with Competitor B's left cheek, -producing a sharp crack. Competitor B's head snaps violently to the right, -eyes squinting from the impact. He staggers briefly but maintains footing, -then resets to defensive stance. The referee steps in to assess. -Camera holds steady on medium shot, capturing full body language. -``` - ---- - -### Training Timeline (Conservative Path) - -1. **Data Prep** (Current) - - ✅ Caption 909/2982 videos complete - - ⏳ Finish remaining 2,073 videos (~72 hours) - - **Total:** ~3 days - -2. **Dataset Curation** (+1 day) - - Select best 100 videos (highest caption quality scores) - - Verify motion diversity (strikes, blocks, staggers, KOs) - - Check for outliers (black frames, duplicates) - -3. **Training Run 1: Baseline** (+1 day) - - 100 videos, 4K steps, rank 128 - - Validate every 500 steps - - **Goal:** Establish baseline quality - -4. **Training Run 2: Hyperparameter Sweep** (+3 days) - - Test LR: [1e-4, 5e-4, 1e-3] - - Test rank: [64, 128, 256] - - **Goal:** Find optimal settings - -5. **Training Run 3: Full Dataset** (+3 days) - - 500-1000 videos, 10K-15K steps - - Best hyperparameters from Run 2 - - **Goal:** Production model - -**Total Timeline:** ~11 days from current state to production model - ---- - -## 🔬 Key Research Insights - -### Why Your Approach Works - -1. **Base Model Blind Spot** - - CogVideoX trained on general YouTube/stock footage - - **No combat sports** in training data - - **No strike mechanics** or impact physics - - Generic prompts like "person slapping another person" → garbage results - -2. **Fine-Tuning Fills the Gap** - - Your 3K videos teach the model **powerslap-specific motion priors** - - Model learns: stance → windup → impact → reaction **sequences** - - Captions describe **actual mechanics** in domain-specific language - - After training: Model understands "open-hand slap trajectory" vs. generic "hitting" - -3. **Why Small Data Works** - - TIC-FT paper: **20 samples** can work with proper training - - LoRA adapts efficiently (only ~0.5% parameters updated) - - CogVideoX base model already has strong motion priors - - You're teaching **domain semantics**, not motion from scratch - ---- - -## 🚨 Common Pitfalls to Avoid - -### From the Literature: - -1. **Too Low LoRA Rank** - - ❌ Rank 4: Not sufficient for new domains - - ✅ Rank 64+: Works for specialized content - - ✅ Rank 128: Official recommendation for new concepts - -2. **Wrong Learning Rate** - - ❌ Too high (>1e-3): Unstable, overfitting - - ❌ Too low (<1e-5): Slow convergence, underfitting - - ✅ Sweet spot: 1e-4 to 1e-3 - -3. **Mismatched Precision** - - ❌ Training CogVideoX-5B in FP16 (it was trained in BF16) - - ✅ Use BF16 for 5B, FP16 for 2B - -4. **Bad Captions** - - ❌ Generic: "Two people fighting" - - ✅ Specific: "Competitor delivers overhead slap with full torso rotation, striking opponent's temple. Opponent recoils, head snapping right, eyes closing on impact." - -5. **Ignoring Validation** - - ❌ Train blindly for 10K steps - - ✅ Validate every 500-1000 steps with diverse prompts - - ✅ Check for: overfitting, motion quality, prompt adherence - ---- - -## 📊 Expected Outcomes - -### After 100-Video Training: - -**Prompts You Can Generate:** -- "Powerslap competitor winds up and delivers a crushing blow to opponent's face, causing immediate head snap and stagger" -- "Fighter in defensive stance absorbs slap, maintains balance, resets to guard position" -- "Referee steps between competitors after knockout slap, waving off the match" - -**Motion Fidelity:** -- ✅ Accurate strike trajectories -- ✅ Realistic impact physics (head movement, body recoil) -- ✅ Proper stances and weight distribution -- ✅ Camera angles matching professional powerslap footage - -**What Won't Work Yet:** -- ❌ Complex multi-person interactions (>2 fighters) -- ❌ Novel camera angles not in training data -- ❌ Combining powerslap with unrelated backgrounds (underwater powerslap, space powerslap) - -### After 1000-Video Training: - -**Additional Capabilities:** -- ✅ Style variations (different arenas, lighting) -- ✅ Edge cases (slips, fouls, technical issues) -- ✅ Generalization to similar combat sports (boxing hooks, MMA strikes) - ---- - -## 🛠️ Next Steps - -### Immediate (This Week): - -1. **Finish captioning pipeline** (2,073 videos remaining) -2. **Caption quality analysis** - - Plot distribution of caption lengths - - Check for garbage captions (LLaVA hallucinations) - - Verify motion diversity coverage - -3. **Prepare training environment** - ```bash - # Clone finetrainers - git clone https://github.com/huggingface/finetrainers - cd finetrainers - pip install -r requirements.txt - pip install git+https://github.com/huggingface/diffusers - - # Verify H100 access - nvidia-smi - - # Test small training run (10 videos, 500 steps) - ``` - -### Short-term (Next 2 Weeks): - -4. **Baseline training run** - - 100 best videos - - Conservative hyperparameters (Option A) - - Validate every 500 steps - -5. **Hyperparameter tuning** - - Learning rate sweep - - LoRA rank experiments - - Document results in `training_logs/` - -6. **Full training run** - - 500-1000 videos - - Best hyperparameters - - Production model checkpoint - -### Long-term (Month 2+): - -7. **Inference optimization** - - Build inference API - - Optimize generation speed (torch.compile, FP8) - - Create prompt templates for common scenarios - -8. **Evaluation suite** - - Human evaluation (motion accuracy, impact realism) - - Automated metrics (FVD, CLIP-score) - - A/B testing vs. base model - -9. **Dataset expansion** - - Use remaining 2K videos - - Curate hard negatives (failed strikes, defensive moves) - - Possibly add synthetic data (base model + augmentation) - ---- - -## 📚 Reference Papers & Repos - -### Papers: -1. **TIC-FT:** arxiv.org/html/2506.00996v2 -2. **CogVideoX:** arxiv.org/abs/2408.06072 -3. **LoRA:** arxiv.org/abs/2106.09685 - -### Code: -1. **Finetrainers:** github.com/huggingface/finetrainers -2. **Diffusers Training:** github.com/huggingface/diffusers/tree/main/examples/cogvideo -3. **Official CogVideo:** github.com/zai-org/CogVideo - -### Models: -1. **CogVideoX-2B:** huggingface.co/THUDM/CogVideoX-2b -2. **CogVideoX-5B:** huggingface.co/THUDM/CogVideoX-5b - ---- - -## 💡 Final Thoughts - -**You're on the right track.** The combination of: -- ✅ 3K domain-specific videos -- ✅ High-quality LLaVA-34B captions -- ✅ H100 infrastructure -- ✅ Powerslap-focused training prompt - -...means you're set up to build a **production-quality powerslap video generation model** that will outperform the base CogVideoX on this domain by orders of magnitude. - -**The literature backs this up:** Even with 20-100 videos, researchers achieve strong domain adaptation. You have 30-150x that amount. The main challenge is **hyperparameter tuning** and **caption quality**, both of which are solvable with iteration. - -**Recommended Next Action:** Finish captioning, then run a **quick 10-video, 500-step test** to validate your training pipeline before committing to the full run. This will catch any bugs and give you a sense of training dynamics. - ---- - -**Generated:** Feb 19, 2026, 4:12 AM UTC -**For:** IMaloney1 -**Project:** CogVideoX Powerslap Fine-Tuning