From 9869560fa33cb204bc0b9cea3443a261751901f4 Mon Sep 17 00:00:00 2001
From: Test User <test@example.com>
Date: Thu, 19 Feb 2026 04:22:36 +0000
Subject: [PATCH] Remove misplaced training analysis (moved to
 ai-video-generator repo)

---
 cogvideo_training_analysis.md | 508 ----------------------------------
 1 file changed, 508 deletions(-)
 delete mode 100644 cogvideo_training_analysis.md

diff --git a/cogvideo_training_analysis.md b/cogvideo_training_analysis.md
deleted file mode 100644
index 789d7cb..0000000
--- a/cogvideo_training_analysis.md
+++ /dev/null
@@ -1,508 +0,0 @@
-# CogVideoX Fine-Tuning Expert Analysis
-**Date:** Feb 19, 2026  
-**For:** Powerslap Dataset (3K videos)  
-**Goal:** Domain-specific video generation fine-tuning
-
----
-
-## Executive Summary
-
-**You're doing this right.** Most CogVideoX users run inference with base models. You're building custom training infrastructure for a niche domain (combat sports) that the base model has never seen. This is exactly when fine-tuning shines.
-
-**Key Finding:** Research shows **10-30 videos can produce strong results** with proper LoRA training. You have **3,000 videos** — that's 100-300x the minimum. This is a massive advantage for domain adaptation.
-
----
-
-## 📚 Literature Review
-
-### 1. **Temporal In-Context Fine-Tuning (TIC-FT)** — The State-of-the-Art Paper
-
-**Source:** arxiv.org/html/2506.00996v2  
-**Published:** Dec 2025  
-**Base Model:** CogVideoX-5B
-
-#### Key Findings:
-
-**Training Setup:**
-- **20 training samples** (condition-target pairs)
-- **LoRA rank 128**
-- **6,000 training steps**
-- **Batch size 2**
-- **Single H100 80GB GPU**
-- **~13 hours** training time
-
-**Results:**
-- Strong performance with as few as **10-30 samples**
-- Outperforms ControlNet, Fun-pose, and spatial in-context methods
-- Works on tasks: character-to-video, object-to-motion, style transfer, action transfer
-
-**Hyperparameters:**
-```python
-learning_rate = 1e-3 to 1e-4  # Recommended range
-optimizer = "Adam"
-lora_rank = 128
-lora_alpha = 128  # Set to rank or rank // 2
-batch_size = 2
-training_steps = 6000
-```
-
-**Architecture:**
-- Temporal concatenation (not spatial grids)
-- Buffer frames with progressive noise levels
-- No architectural modifications needed
-- Unified 3D attention across condition + target frames
-
----
-
-### 2. **HuggingFace Diffusers Official Training Guide**
-
-**Source:** huggingface.co/docs/diffusers/training/cogvideox
-
-#### CogVideoX Team Official Recommendations:
-
-**Dataset:**
-- **100 videos** minimum for best results
-- **4,000 training steps** total
-- **~40 training epochs** (100 videos × 40 epochs = 4K steps)
-
-**For Smaller Datasets (25-50 videos):**
-- **1,500-2,000 steps** works well
-- **~30 training epochs** (50 videos × 30 epochs = 1,500 steps)
-
-**Learning Rate:**
-- **1e-3 to 1e-4** (official range from CogVideoX authors + experiments)
-- Lower LR (1e-4) for stability, higher (1e-3) for faster convergence
-
-**LoRA Settings:**
-```python
-lora_rank = 64  # Official recommendation for new concepts/styles
-lora_alpha = 64  # Set to rank (not 1 like original SAT repo)
-# Rank 16/32 works if base model already generates moderately good results on your captions
-# Rank 4 is too low — doesn't produce promising results
-```
-
-**Memory Optimizations:**
-```python
---enable_slicing  # VAE slicing
---enable_tiling   # VAE tiling
---use_8bit_adam   # Reduces memory usage
-```
-
-**Training Command Example:**
-```bash
-accelerate launch train_cogvideox_lora.py \
-  --pretrained_model_name_or_path THUDM/CogVideoX-2b \
-  --instance_data_root /path/to/videos \
-  --caption_column captions.txt \
-  --video_column videos.txt \
-  --rank 64 \
-  --lora_alpha 64 \
-  --mixed_precision fp16 \
-  --height 480 --width 720 --fps 8 --max_num_frames 49 \
-  --train_batch_size 1 \
-  --num_train_epochs 30 \
-  --gradient_accumulation_steps 1 \
-  --learning_rate 1e-3 \
-  --lr_scheduler cosine_with_restarts \
-  --lr_warmup_steps 200 \
-  --optimizer Adam \
-  --adam_beta1 0.9 \
-  --adam_beta2 0.95 \
-  --max_grad_norm 1.0
-```
-
----
-
-### 3. **Finetrainers (CogVideoX-Factory)**
-
-**Source:** github.com/huggingface/finetrainers (formerly cogvideox-factory)
-
-#### Production Training Framework
-
-**Features:**
-- Memory-optimized LoRA training
-- Distributed training support (DDP, FSDP-2, HSDP)
-- Multi-resolution bucketing
-- Precomputation for large datasets
-- FP8 training support
-
-**Example Success Story:**
-- **Wallace & Gromit LoRA**
-- **13 hours on L40S (32GB VRAM)**
-- LoRA rank 128
-- Example dataset curation tools included
-
-**Supported Models:**
-- CogVideoX-2B, CogVideoX-5B
-- LTX-Video, HunyuanVideo, Wan, Flux
-
-**Memory Requirements (CogVideoX-5B):**
-- **LoRA training:** 18 GB VRAM (with optimizations)
-- **Full fine-tuning:** 53 GB VRAM
-
-**Key Optimizations:**
-- Pre-computation of VAE latents + text embeddings
-- Flash/Flex/Sage/xformers attention backends
-- FP8 weight casting for <24GB training
-
----
-
-### 4. **Official CogVideo Finetune Repo**
-
-**Source:** github.com/zai-org/CogVideo/blob/main/finetune/
-
-**Example Dataset:**
-- **70 training videos**
-- Resolution: **200 × 480 × 720** (frames × height × width)
-- SAT (SwissArmyTransformer) backend
-- Weight conversion tools: SAT ↔ HuggingFace
-
----
-
-## 🎯 Recommendations for Your Powerslap Training
-
-### Dataset Stats
-- **3,000 videos** (powerslap domain)
-- **Current progress:** 909/2982 captioned (30%)
-- **LLaVA-34B captions** with powerslap domain prompt
-
-### Proposed Training Strategy
-
-#### **Option A: Conservative (Proven Settings)**
-
-```python
-# Model
-base_model = "THUDM/CogVideoX-5B"  # Better quality than 2B
-training_method = "LoRA"
-
-# Dataset
-num_videos = 100  # Start with 100 well-captioned videos
-training_steps = 4000
-batch_size = 2
-gradient_accumulation = 1
-effective_batch_size = 2
-
-# LoRA
-lora_rank = 128  # High rank for new domain
-lora_alpha = 128
-
-# Optimization
-learning_rate = 1e-3  # Upper end of recommended range
-optimizer = "Adam"
-adam_beta1 = 0.9
-adam_beta2 = 0.95
-lr_scheduler = "cosine_with_restarts"
-lr_warmup_steps = 200
-max_grad_norm = 1.0
-
-# Precision
-mixed_precision = "bf16"  # CogVideoX-5B trained in BF16
-enable_slicing = True
-enable_tiling = True
-
-# Video settings
-fps = 8
-max_num_frames = 49
-height = 480
-width = 720
-```
-
-**Expected Results:**
-- **Training time:** ~15-20 hours on H100
-- **VRAM:** ~20-25 GB (with optimizations)
-- **Quality:** Strong domain adaptation, faithful powerslap mechanics
-
----
-
-#### **Option B: Aggressive (Maximum Data)**
-
-```python
-# Dataset
-num_videos = 1000  # Use 1/3 of your dataset
-training_steps = 12000  # 12 epochs × 1000 videos
-batch_size = 4  # Larger batch if VRAM allows
-gradient_accumulation = 2
-effective_batch_size = 8
-
-# LoRA
-lora_rank = 256  # Higher rank for richer domain
-lora_alpha = 128  # Keep alpha lower for stability
-
-# Optimization
-learning_rate = 5e-4  # Lower LR for large dataset
-optimizer = "AdamW"
-weight_decay = 1e-2  # Regularization for large data
-```
-
-**Expected Results:**
-- **Training time:** ~60-80 hours on H100
-- **VRAM:** ~30-35 GB
-- **Quality:** Extremely specialized powerslap model, handles edge cases
-
----
-
-#### **Option C: TIC-FT Style (Research-Backed)**
-
-Based on the TIC-FT paper's approach:
-
-```python
-# Dataset
-num_videos = 20  # Minimal test set
-training_steps = 6000
-batch_size = 2
-
-# LoRA
-lora_rank = 128
-lora_alpha = 128
-
-# Temporal In-Context Fine-Tuning
-# (Requires modifying training script to concatenate condition + target frames temporally)
-buffer_frames = 3  # Progressive noise transition frames
-condition_frames = 1  # Single reference frame
-target_frames = 48  # Generate 48 frames from 1 condition frame
-
-learning_rate = 1e-3
-```
-
-**Expected Results:**
-- **Training time:** ~13 hours on H100
-- **VRAM:** ~20 GB
-- **Quality:** Good with minimal data, best for controlled generation tasks
-
----
-
-### Caption Quality Recommendations
-
-**LLaVA-34B Powerslap Prompt** — ✅ You're already doing this right!
-
-**Caption Length:**
-- **50-100 words** is ideal (ChatGLM recommendation)
-- Focus on:
-  - **Motion dynamics:** "winds up", "delivers powerful slap", "head snaps to side"
-  - **Positioning:** "stance shifts", "weight transfers", "defensive positioning"
-  - **Impact physics:** "recoils from impact", "absorbs the strike", "staggers backward"
-  - **Camera movement:** "camera pans left", "zooms in on contact"
-
-**Example Good Caption:**
-```
-Competitor A assumes an orthodox stance, weight balanced evenly. 
-He winds up with his right hand, rotating his torso for maximum power. 
-The open-hand slap connects cleanly with Competitor B's left cheek, 
-producing a sharp crack. Competitor B's head snaps violently to the right, 
-eyes squinting from the impact. He staggers briefly but maintains footing, 
-then resets to defensive stance. The referee steps in to assess. 
-Camera holds steady on medium shot, capturing full body language.
-```
-
----
-
-### Training Timeline (Conservative Path)
-
-1. **Data Prep** (Current)
-   - ✅ Caption 909/2982 videos complete
-   - ⏳ Finish remaining 2,073 videos (~72 hours)
-   - **Total:** ~3 days
-
-2. **Dataset Curation** (+1 day)
-   - Select best 100 videos (highest caption quality scores)
-   - Verify motion diversity (strikes, blocks, staggers, KOs)
-   - Check for outliers (black frames, duplicates)
-
-3. **Training Run 1: Baseline** (+1 day)
-   - 100 videos, 4K steps, rank 128
-   - Validate every 500 steps
-   - **Goal:** Establish baseline quality
-
-4. **Training Run 2: Hyperparameter Sweep** (+3 days)
-   - Test LR: [1e-4, 5e-4, 1e-3]
-   - Test rank: [64, 128, 256]
-   - **Goal:** Find optimal settings
-
-5. **Training Run 3: Full Dataset** (+3 days)
-   - 500-1000 videos, 10K-15K steps
-   - Best hyperparameters from Run 2
-   - **Goal:** Production model
-
-**Total Timeline:** ~11 days from current state to production model
-
----
-
-## 🔬 Key Research Insights
-
-### Why Your Approach Works
-
-1. **Base Model Blind Spot**
-   - CogVideoX trained on general YouTube/stock footage
-   - **No combat sports** in training data
-   - **No strike mechanics** or impact physics
-   - Generic prompts like "person slapping another person" → garbage results
-
-2. **Fine-Tuning Fills the Gap**
-   - Your 3K videos teach the model **powerslap-specific motion priors**
-   - Model learns: stance → windup → impact → reaction **sequences**
-   - Captions describe **actual mechanics** in domain-specific language
-   - After training: Model understands "open-hand slap trajectory" vs. generic "hitting"
-
-3. **Why Small Data Works**
-   - TIC-FT paper: **20 samples** can work with proper training
-   - LoRA adapts efficiently (only ~0.5% parameters updated)
-   - CogVideoX base model already has strong motion priors
-   - You're teaching **domain semantics**, not motion from scratch
-
----
-
-## 🚨 Common Pitfalls to Avoid
-
-### From the Literature:
-
-1. **Too Low LoRA Rank**
-   - ❌ Rank 4: Not sufficient for new domains
-   - ✅ Rank 64+: Works for specialized content
-   - ✅ Rank 128: Official recommendation for new concepts
-
-2. **Wrong Learning Rate**
-   - ❌ Too high (>1e-3): Unstable, overfitting
-   - ❌ Too low (<1e-5): Slow convergence, underfitting
-   - ✅ Sweet spot: 1e-4 to 1e-3
-
-3. **Mismatched Precision**
-   - ❌ Training CogVideoX-5B in FP16 (it was trained in BF16)
-   - ✅ Use BF16 for 5B, FP16 for 2B
-
-4. **Bad Captions**
-   - ❌ Generic: "Two people fighting"
-   - ✅ Specific: "Competitor delivers overhead slap with full torso rotation, striking opponent's temple. Opponent recoils, head snapping right, eyes closing on impact."
-
-5. **Ignoring Validation**
-   - ❌ Train blindly for 10K steps
-   - ✅ Validate every 500-1000 steps with diverse prompts
-   - ✅ Check for: overfitting, motion quality, prompt adherence
-
----
-
-## 📊 Expected Outcomes
-
-### After 100-Video Training:
-
-**Prompts You Can Generate:**
-- "Powerslap competitor winds up and delivers a crushing blow to opponent's face, causing immediate head snap and stagger"
-- "Fighter in defensive stance absorbs slap, maintains balance, resets to guard position"
-- "Referee steps between competitors after knockout slap, waving off the match"
-
-**Motion Fidelity:**
-- ✅ Accurate strike trajectories
-- ✅ Realistic impact physics (head movement, body recoil)
-- ✅ Proper stances and weight distribution
-- ✅ Camera angles matching professional powerslap footage
-
-**What Won't Work Yet:**
-- ❌ Complex multi-person interactions (>2 fighters)
-- ❌ Novel camera angles not in training data
-- ❌ Combining powerslap with unrelated backgrounds (underwater powerslap, space powerslap)
-
-### After 1000-Video Training:
-
-**Additional Capabilities:**
-- ✅ Style variations (different arenas, lighting)
-- ✅ Edge cases (slips, fouls, technical issues)
-- ✅ Generalization to similar combat sports (boxing hooks, MMA strikes)
-
----
-
-## 🛠️ Next Steps
-
-### Immediate (This Week):
-
-1. **Finish captioning pipeline** (2,073 videos remaining)
-2. **Caption quality analysis**
-   - Plot distribution of caption lengths
-   - Check for garbage captions (LLaVA hallucinations)
-   - Verify motion diversity coverage
-
-3. **Prepare training environment**
-   ```bash
-   # Clone finetrainers
-   git clone https://github.com/huggingface/finetrainers
-   cd finetrainers
-   pip install -r requirements.txt
-   pip install git+https://github.com/huggingface/diffusers
-   
-   # Verify H100 access
-   nvidia-smi
-   
-   # Test small training run (10 videos, 500 steps)
-   ```
-
-### Short-term (Next 2 Weeks):
-
-4. **Baseline training run**
-   - 100 best videos
-   - Conservative hyperparameters (Option A)
-   - Validate every 500 steps
-
-5. **Hyperparameter tuning**
-   - Learning rate sweep
-   - LoRA rank experiments
-   - Document results in `training_logs/`
-
-6. **Full training run**
-   - 500-1000 videos
-   - Best hyperparameters
-   - Production model checkpoint
-
-### Long-term (Month 2+):
-
-7. **Inference optimization**
-   - Build inference API
-   - Optimize generation speed (torch.compile, FP8)
-   - Create prompt templates for common scenarios
-
-8. **Evaluation suite**
-   - Human evaluation (motion accuracy, impact realism)
-   - Automated metrics (FVD, CLIP-score)
-   - A/B testing vs. base model
-
-9. **Dataset expansion**
-   - Use remaining 2K videos
-   - Curate hard negatives (failed strikes, defensive moves)
-   - Possibly add synthetic data (base model + augmentation)
-
----
-
-## 📚 Reference Papers & Repos
-
-### Papers:
-1. **TIC-FT:** arxiv.org/html/2506.00996v2
-2. **CogVideoX:** arxiv.org/abs/2408.06072
-3. **LoRA:** arxiv.org/abs/2106.09685
-
-### Code:
-1. **Finetrainers:** github.com/huggingface/finetrainers
-2. **Diffusers Training:** github.com/huggingface/diffusers/tree/main/examples/cogvideo
-3. **Official CogVideo:** github.com/zai-org/CogVideo
-
-### Models:
-1. **CogVideoX-2B:** huggingface.co/THUDM/CogVideoX-2b
-2. **CogVideoX-5B:** huggingface.co/THUDM/CogVideoX-5b
-
----
-
-## 💡 Final Thoughts
-
-**You're on the right track.** The combination of:
-- ✅ 3K domain-specific videos
-- ✅ High-quality LLaVA-34B captions
-- ✅ H100 infrastructure
-- ✅ Powerslap-focused training prompt
-
-...means you're set up to build a **production-quality powerslap video generation model** that will outperform the base CogVideoX on this domain by orders of magnitude.
-
-**The literature backs this up:** Even with 20-100 videos, researchers achieve strong domain adaptation. You have 30-150x that amount. The main challenge is **hyperparameter tuning** and **caption quality**, both of which are solvable with iteration.
-
-**Recommended Next Action:** Finish captioning, then run a **quick 10-video, 500-step test** to validate your training pipeline before committing to the full run. This will catch any bugs and give you a sense of training dynamics.
-
----
-
-**Generated:** Feb 19, 2026, 4:12 AM UTC  
-**For:** IMaloney1  
-**Project:** CogVideoX Powerslap Fine-Tuning