LoKR Configuration
β οΈ Experimental: LoKR is less tested than LoRA. The LyCORIS library
interaction with Lightning Fabric may have rough edges. If you encounter issues, fall back
to LoRA.
Decompose both Kronecker factors
?
What is "Decompose Both Kronecker Factors"?
Ultra-simple explanation: Make the math even more compressed by breaking
down BOTH parts of the Kronecker product instead of just one.
What's a Kronecker product? A mathematical way to represent large
matrices using smaller ones:
Normal matrix: 1000Γ1000 = 1 million numbers
Kronecker: (10Γ10) β (100Γ100) = only 10,100 numbers! (100Γ
smaller)
Decompose one factor (default):
Large Matrix = A β B
Where A is kept full, B is decomposed
Decompose both factors (this option):
Large Matrix = (Aβ Γ Aβ) β (Bβ Γ Bβ)
Where both A and B are decomposed
Why would you enable this?
Pro: Even more compression β saves VRAM
Pro: Fewer parameters β faster training
Con: Less capacity β might lose quality
Con: More complex math β potential instability
When to enable:
You're extremely tight on VRAM
You want maximum compression
You're doing experiments to see what works
When to leave disabled (default):
First time using LoKR (learn the basics first)
You want maximum quality
You have enough VRAM
Impact:
VRAM: Saves more memory
Quality: May reduce capacity
Speed: Slightly faster training
Use Tucker decomposition
?
What is Tucker Decomposition?
Ultra-simple explanation: An alternative mathematical method for
compressing the adapter. Like choosing between ZIP and RAR compression formats.
The two methods:
Kronecker decomposition (default): Splits matrix into two parts: A
β B
Tucker decomposition (this option): Splits into three parts with a
"core tensor": A Γ Core Γ B
Visual analogy:
Kronecker (default):
Large Matrix = [Left part] β [Right part]
Simple, direct
Tucker (this option):
Large Matrix = [Left] Γ [Core] Γ [Right]
Extra "core" in middle
Characteristics of Tucker:
More flexible: Core tensor adds degrees of freedom
Potentially better: Can capture different patterns than Kronecker
More complex: Three components instead of two
Less tested: Kronecker is more common in practice
When to enable:
Experimenting with different compression methods
Kronecker isn't giving good results
You're doing research or advanced optimization
When to leave disabled (default):
First time using LoKR
You want proven, stable methods
Kronecker is working fine
Real talk: This is like choosing between different compression
algorithms. Most people should stick with the default (Kronecker) unless they're
specifically researching decomposition methods.
Impact:
VRAM: Similar to Kronecker
Quality: Different patterns, may be
better/worse
Speed: Similar
Use scalar scaling
?
What is Scalar Scaling?
Ultra-simple explanation: Add a single "volume knob" number that scales
the entire adapter output up or down. Like a master volume control.
Without scalar scaling (default):
Output = LoKR_matrix Γ Input
Fixed strength
With scalar scaling (this option):
Output = scalar Γ (LoKR_matrix Γ Input)
β
Learned "volume knob"
What it does:
Adds ONE extra learnable number (the scalar)
This scalar multiplies the entire adapter output
The network learns the best "volume" for this adapter
Benefits:
Adaptive strength: Network learns how strongly to apply the adapter
Better balance: Can fine-tune the adapter's influence automatically
Minimal cost: Just one extra number per adapter
Example:
LoKR produces output: [0.5, 1.2, -0.3, ...]
Learned scalar: 0.8
Final output: [0.4, 0.96, -0.24, ...] (everything Γ 0.8)
When to enable:
You want the network to learn optimal adapter strength
Experimenting with adaptive scaling
Training seems to benefit from dynamic scaling
When to leave disabled (default):
First time using LoKR (keep it simple)
You're using alpha for scaling (redundant)
Training is working fine without it
Real talk: This is a very small experimental feature. The impact is
usually minimal. Only use if you're specifically testing different scaling approaches.
Impact:
VRAM: Negligible (1 number per adapter)
Quality: May improve balance
Speed: Negligible
DoRA-style weight decomposition
?
What is DoRA-style Weight Decomposition?
Ultra-simple explanation: Separate the "direction" and "magnitude" of
weight changes. Like separating "which way to turn" from "how much to turn."
Normal LoRA/LoKR (default):
Weight_new = Weight_original + Adapter_change
β β
Base weights Changes everything at once
(direction + magnitude)
DoRA-style (this option):
Weight_new = Magnitude Γ Direction
β β β
Final weight How much Which way
Where:
- Direction = normalized weight direction
- Magnitude = learned scaling factor
Car driving analogy:
Normal method: Learn "turn left 45Β° AND accelerate to 50mph" as one
combined action
DoRA method: Learn "turn left 45Β°" (direction) and "go 50mph"
(magnitude) separately
Why DoRA? Research shows separating magnitude from direction can:
Improve training stability
Better preserve base model knowledge
More efficient parameter updates
Sometimes better quality results
The trade-offs:
Pro: Can improve quality and stability
Pro: Better control over how adapter affects model
Con: More complex computation
Con: Slightly slower training (~10-15%)
Con: More memory overhead
When to enable:
You've read research on DoRA and want to try it
Standard LoKR isn't giving good results
You want maximum quality and don't mind slower training
Training is unstable and you want better stability
When to leave disabled (default):
First time using LoKR (learn basics first)
You want fastest training
You're tight on VRAM
Standard approach is working fine
Research context: DoRA (Weight-Decomposed Low-Rank Adaptation) is a 2024
research technique that showed improvements over standard LoRA in some tasks. It's
experimental but promising!
Impact:
VRAM: ~10-20% more memory
Quality: May improve results
Speed: ~10-15% slower training
CFG & Loss Settings (Base/SFT only)
These settings are auto-disabled for turbo models
Memory Optimization
Gradient Checkpointing (ON by default, saves ~40-60% VRAM, ~10-30% slower)
?
What is Gradient Checkpointing?
Simple explanation: A memory-saving trick where we throw away some
intermediate calculations and recalculate them later when needed. Uses less VRAM but
takes a bit more time.
Think of it like: Instead of keeping all your scratch paper (uses desk
space), you erase it and redo the math later when grading (takes more time but less
space).
The trade-off:
Saves: 40-60% of VRAM (HUGE savings!)
Costs: 10-30% slower training (you're doing some calculations
twice)
Should you use it?
YES - If you have <24GB VRAM (almost everyone)
YES - If you're running out of memory
MAYBE NO - If you have 80GB+ VRAM and want max speed
Default: ON (matches original Side-Step trainer behavior). Unless you
have massive VRAM and want maximum speed, keep this ON.
Impact:
VRAM: Saves 40-60% (CRITICAL for most users)
Speed: 10-30% slower training
Quality: No impact
Offload Encoder to CPU (frees ~2-4GB VRAM after setup)
?
What is Encoder Offloading?
Simple explanation: Some parts of the AI (VAE and text encoders) are
only needed at the start. After setup, we move them to regular RAM instead of keeping
them on the GPU. This frees up GPU memory for training.
Think of it like: Moving reference books from your desk to a bookshelf
after you've taken notes. You already have what you need, so free up the desk space.
How much does it save? About 2-4GB of VRAM (exact amount depends on the
model).
Does it slow things down? NO! These components aren't used during
training anyway, so moving them to CPU has zero speed impact.
Should you use it?
YES - If you have 10-16GB VRAM
YES - If you're getting out-of-memory errors
MAYBE - If you have <10GB VRAM (enable it!)
NO - If you have 24GB+ VRAM (not needed)
Fun fact: This is a "free lunch" optimization - saves memory with no
downsides!
Impact:
VRAM: Frees 2-4GB
Speed: Zero impact
Quality: Zero impact
Data Loading
Pin Memory (faster GPU transfer, uses more RAM)
?
What is Pin Memory?
Simple explanation: "Pinning" memory locks loaded data in a special area
of RAM that can transfer to GPU faster. It's like keeping ingredients on the counter
instead of in the pantry.
Technical explanation:
Unpinned (normal) memory: OS can move data around. GPU transfer
needs an extra copy step. Slower but flexible.
Pinned memory: Data is locked in place. GPU can grab it directly
via DMA (Direct Memory Access). Faster but uses more RAM.
The trade-off:
Speed benefit: 10-30% faster data transfer from RAM β GPU
RAM cost: Pinned memory can't be swapped to disk, so it "uses" more
RAM
Should you enable it?
YES (default): If you have 16GB+ system RAM
YES: If using CUDA (NVIDIA GPUs)
MAYBE NO: If you have <16GB RAM and seeing memory pressure
NO: If getting RAM-related crashes
When to disable:
Training on a system with very limited RAM (<8GB)< /li>
Running multiple programs while training
Getting "out of memory" errors (RAM, not VRAM)
System becomes unresponsive during training
Platform compatibility:
CUDA (NVIDIA): Works great, recommended
MPS (Apple Silicon): Less benefit but still works
CPU: No benefit (disabled automatically)
Impact:
GPU VRAM: No impact
CPU RAM: Uses more (data can't be swapped)
Speed: 10-30% faster RAMβGPU transfer
Persistent Workers (keep workers alive between epochs)
?
What are Persistent Workers?
Simple explanation: Keep the data loading workers alive between epochs
instead of killing and restarting them every epoch.
Think of it like: A restaurant with shift workers:
Non-persistent: Fire all prep cooks at end of each lunch service,
hire new ones for dinner. Lots of setup time.
Persistent: Keep the same prep cooks all day. They know the kitchen
and work efficiently.
How it works:
Persistent OFF: At the end of each epoch:
Kill all worker processes
Free their memory
At start of next epoch: spawn new workers
Workers reload dataset metadata
Persistent ON: Workers stay alive between epochs:
No respawning overhead
Dataset stays loaded in worker memory
Faster transition between epochs
Benefits of enabling:
β‘ Faster epoch transitions (no worker spawn time)
β‘ No dataset reloading between epochs
β‘ More efficient when training many epochs
Drawbacks of enabling:
πΎ Workers keep RAM allocated between epochs
π If workers have memory leaks, they accumulate
π Harder to recover from worker crashes
Should you enable it?
YES (default on Linux): If training many epochs (50+)
YES: If you have stable RAM usage (no leaks)
NO (default on Windows): Windows multiprocessing issues
NO: If RAM usage grows over time (memory leak)
NO: If num_workers = 0 (nothing to persist)
When to disable:
RAM usage steadily increases during training (leak)
Workers crash or hang between epochs
Training on Windows (often problematic)
You're doing quick experiments (few epochs)
Platform defaults:
Linux: ON by default (works reliably)
Windows: OFF by default (multiprocessing issues)
Mac: ON by default
β οΈ Note: This setting only matters when num_workers > 0.
With 0 workers, there's nothing to keep persistent.
Impact:
GPU VRAM: No impact
CPU RAM: Workers stay in memory (but not loading
more)
Speed: Faster epoch transitions, no spawn
overhead