Side-Step Training Queue Manager

✨ Side-Step Training Form Configure training jobs with all adapter types (LoRA, DoRA, LoKR, LoHA, OFT), preprocessing tasks, and comprehensive parameter control. Queue multiple jobs and run them overnight!

Task Type

🎯 Training

Train a LoRA/DoRA/LoKR adapter on preprocessed tensors

🔄 Preprocessing

Convert raw audio files to .pt training tensors

Model & Device

Checkpoint Directory * Path to ACE-Step checkpoints folder

Model Variant Official: turbo/base/sft. For fine-tunes: exact folder name

Device

Precision

Job Identification

Job Name (for your reference) Optional: helps you identify this job in the queue

Run Name (optional) ? Organizational field - include this in your output directory path if you want it in your folder names

What is Run Name?

Simple explanation: An organizational name for this training run. In the wizard, this gets incorporated into the output path and TensorBoard logs. When using this form, include it manually in your "Output Directory" path.

How to use it:

Wizard mode: The wizard automatically adds it to paths
CLI/Form mode: Include it in your "Output Directory" field

Example:

Run Name: FMNICKS
Output Directory: ./output/Lora-FMNICKS-Turbo
Result: Easy to identify this run later!

Naming ideas:

FMNICKS - Artist/project name
rock_music_v1 - Genre + version
experiment_rank128 - What you're testing
bass_boost_lora - Descriptive feature
(leave empty) - No specific run name needed

Pro tip: Use consistent naming across your training runs to stay organized when you have 10+ LoRAs!

Impact:

Organization only - doesn't affect training

Training Paths

💡 Path Format: Use your platform's native format. Windows: C:\folder\subfolder or Linux/Mac: /home/user/folder. Both absolute and relative paths work.

Dataset Directory * ? Directory with preprocessed .pt files

What is Dataset Directory?

Simple explanation: The folder containing your preprocessed training data (.pt files and manifest.json).

Path format examples:

Windows:
- E:\ace-step-dataset\FMNICKS\tensors
- C:\Users\YourName\Documents\my_training_data
- .\data\tensors
Linux/Mac:
- /home/user/ace-step-dataset/tensors
- /mnt/data/my_music/preprocessed
- ./data/tensors

What should be in this folder?

manifest.json - Metadata file listing all samples
*.pt files - Preprocessed tensor files (one per song)
Example: track_001.pt, track_002.pt, etc.

How to get this data:

Run preprocessing first (using the wizard or CLI)
Point "Audio Directory" to your raw audio files
Set "Tensor Output" to where you want .pt files saved
After preprocessing, use that same tensor output path here

Output Directory * ? Where to save adapter weights and logs

What is Output Directory?

Simple explanation: Where your trained LoRA/LoKR weights will be saved, along with checkpoints and TensorBoard logs.

Path format examples:

Windows:
- E:\ace-step-dataset\FMNICKS\Lora-FMNICKS-Turbo
- C:\Users\YourName\loras\my_rock_lora
- .\output\experiment_1
Linux/Mac:
- /home/user/loras/my_lora
- /mnt/training/output/rock_music_v1
- ./output/my_lora

What gets saved here:

adapter_model.safetensors - Your trained LoRA weights (the final result!)
adapter_config.json - Configuration metadata
checkpoint-epoch-N/ - Checkpoint folders for resuming training
runs/ - TensorBoard log files
training_args.json - Record of all settings used

Naming tips:

Include run name: Lora-FMNICKS-Turbo
Include model variant: my_lora_base
Include experiment info: rank128_lr0001
Stay organized: ./output/artist/experiment_1

⚠️ Note: This folder will be created if it doesn't exist. Make sure the parent folder has write permissions.

Adapter Type

LoRA Recommended

Low-Rank Adaptation (PEFT). Stable, well-tested, widely supported.

DoRA

Weight-Decomposed Low-Rank Adaptation. Separates magnitude and direction.

LoKR Experimental

Low-Rank Kronecker (LyCORIS). Uses Kronecker factorization.

LoHA

Low-Rank Hadamard Product (LyCORIS). Good for capturing complex styles.

OFT Experimental

Orthogonal Finetuning. Preserves original model geometry well.

LoRA Configuration

Rank ? 16=quick tests, 64=recommended, 128=max quality

What is Rank?

Simple explanation: Rank controls how much the AI can learn from your data. Higher rank = more learning capacity, but uses more GPU memory.

Think of it like: A notebook with more pages. Rank 16 is a small notebook, Rank 128 is a thick textbook.

Valid values: Any number from 1 to 256, but common choices are:

16 - Quick experiments, small datasets (1-5 songs)
32 - Small to medium datasets (5-10 songs)
64 - Recommended default for most use cases
128 - Large datasets (50+ songs), maximum quality

Impact:

VRAM: Higher = More Quality: Higher = Better (if you have enough data) Speed: Minimal impact

Alpha ? Usually 2x the rank

What is Alpha?

Simple explanation: Alpha controls how strongly your trained adapter affects the final output. It's like a volume knob for your training.

Rule of thumb: Always set this to 2× your rank value:

Rank 16 → Alpha 32
Rank 32 → Alpha 64
Rank 64 → Alpha 128
Rank 128 → Alpha 256

When to change: Almost never! The 2× rule works for 99% of cases. Only advanced users experimenting with specific techniques should adjust this.

Impact:

VRAM: None Quality: Stick to 2× rank rule

Dropout ? 0.1 default, increase for small datasets (0.2-0.3)

What is Dropout?

Simple explanation: Dropout randomly "turns off" some learning during training. This prevents the AI from memorizing your specific songs instead of learning the general style.

Think of it like: Studying with occasional distractions. It forces your brain to really understand the material, not just memorize answers.

Valid values: 0.0 to 0.5 (it's a percentage, so 0.1 = 10%)

0.05 - Large datasets (50+ songs), very diverse styles
0.1 - Default, works for most cases
0.2-0.3 - Small datasets (1-10 songs), similar songs

Problem it solves: "Overfitting" - when the AI memorizes your training songs instead of learning the style. Signs: perfect on training data, bad on new generations.

Impact:

VRAM: None Quality: Prevents overfitting Speed: None

Attention Type ?

What is Attention Type?

Simple explanation: The AI has two types of "learning modules" - one that learns about the music itself (self-attention), and one that learns how text prompts connect to music (cross-attention).

Your options:

Both (default) - Train everything. The AI learns both the music style AND how to follow your text prompts better.
Self-attention only - Only train music patterns (rhythm, instruments, mixing). Text prompts won't improve. Use if: training on instrumental music only.
Cross-attention only - Only train prompt understanding. Music style stays the same, but AI gets better at following specific words. Rare use case.

When to change: Stick with "Both" unless you're doing something very specific. Self-only makes sense for instrumental-only training.

Impact:

VRAM: Both=most, Self/Cross=half Quality: Both gives best results

Target Modules ? Space-separated projection names

What are Target Modules?

Simple explanation: Inside each attention layer, there are 4 sub-parts (like 4 sub-workers). This setting chooses which sub-parts to train.

The four parts:

q_proj - "Query" - asks "what should I pay attention to?"
k_proj - "Key" - provides "here's what's important"
v_proj - "Value" - holds the actual information
o_proj - "Output" - combines everything together

How to write it: List the parts you want to train, separated by spaces:

q_proj k_proj v_proj o_proj - Train all 4 (recommended)
q_proj v_proj - Train only query and value (uses less VRAM)
q_proj k_proj - Just query and key (lightweight)

When to change: Only if you're low on VRAM or doing advanced experiments. Training all 4 gives best results.

Advanced: You can use the "estimate" mode to analyze which modules matter most for YOUR specific music, then train only those.

Impact:

VRAM: More modules = More VRAM Quality: All 4 = Best quality

Bias ?

What is Bias?

Simple explanation: Bias is like a "starting point" in the math that makes the AI work. This setting decides if we should also train those starting points or leave them alone.

Your options:

none - Don't train bias values (default, recommended)
lora_only - Train bias in your LoRA adapter only (slight quality improvement)
all - Train ALL bias values in the model (rarely needed)

When to change: Stick with "none" for 99% of use cases. Try "lora_only" if you're experimenting and want a tiny potential quality boost.

Impact:

VRAM: Minimal increase Quality: Minimal improvement

Self-Attention Target Modules (optional) ? Custom projections for self-attention (when attention-type=both)

Advanced: Separate Targets for Self-Attention

What this does: Normally, the same target modules (q_proj, k_proj, etc.) are used for both self-attention and cross-attention. This lets you specify DIFFERENT targets for just the self-attention part.

When to use: Only if you've run gradient estimation and discovered that self-attention benefits from different modules than cross-attention. This is an advanced optimization.

Example: q_proj v_proj for self-attention, while cross-attention uses all 4.

Leave empty unless: You're an advanced user doing targeted optimization based on estimation results.

Cross-Attention Target Modules (optional) ? Custom projections for cross-attention (when attention-type=both)

Advanced: Separate Targets for Cross-Attention

What this does: Lets you specify different target modules for just the cross-attention layers (the part that learns about text prompts).

When to use: Only if estimation shows cross-attention benefits from different modules. Advanced optimization only.

Leave empty unless: You're doing targeted optimization based on gradient estimation.

LoKR Configuration

⚠️ Experimental: LoKR is less tested than LoRA. The LyCORIS library interaction with Lightning Fabric may have rough edges. If you encounter issues, fall back to LoRA.

Linear Dimension ? Analogous to LoRA rank

What is Linear Dimension?

Simple explanation: Like LoRA's "rank" - controls how much the adapter can learn. Higher = more capacity.

Same guidance as LoRA rank:

16 - Quick tests
64 - Recommended default
128 - Maximum quality

Impact:

VRAM: Higher = More Quality: Higher = More capacity

Linear Alpha ? Keep at 2x linear dim

What is Linear Alpha?

Simple explanation: Like LoRA's "alpha" - controls adapter strength. Keep at 2× linear dimension.

Rule: Linear alpha = 2 × linear dimension

Dim 16 → Alpha 32
Dim 64 → Alpha 128
Dim 128 → Alpha 256

Factor ? Kronecker factorization factor (-1 = auto)

What is Factor?

Simple explanation: How to split up the Kronecker product math. -1 = let the library figure it out automatically.

Technical: Determines how the weight matrix is factorized into Kronecker products.

Recommendation: Leave at -1 (auto) unless you really know what you're doing.

Decompose both Kronecker factors ?

What is "Decompose Both Kronecker Factors"?

Ultra-simple explanation: Make the math even more compressed by breaking down BOTH parts of the Kronecker product instead of just one.

What's a Kronecker product? A mathematical way to represent large matrices using smaller ones:

Normal matrix: 1000×1000 = 1 million numbers
Kronecker: (10×10) ⊗ (100×100) = only 10,100 numbers! (100× smaller)

Decompose one factor (default):

Large Matrix = A ⊗ B
Where A is kept full, B is decomposed

Decompose both factors (this option):

Large Matrix = (A₁ × A₂) ⊗ (B₁ × B₂)
Where both A and B are decomposed

Why would you enable this?

Pro: Even more compression → saves VRAM
Pro: Fewer parameters → faster training
Con: Less capacity → might lose quality
Con: More complex math → potential instability

When to enable:

You're extremely tight on VRAM
You want maximum compression
You're doing experiments to see what works

When to leave disabled (default):

First time using LoKR (learn the basics first)
You want maximum quality
You have enough VRAM

Impact:

VRAM: Saves more memory Quality: May reduce capacity Speed: Slightly faster training

Use Tucker decomposition ?

What is Tucker Decomposition?

Ultra-simple explanation: An alternative mathematical method for compressing the adapter. Like choosing between ZIP and RAR compression formats.

The two methods:

Kronecker decomposition (default): Splits matrix into two parts: A ⊗ B
Tucker decomposition (this option): Splits into three parts with a "core tensor": A × Core × B

Visual analogy:

Kronecker (default):
Large Matrix = [Left part] ⊗ [Right part]
               Simple, direct

Tucker (this option):
Large Matrix = [Left] × [Core] × [Right]
               Extra "core" in middle

Characteristics of Tucker:

More flexible: Core tensor adds degrees of freedom
Potentially better: Can capture different patterns than Kronecker
More complex: Three components instead of two
Less tested: Kronecker is more common in practice

When to enable:

Experimenting with different compression methods
Kronecker isn't giving good results
You're doing research or advanced optimization

When to leave disabled (default):

First time using LoKR
You want proven, stable methods
Kronecker is working fine

Real talk: This is like choosing between different compression algorithms. Most people should stick with the default (Kronecker) unless they're specifically researching decomposition methods.

Impact:

VRAM: Similar to Kronecker Quality: Different patterns, may be better/worse Speed: Similar

Use scalar scaling ?

What is Scalar Scaling?

Ultra-simple explanation: Add a single "volume knob" number that scales the entire adapter output up or down. Like a master volume control.

Without scalar scaling (default):

Output = LoKR_matrix × Input
Fixed strength

With scalar scaling (this option):

Output = scalar × (LoKR_matrix × Input)
         ↑
    Learned "volume knob"

What it does:

Adds ONE extra learnable number (the scalar)
This scalar multiplies the entire adapter output
The network learns the best "volume" for this adapter

Benefits:

Adaptive strength: Network learns how strongly to apply the adapter
Better balance: Can fine-tune the adapter's influence automatically
Minimal cost: Just one extra number per adapter

Example:

LoKR produces output: [0.5, 1.2, -0.3, ...]
Learned scalar: 0.8
Final output: [0.4, 0.96, -0.24, ...] (everything × 0.8)

When to enable:

You want the network to learn optimal adapter strength
Experimenting with adaptive scaling
Training seems to benefit from dynamic scaling

When to leave disabled (default):

First time using LoKR (keep it simple)
You're using alpha for scaling (redundant)
Training is working fine without it

Real talk: This is a very small experimental feature. The impact is usually minimal. Only use if you're specifically testing different scaling approaches.

Impact:

VRAM: Negligible (1 number per adapter) Quality: May improve balance Speed: Negligible

DoRA-style weight decomposition ?

What is DoRA-style Weight Decomposition?

Ultra-simple explanation: Separate the "direction" and "magnitude" of weight changes. Like separating "which way to turn" from "how much to turn."

Normal LoRA/LoKR (default):

Weight_new = Weight_original + Adapter_change
             ↑               ↑
        Base weights    Changes everything at once
                       (direction + magnitude)

DoRA-style (this option):

Weight_new = Magnitude × Direction
             ↑           ↑         ↑
        Final weight  How much  Which way

Where:
- Direction = normalized weight direction
- Magnitude = learned scaling factor

Car driving analogy:

Normal method: Learn "turn left 45° AND accelerate to 50mph" as one combined action
DoRA method: Learn "turn left 45°" (direction) and "go 50mph" (magnitude) separately

Why DoRA? Research shows separating magnitude from direction can:

Improve training stability
Better preserve base model knowledge
More efficient parameter updates
Sometimes better quality results

The trade-offs:

Pro: Can improve quality and stability
Pro: Better control over how adapter affects model
Con: More complex computation
Con: Slightly slower training (~10-15%)
Con: More memory overhead

When to enable:

You've read research on DoRA and want to try it
Standard LoKR isn't giving good results
You want maximum quality and don't mind slower training
Training is unstable and you want better stability

When to leave disabled (default):

First time using LoKR (learn basics first)
You want fastest training
You're tight on VRAM
Standard approach is working fine

Research context: DoRA (Weight-Decomposed Low-Rank Adaptation) is a 2024 research technique that showed improvements over standard LoRA in some tasks. It's experimental but promising!

Impact:

VRAM: ~10-20% more memory Quality: May improve results Speed: ~10-15% slower training

Training Hyperparameters

Learning Rate ? Beta 1 default: 3e-4 (was 1e-4 in Alpha)

What is Learning Rate?

Simple explanation: How big of a "step" the AI takes when learning. Too big = learning goes haywire. Too small = takes forever to learn.

Think of it like: Walking speed while learning to navigate. Sprint = fast but might miss things or crash. Crawl = safe but takes ages.

Common values:

0.0001 (or 1e-4) - Standard default, safe for most cases
0.00005 (or 5e-5) - More stable, use if training is unstable
0.0002 (or 2e-4) - Faster learning, riskier
1.0 - ONLY for Prodigy optimizer (it auto-adjusts internally)

How to write it: Can use either format:

0.0001 = 1e-4 (scientific notation)
0.00005 = 5e-5

Signs you need to adjust:

Loss jumping all over → Lower LR (try 5e-5)
Loss barely changing after many epochs → Higher LR (try 2e-4)

Impact:

VRAM: None Quality: Critical tuning parameter Speed: Higher = faster learning (but less stable)

Batch Size ? Usually 1 for music (large tensors)

What is Batch Size?

Simple explanation: How many songs the AI looks at before updating what it learned. Batch size 1 = look at 1 song, update. Batch size 4 = look at 4 songs, then update.

Think of it like: Studying flashcards. Batch 1 = adjust after each card. Batch 10 = see 10 cards, then adjust your strategy.

Why it's usually 1 for music: Music files are HUGE (30-240 seconds of audio). Each song takes a lot of GPU memory. Batch size 2 would need double the memory.

When to change:

Keep at 1 - If you're using default settings
Try 2 - Only if you have 24GB+ VRAM and want slightly smoother training
Never go higher - Unless you have 80GB+ VRAM

Pro tip: Use "Gradient Accumulation" instead! It gives you the benefits of larger batch sizes without using more VRAM.

Impact:

VRAM: Higher = MUCH more VRAM Quality: Minimal impact Speed: Higher = slightly faster per epoch

Gradient Accumulation ? Effective batch = batch × accumulation

What is Gradient Accumulation?

Simple explanation: A clever trick to get the benefits of large batch sizes WITHOUT using more GPU memory. It "accumulates" learning from multiple songs before updating.

Think of it like: Taking notes while studying 4 flashcards, THEN updating your strategy once. Same learning benefit as batch size 4, but you only hold 1 card at a time.

How it works:

Batch size 1, Grad accumulation 4 = "Effective batch of 4"
Look at song 1, take notes
Look at song 2, add to notes
Look at song 3, add to notes
Look at song 4, add to notes
NOW update the model with all 4 songs' worth of info

Common values:

4 - Good default (effective batch of 4)
8 - Smoother learning, slower updates
2 - Faster updates, less smooth
1 - No accumulation (update after every song)

Impact:

VRAM: ZERO impact (that's the magic!) Quality: Higher = smoother, more stable learning Speed: Higher = fewer updates per epoch (slightly slower)

Epochs ? 1-10 songs: 200-500, 10-50: 100-200, 50+: 50-100

What are Epochs?

Simple explanation: How many times the AI goes through your entire dataset. 1 epoch = seeing every song once. 100 epochs = seeing every song 100 times.

Think of it like: How many times you review your study materials. 1 pass = skim once. 100 passes = really know it well.

How many should you use? Depends on dataset size:

1-10 songs: 200-500 epochs (small dataset needs lots of repetition)
10-50 songs: 100-200 epochs
50+ songs: 50-100 epochs (large dataset needs fewer passes)

How do you know when to stop?

Watch the loss curve in TensorBoard
If loss stops decreasing = you can probably stop
If loss starts INCREASING = you've overtrained (stop earlier next time)

Can you stop early? YES! If you're watching TensorBoard and loss plateaus at epoch 60, you can stop. No need to waste time on 40 more epochs.

Impact:

VRAM: None Quality: More = better (until overfitting) Speed: More = longer training time

Warmup Steps ? LR ramps from 10% to 100%

What are Warmup Steps?

Simple explanation: Instead of starting at full learning rate immediately, we gradually increase it over the first N steps. Like warming up before exercise.

Why? Starting at full learning rate can sometimes cause the AI to "freak out" and learn bad patterns in the first few steps. Warmup prevents this.

How it works:

Step 1-10: Learning rate at 10%
Step 11-50: Gradually increase to 50%
Step 51-100: Gradually increase to 100%
Step 101+: Stay at 100% (or follow scheduler)

Common values:

100 - Good default
200-500 - Use if training is unstable at the start
0 - No warmup (riskier)

Impact:

VRAM: None Quality: Prevents early instability Speed: Negligible

Random Seed

Weight Decay L2 regularization (0.01 default)

Max Gradient Norm Gradient clipping threshold

Chunk Duration (Latent Chunking) ? Random latent window per iteration (0=disabled, 60-90=recommended). Saves VRAM + data augmentation.

What is Latent Chunking?

Simple explanation: Instead of training on the entire song every time, randomly pick a shorter section (e.g., 60-90 seconds). This provides two major benefits: saves VRAM and acts as data augmentation.

How it works:

Each training iteration, a random window is extracted from each song
The model sees different parts of the same song every epoch
Like practicing random 60-second clips instead of the full 3-minute song

Benefits:

Saves VRAM: Shorter audio = much less memory needed
Data augmentation: Same song becomes multiple training samples (different random chunks each iteration)
Prevents overfitting: AI can't memorize full songs when it only sees random parts
Enables longer songs: Train on 5-minute songs even with limited VRAM

⚠️ CRITICAL WARNING:

Chunks shorter than 60 seconds can HURT training quality instead of enriching it. Use shorter chunks only if you need to reduce VRAM and understand the trade-off. The AI needs enough context to learn musical structure.

Recommended values:

0 - Disabled (use full songs)
60 - Minimum recommended for quality (1 minute chunks)
90 - Sweet spot - good quality + VRAM savings (1.5 minute chunks)
120 - Conservative option (2 minute chunks)

When to use:

Enable (60-90): Training on songs longer than 2 minutes
Enable (60-90): Running out of VRAM with full-length songs
Enable (90-120): Want extra data augmentation on long songs
Disable (0): Your songs are already short (under 2 minutes)
Disable (0): You have plenty of VRAM (24GB+) and want to train on full songs

Example scenarios:

3-minute songs, 16GB VRAM: Use 90 seconds
5-minute songs, 12GB VRAM: Use 60 seconds
1-minute songs, any VRAM: Disable (use 0)
4-minute songs, 24GB VRAM: Optional - use 90-120 for augmentation or 0 for full songs

Technical note: This slices the preprocessed latent tensors (not raw audio), so chunks are calculated in latent space. The number you enter is approximate seconds of the original audio.

Impact:

VRAM: Significant savings (proportional to chunk size) Quality: 60+ = safe, <60=risky Speed: Faster per step (smaller chunks) Data Aug: More variety in training

Shift (inference metadata) Auto-set: turbo=3.0, base/sft=1.0. Stored for inference, doesn't affect training

Num Inference Steps (metadata) Auto-set: turbo=8, base/sft=50. Stored for inference

CFG & Loss Settings (Base/SFT only)

These settings are auto-disabled for turbo models

CFG Dropout Ratio Null-condition probability. Base trained with 0.15 - match it

Loss Weighting Try min_snr if base/sft output sounds mushy

SNR Gamma Aggressiveness of min-SNR (only with min_snr)

Optimizer & Scheduler

Optimizer ?

What is an Optimizer?

Simple explanation: The optimizer is the "brain" that decides HOW to update the AI based on what it learned. Different optimizers have different strategies and memory requirements.

Your choices:

AdamW (default) - Industry standard. Reliable, well-tested, works for 99% of cases. Needs moderate VRAM for its "memory" of past training.
AdamW8bit - Same as AdamW but uses compressed memory. Saves ~30% of optimizer VRAM with virtually no quality loss. USE THIS if you're running out of VRAM!
AdaFactor - Uses minimal memory by being "forgetful" about past training. Only for extreme VRAM constraints (<8GB). Slightly worse quality.
Prodigy - Experimental "autopilot" that figures out the best learning rate automatically. Set LR to 1.0 and let it tune itself. Can be unstable.

Recommendation:

Have 16GB+ VRAM? Use AdamW
Have 10-16GB VRAM? Use AdamW8bit
Have <10GB VRAM? Use AdamW8bit or AdaFactor
Want to experiment? Try Prodigy (but set LR to 1.0!)

Impact:

VRAM: 8bit saves 30%, AdaFactor saves 50% Quality: AdamW=AdamW8bit > AdaFactor, Prodigy varies Speed: All roughly similar

Scheduler ? Prodigy auto-forces 'constant'

What is a Scheduler?

Simple explanation: Controls how the learning rate changes over time. Think of it like adjusting your study intensity as the semester progresses.

Visual analogy:

Constant: Study at same intensity every day ————————
Linear: Gradually study less and less ————\
Cosine: Study hard, then taper off smoothly ————\___
Cosine with Restarts: Periodic "sprints" ——\__/——\__/

Your choices:

Cosine (default) - Smooth curve down. Starts strong, ends gentle. Best for most cases.
Linear - Straight line down. Predictable, but less commonly used.
Constant - Never change LR. Required for Prodigy optimizer. Rarely used otherwise.
Constant with Warmup - Warm up at start, then stay flat. Simple and effective.
Cosine with Restarts - Periodic "bounces" back to higher LR. Advanced, can help escape plateaus.

Which to choose:

Default case: Use Cosine
Using Prodigy: It forces Constant automatically
Want simplicity: Constant with Warmup
Experimenting: Try Cosine with Restarts

Impact:

VRAM: None Quality: Cosine generally best Speed: None

Memory Optimization

Gradient Checkpointing (ON by default, saves ~40-60% VRAM, ~10-30% slower) ?

What is Gradient Checkpointing?

Simple explanation: A memory-saving trick where we throw away some intermediate calculations and recalculate them later when needed. Uses less VRAM but takes a bit more time.

Think of it like: Instead of keeping all your scratch paper (uses desk space), you erase it and redo the math later when grading (takes more time but less space).

The trade-off:

Saves: 40-60% of VRAM (HUGE savings!)
Costs: 10-30% slower training (you're doing some calculations twice)

Should you use it?

YES - If you have <24GB VRAM (almost everyone)
YES - If you're running out of memory
MAYBE NO - If you have 80GB+ VRAM and want max speed

Default: ON (matches original Side-Step trainer behavior). Unless you have massive VRAM and want maximum speed, keep this ON.

Impact:

VRAM: Saves 40-60% (CRITICAL for most users) Speed: 10-30% slower training Quality: No impact

Offload Encoder to CPU (frees ~2-4GB VRAM after setup) ?

What is Encoder Offloading?

Simple explanation: Some parts of the AI (VAE and text encoders) are only needed at the start. After setup, we move them to regular RAM instead of keeping them on the GPU. This frees up GPU memory for training.

Think of it like: Moving reference books from your desk to a bookshelf after you've taken notes. You already have what you need, so free up the desk space.

How much does it save? About 2-4GB of VRAM (exact amount depends on the model).

Does it slow things down? NO! These components aren't used during training anyway, so moving them to CPU has zero speed impact.

Should you use it?

YES - If you have 10-16GB VRAM
YES - If you're getting out-of-memory errors
MAYBE - If you have <10GB VRAM (enable it!)
NO - If you have 24GB+ VRAM (not needed)

Fun fact: This is a "free lunch" optimization - saves memory with no downsides!

Impact:

VRAM: Frees 2-4GB Speed: Zero impact Quality: Zero impact

Data Loading

Number of Workers ? Default: 4 (Linux), 0 (Windows). Set to 0 if low on RAM

What are Data Loading Workers?

Simple explanation: "Workers" are separate helper processes that load your training data from disk while the GPU is busy training. More workers = faster data loading, but uses more RAM.

Think of it like: A restaurant kitchen with multiple prep cooks (workers). While the head chef (GPU) is cooking, prep cooks are chopping ingredients (loading data) so the chef never has to wait.

How it works:

0 workers: Main process does everything. GPU waits while data loads from disk.
1 worker: One helper process loads next batch while GPU trains on current batch.
4 workers: Four helpers load data in parallel. GPU almost never waits.
8+ workers: Even more parallel loading, but diminishing returns.

Platform defaults:

Linux/Mac: Default 4 workers (multiprocessing works well)
Windows: Default 0 workers (Windows multiprocessing has issues)

Recommended values:

0 - Windows (default), or very low RAM (<8GB)< /li>
2 - Dual-core CPU, moderate RAM (8-16GB)
4 - Quad-core+ CPU, good RAM (16GB+) - recommended for Linux
6-8 - High-end CPU, lots of RAM (32GB+)

When to increase:

You have a fast GPU but slow disk (SSD helps too!)
TensorBoard shows GPU utilization dropping between batches
You have lots of CPU cores and RAM to spare

When to decrease (or use 0):

On Windows: Multiprocessing can cause crashes/hangs
Low RAM: Each worker loads data into memory
Memory leaks: If RAM usage grows over time
Crashes during data loading: Try 0 to debug

⚠️ Windows Warning: Windows has known issues with PyTorch multiprocessing. If training hangs at data loading or you get errors, set this to 0.

Impact:

GPU VRAM: No impact CPU RAM: Each worker uses ~2-4GB Speed: More workers = faster (if you have RAM)

Prefetch Factor ? Default: 2 (Linux), 0 (Windows)

What is Prefetch Factor?

Simple explanation: How many batches AHEAD each worker loads into memory. Higher = smoother training but uses more RAM.

Think of it like: A buffer at a streaming video service. Prefetch factor 2 = load 2 batches ahead. If GPU suddenly speeds up, you have data ready immediately.

How it works:

Prefetch 0: Workers only load when GPU asks. Can cause waiting.
Prefetch 2: Each worker keeps 2 batches loaded ahead. Smooth flow.
Prefetch 4+: Even more buffer, but uses lots of RAM.

Math: Total memory used = num_workers × prefetch_factor × batch_size × data_size

4 workers × 2 prefetch × 1 batch × 500MB per song = 4GB RAM used for prefetching

Recommended values:

0 - Windows (default, since num_workers is 0)
2 - Good default for Linux/Mac (balances speed and RAM)
4 - If you have lots of RAM (32GB+)
1 - If running out of RAM

When to increase:

GPU utilization is inconsistent (drops between batches)
You have RAM to spare
Data loading is still a bottleneck despite workers

When to decrease:

Running out of RAM (system memory, not VRAM)
Getting "out of memory" errors during data loading
Training on a system with <16GB RAM

⚠️ Important: This only matters when num_workers > 0. If workers = 0, prefetch factor is ignored.

Platform notes:

Windows: Usually 0 (since num_workers = 0)
Linux/Mac: Usually 2 (with 4 workers)

Impact:

GPU VRAM: No impact CPU RAM: Higher = more RAM per worker Speed: Smoother training flow, fewer GPU stalls

Pin Memory (faster GPU transfer, uses more RAM) ?

What is Pin Memory?

Simple explanation: "Pinning" memory locks loaded data in a special area of RAM that can transfer to GPU faster. It's like keeping ingredients on the counter instead of in the pantry.

Technical explanation:

Unpinned (normal) memory: OS can move data around. GPU transfer needs an extra copy step. Slower but flexible.
Pinned memory: Data is locked in place. GPU can grab it directly via DMA (Direct Memory Access). Faster but uses more RAM.

The trade-off:

Speed benefit: 10-30% faster data transfer from RAM → GPU
RAM cost: Pinned memory can't be swapped to disk, so it "uses" more RAM

Should you enable it?

YES (default): If you have 16GB+ system RAM
YES: If using CUDA (NVIDIA GPUs)
MAYBE NO: If you have <16GB RAM and seeing memory pressure
NO: If getting RAM-related crashes

When to disable:

Training on a system with very limited RAM (<8GB)< /li>
Running multiple programs while training
Getting "out of memory" errors (RAM, not VRAM)
System becomes unresponsive during training

Platform compatibility:

CUDA (NVIDIA): Works great, recommended
MPS (Apple Silicon): Less benefit but still works
CPU: No benefit (disabled automatically)

Impact:

GPU VRAM: No impact CPU RAM: Uses more (data can't be swapped) Speed: 10-30% faster RAM→GPU transfer

Persistent Workers (keep workers alive between epochs) ?

What are Persistent Workers?

Simple explanation: Keep the data loading workers alive between epochs instead of killing and restarting them every epoch.

Think of it like: A restaurant with shift workers:

Non-persistent: Fire all prep cooks at end of each lunch service, hire new ones for dinner. Lots of setup time.
Persistent: Keep the same prep cooks all day. They know the kitchen and work efficiently.

How it works:

Persistent OFF: At the end of each epoch:
- Kill all worker processes
- Free their memory
- At start of next epoch: spawn new workers
- Workers reload dataset metadata
Persistent ON: Workers stay alive between epochs:
- No respawning overhead
- Dataset stays loaded in worker memory
- Faster transition between epochs

Benefits of enabling:

⚡ Faster epoch transitions (no worker spawn time)
⚡ No dataset reloading between epochs
⚡ More efficient when training many epochs

Drawbacks of enabling:

💾 Workers keep RAM allocated between epochs
🐛 If workers have memory leaks, they accumulate
🐛 Harder to recover from worker crashes

Should you enable it?

YES (default on Linux): If training many epochs (50+)
YES: If you have stable RAM usage (no leaks)
NO (default on Windows): Windows multiprocessing issues
NO: If RAM usage grows over time (memory leak)
NO: If num_workers = 0 (nothing to persist)

When to disable:

RAM usage steadily increases during training (leak)
Workers crash or hang between epochs
Training on Windows (often problematic)
You're doing quick experiments (few epochs)

Platform defaults:

Linux: ON by default (works reliably)
Windows: OFF by default (multiprocessing issues)
Mac: ON by default

⚠️ Note: This setting only matters when num_workers > 0. With 0 workers, there's nothing to keep persistent.

Impact:

GPU VRAM: No impact CPU RAM: Workers stay in memory (but not loading more) Speed: Faster epoch transitions, no spawn overhead

Checkpointing & Logging

Save Every N Epochs

Log Every N Steps TensorBoard basic metrics (loss, LR)

Log Heavy Every N Steps Per-layer gradient norms (expensive)

TensorBoard Directory Custom path for TensorBoard logs

Resume from Checkpoint (optional) Path to checkpoint-epoch-N directory

Preprocessing Configuration

ℹ️ About Preprocessing: Converts raw audio files (MP3, WAV, FLAC, etc.) into .pt tensor files that can be used for training. Run this BEFORE training, or queue it as a separate job.

Audio Directory ? Source directory with audio files

What is Audio Directory?

Simple explanation: The folder containing your raw audio files (MP3, WAV, FLAC, etc.) that will be converted to training tensors.

Path examples:

Windows: E:\my_music, C:\Users\YourName\Music\training
Linux/Mac: /home/user/audio, ./my_audio

Supported formats: MP3, WAV, FLAC, OGG, M4A, and most common audio formats

The files will be:

Loaded and decoded to audio waveforms
Converted to latent tensors via VAE
Saved as .pt files in your tensor output directory

Dataset JSON (optional) ? Metadata file with lyrics/genre/BPM

What is Dataset JSON?

Simple explanation: An optional metadata file that provides extra information about each audio file (lyrics, genre, BPM, etc.).

When to use:

With lyrics: Training with text conditioning (vocals)
With metadata: Genre, BPM, mood tags
Without (empty): Instrumental music with no text prompts

JSON format example:

{
  "samples": [
    {
      "audio": "song1.wav",
      "lyrics": "Verse 1 lyrics here...",
      "genre": "rock",
      "bpm": 120
    }
  ]
}

Leave empty if: You're training on instrumental music without text metadata

Tensor Output ? Where to save .pt files

What is Tensor Output?

Simple explanation: Where the preprocessed .pt tensor files will be saved. This becomes your "Dataset Directory" for training.

Path examples:

Windows: E:\ace-step-dataset\FMNICKS\tensors
Linux/Mac: /home/user/tensors, ./my_tensors

What gets created:

manifest.json - Index of all processed files
track_001.pt, track_002.pt, etc. - Tensor files
One .pt file per audio file in your audio directory

After preprocessing: Use this same path as your "Dataset Directory" when training!

Max Duration (seconds) 0 = auto-detect from dataset

Normalization

Global Options

Use UV package manager (recommended) ?

What is UV?

Simple explanation: UV is a fast Python package manager that Side-Step uses. It's like pip but much faster (10-100x).

Why it's checked by default:

Side-Step documentation recommends using UV
Ensures packages are in sync with project requirements
Much faster than standard pip

Command difference:

With UV (checked): uv run python train.py ...
Without UV: python train.py ...

When to disable:

You don't have UV installed
You're using standard pip/virtualenv
You know you don't need it

To install UV:

pip install uv

Impact:

Just changes the command prefix

Plain output (disable Rich formatting)

Skip confirmation prompt (auto-start)

🚀 Side-Step Queue Manager

Task Type

🎯 Training

🔄 Preprocessing

Model & Device

Job Identification

What is Run Name?

Training Paths

What is Dataset Directory?

What is Output Directory?

Adapter Type

LoRA Recommended

DoRA

LoKR Experimental

LoHA

OFT Experimental

LoRA Configuration

What is Rank?

What is Alpha?

What is Dropout?

What is Attention Type?

What are Target Modules?

What is Bias?

Advanced: Separate Targets for Self-Attention

Advanced: Separate Targets for Cross-Attention

LoKR Configuration

What is Linear Dimension?

What is Linear Alpha?

What is Factor?

What is "Decompose Both Kronecker Factors"?

What is Tucker Decomposition?

What is Scalar Scaling?

What is DoRA-style Weight Decomposition?

LoHA Configuration

OFT Configuration

Training Hyperparameters

What is Learning Rate?

What is Batch Size?

What is Gradient Accumulation?

What are Epochs?

What are Warmup Steps?

What is Latent Chunking?

CFG & Loss Settings (Base/SFT only)

Optimizer & Scheduler

What is an Optimizer?

What is a Scheduler?

Memory Optimization

What is Gradient Checkpointing?

What is Encoder Offloading?

Data Loading

What are Data Loading Workers?

What is Prefetch Factor?

What is Pin Memory?

What are Persistent Workers?

Checkpointing & Logging

Preprocessing Configuration

What is Audio Directory?

What is Dataset JSON?

What is Tensor Output?

Global Options

What is UV?

Queue Execution Options

Generated Queue Script