πŸš€ Side-Step Queue Manager

Complete training configuration with LoRA/LoKR, Preprocessing++, and job queuing

✨ Side-Step Training Form Configure training jobs with all adapter types (LoRA, DoRA, LoKR, LoHA, OFT), preprocessing tasks, and comprehensive parameter control. Queue multiple jobs and run them overnight!

Task Type

Model & Device

Path to ACE-Step checkpoints folder
Official: turbo/base/sft. For fine-tunes: exact folder name

Job Identification

Optional: helps you identify this job in the queue
Organizational field - include this in your output directory path if you want it in your folder names

What is Run Name?

Simple explanation: An organizational name for this training run. In the wizard, this gets incorporated into the output path and TensorBoard logs. When using this form, include it manually in your "Output Directory" path.

How to use it:

  • Wizard mode: The wizard automatically adds it to paths
  • CLI/Form mode: Include it in your "Output Directory" field

Example:

  • Run Name: FMNICKS
  • Output Directory: ./output/Lora-FMNICKS-Turbo
  • Result: Easy to identify this run later!

Naming ideas:

  • FMNICKS - Artist/project name
  • rock_music_v1 - Genre + version
  • experiment_rank128 - What you're testing
  • bass_boost_lora - Descriptive feature
  • (leave empty) - No specific run name needed

Pro tip: Use consistent naming across your training runs to stay organized when you have 10+ LoRAs!

Impact:

Organization only - doesn't affect training

Training Paths

πŸ’‘ Path Format: Use your platform's native format. Windows: C:\folder\subfolder or Linux/Mac: /home/user/folder. Both absolute and relative paths work.
Directory with preprocessed .pt files

What is Dataset Directory?

Simple explanation: The folder containing your preprocessed training data (.pt files and manifest.json).

Path format examples:

  • Windows:
    • E:\ace-step-dataset\FMNICKS\tensors
    • C:\Users\YourName\Documents\my_training_data
    • .\data\tensors
  • Linux/Mac:
    • /home/user/ace-step-dataset/tensors
    • /mnt/data/my_music/preprocessed
    • ./data/tensors

What should be in this folder?

  • manifest.json - Metadata file listing all samples
  • *.pt files - Preprocessed tensor files (one per song)
  • Example: track_001.pt, track_002.pt, etc.

How to get this data:

  • Run preprocessing first (using the wizard or CLI)
  • Point "Audio Directory" to your raw audio files
  • Set "Tensor Output" to where you want .pt files saved
  • After preprocessing, use that same tensor output path here
Where to save adapter weights and logs

What is Output Directory?

Simple explanation: Where your trained LoRA/LoKR weights will be saved, along with checkpoints and TensorBoard logs.

Path format examples:

  • Windows:
    • E:\ace-step-dataset\FMNICKS\Lora-FMNICKS-Turbo
    • C:\Users\YourName\loras\my_rock_lora
    • .\output\experiment_1
  • Linux/Mac:
    • /home/user/loras/my_lora
    • /mnt/training/output/rock_music_v1
    • ./output/my_lora

What gets saved here:

  • adapter_model.safetensors - Your trained LoRA weights (the final result!)
  • adapter_config.json - Configuration metadata
  • checkpoint-epoch-N/ - Checkpoint folders for resuming training
  • runs/ - TensorBoard log files
  • training_args.json - Record of all settings used

Naming tips:

  • Include run name: Lora-FMNICKS-Turbo
  • Include model variant: my_lora_base
  • Include experiment info: rank128_lr0001
  • Stay organized: ./output/artist/experiment_1

⚠️ Note: This folder will be created if it doesn't exist. Make sure the parent folder has write permissions.

Adapter Type

LoRA Configuration

16=quick tests, 64=recommended, 128=max quality

What is Rank?

Simple explanation: Rank controls how much the AI can learn from your data. Higher rank = more learning capacity, but uses more GPU memory.

Think of it like: A notebook with more pages. Rank 16 is a small notebook, Rank 128 is a thick textbook.

Valid values: Any number from 1 to 256, but common choices are:

  • 16 - Quick experiments, small datasets (1-5 songs)
  • 32 - Small to medium datasets (5-10 songs)
  • 64 - Recommended default for most use cases
  • 128 - Large datasets (50+ songs), maximum quality

Impact:

VRAM: Higher = More Quality: Higher = Better (if you have enough data) Speed: Minimal impact
Usually 2x the rank

What is Alpha?

Simple explanation: Alpha controls how strongly your trained adapter affects the final output. It's like a volume knob for your training.

Rule of thumb: Always set this to 2Γ— your rank value:

  • Rank 16 β†’ Alpha 32
  • Rank 32 β†’ Alpha 64
  • Rank 64 β†’ Alpha 128
  • Rank 128 β†’ Alpha 256

When to change: Almost never! The 2Γ— rule works for 99% of cases. Only advanced users experimenting with specific techniques should adjust this.

Impact:

VRAM: None Quality: Stick to 2Γ— rank rule
0.1 default, increase for small datasets (0.2-0.3)

What is Dropout?

Simple explanation: Dropout randomly "turns off" some learning during training. This prevents the AI from memorizing your specific songs instead of learning the general style.

Think of it like: Studying with occasional distractions. It forces your brain to really understand the material, not just memorize answers.

Valid values: 0.0 to 0.5 (it's a percentage, so 0.1 = 10%)

  • 0.05 - Large datasets (50+ songs), very diverse styles
  • 0.1 - Default, works for most cases
  • 0.2-0.3 - Small datasets (1-10 songs), similar songs

Problem it solves: "Overfitting" - when the AI memorizes your training songs instead of learning the style. Signs: perfect on training data, bad on new generations.

Impact:

VRAM: None Quality: Prevents overfitting Speed: None

What is Attention Type?

Simple explanation: The AI has two types of "learning modules" - one that learns about the music itself (self-attention), and one that learns how text prompts connect to music (cross-attention).

Your options:

  • Both (default) - Train everything. The AI learns both the music style AND how to follow your text prompts better.
  • Self-attention only - Only train music patterns (rhythm, instruments, mixing). Text prompts won't improve. Use if: training on instrumental music only.
  • Cross-attention only - Only train prompt understanding. Music style stays the same, but AI gets better at following specific words. Rare use case.

When to change: Stick with "Both" unless you're doing something very specific. Self-only makes sense for instrumental-only training.

Impact:

VRAM: Both=most, Self/Cross=half Quality: Both gives best results
Space-separated projection names

What are Target Modules?

Simple explanation: Inside each attention layer, there are 4 sub-parts (like 4 sub-workers). This setting chooses which sub-parts to train.

The four parts:

  • q_proj - "Query" - asks "what should I pay attention to?"
  • k_proj - "Key" - provides "here's what's important"
  • v_proj - "Value" - holds the actual information
  • o_proj - "Output" - combines everything together

How to write it: List the parts you want to train, separated by spaces:

  • q_proj k_proj v_proj o_proj - Train all 4 (recommended)
  • q_proj v_proj - Train only query and value (uses less VRAM)
  • q_proj k_proj - Just query and key (lightweight)

When to change: Only if you're low on VRAM or doing advanced experiments. Training all 4 gives best results.

Advanced: You can use the "estimate" mode to analyze which modules matter most for YOUR specific music, then train only those.

Impact:

VRAM: More modules = More VRAM Quality: All 4 = Best quality

What is Bias?

Simple explanation: Bias is like a "starting point" in the math that makes the AI work. This setting decides if we should also train those starting points or leave them alone.

Your options:

  • none - Don't train bias values (default, recommended)
  • lora_only - Train bias in your LoRA adapter only (slight quality improvement)
  • all - Train ALL bias values in the model (rarely needed)

When to change: Stick with "none" for 99% of use cases. Try "lora_only" if you're experimenting and want a tiny potential quality boost.

Impact:

VRAM: Minimal increase Quality: Minimal improvement
Custom projections for self-attention (when attention-type=both)

Advanced: Separate Targets for Self-Attention

What this does: Normally, the same target modules (q_proj, k_proj, etc.) are used for both self-attention and cross-attention. This lets you specify DIFFERENT targets for just the self-attention part.

When to use: Only if you've run gradient estimation and discovered that self-attention benefits from different modules than cross-attention. This is an advanced optimization.

Example: q_proj v_proj for self-attention, while cross-attention uses all 4.

Leave empty unless: You're an advanced user doing targeted optimization based on estimation results.

Custom projections for cross-attention (when attention-type=both)

Advanced: Separate Targets for Cross-Attention

What this does: Lets you specify different target modules for just the cross-attention layers (the part that learns about text prompts).

When to use: Only if estimation shows cross-attention benefits from different modules. Advanced optimization only.

Leave empty unless: You're doing targeted optimization based on gradient estimation.

Training Hyperparameters

Beta 1 default: 3e-4 (was 1e-4 in Alpha)

What is Learning Rate?

Simple explanation: How big of a "step" the AI takes when learning. Too big = learning goes haywire. Too small = takes forever to learn.

Think of it like: Walking speed while learning to navigate. Sprint = fast but might miss things or crash. Crawl = safe but takes ages.

Common values:

  • 0.0001 (or 1e-4) - Standard default, safe for most cases
  • 0.00005 (or 5e-5) - More stable, use if training is unstable
  • 0.0002 (or 2e-4) - Faster learning, riskier
  • 1.0 - ONLY for Prodigy optimizer (it auto-adjusts internally)

How to write it: Can use either format:

  • 0.0001 = 1e-4 (scientific notation)
  • 0.00005 = 5e-5

Signs you need to adjust:

  • Loss jumping all over β†’ Lower LR (try 5e-5)
  • Loss barely changing after many epochs β†’ Higher LR (try 2e-4)

Impact:

VRAM: None Quality: Critical tuning parameter Speed: Higher = faster learning (but less stable)
Usually 1 for music (large tensors)

What is Batch Size?

Simple explanation: How many songs the AI looks at before updating what it learned. Batch size 1 = look at 1 song, update. Batch size 4 = look at 4 songs, then update.

Think of it like: Studying flashcards. Batch 1 = adjust after each card. Batch 10 = see 10 cards, then adjust your strategy.

Why it's usually 1 for music: Music files are HUGE (30-240 seconds of audio). Each song takes a lot of GPU memory. Batch size 2 would need double the memory.

When to change:

  • Keep at 1 - If you're using default settings
  • Try 2 - Only if you have 24GB+ VRAM and want slightly smoother training
  • Never go higher - Unless you have 80GB+ VRAM

Pro tip: Use "Gradient Accumulation" instead! It gives you the benefits of larger batch sizes without using more VRAM.

Impact:

VRAM: Higher = MUCH more VRAM Quality: Minimal impact Speed: Higher = slightly faster per epoch
Effective batch = batch Γ— accumulation

What is Gradient Accumulation?

Simple explanation: A clever trick to get the benefits of large batch sizes WITHOUT using more GPU memory. It "accumulates" learning from multiple songs before updating.

Think of it like: Taking notes while studying 4 flashcards, THEN updating your strategy once. Same learning benefit as batch size 4, but you only hold 1 card at a time.

How it works:

  • Batch size 1, Grad accumulation 4 = "Effective batch of 4"
  • Look at song 1, take notes
  • Look at song 2, add to notes
  • Look at song 3, add to notes
  • Look at song 4, add to notes
  • NOW update the model with all 4 songs' worth of info

Common values:

  • 4 - Good default (effective batch of 4)
  • 8 - Smoother learning, slower updates
  • 2 - Faster updates, less smooth
  • 1 - No accumulation (update after every song)

Impact:

VRAM: ZERO impact (that's the magic!) Quality: Higher = smoother, more stable learning Speed: Higher = fewer updates per epoch (slightly slower)
1-10 songs: 200-500, 10-50: 100-200, 50+: 50-100

What are Epochs?

Simple explanation: How many times the AI goes through your entire dataset. 1 epoch = seeing every song once. 100 epochs = seeing every song 100 times.

Think of it like: How many times you review your study materials. 1 pass = skim once. 100 passes = really know it well.

How many should you use? Depends on dataset size:

  • 1-10 songs: 200-500 epochs (small dataset needs lots of repetition)
  • 10-50 songs: 100-200 epochs
  • 50+ songs: 50-100 epochs (large dataset needs fewer passes)

How do you know when to stop?

  • Watch the loss curve in TensorBoard
  • If loss stops decreasing = you can probably stop
  • If loss starts INCREASING = you've overtrained (stop earlier next time)

Can you stop early? YES! If you're watching TensorBoard and loss plateaus at epoch 60, you can stop. No need to waste time on 40 more epochs.

Impact:

VRAM: None Quality: More = better (until overfitting) Speed: More = longer training time
LR ramps from 10% to 100%

What are Warmup Steps?

Simple explanation: Instead of starting at full learning rate immediately, we gradually increase it over the first N steps. Like warming up before exercise.

Why? Starting at full learning rate can sometimes cause the AI to "freak out" and learn bad patterns in the first few steps. Warmup prevents this.

How it works:

  • Step 1-10: Learning rate at 10%
  • Step 11-50: Gradually increase to 50%
  • Step 51-100: Gradually increase to 100%
  • Step 101+: Stay at 100% (or follow scheduler)

Common values:

  • 100 - Good default
  • 200-500 - Use if training is unstable at the start
  • 0 - No warmup (riskier)

Impact:

VRAM: None Quality: Prevents early instability Speed: Negligible
L2 regularization (0.01 default)
Gradient clipping threshold
Random latent window per iteration (0=disabled, 60-90=recommended). Saves VRAM + data augmentation.

What is Latent Chunking?

Simple explanation: Instead of training on the entire song every time, randomly pick a shorter section (e.g., 60-90 seconds). This provides two major benefits: saves VRAM and acts as data augmentation.

How it works:

  • Each training iteration, a random window is extracted from each song
  • The model sees different parts of the same song every epoch
  • Like practicing random 60-second clips instead of the full 3-minute song

Benefits:

  • Saves VRAM: Shorter audio = much less memory needed
  • Data augmentation: Same song becomes multiple training samples (different random chunks each iteration)
  • Prevents overfitting: AI can't memorize full songs when it only sees random parts
  • Enables longer songs: Train on 5-minute songs even with limited VRAM

⚠️ CRITICAL WARNING:

Chunks shorter than 60 seconds can HURT training quality instead of enriching it. Use shorter chunks only if you need to reduce VRAM and understand the trade-off. The AI needs enough context to learn musical structure.

Recommended values:

  • 0 - Disabled (use full songs)
  • 60 - Minimum recommended for quality (1 minute chunks)
  • 90 - Sweet spot - good quality + VRAM savings (1.5 minute chunks)
  • 120 - Conservative option (2 minute chunks)

When to use:

  • Enable (60-90): Training on songs longer than 2 minutes
  • Enable (60-90): Running out of VRAM with full-length songs
  • Enable (90-120): Want extra data augmentation on long songs
  • Disable (0): Your songs are already short (under 2 minutes)
  • Disable (0): You have plenty of VRAM (24GB+) and want to train on full songs

Example scenarios:

  • 3-minute songs, 16GB VRAM: Use 90 seconds
  • 5-minute songs, 12GB VRAM: Use 60 seconds
  • 1-minute songs, any VRAM: Disable (use 0)
  • 4-minute songs, 24GB VRAM: Optional - use 90-120 for augmentation or 0 for full songs

Technical note: This slices the preprocessed latent tensors (not raw audio), so chunks are calculated in latent space. The number you enter is approximate seconds of the original audio.

Impact:

VRAM: Significant savings (proportional to chunk size) Quality: 60+ = safe, <60=risky Speed: Faster per step (smaller chunks) Data Aug: More variety in training
Auto-set: turbo=3.0, base/sft=1.0. Stored for inference, doesn't affect training
Auto-set: turbo=8, base/sft=50. Stored for inference

CFG & Loss Settings (Base/SFT only)

These settings are auto-disabled for turbo models
Null-condition probability. Base trained with 0.15 - match it
Try min_snr if base/sft output sounds mushy
Aggressiveness of min-SNR (only with min_snr)

Optimizer & Scheduler

What is an Optimizer?

Simple explanation: The optimizer is the "brain" that decides HOW to update the AI based on what it learned. Different optimizers have different strategies and memory requirements.

Your choices:

  • AdamW (default) - Industry standard. Reliable, well-tested, works for 99% of cases. Needs moderate VRAM for its "memory" of past training.
  • AdamW8bit - Same as AdamW but uses compressed memory. Saves ~30% of optimizer VRAM with virtually no quality loss. USE THIS if you're running out of VRAM!
  • AdaFactor - Uses minimal memory by being "forgetful" about past training. Only for extreme VRAM constraints (<8GB). Slightly worse quality.
  • Prodigy - Experimental "autopilot" that figures out the best learning rate automatically. Set LR to 1.0 and let it tune itself. Can be unstable.

Recommendation:

  • Have 16GB+ VRAM? Use AdamW
  • Have 10-16GB VRAM? Use AdamW8bit
  • Have <10GB VRAM? Use AdamW8bit or AdaFactor
  • Want to experiment? Try Prodigy (but set LR to 1.0!)

Impact:

VRAM: 8bit saves 30%, AdaFactor saves 50% Quality: AdamW=AdamW8bit > AdaFactor, Prodigy varies Speed: All roughly similar
Prodigy auto-forces 'constant'

What is a Scheduler?

Simple explanation: Controls how the learning rate changes over time. Think of it like adjusting your study intensity as the semester progresses.

Visual analogy:

  • Constant: Study at same intensity every day β€”β€”β€”β€”β€”β€”β€”β€”
  • Linear: Gradually study less and less β€”β€”β€”β€”\
  • Cosine: Study hard, then taper off smoothly β€”β€”β€”β€”\___
  • Cosine with Restarts: Periodic "sprints" β€”β€”\__/β€”β€”\__/

Your choices:

  • Cosine (default) - Smooth curve down. Starts strong, ends gentle. Best for most cases.
  • Linear - Straight line down. Predictable, but less commonly used.
  • Constant - Never change LR. Required for Prodigy optimizer. Rarely used otherwise.
  • Constant with Warmup - Warm up at start, then stay flat. Simple and effective.
  • Cosine with Restarts - Periodic "bounces" back to higher LR. Advanced, can help escape plateaus.

Which to choose:

  • Default case: Use Cosine
  • Using Prodigy: It forces Constant automatically
  • Want simplicity: Constant with Warmup
  • Experimenting: Try Cosine with Restarts

Impact:

VRAM: None Quality: Cosine generally best Speed: None

Memory Optimization

What is Gradient Checkpointing?

Simple explanation: A memory-saving trick where we throw away some intermediate calculations and recalculate them later when needed. Uses less VRAM but takes a bit more time.

Think of it like: Instead of keeping all your scratch paper (uses desk space), you erase it and redo the math later when grading (takes more time but less space).

The trade-off:

  • Saves: 40-60% of VRAM (HUGE savings!)
  • Costs: 10-30% slower training (you're doing some calculations twice)

Should you use it?

  • YES - If you have <24GB VRAM (almost everyone)
  • YES - If you're running out of memory
  • MAYBE NO - If you have 80GB+ VRAM and want max speed

Default: ON (matches original Side-Step trainer behavior). Unless you have massive VRAM and want maximum speed, keep this ON.

Impact:

VRAM: Saves 40-60% (CRITICAL for most users) Speed: 10-30% slower training Quality: No impact

What is Encoder Offloading?

Simple explanation: Some parts of the AI (VAE and text encoders) are only needed at the start. After setup, we move them to regular RAM instead of keeping them on the GPU. This frees up GPU memory for training.

Think of it like: Moving reference books from your desk to a bookshelf after you've taken notes. You already have what you need, so free up the desk space.

How much does it save? About 2-4GB of VRAM (exact amount depends on the model).

Does it slow things down? NO! These components aren't used during training anyway, so moving them to CPU has zero speed impact.

Should you use it?

  • YES - If you have 10-16GB VRAM
  • YES - If you're getting out-of-memory errors
  • MAYBE - If you have <10GB VRAM (enable it!)
  • NO - If you have 24GB+ VRAM (not needed)

Fun fact: This is a "free lunch" optimization - saves memory with no downsides!

Impact:

VRAM: Frees 2-4GB Speed: Zero impact Quality: Zero impact

Data Loading

Default: 4 (Linux), 0 (Windows). Set to 0 if low on RAM

What are Data Loading Workers?

Simple explanation: "Workers" are separate helper processes that load your training data from disk while the GPU is busy training. More workers = faster data loading, but uses more RAM.

Think of it like: A restaurant kitchen with multiple prep cooks (workers). While the head chef (GPU) is cooking, prep cooks are chopping ingredients (loading data) so the chef never has to wait.

How it works:

  • 0 workers: Main process does everything. GPU waits while data loads from disk.
  • 1 worker: One helper process loads next batch while GPU trains on current batch.
  • 4 workers: Four helpers load data in parallel. GPU almost never waits.
  • 8+ workers: Even more parallel loading, but diminishing returns.

Platform defaults:

  • Linux/Mac: Default 4 workers (multiprocessing works well)
  • Windows: Default 0 workers (Windows multiprocessing has issues)

Recommended values:

  • 0 - Windows (default), or very low RAM (<8GB)< /li>
  • 2 - Dual-core CPU, moderate RAM (8-16GB)
  • 4 - Quad-core+ CPU, good RAM (16GB+) - recommended for Linux
  • 6-8 - High-end CPU, lots of RAM (32GB+)

When to increase:

  • You have a fast GPU but slow disk (SSD helps too!)
  • TensorBoard shows GPU utilization dropping between batches
  • You have lots of CPU cores and RAM to spare

When to decrease (or use 0):

  • On Windows: Multiprocessing can cause crashes/hangs
  • Low RAM: Each worker loads data into memory
  • Memory leaks: If RAM usage grows over time
  • Crashes during data loading: Try 0 to debug

⚠️ Windows Warning: Windows has known issues with PyTorch multiprocessing. If training hangs at data loading or you get errors, set this to 0.

Impact:

GPU VRAM: No impact CPU RAM: Each worker uses ~2-4GB Speed: More workers = faster (if you have RAM)
Default: 2 (Linux), 0 (Windows)

What is Prefetch Factor?

Simple explanation: How many batches AHEAD each worker loads into memory. Higher = smoother training but uses more RAM.

Think of it like: A buffer at a streaming video service. Prefetch factor 2 = load 2 batches ahead. If GPU suddenly speeds up, you have data ready immediately.

How it works:

  • Prefetch 0: Workers only load when GPU asks. Can cause waiting.
  • Prefetch 2: Each worker keeps 2 batches loaded ahead. Smooth flow.
  • Prefetch 4+: Even more buffer, but uses lots of RAM.

Math: Total memory used = num_workers Γ— prefetch_factor Γ— batch_size Γ— data_size

  • 4 workers Γ— 2 prefetch Γ— 1 batch Γ— 500MB per song = 4GB RAM used for prefetching

Recommended values:

  • 0 - Windows (default, since num_workers is 0)
  • 2 - Good default for Linux/Mac (balances speed and RAM)
  • 4 - If you have lots of RAM (32GB+)
  • 1 - If running out of RAM

When to increase:

  • GPU utilization is inconsistent (drops between batches)
  • You have RAM to spare
  • Data loading is still a bottleneck despite workers

When to decrease:

  • Running out of RAM (system memory, not VRAM)
  • Getting "out of memory" errors during data loading
  • Training on a system with <16GB RAM

⚠️ Important: This only matters when num_workers > 0. If workers = 0, prefetch factor is ignored.

Platform notes:

  • Windows: Usually 0 (since num_workers = 0)
  • Linux/Mac: Usually 2 (with 4 workers)

Impact:

GPU VRAM: No impact CPU RAM: Higher = more RAM per worker Speed: Smoother training flow, fewer GPU stalls

What is Pin Memory?

Simple explanation: "Pinning" memory locks loaded data in a special area of RAM that can transfer to GPU faster. It's like keeping ingredients on the counter instead of in the pantry.

Technical explanation:

  • Unpinned (normal) memory: OS can move data around. GPU transfer needs an extra copy step. Slower but flexible.
  • Pinned memory: Data is locked in place. GPU can grab it directly via DMA (Direct Memory Access). Faster but uses more RAM.

The trade-off:

  • Speed benefit: 10-30% faster data transfer from RAM β†’ GPU
  • RAM cost: Pinned memory can't be swapped to disk, so it "uses" more RAM

Should you enable it?

  • YES (default): If you have 16GB+ system RAM
  • YES: If using CUDA (NVIDIA GPUs)
  • MAYBE NO: If you have <16GB RAM and seeing memory pressure
  • NO: If getting RAM-related crashes

When to disable:

  • Training on a system with very limited RAM (<8GB)< /li>
  • Running multiple programs while training
  • Getting "out of memory" errors (RAM, not VRAM)
  • System becomes unresponsive during training

Platform compatibility:

  • CUDA (NVIDIA): Works great, recommended
  • MPS (Apple Silicon): Less benefit but still works
  • CPU: No benefit (disabled automatically)

Impact:

GPU VRAM: No impact CPU RAM: Uses more (data can't be swapped) Speed: 10-30% faster RAM→GPU transfer

What are Persistent Workers?

Simple explanation: Keep the data loading workers alive between epochs instead of killing and restarting them every epoch.

Think of it like: A restaurant with shift workers:

  • Non-persistent: Fire all prep cooks at end of each lunch service, hire new ones for dinner. Lots of setup time.
  • Persistent: Keep the same prep cooks all day. They know the kitchen and work efficiently.

How it works:

  • Persistent OFF: At the end of each epoch:
    • Kill all worker processes
    • Free their memory
    • At start of next epoch: spawn new workers
    • Workers reload dataset metadata
  • Persistent ON: Workers stay alive between epochs:
    • No respawning overhead
    • Dataset stays loaded in worker memory
    • Faster transition between epochs

Benefits of enabling:

  • ⚑ Faster epoch transitions (no worker spawn time)
  • ⚑ No dataset reloading between epochs
  • ⚑ More efficient when training many epochs

Drawbacks of enabling:

  • πŸ’Ύ Workers keep RAM allocated between epochs
  • πŸ› If workers have memory leaks, they accumulate
  • πŸ› Harder to recover from worker crashes

Should you enable it?

  • YES (default on Linux): If training many epochs (50+)
  • YES: If you have stable RAM usage (no leaks)
  • NO (default on Windows): Windows multiprocessing issues
  • NO: If RAM usage grows over time (memory leak)
  • NO: If num_workers = 0 (nothing to persist)

When to disable:

  • RAM usage steadily increases during training (leak)
  • Workers crash or hang between epochs
  • Training on Windows (often problematic)
  • You're doing quick experiments (few epochs)

Platform defaults:

  • Linux: ON by default (works reliably)
  • Windows: OFF by default (multiprocessing issues)
  • Mac: ON by default

⚠️ Note: This setting only matters when num_workers > 0. With 0 workers, there's nothing to keep persistent.

Impact:

GPU VRAM: No impact CPU RAM: Workers stay in memory (but not loading more) Speed: Faster epoch transitions, no spawn overhead

Checkpointing & Logging

TensorBoard basic metrics (loss, LR)
Per-layer gradient norms (expensive)
Custom path for TensorBoard logs
Path to checkpoint-epoch-N directory

Global Options

What is UV?

Simple explanation: UV is a fast Python package manager that Side-Step uses. It's like pip but much faster (10-100x).

Why it's checked by default:

  • Side-Step documentation recommends using UV
  • Ensures packages are in sync with project requirements
  • Much faster than standard pip

Command difference:

  • With UV (checked): uv run python train.py ...
  • Without UV: python train.py ...

When to disable:

  • You don't have UV installed
  • You're using standard pip/virtualenv
  • You know you don't need it

To install UV:

pip install uv

Impact:

Just changes the command prefix

Queue Execution Options

πŸ“­ No jobs in queue. Configure a job and click "Add to Queue"

Generated Queue Script