Our AI Models

22 AI video models, 4 image models, and specialist tools — text-to-video, image-to-video, portrait animation, 3D generation and more. Compare specifications and choose the right engine for your project.

WAN

14B Parameter Text-to-Video • Rapid Generation

Speed

WAN is a 14-billion parameter diffusion model by Alibaba, designed for rapid text-to-video generation. Using a two-pass KSamplerAdvanced pipeline with LoRA-enhanced distilled weights, it produces quality videos in just 4 inference steps — making it ideal for fast iteration and experimentation.

The model leverages the Hunyuan latent video architecture with dual ModelSamplingSD3 shift control for both high-noise and low-noise passes, giving fine-tuned control over the generation process while maintaining exceptional speed.

Technical Specifications

ArchitectureTwo-Pass KSamplerAdvanced + LoRA Distillation
Parameters14 Billion
Default Resolution848 × 480
Frame Rate16 fps
Default Steps4 (rapid)
SamplerEuler
SchedulerSimple
CFG Scale1.0
Shift5.0
Output FormatMP4 (via SaveVideo)
Checkpointwan2.2-t2v-rapid-aio-v10

Best For

Rapid Prototyping

Get results in seconds with just 4 steps

🔁
Iterative Refinement

Quickly test different prompts and settings

🎬
Motion Quality

Natural movement with two-pass sampling

📈
Batch Generation

Low step count enables high throughput

LTX 2.3

Distilled Two-Pass Pipeline • HD Output

Quality

LTX 2.3 is a sophisticated two-pass generation pipeline that first generates at a lower resolution and then upsamples to produce crisp 1280×720 HD video. The LTXVLatentUpsampler ensures sharp details and consistent motion across the upscaling process.

It uses Euler Ancestral CFG++ sampling with manual sigma scheduling for precise noise control, dual CFGGuider nodes for both passes, and LoRA-enhanced model weights for optimal quality.

Technical Specifications

ArchitectureTwo-Pass with LTXVLatentUpsampler
Text EncoderLTXAVTextEncoder (Dedicated)
Default Resolution1280 × 720 (HD)
Frame Rate25 fps
Default Steps8 per pass
SamplerEuler Ancestral CFG++
SchedulerManual Sigmas
CFG Scale1.0 (both passes)
Shift2.05
Output FormatMP4 (via SaveVideo)
EnhancementLoRA Model Weights

Best For

💎
High Quality Output

Two-pass pipeline for crisp HD results

📷
Final Productions

Publication-ready 1280×720 video

🛠
Detail Preservation

LoRA enhancement for fine details

🎥
Professional Look

Advanced sigma scheduling for smooth motion

LTX Quality

Two-Pass Pipeline • Up to 4K Output

4K

LTX Quality is the premium configuration of the LTX 2.3 pipeline, generating at half resolution then performing a 2× spatial upscale for ultra-sharp output up to 4K (3840×2176). It uses LoRA-enhanced distilled weights and a dedicated LTXVLatentUpsampler for the refinement pass.

The two-pass approach with separate CFGGuider nodes, configurable sigma schedules, and optional tiled VAE decoding makes it ideal for high-resolution final production work.

Technical Specifications

ArchitectureTwo-Pass with 2× LTXVLatentUpsampler
Text EncoderLTXAVTextEncoder (Dedicated)
Default Resolution960 × 544 → 1920 × 1088
Max Resolution1920 × 1088 → 3840 × 2176 (4K)
Frame Rate24 fps
Default Steps20 (first pass)
SamplerEuler
SchedulerLTXVScheduler (custom shift)
CFG Scale3.0 (first pass) / 1.0 (upscale)
Shift Range0.95 – 2.05
Output FormatMP4 (via SaveVideo)
EnhancementLoRA Distill + Spatial Upscaler

Best For

💎
4K Production

Generate up to 3840×2176 with 2× upscale

🛠
Maximum Detail

LoRA + spatial upscaler preserves fine details

🎥
Professional Output

Configurable sigma schedules for smooth motion

Full Customization

Separate controls for both generation passes

Hunyuan Video

13B Parameter Text-to-Video • High Motion Quality

Motion

Hunyuan Video is a 13-billion parameter diffusion model by Tencent, built on a Dual CLIP text encoder architecture (CLIP-L + LLaVA-LLaMA3) for rich text understanding. It uses FluxGuidance for conditioning control and ModelSamplingSD3 for shift-based sampling.

The model generates at 720p resolution with tiled VAE decoding by default to manage VRAM, producing videos with exceptional motion quality and character consistency — ideal for anime-style and character-driven animation.

Technical Specifications

ArchitectureHunyuan Video + SD3 Sampling + FluxGuidance
Parameters13 Billion
Text EncoderDual CLIP (CLIP-L + LLaVA-LLaMA3)
Default Resolution848 × 480
Frame Rate24 fps
Default Steps20
SamplerEuler (SamplerCustomAdvanced)
SchedulerSimple (BasicScheduler)
Guidance6.0 (FluxGuidance)
Shift7.0
VAE DecodeTiled (256 tile, 64 overlap)
Output FormatMP4 (via SaveVideo)

Best For

🎬
Smooth Motion

Exceptional temporal coherence and fluid movement

🎨
Anime & Characters

Excels at character animation and anime styles

📚
Rich Prompts

Dual CLIP encoder understands complex descriptions

📈
Consistency

Maintains character identity across frames

Hunyuan 1.5

Next-Gen Text-to-Video • Native 720p • Dual CLIP v2

New

Hunyuan 1.5 is the next generation of Tencent’s video diffusion model, featuring an upgraded Dual CLIP encoder (Qwen 2.5 VL 7B + Byt5 Glyph XL) for significantly improved text comprehension and prompt following. It generates natively at 720p (1280×720) without upscaling.

The model uses CFGGuider with SamplerCustomAdvanced for precise generation control, and includes an optional super-resolution path to upscale output to 1080p using a dedicated latent upsampler.

Technical Specifications

ArchitectureHunyuan Video 1.5 + SD3 Sampling + CFGGuider
Text EncoderDual CLIP (Qwen 2.5 VL 7B + Byt5 Glyph XL)
Default Resolution1280 × 720 (HD)
Frame Rate24 fps
Default Steps20
SamplerEuler (SamplerCustomAdvanced)
SchedulerSimple (BasicScheduler)
CFG Scale6.0
Shift7.0
VAE DecodeStandard (VAEDecode)
Optional Upscale1080p Super Resolution (disabled by default)
Output FormatMP4 (via SaveVideo)

Best For

📚
Superior Prompts

Qwen 2.5 VL encoder understands complex, detailed descriptions

🎬
Native HD

Generates at 1280×720 without upscaling artefacts

📈
Smooth Motion

Inherits Hunyuan’s exceptional temporal coherence

💎
Optional 1080p

Built-in super-resolution upscale path when needed

Additional Models

More AI Engines

CogVideoX

Tsinghua University • Open-Source

Research

CogVideoX is a high-quality open-source video generation model from Tsinghua University. It delivers strong text comprehension and artistic output quality with a straightforward single-pass pipeline.

The model excels at stylised and artistic video generation, producing visually striking results with excellent prompt adherence for creative and research use cases.

Technical Specifications

ArchitectureCogVideoX Transformer
Default Resolution720 × 480
Frame Rate8 fps
Default Steps50
CFG Scale6
Frame Count49 frames (~6s)
SamplerEuler
Output FormatMP4

Best For

🎨
Artistic Styles

Excels at stylised and creative video output

💬
Text Comprehension

Strong understanding of complex prompts

🔬
Research Quality

Open-source model with academic backing

🎬
Longer Clips

49 frames for ~6 seconds of output

AnimateDiff

Stable Diffusion Animation • Motion Module

SD

AnimateDiff turns any Stable Diffusion checkpoint into a video animation engine by inserting a temporal motion module. This enables the vast SD ecosystem of models, LoRAs, and styles to produce animated output.

With fast 20-step generation and a lightweight architecture, AnimateDiff is ideal for quick iterations, anime-style content, and leveraging the massive library of community SD checkpoints.

Technical Specifications

ArchitectureSD 1.5 + Motion Module
Default Resolution512 × 512
Frame Rate8 fps
Default Steps20
CFG Scale7
Frame Count32 frames (~4s)
SamplerEuler
Output FormatMP4

Best For

Quick Iterations

20-step generation for rapid prototyping

🎨
Anime & Stylised

Leverage SD checkpoints for any art style

🛠
SD Ecosystem

Compatible with community LoRAs and models

🔁
Short Clips

Perfect for 4-second animated sequences

Image-to-Video Models

Reference Image Animation • Multi-Engine

I2V

Image-to-Video (I2V) models animate a reference image into motion video. Upload a still image and the AI generates natural movement, camera motion, and scene dynamics while preserving the original composition.

Available I2V Engines

Hunyuan 1.5 I2V1280×720 • 24 fps • 121 framesBest motion quality from a reference image
LTX 2.3 I2V1280×720 • 25 fps • 121 framesFast 8-step inference, great detail
SVD XT I2V1024×576 • 6 fps • 25 framesStability AI's Stable Video Diffusion
WAN 2.2 I2V832×480 • 16 fps • 81 frames14B parameter model with Lightx2v turbo

Advanced Video Modes

First-Last Frame • Sound-to-Video • Camera Control

Multi-Mode

Specialised generation modes that extend beyond simple text or image inputs. Interpolate between keyframes, drive video from audio, or control camera movements.

Available Modes

LTX First-Last Frame768×512 • 25 fps • 97 framesInterpolate between two keyframe images with prompt guidance
WAN First-Last Frame832×480 • 16 fps • 81 framesWAN 14B first & last frame interpolation
WAN Sound-to-Video832×480 • 16 fps • 81 framesGenerate talking-head video from portrait + audio
LTX Camera Control768×512 • 25 fps • 97 framesCamera dolly (in, out, left, right) via LoRA
LTX ControlNet768×512 • 25 fps • 97 framesDepth, edge, or pose-guided generation
WAN VACE832×480 • 16 fps • 81 framesVideo conditioning & editing with subject/scene control
WAN Animate832×480 • 16 fps • 81 framesCharacter animation from a single reference image

Specialised Models

Portrait Animation • Face Transfer • 3D Generation

Specialist

Purpose-built models for specific creative tasks including portrait animation, face identity transfer, audio-driven talking heads, and 3D asset generation.

Available Specialist Models

LivePortrait512×512 • 24 fps • 81 framesAnimate portrait photos with facial expressions from a driving video
DreamID-V832×480 • 24 fps • 81 framesFace identity transfer — insert your face into generated video
EchoMimic512×512 • 25 fps • 100 framesAudio-driven talking portrait from a single photo + audio clip
Hunyuan3D v2512×512 single outputGenerate 3D model assets from text or reference image

Image Generation Models

AI Image Generation • Text-to-Image & Image Editing

Images

Four image generation models ranging from the fast and lightweight Stable Diffusion 1.5 to the powerful Flux 2 family. Use the Image Edit model to modify existing images with text instructions.

Available Image Models

Stable Diffusion 1.5512×512 • 25 steps • CFG 7Fast & lightweight • 2 credits
Flux 2 Klein1024×1024 • 20 steps • CFG 4Balanced quality & speed • 5 credits
Flux 2 Dev1024×1024 • 28 steps • CFG 4Highest quality • 8 credits
Flux 2 Image Edit1024×1024 • 20 steps • CFG 4Edit images with text instructions • 6 credits
Side by Side

Model Comparison

Feature WAN LTX 2.3 LTX Quality Hunyuan Hunyuan 1.5
Max Resolution848 × 4801280 × 7203840 × 2176 (4K)848 × 4801280 × 720
Frame Rate16 fps25 fps24 fps24 fps24 fps
Inference Steps4 (fastest)8 × 2 passes20 + upscale pass2020
Generation SpeedFastestModerateSlowestSlowModerate
Output QualityGoodExcellentBestExcellentExcellent
Motion QualityGoodGoodExcellentBestBest
SamplerEulerEuler Ancestral CFG++EulerEulerEuler
UpscalingNoneBuilt-in (LTXVUpsampler)2× Spatial UpscalerNoneOptional 1080p SR
Best Use CasePrototyping & iterationHD production4K final productionCharacter & animeHD motion & prompts
Image-to-Video✓ Available✓ AvailableComing SoonComing Soon✓ Available
Infrastructure

Powered by NVIDIA DGX B200

NVIDIA DGX B200

8x Blackwell GPUs • 1,440 GB HBM3e • 144 PFLOPS

View on NVIDIA.com →

The foundation for your AI factory. NVIDIA DGX B200 is equipped with eight NVIDIA Blackwell GPUs interconnected with fifth-generation NVLink, delivering 3X the training performance and 15X the inference performance of previous-generation systems.

💻
8x Blackwell GPUs

NVIDIA Blackwell architecture with NVLink interconnect and 2x NVSwitch

📌
1,440 GB HBM3e

64 TB/s aggregate memory bandwidth across all GPUs

144 PFLOPS FP4

72 PFLOPS FP8 Tensor Core compute power

🚀
14.4 TB/s NVLink

Fifth-generation NVLink aggregate bandwidth

ComfyUI Engine

Industry-standard node-based workflow engine for reproducible generation

🔒
Private & Secure

Self-hosted infrastructure — your data never leaves the server

Full Specifications

GPU8x NVIDIA Blackwell GPUs
GPU Memory1,440 GB total HBM3e — 64 TB/s aggregate bandwidth
FP4 Tensor Core144 PFLOPS (sparse)
FP8 Tensor Core72 PFLOPS (sparse)
NVLink14.4 TB/s aggregate — 5th generation via 2x NVSwitch
CPU2x Intel Xeon Platinum 8570 — 112 cores, 2.1 / 4 GHz
System Memory2 TB DDR5 (configurable to 4 TB)
Storage2x 1.9 TB NVMe M.2 (OS) + 8x 3.84 TB NVMe U.2 (data)
Networking8x 400 Gb/s ConnectX-7 + 2x 400 Gb/s BlueField-3 DPU
System Power~14.3 kW max
Form Factor10 RU rack unit
SoftwareNVIDIA AI Enterprise + Mission Control + DGX OS

Choose your model and start creating

Create a free account to access all models — text-to-video and image-to-video. Switch between them anytime.