Our AI Models

22 AI video models, 4 image models, and specialist tools — text-to-video, image-to-video, portrait animation, 3D generation and more. Compare specifications and choose the right engine for your project.

WAN

14B Parameter Text-to-Video • Rapid Generation

Speed

WAN is a 14-billion parameter diffusion model by Alibaba, designed for rapid text-to-video generation. Using a two-pass KSamplerAdvanced pipeline with LoRA-enhanced distilled weights, it produces quality videos in just 4 inference steps — making it ideal for fast iteration and experimentation.

The model leverages the Hunyuan latent video architecture with dual ModelSamplingSD3 shift control for both high-noise and low-noise passes, giving fine-tuned control over the generation process while maintaining exceptional speed.

Technical Specifications

Architecture	Two-Pass KSamplerAdvanced + LoRA Distillation
Parameters	14 Billion
Default Resolution	848 × 480
Frame Rate	16 fps
Default Steps	4 (rapid)
Sampler	Euler
Scheduler	Simple
CFG Scale	1.0
Shift	5.0
Output Format	MP4 (via SaveVideo)
Checkpoint	wan2.2-t2v-rapid-aio-v10

Best For

⚡

Rapid Prototyping

Get results in seconds with just 4 steps

🔁

Iterative Refinement

Quickly test different prompts and settings

🎬

Motion Quality

Natural movement with two-pass sampling

📈

Batch Generation

Low step count enables high throughput

LTX 2.3

Distilled Two-Pass Pipeline • HD Output

Quality

LTX 2.3 is a sophisticated two-pass generation pipeline that first generates at a lower resolution and then upsamples to produce crisp 1280×720 HD video. The LTXVLatentUpsampler ensures sharp details and consistent motion across the upscaling process.

It uses Euler Ancestral CFG++ sampling with manual sigma scheduling for precise noise control, dual CFGGuider nodes for both passes, and LoRA-enhanced model weights for optimal quality.

Technical Specifications

Architecture	Two-Pass with LTXVLatentUpsampler
Text Encoder	LTXAVTextEncoder (Dedicated)
Default Resolution	1280 × 720 (HD)
Frame Rate	25 fps
Default Steps	8 per pass
Sampler	Euler Ancestral CFG++
Scheduler	Manual Sigmas
CFG Scale	1.0 (both passes)
Shift	2.05
Output Format	MP4 (via SaveVideo)
Enhancement	LoRA Model Weights

Best For

💎

High Quality Output

Two-pass pipeline for crisp HD results

📷

Final Productions

Publication-ready 1280×720 video

🛠

Detail Preservation

LoRA enhancement for fine details

🎥

Professional Look

Advanced sigma scheduling for smooth motion

LTX Quality

Two-Pass Pipeline • Up to 4K Output

LTX Quality is the premium configuration of the LTX 2.3 pipeline, generating at half resolution then performing a 2× spatial upscale for ultra-sharp output up to 4K (3840×2176). It uses LoRA-enhanced distilled weights and a dedicated LTXVLatentUpsampler for the refinement pass.

The two-pass approach with separate CFGGuider nodes, configurable sigma schedules, and optional tiled VAE decoding makes it ideal for high-resolution final production work.

Technical Specifications

Architecture	Two-Pass with 2× LTXVLatentUpsampler
Text Encoder	LTXAVTextEncoder (Dedicated)
Default Resolution	960 × 544 → 1920 × 1088
Max Resolution	1920 × 1088 → 3840 × 2176 (4K)
Frame Rate	24 fps
Default Steps	20 (first pass)
Sampler	Euler
Scheduler	LTXVScheduler (custom shift)
CFG Scale	3.0 (first pass) / 1.0 (upscale)
Shift Range	0.95 – 2.05
Output Format	MP4 (via SaveVideo)
Enhancement	LoRA Distill + Spatial Upscaler

Best For

💎

4K Production

Generate up to 3840×2176 with 2× upscale

🛠

Maximum Detail

LoRA + spatial upscaler preserves fine details

🎥

Professional Output

Configurable sigma schedules for smooth motion

⚙

Full Customization

Separate controls for both generation passes

Hunyuan Video

13B Parameter Text-to-Video • High Motion Quality

Motion

Hunyuan Video is a 13-billion parameter diffusion model by Tencent, built on a Dual CLIP text encoder architecture (CLIP-L + LLaVA-LLaMA3) for rich text understanding. It uses FluxGuidance for conditioning control and ModelSamplingSD3 for shift-based sampling.

The model generates at 720p resolution with tiled VAE decoding by default to manage VRAM, producing videos with exceptional motion quality and character consistency — ideal for anime-style and character-driven animation.

Technical Specifications

Architecture	Hunyuan Video + SD3 Sampling + FluxGuidance
Parameters	13 Billion
Text Encoder	Dual CLIP (CLIP-L + LLaVA-LLaMA3)
Default Resolution	848 × 480
Frame Rate	24 fps
Default Steps	20
Sampler	Euler (SamplerCustomAdvanced)
Scheduler	Simple (BasicScheduler)
Guidance	6.0 (FluxGuidance)
Shift	7.0
VAE Decode	Tiled (256 tile, 64 overlap)
Output Format	MP4 (via SaveVideo)

Best For

🎬

Smooth Motion

Exceptional temporal coherence and fluid movement

🎨

Anime & Characters

Excels at character animation and anime styles

📚

Rich Prompts

Dual CLIP encoder understands complex descriptions

📈

Consistency

Maintains character identity across frames

Hunyuan 1.5

Next-Gen Text-to-Video • Native 720p • Dual CLIP v2

New

Hunyuan 1.5 is the next generation of Tencent’s video diffusion model, featuring an upgraded Dual CLIP encoder (Qwen 2.5 VL 7B + Byt5 Glyph XL) for significantly improved text comprehension and prompt following. It generates natively at 720p (1280×720) without upscaling.

The model uses CFGGuider with SamplerCustomAdvanced for precise generation control, and includes an optional super-resolution path to upscale output to 1080p using a dedicated latent upsampler.

Technical Specifications

Architecture	Hunyuan Video 1.5 + SD3 Sampling + CFGGuider
Text Encoder	Dual CLIP (Qwen 2.5 VL 7B + Byt5 Glyph XL)
Default Resolution	1280 × 720 (HD)
Frame Rate	24 fps
Default Steps	20
Sampler	Euler (SamplerCustomAdvanced)
Scheduler	Simple (BasicScheduler)
CFG Scale	6.0
Shift	7.0
VAE Decode	Standard (VAEDecode)
Optional Upscale	1080p Super Resolution (disabled by default)
Output Format	MP4 (via SaveVideo)

Best For

📚

Superior Prompts

Qwen 2.5 VL encoder understands complex, detailed descriptions

🎬

Native HD

Generates at 1280×720 without upscaling artefacts

📈

Smooth Motion

Inherits Hunyuan’s exceptional temporal coherence

💎

Optional 1080p

Built-in super-resolution upscale path when needed

Additional Models

More AI Engines

CogVideoX

Tsinghua University • Open-Source

Research

CogVideoX is a high-quality open-source video generation model from Tsinghua University. It delivers strong text comprehension and artistic output quality with a straightforward single-pass pipeline.

The model excels at stylised and artistic video generation, producing visually striking results with excellent prompt adherence for creative and research use cases.

Technical Specifications

Architecture	CogVideoX Transformer
Default Resolution	720 × 480
Frame Rate	8 fps
Default Steps	50
CFG Scale	6
Frame Count	49 frames (~6s)
Sampler	Euler
Output Format	MP4

Best For

🎨

Artistic Styles

Excels at stylised and creative video output

💬

Text Comprehension

Strong understanding of complex prompts

🔬

Research Quality

Open-source model with academic backing

🎬

Longer Clips

49 frames for ~6 seconds of output

AnimateDiff

Stable Diffusion Animation • Motion Module

AnimateDiff turns any Stable Diffusion checkpoint into a video animation engine by inserting a temporal motion module. This enables the vast SD ecosystem of models, LoRAs, and styles to produce animated output.

With fast 20-step generation and a lightweight architecture, AnimateDiff is ideal for quick iterations, anime-style content, and leveraging the massive library of community SD checkpoints.

Technical Specifications

Architecture	SD 1.5 + Motion Module
Default Resolution	512 × 512
Frame Rate	8 fps
Default Steps	20
CFG Scale	7
Frame Count	32 frames (~4s)
Sampler	Euler
Output Format	MP4

Best For

⚡

Quick Iterations

20-step generation for rapid prototyping

🎨

Anime & Stylised

Leverage SD checkpoints for any art style

🛠

SD Ecosystem

Compatible with community LoRAs and models

🔁

Short Clips

Perfect for 4-second animated sequences

Image-to-Video Models

Reference Image Animation • Multi-Engine

I2V

Image-to-Video (I2V) models animate a reference image into motion video. Upload a still image and the AI generates natural movement, camera motion, and scene dynamics while preserving the original composition.

Available I2V Engines

Hunyuan 1.5 I2V	1280×720 • 24 fps • 121 frames	Best motion quality from a reference image
LTX 2.3 I2V	1280×720 • 25 fps • 121 frames	Fast 8-step inference, great detail
SVD XT I2V	1024×576 • 6 fps • 25 frames	Stability AI's Stable Video Diffusion
WAN 2.2 I2V	832×480 • 16 fps • 81 frames	14B parameter model with Lightx2v turbo

Advanced Video Modes

First-Last Frame • Sound-to-Video • Camera Control

Multi-Mode

Specialised generation modes that extend beyond simple text or image inputs. Interpolate between keyframes, drive video from audio, or control camera movements.

Available Modes

LTX First-Last Frame	768×512 • 25 fps • 97 frames	Interpolate between two keyframe images with prompt guidance
WAN First-Last Frame	832×480 • 16 fps • 81 frames	WAN 14B first & last frame interpolation
WAN Sound-to-Video	832×480 • 16 fps • 81 frames	Generate talking-head video from portrait + audio
LTX Camera Control	768×512 • 25 fps • 97 frames	Camera dolly (in, out, left, right) via LoRA
LTX ControlNet	768×512 • 25 fps • 97 frames	Depth, edge, or pose-guided generation
WAN VACE	832×480 • 16 fps • 81 frames	Video conditioning & editing with subject/scene control
WAN Animate	832×480 • 16 fps • 81 frames	Character animation from a single reference image

Specialised Models

Portrait Animation • Face Transfer • 3D Generation

Specialist

Purpose-built models for specific creative tasks including portrait animation, face identity transfer, audio-driven talking heads, and 3D asset generation.

Available Specialist Models

LivePortrait	512×512 • 24 fps • 81 frames	Animate portrait photos with facial expressions from a driving video
DreamID-V	832×480 • 24 fps • 81 frames	Face identity transfer — insert your face into generated video
EchoMimic	512×512 • 25 fps • 100 frames	Audio-driven talking portrait from a single photo + audio clip
Hunyuan3D v2	512×512 single output	Generate 3D model assets from text or reference image

Image Generation Models

AI Image Generation • Text-to-Image & Image Editing

Images

Four image generation models ranging from the fast and lightweight Stable Diffusion 1.5 to the powerful Flux 2 family. Use the Image Edit model to modify existing images with text instructions.

Available Image Models

Stable Diffusion 1.5	512×512 • 25 steps • CFG 7	Fast & lightweight • 2 credits
Flux 2 Klein	1024×1024 • 20 steps • CFG 4	Balanced quality & speed • 5 credits
Flux 2 Dev	1024×1024 • 28 steps • CFG 4	Highest quality • 8 credits
Flux 2 Image Edit	1024×1024 • 20 steps • CFG 4	Edit images with text instructions • 6 credits

Side by Side

Model Comparison

Feature	WAN	LTX 2.3	LTX Quality	Hunyuan	Hunyuan 1.5
Max Resolution	848 × 480	1280 × 720	3840 × 2176 (4K)	848 × 480	1280 × 720
Frame Rate	16 fps	25 fps	24 fps	24 fps	24 fps
Inference Steps	4 (fastest)	8 × 2 passes	20 + upscale pass	20	20
Generation Speed	Fastest	Moderate	Slowest	Slow	Moderate
Output Quality	Good	Excellent	Best	Excellent	Excellent
Motion Quality	Good	Good	Excellent	Best	Best
Sampler	Euler	Euler Ancestral CFG++	Euler	Euler	Euler
Upscaling	None	Built-in (LTXVUpsampler)	2× Spatial Upscaler	None	Optional 1080p SR
Best Use Case	Prototyping & iteration	HD production	4K final production	Character & anime	HD motion & prompts
Image-to-Video	✓ Available	✓ Available	Coming Soon	Coming Soon	✓ Available

Infrastructure

Powered by NVIDIA DGX B200

NVIDIA DGX B200

8x Blackwell GPUs • 1,440 GB HBM3e • 144 PFLOPS

View on NVIDIA.com →

The foundation for your AI factory. NVIDIA DGX B200 is equipped with eight NVIDIA Blackwell GPUs interconnected with fifth-generation NVLink, delivering 3X the training performance and 15X the inference performance of previous-generation systems.

💻

8x Blackwell GPUs

NVIDIA Blackwell architecture with NVLink interconnect and 2x NVSwitch

📌

1,440 GB HBM3e

64 TB/s aggregate memory bandwidth across all GPUs

⚡

144 PFLOPS FP4

72 PFLOPS FP8 Tensor Core compute power

🚀

14.4 TB/s NVLink

Fifth-generation NVLink aggregate bandwidth

⚙

ComfyUI Engine

Industry-standard node-based workflow engine for reproducible generation

🔒

Private & Secure

Self-hosted infrastructure — your data never leaves the server

Full Specifications

GPU	8x NVIDIA Blackwell GPUs
GPU Memory	1,440 GB total HBM3e — 64 TB/s aggregate bandwidth
FP4 Tensor Core	144 PFLOPS (sparse)
FP8 Tensor Core	72 PFLOPS (sparse)
NVLink	14.4 TB/s aggregate — 5th generation via 2x NVSwitch
CPU	2x Intel Xeon Platinum 8570 — 112 cores, 2.1 / 4 GHz
System Memory	2 TB DDR5 (configurable to 4 TB)
Storage	2x 1.9 TB NVMe M.2 (OS) + 8x 3.84 TB NVMe U.2 (data)
Networking	8x 400 Gb/s ConnectX-7 + 2x 400 Gb/s BlueField-3 DPU
System Power	~14.3 kW max
Form Factor	10 RU rack unit
Software	NVIDIA AI Enterprise + Mission Control + DGX OS

Choose your model and start creating

Create a free account to access all models — text-to-video and image-to-video. Switch between them anytime.

Get Started Free Sign In