LTX 2.3 AI Video GeneratorInput/Output Moderation & Blocking Enabled

Create high-fidelity AI video from text or images with synchronized audio, up to 4K resolution, and compliance-first safety controls. Compliance update: we replace restricted model routes where required, or retain model access only with a full moderation evidence package, including input/output review and blocking controls.

FastPro

Fast mode mirrors the workspace defaults for quick iteration.

0KMax Resolution

0 FPSFrame Rate

0sMax Clip Length

0%Open Source

MossAI Tools

Simple Workflow

From Idea to Finished Clip in Four Steps

No technical background needed. Our LTX 2.3 video generator handles the heavy lifting.

Step 1

Write Your Prompt

Describe the scene you want to create — camera angle, lighting, mood, character motion, and environment. The more specific your prompt, the more control you have over the output. LTX 2.3's gated attention text connector parses spatial relationships, timing cues, and emotional tone with significantly higher accuracy than previous models. You can also upload a reference image instead of writing from scratch.

Step 2

Configure Settings

Choose the aspect ratio that fits your platform — 16:9 for YouTube and landscape content, or 9:16 for TikTok, Reels, and Shorts. Set your clip length from 6 to 20 seconds, select your target resolution (1080p, 2K, or 4K), and pick between Fast mode for quick iterations or Pro mode for production-ready output. Frame rates are adjustable between 24, 25, 48, and 50 FPS.

Step 3

Generate with LTX 2.3

The Diffusion Transformer model processes your prompt through its redesigned VAE and gated attention text connector, synthesizing video frames with synchronized audio — ambient sound, music, and effects — in a single forward pass. No separate audio post-production needed. The entire generation runs on cloud GPUs so your local machine stays free.

Step 4

Preview & Export

Watch your generated clip in the browser player. If the result needs adjustment, refine your prompt or tweak settings and regenerate with a single click. Once you're satisfied, download the final video in your chosen resolution — ready to upload directly to any social media platform, video editor timeline, or presentation deck.

Why LTX 2.3

Built for Creators Who Demand More

A redesigned VAE, 4x larger text connector, and native portrait support — packed into an open-source package.

Redesigned VAE for Visual Fidelity

LTX 2.3 ships with a completely rebuilt Variational Autoencoder that reconstructs finer spatial details than any previous version. Hair strands, fabric weave, skin pores, specular highlights on metallic surfaces — all rendered with noticeably sharper edges and more consistent textures. The result is output that holds up at 4K resolution without smearing or artifact banding, even in fast-motion sequences.

4× Larger Text Connector

The gated attention text connector in LTX 2.3 is four times the size of the previous generation. This means the model actually follows complex, multi-clause prompts — timing cues like 'after 3 seconds the camera pans left,' spatial instructions like 'subject in the foreground, mountains behind,' and emotional directions like 'melancholy lighting' translate reliably into the generated video rather than being simplified or ignored.

Image-to-Video Generation

Upload any still image — a product photo, concept art, portrait, or landscape — and LTX 2.3 generates natural, physically plausible motion from the source frame. Unlike earlier models that relied on cheap zoom-and-pan tricks, LTX 2.3 produces genuine object motion: liquids pour, fabrics billow, characters turn, and cameras track through the scene. Visual consistency with the source image is maintained throughout the clip.

Synchronized Audio Generation

LTX 2.3 generates matching audio in the same forward pass as video — no separate audio model, no manual syncing. The output includes ambient environmental sound, music beds, and sound effects that align with on-screen motion. Ocean waves crash as they hit the shore, footsteps land in time with walking, and engines rev as vehicles accelerate. The audio arrives production-ready, mixed at broadcast-safe levels.

Native Portrait Mode

Vertical video is no longer an afterthought. LTX 2.3 was trained with native 9:16 data, which means portrait video comes out properly composed from the start — centered subjects, correct headroom, no letterboxing or awkward crops. TikTok, Instagram Reels, and YouTube Shorts creators can generate scroll-stopping vertical content without any post-production reframing.

Open-Source Under Apache 2.0

LTX 2.3 is released under the Apache 2.0 license by Lightricks. Model weights are available on Hugging Face, and training code plus ComfyUI custom nodes are on GitHub. No vendor lock-in, full weight transparency, and freedom to self-host for enterprise or research use cases.

Capabilities

LTX 2.3 Capabilities at a Glance

A major quality upgrade to the LTX-Video family — redesigned VAE, bigger text encoder, native portrait, and synchronized audio.

From Words to Cinematic Clips

Describe a scene in plain English and LTX 2.3 renders it into a high-fidelity video clip with matching audio. The upgraded text connector catches subtle prompt details — camera angles, timing cues, emotional tone — that earlier models would ignore.

Natural language scene descriptions
Camera movement & timing control
Synchronized audio generation
Fast & Pro quality modes

Text to Video

Generation Modes

Supported Workflows

LTX 2.3 supports multiple generation workflows beyond basic text-to-video. Each mode is designed for a specific creative use case and can be combined for more complex productions.

Text to Video

Core

Describe any scene in natural language — specify camera movement, lighting conditions, subject action, environment, and mood. LTX 2.3 translates your text prompt into a high-fidelity video clip with matching synchronized audio. The 4× larger gated attention text connector ensures complex, multi-clause prompts are followed accurately rather than simplified.

Image to Video

Core

Upload a reference image and add an optional motion prompt. LTX 2.3 generates physically plausible motion from the source frame — liquids flow, fabrics move, characters animate — while maintaining visual consistency with the original image. Ideal for animating product photos, portfolio stills, concept art, and design mockups.

Audio to Video

New

Provide an audio clip and LTX 2.3 generates matching video synchronized to the sound. Music videos, audio visualizations, and sound-driven content creation become straightforward — the model aligns visual motion and timing to audio beats, tempo changes, and rhythmic patterns.

Video Extension

Advanced

Extend an existing clip beyond its original duration. LTX 2.3 continues the scene with consistent motion, lighting, and subject positioning, enabling you to produce longer sequences from a single starting generation. Chain multiple extensions together for clips well beyond the 20-second single-pass limit.

Use Cases

Who's Using LTX 2.3

From solo TikTok creators to agency teams, LTX 2.3 is showing up in real production workflows.

Short-Form & Social Creators

Generate native 9:16 portrait video with synchronized sound for TikTok, Instagram Reels, and YouTube Shorts. Create scroll-stopping content in minutes instead of hours — from concept to publishable clip without a camera, studio, or editing suite. Fast mode lets you iterate on ideas rapidly before committing to a final Pro-quality render.

YouTube & Long-Form Producers

Turn prompt descriptions into cinematic B-roll — aerial flyovers, product close-ups, abstract transitions, and establishing shots — that cut seamlessly into your existing timeline. LTX 2.3 output matches broadcast-standard frame rates (24/25/50 FPS) and resolutions, so there's no quality mismatch when interleaving generated footage with camera-shot material.

Educators & Course Builders

Create illustrative video clips for online courses, presentations, and training materials — molecular animations, historical recreations, scientific visualizations, and abstract concept explainers — without hiring a motion graphics team. The synchronized audio feature automatically adds relevant narration-ready background sound, saving additional production steps.

E-Commerce & Marketing Teams

Transform static product photos into cinematic demo videos with realistic lighting, camera movement, and environmental context. Generate multiple ad creative variations for A/B testing across platforms in a fraction of the time and cost of traditional video production. Image-to-video mode preserves product accuracy while adding compelling visual motion.

Indie Filmmakers & Animators

Pre-visualize shots from script to screen before committing budget to live production. Camera angles, focal length, scene timing, and character staging translate reliably from text prompts, giving directors and cinematographers a tangible reference to align their crew around.

Agencies & Freelancers

Produce polished concept videos for client pitches and stakeholder reviews in minutes rather than days. Fast mode generates rough drafts for creative alignment, while Pro mode delivers presentation-quality output. No rendering queues, no expensive stock footage licenses — just describe the vision and generate.

Under the Hood

Technical Specifications

LTX 2.3 is built on a Diffusion Transformer (DiT) architecture optimized for spatiotemporal video generation. Below are the core technical details behind the model's capabilities.

Architecture

Diffusion Transformer (DiT)

Optimized transformer backbone designed for spatiotemporal video generation with efficient attention patterns.

VAE

Redesigned in v2.3

Rebuilt Variational Autoencoder with improved latent space reconstruction — sharper edges, cleaner textures, fewer artifacts at high resolution.

Text Connector

4× Gated Attention

Gated attention mechanism four times larger than the previous generation, delivering substantially better prompt adherence for complex multi-clause descriptions.

Max Resolution

3840 × 2160 (4K)

Supports 1080p, 2K, and 4K output resolutions across landscape (16:9) and portrait (9:16) aspect ratios.

Frame Rates

24 / 25 / 48 / 50 FPS

Flexible frame rate control matching broadcast (24/25 FPS) and high-frame-rate (48/50 FPS) standards.

Clip Length

Up to 20 seconds

Single-pass generation up to 20 seconds. Extend-video workflow enables longer sequences by chaining multiple generations.

Audio

Synchronized single-pass

Ambient sound, music beds, and sound effects generated in the same forward pass as video — no separate audio model or post-production sync required.

License

Apache 2.0

Fully open-source. Model weights on Hugging Face, training code and ComfyUI nodes on GitHub. Free for commercial and research use.

FAQ

Frequently Asked Questions

LTX 2.3 is the latest open-source video generation model from the LTX-Video family, originally developed by Lightricks. It is built on a Diffusion Transformer (DiT) architecture and can generate high-fidelity video with synchronized audio from text prompts, images, or audio clips in a single forward pass. The model is released under Apache 2.0 and weights are freely available on Hugging Face.

No. Our platform runs LTX 2.3 on cloud-hosted GPUs. You access everything through your web browser — no software downloads, no local GPU required, and no command-line setup. You can generate videos from any device with an internet connection, including laptops, tablets, and smartphones.

LTX 2.3 supports output up to 4K (3840 × 2160) at frame rates of 24, 25, 48, or 50 FPS. Available aspect ratios include landscape (16:9) and portrait (9:16). Portrait mode is natively trained — not cropped from landscape — so vertical video is properly composed from the start.

A single generation pass produces clips up to 20 seconds. For longer sequences, you can use the video extension workflow to chain multiple generations together, maintaining scene consistency across extensions. There is no hard limit on the total length of extended clips.

LTX 2.3 generates video and audio in the same forward pass. The model synthesizes ambient sound, music, and sound effects that align with the visual content — waves crash on screen when water hits the shore, engines rev when vehicles accelerate, and atmospheric music matches the mood of the scene. No separate audio tool or manual syncing is needed.

Text-to-video generates a clip entirely from a text description. Image-to-video takes a reference photo as a starting point and generates natural motion from that frame — the camera can move, subjects can animate, and the environment evolves while staying visually consistent with the original image. You can combine both by uploading an image and adding a text prompt to direct the motion.

Yes. LTX 2.3 is released under the Apache 2.0 license, which permits commercial and research use without restriction. Model weights are published on Hugging Face, and the full training code, ComfyUI custom nodes, and reference inference workflows are available on the official GitHub repository at github.com/Lightricks/LTX-Video.

Your Next Video Starts with a Prompt

Type a scene, drop in a photo, or pick a template — and let LTX 2.3 handle the rest. Up to 4K, synchronized audio, native portrait mode. No credit card required.