Ramvoy Docs

How Ramvoy works.

Ramvoy uses a multi-model AI pipeline. The planner chooses the right tools for each job, the runtime prepares and executes each step, and an internal assembly step renders the final output.

Total systems

Across generation, audio, video, and final assembly.

AI models

Used for images, video, speech, and music.

Internal systems

Final assembly is handled internally with FFmpeg.

Workflow

From prompt to final output.

User input

The flow starts with the user selecting an agent type, writing a prompt, choosing a target length, and optionally uploading images.

Planner creates a structured plan

The planning API converts the request into a structured execution plan with strategy, reasoning, assets needed, steps, and final assembly rules.

The plan selects models

Each step in the plan points to the best model for that task, such as Flux for images, TTS for narration, Kling or PixVerse for video, or FFmpeg for final assembly.

Runtime adapters prepare inputs

Each model has an adapter that converts plan step data into the exact input shape required by the runtime or provider.

Assets are generated and stored

Outputs like images, clips, and audio are generated step by step and stored so later steps can reuse them.

Final assembly renders the deliverable

The internal FFmpeg step combines visuals, narration, music, timing, and export settings into the final output.

Model stack

The models behind Ramvoy.

Image generation

These models generate or edit still images used as source frames, concept art, references, or scene anchors.

Flux 2 Pro

flux_2_pro

High-quality image generation with strong prompt handling, structured prompting, and support for multiple reference images.

Scene concept imagesCharacter visualsProduct or cinematic stillsReference-based image generation

Flux 2 Pro Edit

flux_2_pro_edit

Image editing version of Flux 2 Pro for transforming or refining uploaded images with natural language instructions.

Editing uploaded imagesAdjusting style or compositionCreating polished source images for video steps

Video generation

These models generate motion clips from text or images. Some are optimized for realism, some for cost, and some for audio-driven scenes.

Grok Imagine Video

grok_imagine_video

Generates short videos from text or images and can include synchronized audio.

Short cinematic clipsImage-to-video animationAudio-aware scenes

Runway Gen-4.5

runway_gen_4_5

High-fidelity cinematic video generation with strong prompt adherence and realism.

Premium hero clipsCinematic scenesHigher-end visual sequences

Kling Video 3

kling_video_3

Flexible cinematic video model with support for multi-shot generation and optional audio-oriented workflows.

Story-driven clipsMulti-shot scenesSocial and cinematic video sequences

PixVerse v5.6

pixverse_v5_6

Cost-effective video model with support for native audio-style generation, dialogue-style scenes, and camera movement.

Affordable short video generationStylized clipsAudio-aware short scenes

Seedance 1 Pro

seedance_1_pro

High-quality 1080p text-to-video and image-to-video model with strong narrative and multi-shot potential.

Narrative scenesLonger visual storytelling chainsHigher-resolution outputs

Ray Flash 2

ray_flash_2

Video model focused on realistic motion, cinematic composition, and strong physical scene behavior.

Motion-heavy shotsEnvironmental scenesPhysics-oriented realism

Fabric 1.0

fabric_1_0

Specialized talking-head model that uses an image plus audio to generate a speaking video.

Presenter clipsAvatar-style videosAudio-driven face animation

Speech and music

These models generate voiceovers, music, and longer-form audio tracks that can later be combined with images or videos.

TTS 1.5 Max

tts_1_5_max

Text-to-speech model used for narration, voiceover, expressive delivery, and multilingual speech generation.

NarrationVoiceoverDialogue audio

Music 1.5

music_1_5

Music generation model that can create accompaniment and vocal-based music from lyrics and prompts.

Songs with lyricsPrompt-driven music tracksAudio for music-led projects

Lyria 2

lyria_2

High-fidelity music generation for clean instrumental or soundtrack-style outputs.

Background musicSoundtracksShort music generation

ACE-Step

ace_step

Flexible music model for longer song generation, lyric-aware outputs, and more creative music workflows.

Longer songsPrompt-based audio generationCreative music workflows

Assembly

After model outputs are generated, Ramvoy uses an internal assembly step to combine clips, audio, timing, and final render output.

Internal FFmpeg Assembly

internal_ffmpeg

Internal rendering step that stitches clips, loops images into video segments, mixes narration and music, and exports the final video.

Final renderClip concatenationAudio mixingImage-to-video assembly

Architecture

Planner first, adapters second, render last.

Planner

Structured execution plan

The planner creates the step-by-step run structure.

Adapters

Model-specific inputs

Each adapter transforms a plan step into model-ready input.

Execution

Assets are generated

Images, audio, and video clips are created and stored.

Assembly

FFmpeg final render

Internal FFmpeg stitches and mixes everything into the final output.

Example flow

A typical run

Step 1

Generate visuals with Flux 2 Pro

Step 2

Generate narration with TTS 1.5 Max

Step 3

Generate music with Lyria 2 or Music 1.5

Step 4

Assemble everything with internal FFmpeg