How Ramvoy works.
Ramvoy uses a multi-model AI pipeline. The planner chooses the right tools for each job, the runtime prepares and executes each step, and an internal assembly step renders the final output.
Total systems
13
Across generation, audio, video, and final assembly.
AI models
12
Used for images, video, speech, and music.
Internal systems
1
Final assembly is handled internally with FFmpeg.
Workflow
From prompt to final output.
User input
The flow starts with the user selecting an agent type, writing a prompt, choosing a target length, and optionally uploading images.
Planner creates a structured plan
The planning API converts the request into a structured execution plan with strategy, reasoning, assets needed, steps, and final assembly rules.
The plan selects models
Each step in the plan points to the best model for that task, such as Flux for images, TTS for narration, Kling or PixVerse for video, or FFmpeg for final assembly.
Runtime adapters prepare inputs
Each model has an adapter that converts plan step data into the exact input shape required by the runtime or provider.
Assets are generated and stored
Outputs like images, clips, and audio are generated step by step and stored so later steps can reuse them.
Final assembly renders the deliverable
The internal FFmpeg step combines visuals, narration, music, timing, and export settings into the final output.
Model stack
The models behind Ramvoy.
Image generation
These models generate or edit still images used as source frames, concept art, references, or scene anchors.
Flux 2 Pro
flux_2_proHigh-quality image generation with strong prompt handling, structured prompting, and support for multiple reference images.
Flux 2 Pro Edit
flux_2_pro_editImage editing version of Flux 2 Pro for transforming or refining uploaded images with natural language instructions.
Video generation
These models generate motion clips from text or images. Some are optimized for realism, some for cost, and some for audio-driven scenes.
Grok Imagine Video
grok_imagine_videoGenerates short videos from text or images and can include synchronized audio.
Runway Gen-4.5
runway_gen_4_5High-fidelity cinematic video generation with strong prompt adherence and realism.
Kling Video 3
kling_video_3Flexible cinematic video model with support for multi-shot generation and optional audio-oriented workflows.
PixVerse v5.6
pixverse_v5_6Cost-effective video model with support for native audio-style generation, dialogue-style scenes, and camera movement.
Seedance 1 Pro
seedance_1_proHigh-quality 1080p text-to-video and image-to-video model with strong narrative and multi-shot potential.
Ray Flash 2
ray_flash_2Video model focused on realistic motion, cinematic composition, and strong physical scene behavior.
Fabric 1.0
fabric_1_0Specialized talking-head model that uses an image plus audio to generate a speaking video.
Speech and music
These models generate voiceovers, music, and longer-form audio tracks that can later be combined with images or videos.
TTS 1.5 Max
tts_1_5_maxText-to-speech model used for narration, voiceover, expressive delivery, and multilingual speech generation.
Music 1.5
music_1_5Music generation model that can create accompaniment and vocal-based music from lyrics and prompts.
Lyria 2
lyria_2High-fidelity music generation for clean instrumental or soundtrack-style outputs.
ACE-Step
ace_stepFlexible music model for longer song generation, lyric-aware outputs, and more creative music workflows.
Assembly
After model outputs are generated, Ramvoy uses an internal assembly step to combine clips, audio, timing, and final render output.
Internal FFmpeg Assembly
internal_ffmpegInternal rendering step that stitches clips, loops images into video segments, mixes narration and music, and exports the final video.
Architecture
Planner first, adapters second, render last.
Planner
Structured execution plan
The planner creates the step-by-step run structure.
Adapters
Model-specific inputs
Each adapter transforms a plan step into model-ready input.
Execution
Assets are generated
Images, audio, and video clips are created and stored.
Assembly
FFmpeg final render
Internal FFmpeg stitches and mixes everything into the final output.
Example flow
A typical run
Step 1
Generate visuals with Flux 2 Pro
Step 2
Generate narration with TTS 1.5 Max
Step 3
Generate music with Lyria 2 or Music 1.5
Step 4
Assemble everything with internal FFmpeg