How I created an AI-driven ecosystem to locally generate videos

Written by János Bebesi

Part 1: Overview and Architecture

I decided to create a local AI application that can generate video from short text prompt. This sounds like a fun and useful activity to me since AI applications from major providers (ChatGPT, Microsoft Copilot, Gemini, etc) are highly limited in usage. Local models can be fine-tuned to our needs, and since we can control the parameters and resource needs, they really can be better options than one from the major providers.

To enable the generation from short text, while maximizing the quality at the end video, I decided to split the application to five parts:

An AI to generate the text: It ensures, that from short prompt we can get a full text, and description, prompts for the rest of the application. Text generation is quick compared to video generation. If the text quality is not sufficient, modifying the prompt can improve the description.
Text to speech AI: This uses some part of the response from the Text generation output. 1 minute of audio generation could be done in less than a minute on my machine (this depends on the AI model, so it can be different with other models). Fine tuning the prompt should be possible here to adjust to our original needs.
Image generation AI based on prompt also relatively quick. It can be done in a couple of minutes, depending on image size, quality and model.
Video Generation AI from voice, image, and text. All the above was done, to ensure the starting image, and the audio is correct before we start to generate our video. On my setup a 40-second-long video generation can be lasting for 4 hours.
An application to support the user interface and handle the communication between the AIs, provide human usable interface. This is also important to give the user the right number of parameters to fine-tune the AI model(s). Tokens used, video length, context size, temperature are all parameters which can highly effect performance, quality, and usability of the project.

Constraints for the project:

The application should be able to run on a local machine equipped with an NVIDIA RTX 3060 or similar hardware.
The used AI model should be pre-trained, fine-tuning the model is not part of this project for now.

Advantages of local SLM AI

While LLMs can do many things, they come at the cost of requiring strong hardware and high energy consumption. Their small, yet specialized counterparts the Small Language Models (SLM) have various advantages:

Privacy: If needed, external data transmission can be blocked. Context, training data, output, telemetry, usage tracing, legal or other sensitive information, tasks stay on our server, and we are in full control. Responsibility is also ours.
Model behaviour: Token limits, resource usage, context size can be adjusted as needed. Swapping between models based on task also easily possible.
Fine-Tuning: Train models with confidential or highly sensitive data without fear of leakage. We can choose a model without being trained on irrelevant data.
Offline capabilities: downloaded models does not require internet connection.
No usage limits or only limited by your hardware.
Lower energy consumption.
Can run on cheap hardware.

Text Generation AI

First component of the application is text generation. This will accept the following input:

Prompt about the topic that it should generate.
Input will come as JSON for further processing.
Goal is to have some prefilled prompt for the following AI models to make them consistent and easy to customize from a draft state.

To ensure, that all the responses follow a required structure and size, there should be a prompt template to extend the user prompt. This will help to keep the response inside the applications constraints. This is the prompt template:

You are a multimodal content generator. Your task is to produce a structured JSON object that describes a short video scene based on the given topic and tone.

The output must include the following fields:

– “”audio””: An object containing audio-specific instructions and preferences:

– “”lyrics””: Text of any spoken words or dialogue.

– “”tags””: List of keywords or phrases to enhance audio quality.

– “”video””: An object containing video-specific instructions:

– “”positive_prompt””: Text of what will happen on screen.

– “”negative_prompt””: Text of what to avoid in the visuals.

– “”image””: An object containing image-specific instructions:

– “”positive_prompt””: Text of what is in the image.

– “”negative_prompt””: Text of what to avoid in the image.

Constraints:

– Ensure all fields are filled.

– Format the output as valid JSON.

– The audio section should complement the visual and narrative elements.

Topic: {{Insert topic here}}

Tone: {{Insert tone here}}

Generate the structured output now.

This will ensure, that the output is well formatted, and ready to use for the next AI application. The Topic will be inserted by the user, so the user does not need to take care of the rest what ensures the right response from the model. Tone is not adjustable.

Ollama

My choice for text generation was Ollama. It is an easy to use, yet powerful model runner. I can set the main parameters for the model, like context window, temperature, output format, or max number of tokens to generate. This helps to have a really fine-tuned, tailored solution. It has easy to use API access, and GPU acceleration. The only missing functionality of Ollama is fine-tuning support, but for this experiment it is not needed. Ollama can provide information about the available models, which can be handy if we have multiple models for different tasks.

Text-to speech AI

Will get the following input:

Editable output from the Text Generation AI to fine tune the text in case only partial improvement is needed by the user.
instruction about customization options on how the voice should sound.
Length and quality settings of the generated audio.

Text-to Image AI

Will get the following input:

Editable output from the Text Generation AI to fine tune the text in case only partial improvement is needed by the user.
instruction about customization options on how the image should be generated.
Size and quality settings of the generated image

Video Generation AI

It will use the image, speech and text to generate the video. The video content should also be described how it should look, what should happen. The avatar and its emotion’s reaction should be coordinated with its tone. The user should be able to define the length of the video.

ComfyUI

My goal was to have something that can handle most of the generation tasks with a smooth UI, and an easy-to-use API. With ComfyUI both is possible, the UI itself is intuitive, and templates are available for the various tasks, including image, audio, and video generation, editing. There is also AI models selected for templates. The only difficulty what I fount is to select the right model from the list, when I downloaded multiple models. Once they are downloaded, only manual tagging, naming can help to determine which model to use. Using its node structure, it is possible to achieve the same (or even more!) as with UI, makes it a perfect candidate for this application.

Next steps

As a next step, we will have a look at Ollama and ComfyUI-s capabilities and limitations for our project.

Tagged Book Recommendation dr. Linda Szanto