How I created an AI-driven ecosystem to locally generate videos - Part 2

Written by János Bebesi

Part II: Understanding Ollama

In the first part of this series, I described the overall architecture of a fully local, multimodal AI pipeline capable of generating videos from a short text prompt. Video generation is slow, resource‑heavy, and extremely sensitive to input quality. On consumer hardware such as an RTX 3060, generating a 30–40 second clip can easily take several hours. If the foundation—text, audio, and image—is not well‑structured before video synthesis begins, most of that time is wasted.

This project is an attempt to build a reliable workflow that ensures high‑quality intermediate results while keeping everything local, private, and customizable.
To achieve this efficiently, the pipeline separates the process into multiple stages:

  1. Text generation, where a short prompt becomes a detailed JSON description of the scene.
  2. Audio generation, which provides dialogue, narration, or singing.
  3. Image generation, which forms the visual baseline for the video.
  4. Video generation, the slowest and most expensive step.
  5. Application logic that orchestrates all components.

 

In this blog post I will dive into the first part of the video generation, generate the text what will be used afterwards with Ollama. The source code is in this GitHub repository. The Readme.md file, and the install.ps1 to help you to setup your environment. I will reference here the release branch, which I created for this blog post to always have the same code referred, even if I further develop the application.

The application is mainly AI generated with some guidance. Not only the source code, but the displayed text as well. It can be misleading, or wrong, treat it with caution. I only tested those code paths which are relevant to the application, dead code, bad practices still can be there. I also want to highlight here, that this application is not in a production-like shape, lots of functionality could have been done just like referencing exiting libraries, but AI barely uses existing packages. I was focusing only on a basic workflow for video generation, with my limited video making skills. The focus was to show how can we combine multiple AI models with multiple tools to achieve a goal from limited resources. Also, all the programming aspects like logging, monitoring, performance or security was taken aside. Later I might improve these aspects as well.

What is Ollama?

Ollama is a platform and toolset designed to run text generating AI models locally on your own machine, enabling private and secure AI-powered text generation/processing without relying on 3rd party infrastructure. It supports running popular, pre-trained models like Llama4, deepseek, mistral, also provides GPU acceleration and easy API access for integration into applications. It offers basic model management features such as downloading, updating and switching between models locally.

Key Features of Ollama

  • Local Execution: The most important feature of Ollama is its ability to run AI models locally, mitigating privacy and data protection concerns compared to cloud solutions. By bringing AI models directly to users’ devices, Ollama ensures greater control over and security over data while providing faster processing speeds and reduced reliance on external servers.
  • Extensive and easy to use Model Library: Ollama offers access to an extensive library of pre-trained AI models. Users can choose from a range of models tailored to different tasks, domains, and hardware capabilities, ensuring flexibility and versatility in their AI projects.
  • Seamless Integration: Ollama seamlessly integrates through its API endpoint, making it easy for developers to incorporate AI models into their workflows.

Generating a Text Prompt with Ollama

The Ollama application has a basic UI, where you can set some important parameters, and download available models. The maximum context window is 256K as of writing, so it could be more by the time you read this article.

Hierarchy of Ollama files

The OllamaService.cs contains the main logic for Ollama API integration. It does the following:

  • Retrieves list of available models with details,
  • Sends requests to Ollama API,
  • Collects execution metrics from Ollama API,
  • Formats the user prompt,
  • Parses the response to the expected format,
  • Insects the user prompt into a predefined template to ensure proper formatting. The predefined template ensures that all the responses come in the predefined format.

 

The OllamaOutputState.cs

  • manages the sharing of the generated content with other services.
  • stores audio, image and video input

 

The OllamaPromptRequest.cs and OllamaOptions.cs are containing the request to send to Ollama. They were generated by the documentation of Ollama API.

From the UI, you can control:

  • Temperature and Top P to control randomness and diversity. Learn more here
  • Max Tokens: this is practically the maximum length of the response. Longer videos might require longer responses.
  • Keep alive: controls speed vs memory consumption. If set too low, model startup can add significant amounts of time at startup. If set too high, models live too long in memory, and multiple models running simultaneously may have a negative impact on your system`s performance.

Influencing model behavior

When I looked for ways to influence model behavior to get a structure regardless of user input, I saw three ways of doing so:

  • Create new model via the create API: Once set, the model will always behave as expected. There is no need to resend the prompt template between sessions or requests. Ideal for persistent behavior. Less ideal for experimenting, when the prompt template changes regularly.
  • Set system parameter: It defines rules, personas for the model (“You are a helpful assistant”). This is stronger than the user input, system messages usually override user instructions.
  • Extend user prompt (selected solution): This is the most flexible way of influencing the model behavior and can change between prompts easily, but also less secure.

The reason why I selected to extend the user prompt was its flexibility. In practice the other two could also do the work. For production scenario I would choose the new model or the set of the system parameter depending on the requirements of the product. The selected scenario is especially bad in cases where anyone can type prompt to the model, because the prompt extending method can be easily overridden by malicious user prompts (like: “ignore previous instructions, do this instead… “)

Choosing a model to work with

When selecting a model for your workflow, you should keep in mind that different models have different characteristics and requirements.

  • The best models for this task are those which excel at instruction-following and structured output.
  • Memory needs (prefer maximum 7B models)
  • Speed (not a hard constraint for now)
  • Language (English is enough for now)

You might use different models to test the application, so it could introduce new issues/behaviors. This is why model selection is very important at the beginning. There are thousands of models out there.

Models I have tested so far

  • Llama2:7B it was able to follow most of the instructions. The response quality was good enough for usage. It only supports English. It recognized when I entered Hungarian text, but it could not answer or follow the instructions.
  • goonsai/qwen2.5-3B-goonsai-nsfw-100k is an uncensored ( use at your own risk, it will refuse nothing), it performs well on completing your text. This model gave me the poorest response. The JSON structure was often malformed. This time I did not work on this issue to fix. Answering on Hungarian worked with this model.
  • Mistral 0.3 is available both instruct (instruction following) and text completion mode. It is a 7B parameter model. Generated a pretty good answer on Hungarian.

Running a test

I will test all models by feeding them the following simple prompt 10 times:
Sailorwoman is singing about fishing on the beach.

I will measure:

  • Memory usage
  • First execution time
  • Execution time after first run
  • Incorrect responses (something fails)

 

Model

Execution time first

Execution time

VRAM

Incorrect responses

Llama2:7B

7502 ms

3278-4653 ms

~6GB

4, negative prompt for image and video were missing

Mistral

30014ms

15425-18882 ms

~8.6 GB

I do not like the lyrics of this model

Goonsai/qwen2.5-3B-goonsai-nsfw-100k

27131

8822-33696 ms

~9 GB

Not a single generation was fully followed. It created parameters with incorrect names and added extra parameters to the generation as well.

Example responses

Example response from Llama2:7b

{“audio”: {

“lyrics”: “Oh, I love to fish on the beach, where the sand is soft and the sea is sweet. I cast my line and wait for a bite, watching the waves roll in with a happy sight.”,

“tags”: [“beach”, “fishing”, “sailorwoman”, “happy”] },

“video”: {

“positive_prompt”: “Sailorwoman is singing and dancing on the beach, enjoying her time fishing.”,

“negative_prompt”: “”},

“image”: {

“positive_prompt”: “A picturesque beach with a sailboat in the distance, surrounded by palm trees and a clear blue sky.”,

“negative_prompt”: “”}

}

The format was followed by Llama2, but there were empty responses

Example response from mistral:

{  “audio”: {

    “lyrics”: “A cheerful sailorwoman sings a lively tune about fishing on the beach, with lyrics like ‘Cast your line, feel the bite, reel it in so tight!’, accompanied by joyful hums and sea-inspired melodies.”,

    “tags”: [“happy”, “sailor”, “singing”, “fishing”, “beach”, “melodic”, “cheerful”]

  },

  “video”: {

    “positive_prompt”: “A beautiful day at the beach with a lively sailorwoman joyfully singing about fishing. She wears colorful clothing and is surrounded by various fishing equipment.”,

    “negative_prompt”: “No dark or ominous scenes, no sad expressions on the sailorwoman’s face, no dangerous or scary elements.”

  },

  “image”: {

    “positive_prompt”: “A bright, sunny beach scene with a smiling sailorwoman singing while holding a fishing rod and surrounded by colorful sea life such as seagulls, crabs, and fish.”,

    “negative_prompt”: “No grim or somber images, no polluted or trash-filled beaches, no stormy weather.”

  }

}

Example response from Goonsai/qwen2.5-3B-goonsai-nsfw-100k

{

“audio”: {

    “lyrics”: “Sailorwomen sing, ‘Fishing’s my favorite game,'”,

    “tags”: [“happy”],

    “positive_prompt”: “‘faster fish swimming in ocean'”,

    “negative_prompt”: “‘fishermen fishing for other things'”},

“video”: {

    “postive_prompt”: “sailing woman singing about happy beach day with smiling face and waves of water.”, 

    “nagative_promt”:”woman not looking at the viewer”,  

    “positive_frame_count”:”30-45 frames per second, “,   

    “frame_rate_per_second”:”” },

“image”:{

    “posiive_prompt”: “‘sailorwomen singing happy beach day'”,

    “negative_prompt”:”‘happy woman smiling with waves of water'”},

    “time_length_seconds”:”6 seconds”,  

    “audio_frame_count”:”30-45 frames per second”

}

Next steps

In the next chapter we’ll look at image and audio generation with ComfyUI.

Leave a Reply

Your email address will not be published. Required fields are marked *