Skip to content

sum-vid-llm-commercial (Summarization of video inputs)

Version Changelog

Plugin Version Change
v1.0.0 Initial plugin release with OLIVE 6.1.0.

Description

This Video Summarization plugin creates a short summary of the provided input video(s). It transforms raw video inputs into a short paragraph with the aim of preserving key information contained within the input video.

This plugin performs this task using a large language model (LLM) external to the plugin, either run in OLIVE or hosted separately (advanced users).

LLMs have a max. context length that they can use for processing inputs (prompts) and outputs. Each video frame is encoded as an image and sent to the LLM. This uses up some of the LLM's token context window length. Long videos are therefore processed by splitting them into overlapping sequences of images. Each of these sequences are then summarized using the LLM. Subsequently, semantic embeddings are then generated from each text summary. Clusters are then built from the summary embeddings, and high-level summaries are created for each cluster using the LLM. Finally, an overall summary is created from the high-level summaries using the LLM.

Domains

  • multi-v1
    • Uses an LLM to create a summary of the input video. The accuracy as well as language support varies between different LLMs.

Inputs

A video file to process.

Outputs

The output format for Summarization plugins is simply text.

Functionality (Traits)

The functions of this plugin are defined by its Traits and implemented API messages. A list of these Traits is below, along with the corresponding API messages for each. Click the message name below to be brought to additional implementation details below.

Compatibility

OLIVE 6.1+

Limitations

This plugin is based on an LLM. Its performance critically depends therefore on the LLM's performance on the task, in particular for low-resource languages. There is often a correlation between the number of parameters of an LLM and its performance on complex tasks. Hence, larger LLMs tend to perform better. However, they also tend to use more resources and have lower processing speed given the same hardware.

We tested this plugin using Google's Gemma-3-4B-it-qat-q4_0-gguf and Gemma-3-12B-it-qat-q4_0-gguf. Performance with other LLMs may vary.

Comments

Large Language Model (LLM) Required

This plugin relies on a Large Language Model (LLM) for the heavy lifting of its task. This plugin can only be used with an appropriately configured OLIVE server that has been started with the LLM server active. See the LLM Configuration Documentation for more information, and refer to the Martini documentation to make sure the appropriate startup procedure is followed.

GPU Support

Please refer to the OLIVE GPU Installation and Support documentation page for instructions on how to enable and configure GPU capability in supported plugins. By default this plugin will run on CPU only; however, the speed of the embedding model computation used for long video inputs is greatly enhanced when using GPU.

Text Transformation Options

The following region scoring options are available to this plugin, adjustable in the plugin's configuration file; plugin_config.py.

Option Name Description Default Expected Range
llm_base_url LLM base URL of an OpenAI API compatible LLM server (such as llama-server, vLLM, or hosted LLMs such as OpenAI). http://127.0.0.1:5007/v1
model Name of the LLM model to use. gemma3-4b
api_key API key/token to use for the LLM server, if required. token
llm_ctx_window LLM context window length, i.e., the number of tokens in the context window of the LLM. By default, the plugin queries the LLM server for the max. LLM context window. This setting is used as fallback if the query fails. 8192

If you find this plugin to not perform adequately for your data conditions, or have a specific use case, please get in touch with SRI to discuss how the plugin can be tuned for optimal performance on your data.