Executes a semantic video search using a natural language text query (e.g., 'person walking in park' or 'cityscape at sunset') within a specified dataset (dataset_id). The text query is encoded into an embedding using the dataset's configured encoder (e.g., Perception Encoder or other vision-language embedding models for visual modalities, or Qwen or other text encoders for audio transcript modality), then searched against video content embeddings using vector similarity. The search behavior is controlled by the modality parameter: 'video' (default) searches video-level embeddings for overall video similarity, 'shot' searches shot-level embeddings to find videos with similar scenes, 'image' searches frame-level embeddings to find videos with similar individual frames, and 'audio_speech_to_text' searches transcript text embeddings for spoken content. All modalities return one result per video, surfacing the most relevant composite slice (shot or scene) and a preview frame. Results can be filtered using optional metadata filters and include the composite slice with start/end timestamps, frame numbers, relevance scores, and video metadata.
Authentication
AuthorizationBearer
Bearer authentication of the form Bearer <token>, where token is your auth token.
Request
This endpoint expects an object.
dataset_idstringRequiredformat: "uuid"
The unique identifier for the dataset
text_querystringRequired>=2 characters
The natural language search string
metadata_filtersobject or nullOptional
JSON string containing a list of metadata filters
offsetintegerOptionalDefaults to 0
Starting index to return (default 0)
limitintegerOptionalDefaults to 60
Max number of items to return(default 60, max 1000)
modalityenum or nullOptional
Modality of the video to search for ( video, shot, image ). Valid options: “video”, “shot”, “image”, “audio_speech_to_text”, “capped-shot-segment”. Default is ‘video’.
Allowed values:
skip_moderationbooleanOptionalDefaults to false
Skip content moderation if enabled
moderation_score_typeenumOptional
Type of moderation scores to return when moderation is enabled
Executes a semantic video search using a natural language text query (e.g., ‘person walking in park’ or ‘cityscape at sunset’) within a specified dataset (dataset_id). The text query is encoded into an embedding using the dataset’s configured encoder (e.g., Perception Encoder or other vision-language embedding models for visual modalities, or Qwen or other text encoders for audio transcript modality), then searched against video content embeddings using vector similarity. The search behavior is controlled by the modality parameter: ‘video’ (default) searches video-level embeddings for overall video similarity, ‘shot’ searches shot-level embeddings to find videos with similar scenes, ‘image’ searches frame-level embeddings to find videos with similar individual frames, and ‘audio_speech_to_text’ searches transcript text embeddings for spoken content. All modalities return one result per video, surfacing the most relevant composite slice (shot or scene) and a preview frame. Results can be filtered using optional metadata filters and include the composite slice with start/end timestamps, frame numbers, relevance scores, and video metadata.