Analyzing Dynamic Tags with SQL

Understanding the distribution of dynamic tags in a dataset is crucial for metadata generation, content categorization, and trend analysis. This tutorial will guide you through SQL queries that help you:

  • Analyze the distribution of dynamic tags to determine the most frequent categories.
  • Retrieve the top tags for each image to identify key labels in your dataset.
  • Retrieve the top tags for each video to generate meaningful metadata at the video level.

By the end of this tutorial, you’ll be able to extract structured insights from dynamic tag metadata using SQL.

Note: This tutorial uses the V3 Dynamic Tags table schema. If you’re using an older version, table names may differ (e.g., group_[name] instead of dt_[group_name]_visual).

Analyze the Distribution of Dynamic Tags in a Dataset

This query calculates how often each dynamic tag appears across the dataset, helping you understand category distribution.

1-- Step 1: Rank dynamic tags for each image based on their scores
2WITH ranked AS (
3 SELECT
4 coactive_image_id, -- The unique ID for each image
5 tag_name, -- The tag assigned to the image
6 score, -- The confidence score for the tag
7 ROW_NUMBER() OVER (
8 PARTITION BY coactive_image_id -- Separate ranking for each image
9 ORDER BY score DESC -- Rank tags by confidence score in descending order
10 ) AS rn -- Assign a rank to each tag per image
11 FROM
12 dt_sports_visual -- V3 Dynamic Tags visual table
13),
14
15-- Step 2: Filter to keep the most relevant tags for each image
16image_tag_table AS (
17 SELECT
18 coactive_image_id, -- Image ID
19 tag_name, -- Tag name
20 score -- Confidence score for the tag
21 FROM
22 ranked
23 WHERE
24 rn <= 3 -- Keep only the top 3 tags per image
25 AND score > 0.01 -- Exclude low-confidence tags (score <= 0.01)
26)
27
28-- Step 3: Aggregate tag distribution across the dataset
29SELECT
30 tag_name, -- Tag name
31 COUNT(DISTINCT coactive_image_id) AS image_count -- Count the number of unique images with this tag
32FROM
33 image_tag_table
34GROUP BY
35 tag_name -- Group results by tag name
36ORDER BY
37 image_count DESC; -- Order by the frequency of each tag in descending order

Explanation

  1. CTE (ranked): Assigns a rank to each dynamic tag within every image based on the tag’s score.

  2. CTE (image_tag_table): Filters the ranked results to include:

    • The top 3 tags per image (rn <= 3).
    • Tags with scores above a configurable threshold (dynamic_tag_score > 0.01)
  3. Final Query: Counts how many unique images contain each dynamic tag.

Use Cases

  • Identify most common categories in a dataset.
  • Prioritize frequently occurring tags for search and recommendation systems.

Retrieve the Top Tags for Each Image

This query retrieves the most significant dynamic tags for each image, helping with content categorization.

1-- Step 1: Join with core table to get video association information
2WITH video_info AS (
3 SELECT DISTINCT
4 dt.coactive_image_id,
5 t.coactive_video_id
6 FROM dt_sports_visual dt
7 JOIN coactive_table t ON dt.coactive_image_id = t.coactive_image_id
8 WHERE t.coactive_video_id IS NOT NULL -- Filter for video keyframes
9),
10
11-- Step 2: Rank tags for each image, excluding video keyframes for standalone image analysis
12ranked AS (
13 SELECT
14 dt.coactive_image_id,
15 dt.tag_name,
16 dt.score,
17 ROW_NUMBER() OVER (
18 PARTITION BY dt.coactive_image_id -- Rank tags for each image
19 ORDER BY dt.score DESC -- Highest scoring tags first
20 ) AS rn
21 FROM dt_sports_visual dt
22 WHERE dt.coactive_image_id NOT IN (SELECT coactive_image_id FROM video_info) -- Exclude video keyframes
23),
24
25-- Step 3: Filter to retain the top 5 tags per image
26image_tag_table AS (
27 SELECT
28 coactive_image_id,
29 tag_name,
30 score
31 FROM ranked
32 WHERE rn <= 5 -- Keep only the top 5 tags per image
33 AND score > 0.001 -- Exclude low-confidence tags
34)
35
36-- Final Output: Retrieve the most significant tags for each image
37SELECT
38 coactive_image_id,
39 tag_name,
40 score
41FROM image_tag_table;

Explanation

  1. CTE (video_only): Ensures the analysis focuses on standalone images by excluding video-related data in the next step, which is not necessary if your dataset only contains images.

  2. CTE (ranked): Assigns a rank to each dynamic tag for every image (coactive_image_id), based on the tag’s score.

  3. Final Query: Filters results to include only:

    • The top 5 tags for each image (rn <= 5).
    • Tags with a score above 0.001 to exclude irrelevant or low-confidence tags.

Use Cases

  • Enable image-based metadata tagging for search and filtering.
  • Improve content organization by identifying dominant tags in images.

Retrieve the Top Tags for Each Video

This query retrieves the most significant dynamic tags for each video, helping structure video metadata.

1-- Step 1: Get video associations from keyframe data
2WITH video_keyframes AS (
3 SELECT *
4 dt.coactive_image_id,
5 dt.tag_name,
6 dt.score,
7 t.coactive_video_id
8 FROM dt_sports_visual dt
9 JOIN coactive_table t ON dt.coactive_image_id = t.coactive_image_id
10 WHERE t.coactive_video_id IS NOT NULL -- Ensure valid video data
11),
12
13-- Step 2: Rank tags for each keyframe within each video
14ranked AS (
15 SELECT
16 coactive_video_id,
17 coactive_image_id,
18 tag_name,
19 score,
20 ROW_NUMBER() OVER (
21 PARTITION BY coactive_video_id, coactive_image_id -- Rank tags separately for each keyframe within a video
22 ORDER BY score DESC -- Highest scoring tags first
23 ) AS rn
24 FROM video_keyframes
25),
26
27-- Step 3: Remove duplicates for tags within each video
28distinct_tags AS (
29 SELECT
30 coactive_video_id,
31 coactive_image_id,
32 tag_name,
33 score,
34 ROW_NUMBER() OVER (
35 PARTITION BY coactive_video_id, tag_name -- Rank each tag within the same video
36 ORDER BY score DESC -- Keep the highest score for each tag
37 ) AS tag_rank
38 FROM ranked
39 WHERE rn <= 5 -- Keep only the top 5 tags per keyframe
40 AND score > 0.001 -- Exclude low-confidence tags
41),
42
43-- Step 4: Rank tags across the video to find the most significant ones
44video_tag_table AS (
45 SELECT
46 coactive_video_id,
47 coactive_image_id,
48 tag_name,
49 score,
50 DENSE_RANK() OVER (
51 PARTITION BY coactive_video_id -- Rank tags globally for the video
52 ORDER BY score DESC
53 ) AS rn_video
54 FROM distinct_tags
55 WHERE tag_rank = 1 -- Include only the best instance of each tag
56),
57
58-- Step 5: Select the top 5 tags for each video
59video_top_tag_table AS (
60 SELECT *
61 FROM video_tag_table
62 WHERE rn_video <= 5 -- Limit to the top 5 tags per video
63 ORDER BY coactive_video_id, score DESC
64)
65
66-- Final Output: Retrieve the most significant tags for each video
67SELECT *
68FROM video_top_tag_table;

Explanation

  1. CTE (video_only): Filters to keep only valid video data.

  2. CTE (ranked): Ranks tags within each video by confidence score.

  3. CTE (distinct_tags): Ensures that each tag appears only once per video, keeping the highest confidence score.

  4. CTE (video_tag_table): Retrieves the top 5 tags per video.

Use Cases

  • Enhance video metadata for search and categorization.
  • Provide a structured summary of video content for analysis.