Data Analysis
fiftyone-find-duplicates avatar

fiftyone-find-duplicates

Find, review, and remove duplicate or near-duplicate images in FiftyOne datasets using computer vision similarity embeddings.

Introduction

This skill enables AI assistants to autonomously manage image dataset quality by identifying and removing redundant content. By leveraging FiftyOne's powerful brain similarity computation, the agent calculates image embeddings, determines similarity indexes, and flags duplicates based on configurable distance thresholds. It is designed for data scientists, machine learning engineers, and data annotators who need to clean large-scale computer vision datasets to improve training data efficiency and reduce model overfitting caused by data duplication.

The workflow guides the agent through the complete lifecycle of dataset curation: from initial environment setup and plugin verification to compute-intensive similarity operations and final manual or automated review. The agent uses FiftyOne operators to handle exact byte-level matches and near-duplicate visual patterns, allowing for precise control over what qualifies as a duplicate.

  • Computes image embeddings using pre-trained models like mobilenet-v2-imagenet-torch to quantify visual similarity.

  • Supports automated and manual identification of near-duplicate images using customizable distance thresholds (e.g., 0.1 for near-exact, 0.3 for recommended near-duplicates).

  • Provides deep integration with the FiftyOne App for visual validation, enabling users to review duplicate groups, load saved views of representatives, and interactively delete redundant samples.

  • Includes dedicated workflows for both exact file-based deduplication and complex semantic near-duplicate removal.

  • Manages interaction with the FiftyOne Brain plugin (@voxel51/brain) to handle high-performance similarity indexing.

  • Prerequisites: Requires the FiftyOne Python library, an initialized FiftyOne dataset, and the @voxel51/brain plugin installed.

  • Inputs: Expects a valid dataset name and optional threshold parameters; output is a cleaned dataset with optimized sample distribution.

  • Operational constraints: Depends on local or remote machine availability for embedding computation; large datasets may require significant memory or GPU resources.

  • Best practices: Always launch the FiftyOne App within the session context before executing brain operators to ensure proper GUI state synchronization.

  • Pro-tip: Use the created saved views (e.g., 'near duplicates' or 'representatives') to speed up the auditing process before calling the final deduplication operators.

Repository Stats

Stars
26
Forks
5
Open Issues
8
Language
JavaScript
Default Branch
main
Sync Status
Idle
Last Synced
May 3, 2026, 04:24 PM
View on GitHub