cheesechaser.datapool.base

This module provides functionality for managing and downloading resources from a data pool, particularly focused on Hugging Face-based datasets. It includes classes for handling data locations, managing data pools, and specific implementations for incremental ID-based data pools.

The module offers features such as:

  • Normalized path handling

  • Custom exception classes for resource-related errors

  • A generic DataPool class with batch downloading capabilities

  • A HuggingFace-based data pool implementation

  • An incremental ID-based data pool implementation

Key components:

  • DataLocation: Represents the location of a file within a tar archive

  • DataPool: Abstract base class for data pool operations

  • HfBasedDataPool: Implementation of DataPool for Hugging Face datasets

  • IncrementIDDataPool: Specialized implementation for incremental ID-based datasets

This module is useful for efficiently managing and retrieving resources from large datasets, especially those hosted on Hugging Face.

DataLocation

class cheesechaser.datapool.base.DataLocation(tar_file: str, filename: str)[source]

Represents the location of a file within a tar archive.

Parameters:
  • tar_file (str) – The name of the tar file containing the data.

  • filename (str) – The name of the file within the tar archive.

__eq__(other)

Return self==value.

__hash__ = None
__init__(tar_file: str, filename: str) None
__repr__()

Return repr(self).

__weakref__

list of weak references to the object (if defined)

DataPool

class cheesechaser.datapool.base.DataPool[source]

Abstract base class for data pool operations.

This class defines the interface for data pool operations and provides a method for batch downloading resources to a directory.

__weakref__

list of weak references to the object (if defined)

batch_download_to_directory(resource_ids, dst_dir: str, max_workers: int = 12, save_metainfo: bool = True, metainfo_fmt: str = '{resource_id}_metainfo.json')[source]

Download multiple resources to a directory.

This method downloads a batch of resources to a specified directory, optionally saving metadata for each resource.

Parameters:
  • resource_ids (Iterable[Union[str, Tuple[str, Any]]]) – List of resource IDs or tuples of (resource_id, resource_info) to download.

  • dst_dir (str) – Destination directory for downloaded files.

  • max_workers (int) – Maximum number of worker threads for parallel downloads.

  • save_metainfo (bool) – Whether to save metadata information for each resource.

  • metainfo_fmt (str) – Format string for metadata filenames.

Raises:

OSError – If there’s an issue creating the destination directory or copying files.

mock_resource(resource_id, resource_info) AbstractContextManager[Tuple[str, Any]][source]

Context manager to mock a resource.

This method should be implemented by subclasses to provide a way to temporarily access a resource.

Parameters:
  • resource_id – The ID of the resource to mock.

  • resource_info – Additional information about the resource.

Returns:

A tuple containing the path to the mocked resource and its info.

Raises:

NotImplementedError – If not implemented by a subclass.

HfBasedDataPool

class cheesechaser.datapool.base.HfBasedDataPool(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main')[source]

Implementation of DataPool for Hugging Face datasets.

This class provides methods to interact with and download resources from Hugging Face datasets.

Parameters:
  • data_repo_id (str) – The ID of the Hugging Face dataset repository.

  • data_revision (str) – The revision of the dataset to use.

  • idx_repo_id (str) – The ID of the index repository (defaults to data_repo_id if not provided).

  • idx_revision (str) – The revision of the index to use.

__init__(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main')[source]
mock_resource(resource_id, resource_info) AbstractContextManager[Tuple[str, Any]][source]

Context manager to temporarily access a resource.

This method downloads the requested resource to a temporary directory and provides access to it.

Parameters:
  • resource_id – The ID of the resource to mock.

  • resource_info – Additional information about the resource.

Returns:

A tuple containing the path to the temporary directory and the resource info.

Raises:

ResourceNotFoundError – If the resource cannot be found or downloaded.

IncrementIDDataPool

class cheesechaser.datapool.base.IncrementIDDataPool(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main', base_level: int = 3, base_dir: str = 'images')[source]

A specialized implementation of HfBasedDataPool for incremental ID-based datasets.

This class is designed to work with datasets where resources are identified by incremental integer IDs and are organized in a hierarchical directory structure.

Parameters:
  • data_repo_id (str) – The ID of the Hugging Face dataset repository.

  • data_revision (str) – The revision of the dataset to use.

  • idx_repo_id (str) – The ID of the index repository (defaults to data_repo_id if not provided).

  • idx_revision (str) – The revision of the index to use.

  • base_level (int) – The base level for the hierarchical structure.

  • base_dir (str) – The base directory for the dataset files.

__init__(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main', base_level: int = 3, base_dir: str = 'images')[source]

InvalidResourceDataError

class cheesechaser.datapool.base.InvalidResourceDataError[source]

Base exception for invalid resource data.

__weakref__

list of weak references to the object (if defined)

FileUnrecognizableError

class cheesechaser.datapool.base.FileUnrecognizableError[source]

Exception raised when a file cannot be recognized or processed.

__weakref__

list of weak references to the object (if defined)

ResourceNotFoundError

class cheesechaser.datapool.base.ResourceNotFoundError[source]

Exception raised when a requested resource is not found.