cheesechaser.datapool.base
This module provides functionality for managing and downloading resources from a data pool, particularly focused on Hugging Face-based datasets. It includes classes for handling data locations, managing data pools, and specific implementations for incremental ID-based data pools.
The module offers features such as:
Normalized path handling
Custom exception classes for resource-related errors
A generic DataPool class with batch downloading capabilities
A HuggingFace-based data pool implementation
An incremental ID-based data pool implementation
Key components:
DataLocation: Represents the location of a file within a tar archive
DataPool: Abstract base class for data pool operations
HfBasedDataPool: Implementation of DataPool for Hugging Face datasets
IncrementIDDataPool: Specialized implementation for incremental ID-based datasets
This module is useful for efficiently managing and retrieving resources from large datasets, especially those hosted on Hugging Face.
DataLocation
- class cheesechaser.datapool.base.DataLocation(tar_file: str, filename: str)[source]
Represents the location of a file within a tar archive.
- Parameters:
tar_file (str) – The name of the tar file containing the data.
filename (str) – The name of the file within the tar archive.
- __eq__(other)
Return self==value.
- __hash__ = None
- __init__(tar_file: str, filename: str) None
- __repr__()
Return repr(self).
- __weakref__
list of weak references to the object (if defined)
DataPool
- class cheesechaser.datapool.base.DataPool[source]
Abstract base class for data pool operations.
This class defines the interface for data pool operations and provides a method for batch downloading resources to a directory.
- __weakref__
list of weak references to the object (if defined)
- batch_download_to_directory(resource_ids, dst_dir: str, max_workers: int = 12, save_metainfo: bool = True, metainfo_fmt: str = '{resource_id}_metainfo.json')[source]
Download multiple resources to a directory.
This method downloads a batch of resources to a specified directory, optionally saving metadata for each resource.
- Parameters:
resource_ids (Iterable[Union[str, Tuple[str, Any]]]) – List of resource IDs or tuples of (resource_id, resource_info) to download.
dst_dir (str) – Destination directory for downloaded files.
max_workers (int) – Maximum number of worker threads for parallel downloads.
save_metainfo (bool) – Whether to save metadata information for each resource.
metainfo_fmt (str) – Format string for metadata filenames.
- Raises:
OSError – If there’s an issue creating the destination directory or copying files.
- mock_resource(resource_id, resource_info) AbstractContextManager[Tuple[str, Any]] [source]
Context manager to mock a resource.
This method should be implemented by subclasses to provide a way to temporarily access a resource.
- Parameters:
resource_id – The ID of the resource to mock.
resource_info – Additional information about the resource.
- Returns:
A tuple containing the path to the mocked resource and its info.
- Raises:
NotImplementedError – If not implemented by a subclass.
HfBasedDataPool
- class cheesechaser.datapool.base.HfBasedDataPool(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main')[source]
Implementation of DataPool for Hugging Face datasets.
This class provides methods to interact with and download resources from Hugging Face datasets.
- Parameters:
data_repo_id (str) – The ID of the Hugging Face dataset repository.
data_revision (str) – The revision of the dataset to use.
idx_repo_id (str) – The ID of the index repository (defaults to data_repo_id if not provided).
idx_revision (str) – The revision of the index to use.
- __init__(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main')[source]
- mock_resource(resource_id, resource_info) AbstractContextManager[Tuple[str, Any]] [source]
Context manager to temporarily access a resource.
This method downloads the requested resource to a temporary directory and provides access to it.
- Parameters:
resource_id – The ID of the resource to mock.
resource_info – Additional information about the resource.
- Returns:
A tuple containing the path to the temporary directory and the resource info.
- Raises:
ResourceNotFoundError – If the resource cannot be found or downloaded.
IncrementIDDataPool
- class cheesechaser.datapool.base.IncrementIDDataPool(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main', base_level: int = 3, base_dir: str = 'images')[source]
A specialized implementation of HfBasedDataPool for incremental ID-based datasets.
This class is designed to work with datasets where resources are identified by incremental integer IDs and are organized in a hierarchical directory structure.
- Parameters:
data_repo_id (str) – The ID of the Hugging Face dataset repository.
data_revision (str) – The revision of the dataset to use.
idx_repo_id (str) – The ID of the index repository (defaults to data_repo_id if not provided).
idx_revision (str) – The revision of the index to use.
base_level (int) – The base level for the hierarchical structure.
base_dir (str) – The base directory for the dataset files.