cheesechaser.datapool.base
This module provides functionality for managing and downloading resources from a data pool, particularly focused on Hugging Face-based datasets. It includes classes for handling data locations, managing data pools, and specific implementations for incremental ID-based data pools.
The module offers features such as:
Normalized path handling
Custom exception classes for resource-related errors
A generic DataPool class with batch downloading capabilities
A HuggingFace-based data pool implementation
An incremental ID-based data pool implementation
Key components:
DataLocation: Represents the location of a file within a tar archive
DataPool: Abstract base class for data pool operations
HfBasedDataPool: Implementation of DataPool for Hugging Face datasets
IncrementIDDataPool: Specialized implementation for incremental ID-based datasets
This module is useful for efficiently managing and retrieving resources from large datasets, especially those hosted on Hugging Face.
DataLocation
- class cheesechaser.datapool.base.DataLocation(resource_id: int, tar_file: str, filename: str)[source]
Represents the location of a file within a tar archive.
- Parameters:
resource_id (int) – The unique identifier for the resource.
tar_file (str) – The name of the tar file containing the data.
filename (str) – The name of the file within the tar archive.
- __eq__(other)
Return self==value.
- __hash__ = None
- __init__(resource_id: int, tar_file: str, filename: str) None
- __repr__()
Return repr(self).
- __weakref__
list of weak references to the object (if defined)
DataPool
- class cheesechaser.datapool.base.DataPool[source]
Abstract base class for data pool operations.
This class defines the interface for data pool operations and provides a method for batch downloading resources to a directory. Subclasses should implement the mock_resource method to provide specific functionality for different types of data pools.
The DataPool class is designed to be extended for various data sources and storage mechanisms.
- __weakref__
list of weak references to the object (if defined)
- batch_download_to_directory(resource_ids, dst_dir: str, max_workers: int = 12, save_metainfo: bool = True, metainfo_fmt: str = '{resource_id}_metainfo.json', max_downloads: int | None = None, silent: bool = False)[source]
Download multiple resources to a directory in parallel.
This method efficiently downloads a batch of resources to a specified directory, optionally saving metadata for each resource. It utilizes a thread pool to parallelize downloads for improved performance.
- Parameters:
resource_ids (Iterable[Union[str, Tuple[str, Any]]]) – List of resource IDs or tuples of (resource_id, resource_info) to download.
dst_dir (str) – Destination directory for downloaded files.
max_workers (int) – Maximum number of worker threads for parallel downloads. Defaults to 12.
save_metainfo (bool) – Whether to save metadata information for each resource. Defaults to True.
metainfo_fmt (str) – Format string for metadata filenames. Defaults to ‘{resource_id}_metainfo.json’.
max_downloads (Optional[int]) – Maximum number of downloads to perform. If None, all resources will be downloaded. Defaults to None.
silent (bool) – If True, suppresses progress bar of each standalone file during the download process. Defaults to False.
- Raises:
OSError – If there’s an issue creating the destination directory or copying files.
Note
The max_downloads argument provides a rough limit on the download count. Due to parallel processing, the actual number of downloads may slightly exceed this limit.
- Example:
>>> data_pool = SomeDataPoolImplementation() >>> data_pool.batch_download_to_directory(['resource1', 'resource2'], '/path/to/destination')
- mock_resource(resource_id, resource_info, silent: bool = False) AbstractContextManager[Tuple[str, Any]] [source]
Context manager to mock a resource.
This method should be implemented by subclasses to provide a way to temporarily access a resource. It’s typically used to download or generate a temporary copy of the resource for processing.
- Parameters:
resource_id – The ID of the resource to mock.
resource_info – Additional information about the resource.
silent (bool) – If True, suppresses progress bar of each standalone files during the mocking process.
- Returns:
A tuple containing the path to the mocked resource and its info.
- Raises:
NotImplementedError – If not implemented by a subclass.
HfBasedDataPool
- class cheesechaser.datapool.base.HfBasedDataPool(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main', hf_token: str | None = None)[source]
Implementation of DataPool for Hugging Face datasets.
This class provides methods to interact with and download resources from Hugging Face datasets. It handles the complexities of working with Hugging Face’s repository structure and file organization.
- Parameters:
data_repo_id (str) – The ID of the Hugging Face dataset repository.
data_revision (str) – The revision of the dataset to use.
idx_repo_id (str) – The ID of the index repository (defaults to data_repo_id if not provided).
idx_revision (str) – The revision of the index to use.
hf_token (Optional[str]) – Optional Hugging Face authentication token.
- Example:
>>> data_pool = HfBasedDataPool('username/dataset', data_revision='main') >>> with data_pool.mock_resource('resource1', None) as (path, info): ... # Work with the resource at 'path' ... pass
- __init__(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main', hf_token: str | None = None)[source]
- mock_resource(resource_id, resource_info, silent: bool = False) AbstractContextManager[Tuple[str, Any]] [source]
Context manager to temporarily access a resource.
This method downloads the requested resource to a temporary directory and provides access to it.
- Parameters:
resource_id – The ID of the resource to mock.
resource_info – Additional information about the resource.
silent (bool) – If True, suppresses progress bar of each standalone files during the mocking process.
- Returns:
A tuple containing the path to the temporary directory and the resource info.
- Raises:
ResourceNotFoundError – If the resource cannot be found or downloaded.
- Example:
>>> data_pool = HfBasedDataPool('username/dataset') >>> with data_pool.mock_resource('resource1', {'metadata': 'value'}) as (path, info): ... # Work with the resource at 'path' ... print(f"Resource path: {path}") ... print(f"Resource info: {info}")
IncrementIDDataPool
- class cheesechaser.datapool.base.IncrementIDDataPool(data_repo_id: str, data_revision: str = 'main', idx_repo_id: str | None = None, idx_revision: str = 'main', base_level: int = 3, base_dir: str = 'images', hf_token: str | None = None)[source]
A specialized implementation of HfBasedDataPool for incremental ID-based datasets.
This class is designed to work with datasets where resources are identified by incremental integer IDs and are organized in a hierarchical directory structure.
- Parameters:
data_repo_id (str) – The ID of the Hugging Face dataset repository.
data_revision (str) – The revision of the dataset to use.
idx_repo_id (str) – The ID of the index repository (defaults to data_repo_id if not provided).
idx_revision (str) – The revision of the index to use.
base_level (int) – The base level for the hierarchical structure.
base_dir (str) – The base directory for the dataset files.
InvalidResourceDataError
FileUnrecognizableError
- class cheesechaser.datapool.base.FileUnrecognizableError[source]
Exception raised when a file cannot be recognized or processed.
This exception is used when the system encounters a file that it cannot parse or interpret according to the expected format or structure.
- __weakref__
list of weak references to the object (if defined)