cheesechaser.datapool.table

This module provides classes for managing and accessing data pools stored on Hugging Face.

It includes implementations for table-based data pools, allowing efficient retrieval and management of data resources stored in archives on Hugging Face repositories. The module supports various file formats and provides mechanisms for mapping between resource IDs and their corresponding archive locations.

TableBasedHfDataPool

class cheesechaser.datapool.table.TableBasedHfDataPool(data_repo_id: str, archive_column: str, file_in_archive_column: str, id_column: str = 'id', data_revision: str = 'main', mock_use_id: bool = True, hf_token: str | None = None)[source]

A class representing a table-based data pool stored on Hugging Face.

This class extends HfBasedDataPool to provide functionality for managing data that is organized in a tabular format, where each row represents a data item stored in an archive file.

Parameters:
  • data_repo_id (str) – The ID of the Hugging Face repository containing the data.

  • archive_column (str) – The name of the column containing archive filenames.

  • file_in_archive_column (str) – The name of the column containing filenames within archives.

  • id_column (str) – The name of the column containing unique identifiers for each data item.

  • data_revision (str) – The revision of the data to use (default is ‘main’).

  • mock_use_id (bool) – Whether to use the ID as part of the filename when extracting (default is True).

  • hf_token (Optional[str]) – An optional Hugging Face API token for authentication.

__init__(data_repo_id: str, archive_column: str, file_in_archive_column: str, id_column: str = 'id', data_revision: str = 'main', mock_use_id: bool = True, hf_token: str | None = None)[source]

SimpleTableHfDataPool

class cheesechaser.datapool.table.SimpleTableHfDataPool(data_repo_id: str, archive_column: str, file_in_archive_column: str, table_file: str, id_column: str = 'id', data_revision: str = 'main', mock_use_id: bool = True, hf_token: str | None = None)[source]

A simple implementation of TableBasedHfDataPool that loads data from a single table file.

This class provides functionality to load data from a CSV or Parquet file stored in a Hugging Face repository.

Parameters:
  • data_repo_id (str) – The ID of the Hugging Face repository containing the data.

  • archive_column (str) – The name of the column containing archive filenames.

  • file_in_archive_column (str) – The name of the column containing filenames within archives.

  • table_file (str) – The name of the file containing the data table.

  • id_column (str) – The name of the column containing unique identifiers for each data item.

  • data_revision (str) – The revision of the data to use (default is ‘main’).

  • mock_use_id (bool) – Whether to use the ID as part of the filename when extracting (default is True).

  • hf_token (Optional[str]) – An optional Hugging Face API token for authentication.

__init__(data_repo_id: str, archive_column: str, file_in_archive_column: str, table_file: str, id_column: str = 'id', data_revision: str = 'main', mock_use_id: bool = True, hf_token: str | None = None)[source]