torchoutil.extras.hdf.pack module¶

bytearray_to_bytes( x: Any, ) → Any[source]¶

hdf_dtype_to_fill_value( hdf_dtype: dtype | Literal['b', 'i', 'u', 'f', 'c'] | type | None, ) → bool | int | float | complex | None | str | bytes[source]¶

hdf_dtype_to_numpy_dtype( hdf_dtype: dtype | Literal['b', 'i', 'u', 'f', 'c'] | type, ) → dtype[source]¶

numpy_dtype_to_hdf_dtype( dtype: dtype | None, *, encoding: str = 'utf-8', ) → dtype[source]¶

pack_to_hdf( dataset: SupportsGetitemLen[T], hdf_fpath: str | Path, pre_transform: Callable[[T], T_DictOrTuple] | None = None, *, exists: Literal['overwrite', 'skip', 'error'] = 'error', verbose: int = 0, batch_size: int = 32, num_workers: int | Literal['auto'] = 'auto', shape_suffix: str = '__shape', store_str_as_vlen: bool = False, file_kwds: Dict[str, Any] | None = None, ds_kwds: Dict[str, Any] | None = None, user_attrs: Any = None, skip_scan: bool = False, ) → HDFDataset[T_DictOrTuple, T_DictOrTuple][source]¶

Pack a dataset to HDF file.

Args:

dataset: The sized dataset to pack. Must be sized and all items must be of dict type.: The key of each dictionaries are strings and values can be int, float, str, Tensor, non-empty List[int], non-empty List[float], non-empty List[str]. If values are tensors or lists, the number of dimensions must be the same for all items in the dataset.

hdf_fpath: The path to the HDF file. pre_transform: The optional transform to apply to audio returned by the dataset BEFORE storing it in HDF file.

Can be used for deterministic transforms like Resample, LogMelSpectrogram, etc. defaults to None.

exists: Determine which action should be performed if the target HDF file already exists.: “overwrite”: Replace the target file then pack dataset. “skip”: Skip this function and returns the packed dataset. “error”: Raises a ValueError.

verbose: Verbose level. defaults to 0. batch_size: The batch size of the dataloader. defaults to 32. num_workers: The number of workers of the dataloader.

If “auto”, it will be set to len(os.sched_getaffinity(0)). defaults to “auto”.

shape_suffix: Shape column suffix in HDF file. defaults to “_shape”. store_str_as_vlen: If True, store strings as variable length string dtype. defaults to False. file_kwds: Options given to h5py file object. defaults to None. ds_kwds: Keywords arguments passed to the returned HDFDataset instance if the target file already exists and if exists == “skip”. user_attrs: Additional metadata to add to the hdf file. It must be convertible to JSON with json.dumps. defaults to None. skip_scan: If True, the input dataset will be considered as fully homogeneous, which means that all columns values contains the same shape and dtype, which will be inferred from the first batch.

It is meant to skip the first step which scans each dataset item once and speed up packing to HDF file. defaults to False.

Returns:

hdf_dataset: The target HDF dataset object.