How to create a dataset for pytorch ?

PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

  • Dataset: torch.utils.data.Dataset
  • Dataloader: torch.utils.data.Dataloader

Customize dataset

class CustomDataset(Dataset):
    """
    customDataset stores samples and their targets,
    return the length of the dataset, 
    return an instance of sample and target of the dataset
    """
    def __init__(self, datapath):
        self.datapath = datapath
        
        self.data = []
        self.label = []

        # iterate entries in datapath
        with open(datapath, 'r') as f:
            line = f.readline()
            
            # add entries to self.data and self.label
        
    def __len__(self):
        """
        return the number of entries in the datapath
        """
        return len(self.data)
    
    def __getitem__(self,index):
        """
        return a sample from the dataset at the given index
        """

        return self.data[index], self.label[index]

Iterate through the DataLoader

The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

  • shuffle
    • shuffle = True: shuffle batch data in each epoch
    • setting seed: ensure each training process with the same random.
# iter dataloader example
# Display image and label.
train_features, train_labels = next(iter(train_dataloader)) 
# iter() method returns the iterator object
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

torch.utils.data

Dataset types

  • Map-style datasets
  • Iterable-style datasets (TBD)

Loading datasets

  • Load datasets with class DataLoader: instantiate a DataLoader class, with inputs of datasets, batch size, and sampler, etc.
      train_dataloader = Dataloader(train_dataset, batch_size = 2, shuffle=True)
    
  • Sampler
    • dataloader can automaticaaly construct shuffled sampler by specifying shuffle = True;
    • or we can customize our own Sampler, for example by specifying torch.utils.data.distributed.DistributedSampler.
  • Batch (TBD)

  • collate_fn (TBD)
    • collate_fn is used to prepare data in a batch (convert to tensor, padding to same length, convert to same dimension, etc. )
    • user can customize their own collate_fn
  • multi-processing (TBD)
    • python multiprocessing
    • torch.multiprocessing

<
Previous Post
Binary Tree
>
Next Post
Notes for Huggingface efficient training on a single GPU