Network learning

dataloader()

dataloader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, *, prefetch_factor=2, persistent_workers=False)

dataset style

dataset: (dataset class). if we use IterableDataset, data loading order is entirely controlled by the user-defined iterable.
data loading order and sampler: the map-style datasets represents a map from indices/keys(possibly non-integral) to data samples, so there is a need to specify the sequence of indices/keys. A sampler could randomly permute a list of indices and yield each one at a time, or yield a small number of them for mini-batch SGD. This is why sampler or batch_sampler is not compatible with iterable-style datasets, since such datasets have no notion of a key or an index

batched and non-batched data

batched and non-batched data: relative arguments: batch_size, drop_last and batch_sampler
- when batch_size(int, optional. default=1) is not None(ie. it can be None), the dataloader yields batched samples instead of individual samples.
- drop_last (bool, optional. default=False) set True to drop the last incomplete batch, if the dataset size is not divisible by the batch_size
- batch_sampler (Sampler class or iterable, optional.)

In this case,loading from a map-style dataset is roughly equivalent with:

1 2	for indices in batch_sampler: yield collate_fn([dataset[i] for i in indices])

and loading from an iterable-style dataset is roughly quivalent with:

1
2
3

dataset_iter = iter(dataset)
for indices in batch_sampler:
	yield collate_fn([next(dataset_iter) for _ in indices])

disable automatic batching: when both batch_size and batch_sampler are None, automatic batching is disabled. Each sample obtained from the dataset is processed with the function passed as the collate_fn argument.

In this case, loading from a map-style dataset is roughly equivalent wtih:

1 2	for index in sampler: yield collate_fn(dataset[index])

and loading from an iterable-style dataset is roughly quivalent with:

1 2	for data in iter(dataset): tield collate_fn(data)

collate_fn(callable, optional) When automatic batching is disabled, the default collate_fn simply converts NumPy arrays into PyTorch Tensors, and keep everything else untouched.
- when automatic batching is disabled, collate_fn is called with each individual data sample, and the output is yielded from the dataloader iterator. In this case, the default collate_fn simply converts NumPy arrays in PyTorch tensors.
- when automatic batching is enabled, collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the dataloader iterator.
  For instance,if there is a input dataset which returns a tuple (data, index), then the default collate_fn clolates a list of data and a list of index, i.e. [batchsize, data] and [batchsize, index]

other arguments

shuffle (bool, optional, default= False) set True to have the data reshuffled at every eopch.
num_workers (int, optional, default=0 ) how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.

single-process or multi-process data loading

not needed for now.

pytorch_data

BowenLyu 发布于 2021-08-25