Code
= 64
BATCH_SIZE = "bert-base-uncased"
LANGUAGE_MODEL = 256
MAX_TEXT_LENGTH = mp.cpu_count()
NUM_WORKERS = 100000 N
Sachin Abeywardana
November 17, 2021
Since I have been trying to use collate functions alot I wanted to see what the speed was with. TLDR: Itβs quicker to use tokenizer after normal batching than it is through a collate function. Not sure why.
We will be using the SNLI dataset sentences (and throwing away labels) for this experiment.
snli = datasets.load_dataset('snli', split='train')
class Sentences(Dataset):
def __init__(self, data: Dataset, limit: int) -> None:
sentences = [[pair["hypothesis"], pair["premise"]] for pair in data]
sentences = [sentence for pair in sentences for sentence in pair]
self.sentences = sentences[:limit]
def __len__(self):
return len(self.sentences)
def __getitem__(self, i):
return self.sentences[i]
sentence_ds = Sentences(snli, N)
Downloading and preparing dataset snli/plain_text (download: 90.17 MiB, generated: 65.51 MiB, post-processed: Unknown size, total: 155.68 MiB) to /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b...
Dataset snli downloaded and prepared to /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b. Subsequent calls will reuse this data.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
Letβs define a collate function, which is just your usual HuggingFace tokenizer, but with some defaults.
tokenizer = AutoTokenizer.from_pretrained(LANGUAGE_MODEL)
class CollateFn:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def __call__(self, x):
return self.tokenizer(
x,
max_length=MAX_TEXT_LENGTH,
truncation=True,
padding="max_length",
return_tensors="pt"
)
collate_fn = CollateFn(tokenizer)
As can be seen in the following two experiments, the inline collate_fn
is twice as slow. Would be great to hear your opinions as to why. My only guess is that considering the DataLoader multiprocessor is clashing with the tokenizer multiprocessor. However, changing workers
to 1
in second cell below did nothing to help.