HuggingFace Tokenizers as Collate Functions Timing πŸ€— πŸ€–

pytorch
huggingface
Timing comparison of tokenizer as collate function and after batching
Author

Sachin Abeywardana

Published

November 17, 2021

Since I have been trying to use collate functions alot I wanted to see what the speed was with. TLDR: It’s quicker to use tokenizer after normal batching than it is through a collate function. Not sure why.

Code
BATCH_SIZE = 64
LANGUAGE_MODEL = "bert-base-uncased"
MAX_TEXT_LENGTH = 256
NUM_WORKERS = mp.cpu_count()
N = 100000

We will be using the SNLI dataset sentences (and throwing away labels) for this experiment.

Code
snli = datasets.load_dataset('snli', split='train')

class Sentences(Dataset):
    def __init__(self, data: Dataset, limit: int) -> None:
        sentences = [[pair["hypothesis"], pair["premise"]] for pair in data]
        sentences = [sentence for pair in sentences for sentence in pair]
        self.sentences = sentences[:limit]

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, i):
        return self.sentences[i]


sentence_ds = Sentences(snli, N)
Downloading and preparing dataset snli/plain_text (download: 90.17 MiB, generated: 65.51 MiB, post-processed: Unknown size, total: 155.68 MiB) to /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b...
Dataset snli downloaded and prepared to /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b. Subsequent calls will reuse this data.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))

Let’s define a collate function, which is just your usual HuggingFace tokenizer, but with some defaults.

Code
tokenizer = AutoTokenizer.from_pretrained(LANGUAGE_MODEL)

class CollateFn:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        
    def __call__(self, x):
        return self.tokenizer(
            x, 
            max_length=MAX_TEXT_LENGTH, 
            truncation=True, 
            padding="max_length", 
            return_tensors="pt"
        )
    
collate_fn = CollateFn(tokenizer)

As can be seen in the following two experiments, the inline collate_fn is twice as slow. Would be great to hear your opinions as to why. My only guess is that considering the DataLoader multiprocessor is clashing with the tokenizer multiprocessor. However, changing workers to 1 in second cell below did nothing to help.

%%time
sentence_dl = DataLoader(
    sentence_ds,
    BATCH_SIZE,
    num_workers=NUM_WORKERS,
    shuffle=False,
    drop_last=False,
    pin_memory=True,
)

for batch in tqdm(sentence_dl):
    x = collate_fn(batch)
CPU times: user 15.5 s, sys: 743 ms, total: 16.3 s
Wall time: 13.8 s
%%time
sentence_dl = DataLoader(
    sentence_ds,
    BATCH_SIZE,
    num_workers=NUM_WORKERS,
    shuffle=False,
    drop_last=False,
    pin_memory=True,
    collate_fn=collate_fn,
)

for batch in tqdm(sentence_dl):
    continue
CPU times: user 13.4 s, sys: 1.66 s, total: 15.1 s
Wall time: 28.1 s