Glossary

Batch Size

In transformers, batch size is the number of sequences processed in parallel during each training iteration. Large batch sizes can lead to faster training times and better utilization of hardware (GPUs/TPUs), but might decrease accuracy and cause overfitting. Small batch sizes can lead to better generalization and accuracy, but can be slower and computationally expensive.
Block Size

In transformers, block size (or context length) refers to the maximum length of the input sequence a model/transformer can process at once. If the block size is 8, there will be up to 8 characters of context to predict the ninth character in sequence. Larger block sizes allow the model to capture more contextual information, potentially improving performance on tasks requiring long-range dependencies. Smaller block sizes Can lead to faster training and lower memory usage, especially when dealing with limited computational resources.
Embedding

A dense, low-dimensional vector representation of data that captures semantic relationships and allows models to better understand and process complex information. Essentially, it's a way to represent data like words, images, or user interactions as numerical vectors in a way that similar items are closer together in the vector space.