To train a language model, we need to break language into pieces (so-called tokens) and feed them to the model incrementally. Tokenization can be performed on multiple levels.
Character-level: Text is perceived as a sequence of individual characters (including white spaces). This granular approach allows every possible word to be formed from a sequence of characters. However, it is more difficult to capture semantic relationships between words.
Word-level: Text is represented as a sequence of words. However, the model’s vocabulary is limited by the existing words in the training data.
It turned out that my training data has a vocabulary of ~70,000 unique words. However, as many words appear only once or twice, I decided to replace such rare words by a “<UNK>” special token. This had the effect of reducing vocabulary to ~25,000 words, which leads to a smaller model that needs to be trained later.
from nltk.tokenize import RegexpTokenizer
def custom_tokenizer(txt: str, spec_tokens: List[str], pattern: str=”|\\d|\\\\w+|[^\\\\s]\”) -> List[str]:\”\”\”Tokenize text into words or characters using NLTK\’s RegexpTokenizer, considerung given special combinations as single tokens.
:param txt: The corpus as a single string element.:param spec_tokens: A list of special tokens (e.g. ending, out-of-vocab).:param pattern: By default the corpus is tokenized on a word level (split by spaces).Numbers are considered single tokens.
>> note: The pattern for character level tokenization is \’|.\’\”\”\”pattern = \”|\”.join(spec_tokens) + patterntokenizer = RegexpTokenizer(pattern)tokens = tokenizer.tokenize(txt)return tokens
[“Alice:”, “Hi”, “how”, “are”, “you”, “guys”, “?”, “<END>”, “Tom:”, … ]
from collections import Counter
def get_infrequent_tokens(tokens: Union[List[str], str], min_count: int) -> List[str]:\”\”\”Identify tokens that appear less than a minimum count.
:param tokens: When it is the raw text in a string, frequencies are counted on character level.When it is the tokenized corpus as list, frequencies are counted on token level.:min_count: Threshold of occurence to flag a token.:return: List of tokens that appear infrequently. \”\”\”counts = Counter(tokens)infreq_tokens = set([k for k,v in counts.items() if v<=min_count])return infreq_tokens
def mask_tokens(tokens: List[str], mask: Set[str]) -> List[str]:\”\”\”Iterate through all tokens. Any token that is part of the set, is replaced by the unknown token.
:param tokens: The tokenized corpus.:param mask: Set of tokens that shall be masked in the corpus.:return: List of tokenized corpus after the masking operation.\”\”\”return [t.replace(t, unknown_token) if t in mask else t for t in tokens]
infreq_tokens = get_infrequent_tokens(tokens, min_count=2)tokens = mask_tokens(tokens, infreq_tokens)
[“Alice:”, “Hi”, “how”, “are”, “you”, “<UNK>”, “?”, “<END>”, “Tom:”, … ]
Source link