Build a Language Model on Your WhatsApp Chats | by Bernhard Pfann, CFA

Build a Language Model on Your WhatsApp Chats | by Bernhard Pfann, CFA | Nov, 2023

To train a language model, we need to break language into pieces (so-called tokens) and feed them to the model incrementally. Tokenization can be performed on multiple levels.

Character-level: Text is perceived as a sequence of individual characters (including white spaces). This granular approach allows every possible word to be formed from a sequence of characters. However, it is more difficult to capture semantic relationships between words.

Word-level: Text is represented as a sequence of words. However, the model’s vocabulary is limited by the existing words in the training data.

It turned out that my training data has a vocabulary of ~70,000 unique words. However, as many words appear only once or twice, I decided to replace such rare words by a “<UNK>” special token. This had the effect of reducing vocabulary to ~25,000 words, which leads to a smaller model that needs to be trained later.

from nltk.tokenize import RegexpTokenizer

def custom_tokenizer(txt: str, spec_tokens: List[str], pattern: str=”|\\d|\\\\w+|[^\\\\s]\”) -> List[str]:\”\”\”Tokenize text into words or characters using NLTK\’s RegexpTokenizer, considerung given special combinations as single tokens.

:param txt: The corpus as a single string element.:param spec_tokens: A list of special tokens (e.g. ending, out-of-vocab).:param pattern: By default the corpus is tokenized on a word level (split by spaces).Numbers are considered single tokens.

>> note: The pattern for character level tokenization is \’|.\’\”\”\”pattern = \”|\”.join(spec_tokens) + patterntokenizer = RegexpTokenizer(pattern)tokens = tokenizer.tokenize(txt)return tokens

[“Alice:”, “Hi”, “how”, “are”, “you”, “guys”, “?”, “<END>”, “Tom:”, … ]
from collections import Counter

def get_infrequent_tokens(tokens: Union[List[str], str], min_count: int) -> List[str]:\”\”\”Identify tokens that appear less than a minimum count.

:param tokens: When it is the raw text in a string, frequencies are counted on character level.When it is the tokenized corpus as list, frequencies are counted on token level.:min_count: Threshold of occurence to flag a token.:return: List of tokens that appear infrequently. \”\”\”counts = Counter(tokens)infreq_tokens = set([k for k,v in counts.items() if v<=min_count])return infreq_tokens

def mask_tokens(tokens: List[str], mask: Set[str]) -> List[str]:\”\”\”Iterate through all tokens. Any token that is part of the set, is replaced by the unknown token.

:param tokens: The tokenized corpus.:param mask: Set of tokens that shall be masked in the corpus.:return: List of tokenized corpus after the masking operation.\”\”\”return [t.replace(t, unknown_token) if t in mask else t for t in tokens]

infreq_tokens = get_infrequent_tokens(tokens, min_count=2)tokens = mask_tokens(tokens, infreq_tokens)

[“Alice:”, “Hi”, “how”, “are”, “you”, “<UNK>”, “?”, “<END>”, “Tom:”, … ]

Source link

Build a Language Model on Your WhatsApp Chats | by Bernhard Pfann, CFA | Nov, 2023

Episode #509: Austin Root, Stansberry Asset Management – The Case For Productive Assets – Meb Faber Research

Winning the cloud game: Phoning the right friend to answer the cloud optimization question

Related Posts

How insurance companies can use synthetic data to fight bias

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

How Game Theory Can Make AI More Reliable

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

Deciphering Doubt: Navigating Uncertainty in LLM Responses

Winning the cloud game: Phoning the right friend to answer the cloud optimization question

ADA Website Compliance: Ensuring a Compliant Website

Using Data Analysis to Improve and Verify the Customer Experience and Bad Reviews

Leave a Reply Cancel reply

23 Plagiarism Facts and Statistics to Analyze Latest Trends

Managing PDFs in Node.js with pdf-lib

Accenture creates a regulatory document authoring solution using AWS generative AI services

Salesforce AI Introduces Moira: A Cutting-Edge Time Series Foundation Model Offering Universal Forecasting Capabilities

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

Programming Language Tier List

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

AI Compared: Which Assistant Is the Best?

How insurance companies can use synthetic data to fight bias

5 SLA metrics you should be monitoring

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

CATEGORIES

SITEMAP

Welcome Back!

Create New Account!

Retrieve your password