Named Entity Recognition (NER) is crucial in natural language processing, with applications in medical coding, financial analysis, and legal document parsing. Custom models are typically built using transformer encoders pre-trained on self-supervised tasks such as masked language modeling (MLM). However, the emergence of large language models (LLMs) like GPT-3 and GPT-4 has presented challenges in NER tasks due to high inference costs and potential privacy issues.
The NuMind team proposes an approach that leverages LLMs to reduce the need for human annotations in creating custom models. Instead of using an LLM to annotate a single-domain dataset for a specific NER task, the concept involves utilizing the LLM to annotate a diverse, multi-domain dataset covering various NER problems. Subsequently, a smaller foundation model like BERT is further pre-trained on this annotated dataset. This pre-trained model can then be fine-tuned for any downstream NER task.
The team has introduced three NER models as follows:
NuNER Zero: A zero-shot NER model based on the GLiNER (Generalist Model for Named Entity Recognition using Bidirectional Transformer) architecture, requiring input as a concatenation of entity types and text. Unlike GLiNER, NuNER Zero operates as a token classifier, enabling the detection of arbitrarily long entities. Trained on the NuNER v2.0 dataset, which combines subsets of Pile and C4 annotated via LLMs using NuNER’s approach, NuNER Zero emerges as the leading compact zero-shot NER model, with a +3.1% token-level F1-Score improvement over GLiNER-large-v2.1 on GLiNER’s benchmark.
NuNER Zero 4k: NuNER Zero 4k is the long-context (4k tokens) version of NuNER Zero. It generally performs less effectively than NuNER Zero but can outperform NuNER Zero in scenarios where context size is crucial.
NuNER Zero-span: NuNER Zero-span is the span-prediction version of NuNER Zero, showing slightly better performance than NuNER Zero but unable to detect entities larger than 12 tokens.
The key features of these three models are:
NuNER Zero: Derived from NuNER, suitable for moderate token size.
NuNER Zero 4K: A variant of NuNER that excels in situations where context size plays a significant role.
NuNER Zero-span: The span-prediction version of NuNER Zero, not ideal for entities larger than 12 tokens.
In conclusion, NER plays a critical role in natural language processing, and developing custom models usually relies on transformer encoders trained through MLM. However, the advent of LLMs like GPT-3 and GPT-4 presents challenges due to high inference costs. The NuMind team suggests a method that utilizes LLMs to reduce human annotations by annotating a multi-domain dataset. They introduce three NER models: NuNER Zero, a compact zero-shot model; NuNER Zero 4k, emphasizing longer context; and NuNER Zero-span, focusing on span prediction with minor performance improvements but limited to entities under 12 tokens.
Sources
https://huggingface.co/numind/NuNER_Zero-4k
https://huggingface.co/numind/NuNER_Zero
https://huggingface.co/numind/NuNER_Zero-span
https://arxiv.org/pdf/2402.15343
https://www.linkedin.com/posts/tomaarsen_numind-yc-s22-has-just-released-3-new-state-of-the-art-activity-7195863382783049729-kqko/?utm_source=share&utm_medium=member_ios
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.