Monday, June 23, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

The Long and Short of It: Proportion-Based Relevance to Capture Document Semantics End-to-End | by Anthony Alcaraz | Nov, 2023

November 25, 2023
in AI Technology
Reading Time: 3 mins read
0 0
A A
0
Share on FacebookShare on Twitter


Towards Data Science

Dominant search methods today typically rely on keywords matching or vector space similarity to estimate relevance between a query and documents. However, these techniques struggle when it comes to searching corpora using entire files, papers or even books as search queries.

Some fun with Dall-E 3

Keyword-based Retrieval

While keywords searches excel for short look up, they fail to capture semantics critical for long-form content. A document correctly discussing “cloud platforms” may be completely missed by a query seeking expertise in “AWS”. Exact term matches face vocabulary mismatch issues frequently in lengthy texts.

Vector Similarity Search

Modern vector embedding models like BERT condensed meaning into hundreds of numerical dimensions accurately estimating semantic similarity. However, transformer architectures with self-attention don’t scale beyond 512–1024 tokens due to exploding computation.

Without the capacity to fully ingest documents, the resulting “bag-of-words” partial embeddings lose the nuances of meaning interspersed across sections. The context gets lost in abstraction.

The prohibitive compute complexity also restricts fine-tuning on most real-world corpora limiting accuracy. Unsupervised learning provides one alternative but solid techniques are lacking.

In a recent paper, researchers address exactly these pitfalls by re-imagining relevance for ultra-long queries and documents. Their innovations unlock new potential for AI document search.

Dominant search paradigms today are ineffective for queries that run into thousands of words as input text. Key issues faced include:

Transformers like BERT have quadratic self-attention complexity, making them infeasible for sequences beyond 512–1024 tokens. Their sparse attention alternatives compromise on accuracy.Lexical models matching based on exact term overlaps cannot infer semantic similarity critical for long-form text.Lack of labelled training data for most domain collections necessitates…



Source link

Tags: AlcarazAnthonyCaptureDocumentendtoendLongNovProportionBasedRelevanceSemanticsshort
Previous Post

Bitcoin ETF Optimism Grows Amid Regulatory Developments and Market Resilience By Investing.com

Next Post

😱⚠️FIX THIS BEFORE LEARNING DATA SCIENCE/ML

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
😱⚠️FIX THIS BEFORE LEARNING DATA SCIENCE/ML

😱⚠️FIX THIS BEFORE LEARNING DATA SCIENCE/ML

Working as an Automation Engineer

Working as an Automation Engineer

JP Morgan bullish on six Big Biotechs, rates nine others neutral or lower (NASDAQ:INSM)

JP Morgan bullish on six Big Biotechs, rates nine others neutral or lower (NASDAQ:INSM)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
How ‘Chain of Thought’ Makes Transformers Smarter

How ‘Chain of Thought’ Makes Transformers Smarter

May 13, 2024
Managing PDFs in Node.js with pdf-lib

Managing PDFs in Node.js with pdf-lib

November 16, 2023
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

October 30, 2023
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In