Neural Information Retrieval Talks — Zeta Alpha

Follow Neural Information Retrieval Talks — Zeta Alpha

Share on

A monthly podcast where we discuss recent research and developments in the world of Neural IR (Information Retrieval) and Natural Language Processing with our co-hosts Sergi Castella (content creator at Zeta Alpha) and Andrew Yates (Assistant Professor in

Zeta Alpha

Apr 11, 2023 LATEST EPISODE
infrequent NEW EPISODES
1h 3m AVG DURATION
11 EPISODES

Search for episodes from Neural Information Retrieval Talks — Zeta Alpha with a specific topic:

Latest episodes from Neural Information Retrieval Talks — Zeta Alpha

The Promise of Language Models for Search: Generative Information Retrieval

Play Episode Listen Later Apr 11, 2023 67:31

In this episode of Neural Search Talks, Andrew Yates (Assistant Prof at the University of Amsterdam) Sergi Castella (Analyst at Zeta Alpha), and Gabriel Bénédict (PhD student at the University of Amsterdam) discuss the prospect of using GPT-like models as a replacement for conventional search engines. Generative Information Retrieval (Gen IR) SIGIR Workshop Workshop organized by Gabriel Bénédict, Ruqing Zhang, and Donald Metzler https://coda.io/@sigir/gen-ir Resources on Gen IR: https://github.com/gabriben/awesome-generative-information-retrieval References Rethinking Search: https://arxiv.org/abs/2105.02274 Survey on Augmented Language Models: https://arxiv.org/abs/2302.07842 Differentiable Search Index: https://arxiv.org/abs/2202.06991 Recommender Systems with Generative Retrieval: https://shashankrajput.github.io/Generative.pdf Timestamps: 00:00 Introduction, ChatGPT Plugins 02:01 ChatGPT plugins, LangChain 04:37 What is even Information Retrieval? 06:14 Index-centric vs. model-centric Retrieval 12:22 Generative Information Retrieval (Gen IR) 21:34 Gen IR emerging applications 24:19 How Retrieval Augmented LMs incorporate external knowledge 29:19 What is hallucination? 35:04 Factuality and Faithfulness 41:04 Evaluating generation of Language Models 47:44 Do we even need to "measure" performance? 54:07 How would you evaluate Bing's Sydney? 57:22 Will language models take over commercial search? 1:01:44 NLP academic research in the times of GPT-4 1:06:59 Outro

university phd search language amsterdam survey models nlp evaluating index bing gpt generative gabriel b information retrieval recommender systems

Task-aware Retrieval with Instructions

Play Episode Listen Later Jan 27, 2023 71:13

Andrew Yates (Assistant Prof at University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discuss the paper "Task-aware Retrieval with Instructions" by Akari Asai et al. This paper proposes to augment a conglomerate of existing retrieval and NLP datasets with natural language instructions (BERRI, Bank of Explicit RetRieval Instructions) and use it to train TART (Multi-task Instructed Retriever).

university table bank negative amsterdam tasks results nlp aware evaluation instructions colbert tart retrieval beir

Generating Training Data with Large Language Models w/ Special Guest Marzieh Fadaee

Play Episode Listen Later Dec 13, 2022 76:14

Marzieh Fadaee — NLP Research Lead at Zeta Alpha — joins Andrew Yates and Sergi Castella to chat about her work in using large Language Models like GPT-3 to generate domain-specific training data for retrieval models with little-to-no human input. The two papers discussed are "InPars: Data Augmentation for Information Retrieval using Large Language Models" and "Promptagator: Few-shot Dense Retrieval From 8 Examples". InPars: https://arxiv.org/abs/2202.05144 Promptagator: https://arxiv.org/abs/2209.11755 Timestamps: 00:00 Introduction 02:00 Background and journey of Marzieh Fadaee 03:10 Challenges of leveraging Large LMs in Information Retrieval 05:20 InPars, motivation and method 14:30 Vanilla vs GBQ prompting 24:40 Evaluation and Benchmark 26:30 Baselines 27:40 Main results and takeaways (Table 1, InPars) 35:40 Ablations: prompting, in-domain vs. MSMARCO input documents 40:40 Promptagator overview and main differences with InPars 48:40 Retriever training and filtering in Promptagator 54:37 Main Results (Table 2, Promptagator) 1:02:30 Ablations on consistency filtering (Figure 2, Promptagator) 1:07:39 Is this the magic black-box pipeline for neural retrieval on any documents 1:11:14 Limitations of using LMs for synthetic data 1:13:00 Future directions for this line of research

future training challenges data table language figure large models limitations evaluation generating vanilla gpt lms retriever information retrieval gbq marzieh

ColBERT + ColBERTv2: late interaction at a reasonable inference cost

Play Episode Listen Later Aug 16, 2022 57:30

Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022), which mainly propose a fast late interaction operation to achieve a performance close to full cross-encoders but at a more manageable computational cost at inference; along with many other optimizations.

university cost effects amsterdam takeaways ir interaction reasonable methodology colbert terminology dense retrieval mrr inference qualitatively matei zaharia

Evaluating Extrapolation Performance of Dense Retrieval: How does DR compare to cross encoders when it comes to generalization?

Play Episode Listen Later Jul 20, 2022 58:30

How much of the training and test sets in TREC or MS Marco overlap? Can we evaluate on different splits of the data to isolate the extrapolation performance? In this episode of Neural Information Retrieval Talks, Andrew Yates and Sergi Castella i Sapé discuss the paper "Evaluating Extrapolation Performance of Dense Retrieval" byJingtao Zhan, Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma.

learning research cross performance train table figure comparing ir compare evaluating evaluation sap broad colbert dense overlap retrieval generalization baselines trec extrapolation information retrieval resampling

Open Pre-Trained Transformer Language Models (OPT): What does it take to train GPT-3?

Play Episode Listen Later Jun 16, 2022 47:12

Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below). Links ❓Feedback Form: https://scastella.typeform.com/to/rg7a5GfJ

Few-Shot Conversational Dense Retrieval (ConvDR) w/ special guest Antonios Krasakis

Play Episode Listen Later May 11, 2022 83:11

We discuss Conversational Search with our usual cohosts Andrew Yates and Sergi Castella i Sapé; along with a special guest Antonios Minas Krasakis, PhD candidate at the University of Amsterdam. We center our discussion around the ConvDR paper: "Few-Shot Conversational Dense Retrieval" by Shi Yu et al. which was the first work to perform Conversational Search without an explicit conversation to query rewriting step. Timestamps: 00:00 Introduction 00:50 Conversational AI and Conversational Search 05:40 What makes Conversational Search challenging 07:00 ConvDR paper introduction 10:10 Passage representations 11:30 Conversation representations: query rewriting 19:12 ConvDR novel proposed method: teacher-student setup with ANCE 22:50 Datasets and benchmarks: CAsT, CANARD 25:32 Teacher-student advantages and knowledge distillation vs. ranking loss functions 28:09 TREC CAsT and OR-QuAC 35:50 Metrics: MRR, NDCG, holes@10 44:16 Main Results on CAsT and OR-QuAC (Table 2) 57:35 Ablations on combinations of loss functions (Table 4) 1:00:10 How fast is ConvDR? (Table 3) 1:02:40 Qualitative analysis on ConvDR embeddings (Figure 4) 1:04:50 How has this work aged? More recent works in similar directions: Contextualized Quesy Embeddings for Conversational Search. 1:07:02 Is "end-to-end" the silver-bullet for Conversational Search? 1:10:04 Will conversational search become more mainstream? 1:18:44 Latest initiatives for Conversational Search

university conversations phd teacher table figure amsterdam passage sap conversational dense qualitative retrieval conversational ai antonio s datasets

Transformer Memory as a Differentiable Search Index: memorizing thousands of random doc ids works!?

Play Episode Listen Later Mar 23, 2022 61:40

Andrew Yates and Sergi Castella discuss the paper titled "Transformer Memory as a Differentiable Search Index" by Yi Tay et al at Google. This work proposes a new approach to document retrieval in which document ids are memorized by a transformer during training (or "indexing") and for retrieval, a query is fed to the model, which then generates autoregressively relevant doc ids for that query. Paper: https://arxiv.org/abs/2202.06991 Timestamps: 00:00 Intro: Transformer memory as a Differentiable Search Index (DSI) 01:15 The gist of the paper, motivation 4:20 Related work: Autoregressive Entity Linking 7:38 What is an index? Conventional vs. "differentiable" 10:20 Indexing and Retrieval definitions in the context of the DSI 12:40 Learning representations for documents 17:20 How to represent document ids: atomic, string, semantically relevant 22:00 Zero-shot vs. finetuned settings 24:10 Datasets and baselines 27:08 Dinetuned results 36:40 Zero-shot results 43:50 Ablation results 47:15 Where could this model be useds? 52:00 Is memory efficiency a fundamental problem of this approach? 55:14 What about semantically relevant doc ids? 60:30 Closing remarks Contact: castella@zeta-alpha.com

learning google search memory thousands index conventional transformer retrieval memorizing indexing ablation datasets

Learning to Retrieve Passages without Supervision: finally unsupervised Neural IR?

Play Episode Listen Later Feb 16, 2022 59:10

In this third episode of the Neural Information Retrieval Talks podcast, Andrew Yates and Sergi Castella discuss the paper "Learning to Retrieve Passages without Supervision" by Ori Ram et al. Despite the massive advances in Neural Information Retrieval in the past few years, statistical models still overperform neural models when no annotations are available at all. This paper proposes a new self-supervised pertaining task for Dense Information Retrieval that manages to beat BM25 on some benchmarks without using any label. Paper: https://arxiv.org/abs/2112.07708 Timestamps: 00:00 Introduction 00:36 "Learning to Retrieve Passages Without Supervision" 02:20 Open Domain Question Answering 05:05 Related work: Families of Retrieval Models 08:30 Contrastive Learning 11:18 Siamese Networks, Bi-Encoders and Dual-Encoders 13:33 Choosing Negative Samples 17:46 Self supervision: how to train IR models without labels. 21:31 The modern recipe for SOTA Retrieval Models 23:50 Methodology: a new proposed self supervision task 26:40 Datasets, metrics and baselines 33:50 Results: Zero-Shot performance 43:07 Results: Few-shot performance 47:15 Practically, is not using labels relevant after all? 51:37 How would you "break" the Spider model? 53:23 How long until Neural IR models outperform BM25 out-of-the-box robustly? 54:50 Models as a service: OpenAI's text embeddings API Contact: castella@zeta-alpha.com

learning families spider models ir openai methodology practically passages supervision neural unsupervised retrieve datasets

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

Play Episode Listen Later Jan 21, 2022 54:13

We discuss the Information Retrieval publication "The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes" by Nils Reimers and Iryna Gurevych, which explores how Dense Passage Retrieval performance degrades as the index size varies and how it compares to traditional sparse or keyword-based methods. Timestamps: 00:00 Co-host introduction 00:26 Paper introduction 02:18 Dense vs. Sparse retrieval 05:46 Theoretical analysis of false positives(1) 08:17 What is low vs. high dimensional representations 11:49 Theoretical analysis o false positives (2) 20:10 First results: growing the MS-Marco index 28:35 Adding random strings to the index 39:17 Discussion, takeaways 44:26 Will dense retrieval replace or coexist with sparse methods? 50:50 Sparse, Dense and Attentional Representations for Text Retrieval Referenced work: Sparse, Dense and Attentional Representations for Text Retrieval by Yi Luan et al. 2020.

curse large index dimensional sizes theoretical dense sparse information retrieval

Shallow Pooling for Sparse Labels: the shortcomings of MS MARCO

Play Episode Listen Later Dec 16, 2021 67:17

In this first episode of Neural Information Retrieval Talks, Andrew Yates and Sergi Castellla discuss the paper "Shallow Pooling for Sparse Labels" by Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan and Charles L. A. Clarke from the University of Waterloo, Canada. This paper puts the spotlight on the popular IR benchmark MS MARCO and investigates whether modern neural retrieval models retrieve documents that are even more relevant than the original top relevance annotations. The results have important implications and raise the question of to what degree this benchmark is still an informative north star to follow. Contact: castella@zeta-alpha.com Timestamps: 00:00 — Introduction. 01:52 — Overview and motivation of the paper. 04:00 — Origins of MS MARCO. 07:30 — Modern approaches to IR: keyword-based, dense retrieval, rerankers and learned sparse representations. 13:40 — What is "better than perfect" performance on MS MARCO? 17:15 — Results and discussion: how often are neural rankers preferred over original annotations on MS MARCO? How should we interpret these results? 26:55 — The authors' proposal to "fix" MS MARCO: shallow pooling 32:40 — How does TREC Deep Learning compare? 38:30 — How do models compare after re-annotating MS MARCO passages? 45:00 — Figure 5 audio description. 47:00 — Discussion on models' performance after re-annotations. 51:50 — Exciting directions in the space of IR benchmarking. 1:06:20 — Outro. Related material: - Leo Boystov paper critique blog post: http://searchivarius.org/blog/ir-leaderboards-never-tell-full-story-they-are-still-useful-and-what-can-be-done-make-them-even - "MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries" https://dl.acm.org/doi/abs/10.1145/3459637.3482011

university canada modern origins figure exciting results ir labels clarke waterloo shallow shortcomings pooling sparse

Claim Neural Information Retrieval Talks — Zeta Alpha

In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

Claim Cancel