Welcome to the NLP highlights podcast, where we invite researchers to talk about their work in various areas in natural language processing. The hosts are the members of the AllenNLP team at Allen Institute for AI. All views expressed belong to the hosts and guests and do not represent their employers.
Allen Institute for Artificial Intelligence
This podcast episode features Dr. Mohamed Elhoseiny, a true luminary in the realm of computer vision with over a decade of groundbreaking research. As an Assistant Professor at KAUST, Dr. Elhoseiny's work delves into the intersections of Computer Vision, Language & Vision, and Computational Creativity in Art, Fashion, and AI. Notably, he co-organized the 1st and 2nd Workshops on Closing the Loop between Vision and Language, demonstrating his commitment to advancing interdisciplinary research. With a rich educational background from Stanford University's Graduate School of Business Ignite Program, and Rutgers University as MS/PhD Researcher, coupled with influential stints at Stanford, Baidu Research, Facebook AI Research, Adobe Research, and SRI International, Dr. Elhoseiny brings a wealth of experience to our discussion.
Our first guest with this new format is Kyle Lo, the most senior lead scientist in the Semantic Scholar team at Allen Institute for AI (AI2), who kindly agreed to share his perspective on #Science of #Science (#scisci) on our podcast. SciSci is concerned with studying how people do science, and includes developing methods and tools to help people consume AND produce science. Kyle has made several critical contributions in this field which enabled a lot of SciSci work over the past 5+ years, ranging from novel NLP methods (eg, SciBERT https://lnkd.in/gTP_tYiF ), to open data collections (eg, S2ORK https://lnkd.in/g4J6tXCG), to toolkits for manipulating scientific documents (eg, PaperMage https://lnkd.in/gwU7k6mJ which JUST received a Best Paper Award
In this special episode of NLP Highlights, we discussed building and open sourcing language models. What is the usual recipe for building large language models? What does it mean to open source them? What new research questions can we answer by open sourcing them? We particularly focused on the ongoing Open Language Model (OLMo) project at AI2, and invited Iz Beltagy and Dirk Groeneveld, the research and engineering leads of the OLMo project to chat. Blog post announcing OLMo: https://blog.allenai.org/announcing-ai2-olmo-an-open-language-model-made-by-scientists-for-scientists-ab761e4e9b76 Organizations interested in partnership can express their interest here: https://share.hsforms.com/1blFWEWJ2SsysSXFUEJsxuA3ioxm You can find Iz at twitter.com/i_beltagy and Dirk at twitter.com/mechanicaldirk
In this special episode, we chatted with Chris Callison-Burch about his testimony in the recent U.S. Congress Hearing on the Interoperability of AI and Copyright Law. We started by asking Chris about the purpose and the structure of this hearing. Then we talked about the ongoing discussion on how the copyright law is applicable to content generated by AI systems, the potential risks generative AI poses to artists, and Chris' take on all of this. We end the episode with a recording of Chris' opening statement at the hearing.
How can we generate coherent long stories from language models? Ensuring that the generated story has long range consistency and that it conforms to a high level plan is typically challenging. In this episode, Kevin Yang describes their system that prompts language models to first generate an outline, and iteratively generate the story while following the outline and reranking and editing the outputs for coherence. We also discussed the challenges involved in evaluating long generated texts. Kevin Yang is a PhD student at UC Berkeley. Kevin's webpage: https://people.eecs.berkeley.edu/~yangk/ Papers discussed in this episode: 1. Re3: Generating Longer Stories With Recursive Reprompting and Revision (https://www.semanticscholar.org/paper/Re3%3A-Generating-Longer-Stories-With-Recursive-and-Yang-Peng/2aab6ca1a8dae3f3db6d248231ac3fa4e222b30a) 2. DOC: Improving Long Story Coherence With Detailed Outline Control (https://www.semanticscholar.org/paper/DOC%3A-Improving-Long-Story-Coherence-With-Detailed-Yang-Klein/ef6c768f23f86c4aa59f7e859ca6ffc1392966ca)
Compositional generalization refers to the capability of models to generalize to out-of-distribution instances by composing information obtained from the training data. In this episode we chatted with Najoung Kim, on how to explicitly evaluate specific kinds of compositional generalization in neural network models of language. Najoung described COGS, a dataset she built for this, some recent results in the space, and why we should be careful about interpreting the results given the current practice of pretraining models of lots of unlabeled text. Najoung's webpage: https://najoungkim.github.io/ Papers we discussed: 1. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation (Kim et al., 2020): https://www.semanticscholar.org/paper/b20ddcbd239f3fa9acc603736ac2e4416302d074 2. Compositional Generalization Requires Compositional Parsers (Weissenhorn et al., 2022): https://www.semanticscholar.org/paper/557ebd17b7c7ac4e09bd167d7b8909b8d74d1153 3. Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models (Kim et al., 2022): https://www.semanticscholar.org/paper/8969ea3d254e149aebcfd1ffc8f46910d7cb160e Note that we referred to the final paper by an earlier name in the discussion.
We invited Urvashi Khandelwal, a research scientist at Google Brain to talk about nearest neighbor language and machine translation models. These models interpolate parametric (conditional) language models with non-parametric distributions over the closest values in some data stores built from relevant data. Not only are these models shown to outperform the usual parametric language models, they also have important implications on memorization and generalization in language models. Urvashi's webpage: https://urvashik.github.io Papers discussed: 1) Generalization through memorization: Nearest Neighbor Language Models (https://www.semanticscholar.org/paper/7be8c119dbe065c52125ee7716601751f3116844) 2)Nearest Neighbor Machine Translation (https://www.semanticscholar.org/paper/20d51f8e449b59c7e140f7a7eec9ab4d4d6f80ea)
In this episode, we talk with Kayo Yin, an incoming PhD at Berkeley, and Malihe Alikhani, an assistant professor at the University of Pittsburgh, about opportunities for the NLP community to contribute to Sign Language Processing (SLP). We talked about history and misconceptions about sign languages, high-level similarities and differences between spoken and sign languages, distinct linguistic features of signed languages, representations, computational resources, SLP tasks, and suggestions for better design and implementation of SLP models.
This episode is the third in our current series on PhD applications. We talk about what the PhD application process looks like after applications are submitted. We start with a general overview of the timeline, then talk about how to approach interviews and conversations with faculty, and finish by discussing the different factors to consider in deciding between programs. The guests for this episode are Rada Mihalcea (Professor at the University of Michigan), Aishwarya Kamath (PhD student at NYU), and Sanjay Subramanian (PhD student at UC Berkeley). Homepages: - Aishwarya Kamath: https://ashkamath.github.io/ - Sanjay Subramanian: https://sanjayss34.github.io/ - Rada Mihalcea: https://web.eecs.umich.edu/~mihalcea/ The hosts for this episode are Alexis Ross and Nishant Subramani.
This episode is the second in our current series on PhD applications. How do PhD programs in Europe differ from PhD programs in the US, and how should people decide between them? In this episode, we invite Barbara Plank (Professor at ITU, IT University of Copenhagen) and Gonçalo Correia (ELLIS PhD student at University of Lisbon and University of Amsterdam) to share their perspectives on this question. We start by talking about the main differences between pursuing a PhD in Europe and the US. We then talk about the application requirements for European PhD programs and factors to consider when deciding whether to apply in Europe or the US. We conclude by talking about the ELLIS PhD program, a relatively new program for PhD students that facilitates collaborations across Europe. ELLIS PhD program: https://ellis.eu/phd-postdoc (Application Deadline: November 15, 2021) Homepages: - Barbara Plank: https://bplank.github.io/ - Gonçalo Correia: https://goncalomcorreia.github.io/
This episode is the first in our current series on PhD applications. How should people prepare their applications to PhD programs in NLP? In this episode, we invite Nathan Schneider (Professor of Linguistics and Computer Science at Georgetown University) and Roma Patel (PhD student in Computer Science at Brown University) to share their perspectives on preparing application materials. We start by talking about what factors should go into the decision to apply for PhD programs and how to gain relevant experience. We then talk about the most important parts of an application, focusing particularly on how to write a strong statement of purpose and choose recommendation letter writers. Blog posts mentioned in this episode: - Nathan Schneider's Advice on Statements of Purpose: https://nschneid.medium.com/inside-ph-d-admissions-what-readers-look-for-in-a-statement-of-purpose-3db4e6081f80 - Student Perspectives on Applying to NLP PhD Programs: https://blog.nelsonliu.me/2019/10/24/student-perspectives-on-applying-to-nlp-phd-programs/ Homepages: - Nathan Schneider: https://people.cs.georgetown.edu/nschneid/ - Roma Patel: http://cs.brown.edu/people/rpatel59/ The hosts for this episode are Alexis Ross and Nishant Subramani.
In this episode, we discussed the Alexa Prize Socialbot Grand Challenge and this year's winning submission, Alquist 4.0, with Petr Marek, a member of the winning team. Petr gave us an overview of their submission, the design choices that led to them winning the competition, including combining a hardcoded dialog tree and a neural generator model and extracting implicit personal information about users from their responses, and some outstanding challenges. Petr Marek is a PhD student at the Czech Technical University in Prague. More about the Alexa Prize challenges: https://developer.amazon.com/alexaprize Technical report on Alquist 4.0: https://arxiv.org/abs/2109.07968
What can NLP researchers learn from Human Computer Interaction (HCI) research? We chatted with Nanna Inie and Leon Derczynski to find out. We discussed HCI's research processes including methods of inquiry, the data annotation processes used in HCI, and how they are different from NLP, and the cognitive methods used in HCI for qualitative error analyses. We also briefly talked about the opportunities the field of HCI presents for NLP researchers. This discussion is based on the following paper: https://aclanthology.org/2021.hcinlp-1.16/ Nanna Inie is a postdoctoral researcher and Leon Derczynski is an associate professor in CS at the IT University of Copenhagen. The hosts for this episode are Ana Marasović and Pradeep Dasigi.
In this episode, we talk with Lisa Beinborn, an assistant professor at Vrije Universiteit Amsterdam, about how to use human cognitive signals to improve and analyze NLP models. We start by discussing different kinds of cognitive signals—eye-tracking, EEG, MEG, and fMRI—and challenges associated with using them. We then turn to Lisa's recent work connecting interpretability measures with eye-tracking data, which reflect the relative importance measures of different tokens in human reading comprehension. We discuss empirical results suggesting that eye-tracking signals correlate strongly with gradient-based saliency measures, but not attention, in NLP methods. We conclude with discussion of the implications of these findings, as well as avenues for future work. Papers discussed in this episode: Towards best practices for leveraging human language processing signals for natural language processing: https://api.semanticscholar.org/CorpusID:219309655 Relative Importance in Sentence Processing: https://api.semanticscholar.org/CorpusID:235358922 Lisa Beinborn's webpage: https://beinborn.eu/ The hosts for this episode are Alexis Ross and Pradeep Dasigi.
In this episode, we talk to Shunyu Yao about recent insights into how transformers can represent hierarchical structure in language. Bounded-depth hierarchical structure is thought to be a key feature of natural languages, motivating Shunyu and his coauthors to show that transformers can efficiently represent bounded-depth Dyck languages, which can be thought of as a formal model of the structure of natural languages. We went on to discuss some of the intuitive ideas that emerge from the proofs, connections to RNNs, and insights about positional encodings that may have practical implications. More broadly, we also touched on the role of formal languages and other theoretical tools in modern NLP. Papers discussed in this episode: - Self-Attention Networks Can Process Bounded Hierarchical Languages (https://arxiv.org/abs/2105.11115) - Theoretical Limitations of Self-Attention in Neural Sequence Models (https://arxiv.org/abs/1906.06755) - RNNs can generate bounded hierarchical languages with optimal memory (https://arxiv.org/abs/2010.07515) - On the Practical Computational Power of Finite Precision RNNs for Language Recognition (https://arxiv.org/abs/1805.04908) Shunyu Yao's webpage: https://ysymyth.github.io/ The hosts for this episode are William Merrill and Matt Gardner.
We discussed adversarial dataset construction and dynamic benchmarking in this episode with Douwe Kiela, a research scientist at Facebook AI Research who has been working on a dynamic benchmarking platform called Dynabench. Dynamic benchmarking tries to address the issue of many recent datasets getting solved with little progress being made towards solving the corresponding tasks. The idea is to involve models in the data collection loop to encourage humans to provide data points that are hard for those models, thereby continuously collecting harder datasets. We discussed the details of this approach, and some potential caveats. We also discussed dynamic leaderboards, a recent addition to Dynabench that rank systems based on their utility given specific use cases. Papers discussed in this episode: 1. Dynabench: Rethinking Benchmarking in NLP (https://www.semanticscholar.org/paper/Dynabench%3A-Rethinking-Benchmarking-in-NLP-Kiela-Bartolo/77a096d80eb4dd4ccd103d1660c5a5498f7d026b) 2. Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking (https://www.semanticscholar.org/paper/Dynaboard%3A-An-Evaluation-As-A-Service-Platform-for-Ma-Ethayarajh/d25bb256e5b69f769a429750217b0d9ec1cf4d86) 3. Adversarial NLI: A New Benchmark for Natural Language Understanding (https://www.semanticscholar.org/paper/Adversarial-NLI%3A-A-New-Benchmark-for-Natural-Nie-Williams/9d87300892911275520a4f7a5e5abf4f1c002fec) 4. DynaSent: A Dynamic Benchmark for Sentiment Analysis (https://www.semanticscholar.org/paper/DynaSent%3A-A-Dynamic-Benchmark-for-Sentiment-Potts-Wu/284dfcf7f25ca87b2db235c6cdc848b4143d3923) Douwe Kiela's webpage: https://douwekiela.github.io/ The hosts for this episode are Pradeep Dasigi and Alexis Ross.
We invited members of Masakhane, Tosin Adewumi and Perez Ogayo, to talk about their EMNLP Findings paper that discusses why typical research is limited for low-resourced NLP and how participatory research can help. As a result of participatory research, Masakhane has many, many success stories: first datasets and benchmarks in African languages, first research on human evaluation specifically for MT for low-resource languages, etc. In this episode, we talked about one of them—MasakhaNER—in more detail. The hosts for this episode are Pradeep Dasigi and Ana Marasović. -------------------------- Tosin Adewumi is a PhD student at the Luleå University of Technology in Sweden. His Twitter handle: @tosintwit Perez Ogayo is an undergrad student at the African Leadership University in Rwanda. Her Twitter handle: @a_ogayo Masakhane is a grassroots organization whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans: https://www.masakhane.io/ Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages (Findings of EMNLP 2020): https://arxiv.org/abs/2010.02353 MasakhaNER: Named Entity Recognition for African languages (AfricaNLP Workshop @ EACL 2021): https://arxiv.org/abs/2103.11811
We invited Lisa Li to talk about her recent work, Prefix-Tuning: Optimizing Continuous Prompts for Generation. Prefix tuning is a lightweight alternative to finetuning, and the idea is to tune only a fixed-length task-specific continuous vector, and to keep the pretrained transformer parameters frozen. We discussed how prefix tuning compares with finetuning and other efficient alternatives on two tasks in various experimental settings, and in what scenarios prefix tuning is preferable. Lisa is a Phd student at Stanford University. Lisa's webpage: https://xiangli1999.github.io/ The hosts for this episode are Pradeep Dasigi and Ana Marasović.
How can we build Visual Question Answering systems for real users? For this episode, we chatted with Danna Gurari, about her work in building datasets and models towards VQA for people who are blind. We talked about the differences between the existing datasets, and Vizwiz, a dataset built by Gurari et al., and the resulting algorithmic changes. We also discussed the unsolved challenges in this field, and the new tasks they result in. Danna Gurari is an Assistant Professor as well as Founding Director of the Image and Video Computing group in the School of Information at University of Texas at Austin (UT-Austin). Vizwiz project page: https://vizwiz.org/ The hosts for this episode are Ana Marasović and Pradeep Dasigi.
We invited Jayant Krishnamurthy and Hao Fang, researchers at Microsoft Semantic Machines to discuss their platform for building task-oriented dialog systems, and their recent TACL paper on the topic. The paper introduces a new formalism for task-oriented dialog to effectively handle references and revisions in complex dialog, and a large realistic dataset that uses this formalism. Leaderboard associated with the dataset: https://microsoft.github.io/task_oriented_dialogue_as_dataflow_synthesis/ Jayant's Twitter handle: https://twitter.com/jayantkrish Hao's Twitter handle: https://twitter.com/hfang90
In this episode, Robin Jia talks about how to build robust NLP systems. We discuss the different senses in which a system can be robust, reasons to care about system robustness, and the challenges involved in evaluating robustness of NLP models. We talk about how to build certifiably robust models through interval bound propagation and discrete encoding functions, as well as how to modify data collection procedures through active learning for more robust model development. Robin Jia is currently a visiting researcher at Facebook AI Research, and will be an assistant professor in the Department of Computer Science at the University of Southern California starting Fall 2021.
We invited Nils Holzenberger, a PhD student at JHU to talk about a dataset involving statutory reasoning in tax law Holzenberger et al. released recently. This dataset includes difficult textual entailment and question answering problems that involve reasoning about how sections in tax law are applicable to specific cases. They also released a Prolog solver that fully solves the problems, and show that learned models using dense representations of text perform poorly. We discussed why this is the case, and how one can train models to solve these challenges. Project webpage: https://nlp.jhu.edu/law/
We invited Alona Fyshe to talk about the link between NLP and the human brain. We began by talking about what we currently know about the connection between representations used in NLP and representations recorded in the brain. We also discussed how different brain imaging techniques compare to each other. We then dove into experiments investigating how hidden states of LSTM language models correlate with EEG brain imaging data on three types of language inputs: well-formed grammatical sentences, pseudo-word sentences preserving syntax but not semantics, and word-lists preserving neither. We talk about the kinds of conclusions that can be drawn from these correlations and conclude by discussing avenues for future work.
We invited Asli Celikyilmaz for this episode to talk about evaluation of text generation systems. We discussed the challenges in evaluating generated text, and covered human and automated metrics, with a discussion of recent developments in learning metrics. We also talked about some open research questions, including the difficulties in evaluating factual correctness of generated text. Asli Celikyilmaz is a Principal Researcher at Microsoft Research. Link to a survey co-authored by Asli on this topic: https://arxiv.org/abs/2006.14799
In this episode, Diyi Yang gives us an overview of using NLP models for social applications, including understanding social relationships, processes, roles, and power. As NLP systems are getting used more and more in the real world, they additionally have increasing social impacts that must be studied. We talk about how to get started in this field, what datasets exist and are commonly used, and potential ethical issues. We additionally cover two of Diyi's recent papers, on neutralizing subjective bias in text, and on modeling persuasiveness in text. Diyi Yang is an assistant professor in the School of Interactive Computing at Georgia Tech.
In this episode, we talked about Coreference Resolution with Marta Recasens, a Research Scientist at Google. We discussed the complexity involved in resolving references in language, the simplification of the problem that the NLP community has focused on by talking about specific datasets, and the complex coreference phenomena that are not yet captured in those datasets. We also briefly talked about how coreference is handled in languages other than English, and how some of the notions we have about modeling coreference phenomena in English do not necessarily transfer to other languages. We ended the discussion by talking about large language models, and to what extent they might be good at handling coreference.
We interviewed Sameer Singh for this episode, and discussed an overview of recent work in interpreting NLP model predictions, particularly instance-level interpretations. We started out by talking about why it is important to interpret model outputs and why it is a hard problem. We then dove into the details of three kinds of interpretation techniques: attribution based methods, interpretation using influence functions, and generating explanations. Towards the end, we spent some time discussing how explanations of model behavior can be evaluated, and some limitations and potential concerns in evaluation methods. Sameer Singh is an Assistant Professor of Computer Science at the University of California, Irvine. Some of the techniques discussed in this episode have been implemented in the AllenNLP Interpret framework (details and demo here: https://allennlp.org/interpret).
We invited Yonatan Bisk to talk about grounded language understanding. We started off by discussing an overview of the topic, its research goals, and the the challenges involved. In the latter half of the conversation, we talked about ALFRED (Shridhar et al., 2019), a grounded instruction following benchmark that simulates training a robot butler. The current best models built for this benchmark perform very poorly compared to humans. We discussed why that might be, and what could be done to improve their performance. Yonatan Bisk is currently an assistant professor at Language Technologies Institute at Carnegie Mellon University. The data and the leaderboard for ALFRED can be accessed here: https://askforalfred.com/.
In this special episode, Carissa Schoenick, a program manager and communications director at AI2 interviewed Matt Gardner about AllenNLP. We chatted about the origins of AllenNLP, the early challenges in building it, and the design decisions behind the library. Given the release of AllenNLP 1.0 this week, we asked Matt what users can expect from the new release, what improvements the AllenNLP team is working on for the future versions.
We invited Marco Tulio Ribeiro, a Senior Researcher at Microsoft, to talk about evaluating NLP models using behavioral testing, a framework borrowed from Software Engineering. Marco describes three kinds of black-box tests the check whether NLP models satisfy certain necessary conditions. While breaking the standard IID assumption, this framework presents a way to evaluate whether NLP systems are ready for real-world use. We also discuss what capabilities can be tested using this framework, how one can come up with good tests, and the need for an evolving set of behavioral tests for NLP systems. Marco's homepage: https://homes.cs.washington.edu/~marcotcr/
We invited Fernando Pereira, a VP and Distinguished Engineer at Google, where he leads NLU and ML research, to talk about managing NLP research teams in industry. Topics we discussed include prioritizing research against product development and effective collaboration with product teams, dealing with potential research interest mismatch between individuals and the company, managing publications, hiring new researchers, and diversity and inclusion.
We invited Steven Cao to talk about his paper on multilingual alignment of contextual word embeddings. We started by discussing how multilingual transformers work in general, and then focus on Steven's work on aligning word representations. The core idea is to start from a list of words automatically aligned from parallel corpora and to ensure the representations of the aligned words are similar to each other while not moving too far away from their original representations. We discussed the experiments on the XNLI dataset in the paper, analysis, and the decision to do the alignment at word level and compare it to other possibilities such as aligning word pieces or higher level encoded representations in transformers. Paper: https://openreview.net/forum?id=r1xCMyBtPS Steven Cao's webpage: https://stevenxcao.github.io/
We invited Jon Clark from Google to talk about TyDi QA, a new question answering dataset, for this episode. The dataset contains information seeking questions in 11 languages that are typologically diverse, i.e., they differ from each other in terms of key structural and functional features. The questions in TyDiQA are information-seeking, like those in Natural Questions, which we discussed in the previous episode. In addition, TyDiQA also has questions collected in multiple languages using independent crowdsourcing pipelines, as opposed to some other multilingual QA datasets like XQuAD and MLQA where English data is translated into other languages. The dataset and the leaderboard can be accessed at https://ai.google.com/research/tydiqa.
In this episode, Tom Kwiatkowski and Michael Collins talk about Natural Questions, a benchmark for question answering research. We discuss how the dataset was collected to reflect naturally-occurring questions, the criteria used for identifying short and long answers, how this dataset differs from other QA datasets, and how easy it might be to game the benchmark with superficial processing of the text. We also contrast the holistic design in Natural Questions to deliberately targeting specific linguistic phenomena of interest when building a QA dataset. Dataset: https://ai.google.com/research/NaturalQuestions Paper: https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00276
How do we know, in a concrete quantitative sense, what a deep learning model knows about language? In this episode, Ellie Pavlick talks about two broad directions to address this question: structural and behavioral analysis of models. In structural analysis, we often train a linear classifier for some linguistic phenomenon we'd like to probe (e.g., syntactic dependencies) while using the (frozen) weights of a model pre-trained on some tasks (e.g., masked language models). What can we conclude from the results of probing experiments? What does probing tell us about the linguistic abstractions encoded in each layer of an end-to-end pre-trained model? How well does it match classical NLP pipelines? How important is it to freeze the pre-trained weights in probing experiments? In contrast, behavioral analysis evaluates a model's ability to distinguish between inputs which respect vs. violate a linguistic phenomenon using acceptability or entailment tasks, e.g., can the model predict which is more likely: "dog bites man" vs. "man bites dog"? We discuss the significance of which format to use for behavioral tasks, and how easy it is for humans to perform such tasks. Ellie Pavlick's homepage: https://cs.brown.edu/people/epavlick/ BERT rediscovers the classical nlp pipeline , by Ian Tenney, Dipanjan Das, Ellie Pavlick https://arxiv.org/pdf/1905.05950.pdf?fbclid=IwAR3gzFibSBoDGdjqVu9Gq0mh1lDdRZa7dm42JuXXUfjG6rKZ44iHIOdV6jg Inherent Disagreements in Human Textual Inferences by Ellie Pavlick and Tom Kwiatkowski https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00293
In this episode we invite Verena Rieser and Ondřej Dušek on to talk to us about the complexities of generating natural language when you have some kind of structured meaning representation as input. We talk about when you might want to do this, which is often is some kind of a dialog system, but also generating game summaries, and even some language modeling work. We then talk about why this is hard, which in large part is due to the difficulty of collecting data, and how to evaluate the output of these systems. We then move on to discussing the details of a major challenge that Verena and Ondřej put on, called the end-to-end natural language generation challenge (E2E NLG). This was a dataset of task-based dialog generation focused on the restaurant domain, with some very innovative data collection techniques. They held a shared task with 16 participating teams in 2017, and the data has been further used since. We talk about the methods that people used for the task, and what we can learn today from what methods have been used on this data. Verena's website: https://sites.google.com/site/verenateresarieser/ Ondřej's website: https://tuetschek.github.io/ The E2E NLG Challenge that we talked about quite a bit: http://www.macs.hw.ac.uk/InteractionLab/E2E/
In this episode, we invite Hao Tan and Mohit Bansal to talk about multi-modal training of transformers, focusing in particular on their EMNLP 2019 paper that introduced LXMERT, a vision+language transformer. We spend the first third of the episode talking about why you might want to have multi-modal representations. We then move to the specifics of LXMERT, including the model structure, the losses that are used to encourage cross-modal representations, and the data that is used. Along the way, we mention latent alignments between images and captions, the granularity of captions, and machine translation even comes up a few times. We conclude with some speculation on the future of multi-modal representations. Hao's website: http://www.cs.unc.edu/~airsplay/ Mohit's website: http://www.cs.unc.edu/~mbansal/ LXMERT paper: https://www.aclweb.org/anthology/D19-1514/
In this episode, we talked to Emily Bender about the ethical considerations in developing NLP models and putting them in production. Emily cited specific examples of ethical issues, and talked about the kinds of potential concerns to keep in mind, both when releasing NLP models that will be used by real people, and also while conducting NLP research. We concluded by discussing a set of open-ended questions about designing tasks, collecting data, and publishing results, that Emily has put together towards addressing these concerns. Emily M. Bender is a Professor in the Department of Linguistics and an Adjunct Professor in the Department of Computer Science and Engineering at the University of Washington. She's active on Twitter at @emilymbender.
In this episode we invite Sudha Rao to talk about question generation. We talk about different settings where you might want to generate questions: for human testing scenarios (rare), for data augmentation (has been done a bunch for SQuAD-like tasks), for detecting missing information / asking clarification questions, for dialog uses, and others. After giving an overview of the general area, we talk about the specifics of some of Sudha's work, including her ACL 2018 best paper on ranking clarification questions using EVPI. We conclude with a discussion of evaluating question generation, which is a hard problem, and what the exciting open questions there are in this research area. Sudha's website: https://raosudha.weebly.com/
In this episode we talked with Victor Sanh and Thomas Wolf from HuggingFace about model distillation, and DistilBERT as one example of distillation. The idea behind model distillation is compressing a large model by building a smaller model, with much fewer parameters, that approximates the output distribution of the original model, typically for increased efficiency. We discussed how model distillation was typically done previously, and then focused on the specifics of DistilBERT, including training objective, empirical results, ablations etc. We finally discussed what kinds of information you might lose when doing model distillation.
We talked to Brendan O'Connor for this episode about processing language in social media. Brendan started off by telling us about his projects that studied the linguistic and geographical patterns of African American English (AAE), and how obtaining data from Twitter made these projects possible. We then talked about how many tools built for standard English perform very poorly on AAE, and why collecting dialect-specific data is important. For the rest of the conversation, we discussed the issues involved in scraping data from social media, including ethical considerations and the biases that the data comes with. Brendan O'Connor is an Assistant Professor at the University of Massachusetts, Amherst. Warning: This episode contains explicit language (one swear word).
What exciting NLP research problems are involved in processing biomedical and clinical data? In this episode, we spoke with Dina Demner-Fushman, who leads NLP and IR research at the Lister Hill National Center for Biomedical Communications, part of the National Library of Medicine. We talked about processing biomedical scientific literature, understanding clinical notes, and answering consumer health questions, and the challenges involved in each of these applications. Dina listed some specific tasks and relevant data sources for NLP researchers interested in such applications, and concluded with some pointers to getting started in this field.
In this episode, Jonathan Frankle describes the lottery ticket hypothesis, a popular explanation of how over-parameterization helps in training neural networks. We discuss pruning methods used to uncover subnetworks (winning tickets) which were initialized in a particularly effective way. We also discuss patterns observed in pruned networks, stability of networks pruned at different time steps and transferring uncovered subnetworks across tasks, among other topics. A recent paper on the topic by Frankle and Carbin, ICLR 2019: https://arxiv.org/abs/1803.03635 Jonathan Frankle's homepage: http://www.jfrankle.com/
For our 100th episode, we invite AI2 CEO Oren Etzioni to talk to us about NLP startups. Oren has founded several successful startups, is himself an investor in startups, and helps with AI2's startup incubator. Some of our discussion topics include: What's the similarity between being a researcher and an entrepreneur? How do you transition from being a researcher to doing a startup? How do you evaluate early-stage startups? What advice would you give to a researcher who's thinking about a startup? What are some typical mistakes that you've seen startups make? Along the way, Oren predicts a that we'll see a whole generation of startup companies based on the technology underlying ELMo, BERT, etc.
For this episode, we chatted with Neil Thomas and Roshan Rao about modeling protein sequences and evaluating transfer learning methods for a set of five protein modeling tasks. Learning representations using self-supervised pretaining objectives has shown promising results in transferring to downstream tasks in protein sequence modeling, just like it has in NLP. We started off by discussing the similarities and differences between language and protein sequence data, and how the contextual embedding techniques are applicable also to protein sequences. Neil and Roshan then described a set of five benchmark tasks to assess the quality of protein embeddings (TAPE), particularly in terms of how well they capture the structural, functional, and evolutionary aspects of proteins. The results from the experiments they ran with various model architectures indicated that there was not a single best performing model across all tasks, and that there is a lot of room for future work in protein sequence modeling. Neil Thomas and Roshan Rao are PhD students at UC Berkeley. Paper: https://www.biorxiv.org/content/10.1101/676825v1 Blog post: https://bair.berkeley.edu/blog/2019/11/04/proteins/
What function do the different attention heads serve in multi-headed attention models? In this episode, Lena describes how to use attribution methods to assess the importance and contribution of different heads in several tasks, and describes a gating mechanism to prune the number of effective heads used when combined with an auxiliary loss. Then, we discuss Lena's work on studying the evolution of representations of individual tokens in transformers model. Lena's homepage: https://lena-voita.github.io/ Blog posts: https://lena-voita.github.io/posts/acl19_heads.html https://lena-voita.github.io/posts/emnlp19_evolution.html Papers: https://arxiv.org/abs/1905.09418 https://arxiv.org/abs/1909.01380
In this episode, we talk to Taylor Berg-Kirkpatrick about optical character recognition (OCR) on historical documents. Taylor starts off by describing some practical issues related to old scanning processes of documents that make performing OCR on them a difficult problem. Then he explains how one can build latent variable models for this data using unsupervised methods, the relative importance of various modeling choices, and summarizes how well the models do. We then take a higher level view of historical OCR as a Machine Learning problem, and discuss how it is different from other ML problems in terms of the tradeoff between learning from data and imposing constraints based on prior knowledge of the underlying process. Finally, Taylor talks about the applications of this research, and how these predictions can be of interest to historians studying the original texts.
In this episode, we chat with Luke Zettlemoyer about Question Answering as a format for crowdsourcing annotations of various semantic phenomena in text. We start by talking about QA-SRL and QAMR, two datasets that use QA pairs to annotate predicate-argument relations at the sentence level. Luke describes how this annotation scheme makes it possible to obtain annotations from non-experts, and discusses the tradeoffs involved in choosing this scheme. Then we talk about the challenges involved in using QA-based annotations for more complex phenomena like coreference. Finally, we briefly discuss the value of crowd-labeled datasets given the recent developments in pretraining large language models. Luke is an associate professor at the University of Washington and a Research Scientist at Facebook AI Research.
In this episode, we invite Yejin Choi to talk about common sense knowledge and reasoning, a growing area in NLP. We start by discussing a working definition of “common sense” and the practical utility of studying it. We then talk about some of the datasets and resources focused on studying different aspects of common sense (e.g., ReCoRD, CommonsenseQA, ATOMIC) and contrast implicit vs. explicit modeling of common sense, and what it means for downstream applications. To conclude, Yejin shares her thoughts on some of the open problems in this area and where it is headed in the future. Yejin Choi's homepage: https://homes.cs.washington.edu/~yejin/ ATOMIC: https://homes.cs.washington.edu/~msap/atomic/ ReCoRD: https://sheng-z.github.io/ReCoRD-explorer/ CommonsenseQA: https://www.tau-nlp.org/commonsenseqa
In this episode, Aaron White tells us about the decompositional semantics initiative (Decomp), an attempt to re-think the prototypical approach to semantic representation and annotation. The basic idea is to decompose complex semantic classes such as ‘agent' and ‘patient' into simpler semantic properties such as ‘causation' and ‘volition', while embracing the uncertainty inherent in language by allowing annotators to choose answers such as ‘probably' or ‘probably not'. In order to scale the collection of labeled data, each property is annotated by asking crowd workers intuitive questions about phrases in a given sentence. Aaron White's homepage: http://aaronstevenwhite.io/ Decomp initiative page: http://decomp.io/