Deep Papers – Détails, épisodes et analyse
Détails du podcast
Informations techniques et générales issues du flux RSS du podcast.

Deep Papers
Arize AI
Fréquence : 1 épisode/18j. Total Éps: 53

Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
Classements récents
Dernières positions dans les classements Apple Podcasts et Spotify.
Apple Podcasts
🇨🇦 Canada - mathematics
13/08/2025#40🇬🇧 Grande Bretagne - mathematics
13/08/2025#21🇩🇪 Allemagne - mathematics
13/08/2025#18🇺🇸 États-Unis - mathematics
13/08/2025#14🇫🇷 France - mathematics
13/08/2025#31🇨🇦 Canada - mathematics
12/08/2025#37🇬🇧 Grande Bretagne - mathematics
12/08/2025#30🇩🇪 Allemagne - mathematics
12/08/2025#16🇺🇸 États-Unis - mathematics
12/08/2025#8🇫🇷 France - mathematics
12/08/2025#30
Spotify
Aucun classement récent disponible
Liens partagés entre épisodes et podcasts
Liens présents dans les descriptions d'épisodes et autres podcasts les utilisant également.
See all- https://arize.com/community/
56 partages
- https://arize.com/llm-evaluation/
53 partages
- https://twitter.com/arizeai
56 partages
- https://mobile.twitter.com/ai__pub
2 partages
- https://www.linkedin.com/company/arizeai/
56 partages
Qualité et score du flux RSS
Évaluation technique de la qualité et de la structure du flux RSS.
See allScore global : 48%
Historique des publications
Répartition mensuelle des publications d'épisodes au fil des années.
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
vendredi 16 août 2024 • Durée 39:05
This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.
Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
Breaking Down Meta's Llama 3 Herd of Models
mardi 6 août 2024 • Durée 44:40
Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they did here, talk about open source, and decide if we want to believe the hype.
Read it on the blog: https://arize.com/blog/breaking-down-meta-llama-3/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
Reinforcement Learning in the Era of LLMs
vendredi 15 mars 2024 • Durée 44:49
We’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the research in conventional RL to RL techniques used in LLM research and demystify this technique by discussing why, when, and how RL excels.
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
Sora: OpenAI’s Text-to-Video Generation Model
vendredi 1 mars 2024 • Durée 45:08
This week, we discuss the implications of Text-to-Video Generation and speculate as to the possibilities (and limitations) of this incredible technology with some hot takes. Dat Ngo, ML Solutions Engineer at Arize, is joined by community member and AI Engineer Vibhu Sapra to review OpenAI’s technical report on their Text-To-Video Generation Model: Sora.
According to OpenAI, “Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.” At the time of this recording, the model had not been widely released yet, but was becoming available to red teamers to assess risk, and also to artists to receive feedback on how Sora could be helpful for creatives.
At the end of our discussion, we also explore EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. This recent paper proposed a new framework and pipeline to exhaustively evaluate the performance of the generated videos, which we look at in light of Sora.
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
RAG vs Fine-Tuning
jeudi 8 février 2024 • Durée 39:49
This week, we’re discussing "RAG vs Fine-Tuning: Pipelines, Tradeoff, and a Case Study on Agriculture." This paper explores a pipeline for fine-tuning and RAG, and presents the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4.
The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
Phi-2 Model
vendredi 2 février 2024 • Durée 44:29
We dive into Phi-2 and some of the major differences and use cases for a small language model (SLM) versus an LLM.
With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.
Find the transcript and live recording: https://arize.com/blog/phi-2-model
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels
vendredi 2 février 2024 • Durée 36:22
We discuss HyDE: a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders.
HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages. This leap in zero-shot learning efficiently retrieves relevant real-world information without task-specific fine-tuning, broadening AI model applicability and effectiveness.
Link to transcript and live recording: https://arize.com/blog/hyde-paper-reading-and-discussion/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part I
mercredi 27 décembre 2023 • Durée 47:50
For the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B.
There's a lot to cover, so this week's paper read is Part I in a series about Mixtral and Gemini. In Part I, we provide some background and context for Mixtral 8x7B from Mistral AI, a high-quality sparse mixture of experts model (SMoE) that outperforms Llama 2 70B on most benchmarks with 6x faster inference Mixtral also matches or outperforms GPT3.5 on most benchmarks. This open-source model was optimized through supervised fine-tuning and direct preference optimization.
Stay tuned for Part II in January, where we'll build on this conversation in and discuss Gemini-developed by teams at DeepMind and Google Research.
Link to transcript and live recording: https://arize.com/blog/a-deep-dive-into-generatives-newest-models-mistral-mixtral-8x7b/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
lundi 18 décembre 2023 • Durée 44:59
We’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and prompt length on LLMs’ effectiveness. The findings emphasize the importance of careful consideration in constructing prompts, highlighting the crucial role of table relationships and content, the effectiveness of in-domain demonstration examples, and the significance of prompt length in cross-domain scenarios.
Read the blog and watch the discussion: https://arize.com/blog/how-to-prompt-llms-for-text-to-sql-paper-reading/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets
jeudi 30 novembre 2023 • Durée 41:02
For this paper read, we’re joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets.” Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements and also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.
Find the transcript and read more here: https://arize.com/blog/the-geometry-of-truth-emergent-linear-structure-in-llm-representation-of-true-false-datasets-paper-reading/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.