Explore every episode of the podcast Data Skeptic
| Title | Pub. Date | Duration | |
|---|---|---|---|
| Shilling Attacks on Recommender Systems | 05 Nov 2025 | 00:34:48 | |
In this episode of Data Skeptic's Recommender Systems series, Kyle sits down with Aditya Chichani, a senior machine learning engineer at Walmart, to explore the darker side of recommendation algorithms. The conversation centers on shilling attacks—a form of manipulation where malicious actors create multiple fake profiles to game recommender systems, either to promote specific items or sabotage competitors. Aditya, who researched these attacks during his undergraduate studies at SPIT before completing his master's in computer science with a data science specialization at UC Berkeley, explains how these vulnerabilities emerge particularly in collaborative filtering systems. From promoting a friend's ska band on Spotify to inflating product ratings on e-commerce platforms, shilling attacks represent a significant threat in an industry where approximately 4% of reviews are fake, translating to $800 billion in annual sales in the US alone. The discussion delves deep into collaborative filtering, explaining both user-user and item-item approaches that create similarity matrices to predict user preferences. However, these systems face various shilling attacks of increasing sophistication: random attacks use minimal information with average ratings, while segmented attacks strategically target popular items (like Taylor Swift albums) to build credibility before promoting target items. Bandwagon attacks focus on highly popular items to connect with genuine users, and average attacks leverage item rating knowledge to appear authentic. User-user collaborative filtering proves particularly vulnerable, requiring as few as 500 fake profiles to impact recommendations, while item-item filtering demands significantly more resources. Aditya addresses detection through machine learning techniques that analyze behavioral patterns using methods like PCA to identify profiles with unusually high correlation and suspicious rating consistency. However, this remains an evolving challenge as attackers adapt strategies, now using large language models to generate more authentic-seeming fake reviews. His research with the MovieLens dataset tested detection algorithms against synthetic attacks, highlighting how these concerns extend to modern e-commerce systems. While companies rarely share attack and detection data publicly to avoid giving attackers advantages, academic research continues advancing both offensive and defensive strategies in recommender systems security. | |||
| Music Playlist Recommendations | 29 Oct 2025 | 00:52:29 | |
In this episode, Rebecca Salganik, a PhD student at the University of Rochester with a background in vocal performance and composition, discusses her research on fairness in music recommendation systems. She explores three key types of fairness—group, individual, and counterfactual—and examines how algorithms create challenges like popularity bias (favoring mainstream content) and multi-interest bias (underserving users with diverse tastes). Rebecca introduces LARP, her multi-stage multimodal framework for playlist continuation that uses contrastive learning to align text and audio representations, learn song relationships, and create playlist-level embeddings to address the cold start problem. A significant contribution of Rebecca's work is the Music Semantics dataset, created by scraping Reddit discussions to capture how people naturally describe music using atmospheric qualities, contextual comparisons, and situational associations rather than just technical features. This dataset, available on Hugging Face, enables more nuanced recommendation systems that better understand user preferences and support niche tastes. Her research utilizes industry datasets including Last.fm and Spotify's Million Playlist Dataset, and points toward exciting future applications in music generation and multimodal systems that combine audio, text, and video.
| |||
| Complex Dynamic in Networks | 28 Jun 2025 | 00:56:00 | |
In this episode, we learn why simply analyzing the structure of a network is not enough, and how the dynamics - the actual mechanisms of interaction between components - can drastically change how information or influence spreads. Our guest, Professor Baruch Barzel of Bar-Ilan University, is a leading researcher in network dynamics and complex systems ranging from biology to infrastructure and beyond. Paper in focus: Universality in network dynamics, 2013 | |||
| AGI Can Be Safe | 26 Jun 2023 | 00:45:57 | |
We are joined by Koen Holtman, an independent AI researcher focusing on AI safety. Koen is the Founder of Holtman Systems Research, a research company based in the Netherlands. Koen started the conversation with his take on an AI apocalypse in the coming years. He discussed the obedience problem with AI models and the safe form of obedience. Koen explained the concept of Markov Decision Process (MDP) and how it is used to build machine learning models. Koen spoke about the problem of AGIs not being able to allow changing their utility function after the model is deployed. He shared another alternative approach to solving the problem. He shared how to engineer AGI systems now and in the future safely. He also spoke about how to implement safety layers on AI models. Koen discussed the ultimate goal of a safe AI system and how to check that an AI system is indeed safe. He discussed the intersection between large language Models (LLMs) and MDPs. He shared the key ingredients to scale the current AI implementations. | |||
| AI Fails on Theory of Mind Tasks | 19 Jun 2023 | 00:52:21 | |
An assistant professor of Psychology at Harvard University, Tomer Ullman, joins us. Tomer discussed the theory of mind and whether machines can indeed pass it. Using variations of the Sally-Anne test and the Smarties tube test, he explained how LLMs could fail the theory of mind test. | |||
| AI for Mathematics Education | 12 Jun 2023 | 00:35:36 | |
The application of LLMs cuts across various industries. Today, we are joined by Steven Van Vaerenbergh, who discussed the application of AI in mathematics education. He discussed how AI tools have changed the landscape of solving mathematical problems. He also shared LLMs' current strengths and weaknesses in solving math problems. | |||
| Evaluating Jokes with LLMs | 06 Jun 2023 | 00:43:11 | |
Fabricio Goes, a Lecturer in Creative Computing at the University of Leicester, joins us today. Fabricio discussed what creativity entails and how to evaluate jokes with LLMs. He specifically shared the process of evaluating jokes with GPT-3 and GPT-4. He concluded with his thoughts on the future of LLMs for creative tasks. | |||
| Why Machines Will Never Rule the World | 29 May 2023 | 00:55:15 | |
Barry Smith and Jobst Landgrebe, authors of the book "Why Machines will never Rule the World," join us today. They discussed the limitations of AI systems in today's world. They also shared elaborate reasons AI will struggle to attain the level of human intelligence. | |||
| A Psychopathological Approach to Safety in AGI | 23 May 2023 | 00:49:00 | |
While the possibilities with AGI emergence seem great, it also calls for safety concerns. On the show, Vahid Behzadan, an Assistant Professor of Computer Science and Data Science, joins us to discuss the complexities of modeling AGIs to accurately achieve objective functions. He touched on tangent issues such as abstractions during training, the problem of unpredictability, communications among agents, and so on. | |||
| The NLP Community Metasurvey | 15 May 2023 | 00:49:45 | |
Julian Michael, a postdoc at the Center for Data Science, New York University, joins us today. Julian's conversation with Kyle was centered on the NLP community metasurvey: a survey aimed at understanding expert opinions on controversial NLP issues. He shared the process of preparing the survey as well as some shocking results. | |||
| Skeptical Survey Interpretation | 10 May 2023 | 00:21:39 | |
Kyle shares his own perspectives on challenges getting insight from surveys. The discussion ranges from commentary on the market research industry to specific advice for detecting disingenuous or fraudulent responses and filtering them from your analysis. Finally, he shares some quick thoughts on the usage of the Chi-Square test for interpreting cross tab results in survey analysis. | |||
| The Gallup Poll | 01 May 2023 | 00:40:26 | |
Jeff Jones, a Senior Editor at Gallup, joins us today. His conversation with Kyle spanned a range of topics on Gallup's poll creation process. He discussed how Gallup generates unbiased questionnaires, gets respondents, analyzes results, and everything in between. | |||
| Inclusive Study Group Formation at Scale | 25 Apr 2023 | 00:32:17 | |
Gireeja Ranade, a University of California at Berkeley professor, speaks with us today. She presented her study on implementing inclusive study groups at scale and shared the observed student performance improvements after the intervention. | |||
| Github Network Analysis | 22 Jun 2025 | 00:36:46 | |
In this episode we'll discuss how to use Github data as a network to extract insights about teamwork. Our guest, Gabriel Ramirez, manager of the notifications team at GitHub, will show how to apply network analysis to better understand and improve collaboration within his engineering team by analyzing GitHub metadata - such as pull requests, issues, and discussions - as a bipartite graph of people and projects. Some insights we'll discuss are how network centrality measures (like eigenvector and betweenness centrality) reveal organizational dynamics, how vacation patterns influence team connectivity, and how decentralizing communication hubs can foster healthier collaboration. Gabriel's open-source project, GH Graph Explorer, enables other managers and engineers to extract, visualize, and analyze their own GitHub activity using tools like Python, Neo4j, Gephi and LLMs for insight generation, but always remember – don't take the results on face value. Instead, use the results to guide your qualitative investigation. | |||
| The PhilPapers Survey | 21 Apr 2023 | 00:31:36 | |
Today, we are joined by David Bourget. David is an Associate Professor in Philosophy at Western University in London, Ontario. David is also the co-director of the PhilPapers Foundation and Director of the Center for Digital Philosophy. He joins us to discuss the PhilPapers Survey project. The PhilPapers survey was initially taken in 2009, but there was a follow-up survey in 2020. David discussed the need for the subsequent survey and what changed. He mentioned the metric for measuring the opinion changes between the 2009 and 2020 surveys. He also shared future plans for the PhilPapers surveys. | |||
| Non-Response Bias | 10 Apr 2023 | 00:35:33 | |
Today's show focused on an essential part of surveys — missing values. This is typically caused by a low response rate or non-response from respondents. Yajuan Si is a Research Associate Professor at the Survey Research Center at the University of Michigan. She joins us to discuss dealing with bias from low survey response rates. | |||
| Measuring Trust in Robots with Likert Scales | 03 Apr 2023 | 00:47:24 | |
We are joined by two guests today, Mariah, a Ph.D. student in the CORE Robotics Lab at Georgia Tech, and Matthew Gombolay, the Director of the CORE Robotics Lab. They both discuss practices for measuring a respondent's perception in a survey. | |||
| CAREER Prediction | 27 Mar 2023 | 00:40:31 | |
Ever wondered what your next career would be? Today, Keyon Vafa, a computer science Ph.D. student at Columbia University, joins us to discuss his latest research on developing a machine-learning model for career prediction. Keyon extensively spoke about how the model was developed and the possibilities it brings. | |||
| The Panel Study of Income Dynamics | 21 Mar 2023 | 00:34:03 | |
Noura Insolera, a Research Investigator with the Panel Study of Income Dynamics (PSID), joins us to share how PSID conducts longitudinal household surveys. She also shared some interesting findings from their data exploration, particularly on the observation and trends in food insecurity. | |||
| Survey Design Working Session | 14 Mar 2023 | 01:01:42 | |
Susan Gerbic joins Kyle to review some of the surveys Data Skeptic has launch, draft a new survey about podcast listening habits, and then review the results of that survey. You can see those results at the link below. https://survey.dataskeptic.com/survey/result/1675102237053 Watch the videos Susan mentioned on her Youtube page at the link below. https://www.youtube.com/playlist?list=PL7VAuaQDhPTVaLeI1IcpYph5lH19xA1u4 | |||
| Bot Detection and Dyadic Surveys | 06 Mar 2023 | 00:35:24 | |
The use of social bots to fill out online surveys is becoming prevalent. Today, we speak with Sara Bybee, a postdoctoral research scholar at the University of Utah. Sara shares from her research, how she detected social bots, the strategies to curb them, and how underrepresented groups can be more represented in surveys. | |||
| Reproducible ESP Testing | 20 Feb 2023 | 00:47:10 | |
Our guest today is Zoltán Kekecs, a Ph.D. holder in Behavioural Science. Zoltán highlights the problem of low replicability in journal papers and illustrates how researchers can better ensure complete replication of their research and findings. He used Bem's experiment as an example, extensively talking about his methodology and results. | |||
| A Survey of Data Science Methodologies | 13 Feb 2023 | 00:24:58 | |
On the show, Iñigo Martinez, a Ph.D. student at the University of Navarra shares his survey results which investigated how data practitioners perform data science projects. He revealed the methodologies typically used by data practitioners and the success factors in data science projects. | |||
| Opinion Dynamics Models | 06 Feb 2023 | 00:35:45 | |
On the show today, Dino Carpentras, a post-doctoral researcher at the Computational Social Science group at ETH Zürich joins us to discuss how opinion dynamics models are built and validated. He explained how quantifying opinions is complex, and strategies to develop robust models for measuring and predicting public opinions. | |||
| Networks and Complexity | 14 Jun 2025 | 00:17:49 | |
In this episode, Kyle does an overview of the intersection of graph theory and computational complexity theory. In complexity theory, we are about the runtime of an algorithm based on its input size. For many graph problems, the interesting questions we want to ask take longer and longer to answer! This episode provides the fundamental vocabulary and signposts along the path of exploring the intersection of graph theory and computational complexity theory. | |||
| Casual Affective Triggers | 30 Jan 2023 | 00:35:48 | |
Crafting survey questions is one thing but getting your audience to fill it is yet another. On the show today, we speak with Alexander Nolte, an Associate Professor at the University of Tartu. Alexander discussed the use of Casual Affective Triggers (CAT) to incentivize people to accept survey invitations and improve the completion rate. He revealed the impact of CATs on survey response rates from a study he conducted. | |||
| Conversational Surveys | 23 Jan 2023 | 00:39:49 | |
Traditional surveys have straight-jacket questions to be answered, thus restricting the information that can be gotten. Today, Ziang Xiao, a Postdoc Researcher in the FATE group at Microsoft Research Montréal, talks about conversational surveys, a type of survey that asks questions based on preceding answers. He discussed the benefits of conversational surveys and some of the challenges it poses. | |||
| Do Results Generalize for Privacy and Security Surveys | 17 Jan 2023 | 00:40:21 | |
Today, Jenny Tang, a Ph.D. student of societal computing at Carnegie Mellon University discusses her work on the generalization of privacy and security surveys on platforms such as Amazon MTurk and Prolific. Jenny shared the drawbacks of using such online platforms, the discrepancies observed about the samples drawn, and key insights from her results. | |||
| 4 out of 5 Data Scientists Agree | 10 Jan 2023 | 00:28:52 | |
This episode kicks off the new season of the show, Data Skeptic: Surveys. Linhda rejoins the show for a conversation with Kyle about her experience taking surveys and what questions she has for the season. Lastly, Kyle announces the launch of survey.dataskeptic.com, a new site we're launching to gather your opinions. Please take a moment and share your thoughts! | |||
| Crowdfunded Board Games | 26 Dec 2022 | 00:34:31 | |
It may be intuitive to think crowdfunding a project drives its innovation and novelty, but there are no empirical studies that prove this. On the show, Johannes Wachs shares his research that sought to determine whether crowdfunding truly drives innovation. He used board games as a case study and shared the results he found. | |||
| Russian Election Interference Effectiveness | 19 Dec 2022 | 00:41:55 | |
There were reports of Russia's interference in the 2016 US elections. In today's episode, Koustuv Saha, a researcher at Microsoft Research walks us through the effect of targeted ads for political campaigns. Using practical examples, he discusses how targeted ads can propagate fake news, its ripple effects on electioneering, and how to find a sweet spot with targeted ads. | |||
| Placement Laundering Fraud | 15 Dec 2022 | 00:32:59 | |
There is an unsung kind of ad fraud brewing in the ad tech space — placement laundering fraud. On the show, Jeff Kline discusses what placement laundering fraud is, how it can be identified, and possible solutions to it. Listen to learn more. | |||
| Data Clean Rooms | 12 Dec 2022 | 00:31:49 | |
Bosko Milekic, the Co-founder of Optable, a data collaboration platform for the media and advertising industry, joins us today. Bosko talked about the clean rooms, the technology driving data privacy during collaboration. He discussed why clean rooms are gaining widespread adoption, and how users can exploit Optable's clean room platform for a secured data-sharing experience. | |||
| Dark Patterns in Site Design | 05 Dec 2022 | 00:34:54 | |
Kerstin Bongard-Blanchy is a Research Associate at the University of Luxembourg. She joins us to discuss her study that investigated dark patterns in web designs. She discussed the results, the effect of dark patterns effect on users, whether an average user can detect them, and the way forward to a more ethical web space. | |||
| Internet Advertising Bureau Media Lab | 03 Dec 2022 | 00:37:01 | |
We are joined by Anthony Katsur, the CEO of IAB Tech Lab. Anthony discusses standards within the ad tech industry. He explained how IAB Tech Lab set and propagates global standards, actions to ensure compliance from advertisers, and industry trends for a more privacy-centric ad tech space. | |||
| Graphs for Causal AI | 24 May 2025 | 00:41:00 | |
How to build artificial intelligence systems that understand cause and effect, moving beyond simple correlations? As we all know, correlation is not causation. "Spurious correlations" can show, for example, how rising ice cream sales might statistically link to more drownings, not because one causes the other, but due to an unobserved common cause like warm weather. Our guest, Utkarshani Jaimini, a researcher from the University of South Carolina's Artificial Intelligence Institute, tries to tackle this problem by using knowledge graphs that incorporate domain expertise. Knowledge graphs (structured representations of information) are combined with neural networks in the field of neurosymbolic AI to represent and reason about complex relationships. This involves creating causal ontologies, incorporating the "weight" or strength of causal relationships and hyperrelations. This field has many practical applications such as for AI explainability, healthcare and autonomous driving. Follow our guest Papers in focusCausalLP: Learning causal relations with weighted knowledge graph link prediction, 2024 HyperCausalLP: Causal Link Prediction using Hyper-Relational Knowledge Graph, 2024
| |||
| Your Mouse Reveals Your Gender and Age | 28 Nov 2022 | 00:39:47 | |
When we navigate a webpage, it is fairly easy for our mouse movement to be tracked and collected. Today, Luis Leiva, a Professor of Computer Science discusses how these mouse tracking data can be used to predict age, gender and user attention. He also discusses the privacy concerns with mouse tracking data and possible ways it can be curtailed. | |||
| Measuring Web Search Behavior | 21 Nov 2022 | 00:36:08 | |
On the show, Aleksandra Urman and Mykola Makhortykh join us to discuss their work on the comparative analysis of web search behavior using web tracking data. They shared interesting results from their analysis, bordering around the user preferences for search engines, demographic patterns, and differences between how men and women surf the net. | |||
| StrategyQA and Big Bench | 18 Nov 2022 | 00:41:56 | |
Did Aristotle Use a Laptop? That's a question from the StrategyQA benchmark which highlights the stretch goals for current artificial intelligence systems. Answering a question like that requires several cognitive steps and reasoning. Constructing a dataset of similarly challenging questions is a major undertaking. On today's episode, Mor Geva returns to share details about the creation of StrategyQA and the larger Big Bench dataset it has been included in. | |||
| Ad Blockers Effect on News Consumption | 14 Nov 2022 | 00:38:59 | |
While at first glance, the use of ad blockers drops the revenue of news publishers, this may not be completely true. On the show today, Shunyao Yan, an Assistant Professor in Marketing at Leavey School of Business, Santa Clara University, discussed the effect of ad blockers on news consumption and how ad blockers can potentially be helpful for news publishers. | |||
| Your Consent is Worth 75 Euros a Year | 07 Nov 2022 | 00:24:04 | |
People who do not want their data tracked and shared online can pay a token for a cookie paywall. But are the websites keeping to their side of the bargain? Victor Morel, a Postdoc candidate at the Chalmers University of Technology joins us to discuss his work around auditing the activities of cookie paywalls. He discussed the findings from his analysis and proffers some solutions to making cookie paywalls more transparent. | |||
| Automated Email Generation for Targeted Attacks | 31 Oct 2022 | 00:45:06 | |
The advancement of generative language models has been a force for good, but also for evil. On the show, Avisha Das, a post-doctoral scholar at the University of Texas Health Center, joins us to discuss how attackers use machine learning to create unsuspecting phishing emails. She also discussed how she used RNN for automated email generation, with the goal of defeating statistical detectors. | |||
| Tribal Marketing | 24 Oct 2022 | 00:37:38 | |
Peter Gloor, a Research Scientist at the MIT Center for Collective Intelligence, takes us on a new world of tribe classification. He extensively discussed the need for such classification on the internet and how he built a machine learning model that does it. Listen to find out more! | |||
| Nano-targetted Facebook Ads | 17 Oct 2022 | 00:44:46 | |
| Debiasing GPT-3 Job Ads | 10 Oct 2022 | 00:48:57 | |
We hear about the impeccable achievements of GPT-3 models, but such large generative models come with their bias. On the show today, Conrad Borchers, a Ph.D. student in Human-Computer Interaction, joins us to discuss the bias in GPT-3 for job ads and how such large models can be de-biased. Listen to learn more! | |||
| ML Ops in Production | 06 Oct 2022 | 00:41:58 | |
Moses Guttman from Clear ML joins us to share insights about how organizations leveraging machine learning keep their programs on track. While many parallels exist between the software development life cycle (SWLC) and the machine learning development life cycle, successful deployments of ML in production have demonstrated that a unique set of tools is required. Moses and I discuss the emergence of ML Ops, success stories, and how modern teams leverage tools like Clear ML's open source solution to maximize the value of ML in the organization.
| |||
| Power Networks | 16 May 2025 | 00:41:50 | |
| Ad Network Tomography | 03 Oct 2022 | 00:35:20 | |
Data sharing in the ad tech space has largely been a black box system. While it is obvious the data is being collected, the data sharing process is obscure to users. On the show today, Maaz Bin Musa and Rishab, both researchers at the University of Iowa, speak about the importance of data transparency and their tool, ATOM for data transparency. Listen to find out how ATOM uncovers data-sharing relationships in the ad-tech space. | |||
| First Party Tracking Cookies | 26 Sep 2022 | 00:35:06 | |
When you accept cookies on a website, you cannot tell whether the cookies are used for tracking your personal data or not. Shaoor Munir's machine learning model does that. On the show today, the Ph.D student at the University of California, discussed the world of first-party cookies and how he developed a machine learning model that predicts whether a first-party cookie is used for tracking purposes. | |||
| The Harms of Targeted Weight Loss Ads | 19 Sep 2022 | 00:34:57 | |
Liza Gak, a Ph.D. student at UC Berkeley, joins us to discuss her research on harmful weight loss advertising. She discussed how weight loss ads are not fact-checked, and how they typically target the most vulnerable. She extensively discussed her interview process, data analysis, and results. Listen for more! | |||