Back

Explore every episode of the podcast Data Skeptic

Dive into the complete episode list for Data Skeptic. Each episode is cataloged with detailed descriptions, making it easy to find and explore specific topics. Keep track of all episodes from your favorite podcast and never miss a moment of insightful content.

Rows per page:

1–50 of 599

TitlePub. DateDuration
Shilling Attacks on Recommender Systems05 Nov 202500:34:48

In this episode of Data Skeptic's Recommender Systems series, Kyle sits down with Aditya Chichani, a senior machine learning engineer at Walmart, to explore the darker side of recommendation algorithms. The conversation centers on shilling attacks—a form of manipulation where malicious actors create multiple fake profiles to game recommender systems, either to promote specific items or sabotage competitors. Aditya, who researched these attacks during his undergraduate studies at SPIT before completing his master's in computer science with a data science specialization at UC Berkeley, explains how these vulnerabilities emerge particularly in collaborative filtering systems. From promoting a friend's ska band on Spotify to inflating product ratings on e-commerce platforms, shilling attacks represent a significant threat in an industry where approximately 4% of reviews are fake, translating to $800 billion in annual sales in the US alone.

The discussion delves deep into collaborative filtering, explaining both user-user and item-item approaches that create similarity matrices to predict user preferences. However, these systems face various shilling attacks of increasing sophistication: random attacks use minimal information with average ratings, while segmented attacks strategically target popular items (like Taylor Swift albums) to build credibility before promoting target items. Bandwagon attacks focus on highly popular items to connect with genuine users, and average attacks leverage item rating knowledge to appear authentic. User-user collaborative filtering proves particularly vulnerable, requiring as few as 500 fake profiles to impact recommendations, while item-item filtering demands significantly more resources. Aditya addresses detection through machine learning techniques that analyze behavioral patterns using methods like PCA to identify profiles with unusually high correlation and suspicious rating consistency. However, this remains an evolving challenge as attackers adapt strategies, now using large language models to generate more authentic-seeming fake reviews. His research with the MovieLens dataset tested detection algorithms against synthetic attacks, highlighting how these concerns extend to modern e-commerce systems. While companies rarely share attack and detection data publicly to avoid giving attackers advantages, academic research continues advancing both offensive and defensive strategies in recommender systems security.

Music Playlist Recommendations29 Oct 202500:52:29

In this episode, Rebecca Salganik, a PhD student at the University of Rochester with a background in vocal performance and composition, discusses her research on fairness in music recommendation systems. She explores three key types of fairness—group, individual, and counterfactual—and examines how algorithms create challenges like popularity bias (favoring mainstream content) and multi-interest bias (underserving users with diverse tastes). Rebecca introduces LARP, her multi-stage multimodal framework for playlist continuation that uses contrastive learning to align text and audio representations, learn song relationships, and create playlist-level embeddings to address the cold start problem.

A significant contribution of Rebecca's work is the Music Semantics dataset, created by scraping Reddit discussions to capture how people naturally describe music using atmospheric qualities, contextual comparisons, and situational associations rather than just technical features. This dataset, available on Hugging Face, enables more nuanced recommendation systems that better understand user preferences and support niche tastes. Her research utilizes industry datasets including Last.fm and Spotify's Million Playlist Dataset, and points toward exciting future applications in music generation and multimodal systems that combine audio, text, and video.

 

Complex Dynamic in Networks28 Jun 202500:56:00

In this episode, we learn why simply analyzing the structure of a network is not enough, and how the dynamics - the actual mechanisms of interaction between components - can drastically change how information or influence spreads.  Our guest, Professor Baruch Barzel of Bar-Ilan University, is a leading researcher in network dynamics and complex systems ranging from biology to infrastructure and beyond. 

BarzelLab

BarzelLab on Youtube

Paper in focus: Universality in network dynamics, 2013

AGI Can Be Safe26 Jun 202300:45:57

We are joined by Koen Holtman, an independent AI researcher focusing on AI safety. Koen is the Founder of Holtman Systems Research, a research company based in the Netherlands.

Koen started the conversation with his take on an AI apocalypse in the coming years. He discussed the obedience problem with AI models and the safe form of obedience.

Koen explained the concept of Markov Decision Process (MDP) and how it is used to build machine learning models.

Koen spoke about the problem of AGIs not being able to allow changing their utility function after the model is deployed. He shared another alternative approach to solving the problem. He shared how to engineer AGI systems now and in the future safely. He also spoke about how to implement safety layers on AI models.

Koen discussed the ultimate goal of a safe AI system and how to check that an AI system is indeed safe. He discussed the intersection between large language Models (LLMs) and MDPs. He shared the key ingredients to scale the current AI implementations.

AI Fails on Theory of Mind Tasks19 Jun 202300:52:21

An assistant professor of Psychology at Harvard University, Tomer Ullman, joins us. Tomer discussed the theory of mind and whether machines can indeed pass it. Using variations of the Sally-Anne test and the Smarties tube test, he explained how LLMs could fail the theory of mind test.

AI for Mathematics Education12 Jun 202300:35:36

The application of LLMs cuts across various industries. Today, we are joined by Steven Van Vaerenbergh, who discussed the application of AI in mathematics education. He discussed how AI tools have changed the landscape of solving mathematical problems. He also shared LLMs' current strengths and weaknesses in solving math problems.

Evaluating Jokes with LLMs06 Jun 202300:43:11

Fabricio Goes, a Lecturer in Creative Computing at the University of Leicester, joins us today. Fabricio discussed what creativity entails and how to evaluate jokes with LLMs. He specifically shared the process of evaluating jokes with GPT-3 and GPT-4. He concluded with his thoughts on the future of LLMs for creative tasks.

Why Machines Will Never Rule the World29 May 202300:55:15

Barry Smith and Jobst Landgrebe, authors of the book "Why Machines will never Rule the World," join us today. They discussed the limitations of AI systems in today's world. They also shared elaborate reasons AI will struggle to attain the level of human intelligence.

A Psychopathological Approach to Safety in AGI23 May 202300:49:00

While the possibilities with AGI emergence seem great, it also calls for safety concerns. On the show, Vahid Behzadan, an Assistant Professor of Computer Science and Data Science, joins us to discuss the complexities of modeling AGIs to accurately achieve objective functions. He touched on tangent issues such as abstractions during training, the problem of unpredictability, communications among agents, and so on.

The NLP Community Metasurvey15 May 202300:49:45

Julian Michael, a postdoc at the Center for Data Science, New York University, joins us today. Julian's conversation with Kyle was centered on the NLP community metasurvey: a survey aimed at understanding expert opinions on controversial NLP issues. He shared the process of preparing the survey as well as some shocking results.

Skeptical Survey Interpretation10 May 202300:21:39

Kyle shares his own perspectives on challenges getting insight from surveys. The discussion ranges from commentary on the market research industry to specific advice for detecting disingenuous or fraudulent responses and filtering them from your analysis. Finally, he shares some quick thoughts on the usage of the Chi-Square test for interpreting cross tab results in survey analysis.

 
The Gallup Poll01 May 202300:40:26

Jeff Jones, a Senior Editor at Gallup, joins us today. His conversation with Kyle spanned a range of topics on Gallup's poll creation process. He discussed how Gallup generates unbiased questionnaires, gets respondents, analyzes results, and everything in between.

Inclusive Study Group Formation at Scale25 Apr 202300:32:17

Gireeja Ranade, a University of California at Berkeley professor, speaks with us today. She presented her study on implementing inclusive study groups at scale and shared the observed student performance improvements after the intervention.

Github Network Analysis22 Jun 202500:36:46

In this episode we'll discuss how to use Github data as a network to extract insights about teamwork.

Our guest, Gabriel Ramirez, manager of the notifications team at GitHub, will show how to apply network analysis to better understand and improve collaboration within his engineering team by analyzing GitHub metadata - such as pull requests, issues, and discussions - as a bipartite graph of people and projects.

Some insights we'll discuss are how network centrality measures (like eigenvector and betweenness centrality) reveal organizational dynamics, how vacation patterns influence team connectivity, and how decentralizing communication hubs can foster healthier collaboration. 

Gabriel's open-source project, GH Graph Explorer, enables other managers and engineers to extract, visualize, and analyze their own GitHub activity using tools like Python, Neo4j, Gephi and LLMs for insight generation, but always remember – don't take the results on face value. Instead, use the results to guide your qualitative investigation. 

The PhilPapers Survey21 Apr 202300:31:36

Today, we are joined by David Bourget. David is an Associate Professor in Philosophy at Western University in London, Ontario. David is also the co-director of the PhilPapers Foundation and Director of the Center for Digital Philosophy. He joins us to discuss the PhilPapers Survey project.

The PhilPapers survey was initially taken in 2009, but there was a follow-up survey in 2020. David discussed the need for the subsequent survey and what changed. He mentioned the metric for measuring the opinion changes between the 2009 and 2020 surveys. He also shared future plans for the PhilPapers surveys.

Non-Response Bias10 Apr 202300:35:33

Today's show focused on an essential part of surveys — missing values. This is typically caused by a low response rate or non-response from respondents. Yajuan Si is a Research Associate Professor at the Survey Research Center at the University of Michigan. She joins us to discuss dealing with bias from low survey response rates.

Measuring Trust in Robots with Likert Scales03 Apr 202300:47:24

We are joined by two guests today, Mariah, a Ph.D. student in the CORE Robotics Lab at Georgia Tech, and Matthew Gombolay, the Director of the CORE Robotics Lab. They both discuss practices for measuring a respondent's perception in a survey.

CAREER Prediction27 Mar 202300:40:31

Ever wondered what your next career would be? Today, Keyon Vafa, a computer science Ph.D. student at Columbia University, joins us to discuss his latest research on developing a machine-learning model for career prediction. Keyon extensively spoke about how the model was developed and the possibilities it brings.

The Panel Study of Income Dynamics21 Mar 202300:34:03

Noura Insolera, a Research Investigator with the Panel Study of Income Dynamics (PSID), joins us to share how PSID conducts longitudinal household surveys. She also shared some interesting findings from their data exploration, particularly on the observation and trends in food insecurity.

Survey Design Working Session14 Mar 202301:01:42

Susan Gerbic joins Kyle to review some of the surveys Data Skeptic has launch, draft a new survey about podcast listening habits, and then review the results of that survey. You can see those results at the link below.

https://survey.dataskeptic.com/survey/result/1675102237053

Watch the videos Susan mentioned on her Youtube page at the link below.

https://www.youtube.com/playlist?list=PL7VAuaQDhPTVaLeI1IcpYph5lH19xA1u4

Bot Detection and Dyadic Surveys06 Mar 202300:35:24

The use of social bots to fill out online surveys is becoming prevalent. Today, we speak with Sara Bybee, a postdoctoral research scholar at the University of Utah. Sara shares from her research, how she detected social bots, the strategies to curb them, and how underrepresented groups can be more represented in surveys.

Reproducible ESP Testing20 Feb 202300:47:10

Our guest today is Zoltán Kekecs, a Ph.D. holder in Behavioural Science. Zoltán highlights the problem of low replicability in journal papers and illustrates how researchers can better ensure complete replication of their research and findings. He used Bem's experiment as an example, extensively talking about his methodology and results.

A Survey of Data Science Methodologies13 Feb 202300:24:58

On the show, Iñigo Martinez, a Ph.D. student at the University of Navarra shares his survey results which investigated how data practitioners perform data science projects. He revealed the methodologies typically used by data practitioners and the success factors in data science projects.

Opinion Dynamics Models06 Feb 202300:35:45

On the show today, Dino Carpentras, a post-doctoral researcher at the Computational Social Science group at ETH Zürich joins us to discuss how opinion dynamics models are built and validated. He explained how quantifying opinions is complex, and strategies to develop robust models for measuring and predicting public opinions.

Networks and Complexity14 Jun 202500:17:49

In this episode, Kyle does an overview of the intersection of graph theory and computational complexity theory.  In complexity theory, we are about the runtime of an algorithm based on its input size.  For many graph problems, the interesting questions we want to ask take longer and longer to answer!  This episode provides the fundamental vocabulary and signposts along the path of exploring the intersection of graph theory and computational complexity theory.

Casual Affective Triggers30 Jan 202300:35:48

Crafting survey questions is one thing but getting your audience to fill it is yet another. On the show today, we speak with Alexander Nolte, an Associate Professor at the University of Tartu. Alexander discussed the use of Casual Affective Triggers (CAT) to incentivize people to accept survey invitations and improve the completion rate. He revealed the impact of CATs on survey response rates from a study he conducted.

Conversational Surveys23 Jan 202300:39:49

Traditional surveys have straight-jacket questions to be answered, thus restricting the information that can be gotten. Today, Ziang Xiao, a Postdoc Researcher in the FATE group at Microsoft Research Montréal, talks about conversational surveys, a type of survey that asks questions based on preceding answers. He discussed the benefits of conversational surveys and some of the challenges it poses.

Do Results Generalize for Privacy and Security Surveys17 Jan 202300:40:21

Today, Jenny Tang, a Ph.D. student of societal computing at Carnegie Mellon University discusses her work on the generalization of privacy and security surveys on platforms such as Amazon MTurk and Prolific. Jenny shared the drawbacks of using such online platforms, the discrepancies observed about the samples drawn, and key insights from her results.

4 out of 5 Data Scientists Agree10 Jan 202300:28:52

This episode kicks off the new season of the show, Data Skeptic: Surveys.  Linhda rejoins the show for a conversation with Kyle about her experience taking surveys and what questions she has for the season.  Lastly, Kyle announces the launch of survey.dataskeptic.com, a new site we're launching to gather your opinions.  Please take a moment and share your thoughts!

Crowdfunded Board Games26 Dec 202200:34:31

It may be intuitive to think crowdfunding a project drives its innovation and novelty, but there are no empirical studies that prove this. On the show, Johannes Wachs shares his research that sought to determine whether crowdfunding truly drives innovation. He used board games as a case study and shared the results he found.

Russian Election Interference Effectiveness19 Dec 202200:41:55

There were reports of Russia's interference in the 2016 US elections. In today's episode, Koustuv Saha, a researcher at Microsoft Research walks us through the effect of targeted ads for political campaigns. Using practical examples, he discusses how targeted ads can propagate fake news, its ripple effects on electioneering, and how to find a sweet spot with targeted ads.

Placement Laundering Fraud15 Dec 202200:32:59

There is an unsung kind of ad fraud brewing in the ad tech space — placement laundering fraud. On the show, Jeff Kline discusses what placement laundering fraud is, how it can be identified, and possible solutions to it. Listen to learn more.

Data Clean Rooms12 Dec 202200:31:49

Bosko Milekic, the Co-founder of Optable, a data collaboration platform for the media and advertising industry, joins us today. Bosko talked about the clean rooms, the technology driving data privacy during collaboration. He discussed why clean rooms are gaining widespread adoption, and how users can exploit Optable's clean room platform for a secured data-sharing experience.

Dark Patterns in Site Design05 Dec 202200:34:54

Kerstin Bongard-Blanchy is a Research Associate at the University of Luxembourg. She joins us to discuss her study that investigated dark patterns in web designs. She discussed the results, the effect of dark patterns effect on users, whether an average user can detect them, and the way forward to a more ethical web space.

Internet Advertising Bureau Media Lab03 Dec 202200:37:01

We are joined by Anthony Katsur, the CEO of IAB Tech Lab. Anthony discusses standards within the ad tech industry. He explained how IAB Tech Lab set and propagates global standards, actions to ensure compliance from advertisers, and industry trends for a more privacy-centric ad tech space.

Graphs for Causal AI24 May 202500:41:00

How to build artificial intelligence systems that understand cause and effect, moving beyond simple correlations?

As we all know, correlation is not causation. "Spurious correlations" can show, for example, how rising ice cream sales might statistically link to more drownings, not because one causes the other, but due to an unobserved common cause like warm weather.

Our guest, Utkarshani Jaimini, a researcher from the University of South Carolina's Artificial Intelligence Institute, tries to tackle this problem by using knowledge graphs that incorporate domain expertise. 

Knowledge graphs (structured representations of information) are combined with neural networks in the field of neurosymbolic AI to represent and reason about complex relationships. This involves creating causal ontologies, incorporating the "weight" or strength of causal relationships and hyperrelations. This field has many practical applications such as for AI explainability, healthcare and autonomous driving.

Follow our guest

Utkarshani Jaimini's Webpage

Linkedin

Papers in focus

CausalLP: Learning causal relations with weighted knowledge graph link prediction, 2024

HyperCausalLP: Causal Link Prediction using Hyper-Relational Knowledge Graph, 2024

 

Your Mouse Reveals Your Gender and Age28 Nov 202200:39:47

When we navigate a webpage, it is fairly easy for our mouse movement to be tracked and collected. Today, Luis Leiva, a Professor of Computer Science discusses how these mouse tracking data can be used to predict age, gender and user attention. He also discusses the privacy concerns with mouse tracking data and possible ways it can be curtailed.

Measuring Web Search Behavior21 Nov 202200:36:08

On the show, Aleksandra Urman and Mykola Makhortykh join us to discuss their work on the comparative analysis of web search behavior using web tracking data. They shared interesting results from their analysis, bordering around the user preferences for search engines, demographic patterns, and differences between how men and women surf the net.

StrategyQA and Big Bench18 Nov 202200:41:56

Did Aristotle Use a Laptop?  That's a question from the StrategyQA benchmark which highlights the stretch goals for current artificial intelligence systems.  Answering a question like that requires several cognitive steps and reasoning.  Constructing a dataset of similarly challenging questions is a major undertaking.  On today's episode, Mor Geva returns to share details about the creation of StrategyQA and the larger Big Bench dataset it has been included in.

Ad Blockers Effect on News Consumption14 Nov 202200:38:59

While at first glance, the use of ad blockers drops the revenue of news publishers, this may not be completely true. On the show today, Shunyao Yan, an Assistant Professor in Marketing at Leavey School of Business, Santa Clara University, discussed the effect of ad blockers on news consumption and how ad blockers can potentially be helpful for news publishers.

Your Consent is Worth 75 Euros a Year07 Nov 202200:24:04

People who do not want their data tracked and shared online can pay a token for a cookie paywall. But are the websites keeping to their side of the bargain? Victor Morel, a Postdoc candidate at the Chalmers University of Technology joins us to discuss his work around auditing the activities of cookie paywalls. He discussed the findings from his analysis and proffers some solutions to making cookie paywalls more transparent.

Automated Email Generation for Targeted Attacks31 Oct 202200:45:06

The advancement of generative language models has been a force for good, but also for evil. On the show, Avisha Das, a post-doctoral scholar at the University of Texas Health Center, joins us to discuss how attackers use machine learning to create unsuspecting phishing emails. She also discussed how she used RNN for automated email generation, with the goal of defeating statistical detectors. 

Tribal Marketing24 Oct 202200:37:38

Peter Gloor, a Research Scientist at the MIT Center for Collective Intelligence, takes us on a new world of tribe classification. He extensively discussed the need for such classification on the internet and how he built a machine learning model that does it. Listen to find out more!

Nano-targetted Facebook Ads17 Oct 202200:44:46
Debiasing GPT-3 Job Ads10 Oct 202200:48:57

We hear about the impeccable achievements of GPT-3 models, but such large generative models come with their bias. On the show today, Conrad Borchers, a Ph.D. student in Human-Computer Interaction, joins us to discuss the bias in GPT-3 for job ads and how such large models can be de-biased. Listen to learn more!

ML Ops in Production06 Oct 202200:41:58

Moses Guttman from Clear ML joins us to share insights about how organizations leveraging machine learning keep their programs on track.  While many parallels exist between the software development life cycle (SWLC) and the machine learning development life cycle, successful deployments of ML in production have demonstrated that a unique set of tools is required.  Moses and I discuss the emergence of ML Ops, success stories, and how modern teams leverage tools like Clear ML's open source solution to maximize the value of ML in the organization.

 

Power Networks16 May 202500:41:50
Ad Network Tomography03 Oct 202200:35:20

Data sharing in the ad tech space has largely been a black box system. While it is obvious the data is being collected, the data sharing process is obscure to users. On the show today, Maaz Bin Musa and Rishab, both researchers at the University of Iowa, speak about the importance of data transparency and their tool, ATOM for data transparency. Listen to find out how ATOM uncovers data-sharing relationships in the ad-tech space.

First Party Tracking Cookies26 Sep 202200:35:06

When you accept cookies on a website, you cannot tell whether the cookies are used for tracking your personal data or not. Shaoor Munir's machine learning model does that. On the show today, the Ph.D student at the University of California, discussed the world of first-party cookies and how he developed a machine learning model that predicts whether a first-party cookie is used for tracking purposes.

The Harms of Targeted Weight Loss Ads19 Sep 202200:34:57

Liza Gak, a Ph.D. student at UC Berkeley, joins us to discuss her research on harmful weight loss advertising. She discussed how weight loss ads are not fact-checked, and how they typically target the most vulnerable. She extensively discussed her interview process, data analysis, and results. Listen for more!

© My Podcast Data