Data Engineering Podcast – Details, episodes & analysis

Podcast details

Technical and general information from the podcast's RSS feed.

Data Engineering Podcast

Tobias Macey

Technology

Education

Frequency: 1 episode/7d. Total Eps: 474

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Site

RSS

Apple

Recent rankings

Latest chart positions across Apple Podcasts and Spotify rankings.

Apple Podcasts

🇬🇧 Great Britain - technology
28/07/2025
#80
🇬🇧 Great Britain - technology
27/07/2025
#69
🇫🇷 France - technology
27/07/2025
#91
🇬🇧 Great Britain - technology
26/07/2025
#65
🇫🇷 France - technology
17/07/2025
#98
🇨🇦 Canada - technology
11/07/2025
#87
🇬🇧 Great Britain - technology
11/07/2025
#95
🇬🇧 Great Britain - technology
10/07/2025
#98
🇨🇦 Canada - technology
09/07/2025
#64
🇨🇦 Canada - technology
08/07/2025
#43

Spotify

No recent rankings available

Shared links between episodes and podcasts

Links found in episode descriptions and other podcasts that share them.

See all

https://www.canva.com/
1022 shares
http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/?utm_source=rss&utm_medium=rss
717 shares
https://zapier.com/
560 shares

https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
64 shares
https://en.wikipedia.org/wiki/Extract
44 shares
https://en.wikipedia.org/wiki/C%2B%2B
32 shares

https://github.com/
182 shares
https://github.com/features/copilot
121 shares
https://github.com/features/actions
57 shares

RSS feed quality and score

Technical evaluation of the podcast's RSS feed quality and structure.

See all

RSS feed quality

To improve

Score global : 58%

Publication history

Monthly episode publishing history over the past years.

Year

Episodes published by month in

Latest published episodes

Recent episodes with titles, durations, and descriptions.

See all

The Evolution of DataOps: Insights from DataKitchen's CEO

Episode 437

dimanche 4 août 2024 • Duration 53:30

Summary
In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production.
Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what DataKitchen is and the story behind it?
You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?
Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?
The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?
What are the challenges that never went away?
You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?
What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?
Can you talk through the technical implementation of your new obserability and quality testing platform?
What does the onboarding and integration process look like?
Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?
What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?
What do you have planned for the future of your work at DataKitchen?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

Episode 436

dimanche 28 juillet 2024 • Duration 49:26

Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data

Interview

Introduction
How did you get involved in the area of data management?
Can you describe the scope and purpose of data contracts in the context of this conversation?
In what way(s) do they differ from data quality/data observability?
Data contracts are also known as the API for data, can you elaborate on this?
What are the types of guarantees and requirements that you can enforce with these data contracts?
What are some examples of constraints or guarantees that cannot be represented in these contracts?
Are data contracts related to the shift-left?
Data contracts are also known as the API for data, can you elaborate on this?
The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
How did you approach the design of the syntax and implementation for Soda's data contracts?
Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
When are data contracts the wrong choice?
What do you have planned for the future of data contracts?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Migration Strategies For Large Scale Systems

Episode 427

lundi 27 mai 2024 • Duration 01:00:00

Summary

Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process

Interview

Introduction
How did you get involved in the area of data management?
Can you start by sharing some of your experiences with data migration projects?
- As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?
How would you categorize the different types and motivations of migrations?
- How does the motivation for a migration influence the ways that you plan for and execute that work?
Can you talk us through one or two specific projects that you have taken part in?
Part 1: The Triggers
- Section 1: Technical Limitations triggering Data Migration
  - Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure
  - Legacy compatibility: Difficulties integrating with modern tools and cloud platforms
  - System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade)
- Section 2: Types of Migrations for Infrastructure Focus
  - Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.)
  - Data center migration: Physical relocation or consolidation of data centers
  - Virtualization migration: Moving from physical servers to virtual machines (or vice versa)
- Section 3: Technical Decisions Driving Data Migrations
  - End-of-life support: Forced migration when older software or hardware is sunsetted
  - Security and compliance: Adopting new platforms with better security postures
  - Cost Optimization: Potential savings of cloud vs. on-premise data centers
Part 2: Challenges (and Anxieties)
- Section 1: Technical Challenges
  - Data transformation challenges: Schema changes, complex data mappings
  - Network bandwidth and latency: Transferring large datasets efficiently
  - Performance testing and load balancing: Ensuring new systems can handle the workload
  - Live data consistency: Maintaining data integrity while updates occur in the source system
  - Minimizing Lag: Techniques to reduce delays in replicating changes to the new system
  - Change data capture: Identifying and tracking changes to the source system during migration
- Section 2: Operational Challenges
  - Minimizing downtime: Strategies for service continuity during migration
  - Change management and rollback plans: Dealing with unexpected issues
  - Technical skills and resources: In-house expertise/data teams/external help
- Section 3: Security & Compliance Challenges
  - Data encryption and protection: Methods for both in-transit and at-rest data
  - Meeting audit requirements: Documenting data lineage & the chain of custody
  - Managing access controls: Adjusting identity and role-based access to the new systems
Part 3: Patterns
- Section 1: Infrastructure Migration Strategies
  - Lift and shift: Migrating as-is vs. modernization and re-architecting during the move
  - Phased vs. big bang approaches: Tradeoffs in risk vs. disruption
  - Tools and automation: Using specialized software to streamline the process
  - Dual writes: Managing updates to both old and new systems for a time
  - Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes
  - Data validation & reconciliation: Ensuring consistency between source and target
- Section 2: Maintaining Performance and Reliability
  - Disaster recovery planning: Failover mechanisms for the new environment
  - Monitoring and alerting: Proactively identifying and addressing issues
  - Capacity planning and forecasting growth to scale the new infrastructure
- Section 3: Data Consistency and Replication
  - Replication tools - strategies and specialized tooling
  - Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full)
  - Testing/Verification Strategies for validating data correctness in a live environment
  - Implication of large scale systems/environments
  - Comparison of interesting strategies:
    - DBLog, Debezium, Databus, Goldengate etc
What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations?
When is a migration the wrong choice?
What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

Links

DagKnows
Google Cloud Dataflow
Seinfeld Risk Management
ACL == Access Control List
LinkedIn Databus - Change Data Capture
Espresso Storage
HDFS
Kafka
Postgres Replication Slots
Queueing Theory
Apache Beam
Debezium
Airbyte
[Fivetran](fivetran.com)
Designing Data Intensive Applications by Martin Kleppman (affiliate link)
Vector Databases
Pinecone
Weaviate
LAMP Stack
Netflix DBLog

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB

Episode 335

dimanche 23 octobre 2022 • Duration 52:04

Summary

The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Manjot Singh about MariaDB, one of the leading open source database engines

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what MariaDB is and the story behind it?
MariaDB started as a fork of the MySQL engine, what are the notable differences that have evolved between the two projects?
- How have the MariaDB team worked to maintain compatibility for users who want to switch from MySQL?
What are the unique capabilities that MariaDB offers?
Beyond the core open source project you have built a suite of commercial extensions. What are the use cases/capabilities that you are targeting with those products?
How do you balance the time and effort invested in the open source engine against the commercial projects to ensure that the overall effort is sustainable?
- What are your guidelines for what features and capabilities are released in the community edition and which are more suited to the commercial products?
For your managed cloud service, what are the differentiating factors for that versus the database services provided by the major cloud platforms?
- What do you see as the future of the database market and how we interact and integrate with them?
What are the most interesting, innovative, or unexpected ways that you have seen MariaDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on MariaDB?
When is MariaDB the wrong choice?
What do you have planned for the future of MariaDB?

Contact Info

LinkedIn
@ManjotSingh on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Episode 333

dimanche 16 octobre 2022 • Duration 50:44

Summary

The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Dremio is and the story behind it?
What are some of the notable changes in the Dremio product and related ecosystem over the past ~4 years?
- How has the advent of the lakehouse paradigm influenced the product direction?
What are the main benefits that a lakehouse design offers to a data platform?
What are some of the architectural patterns that are only possible with a lakehouse?
What is the distinction you make between a lakehouse and an open lakehouse?
What are some of the unique features that Dremio offers for lakehouse implementations?
What are some of the investments that Dremio has made to the broader open source/open lakehouse ecosystem?
- How are those projects/investments being used in the commercial offering?
What is the purchase/usage model that customers expect for lakehouse implementations?
- How have those expectations shifted since the first iterations of Dremio?
Dremio has its ancestry in the Drill project. How has that history influenced the capabilities (e.g. integrations, scalability, deployment models, etc.) and evolution of Dremio compared to systems like Trino/Presto and Spark SQL?
What are the most interesting, innovative, or unexpected ways that you have seen Dremio used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dremio?
When is Dremio the wrong choice?
What do you have planned for the future of Dremio?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks

Episode 334

dimanche 16 octobre 2022 • Duration 01:02:36

Summary

Logistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Adrian Kosowski about Pathway, an AI powered database and streaming framework. Pathway is used for analyzing and optimizing supply chains and logistics in real-time.

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Pathway is and the story behind it?
What are the primary challenges that you are working to solve?
- Who are the target users of the Pathway product and how does it fit into their work?
Your tagline is that Pathway is "the database that thinks". What are some of the ways that existing database and stream-processing architectures introduce friction on the path to analysis?
- How does Pathway incorporate computational capabilities into its engine to address those challenges?
What are the types of data that Pathway is designed to work with?
Can you describe how the Pathway engine is implemented?
- What are some of the ways that the design and goals of the product have shifted since you started working on it?
What are some of the ways that Pathway can be integrated into an analytical system?
What is involved in adapting its capabilities to different industries?
What are the most interesting, innovative, or unexpected ways that you have seen Pathway used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pathway?
When is Pathway the wrong choice?
What do you have planned for the future of Pathway?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Pathway
Pathway for developers
SPOJ.com – competitive programming community
Spatiotemporal Data
Pointers in programming
Clustering
The Halting Problem
Pytorch
- Podcast.__init__ Episode
Tensorflow
Markov Chains
NetworkX
Finite State Machine
DTW == Dynamic Time Warping

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

Episode 332

lundi 10 octobre 2022 • Duration 55:24

Summary

The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
Your host is Tobias Macey and today I’m interviewing Vusal Dadalov about Iomete, an open and affordable lakehouse platform

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Iomete is and the story behind it?
The selection of the storage/query layer is the most impactful decision in the implementation of a data platform. What do you see as the most significant factors that are leading people to Iomete/lakehouse structures rather than a more traditional db/warehouse?
The principle of the Lakehouse architecture has been gaining popularity recently. What are some of the complexities/missing pieces that make its implementation a challenge?
- What are the hidden difficulties/incompatibilities that come up for teams who are investing in data lake/lakehouse technologies?
- What are some of the shortcomings of lakehouse architectures?
What are the fundamental capabilities that are necessary to run a fully functional lakehouse?
Can you describe how the Iomete platform is implemented?
- What was your process for deciding which elements to adopt off the shelf vs. building from scratch?
- What do you see as the strengths of Spark as the query/execution engine as compared to e.g. Presto/Trino or Dremio?
What are the integrations and ecosystem investments that you have had to prioritize to simplify adoption of Iomete?
What have been the most challenging aspects of building a competitive business in such an active product category?
What are the most interesting, innovative, or unexpected ways that you have seen Iomete used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iomete?
When is Iomete the wrong choice?
What do you have planned for the future of Iomete?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Investing In Understanding The Customer Journey At American Express

Episode 331

lundi 10 octobre 2022 • Duration 40:43

Summary

For any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
Your host is Tobias Macey and today I’m interviewing Purvi Shah about building the Customer 360 data product for American Express and migrating their enterprise data platform to the cloud

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what the Customer 360 project is and the story behind it?
What are the types of questions and insights that the C360 project is designed to answer?
- Can you describe the types of information and data sources that you are relying on to feed this project?
What are the different axes of scale that you have had to address in the design and architecture of the C360 project? (e.g. geographical, volume/variety/velocity of data, scale of end-user access and data manipulation, etc.)
What are some of the challenges that you have had to address in order to build and maintain the map between organizational and technical requirements/semantics in the platform?
- What were some of the early wins that you targeted, and how did the lessons from those successes drive the product design going forward?
Can you describe the platform architecture for your data systems that are powering the C360 product?
- How have the design/goals/requirements of the system changed since you first started working on it?
How have you approached the integration and migration of legacy data systems and assets into this new platform?
- What are some of the ongoing maintenance challenges that the legacy platforms introduce?
Can you describe how you have approached the question of data quality/observability and the validation/verification of the generated assets?
What are the aspects of governance and access control that you need to deal with being part of a financial institution?
Now that the C360 product has been in use for a few years, what are the strategic and tactical aspects of the ongoing evolution and maintenance of the product which you have had to address?
What are the most interesting, innovative, or unexpected ways that you have seen the C360 product used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on C360 for American Express?
When is a C360 project the wrong choice?
What do you have planned for the future of C360 and enterprise data platforms at American Express?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast