Data Engineering Podcast – Details, episodes & analysis
Podcast details
Technical and general information from the podcast's RSS feed.

Data Engineering Podcast
Tobias Macey
Frequency: 1 episode/7d. Total Eps: 474

Recent rankings
Latest chart positions across Apple Podcasts and Spotify rankings.
Apple Podcasts
🇬🇧 Great Britain - technology
28/07/2025#80🇬🇧 Great Britain - technology
27/07/2025#69🇫🇷 France - technology
27/07/2025#91🇬🇧 Great Britain - technology
26/07/2025#65🇫🇷 France - technology
17/07/2025#98🇨🇦 Canada - technology
11/07/2025#87🇬🇧 Great Britain - technology
11/07/2025#95🇬🇧 Great Britain - technology
10/07/2025#98🇨🇦 Canada - technology
09/07/2025#64🇨🇦 Canada - technology
08/07/2025#43
Spotify
No recent rankings available
Shared links between episodes and podcasts
Links found in episode descriptions and other podcasts that share them.
See all- https://github.com/
182 shares
- https://github.com/features/copilot
121 shares
- https://github.com/features/actions
57 shares
RSS feed quality and score
Technical evaluation of the podcast's RSS feed quality and structure.
See allScore global : 58%
Publication history
Monthly episode publishing history over the past years.
The Evolution of DataOps: Insights from DataKitchen's CEO
Episode 437
dimanche 4 août 2024 • Duration 53:30
In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers
- Introduction
- How did you get involved in the area of data management?
- Can you describe what DataKitchen is and the story behind it?
- You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?
- Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?
- The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?
- What are the challenges that never went away?
- You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?
- What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?
- Can you talk through the technical implementation of your new obserability and quality testing platform?
- What does the onboarding and integration process look like?
- Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?
- What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?
- What do you have planned for the future of your work at DataKitchen?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- DataKitchen
- Podcast Episode
- NASA
- DataOps Manifesto
- Data Reliability Engineering
- Data Observability
- dbt
- DevOps Enterprise Summit
- Building The Data Warehouse by Bill Inmon (affiliate link)
- dataops-testgen, dataops-observability
- Free Data Quality and Data Observability Certification
- Databricks
- DORA Metrics
- DORA for data
Achieving Data Reliability: The Role of Data Contracts in Modern Data Management
Episode 436
dimanche 28 juillet 2024 • Duration 49:26
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
- Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
- Introduction
- How did you get involved in the area of data management?
- Can you describe the scope and purpose of data contracts in the context of this conversation?
- In what way(s) do they differ from data quality/data observability?
- Data contracts are also known as the API for data, can you elaborate on this?
- What are the types of guarantees and requirements that you can enforce with these data contracts?
- What are some examples of constraints or guarantees that cannot be represented in these contracts?
- Are data contracts related to the shift-left?
- Data contracts are also known as the API for data, can you elaborate on this?
- The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
- How did you approach the design of the syntax and implementation for Soda's data contracts?
- Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
- Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
- What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
- When are data contracts the wrong choice?
- What do you have planned for the future of data contracts?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
- Soda
- Podcast Episode
- JBoss
- Data Contract
- Airflow
- Unit Testing
- Integration Testing
- OpenAPI
- GraphQL
- Circuit Breaker Pattern
- SodaCL
- Soda Data Contracts
- Data Mesh
- Great Expectations
- dbt Unit Tests
- Open Data Contracts
- ODCS == Open Data Contract Standard
- ODPS == Open Data Product Specification
Data Migration Strategies For Large Scale Systems
Episode 427
lundi 27 mai 2024 • Duration 01:00:00
Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
- Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing some of your experiences with data migration projects?
- As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?
- How would you categorize the different types and motivations of migrations?
- How does the motivation for a migration influence the ways that you plan for and execute that work?
- Can you talk us through one or two specific projects that you have taken part in?
- Part 1: The Triggers
- Section 1: Technical Limitations triggering Data Migration
- Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure
- Legacy compatibility: Difficulties integrating with modern tools and cloud platforms
- System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade)
- Section 2: Types of Migrations for Infrastructure Focus
- Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.)
- Data center migration: Physical relocation or consolidation of data centers
- Virtualization migration: Moving from physical servers to virtual machines (or vice versa)
- Section 3: Technical Decisions Driving Data Migrations
- End-of-life support: Forced migration when older software or hardware is sunsetted
- Security and compliance: Adopting new platforms with better security postures
- Cost Optimization: Potential savings of cloud vs. on-premise data centers
- Section 1: Technical Limitations triggering Data Migration
- Part 2: Challenges (and Anxieties)
- Section 1: Technical Challenges
- Data transformation challenges: Schema changes, complex data mappings
- Network bandwidth and latency: Transferring large datasets efficiently
- Performance testing and load balancing: Ensuring new systems can handle the workload
- Live data consistency: Maintaining data integrity while updates occur in the source system
- Minimizing Lag: Techniques to reduce delays in replicating changes to the new system
- Change data capture: Identifying and tracking changes to the source system during migration
- Section 2: Operational Challenges
- Minimizing downtime: Strategies for service continuity during migration
- Change management and rollback plans: Dealing with unexpected issues
- Technical skills and resources: In-house expertise/data teams/external help
- Section 3: Security & Compliance Challenges
- Data encryption and protection: Methods for both in-transit and at-rest data
- Meeting audit requirements: Documenting data lineage & the chain of custody
- Managing access controls: Adjusting identity and role-based access to the new systems
- Section 1: Technical Challenges
- Part 3: Patterns
- Section 1: Infrastructure Migration Strategies
- Lift and shift: Migrating as-is vs. modernization and re-architecting during the move
- Phased vs. big bang approaches: Tradeoffs in risk vs. disruption
- Tools and automation: Using specialized software to streamline the process
- Dual writes: Managing updates to both old and new systems for a time
- Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes
- Data validation & reconciliation: Ensuring consistency between source and target
- Section 2: Maintaining Performance and Reliability
- Disaster recovery planning: Failover mechanisms for the new environment
- Monitoring and alerting: Proactively identifying and addressing issues
- Capacity planning and forecasting growth to scale the new infrastructure
- Section 3: Data Consistency and Replication
- Replication tools - strategies and specialized tooling
- Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full)
- Testing/Verification Strategies for validating data correctness in a live environment
- Implication of large scale systems/environments
- Comparison of interesting strategies:
- DBLog, Debezium, Databus, Goldengate etc
- DBLog, Debezium, Databus, Goldengate etc
- Section 1: Infrastructure Migration Strategies
- What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations?
- When is a migration the wrong choice?
- What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- DagKnows
- Google Cloud Dataflow
- Seinfeld Risk Management
- ACL == Access Control List
- LinkedIn Databus - Change Data Capture
- Espresso Storage
- HDFS
- Kafka
- Postgres Replication Slots
- Queueing Theory
- Apache Beam
- Debezium
- Airbyte
- [Fivetran](fivetran.com)
- Designing Data Intensive Applications by Martin Kleppman (affiliate link)
- Vector Databases
- Pinecone
- Weaviate
- LAMP Stack
- Netflix DBLog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Red Hat Code Comments Podcast:  Putting new technology to use is an exciting prospect. But going from purchase to production isn’t always smooth—even when it’s something everyone is looking forward to. Code Comments covers the bumps, the hiccups, and the setbacks teams face when adjusting to new technology—and the triumphs they pull off once they really get going. Follow Code Comments [anywhere you listen to podcasts](https://link.chtbl.com/codecomments?sid=podcast.dataengineering).
- Starburst:  This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt
Episode 337
dimanche 30 octobre 2022 • Duration 40:10
One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Nandam Karthik about his experiences building analytics projects with dbt and Optimus for his clients at Sigmoid.
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Sigmoid is and the types of projects that you are involved in?
- What are some of the core challenges that your clients are facing when they start working with you?
- An ELT workflow with dbt as the transformation utility has become a popular pattern for building analytics systems. Can you share some examples of projects that you have built with this approach?
- What are some of the ways that this pattern becomes bespoke as you start exploring a project more deeply?
- What are the sharp edges/white spaces that you encountered across those projects?
- Can you describe what Optimus is?
- How does Optimus improve the user experience of teams working in dbt?
- What are some of the tactical/organizational practices that you have found most helpful when building with dbt and Optimus?
- What are the most interesting, innovative, or unexpected ways that you have seen Optimus/dbt used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt/Optimus projects?
- When is Optimus/dbt the wrong choice?
- What are your predictions for how "best practices" for analytics projects will change/evolve in the near/medium term?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Datafold:  Datafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold.
How To Bring Agile Practices To Your Data Projects
Episode 336
dimanche 23 octobre 2022 • Duration 01:12:18
Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows
- Introduction
- How did you get involved in the area of data management?
- Can you describe what AgileData is and the story behind it?
- What are the main industries and/or use cases that you are focused on supporting?
- The data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis?
- One of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project?
- How do you design in affordances for refactoring of the data models without breaking downstream assets?
- Another aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products?
- What are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products?
- In order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.)
- How do you work with the "customers" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers?
- What are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets?
- What are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData?
- When is agile the wrong choice for a data project?
- What do you have planned for the future of AgileData?
- @shagility on Twitter
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
- AgileData
- OptimalBI
- How To Make Toast
- Data Mesh
- Information Product Canvas
- DataKitchen
- Great Expectations
- Soda Data
- Google DataStore
- Unfix.work
- Activity Schema
- Data Vault
- Star Schema
- Lean Methodology
- Scrum
- Kanban
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Atlan:  Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
- Prefect:  Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit…
Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB
Episode 335
dimanche 23 octobre 2022 • Duration 52:04
The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Manjot Singh about MariaDB, one of the leading open source database engines
- Introduction
- How did you get involved in the area of data management?
- Can you describe what MariaDB is and the story behind it?
- MariaDB started as a fork of the MySQL engine, what are the notable differences that have evolved between the two projects?
- How have the MariaDB team worked to maintain compatibility for users who want to switch from MySQL?
- What are the unique capabilities that MariaDB offers?
- Beyond the core open source project you have built a suite of commercial extensions. What are the use cases/capabilities that you are targeting with those products?
- How do you balance the time and effort invested in the open source engine against the commercial projects to ensure that the overall effort is sustainable?
- What are your guidelines for what features and capabilities are released in the community edition and which are more suited to the commercial products?
- For your managed cloud service, what are the differentiating factors for that versus the database services provided by the major cloud platforms?
- What do you see as the future of the database market and how we interact and integrate with them?
- What are the most interesting, innovative, or unexpected ways that you have seen MariaDB used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on MariaDB?
- When is MariaDB the wrong choice?
- What do you have planned for the future of MariaDB?
- @ManjotSingh on Twitter
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
- MariaDB
- HTML Goodies
- MySQL
- PHP
- MySQL/MariaDB Pluggable Storage
- InnoDB
- MyISAM
- Aria Storage
- SQL/PSM
- MyRocks
- MariaDB XPand
- BSL == Business Source License
- Paxos
- MariaDB MongoDB Compatibility
- Vertica
- MariaDB Spider Storage Engine
- IHME == Institute for Health Metrics and Evaluation
- Rundeck
- MaxScale
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem
Episode 333
dimanche 16 octobre 2022 • Duration 50:44
The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Dremio is and the story behind it?
- What are some of the notable changes in the Dremio product and related ecosystem over the past ~4 years?
- How has the advent of the lakehouse paradigm influenced the product direction?
- What are the main benefits that a lakehouse design offers to a data platform?
- What are some of the architectural patterns that are only possible with a lakehouse?
- What is the distinction you make between a lakehouse and an open lakehouse?
- What are some of the unique features that Dremio offers for lakehouse implementations?
- What are some of the investments that Dremio has made to the broader open source/open lakehouse ecosystem?
- How are those projects/investments being used in the commercial offering?
- What is the purchase/usage model that customers expect for lakehouse implementations?
- How have those expectations shifted since the first iterations of Dremio?
- Dremio has its ancestry in the Drill project. How has that history influenced the capabilities (e.g. integrations, scalability, deployment models, etc.) and evolution of Dremio compared to systems like Trino/Presto and Spark SQL?
- What are the most interesting, innovative, or unexpected ways that you have seen Dremio used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dremio?
- When is Dremio the wrong choice?
- What do you have planned for the future of Dremio?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
- Dremio
- Dremio Sonar
- Dremio Arctic
- DML == Data Modification Language
- Spark
- Data Lake
- Trino
- Presto
- Dremio Data Reflections
- Tableau
- Delta Lake
- Apache Impala
- Apache Arrow
- DuckDB
- Google BigLake
- Project Nessie
- Apache Iceberg
- Hive Metastore
- AWS Glue Catalog
- Dremel
- Apache Drill
- Arrow Gandiva
- dbt
- Airbyte
- Singer
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks
Episode 334
dimanche 16 octobre 2022 • Duration 01:02:36
Logistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Adrian Kosowski about Pathway, an AI powered database and streaming framework. Pathway is used for analyzing and optimizing supply chains and logistics in real-time.
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Pathway is and the story behind it?
- What are the primary challenges that you are working to solve?
- Who are the target users of the Pathway product and how does it fit into their work?
- Your tagline is that Pathway is "the database that thinks". What are some of the ways that existing database and stream-processing architectures introduce friction on the path to analysis?
- How does Pathway incorporate computational capabilities into its engine to address those challenges?
- What are the types of data that Pathway is designed to work with?
- Can you describe how the Pathway engine is implemented?
- What are some of the ways that the design and goals of the product have shifted since you started working on it?
- What are some of the ways that Pathway can be integrated into an analytical system?
- What is involved in adapting its capabilities to different industries?
- What are the most interesting, innovative, or unexpected ways that you have seen Pathway used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pathway?
- When is Pathway the wrong choice?
- What do you have planned for the future of Pathway?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
- Pathway
- Pathway for developers
- SPOJ.com – competitive programming community
- Spatiotemporal Data
- Pointers in programming
- Clustering
- The Halting Problem
- Pytorch
- Tensorflow
- Markov Chains
- NetworkX
- Finite State Machine
- DTW == Dynamic Time Warping
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Making The Open Data Lakehouse Affordable Without The Overhead At Iomete
Episode 332
lundi 10 octobre 2022 • Duration 55:24
The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Vusal Dadalov about Iomete, an open and affordable lakehouse platform
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Iomete is and the story behind it?
- The selection of the storage/query layer is the most impactful decision in the implementation of a data platform. What do you see as the most significant factors that are leading people to Iomete/lakehouse structures rather than a more traditional db/warehouse?
- The principle of the Lakehouse architecture has been gaining popularity recently. What are some of the complexities/missing pieces that make its implementation a challenge?
- What are the hidden difficulties/incompatibilities that come up for teams who are investing in data lake/lakehouse technologies?
- What are some of the shortcomings of lakehouse architectures?
- What are the fundamental capabilities that are necessary to run a fully functional lakehouse?
- Can you describe how the Iomete platform is implemented?
- What was your process for deciding which elements to adopt off the shelf vs. building from scratch?
- What do you see as the strengths of Spark as the query/execution engine as compared to e.g. Presto/Trino or Dremio?
- What are the integrations and ecosystem investments that you have had to prioritize to simplify adoption of Iomete?
- What have been the most challenging aspects of building a competitive business in such an active product category?
- What are the most interesting, innovative, or unexpected ways that you have seen Iomete used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iomete?
- When is Iomete the wrong choice?
- What do you have planned for the future of Iomete?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
- Iomete
- Fivetran
- Airbyte
- Snowflake
- Databricks
- Collibra
- Talend
- Parquet
- Trino
- Spark
- Presto
- Snowpark
- Iceberg
- Iomete dbt adapter
- Singer
- Meltano
- AWS Interface Gateway
- Apache Hudi
- Delta Lake
- Amundsen
- AWS EMR
- AWS Athena
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Investing In Understanding The Customer Journey At American Express
Episode 331
lundi 10 octobre 2022 • Duration 40:43
For any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.
Announcements- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Purvi Shah about building the Customer 360 data product for American Express and migrating their enterprise data platform to the cloud
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the Customer 360 project is and the story behind it?
- What are the types of questions and insights that the C360 project is designed to answer?
- Can you describe the types of information and data sources that you are relying on to feed this project?
- What are the different axes of scale that you have had to address in the design and architecture of the C360 project? (e.g. geographical, volume/variety/velocity of data, scale of end-user access and data manipulation, etc.)
- What are some of the challenges that you have had to address in order to build and maintain the map between organizational and technical requirements/semantics in the platform?
- What were some of the early wins that you targeted, and how did the lessons from those successes drive the product design going forward?
- Can you describe the platform architecture for your data systems that are powering the C360 product?
- How have the design/goals/requirements of the system changed since you first started working on it?
- How have you approached the integration and migration of legacy data systems and assets into this new platform?
- What are some of the ongoing maintenance challenges that the legacy platforms introduce?
- Can you describe how you have approached the question of data quality/observability and the validation/verification of the generated assets?
- What are the aspects of governance and access control that you need to deal with being part of a financial institution?
- Now that the C360 product has been in use for a few years, what are the strategic and tactical aspects of the ongoing evolution and maintenance of the product which you have had to address?
- What are the most interesting, innovative, or unexpected ways that you have seen the C360 product used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on C360 for American Express?
- When is a C360 project the wrong choice?
- What do you have planned for the future of C360 and enterprise data platforms at American Express?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA