Machine Learning Guide – Details, episodes & analysis

Podcast details

Technical and general information from the podcast's RSS feed.

Machine Learning Guide

OCDevel

Technology

Education

Frequency: 1 episode/55d. Total Eps: 57

Machine learning audio course, teaching the fundamentals of machine learning and artificial intelligence. It covers intuition, models (shallow and deep), math, languages, frameworks, etc. Where your other ML resources provide the trees, I provide the forest. Consider MLG your syllabus, with highly-curated resources for each episode's details at ocdevel.com. Audio is a great supplement during exercise, commute, chores, etc.

Site

RSS

Apple

Recent rankings

Latest chart positions across Apple Podcasts and Spotify rankings.

Apple Podcasts

🇫🇷 France - technology
26/07/2025
#73
🇫🇷 France - technology
24/07/2025
#93
🇫🇷 France - technology
16/07/2025
#91
🇨🇦 Canada - technology
09/07/2025
#93
🇩🇪 Germany - technology
09/07/2025
#93
🇨🇦 Canada - technology
08/07/2025
#64
🇬🇧 Great Britain - technology
01/07/2025
#87
🇨🇦 Canada - technology
30/06/2025
#79
🇬🇧 Great Britain - technology
29/06/2025
#77
🇨🇦 Canada - technology
28/06/2025
#94

Spotify

No recent rankings available

Shared links between episodes and podcasts

Links found in episode descriptions and other podcasts that share them.

See all

https://github.com/features/copilot
121 shares
https://github.com/cline/cline
4 shares
https://github.com/rhasspy/piper
3 shares

https://aws.amazon.com/lambda/
45 shares
https://aws.amazon.com/cdk/
16 shares
https://aws.amazon.com/sagemaker/
13 shares

RSS feed quality and score

Technical evaluation of the podcast's RSS feed quality and structure.

See all

RSS feed quality

Good

Score global : 73%

Publication history

Monthly episode publishing history over the past years.

Year

Episodes published by month in

Latest published episodes

Recent episodes with titles, durations, and descriptions.

See all

MLA 021 Databricks: Cloud Analytics and MLOps

Season 1 · Episode 53

mercredi 22 juin 2022 • Duration 26:28

Databricks is a cloud-based platform for data analytics and machine learning operations, integrating features such as a hosted Spark cluster, Python notebook execution, Delta Lake for data management, and seamless IDE connectivity. Raybeam utilizes Databricks and other ML Ops tools according to client infrastructure, scaling needs, and project goals, favoring Databricks for its balanced feature set, ease of use, and support for both startups and enterprises.

Links

Notes and resources at ocdevel.com/mlg/mla-21
Try a walking desk stay healthy & sharp while you learn & code

Raybeam and Databricks

Raybeam is a data science and analytics company, recently acquired by Dept Agency.
While Raybeam focuses on data analytics, its acquisition has expanded its expertise into ML Ops and AI.
The company recommends tools based on client requirements, frequently utilizing Databricks for its comprehensive nature.

Understanding Databricks

Databricks is not merely an analytics platform; it is a competitor in the ML Ops space alongside tools like SageMaker and Kubeflow.
It provides interactive notebooks, Python code execution, and runs on a hosted Apache Spark cluster.
Databricks includes Delta Lake, which acts as a storage and data management layer.

Choosing the Right MLOps Tool

Raybeam evaluates each client’s needs, existing expertise, and infrastructure before recommending a platform.
Databricks, SageMaker, Kubeflow, and Snowflake are common alternatives, with the final selection dependent on current pipelines and operational challenges.
Maintaining existing workflows is prioritized unless scalability or feature limitations necessitate migration.

Databricks Features

Databricks is accessible via a web interface similar to Jupyter Hub and can be integrated with local IDEs (e.g., VS Code, PyCharm) using Databricks Connect.
Notebooks on Databricks can be version-controlled with Git repositories, enhancing collaboration and preventing data loss.
The platform supports configuration of computing resources to match model size and complexity.
Databricks clusters are hosted on AWS, Azure, or GCP, with users selecting the underlying cloud provider at sign-up.

Parquet and Delta Lake

Parquet files store data in a columnar format, which improves efficiency for aggregation and analytics tasks.
Delta Lake provides transactional operations on top of Parquet files by maintaining a version history, enabling row edits and deletions.
This approach offers a database-like experience for handling large datasets, simplifying both analytics and machine learning workflows.

Pricing and Usage

Pricing for Databricks depends on the chosen cloud provider (AWS, Azure, or GCP) with an additional fee for Databricks’ services.
The added cost is described as relatively small, and the platform is accessible to both individual developers and large enterprises.
Databricks is recommended for newcomers to data science and ML for its breadth of features and straightforward setup.

Databricks, MLflow, and Other Integrations

Databricks provides a hosted MLflow solution, offering experiment tracking and model management.
The platform can access data stored in services like S3, Snowflake, and other cloud provider storage options.
Integration with tools such as PyArrow is supported, facilitating efficient data access and manipulation.

Example Use Cases and Decision Process

Migration to Databricks is recommended when a client’s existing infrastructure (e.g., on-premises Spark clusters) cannot scale effectively.
The selection process involves an in-depth exploration of a client’s operational challenges and goals.
Databricks is chosen for clients lacking feature-specific needs but requiring a unified data analytics and ML platform.

Personal Projects by Ming Chang

Ming Chang has explored automated stock trading using APIs such as Alpaca, focusing on downloading and analyzing market data.
He has also developed drone-related projects with Raspberry Pi, emphasizing real-world applications of programming and physical computing.

Additional Resources

MLA 020 Kubeflow and ML Pipeline Orchestration on Kubernetes

Season 1 · Episode 52

samedi 29 janvier 2022 • Duration 01:08:47

Machine learning pipeline orchestration tools, such as SageMaker and Kubeflow, streamline the end-to-end process of data ingestion, model training, deployment, and monitoring, with Kubeflow providing an open-source, cross-cloud platform built atop Kubernetes. Organizations typically choose between cloud-native managed services and open-source solutions based on required flexibility, scalability, integration with existing cloud environments, and vendor lock-in considerations.

Links

Notes and resources at ocdevel.com/mlg/mla-20
Try a walking desk stay healthy & sharp while you learn & code

Dirk-Jan Verdoorn - Data Scientist at Dept Agency

Managed vs. Open-Source ML Pipeline Orchestration

Cloud providers such as AWS, Google Cloud, and Azure offer managed machine learning orchestration solutions, including SageMaker (AWS) and Vertex AI (GCP).
Managed services provide integrated environments that are easier to set up and operate but often result in vendor lock-in, limiting portability across cloud platforms.
Open-source tools like Kubeflow extend Kubernetes to support end-to-end machine learning pipelines, enabling portability across AWS, GCP, Azure, or on-premises environments.

Introduction to Kubeflow

Kubeflow is an open-source project aimed at making machine learning workflow deployment on Kubernetes simple, portable, and scalable.
Kubeflow enables data scientists and ML engineers to build, orchestrate, and monitor pipelines using popular frameworks such as TensorFlow, scikit-learn, and PyTorch.
Kubeflow can integrate with TensorFlow Extended (TFX) for complete end-to-end ML pipelines, covering data ingestion, preprocessing, model training, evaluation, and deployment.

Machine Learning Pipelines: Concepts and Motivation

Production machine learning systems involve not just model training but also complex pipelines for data ingestion, feature engineering, validation, retraining, and monitoring.
Pipelines automate retraining based on model performance drift or updated data, supporting continuous improvement and adaptation to changing data patterns.
Scalable, orchestrated pipelines reduce manual overhead, improve reproducibility, and ensure that models remain accurate as underlying business conditions evolve.

Pipeline Orchestration Analogies and Advantages

ML pipeline orchestration tools in machine learning fulfill a role similar to continuous integration and continuous deployment (CI/CD) in traditional software engineering.
Pipelines enable automated retraining, modularization of pipeline steps (such as ingestion, feature transformation, and deployment), and robust monitoring.
Adopting pipeline orchestrators, rather than maintaining standalone models, helps organizations handle multiple models and varied business use cases efficiently.

Choosing Between Managed and Open-Source Solutions

Managed services (e.g., SageMaker, Vertex AI) offer streamlined user experiences and seamless integration but restrict cross-cloud flexibility.
Kubeflow, as an open-source platform on Kubernetes, enables cross-platform deployment, integration with multiple ML frameworks, and minimizes dependency on a single cloud provider.
The complexity of Kubernetes and Kubeflow setup is offset by significant flexibility and community-driven improvements.

Cross-Cloud and Local Development

Kubeflow operates on any Kubernetes environment including AWS EKS, GCP GKE, and Azure AKS, as well as on-premises or local clusters.
Local and cross-cloud development are facilitated in Kubeflow, while managed services like SageMaker and Vertex AI are better suited to cloud-native workflows.
Debugging and development workflows can be challenging in highly secured cloud environments; Kubeflow’s local deployment flexibility addresses these hurdles.

Relationship to TensorFlow Extended (TFX) and Machine Learning Frameworks

TensorFlow Extended (TFX) is an end-to-end platform for creating production ML pipelines, tightly integrated with Kubeflow for deployment and execution.
While Kubeflow originally focused on TensorFlow, it has grown to support PyTorch, scikit-learn, and other major ML frameworks, offering wider applicability.
TFX provides modular pipeline components (data ingestion, transformation, validation, model training, evaluation, and deployment) that execute within Kubeflow’s orchestration platform.

Alternative Pipeline Orchestration Tools

Airflow is a general-purpose workflow orchestrator using DAGs, suited for data engineering and automation, but less resource-capable for heavy ML training within the pipeline.
- Airflow often submits jobs to external compute resources (e.g., AI Platform) for resource-intensive workloads.
- In organizations using both Kubeflow and Airflow, Airflow may handle data workflows, while Kubeflow is reserved for ML pipelines.
MLflow and other solutions also exist, each with unique integrations and strengths; their adoption depends on use case requirements.

Selecting a Cloud Platform and Orchestration Approach

The optimal choice of cloud platform and orchestration tool is typically guided by client needs, existing integrations (e.g., organizational use of Google or Microsoft solutions), and team expertise.
Agencies with diverse client portfolios often benefit from open-source, cross-cloud tools like Kubeflow to maximize flexibility and knowledge sharing across projects.
Users entrenched in a single cloud provider may prefer managed offerings for ease of use and integration, while those prioritizing portability and flexibility often choose open-source solutions.

Cost Optimization in Model Training

Both AWS and GCP offer cost-saving compute options for training, such as spot instances (AWS) and preemptible instances (GCP), which are suitable for non-production, batch training jobs.
Production workloads that require high uptime and reliability do not typically utilize cost-saving transient compute resources, as these can be interrupted.

Machine Learning Project Lifecycle Overview

Project initiation begins with data discovery and validation of the client’s requirements against available data.
Cloud environment selection is influenced by client infrastructure, business applications, and platform integrations rather than solely by technical features.
Data cleaning, exploratory analysis, model prototyping, advanced model refinement, and deployment are handled collaboratively with data engineering and machine learning teams.
The pipeline is gradually constructed in modular steps, facilitating scalable, automated retraining and integration with business applications.

Educational Pathways for Data Science and Machine Learning Careers

Advanced mathematics or statistics education provides a strong foundation for work in data science and machine learning.
Master’s degrees in data science add the most value for candidates from non-technical undergraduate backgrounds; those with backgrounds in statistics, mathematics, or computer science may benefit more from self-study or targeted upskilling.
When evaluating online or accelerated degree programs, candidates should scrutinize the curriculum, instructor engagement, and peer interaction to ensure comprehensive learning.

MLG 032 Cartesian Similarity Metrics

Season 1 · Episode 43

dimanche 8 novembre 2020 • Duration 41:52

Try a walking desk to stay healthy while you study or work!

Show notes at ocdevel.com/mlg/32.

L1/L2 norm, Manhattan, Euclidean, cosine distances, dot product

Normed distances link

A norm is a function that assigns a strictly positive length to each vector in a vector space. link
Minkowski is generalized. p_root(sum(xi-yi)^p). "p" = ? (1, 2, ..) for below.
L1: Manhattan/city-block/taxicab. abs(x2-x1)+abs(y2-y1). Grid-like distance (triangle legs). Preferred for high-dim space.
L2: Euclidean. sqrt((x2-x1)^2+(y2-y1)^2. sqrt(dot-product). Straight-line distance; min distance (Pythagorean triangle edge)
Others: Mahalanobis, Chebyshev (p=inf), etc

Dot product

A type of inner product.
Outer-product: lies outside the involved planes. Inner-product: dot product lies inside the planes/axes involved link. Dot product: inner product on a finite dimensional Euclidean space link

Cosine (normalized dot)

MLA 011 Practical Clustering Tools

Season 1 · Episode 42

dimanche 8 novembre 2020 • Duration 34:50

Primary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices.

Links

Notes and resources at ocdevel.com/mlg/mla-11
Try a walking desk stay healthy & sharp while you learn & code

K-means Clustering

K-means is the most widely used clustering algorithm and is typically the first method to try for general clustering tasks.
The scikit-learn KMeans implementation is suitable for small to medium-sized datasets, while Faiss's kmeans is more efficient and accurate for very large datasets.
K-means requires the number of clusters to be specified in advance and relies on the Euclidean distance metric, which performs poorly in high-dimensional spaces.
When document embeddings have high dimensionality (e.g., 768 dimensions from sentence transformers), K-means becomes less effective due to the limitations of Euclidean distance in such spaces.

Alternatives to K-means for High Dimensions

For text embeddings with high dimensionality, agglomerative (hierarchical) clustering methods are preferable, particularly because they allow the use of different similarity metrics.
Agglomerative clustering in scikit-learn accepts a pre-computed cosine similarity matrix, which is more appropriate for natural language processing.
Constructing the pre-computed distance (or similarity) matrix involves normalizing vectors and computing dot products, which can be efficiently achieved with linear algebra libraries like PyTorch.
Hierarchical algorithms do not use inertia in the same way as K-means and instead rely on external metrics, such as silhouette score.
Other clustering algorithms exist, including spectral, mean shift, and affinity propagation, which are not covered in this episode.

Semantic Search and Vector Indexing

Libraries such as Faiss, Annoy, and HNSWlib provide approximate nearest neighbor search for efficient semantic search on large-scale vector data.
These systems create an index of your embeddings to enable rapid similarity search, often with the ability to specify cosine similarity as the metric.
Sample code using these libraries with sentence transformers can be found in the UKP Lab sentence-transformers examples directory.

Determining the Optimal Number of Clusters

Both K-means and agglomerative clustering require a predefined number of clusters, but this is often unknown beforehand.
The "elbow" method involves running the clustering algorithm with varying cluster counts and plotting the inertia (sum of squared distances within clusters) to visually identify the point of diminishing returns; see kmeans.inertia_.
The kneed package can automatically detect the "elbow" or "knee" in the inertia plot, eliminating subjective human judgment; sample code available here.
The silhouette score, calculated via silhouette_score, considers both inter- and intra-cluster distances and allows for direct selection of the number of clusters with the maximum score.
The silhouette score can be computed using a pre-computed distance matrix (such as from cosine similarities), making it well-suited for applications involving non-Euclidean metrics and hierarchical clustering.

Density-Based Clustering: DBSCAN and HDBSCAN

DBSCAN is a hierarchical clustering method that does not require specifying the number of clusters, instead discovering clusters based on data density.
HDBSCAN is a more popular and versatile implementation of density-based clustering, capable of handling various types of data without significant parameter tuning.
DBSCAN and HDBSCAN can be preferable to K-means or agglomerative clustering when automatic determination of cluster count or robustness to noise is important.
However, these algorithms may not perform well with all types of high-dimensional embedding data, as illustrated by the challenges faced when clustering 768-dimensional text embeddings.

Summary Recommendations and Links

For low- to medium-sized, low-dimensional data, use K-means with silhouette score to choose the optimal number of clusters: scikit-learn KMeans, silhouette_score.
For very large data or vector search, use Faiss.kmeans.
For high-dimensional data using cosine similarity, use Agglomerative Clustering with a pre-computed square matrix of cosine similarities; sample code.
For density-based clustering, consider DBSCAN or HDBSCAN.
Exploratory code and further examples can be found in the UKP Lab sentence-transformers examples.

MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK

Season 1 · Episode 41

mercredi 28 octobre 2020 • Duration 26:22

The landscape of Python natural language processing tools has evolved from broad libraries like NLTK toward more specialized packages such as Gensim for topic modeling, SpaCy for linguistic analysis, and Hugging Face Transformers for advanced tasks, with Sentence Transformers extending transformer models to enable efficient semantic search and clustering. Each library occupies a distinct place in the NLP workflow, from fundamental text preprocessing to semantic document comparison and large-scale language understanding.

Links

Notes and resources at ocdevel.com/mlg/mla-10
Try a walking desk stay healthy & sharp while you learn & code

Historical Foundation: NLTK

NLTK ("Natural Language Toolkit") was one of the earliest and most popular Python libraries for natural language processing, covering tasks from tokenization and stemming to document classification and syntax parsing.
NLTK remains a catch-all "Swiss Army knife" for NLP, but many of its functions have been supplemented or superseded by newer tools tailored to specific tasks.

Specialized Topic Modeling and Phrase Analysis: Gensim

Gensim emerged as the leading library for topic modeling in Python, most notably via its LDA Topic Modeling implementation, which groups documents according to topic distributions.
Topic modeling workflows often use NLTK for initial preprocessing (tokenization, stop word removal, lemmatization), then vectorize with scikit-learn’s TF-IDF, and finally model topics with Gensim’s LDA.
Gensim also provides effective Bigrams/Trigrams, allowing the detection and combination of commonly-used word pairs or triplets (n-grams) to enhance analysis accuracy.

Linguistic Structure and Manipulation: SpaCy and Related Tools

spaCy is a deep-learning-based library for high-performance linguistic analysis, focusing on tasks such as part-of-speech tagging, named entity recognition, and syntactic parsing.
SpaCy supports integrated sentence and word tokenization, stop word removal, and lemmatization, but for advanced lemmatization and inflection, LemmInflect can be used to derive proper inflections for part-of-speech tags.
For even more accurate (but slower) linguistic tasks, consider Stanford CoreNLP via SpaCy integration as spacy-stanza.
SpaCy can examine parse trees to identify sentence components, enabling sophisticated NLP applications like grammatical corrections and intent detection in conversation agents.

High-Level NLP Tasks: Hugging Face Transformers

huggingface/transformers provides interfaces to transformer-based models (like BERT and its successors) capable of advanced NLP tasks including question answering, summarization, translation, and sentiment analysis.
Its Pipelines allow users to accomplish over ten major NLP applications with minimal code.
The library’s model repository hosts a vast collection of pre-trained models that can be used for both research and production.

Semantic Search and Clustering: Sentence Transformers

UKPLab/sentence-transformers extends the transformer approach to create dense document embeddings, enabling semantic search, clustering, and similarity comparison via cosine distance or similar metrics.
Example applications include finding the most similar documents, clustering user entries, or summarizing clusters of text.
The repository offers application examples for tasks such as semantic search and clustering, often using cosine similarity.
For very large-scale semantic search (such as across Wikipedia), approximate nearest neighbor (ANN) libraries like Annoy, FAISS, and hnswlib enable rapid similarity search with embeddings; practical examples are provided in the Sentence Transformers documentation.

Additional Resources and Library Landscape

For a comparative overview and discovery of further libraries, see Analytics Steps Top 10 NLP Libraries in Python, which reviews several packages beyond those discussed here.

Summary of Library Roles and Use Cases

NLTK: Foundational and comprehensive for most classic NLP needs; still covers a broad range of preprocessing and basic analytic tasks.
Gensim: Best for topic modeling and phrase extraction (bigrams/trigrams); especially useful in workflows relying on document grouping and label generation.
SpaCy: Leading tool for syntactic, linguistic, and grammatical analysis; supports integration with advanced lemmatizers and external tools like Stanford CoreNLP.
Hugging Face Transformers: The standard for modern, high-level NLP tasks and quick prototyping, featuring simple pipelines and an extensive model hub.
Sentence Transformers: The main approach for embedding text for semantic search, clustering, and large-scale document comparison, supporting ANN methodologies via companion libraries.

MLA 009 Charting and Visualization Tools for Data Science

Season 1 · Episode 39

mardi 6 novembre 2018 • Duration 24:43

Python charting libraries - Matplotlib, Seaborn, and Bokeh - explaining, their strengths from quick EDA to interactive, HTML-exported visualizations, and clarifies where D3.js fits as a JavaScript alternative for end-user applications. It also evaluates major software solutions like Tableau, Power BI, QlikView, and Excel, detailing how modern BI tools now integrate drag-and-drop analytics with embedded machine learning, potentially allowing business users to automate entire workflows without coding.

Links

Notes and resources at ocdevel.com/mlg/mla-9
Try a walking desk stay healthy & sharp while you learn & code

Core Phases in Data Science Visualization

Exploratory Data Analysis (EDA):
- EDA occupies an early stage in the Business Intelligence (BI) pipeline, positioned just before or sometimes merged with the data cleaning (“munging”) phase.
- The outputs of EDA (e.g., correlation matrices, histograms) often serve as inputs to subsequent machine learning steps.

Python Visualization Libraries 1. Matplotlib

The foundational plotting library in Python, supporting static, basic chart types.
Requires substantial boilerplate code for custom visualizations.
Serves as the core engine for many higher-level visualization tools.
Common EDA tasks (like plotting via .corr(), .hist(), and .scatter() methods on pandas DataFrames) depend on Matplotlib under the hood.

2. Pandas Plotting

Pandas integrates tightly with Matplotlib and exposes simple, one-line commands for common plots (e.g., df.corr(), df.hist()).
Designed to make quick EDA accessible without requiring detailed knowledge of Matplotlib’s verbose syntax.

3. Seaborn

A high-level wrapper around Matplotlib, analogous to how Keras wraps TensorFlow.
Sets sensible defaults for chart styles, fonts, colors, and sizes, improving aesthetics with minimal effort.
Importing Seaborn can globally enhance the appearance of all Matplotlib plots, even without direct usage of Seaborn’s plotting functions.

4. Bokeh

A powerful library for creating interactive, web-ready plots from Python.
Enables user interactions such as hovering, zooming, and panning within rendered plots.
Exports visualizations as standalone HTML files or can operate as a server-linked app for live data exploration.
Supports advanced features like cross-filtering, allowing dynamic slicing and dicing of data across multiple axes or columns.
More suited for creating reusable, interactive dashboards rather than quick, one-off EDA visuals.

5. D3.js

Unlike previous libraries, D3.js is a JavaScript framework for creating complex, highly customized data visualizations for web and mobile apps.
Used predominantly on the client-side to build interactive front-end graphics for end users, not as an EDA tool for analysts.
Common in production-grade web apps, but not typically part of a Python-based data science workflow.

Dedicated Visualization and BI Software Tableau

Leading commercial drag-and-drop BI tool for data visualization and dashboarding.
Connects to diverse data sources (CSV, Excel, databases), auto-detects column types, and suggests default chart types.
Users can interactively build visualizations, cross-filter data, and switch chart types without coding.

Power BI

Microsoft’s BI suite, similar to Tableau, supporting end-to-end data analysis and visualization.
Integrates data preparation, visualization, and increasingly, built-in machine learning workflows.
Focused on empowering business users or analysts to run the BI pipeline without programming.

QlikView

Another major BI offering is QlikView, emphasizing interactive dashboards and data exploration.

Excel

Still widely used for basic EDA and visualizations directly on spreadsheets.
Offers limited but accessible charting tools for histograms, scatter plots, and simple summary statistics.
Data often originates from Excel/CSV files before being ingested for further analysis in Python/pandas.

Trends & Insights

Workflow Integration: Modern BI tools are converging, adding both classic EDA capabilities and basic machine learning modeling, often through a code-free interface.
Automation Risks and Opportunities: As drag-and-drop BI tools increase in capabilities (including model training and selection), some data science coding work traditionally required for BI pipelines may become accessible to non-programmers.
Distinctions in Use:
- Python libraries (Matplotlib, Seaborn, Bokeh) excel in automating and scripting EDA, report generation, and static analysis as part of data pipelines.
- BI software (Tableau, Power BI, QlikView) shines for interactive exploration and democratized analytics, integrated from ingestion to reporting.
- D3.js stands out for tailored, production-level, end-user app visualizations, rarely leveraged by data scientists for EDA.

Key Takeaways

For quick, code-based EDA: Use Pandas’ built-in plotters (wrapping Matplotlib).
For pre-styled, pretty plots: Use Seaborn (with or without direct API calls).
For interactive, shareable dashboards: Use Bokeh for Python or BI tools for no-code operation.
For enterprise, end-user-facing dashboards: Choose BI software like Tableau or build custom apps using D3.js for total control.

MLA 008 Exploratory Data Analysis (EDA)

Season 1 · Episode 38

vendredi 26 octobre 2018 • Duration 25:07

Exploratory data analysis (EDA) sits at the critical pre-modeling stage of the data science pipeline, focusing on uncovering missing values, detecting outliers, and understanding feature distributions through both statistical summaries and visualizations, such as Pandas' info(), describe(), histograms, and box plots. Visualization tools like Matplotlib, along with processes including imputation and feature correlation analysis, allow practitioners to decide how best to prepare, clean, or transform data before it enters a machine learning model.

Links

Notes and resources at ocdevel.com/mlg/mla-8
Try a walking desk stay healthy & sharp while you learn & code

EDA in the Data Science Pipeline

Position in Pipeline: EDA is an essential pre-processing step in the business intelligence (BI) or data science pipeline, occurring after data acquisition but before model training.
Purpose: The goal of EDA is to understand the data by identifying:
- Missing values (nulls)
- Outliers
- Feature distributions
- Relationships or correlations between variables

Data Acquisition and Initial Inspection

Data Sources: Data may arrive from various streams (e.g., Twitter, sensors) and is typically stored in structured formats such as databases or spreadsheets.
Loading Data: In Python, data is often loaded into a Pandas DataFrame using commands like pd.read_csv('filename.csv').
Initial Review:
- df.info(): Displays data types and counts of non-null entries by column, quickly highlighting missing values.
- df.describe(): Provides summary statistics for each column, including count, mean, standard deviation, min/max, and quartiles.

Handling Missing Data and Outliers

Imputation:
- Missing values must often be filled (imputed), as most machine learning algorithms cannot handle nulls.
- Common strategies: impute with mean, median, or another context-appropriate value.
- For example, missing ages can be filled with the column's average rather than zero, to avoid introducing skew.
Outlier Strategy:
- Outliers can be removed, replaced (e.g., by nulls and subsequently imputed), or left as-is if legitimate.
- Treatment depends on whether outliers represent true data points or data errors.

Visualization Techniques

Purpose: Visualizations help reveal data distributions, outliers, and relationships that may not be apparent from raw statistics.
Common Visualization Tools:
- Matplotlib: The primary Python library for static data visualizations.
- Visualization Methods:
  - Histogram: Ideal for visualizing the distribution of a single variable (e.g., age), making outliers visible as isolated bars.
  - Box Plot: Summarizes quartiles, median, and range, with 'whiskers' showing min/max; useful for spotting outliers and understanding data spread.
  - Line Chart: Used for time-series data, highlighting trends and anomalies (e.g., sudden spikes in stock price).
  - Correlation Matrix: Visual grid (often of scatterplots) comparing each feature against every other, helping to detect strong or weak linear relationships between features.

Feature Correlation and Dimensionality

Correlation Plot:
- Generated with df.corr() in Pandas to assess linear relationships between features.
- High correlation between features may suggest redundancy (e.g., number of bedrooms and square footage) and inform feature selection or removal.
Limitations:
- While correlation plots provide intuition, automated approaches like Principal Component Analysis (PCA) or autoencoders are typically superior for feature reduction and target prediction tasks.

Data Transformation Prior to Modeling

Scaling:
- Machine learning models, especially neural networks, often require input features to be scaled (normalized or standardized).
- StandardScaler (from scikit-learn): Standardizes features, but is sensitive to outliers.
- RobustScaler: A variant that compresses the influence of outliers, keeping data within interquartile ranges, simplifying preprocessing steps.

Summary of EDA Workflow

Initial Steps:
- Load data into a DataFrame.
- Examine data types and missing values with df.info().
- Review summary statistics with df.describe().
Visualization:
- Use histograms and box plots to explore feature distributions and detect anomalies.
- Leverage correlation matrices to identify related features.
Data Preparation:
- Impute missing values thoughtfully (e.g., with means or medians).
- Decide on treatment for outliers: removal, imputation, or scaling with tools like RobustScaler.
Outcome:
- Proper EDA ensures that data is cleaned, features are well-understood, and inputs are suitable for effective machine learning model training.

MLA 007 Jupyter Notebooks

Season 1 · Episode 37

mardi 16 octobre 2018 • Duration 16:52

Jupyter Notebooks, originally conceived as IPython Notebooks, enable data scientists to combine code, documentation, and visual outputs in an interactive, browser-based environment supporting multiple languages like Python, Julia, and R. This episode details how Jupyter Notebooks structure workflows into executable cells - mixing markdown explanations and inline charts - which is essential for documenting, demonstrating, and sharing data analysis and machine learning pipelines step by step.

Links

Notes and resources at ocdevel.com/mlg/mla-7
Try a walking desk stay healthy & sharp while you learn & code

Overview of Jupyter Notebooks

Historical Context and Scope
- Jupyter Notebooks began as IPython Notebooks focused solely on Python.
- The project was renamed Jupyter to support additional languages - namely Julia ("JU"), Python ("PY"), and R ("R") - broadening its applicability for data science and machine learning across multiple languages.
Interactive, Narrative-Driven Coding
- Jupyter Notebooks allow for the mixing of executable code, markdown documentation, and rich media outputs within a browser-based interface.
- The coding environment is structured as a sequence of cells where each cell can independently run code and display its output directly underneath.
- Unlike traditional Python scripts, which output results linearly and impermanently, Jupyter Notebooks preserve the stepwise development process and its outputs for later review or publication.

Typical Workflow Example

Stepwise Data Science Pipeline Construction
- Import necessary libraries: Each new notebook usually starts with a cell for imports (e.g., matplotlib, scikit-learn, keras, pandas).
- Data ingestion phase: Read data into a pandas DataFrame via read_csv for CSVs or read_sql for databases.
- Exploratory analysis steps: Use DataFrame methods like .info() and .describe() to inspect the dataset; results are rendered below the respective cell.
- Model development: Train a machine learning model - for example using Keras - and output performance metrics such as loss, mean squared error, or classification accuracy directly beneath the executed cell.
- Data visualization: Leverage charting libraries like matplotlib to produce inline plots (e.g., histograms, correlation matrices), which remain visible as part of the notebook for later reference.

Publishing and Documentation Features

Markdown Support and Storytelling
- Markdown cells enable the inclusion of formatted explanations, section headings, bullet points, and even inline images and videos, allowing for clear documentation and instructional content interleaved with code.
- This format makes it simple to delineate different phases of a pipeline (e.g., "Data Ingestion", "Data Cleaning", "Model Evaluation") with descriptive context.
Inline Visual Outputs
- Outputs from code cells, such as tables, charts, and model training logs, are preserved within the notebook interface, making it easy to communicate findings and reasoning steps alongside the code.
- Visualization libraries (like matplotlib) can render charts directly in the notebook without the need to generate separate files.
Reproducibility and Sharing
- Notebooks can be published to platforms like GitHub, where the full code, markdown, and most recent cell outputs are viewable in-browser.
- This enables transparent workflow documentation and facilitates tutorials, blog posts, and collaborative analysis.

Practical Considerations and Limitations

Cell-based Execution Flexibility
- Each cell can be run independently, so developers can repeatedly rerun specific steps (e.g., re-trying a modeling cell after code fixes) without needing to rerun the entire notebook.
- This is especially useful for iterative experimentation with large or slow-to-load datasets.
Primary Use Cases
- Jupyter Notebooks excel at "storytelling" - presenting an analytical or modeling process along with its rationale and findings, primarily for publication or demonstration.
- For regular development, many practitioners prefer traditional editors or IDEs (like PyCharm or Vim) due to advanced features such as debugging, code navigation, and project organization.

Summary

Jupyter Notebooks serve as a central tool for documenting, presenting, and sharing the entirety of a machine learning or data analysis pipeline - combining code, output, narrative, and visualizations into a single, comprehensible document ideally suited for tutorials, reports, and reproducible workflows.

MLA 006 Salaries for Data Science & Machine Learning

Season 1 · Episode 36

jeudi 19 juillet 2018 • Duration 19:35

O'Reilly's 2017 Data Science Salary Survey finds that location is the most significant salary determinant for data professionals, with median salaries ranging from $134,000 in California to under $30,000 in Eastern Europe, and highlights that negotiation skills can lead to salary differences as high as $45,000. Other key factors impacting earnings include company age and size, job title, industry, and education, while popular tools and languages—such as Python, SQL, and Spark—do not strongly influence salary despite widespread use.

Links

Notes and resources at ocdevel.com/mlg/mla-6
Try a walking desk stay healthy & sharp while you learn & code

Global and Regional Salary Differences

Median Global Salary: $90,000 USD, up from $85,000 the previous year.
Regional Breakdown:
- United States: $112,000 median; California leads at $134,000.
- Western Europe: $57,000—about half the US median.
- Australia & New Zealand: Second after the US.
- Eastern Europe: Below $30,000.
- Asia: Wide interquartile salary range, indicating high variability.

Demographic and Personal Factors

Gender: Women's median salaries are $8,000 lower than men's. Women make up 20% of respondents but are increasing in number.
Age & Experience: Higher age/experience correlates with higher salaries, but the proportion of older professionals declines.
Education: Nearly all respondents have at least a master's; PhD holders earn only about $5,000 more than those with a master’s.
Negotiation Skills: Self-reported strong salary negotiation skills are linked to $45,000 higher median salaries (from $70,000 for lowest to $115,000 for highest bargaining skill).

Industry, Company, and Role

Industry Impact:
- Highest salaries found in search/social networking and media/entertainment.
- Education and non-profit offer the lowest pay.
Company Age & Size:
- Companies aged 2–5 years offer higher than average pay; less than 2 years old offer much lower salaries (~$40,000).
- Large organizations generally pay more.
Job Title:
- "Data scientist" and "data analyst" titles carry higher medians than "engineer" titles by around $7,000.
- Executive titles (CTO, VP, Director) see the highest pay, with CTOs at $150,000 median.

Tools, Languages, and Technologies

Operating Systems:
- Windows: 67% usage, but declining.
- Linux: 55%; Unix: 18%; macOS: 46%; Unix-based systems are rising in use.
Programming Languages:
- SQL: 64% (most used for database querying).
- Python: 63% (most popular procedural language).
- R: 54%.
- Others (Java, Scala, C/C++, C#): Each less than 20%.
- Salary difference across languages is minor; C/C++ users earn more but not enough to outweigh the difficulty.
Databases:
- MySQL (37%), MS SQL Server (30%), PostgreSQL (28%).
- Popularity of the database has little impact on pay.
Big Data and Search Tools:
- Spark: Most popular big data platform, especially for large-scale data processing.
- Elasticsearch: Most common search engine, but Solr pays more.
Machine Learning Libraries:
- Scikit-learn (37%) and Spark MLlib (16%) are most used.
Visualization Tools:
- R’s ggplot2 and Python’s matplotlib are leading choices.

Key Salary Differentiators (per Machine Learning Analysis)

Top Predictors (explaining ~60% of salary variance):
- World/US region
- Experience
- Gender
- Company size
- Education (but amounting to only ~$5,000 difference)
- Job title
- Industry
Lesser Impact: Specific tools, languages, and databases do not meaningfully affect salary.

Summary Takeaways

The greatest leverage for a higher salary comes from geography and individual negotiation capability, with up to $45,000 differences possible.
Role/title selection, industry, company age, and size are also significant, while mastering the most commonly used tools is essential but does not strongly differentiate pay.
For aspiring data professionals: focus on developing negotiation skills and, where possible, optimize for location and title to maximize earning potential.

MLA 005 Shapes and Sizes: Tensors and NDArrays

Season 1 · Episode 35

samedi 9 juin 2018 • Duration 27:18

Explains the fundamental differences between tensor dimensions, size, and shape, clarifying frequent misconceptions—such as the distinction between the number of features (“columns”) and true data dimensions—while also demystifying reshaping operations like expand_dims, squeeze, and transpose in NumPy. Through practical examples from images and natural language processing, listeners learn how to manipulate tensors to match model requirements, including scenarios like adding dummy dimensions for grayscale images or reordering axes for sequence data.

Links

Notes and resources at ocdevel.com/mlg/mla-5
Try a walking desk stay healthy & sharp while you learn & code

Definitions

Tensor: A general term for an array of any number of dimensions.
- 0D Tensor (Scalar): A single number (e.g., 5).
- 1D Tensor (Vector): A simple list of numbers.
- 2D Tensor (Matrix): A grid of numbers (rows and columns).
- 3D+ Tensors: Higher-dimensional arrays, such as images or batches of images.
NDArray (NumPy): Stands for "N-dimensional array," the foundational array type in NumPy, synonymous with "tensor."

Tensor Properties Dimensions

Number of nested levels in the array (e.g., a matrix has two dimensions: rows and columns).
Access in NumPy: Via .ndim property (e.g., array.ndim).

Size

Total number of elements in the tensor.
Examples:
- Scalar: size = 1
- Vector: size equals number of elements (e.g., 5 for [1, 2, 3, 4, 5])
- Matrix: size = rows × columns (e.g., 10×10 = 100)
Access in NumPy: Via .size property.

Shape

Tuple listing the number of elements per dimension.
Example: An image with 256×256 pixels and 3 color channels has shape = (256, 256, 3).

Common Scenarios & Examples Data Structures in Practice

CSV/Spreadsheet Example: Dataset with 1 million housing examples and 50 features:
- Shape: (1_000_000, 50)
- Size: 50,000,000
Image Example (RGB): 256×256 pixel image:
- Shape: (256, 256, 3)
- Dimensions: 3 (width, height, channels)
Batching for Models:
- For a convolutional neural network, shape might become (batch_size, width, height, channels), e.g., (32, 256, 256, 3).

Conceptual Clarifications

The term "dimensions" in data science often refers to features (columns), but technically in tensors it means the number of structural axes.
The "curse of dimensionality" often uses "dimensions" to refer to features, not tensor axes.

Reshaping and Manipulation in NumPy Reshaping Tensors

Adding Dimensions:
- Useful when a model expects higher-dimensional input than currently available (e.g., converting grayscale image from shape (256, 256) to (256, 256, 1)).
- Use np.expand_dims or array.reshape.
Removing Singleton Dimensions:
- Occurs when, for example, model output is (N, 1) and single dimension should be removed to yield (N,).
- Use np.squeeze or array.reshape.
Wildcard with -1:
- In reshaping, -1 is a placeholder for NumPy to infer the correct size, useful when batch size or another dimension is variable.
Flattening:
- Use np.ravel to turn a multi-dimensional tensor into a contiguous 1D array.

Axis Reordering

Transposing Axes:
- Needed when model input or output expects axes in a different order (e.g., sequence length and embedding dimensions in NLP).
- Use np.transpose for general axis permutations.
- Use np.swapaxes to swap two specific axes but prefer transpose for clarity and flexibility.

Practical Example

In NLP sequence models:
- 3D tensor with (batch_size, sequence_length, embedding_dim) might need to be reordered to (batch_size, embedding_dim, sequence_length) for certain models.
- Achieved using: array.transpose(0, 2, 1)

Core NumPy Functions for Manipulation

reshape: General function for changing the shape of a tensor, including adding or removing dimensions.
expand_dims: Adds a new axis with size 1.
squeeze: Removes axes with size 1.
ravel: Flattens to 1D.
transpose: Changes the order of axes.
swapaxes: Swaps specified axes (less general than transpose).

Summary Table of Operations Operation NumPy Function Purpose Add dimension np.expand_dims Convert (256,256) to (256,256,1) Remove dimension np.squeeze Convert (N,1) to (N,) General reshape np.reshape Any change matching total size Flatten np.ravel Convert (a,b) to (a*b,) Swap axes np.swapaxes Exchange positions of two axes Permute axes np.transpose Reorder any sequence of axes Closing Notes

A deep understanding of tensor structure - dimensions, size, and shape - is vital for preparing data for machine learning models.
Reshaping, expanding, squeezing, and transposing tensors are everyday tasks in model development, especially for adapting standard datasets and models to each other.