All You Can Books

Data Science Decoded

Mike E

Data Science Decoded

Data Science Decoded

We discuss seminal mathematical papers (sometimes really old ) that have shaped and established the fields of machine learning and data science as we know them today. The goal of the podcast is to introduce you to the evolution of these fields from a mathematical and slightly philosophical perspective. We will discuss the contribution of these papers, not just from pure a math aspect but also how they influenced the discourse in the field, which areas were opened up as a result, and so on. Our podcast episodes are also available on our youtube: https://youtu.be/wThcXx_vXjQ?si=vnMfs

Podcast Episodes

Data Science #15 - The First Decision Tree Algorithm (1963)

the 15th episode we went over the paper "Problems in the Analysis of Survey Data, and a Proposal" by James N. Morgan and John A. Sonquist from 1963. It highlights seven key issues in analyzing complex survey data, such as high dimensionality, categorical variables, measurement errors, sample variability, intercorrelations, interaction effects, and causal chains.


These challenges complicate efforts to draw meaningful conclusions about relationships between factors like income, education, and occupation. To address these problems, the authors propose a method that sequentially splits data by identifying features that reduce unexplained variance, much like modern decision trees.


The method focuses on maximizing explained variance (SSE), capturing interaction effects, and accounting for sample variability.


It handles both categorical and continuous variables while respecting logical causal priorities. This paper has had a significant influence on modern data science and AI, laying the groundwork for decision trees, CART, random forests, and boosting algorithms.


Its method of splitting data to reduce error, handle interactions, and respect feature hierarchies is foundational in many machine learning models used today. Link to full paper at our website:

https://datasciencedecodedpodcast.com/episode-15-the-first-decision-tree-algorithm-1963

Download This Episode

Data Science #14 - The original k-means algorithm paper review (1957)

At the 14th episode we go over the Stuart Lloyd's 1957 paper, "Least Squares Quantization in PCM," (which was published only at 1982) The k-means algorithm can be traced back to this paper. Loyd introduces an approach to quantization in pulse-code modulation (PCM). Which is like a 1-D k means clustering. Lloyd discusses how quantization intervals and corresponding quantum values should be adjusted based on signal amplitude distributions to minimize noise, improving efficiency in PCM systems.


He derives an optimization framework that minimizes quantization noise under finite quantization schemes. Lloyd’s algorithm bears significant resemblance to the k-means clustering algorithm, both seeking to minimize a sum of squared errors.

In Lloyd's method, the quantization process is analogous to assigning data points (signal amplitudes) to clusters (quantization intervals) based on proximity to centroids (quantum values), with the centroids updated iteratively based on the mean of the assigned points.

This iterative process of recalculating quantization values mirrors k-means’ recalculation of cluster centroids. While Lloyd’s work focuses on signal processing in telecommunications, its underlying principles of optimizing quantization have clear parallels with the k-means method used in clustering tasks in data science. The paper's influence on modern data science is profound. Lloyd's algorithm not only laid the groundwork for k-means but also provided a fundamental understanding of quantization error minimization, critical in fields such as machine learning, image compression, and signal processing.


The algorithm's simplicity, combined with its iterative nature, has led to its wide adoption in various data science applications. Lloyd's work remains a cornerstone in both the theory of clustering algorithms and practical applications in signal and data compression technologies.

Download This Episode

Data Science #13 - Kolmogorov complexity paper review (1965) - Part 2

In the 14th episode we review the second part of Kolmogorov's seminal paper: Three approaches to the quantitative definition of information’." Problems of information transmission 1.1 (1965): 1-7. The paper introduces algorithmic complexity (or Kolmogorov complexity), which measures the amount of information in an object based on the length of the shortest program that can describe it.

This shifts focus from Shannon entropy, which measures uncertainty probabilistically, to understanding the complexity of structured objects.


Kolmogorov argues that systems like texts or biological data, governed by rules and patterns, are better analyzed by their compressibility—how efficiently they can be described—rather than by random probabilistic models. In modern data science and AI, these ideas are crucial. Machine learning models, like neural networks, aim to compress data into efficient representations to generalize and predict. Kolmogorov complexity underpins the idea of minimizing model complexity while preserving key information, which is essential for preventing overfitting and improving generalization.


In AI, tasks such as text generation and data compression directly apply Kolmogorov's concept of finding the most compact representation, making his work foundational for building efficient, powerful models. This is part 2 out of 2 episodes covering this paper (the first one is in Episode 12).

Download This Episode

Data Science #12 - Kolmogorov complexity paper review (1965) - Part 1

In the 12th episode we review the first part of Kolmogorov's seminal paper:

"3 approaches to the quantitative definition of information’." Problems of information transmission 1.1 (1965): 1-7. The paper introduces algorithmic complexity (or Kolmogorov complexity), which measures the amount of information in an object based on the length of the shortest program that can describe it.

This shifts focus from Shannon entropy, which measures uncertainty probabilistically, to understanding the complexity of structured objects.


Kolmogorov argues that systems like texts or biological data, governed by rules and patterns, are better analyzed by their compressibility—how efficiently they can be described—rather than by random probabilistic models. In modern data science and AI, these ideas are crucial. Machine learning models, like neural networks, aim to compress data into efficient representations to generalize and predict. Kolmogorov complexity underpins the idea of minimizing model complexity while preserving key information, which is essential for preventing overfitting and improving generalization.

In AI, tasks such as text generation and data compression directly apply Kolmogorov's concept of finding the most compact representation, making his work foundational for building efficient, powerful models. This is part 1 out of 2 episodes covering this paper

Download This Episode

How It Works

30-day FREE trial

Get ALL YOU CAN BOOKS absolutely FREE for 30 days. Download our FREE app and enjoy unlimited downloads of our entire library with no restrictions.

UNLIMITED access

Have immediate access and unlimited downloads to over 200,000 books, courses, podcasts, and more with no restrictions.

Forever Downloads

Everything you download during your trial is yours to keep and enjoy for free, even if you cancel during the trial. Cancel Anytime. No risk. No obligations.

Significant Savings

For just $24.99 per month, you can continue to have unlimited access to our entire library. To put that into perspective, most other services charge the same amount for just one book!

Start Your Free Trial Now

Our Story

Welcome to All You Can Books, the ultimate destination for book lovers.

Welcome to All You Can Books, the ultimate destination for book lovers.

As avid readers, we understand the joy of immersing ourselves in a captivating story or getting lost in the pages of a good book. That's why we founded All You Can Books back in 2010, to create a platform where people can access an extensive library of quality content and discover new favorites.

Since our founding days, we’ve continuously added to our vast library and currently have over 200,000 titles, including ebooks, audiobooks, language learning courses, podcasts, bestseller summaries, travel books, and more! Our goal at All You Can Books is to ensure we have something for everyone.

Join our community of book lovers and explore the world of literature and beyond!