Statistics for Data Science

Next course starts on April 1st

Introduction
Syllabus

Give yourself a boost with this 4-session focused course. We'll cover everything that's needed to land your dream job. From functions and random variables to state-of-the-art methods like XGBoost.

Session 1 — Foundations: Functions, Random Variables, and ML Motivation

1. Mathematical Foundations

  • What is a function? Domain, codomain, mappings

  • Deterministic vs stochastic functions

2. Random Variables & Distributions

  • Discrete vs continuous RVs

  • PMF, PDF, CDF

  • Joint distributions, independence

3. Moments & Their Role in ML

  • Expectation

  • Variance

  • Covariance, correlation

4. Connecting Probability to ML

  • Bias–variance decomposition

  • Underfitting vs. overfitting

  • Why randomness matters for generalization

Session 4 — Advanced Algorithms: Trees, Ensembles and modern ML

1. Decision Trees

  • Splitting criteria (Gini, entropy)

  • Depth, pruning, overfitting

2. Bagging & Random Forests

  • Bootstrapping

  • Aggregation

  • Feature randomness

  • Out-of-bag evaluation

3. Boosting

  • AdaBoost intuition

  • Gradient boosting framework

  • High-level intuition of XGBoost

4. Comparing the Major Algorithms

  • When linear models win

  • When kNN is appropriate

  • When trees and ensembles dominate

  • Interpretability vs predictive performance

5. Practical Modeling Pipeline Wrap-Up

  • Choosing models based on data

  • Feature engineering considerations

  • Real-world ML workflow

Session 3 — Foundational Algorithms: Regression & Simple ML Models

1. Supervised Learning Types

  • Regression vs classification

  • Continuous vs discrete outputs

2. Linear Models

  • Linear regression

  • Logistic regression

  • Interpretation, assumptions

  • When linear models work (and don’t)

3. Distance-Based Methods

  • k-Nearest Neighbors

  • Distance metrics

  • Strengths and limitations

4. Basic Unsupervised (Minimal Coverage)

  • k-Means clustering

  • Use cases (segmentation, exploration)

  • Caveats (scaling, shape assumptions)

5. How These Models Fit Together

  • Parametric vs non-parametric

  • Low-variance vs high-variance models

Session 2 — Asymptotic Theory + ML Framework

1. Asymptotics for Learning

  • Convergence in probability and distribution

  • Law of Large Numbers, Central Limit Theorem

2. Classical Inference

  • Hypothesis testing and confidence intervals

  • Linking statistical inference to ML validation

3. ML Workflow & Core Concepts

  • Data Generating Process (DGP)

  • Train/validation/test splits and cross-validation

4. Loss Functions & Optimization

  • MSE, MAE, log-loss, hinge loss

  • Choosing the right loss

  • Gradient descent

5. Model Evaluation

  • Precision, recall, F1, accuracy and class imbalance

Schedule & Location

A 4-session, in-person course in the heart of Budapest. Classes are held on Sundays, usually from 9am to 1-2pm.

Pricing

Statistics for Data Science - 110.000 HUF