Core Concepts of Machine Learning (classical)

Machine learning is often discussed in the context of deep learning and neural networks — but long before that, a rich set of classical techniques laid the groundwork for intelligent systems.

In this post, I’ll walk through the core concepts of classical machine learning, from how models learn patterns in data to essential ideas like overfitting, bias-variance trade-off, and model evaluation.

Whether you’re new to the field or revisiting the fundamentals, this guide aims to clarify the building blocks that every ML practitioner should understand.

Supervised Learning

Supervised learning is a type of machine learning where the model is trained on labeled data, meaning each input has a corresponding known output (target). The algorithm learns a mapping function from inputs to outputs to make predictions or classifications on unseen data.

Key Concept: Input → Output (with labels)
Example Use Case: Predicting house prices based on size, location, etc.

Unsupervised Learning

Unsupervised learning is a type of machine learning where the model is given data without explicit labels or targets. The goal is to discover underlying patterns, structures, or relationships in the data, such as clustering similar items or reducing dimensions.

Key Concept: Input → Discover hidden structures (no labels)
Example Use Case: Grouping customers into clusters based on purchasing behavior.

Reinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and aims to maximize cumulative rewards over time by improving its decision-making policy.

Key Concept: Agent → Environment → Rewards/Penalties → Policy improvement
Example Use Case: Teaching a robot to navigate a maze or training an AI to play games like chess.

Two Main Techniques for Reinforcement Learning

Value-Based Methods
- Description: These methods focus on learning a value function that estimates the expected cumulative reward for each state (or state-action pair). The agent uses this value function to choose actions that maximize the future reward.
- Key Example: Q-Learning
  Q-Learning is an off-policy algorithm that learns the optimal action-value function $Q(s,a)$ by iteratively updating it based on the Bellman equation. The agent chooses actions based on the current estimate of $Q(s,a)$ (e.g., using an epsilon-greedy policy).
- Strength: Simple and effective for problems with discrete action spaces.
- Limitation: Struggles with large or continuous state-action spaces.

Policy-Based Methods
- Description: These methods focus on directly learning the policy $\pi(a|s)$, which is the probability distribution over actions given a state. Instead of estimating value functions, the agent learns to optimize the policy to maximize expected rewards.
- Key Example: REINFORCE (Policy Gradient)
  REINFORCE uses gradient ascent to update the policy parameters by maximizing the expected reward. The policy is typically modeled as a neural network.
- Strength: Works well for high-dimensional and continuous action spaces.
- Limitation: Can suffer from high variance in gradient estimates.

Key Difference:

Value-Based Methods: Indirectly learn the policy by first estimating value functions.
Policy-Based Methods: Directly learn the policy without relying on value functions.

In practice, a combination of both, called Actor-Critic Methods, is often used to leverage the strengths of both approaches.

Feature Engineering

Feature engineering is the process of transforming raw data into meaningful inputs for machine learning algorithms. It plays a critical role in classical ML, where performance often depends less on the model itself and more on the quality and representation of features.

Typical feature engineering tasks include:

Encoding categorical variables (e.g., one-hot encoding, label encoding)
Scaling and normalization (e.g., Min-Max, StandardScaler)
Creating interaction terms (e.g., feature₁ × feature₂)
Extracting domain-specific features (e.g., time-of-day from timestamps)
Handling missing values through imputation or removal
Dimensionality reduction (e.g., PCA, feature selection)

A strong mental model here is:

Your model is only as good as the information you feed it.

Well-engineered features expose patterns that a learning algorithm can exploit, while poor features can bury useful signal or mislead the model entirely.

Model Evaluation

Evaluating a machine learning model means measuring how well it generalizes to unseen data — not how well it memorizes the training set. This distinction is vital, and classical ML emphasizes this separation strongly.

Common evaluation metrics include:

Classification: Accuracy, Precision, Recall, F1-Score, ROC AUC
Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), $R^2$ Score
Clustering: Silhouette Score, Adjusted Rand Index, Davies–Bouldin Index

Other key practices:

Train/test split or cross-validation to simulate out-of-sample performance
Confusion matrices to understand classification trade-offs
Learning curves to diagnose underfitting vs. overfitting
Hyperparameter tuning (e.g., grid search, random search)

A good rule of thumb:

Always evaluate with data the model has never seen — and use multiple metrics to avoid tunnel vision.

Linear Regression

Explanation: A supervised algorithm used for predicting continuous numerical outputs by finding the best-fit linear relationship between input features and the target variable.
Examples:
1. Predicting house prices based on size, number of rooms, and location.
2. Estimating sales revenue based on advertising spending.

Mathematical Principle:

The model minimizes the Mean Squared Error (MSE) to find the best-fit line:

$\hat{y} = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$

MSE is minimized as:

$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i – \hat{y}_i)^2$

Logistic Regression

Explanation: A supervised algorithm used for binary or multi-class classification by modeling the probability of a target class using a logistic (sigmoid) function.
Examples:
1. Predicting whether an email is spam or not.
2. Diagnosing whether a tumor is benign or malignant.

Mathematical Principle:

Logistic regression models the probability of an outcome as:

$P(y=1|x) = \frac{1}{1 + e^{-(w^Tx + b)}}$

It maximizes the:

$\text{Log-Likelihood} = \sum_{i=1}^n \left[ y_i \log(P(y=1|x_i)) + (1-y_i) \log(1 – P(y=1|x_i)) \right]$

Decision Trees

Explanation: A supervised algorithm that splits the data into subsets based on feature conditions to make decisions. It’s interpretable and useful for both classification and regression.
Examples:
1. Determining loan eligibility based on income, age, and credit score.
2. Predicting whether a customer will churn.

Mathematical Principle:

Decision Trees split data using impurity measures like:

Gini Index: $\text{Gini} = 1 – \sum_{i=1}^k p_i^2$
Entropy: $\text{Entropy} = -\sum_{i=1}^k p_i \log_2(p_i)$

Information Gain is calculated as: $\text{Information Gain} = \text{Parent Impurity} – \sum (\text{Child Impurity})$

Random Forests

Explanation: An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting through random sampling of data and features.
Examples:
1. Predicting customer segmentation in marketing.
2. Detecting fraudulent transactions in financial systems.

Mathematical Principle:

Combines Decision Trees via Bagging:

Randomly samples subsets of data and features for each tree, then aggregates results using:

Classification: Majority vote of tree outputs.
Regression: Average of tree outputs.

Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost)

Explanation: A powerful ensemble method that builds models sequentially, correcting errors of previous models, and works well for both classification and regression tasks.
Examples:
1. Predicting credit scores based on financial history.
2. Winning Kaggle competitions with tabular data.

Mathematical Principle:

Gradient Boosting minimizes a loss function $L(y, \hat{y})$ using gradients:

$$
\hat{y}_{t+1} = \hat{y}_t + \eta \cdot \nabla L(y, \hat{y}_t)
$$ where $\eta$ is the learning rate.

k-Means (Clustering)

Explanation: An unsupervised algorithm that groups data into k clusters by minimizing the variance within each cluster.
Examples:
1. Grouping similar customers based on purchasing behavior.
2. Segmenting regions in satellite images.

Mathematical Principle:

k-Means minimizes the Within-Cluster Sum of Squares (WCSS):

$$
\text{WCSS} = \sum_{i=1}^k \sum_{x \in C_i} \|x – \mu_i\|^2
$$ where $C_i$ is a cluster and $\mu_i$ is its centroid.

DBSCAN (Density-Based Clustering)

Explanation: An unsupervised algorithm that identifies clusters based on the density of data points, detecting outliers as noise.
Examples:
1. Identifying anomalies in network traffic data.
2. Grouping galaxies in astronomical datasets.

Mathematical Principle:

A core point has at least $\text{MinPts}$ neighbors within a radius $\varepsilon$. Clusters grow by density connectivity, and outliers have fewer than $\text{MinPts}$ neighbors.

Support Vector Machines (SVMs)

Explanation: A supervised algorithm that finds the optimal hyperplane to separate classes (or regressions) by maximizing the margin between them. It can use kernels for non-linear problems.
Examples:
1. Classifying handwritten digits in the MNIST dataset.
2. Categorizing emails as spam or not spam.

Mathematical Principle:

Maximizes the margin:

$$
\text{Maximize: } \frac{2}{\|w\|}
$$ Subject to:

$$
y_i (w^T x_i + b) \geq 1, \; \forall i
$$ Non-linear problems are solved using the Kernel Trick.

k-Nearest Neighbors (k-NN)

Explanation: A supervised algorithm that classifies data points or predicts values based on the majority vote or average of the k-nearest labeled points.
Examples:
1. Recommending movies based on user ratings.
2. Identifying species of flowers based on petal and sepal measurements.

Mathematical Principle:

Uses a distance metric (e.g., Euclidean Distance):

$$
d(x_1, x_2) = \sqrt{\sum_{i=1}^n (x_{1,i} – x_{2,i})^2}
$$

Conclusion

While modern machine learning is often dominated by deep learning and large language models, the classical foundations remain essential. Concepts like feature engineering, model evaluation, generalization, and learning paradigms form the core mental models every ML engineer or data scientist should carry with them — regardless of the complexity of the tools they use.

Understanding these fundamentals not only improves your ability to choose and apply algorithms effectively, but also helps you communicate clearly, troubleshoot with confidence, and build more reliable systems.

In the end, mastering the basics is what enables you to go further — with clarity, precision, and purpose.