Supervised and Unsupervised Learning
Machine learning algorithms are broadly categorized based on how they utilize data. The two most prominent paradigms are Supervised Learning and Unsupervised Learning.
Supervised Learning
In Supervised Learning, models are trained on labeled data. This means for every input example , the algorithm is also given a "ground truth" target label .
The goal is to learn a mapping that generalizes well to unseen data.
Key Tasks:
- Regression: Predicting a continuous value (e.g., predicting house prices based on square footage).
- Classification: Predicting a discrete category (e.g., identifying if an email is "Spam" or "Not Spam").
Common Algorithms:
- Linear Regression
- Logistic Regression
- Decision Trees & Random Forests
- Support Vector Machines (SVM)
- Neural Networks
Unsupervised Learning
Unsupervised Learning works with unlabeled data. The algorithm is only given inputs and must discover hidden patterns, structures, or distributions within the data .
There is no "right answer" provided; the model focuses on finding similarities or differences.
Key Tasks:
- Clustering: Grouping similar data points together (e.g., customer segmentation).
- Dimensionality Reduction: Simplifying data by reducing features while preserving essential information (e.g., PCA).
- Anomaly Detection: Finding unusual data points that deviate from the norm.
- Density Estimation: Estimating the underlying distribution of the data.
Common Algorithms:
- k-means Clustering
- Principal Component Analysis (PCA)
- Autoencoders
- Gaussian Mixture Models (GMM)
Visualizing the Difference
Observe how supervised learning utilizes labels (colors) to understand class distinctions, while unsupervised learning sees only raw data distribution.
Supervised Learning (Labeled Data)
Data points are associated with known target classes.
Unsupervised Learning (Unlabeled Data)
Raw data points without any target labels or categories.
Semi-Supervised Learning: A hybrid approach where the model is trained on a small amount of labeled data and a large amount of unlabeled data. This is common when labeling data is expensive but raw data is abundant.