Skip to main content

Command Palette

Search for a command to run...

How Machines Learn: The Complete Guide to Training, Features, and Labels

Understand the core mechanics of machine learning: training data, features, labels, and the learning loop. Includes hands-on gradient descent example.

Updated
16 min read

How Machines Learn: The Core Mechanics

Reading Time: 20 minutes | Difficulty: Beginner | Track: Practical

Prerequisites: What is AI and Machine Learning - Understanding of AI/ML distinction

You've heard that machines can "learn." But what does that actually mean? How does showing a computer thousands of cat pictures teach it to recognize cats?

In this article, we'll demystify the learning process. You'll understand training data, features, labels, and most importantly, the learning loop that makes it all work. By the end, you'll code a simple learning algorithm from scratch.


Quick Navigation


What You'll Learn

By the end of this article, you'll be able to:

  • [ ] Explain how machines learn from data
  • [ ] Understand training data, features, and labels
  • [ ] Describe the learning loop (predict → error → adjust)
  • [ ] Distinguish between supervised, unsupervised, and reinforcement learning
  • [ ] Implement gradient descent from scratch

Time investment: ~30 minutes reading + coding


The Big Idea: Learning as Function Approximation {#the-big-idea}

As a Developer, You Write Functions

def predict_house_price(square_feet, bedrooms, location):
    """Traditional programming: YOU write the logic."""
    if location == "downtown":
        base_price = 500_000
    else:
        base_price = 300_000

    return base_price + (square_feet * 200) + (bedrooms * 50_000)

# Predict
price = predict_house_price(1800, 3, "suburbs")  # $660,000

The problem: How did you know those numbers? What if they're wrong? What about 100 other factors you haven't considered?

Machine Learning: The Algorithm Writes the Function

# You provide examples (data)
training_data = [
    {"sqft": 1500, "beds": 3, "location": "suburbs", "price": 350_000},
    {"sqft": 2000, "beds": 4, "location": "downtown", "price": 550_000},
    {"sqft": 1200, "beds": 2, "location": "suburbs", "price": 280_000},
    # ... thousands more examples
]

# Algorithm LEARNS the function
model = train_model(training_data)

# Now predict (model figured out the logic!)
price = model.predict({"sqft": 1800, "beds": 3, "location": "suburbs"})

The magic: The model finds patterns in your data and creates its own function.

Visual Representation

Key Insight: ML is function approximation. The model learns to map inputs → outputs from examples.


Training Data: The Foundation {#training-data}

What is Training Data?

Training data = A collection of examples that teach the model what to predict.

Each example has:

  • Input data (features)
  • Correct answer (label) - for supervised learning

Think of it like flashcards:

  • Front of card: Question (features)
  • Back of card: Answer (label)
  • Deck: Training dataset

Structure of Training Data

Real Example: Email Spam Detection

# Training data structure
training_emails = [
    {
        # Features (input)
        "text": "Meeting tomorrow at 3pm",
        "sender": "colleague@company.com",
        "has_links": False,
        "urgency_words": 0,

        # Label (output)
        "is_spam": False
    },
    {
        "text": "URGENT: You've won $1,000,000!!!",
        "sender": "unknown@suspicious.ru",
        "has_links": True,
        "urgency_words": 2,

        "is_spam": True
    },
    # ... thousands more examples
]

The Cardinal Rule

Your model can only be as good as your training data.

Data ProblemResult
Too few examplesModel can't generalize (memorizes instead of learning)
Biased examplesModel inherits bias (e.g., only photos of cats in baskets → thinks all cats are in baskets)
Noisy labelsModel learns noise (garbage in = garbage out)
Missing key featuresModel can't find patterns (like predicting house prices without location)

Features: What the Model Sees {#features}

Features Defined

Features (also called attributes, variables, predictors, inputs) = The information your model uses to make predictions.

Think of features as the questions you'd ask to make a decision:

  • Predicting house price? Ask: square footage, bedrooms, location, age
  • Detecting spam? Ask: sender, keywords, link count, urgency words
  • Diagnosing disease? Ask: symptoms, test results, patient history

Raw Data vs Features

Often, raw data needs transformation:

Types of Features

# Numerical features
age = 35  # Continuous number
years_experience = 10  # Discrete count
temperature = 72.5  # Continuous measurement

# Categorical features
department = "engineering"  # Nominal (no order)
education = "masters"  # Ordinal (has order: HS < BS < MS < PhD)
color = "red"  # Nominal

# Binary features
is_active = True
has_subscription = False

# Text features (need special handling)
product_description = "Wireless bluetooth headphones with noise cancellation"

# Date/Time features
signup_date = "2023-01-15"
last_login = "2024-01-10"

Feature Engineering: The Art

Feature engineering = Creating informative features from raw data.

Example: User churn prediction

from datetime import datetime

def engineer_features(user_data):
    """Transform raw data into useful features."""

    now = datetime.now()
    signup = datetime.fromisoformat(user_data["signup_date"])
    last_login = datetime.fromisoformat(user_data["last_login"])

    return {
        # Original features
        "subscription_tier": user_data["tier"],

        # Engineered features (new!)
        "account_age_days": (now - signup).days,
        "days_since_login": (now - last_login).days,
        "is_active": (now - last_login).days < 30,

        # Derived ratios
        "logins_per_month": user_data["total_logins"] / max(1, (now - signup).days / 30),
        "feature_usage_rate": user_data["features_used"] / user_data["features_available"],

        # Binned features
        "user_segment": "new" if (now - signup).days < 90 else "established"
    }

# Raw data
raw = {
    "tier": "premium",
    "signup_date": "2023-06-15",
    "last_login": "2024-01-20",
    "total_logins": 145,
    "features_used": 12,
    "features_available": 20
}

features = engineer_features(raw)
print(features)
# {
#   "subscription_tier": "premium",
#   "account_age_days": 219,
#   "days_since_login": 4,
#   "is_active": True,
#   "logins_per_month": 19.9,
#   "feature_usage_rate": 0.6,
#   "user_segment": "established"
# }

Pro tip: Good features often matter more than fancy algorithms. Domain knowledge is your superpower!


Labels: The Ground Truth {#labels}

Labels Defined

Labels (also called targets, outputs, dependent variables) = The answers you want the model to predict.

Classification Labels

Predict categories:

# Binary classification (2 classes)
email_label = "spam" or "not_spam"
tumor_label = "malignant" or "benign"

# Multi-class classification (3+ classes)
sentiment = "positive" or "negative" or "neutral"
animal = "cat" or "dog" or "bird" or "fish"

Regression Labels

Predict numbers:

house_price = 450_000.00  # Continuous
temperature = 23.5  # Continuous
stock_price = 156.32  # Continuous
num_customers = 1_247  # Discrete count

Labeled vs Unlabeled Data

The Labeling Challenge

Getting labels is often the hardest part:

Easy labels (automated):

# Labels from user actions
{"email_id": 1, "user_marked_spam": True}  # User labeled it
{"product_id": 42, "was_purchased": True}  # Action is the label
{"video_id": 99, "watch_time_seconds": 145}  # Behavior is the label

Hard labels (require human effort):

# Need experts to label
{"medical_image": img_1, "diagnosis": "???"}  # Radiologist needed
{"legal_document": doc_1, "category": "???"}  # Lawyer needed
{"support_ticket": ticket_1, "priority": "???"}  # Manual review needed

Reality check: Labeling data is expensive. Large companies spend millions on data labeling. Startups get creative (weak supervision, active learning, synthetic data).


The Learning Loop {#the-learning-loop}

This is where the magic happens. How does the model actually learn?

The Process

In Code (Simplified)

# Pseudocode of the learning loop

def train_model(training_data, epochs=100, learning_rate=0.01):
    """Train a model using gradient descent."""

    # Start with random model
    model = initialize_random_model()

    # Repeat for many epochs
    for epoch in range(epochs):
        total_loss = 0

        # Process each training example
        for features, true_label in training_data:

            # 1. PREDICT
            prediction = model.predict(features)

            # 2. MEASURE ERROR
            error = compute_loss(prediction, true_label)
            total_loss += error

            # 3. COMPUTE GRADIENTS
            # "How should I change my weights to reduce error?"
            gradients = compute_gradients(error, model)

            # 4. UPDATE MODEL
            # Take a small step in the direction that reduces error
            model.weights -= learning_rate * gradients

        avg_loss = total_loss / len(training_data)
        print(f"Epoch {epoch}: Average Loss = {avg_loss:.4f}")

        # Early stopping if good enough
        if avg_loss < threshold:
            break

    return model

The Intuition

Imagine you're blindfolded on a hilly landscape. Your goal: reach the lowest point.

  1. Feel around (compute gradient) - Which direction is downhill?
  2. Take a step (update weights) - Move slightly downhill
  3. Repeat until you can't go any lower

This is gradient descent - the core of machine learning!


Types of Learning {#types-of-learning}

Now that you understand the learning loop, let's see different types of learning.

Supervised Learning

Training: "Here are emails. I've labeled each as spam/not spam. Learn the pattern."

Two types:

TypeDescriptionExamples
ClassificationPredict discrete categoriesSpam/not spam, cat/dog/bird, disease diagnosis
RegressionPredict continuous numbersHouse prices, temperature, stock prices

Common algorithms:

  • Linear Regression, Logistic Regression
  • Decision Trees, Random Forests
  • Support Vector Machines (SVM)
  • Neural Networks

Unsupervised Learning

Training: "Here's customer data. Find natural groupings I haven't identified."

Types:

TypeDescriptionExample
ClusteringGroup similar itemsCustomer segments, document topics
Dimensionality ReductionCompress features1000 features → 10 (while keeping info)
Anomaly DetectionFind outliersFraud detection, network intrusions

Common algorithms:

  • K-Means, DBSCAN (clustering)
  • PCA, t-SNE (dimensionality reduction)
  • Isolation Forest (anomaly detection)

Reinforcement Learning

Training: "Try actions in this environment. I'll give rewards/penalties based on results."

Key concepts:

  • Agent: The learner (your model)
  • Environment: Where actions happen (game, robot, market)
  • State: Current situation
  • Action: What the agent can do
  • Reward: Feedback (+1 for good, -1 for bad)
  • Policy: Strategy for choosing actions

Examples:

  • Game AI (Chess, Go, Atari games)
  • Robotics (walking, grasping)
  • Resource optimization (data center cooling, traffic lights)

Common algorithms:

  • Q-Learning
  • Deep Q-Networks (DQN)
  • Policy Gradient methods
  • Proximal Policy Optimization (PPO)

Hands-On: Gradient Descent from Scratch {#hands-on}

Let's implement the learning loop yourself! We'll learn a simple linear relationship.

The Problem

Given data points, learn the line that fits them best.

True relationship: y = 2x + 1 (but model doesn't know this!)

The Code

import numpy as np
import matplotlib.pyplot as plt

# Generate training data
np.random.seed(42)
X = np.linspace(0, 10, 50)  # 50 points from 0 to 10
y = 2 * X + 1 + np.random.randn(50) * 2  # y = 2x + 1 + noise

# Visualize data
plt.scatter(X, y, alpha=0.6, label='Training Data')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Training Data')
plt.legend()
plt.grid(True)
plt.show()

# Initialize model with random parameters
w = np.random.randn()  # Weight (slope)
b = np.random.randn()  # Bias (intercept)

print(f"Initial: w={w:.2f}, b={b:.2f}")

# Hyperparameters
learning_rate = 0.01
epochs = 100

# Training loop
losses = []

for epoch in range(epochs):
    # 1. PREDICT
    y_pred = w * X + b

    # 2. MEASURE ERROR (Mean Squared Error)
    loss = np.mean((y_pred - y) ** 2)
    losses.append(loss)

    # 3. COMPUTE GRADIENTS
    # Calculus tells us:
    # dL/dw = 2 * mean((y_pred - y) * X)
    # dL/db = 2 * mean(y_pred - y)
    grad_w = 2 * np.mean((y_pred - y) * X)
    grad_b = 2 * np.mean(y_pred - y)

    # 4. UPDATE MODEL
    w -= learning_rate * grad_w
    b -= learning_rate * grad_b

    if epoch % 20 == 0:
        print(f"Epoch {epoch}: w={w:.2f}, b={b:.2f}, loss={loss:.4f}")

print(f"\nFinal: w={w:.2f}, b={b:.2f}")
print(f"True:  w=2.00, b=1.00")

# Visualize result
plt.figure(figsize=(12, 5))

# Plot 1: Final fit
plt.subplot(1, 2, 1)
plt.scatter(X, y, alpha=0.6, label='Data')
plt.plot(X, w * X + b, 'r-', linewidth=2, label=f'Learned: y={w:.2f}x+{b:.2f}')
plt.plot(X, 2 * X + 1, 'g--', linewidth=2, label='True: y=2x+1')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Final Model Fit')
plt.legend()
plt.grid(True)

# Plot 2: Loss over time
plt.subplot(1, 2, 2)
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True)

plt.tight_layout()
plt.show()

Expected output:

Initial: w=-0.22, b=1.53
Epoch 0: w=-0.16, b=1.51, loss=51.8234
Epoch 20: w=1.34, b=1.42, loss=5.2156
Epoch 40: w=1.85, b=1.23, loss=4.1892
Epoch 60: w=1.98, b=1.15, loss=4.1234
Epoch 80: w=2.01, b=1.12, loss=4.1156

Final: w=2.03, b=1.10
True:  w=2.00, b=1.00

What just happened:

  1. Started with random w and b
  2. Made predictions (straight line)
  3. Measured error (how far predictions are from true values)
  4. Computed gradients (which direction to adjust w and b)
  5. Updated parameters slightly
  6. Repeated 100 times
  7. Converged close to true values!

This is gradient descent - the foundation of all modern ML!

🎯 Exercise: Modify the Code (Click to expand) Try these experiments: 1. Change learning rate: - Set learning_rate = 0.001 (too small) - What happens? - Set learning_rate = 0.5 (too large) - Does it still converge? 2. More epochs: - Run for 500 epochs - Does it get closer to the true values? 3. Different relationship: - Change true relationship to y = 3x + 5 - Does gradient descent find the new values? 4. Add more features: - Try fitting y = w1*x + w2*x² + b (polynomial) Hints: - Watch the loss curve - it should decrease - If loss increases, learning rate is too high - If loss decreases very slowly, learning rate is too small or need more epochs

Summary

Key Takeaways

  • 🎯 ML is function approximation: Learn from examples, not explicit rules
  • 🎯 Training data = examples: Each has features (input) and label (output)
  • 🎯 Features = what model sees: Good features are critical
  • 🎯 Labels = what we predict: Classification (categories) or regression (numbers)
  • 🎯 Learning loop: Predict → Measure error → Compute gradients → Update → Repeat
  • 🎯 Gradient descent: The optimization algorithm that powers ML

Types of Learning

TypeHas Labels?GoalExample
Supervised✅ YesPredict labelsSpam detection, price prediction
Unsupervised❌ NoFind patternsCustomer segmentation, anomaly detection
Reinforcement⚡ RewardsLearn through trial/errorGame AI, robotics

The Learning Loop


Next Steps

Immediate Actions

  1. ✅ Run the gradient descent code yourself
  2. 📝 Experiment with different learning rates
  3. 🤔 Think about your own problem: What are the features? What's the label?

Continue Learning

In this series:

In this module (Foundations):

  1. What is AI, ML, and Deep Learning?
  2. ← You are here: How Machines Learn
  3. Python for Machine Learning ← Coming next!
  4. Math Essentials
  5. Your First Model

Want deeper understanding?


Resources & Further Reading

Foundational Concepts

Textbooks (Free Online)

Interactive Learning


FAQ

Q: Why does gradient descent work? A: Think of error as a hill. Gradient = direction of steepest ascent. Negative gradient = direction of steepest descent. By repeatedly taking small steps downhill (opposite of gradient), we eventually reach a valley (minimum error). Mathematically, the gradient points in the direction of maximum increase. We want to minimize error, so we go the opposite direction.
Q: Can the model memorize instead of learning? A: Yes! This is called overfitting. The model memorizes training examples instead of learning general patterns. Example: Like a student who memorizes answers to practice problems but can't solve new ones. Solution: Use separate test data (never seen during training) to check if the model generalizes. We'll cover this in detail in the evaluation module.
Q: How much training data do I need? A: It depends, but rough guidelines: - Classical ML: 10-100x more examples than features (10 features → 100-1000 examples) - Deep Learning: 1000s to millions (depends on complexity) - Transfer Learning: Can work with 100s of examples Quality > Quantity: 1000 high-quality labeled examples beat 10,000 noisy ones.
Q: What if I don't have labeled data? A: Options: 1. Label it yourself (small datasets) 2. Hire labelers (Amazon Mechanical Turk, LabelBox) 3. Use weak supervision (generate noisy labels programmatically) 4. Semi-supervised learning (use a little labeled + lots unlabeled) 5. Unsupervised learning (find patterns without labels) 6. Synthetic data (generate examples programmatically) Each has trade-offs. We'll cover these strategies later.
Q: How do I choose a learning rate? A: Trial and error, but guidelines: - Too small: Learning is very slow (loss decreases too slowly) - Too large: Training is unstable (loss oscillates or increases) - Just right: Steady decrease in loss Typical values: 0.001 to 0.1 Advanced: Use learning rate schedulers (start large, decrease over time) or adaptive optimizers (Adam, which adjusts learning rate automatically).

Exercises & Practice

🎯 Exercise 1: Identify Features and Labels (Click to expand) Challenge: For each scenario, identify features and labels. 1. Predicting customer lifetime value - Features: ? - Label: ? 2. Detecting fraudulent credit card transactions - Features: ? - Label: ? 3. Recommending movies - Features: ? - Label: ? Answers: 1. Customer lifetime value - Features: Purchase history, demographics, engagement metrics, customer since date - Label: Total revenue from customer (continuous number) - Regression 2. Fraud detection - Features: Transaction amount, merchant, location, time, user history - Label: Is fraudulent? (yes/no) - Classification 3. Movie recommendations - Features: User's rating history, movie genres, actors, user demographics - Label: Rating user would give (1-5 stars) - Regression OR will user watch? (yes/no) - Classification
🎯 Exercise 2: Learning Type Classification (Click to expand) Challenge: Classify each as Supervised, Unsupervised, or Reinforcement Learning. 1. Grouping news articles by topic (no predefined topics) 2. Predicting next word in a sentence 3. Training a robot to walk 4. Finding unusual patterns in network traffic 5. Classifying emails as work/personal/spam Answers: 1. Unsupervised (clustering - no labels, finding natural groups) 2. Supervised (have pairs: previous words → next word) 3. Reinforcement (trial and error, reward for staying upright) 4. Unsupervised (anomaly detection - no labels for "unusual") 5. Supervised (have labeled examples of each category)

Feedback & Discussion

What did you think of this article?

  • 💬 Leave a comment: Did the gradient descent example help?
  • 🐛 Found an error? Open an issue
  • 💡 Suggestion? What concepts need more explanation?
  • 🔗 Share with someone learning ML

Series Navigation

AI Zero to Hero - Practical Track

Module 1: Foundations

  1. What is AI, ML, and Deep Learning?
  2. ← You are here: How Machines Learn
  3. Python for Machine Learning ← Next
  4. Math Essentials
  5. Your First Model

Progress: ████████░░░░░░░░░░░░ 10%


Last updated: 2026-01-24 | Reading time: 20 minutes | View source

More from this blog

Learn AI - Zero to Hero

111 posts