CNNs Explained Simply — Looking at Images with a Flashlight

Before 2012, computer vision was the hardest problem in artificial intelligence by a wide margin. You could do almost anything else — play chess, beat humans at trivia, translate languages crudely — but asking a computer "is there a cat in this picture?" was genuinely humiliating. The best systems in the world, built by the best labs in the world, were barely better at recognising animals than a determined toddler.

Then, in one ten-month stretch in 2012, that problem fell off a cliff. A team at the University of Toronto entered a public image-recognition contest with a neural network called AlexNet and cut the error rate in half. Half. Other researchers stared at the numbers for a week, assumed they were wrong, and then quietly blew up their own research agendas. Overnight, every serious computer-vision group in the world switched to the approach AlexNet had used. The approach was a convolutional neural network — CNN for short — and within two years, CNNs had beaten humans at certain visual tasks.

The trick at the heart of a CNN is one of the most satisfying ideas in modern AI. It's also a metaphor you can hold in one hand. It's a little flashlight that slides across an image, looking for the same kinds of small patterns no matter where in the image those patterns show up. Once you see the flashlight, you understand why CNNs work, why they work so dramatically better than anything that came before, and why almost every image-related product you use today has some descendant of this idea inside it. No calculus. No code. One flashlight.

The problem vanilla networks couldn't solve

In the last post, we saw how training works: a gradient walk downhill on a loss landscape. In M3.2, we saw why depth works: each layer builds more abstract features from the layer below.

Those ideas alone should let you recognise a cat in a photo, right? Just flatten the photo into a long list of pixel brightnesses, feed it through a stack of layers, and train the thing.

People tried this for years. It barely worked. Here's why.

Pretend your input image is just a 100-by-100 grid of pixels — already tiny by modern standards. That's 10,000 pixels. A fully connected first layer, the kind we described in M3.2, needs one weight from every pixel to every neuron. If you want, say, 1,000 neurons in your first layer, you need 10,000 × 1,000 = ten million weights just for the first layer. For a real 1,000-by-1,000 image, that's a billion weights for the first layer alone.

And that's not even the biggest problem. The biggest problem is that the network has to separately learn "what a cat looks like here" for every possible position in the image. A cat in the top-left corner activates a totally different set of weights than a cat in the bottom-right corner. As far as the network is concerned, these are two unrelated problems. Move the cat five pixels to the right and the whole system gets confused.

This was the impasse. The math worked, the training worked, but the approach just scaled badly on real images. Something had to change.

The core move: scan the image with a flashlight

Here's the CNN idea in one picture.

Instead of a layer where every neuron looks at every pixel, a CNN has a single small filter — think of it as a tiny flashlight a few pixels wide — that slides across the whole image, one position at a time. At every position, it asks one question: how much does this little patch of the image look like what I'm looking for?

The filter might be looking for a short horizontal edge. At each spot, it measures how much horizontal-edge-iness is in this patch. If it slides to a patch with a strong horizontal edge, it returns a high score. If it slides to a patch that's just a blank wall, it returns a low score. When it's swept across the whole image, you get a new grid — a feature map — with one score per position, showing where horizontal edges were found.

That's the whole convolution operation. A filter. Sliding. Scoring. Out comes a map.

A CNN usually runs many filters in parallel — maybe 64 or 128 of them — each looking for a different kind of small pattern. One might look for horizontal edges, another for diagonal edges, another for small dots, another for a sudden color change. Each filter produces its own feature map. The first layer of a CNN turns one image into 64 feature maps, each one a "where was my pattern?" report for a different pattern.

The honest one-line definition:

A convolutional layer is a set of small patterns the network looks for everywhere in the image. The output is a collection of maps showing where each pattern appeared.

Why this one idea changes everything

The flashlight trick sounds almost too simple to matter. It does, though, because of three consequences, and each of them is load-bearing.

Consequence 1 · drastically fewer weights. A single filter is small — typically 3×3 or 5×5 pixels. That's 9 or 25 weights, regardless of the image size. The filter is reused at every position. So a convolutional layer with 64 filters has only 64 × 9 = 576 weights for the first layer, not ten million. A CNN can be hundreds of layers deep and still have a tenth of the weights of a shallow fully-connected network. That makes training feasible.

Consequence 2 · the network learns position-agnostic features. Because the same filter sees the whole image, when it learns "horizontal edge," it learns it for every position simultaneously. A cat in the top-left and a cat in the bottom-right will both trigger the same edge detectors. The network stops having to learn "cat in this spot" vs "cat in that spot" separately. It learns "what does a cat look like, anywhere."

Consequence 3 · the abstraction ladder from M3.2 now has a natural structure. Early filters in a CNN learn tiny low-level things — edges, textures, color gradients. The feature maps they produce become the input to the next convolutional layer, which runs ITS own set of filters over the feature maps. But those filters aren't looking for pixels anymore; they're looking for patterns of edges. Which turns out to be things like corners, curves, and small shapes. The next layer looks for patterns of shapes, which turns out to be things like eyes, ears, wheels, leaves. And so on, up the ladder.

The ladder is the same one we met last post, but the structure of convolution means the bottom rung is "tiny local patterns" and each rung up combines patterns into slightly bigger patterns until the top of the ladder is "whole objects." Nobody hand-designed this. It falls out of stacking convolutional layers and training them downhill on a lot of labelled images. This is the famous result that made researchers believe deep learning was really going to work.

The two helper moves: pooling and padding

You'll sometimes hear two other words alongside CNNs. Don't let them throw you.

Pooling is the move where, after a convolutional layer, the feature maps get shrunk. Each small patch — say, 2×2 — gets collapsed to a single number, usually the biggest one in that patch. It's a way of saying "we don't care about the exact pixel where the pattern showed up, just that it showed up somewhere around here." Pooling throws away precise location to keep approximate presence. As you go deeper into a CNN, pooling keeps shrinking the feature maps. By the top of the ladder, you might have a handful of tiny maps each asking "does this whole image contain something cat-like?" That's exactly what you want.

Padding is a tiny technical fix for what happens at the edges of an image, where the flashlight would hang off the side. You pad the image with a strip of zeros around the border so the filter always has pixels to look at. Boring bookkeeping; worth knowing the word exists but nothing conceptual.

These two together with the convolution step are basically the entire CNN toolkit. Stack them in a thoughtful order, train with gradient descent, and you get an image recognizer that works.

Why 2012 was the snap

We've been circling 2012 since M1. Here's the specific reason it was a snap for computer vision.

Before 2012, CNNs existed on paper — the ideas went back to the 1980s with Yann LeCun's work on handwritten digit recognition. But nobody could scale them up. Training a deep CNN on real photos required more compute than any lab could afford, and more data than anyone had collected.

Three things arrived at the same time, and the dam broke.

ImageNet. A Stanford project led by Fei-Fei Li built a dataset of roughly 14 million labelled images across 20,000 categories. This was absurdly more data than any previous visual dataset, and it was made public. For the first time, there was a fair benchmark big enough to tell whether a method was really working.
GPUs. Graphics processing units, originally built for video games, turned out to be wildly good at the multiply-and-add math convolutions are made of. A GPU could do in hours what a CPU would take weeks to do. Training a deep CNN became a problem you could fit into a single afternoon.
AlexNet. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton took all the existing CNN ideas, added some modern tricks (better activation functions, better regularization), scaled the whole thing to about 60 million parameters, trained it on ImageNet across two GPUs for a week, and entered the annual ImageNet contest. They won by a gap that embarrassed everyone else in the field.

Every visual-AI product you've touched in the last decade is downstream of that moment. Your phone's face unlock is a CNN descendant. The automatic alt-text on your photos app is a CNN descendant. The self-checkout camera that figures out you're holding an avocado is a CNN descendant. The medical imaging system reading a chest X-ray is a CNN descendant. They're bigger, fancier, occasionally replaced by transformer-based models for some tasks, but the flashlight idea is still the structural core of most of them.

What CNNs are not good at

One honest limitation worth naming before we leave this topic.

CNNs are spectacularly good at locally-structured data — data where nearby things are meaningfully related. In images, nearby pixels form edges, which form textures, which form objects. Perfect fit for a flashlight.

CNNs are not naturally great at data where the important relationships are far apart. In a sentence, the subject at the start might relate to a verb at the end, and a flashlight sliding a few words at a time has to pass information up through many layers to connect them. It works, clumsily, but it's not what CNNs were built for. That's why text processing, for years, leaned on a different architecture — recurrent networks — and then, more recently, transformers. We'll meet both.

The takeaway: every architecture is shaped by an assumption about the data. CNNs assume "useful patterns are local and repeat everywhere." That assumption is dead-on for images and works pretty well for audio. It's weaker for text, where the interesting patterns are often non-local.

What just changed in your head

You started this post with "convolutional neural network" as an imposing term. You're ending it with a picture of a little flashlight sliding across an image, asking the same question at every position, producing a map of where that question got a yes. Stack a bunch of those flashlights, each looking for a different tiny pattern, feed their outputs into the next layer, and you've got the single most important architecture in the history of computer vision.

One sentence worth carrying forward:

A CNN is a neural network where every filter is shared across every position, so the network learns "what patterns matter" separately from "where in the image they happen to be."

That sentence, plus the ladder from M3.2, plus the downhill walk from M3.3, is the whole of how modern vision models work. The rest is engineering.

In the next post, we turn to a different challenge: what do you do when the data isn't an image but a sequence — a sentence, a speech recording, a stream of stock prices? A flashlight sliding across an image doesn't quite fit, because sequences have a direction and a memory. We'll meet the old-school answer — recurrent neural networks — and see both what they solved and why they eventually got replaced. It's a short detour, but it sets up the transformer revolution in Module 4.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous M3.3 · Training, Downhill Blindfolded	M3.4	Next ➡️ M3.5 · RNNs — One Word at a Time

📚 AI Zero to Hero · Course Home — all 33 posts, six modules.

Cover photo via Unsplash. This post is part of the AI Zero to Hero series.

CNNs — Looking at Images with a Flashlight

The problem vanilla networks couldn't solve

The core move: scan the image with a flashlight

Why this one idea changes everything

The two helper moves: pooling and padding

Why 2012 was the snap

What CNNs are not good at

What just changed in your head

Course navigation

Comments

AI Zero to Hero

RNNs — Reading One Word at a Time

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The problem vanilla networks couldn't solve

The core move: scan the image with a flashlight

Why this one idea changes everything

The two helper moves: pooling and padding

Why 2012 was the snap

What CNNs are not good at

What just changed in your head

Course navigation

Comments

AI Zero to Hero

RNNs — Reading One Word at a Time

More from this blog