A quick exploration of gradient descent - how it works and why it's the backbone of modern ML.
Problem
I wanted to deeply understand how gradient descent actually works under the hood, not just use it as a black box in PyTorch or TensorFlow.
Blocker
The math looked intimidating at first - partial derivatives, learning rates, convergence… Where do I even start?
Solution
The core idea is beautifully simple: take small steps in the direction that reduces error the most.
# Simple gradient descent implementation
def gradient_descent(x, learning_rate=0.01, epochs=100):
for _ in range(epochs):
gradient = compute_gradient(x) # derivative of loss w.r.t. x
x = x - learning_rate * gradient # update in opposite direction
return x
# For a simple quadratic: f(x) = x^2
# Gradient: f'(x) = 2x
# Starting at x=10, learning_rate=0.1
# x_new = 10 - 0.1 * 20 = 8
# x_new = 8 - 0.1 * 16 = 6.4
# ... converges to 0 (the minimum) Key insight: The gradient tells us the direction of steepest increase, so we go the opposite way to minimize.
Resources
Next Steps
- Implement batch vs stochastic gradient descent
- Explore momentum and Adam optimizer
- Apply to a real neural network from scratch
A quick exploration of gradient descent - how it works and why it's the backbone of modern ML.
Problem
I wanted to deeply understand how gradient descent actually works under the hood, not just use it as a black box in PyTorch or TensorFlow.
Blocker
The math looked intimidating at first - partial derivatives, learning rates, convergence… Where do I even start?
Solution
The core idea is beautifully simple: take small steps in the direction that reduces error the most.
# Simple gradient descent implementation
def gradient_descent(x, learning_rate=0.01, epochs=100):
for _ in range(epochs):
gradient = compute_gradient(x) # derivative of loss w.r.t. x
x = x - learning_rate * gradient # update in opposite direction
return x
# For a simple quadratic: f(x) = x^2
# Gradient: f'(x) = 2x
# Starting at x=10, learning_rate=0.1
# x_new = 10 - 0.1 * 20 = 8
# x_new = 8 - 0.1 * 16 = 6.4
# ... converges to 0 (the minimum) Key insight: The gradient tells us the direction of steepest increase, so we go the opposite way to minimize.
Resources
Next Steps
- Implement batch vs stochastic gradient descent
- Explore momentum and Adam optimizer
- Apply to a real neural network from scratch