Neural Networks from Scratch

1/24/26

Neural Networks Architecture
- how machines actually learn
- ![[Screenshot 2026-01-24 at 10.02.45 AM.png]]
- first one -> input layer
- last layer -> output layer
- layers in the middle -> hidden layer , generally n number of hidden layers
- the circular things are the neurons
ML
- Goal: understanding the fundamental mechanism of pattern discovery of data. i.e mathematical relationships within large datasets.
- Training is simple, iterative loop of guessing and correcting:
  - start with a guess -> random values
  - measure error -> how far is the guess from the truth ?
  - adjust -> modify the guess to be slightly less wrong
  - repeat -> do this millions of time, until error zeros or very very close to zero.
- ![[Screenshot 2026-01-24 at 10.21.57 AM.png]]
Neuron -> A Tiny Decision Maker
- Building block of AI is a simple mathematical function, not a biological mystery.
- It takes some numbers in, performs basic arithmetic ,and outputs a signal.
  - Inputs -> numerical data points representing features (eg: house size, pixel intensity)
  - weights and summation -> importance assigned to each input, combined into a single weighted sum.
  - activation -> a non-linear function that decides whether the signal should be “fired” or not “fired”.
  - ![[Screenshot 2026-01-24 at 10.24.49 AM.png]]
  - Weights -> W1,W2,W3.... , something that the change the weight accordingly to change the loss
  - Inputs -> Features
  - non-linearity -> activation function
- Anatomy of a neuron :
  - output = activation(\sigma wixi + b)
  - weights -> w -> determine the strength and importance of each input signal
  - bias (b) -> an offset that allows the neuron to shift its decision boundary
  - activation -> a nonlinear function that decides if the signal should “fire”.
- the need for activation functions :
  - Consider
    - Layer 1 -> y := W1x + b1
    - Layer 2 -> z = W2y + b2
  - Combining them we get
    - Z = W2(W1x+b1) + b2
    - Z = (W2W1)X + (W2b1+b2)
  - this will then simplify to:
    - z = W'x + b'
    - where
      - W' = W2W1
      - b' = W2b1+b2
  - so basically if we are stacking layers cause it will just results in another linear function.
  - so 100 layers of this thing will be equivalent to just one.
  - so if we add non-linearity to this , we can solve this function.
- How Activation Function solves this :
  - activation functions add “bends” to the math, preventing layers from collapsing into each other -> breaking the linearity and linear combination of operations.
  - Enabling Depth -> “secret sauce” that allows deep networks to learn complex, multi-layered representations.
  - Universal Approximation -> Non-linearity allows the networks to model any continuous functions, no matter how complex the function is.
- Different Activations Functions
  - sigmoid
    - classic S-curve 0-1.
    - default used for probabilistic ML.
  - tanh
    - zero centered (-1 to 1)
    - often provides faster convergence than sigmoid.
  - ReLU
    - ![[Screenshot 2026-01-24 at 10.55.52 AM.png]]
    - max(0,x) -> simple ,efficient and enables deep networks.
  - Modern Variants
    - GELU -> modern smooth version of ReLU.
    - Swish
Loss Functions:
- The thing where the feedback loop starts.
- I.e minimizing this score the ML systems get better
- loss = a single number measuring how wrong we are
- lower loss implies better predictions
- training goal of a ml system is to minimize the loss.
- most of the times we used mean squared error (MSE) to compute loss
  - L = (1/n) \sigma (predicition -target)^2
  - Squares will make all the errors positive
  - big errors will be penalized more than small errors
  - good for regression tasks.
Backpropogation:
- ![[Screenshot 2026-01-24 at 11.06.04 AM.png]]
- how does the machine knows which weights to adjust to minimize loss ?
- it checks which nodes weights that we need to change.
- Intuition for backprop:
  - in a big org, something went wrong, and why are things are down ?
    - ceo would ask vp , vp to managers and the managers to employees
    - they are backpropogating to find the error and to trying to optimize the process
  - basically like a chain of responsibility.
  - error signal travel backwards from the output later through the hidden layers to the input.
  - adjustment: each weight is adjusted proportionally to its contribution to the final mistake.
Learning Rate:
- a hyperparameter -> something that is tunable and can be changed.
- how fast or how slow we can tune the loss minimization
- ![[Screenshot 2026-01-24 at 11.15.52 AM.png]]
- (loss with a 2d graph and 3d repr)
- loss landscape -> visualize how the loss looks for our loss function
- we try to move in the direction of the greatest gradient
- learning rate telling us how fast to go there
  - high learning rate might take us directly or it might overshoot
  - slow learning rate might take a lot of time to get to the optimum
Gradient Descent:
- Strategy:
  - Step Downhill : Feel the slope and take a step in that direction that reduces the loss.
  - Iterative Progress : repeat the process, taking step after step until we reach the bottom
  - Goal : Reach the global minimum - the point where the model’s error is the minimum.
Whole Training Loop:
- ![[Screenshot 2026-01-24 at 11.31.22 AM.png]]