Ting-Yu Dai
November 25, 2025
The trade-off between different regularization
Today, I want to start with an interview question that I encountered two years ago. What is the difference between lasso and ridge regularization?
Let's start with a linear regression:
It is a one parameter linear regression and the goal is to find the optimal that minimize the error between and . We could set our loss fucntion as the squared loss.
Let's say we want to optimize that through graduate descent. We calculate the slope i.e. the derivative of the loss function with respect to and move in the opposite direction.
Calculate Gradient:
(This is roughly "Error Input".)
Update Rule: We update the weight using a learning rate (a small number like 0.01).
If the gradient is positive (slope is up), we subtract to go down. If negative, we add to go up. We keep doing this until the gradient is zero (the bottom of the valley).
Regularization is a concept to implement the trade-off of the bias and variance that helps to reduce the prediction error. From a equation perspective, it add a penalty to the weight to constraint the weight from having a large number of the weight. Lasso Regularization i.e. L1 regularization add the absolute value of the weight, so the new loss function turn into:
(Where is the regularization strength.)
After add that penalty term, what would it impact the optimization? First, the derivative of is either (if ) or (if ).
The New Update Rule:
What does that mean? By observing the equation, we know we keep adding a constant subtraction (of ) pushing it toward zero. You may also notice that the force of that "push" is not determined by , so even get smaller, the weight can hit exactly 0 by doing that. Once hits , the gradient becomes undefined (or a "sub-gradient" interval), effectively trapping the weight at unless the data error is massive enough to push it out.
Result: L1 is aggressive. It can force the parameter to become exactly zero.
On the other hand, Ridge regularization add the penalty term as the square of the weight
The derivative of is .
The New Update Rule:
Now clearly, the difference compared to L1 is that now the penalty depends on if self. If is large, the penalty is large. vise versa. Moreove, while approaching zero, the penalty force i.e. also approaches zero. It pushes hard at the beginning and become infinitely small as you get close to the center. As a result, you get extremely small values but never hit zero.
Result: L2 is conservative. It shrinks the parameter to be very small, but rarely eliminates it entirely.
Most of the explanation below is highly referred to here.
In simple term, if you image your data label as a distribution. The bias represent the average difference between your prediction and the true values, indicating the ability of our model to capture the centeroid of the true value distribution.
On the other hand, variance is the measure of variability i.e. spread of the predicted values. I will image as the width of the distribution. Lower variance means the model is limited to a small space to give the prediction.
Combining the previous example and the bias and variance concept, let's first think about: What could happen in one-parameter example? A very intuitive thought is that: since it only contain one changeable parameter, it must be hard to give a good generalization of the underlying trend. That would produce high errors on both training and testing dataset i.e. the high bias i.e.underfitting.
Overfitting occurs when a model is fitted too hard and it start to model the noise of the training dataset. From a metric-wise view, it commonly has a low error in training and high in testing i.e. high variance.
Now, the name itself basically explain it. Underfitting means the model is fitted badly so that have bad perfromance. Overfitting means the model is fitted to hard that stick hard with the training dataset.
Why adding that penalty term is considered to cure overfitting? Remember we said the overall scope of doing regularization is to minimize the weight by adding the penalty term. In another way to say it, as the weight become smaller, the variance become smaller. With a smaller weight, even you have a highly pertubated input set, the output is controlled.
Let's say you want to predict the energy consumption data using a linear regression model. The input data is building metadata such as square feet of the space, which year does this house build, and the HVAC system effiency... etc. Including all these parameters, we happen to have two very similar variables:
What would happen in Lasso and Ridge Behavior?
L1 (Lasso) Prediction Behavior:
L2 (Ridge) Prediction Behavior:
Imagine suddenly the input data contain one variable that is "tea price" in our dataset.
L1 Prediction Behavior:
L2 Prediction Behavior:
Imagine a sensor glitches and sends a value of 1,000,000 for a feature that is usually between 1 and 10.
| Scenario | L1 (Lasso) Behavior | L2 (Ridge) Behavior |
|---|---|---|
| New data has noise | Ignores it (if weight is 0). | Reacts slightly (all weights are active). |
| Two inputs are identical | Picks one, kills the other. | Averages them (uses both). |
| Model "Personality" | Specialist: Relies heavily on a few key factors. | Generalist: Relies on a weighted sum of everything. |