Linear Regression From Scratch in NumPy

Linear regression is a model for predicting a numeric target. It assumes the prediction can be expressed as a weighted sum of input features:

y_hat = X @ w + b

The model learns w and b by making the prediction error small. In ordinary least squares, that means minimizing the sum of squared residuals between observed targets and predicted targets.

That definition matters because it draws a useful boundary:

predicting rent, salary, temperature, latency, or revenue is a regression problem;
predicting whether something is a cat, spam message, fraud attempt, or churned customer is a classification problem.

Classification can still use a linear score internally, but once you add a sigmoid and train against class labels, you are usually talking about logistic regression, not linear regression. More on that at the end.

A Small Rent Dataset

This toy dataset has two features:

number of rooms;
area in square meters.

The target is monthly rent.

python

import numpy as np

data = np.array(
    [
        # rooms, square meters, rent
        [3, 62, 798],
        [1, 35, 454],
        [2, 38, 615],
        [3, 100, 1474],
        [1, 37, 491],
        [2, 80, 921],
        [2, 82, 983],
        [2, 80, 1044],
        [3, 107, 1290],
        [2, 80, 1413],
    ],
    dtype=float,
)

X = data[:, :2]
y = data[:, 2]

The dataset is intentionally tiny. It is good enough for learning the mechanics, not good enough for estimating real market rent.

Scale the Features

Gradient descent behaves better when features live on a comparable scale. Here rooms ranges from 1 to 3, while square meters ranges from 35 to 107. Without scaling, the larger numeric range tends to dominate the updates.

Use the training mean and standard deviation, then keep those values for future predictions:

python

x_mean = X.mean(axis=0)
x_std = X.std(axis=0)
X_scaled = (X - x_mean) / x_std

This is the same idea behind StandardScaler in scikit-learn. Do not recompute those statistics from each new prediction request; the model must see new data through the same transformation it saw during training.

The Model, Loss, and Gradients

The prediction function is only a matrix multiplication plus an intercept:

python

def predict(X, w, b):
    return X @ w + b

Mean squared error is easy to inspect:

python

def mse(X, y, w, b):
    error = predict(X, w, b) - y
    return np.mean(error**2)

For this loss, the gradients are:

python

def gradients(X, y, w, b):
    error = predict(X, w, b) - y
    m = len(X)
    dw = (2 / m) * X.T @ error
    db = 2 * np.mean(error)
    return dw, db

Then gradient descent repeatedly moves the parameters in the opposite direction of the gradient:

python

w = np.zeros(X_scaled.shape[1])
b = 0.0
learning_rate = 0.05

for _ in range(5000):
    dw, db = gradients(X_scaled, y, w, b)
    w -= learning_rate * dw
    b -= learning_rate * db

print("weights:", w)
print("bias:", b)
print("mse:", mse(X_scaled, y, w, b))

On this dataset the output is:

weights: [ 12.74533963 307.78532541]
bias: 948.2999999999995
mse: 20084.833366523806

Because the features were standardized, the bias is close to the mean rent in the training data. The larger coefficient on area says that, in this small sample, area explains more of the fitted variation than room count.

Make a Prediction

To predict a new apartment, scale it with the same x_mean and x_std:

python

apartment = np.array([[2, 80]], dtype=float)
apartment_scaled = (apartment - x_mean) / x_std

rent = predict(apartment_scaled, w, b)[0]
print(rent)

The model predicts:

1069.701285769946

Read that as “the fitted line says about 1070”, not as a market price. There are only ten observations, no location feature, no train/test split, and several hidden variables that matter more than this toy model can know.

Check Against the Closed-Form Solution

For ordinary least squares, we can also solve the same problem directly with linear algebra:

python

X_design = np.c_[np.ones(len(X_scaled)), X_scaled]
theta = np.linalg.lstsq(X_design, y, rcond=None)[0]

print(theta)

The result is the same fit:

[948.3         12.74533963 307.78532541]

The first value is the intercept. The next two are the feature weights. Gradient descent is useful to learn the optimization process and scales to models where a closed-form solution is not convenient. For this small ordinary least squares problem, the closed-form solution is the cleaner verification.

What This Example Leaves Out

A production regression model needs more discipline:

split data into training and validation sets;
keep preprocessing in a reproducible pipeline;
measure error on data the model did not train on;
watch out for outliers and correlated features;
consider Ridge or Lasso when coefficients are unstable;
use a maintained library such as scikit-learn unless the goal is to learn the math.

The from-scratch version is valuable because it makes the mechanics visible. It is not a replacement for battle-tested model code.

Classification Is Not Linear Regression

A common beginner mistake is to say: “I will use linear regression for classification, then pass the result through a sigmoid.”

That describes the shape of logistic regression, but not ordinary least squares linear regression. Logistic regression uses a linear score internally:

z = X @ w + b

Then it maps the score to a probability:

p = 1 / (1 + exp(-z))

And it is trained with a classification loss, typically log loss. The target is categorical, and the result is interpreted as a probability or class decision.

So the practical rule is simple:

use linear regression when the target is a continuous number;
use logistic regression or another classifier when the target is a class.

Keeping that distinction clear prevents a lot of quiet modeling bugs.

Visitors now

Linear Regression From Scratch in NumPy