• Linear Regression From Scratch in NumPy

    Linear Regression From Scratch in NumPy

    Linear regression is a model for predicting a numeric target. It assumes the prediction can be expressed as a weighted sum of input features:

    y_hat = X @ w + b

    The model learns w and b by making the prediction error small. In ordinary least squares, that means minimizing the sum of squared residuals between observed targets and predicted targets.

    That definition matters because it draws a useful boundary:

    • predicting rent, salary, temperature, latency, or revenue is a regression problem;
    • predicting whether something is a cat, spam message, fraud attempt, or churned customer is a classification problem.

    Classification can still use a linear score internally, but once you add a sigmoid and train against class labels, you are usually talking about logistic regression, not linear regression. More on that at the end.

    A Small Rent Dataset

    This toy dataset has two features:

    • number of rooms;
    • area in square meters.

    The target is monthly rent.

    import numpy as np
    
    data = np.array(
        [
            # rooms, square meters, rent
            [3, 62, 798],
            [1, 35, 454],
            [2, 38, 615],
            [3, 100, 1474],
            [1, 37, 491],
            [2, 80, 921],
            [2, 82, 983],
            [2, 80, 1044],
            [3, 107, 1290],
            [2, 80, 1413],
        ],
        dtype=float,
    )
    
    X = data[:, :2]
    y = data[:, 2]

    The dataset is intentionally tiny. It is good enough for learning the mechanics, not good enough for estimating real market rent.

    Scale the Features

    Gradient descent behaves better when features live on a comparable scale. Here rooms ranges from 1 to 3, while square meters ranges from 35 to 107. Without scaling, the larger numeric range tends to dominate the updates.

    Use the training mean and standard deviation, then keep those values for future predictions:

    x_mean = X.mean(axis=0)
    x_std = X.std(axis=0)
    X_scaled = (X - x_mean) / x_std

    This is the same idea behind StandardScaler in scikit-learn. Do not recompute those statistics from each new prediction request; the model must see new data through the same transformation it saw during training.

    The Model, Loss, and Gradients

    The prediction function is only a matrix multiplication plus an intercept:

    def predict(X, w, b):
        return X @ w + b

    Mean squared error is easy to inspect:

    def mse(X, y, w, b):
        error = predict(X, w, b) - y
        return np.mean(error**2)

    For this loss, the gradients are:

    def gradients(X, y, w, b):
        error = predict(X, w, b) - y
        m = len(X)
        dw = (2 / m) * X.T @ error
        db = 2 * np.mean(error)
        return dw, db

    Then gradient descent repeatedly moves the parameters in the opposite direction of the gradient:

    w = np.zeros(X_scaled.shape[1])
    b = 0.0
    learning_rate = 0.05
    
    for _ in range(5000):
        dw, db = gradients(X_scaled, y, w, b)
        w -= learning_rate * dw
        b -= learning_rate * db
    
    print("weights:", w)
    print("bias:", b)
    print("mse:", mse(X_scaled, y, w, b))

    On this dataset the output is:

    weights: [ 12.74533963 307.78532541]
    bias: 948.2999999999995
    mse: 20084.833366523806

    Because the features were standardized, the bias is close to the mean rent in the training data. The larger coefficient on area says that, in this small sample, area explains more of the fitted variation than room count.

    Make a Prediction

    To predict a new apartment, scale it with the same x_mean and x_std:

    apartment = np.array([[2, 80]], dtype=float)
    apartment_scaled = (apartment - x_mean) / x_std
    
    rent = predict(apartment_scaled, w, b)[0]
    print(rent)

    The model predicts:

    1069.701285769946

    Read that as “the fitted line says about 1070”, not as a market price. There are only ten observations, no location feature, no train/test split, and several hidden variables that matter more than this toy model can know.

    Check Against the Closed-Form Solution

    For ordinary least squares, we can also solve the same problem directly with linear algebra:

    X_design = np.c_[np.ones(len(X_scaled)), X_scaled]
    theta = np.linalg.lstsq(X_design, y, rcond=None)[0]
    
    print(theta)

    The result is the same fit:

    [948.3         12.74533963 307.78532541]

    The first value is the intercept. The next two are the feature weights. Gradient descent is useful to learn the optimization process and scales to models where a closed-form solution is not convenient. For this small ordinary least squares problem, the closed-form solution is the cleaner verification.

    What This Example Leaves Out

    A production regression model needs more discipline:

    • split data into training and validation sets;
    • keep preprocessing in a reproducible pipeline;
    • measure error on data the model did not train on;
    • watch out for outliers and correlated features;
    • consider Ridge or Lasso when coefficients are unstable;
    • use a maintained library such as scikit-learn unless the goal is to learn the math.

    The from-scratch version is valuable because it makes the mechanics visible. It is not a replacement for battle-tested model code.

    Classification Is Not Linear Regression

    A common beginner mistake is to say: “I will use linear regression for classification, then pass the result through a sigmoid.”

    That describes the shape of logistic regression, but not ordinary least squares linear regression. Logistic regression uses a linear score internally:

    z = X @ w + b

    Then it maps the score to a probability:

    p = 1 / (1 + exp(-z))

    And it is trained with a classification loss, typically log loss. The target is categorical, and the result is interpreted as a probability or class decision.

    So the practical rule is simple:

    • use linear regression when the target is a continuous number;
    • use logistic regression or another classifier when the target is a class.

    Keeping that distinction clear prevents a lot of quiet modeling bugs.