Statistics & Data

How I Used Regression Analysis to Predict a Database Crisis 23 Days Early

When our PostgreSQL table started growing unpredictably, I built a simple regression model in an afternoon to estimate when we would run out of space. It worked. This is the practical case for regression analysis in software engineering.

Norehan Norrizan
··10 min read

We had a PostgreSQL table that stored audit logs. For eighteen months it grew at a steady, predictable rate and we paid no attention to it. Then a product change caused a new type of event to be logged far more frequently, and the growth rate jumped. Nobody noticed at first — the table was still small relative to our storage. By the time the DBA flagged it, we needed to understand: how long do we have?

I spent about an afternoon pulling the table size data from pg_catalog, fitting a linear regression to the growth rate, and projecting forward. The model predicted we would run out of allocated storage in 23 days. We had actually allocated just enough headroom that a visual inspection of the graph looked "fine". The regression line made the collision point obvious.

We scheduled the cleanup window for day 18. Problem solved, no crisis. That experience taught me that regression analysis is not a data science tool — it is a practical engineering tool that happens to come from statistics.

What Regression Analysis Actually Is

Regression models the relationship between an outcome variable (the thing you want to understand or predict) and one or more input variables (the things you think influence it). The result is a mathematical function — a model — that you can use to predict new values or understand how much each input contributes to the outcome.

Linear regression is the simplest form: fit a straight line through the data. The line is described by y = β₀ + β₁x, where β₀ is the intercept and β₁ is the slope. The slope is the number you actually care about most: it tells you how much y changes per unit change in x.

import numpy as np
from sklearn.linear_model import LinearRegression

# Table size over 90 days (in GB)
days = np.array([1, 15, 30, 45, 60, 75, 90]).reshape(-1, 1)
size_gb = np.array([12.1, 12.8, 14.2, 17.1, 21.3, 26.8, 33.9])

model = LinearRegression()
model.fit(days, size_gb)

print(f"Growth rate:  {model.coef_[0]:.3f} GB/day")
print(f"Baseline:     {model.intercept_:.1f} GB")
print(f"R² (fit):     {model.score(days, size_gb):.3f}")

# Project to storage limit (60 GB)
# Solving: 60 = intercept + slope * day
days_until_full = (60 - model.intercept_) / model.coef_[0]
print(f"Days until 60 GB limit: {days_until_full:.0f}")

Reading the Output: What Actually Matters

The slope coefficient (β₁) is your most actionable number. In the example above, if the slope is 0.24 GB/day, you can immediately translate that into concrete planning decisions: "at this rate, we need X additional storage in Y days."

The R² score (coefficient of determination) tells you what proportion of the outcome's variance the model explains. R² of 0.95 means the model explains 95% of the variation in table size — excellent fit, projections are reliable. R² of 0.35 means the growth is driven by factors your single-variable model is not capturing, and projections should be treated with much more caution.

I generally trust a linear model for projection when R² > 0.85 and when the residuals (differences between model predictions and actual data) look random — no pattern, no trend. Patterned residuals signal that the relationship is not actually linear and you need either polynomial regression or a different model entirely.

Multiple Regression for More Complex Problems

API latency almost never depends on just one thing. It depends on concurrent users, query complexity, cache state, time of day, and a dozen other factors simultaneously. Multiple regression extends the single-variable case to handle all of these:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Collected from production monitoring
X = pd.DataFrame({
    "concurrent_users":    [10,  50, 100, 200, 300, 400],
    "cache_miss_rate":     [0.05, 0.08, 0.15, 0.28, 0.41, 0.56],
    "db_queries_per_req":  [2.1,  2.4,  3.1,  4.8,  6.2,  8.7],
})
y = [48, 89, 142, 264, 451, 718]  # median response time (ms)

# Standardise for interpretable coefficients
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LinearRegression().fit(X_scaled, y)

for name, coef in zip(X.columns, model.coef_):
    print(f"{name:25s} {coef:+.1f}ms per standard deviation")

Standardising the features first means the coefficients are directly comparable: a larger absolute coefficient means that variable has more influence on the outcome per unit of natural variation. This tells you which lever matters most for performance improvement.

Where Linear Regression Falls Short

I want to be honest about the limitations, because misapplied regression is worse than no regression.

Extrapolation beyond your data is unreliable. A model trained on data up to 200 concurrent users may not accurately predict behaviour at 500. Systems often exhibit non-linear behaviour near capacity limits — latency can spike sharply past a threshold that the linear model will underpredict. Always sanity-check projections against your understanding of the system's physical constraints.

Correlation is not causation. A regression that shows a strong relationship between two variables does not mean one causes the other. Both might be driven by a third variable you have not included. I use regression for projection and quantification, not for causal claims.

Outliers distort the model. Ordinary least squares regression is sensitive to outliers. One anomalous data point (a maintenance window, a traffic spike, a bug that was immediately fixed) can tilt the fitted line significantly. Always plot your data before fitting, and investigate outliers before deciding whether to include them.

A Practical Workflow

When I approach a regression problem in engineering work, I follow this sequence:

  • Plot first. Is the relationship roughly linear? Are there obvious outliers? Does the data look like it follows a different curve (exponential, logarithmic)?
  • Fit the model and check R². If it is low, I am missing important variables or the relationship is non-linear.
  • Plot residuals. Residuals should be randomly scattered around zero. Patterns indicate model misspecification.
  • Make the projection with uncertainty bounds. A prediction interval is more honest than a point estimate. "The table will reach 60 GB between day 19 and day 27, with 95% confidence" is more useful than "day 23".

Regression will not replace good engineering judgement. But it gives that judgement a quantitative foundation — turning "this table is growing fast" into "we have 23 days to act" is exactly the kind of precision that turns an eventual fire into a scheduled maintenance task.

Further Reading

Filed under

regressionstatisticsperformancecapacity planning