PickMyTest

Logistic Regression

Models the probability of a binary outcome as a function of one or more independent variables

Logistic Regression#

Logistic regression is a method for modeling the probability of a binary outcome (e.g., yes/no, sick/healthy) as a function of one or more independent variables. Unlike linear regression, the dependent variable is categorical (dichotomous).

When to Use#

Use logistic regression when you want to:

  • Predict a binary dependent variable (e.g., 0/1, yes/no)
  • Identify the influencing factors on an event
  • Estimate the probability of an event occurring
  • The predictors can be metric, ordinal, or categorical (mixed data types)

Assumptions#

  • Dependent variable is binary coded (0/1)
  • Independence of observations
  • No multicollinearity among predictors (VIF < 10)
  • Linear relationship between predictors and the logit of the dependent variable
  • Sufficiently large sample size (rule of thumb: at least 10 events per predictor)
  • No influential outliers

Formula#

The logistic regression model uses the logit function:

ln⁑(p1βˆ’p)=Ξ²0+Ξ²1X1+Ξ²2X2+β‹―+Ξ²kXk\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k

where pp is the probability of the event. Solved for pp:

p=11+eβˆ’(Ξ²0+Ξ²1X1+β‹―+Ξ²kXk)p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_k X_k)}}

The Odds Ratio for a predictor:

OR=eΞ²iOR = e^{\beta_i}

Example#

Practical Example: Customer Churn

A telecommunications company wants to predict whether a customer will cancel their contract (1) or stay (0). The predictors used are:

  • X₁: Contract duration (in months)
  • Xβ‚‚: Monthly costs (in EUR)
  • X₃: Number of complaints

Result: The odds ratio for complaints is OR = 1.85. This means: With each additional complaint, the odds of cancellation increase by a factor of 1.85 (i.e., by 85%), holding all other variables constant.

Effect Size#

Various pseudo-RΒ² measures are used for logistic regression:

Nagelkerke's RΒ²:

RNagelkerke2=1βˆ’(L0LM)2/n1βˆ’L02/nR^2_{\text{Nagelkerke}} = \frac{1 - \left(\frac{L_0}{L_M}\right)^{2/n}}{1 - L_0^{2/n}}

where L0L_0 is the likelihood of the null model and LML_M is the likelihood of the full model.

Effect SizeNagelkerke's RΒ²
Small0.02
Medium0.13
Large0.26

Additionally, odds ratios (eΞ²e^{\beta}) are an important measure of the practical significance of individual predictors. Classification accuracy and the ROC curve (AUC) evaluate the overall model fit.

Further Reading

  • Hosmer, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE.