Linear and Logistic Regression

 While correlation is used to describe the strength and direction of an association between two random, uncontrolled continuous variables, regression analysis goes a step further by establishing a functional relationship that allows researchers to predict the value of a dependent (response) variable based on the value of at least one independent (predictor) variable. In the broader context of testing for relationships, regression provides a specific mathematical model to describe the exact nature of how variables interact.

Simple Linear Regression Simple linear regression evaluates the relationship between exactly one independent variable and one continuous dependent variable. In this model, researchers assume that a straight line provides the best description of the relationship, which can be expressed mathematically as $y = \beta_0 + \beta_1x + \epsilon$ (or $y = a + bx$), where $\beta_0$ is the y-intercept, $\beta_1$ is the slope, and $\epsilon$ represents random error.

To determine the most accurate straight line through a scatter plot of data, statisticians use the method of least squares. This objective mathematical procedure calculates the "best-fitting" line by minimizing the sum of the squared vertical deviations (residuals) between the actual observed data points and the predicted values on the line.

To evaluate how well this linear model actually fits the data, researchers calculate the coefficient of determination ($r^2$ or $R^2$). This value represents the proportion or percentage of the total variation in the dependent variable that is directly explained by the linear relationship with the independent variable. For example, if $r^2 = 0.95$, it means 95% of the variability in the outcome is explained by the regression line.

For simple linear regression to be statistically valid, several strict assumptions must be met:

  • The independent variable ($x$) is fixed and controlled by the researcher without error.
  • For any given value of $x$, the corresponding $y$ values are normally distributed.
  • The variance of $y$ remains constant across all values of $x$ (homogeneity of variance).

Multiple Linear Regression Because a single predictor variable is rarely sufficient to accurately forecast an outcome in the real world, researchers use multiple linear regression to evaluate the effects of two or more independent variables simultaneously on a single continuous dependent variable.

Instead of a simple straight line, multiple regression creates a "plane" (for two independent variables) or a "hyperplane" (for three or more) in multidimensional space to best fit the data. This model calculates specific beta coefficients (beta weights) for each independent variable. These coefficients are highly valuable because they indicate the relative importance of each predictor variable and describe the change in the dependent variable when that specific predictor is altered while all other predictors are held constant. Because evaluating many variables can become incredibly complex, researchers often use automated stepwise regression to systematically add or subtract variables from the equation to identify the most statistically useful subset of predictors.

Logistic Regression While linear regression requires a continuous dependent variable, logistic regression is the appropriate predictive model when the dependent variable is a dichotomous or binary discrete outcome (e.g., success/failure, live/die, presence/absence of a disease).

Attempting to use standard linear regression for binary outcomes creates several mathematical problems: the errors cannot be normally distributed, the variance is not constant, and the linear equation could predict impossible probabilities greater than 1.0 or less than 0.0.

To solve this, logistic regression utilizes a logarithmic transformation called the logit function or log-odds. Instead of predicting the exact value of a continuous variable, logistic regression models the natural logarithm of the odds that the specific binary event will occur. This transformation successfully forces the predicted probabilities to always fall strictly between 0 and 1. Like multiple regression, logistic regression can evaluate multiple independent variables at the same time—and these predictor variables can be either continuous or discrete. It is heavily used in clinical and epidemiological studies, such as using factors like a mother's age, weight, and smoking history to predict the binary probability of a premature birth.


What is significance & Its Level


Significance in statistics, often referred to as statistical significance or level of significance ($\alpha$), is fundamentally the probability of rejecting a true null hypothesis ($H_0$).



  1. Definition and Measurement of Significance:

    • Hypothesis Testing is sometimes referred to as significance testing. It is the process of inferring from a sample whether to reject a certain statement about a population.
    • The level of significance ($\alpha$) is the probability that the statistical test results in rejecting the null hypothesis ($H_0$) when $H_0$ is actually true. This mistake is known as a Type I error.
    • By convention, a probability of less than 5% ($\alpha < 0.05$ or a 1/20 chance) is usually considered an unlikely event. If the difference is significant at the 5% level, it is often expressed as $p < 0.05$.
    • When results are considered "statistically significant," it means the sample data is incompatible with the null hypothesis, leading to its rejection in favor of the alternate hypothesis ($H_1$).
    • The $p$-value (or significance probability) is a post hoc measure of error. It is the probability, calculated assuming $H_0$ is true, that the test statistic takes a value equal to or more extreme than the value actually observed. A small $p$-value signifies a strong rejection of $H_0$.
  2. Statistical Tests for Significance (Hypothesis Testing): A wide range of inferential statistical tests are used to determine significance, typically categorized based on the type of data (continuous/discrete) and assumptions (parametric/nonparametric). These tests compare an observed test statistic (a ratio based on sample data) to a preset critical value or calculate a $p$-value to determine if the result is extreme enough to reject $H_0$.

    Common statistical tests employed for significance testing include:

    • Parametric Procedures (generally assume normality and homogeneity of variance):

      • $t$-Tests (used primarily when comparing one or two means, or paired data):
        • One-Sample $t$-Test.
        • Two-Sample $t$-Test.
        • Matched Pair $t$-Test (Paired $t$-Test).
      • Analysis of Variance (ANOVA) (used for comparing means of three or more groups, relying on the $F$-distribution).
      • $Z$-Tests (used for large samples, especially concerning proportions or means with known population variance):
        • $Z$-Test of Proportions (One-sample or Two-sample case).
    • Tests for Relationships and Association:

      • Correlation and Regression (to test if a relationship exists, usually $H_0: r_{xy} = 0$ or $H_0: \beta_1 = 0$).
      • Chi Square ($\chi^2$) Tests (used when only discrete variables are involved):
        • Chi Square Goodness-of-Fit Test.
        • Chi Square Test of Independence (or Test for Association).
        • Related tests: Fisher’s exact test, McNemar's test, Cochran-Mantel-Haenszel test.
    • Nonparametric Tests (alternatives used when assumptions like normality are not met):

      • Wilcoxon Signed Rank Test (alternative to paired $t$-test).
      • Wilcoxon Rank Sum Test (Mann–Whitney $U$ test, alternative to two-sample $t$-test).
      • Kruskal–Wallis Test (alternative to One-Way ANOVA).
      • Sign Test (alternative where $\mu$ is interpreted as the difference of medians).

Example of Shelf Life Calculation with No Variation


Based on the requirement that the three batches exhibit similarity (no significant difference), the stability data can be combined (pooled) to determine a single, unified shelf life.

The FDA guideline specifies that the expiration dating period (shelf life, $\xi$) is determined as the time point at which the $95%$ one-sided lower confidence limit for the mean degradation curve intersects the acceptable lower specification limit ($\eta$).

Here is a simulated example demonstrating this process for three similar batches ($K=3$).

1. Simulated Stability Study Data and Parameters

Objective: Determine the shelf life ($\xi$) for a drug product using three validation batches. Acceptable Lower Specification Limit ($\eta$): $90%$ of label claim. Model: Linear degradation ($Y = \alpha + \beta X + \epsilon$). Time Points ($X_j$): 0, 3, 6, 9, and 12 months ($n=5$ time points). Total Observations ($N$): $K \times n = 3 \times 5 = 15$.

The observed Potency (% Label Claim) data are simulated to be consistent with a common degradation rate of approximately $-0.5%$ per month, indicating high similarity across batches:

Batch (i)Time $X_j$ (Months)Potency $Y_{i,j}$ (%)
10100.2
398.6
697.1
995.3
1294.1
2099.9
398.3
696.9
995.6
1293.8
30100.0
398.5
697.0
995.4
1294.2

2. Preliminary Test for Batch Similarity

A preliminary statistical test for batch similarity (equality of slopes and intercepts) is conducted at a significance level of $0.25$.

Assumption: The statistical test demonstrates that the three batches are statistically similar (the null hypothesis of no difference in slopes and intercepts is not rejected). This justifies pooling the $N=15$ data points into one overall analysis.

3. Statistical Calculation (Pooled Data)

The Ordinary Least Squares (OLS) method is applied to the combined data set to estimate the common intercept ($\hat{\alpha}$) and common slope ($\hat{\beta}$).

ParameterCalculation Result (Pooled Data)
Mean Time ($\overline{X}$)6.0 months
Pooled Sum of Squares of X ($K\sum_{j=1}^{n}(x_{j}-\overline{x})^{2}$)90
Estimated Intercept ($\hat{\alpha}$)$100.40$ (Potency %)
Estimated Slope ($\hat{\beta}$)$-0.50$ ($-%$ per month)
Mean Squared Error (MSE)$0.038$
Degrees of Freedom (N-2)13
$t$-value ($t(0.95, 13)$)$\approx 1.771$

The pooled mean degradation curve is: $\hat{Y}(X) = 100.40 - 0.50 X$

4. Determination of Tentative Shelf Life ($\xi$)

The tentative shelf life ($\xi$) is the solution to the equation where the lower $95%$ confidence bound intersects the lower specification limit ($\eta=90$):

$$ \eta = \hat{\alpha} + \hat{\beta}\xi - t(.95)S(\xi) $$

Where $S(\xi)$ is the standard error of the estimated mean degradation curve at time $\xi$:

$$S^{2}(\xi) = \text{MSE} \left\{ \frac{1}{N} + \frac{(\xi-\overline{X})^{2}}{K\sum_{j=1}^{n}(x_{j}-\overline{x})^{2}} \right\}$$

Substituting the calculated pooled values:

$$ 90 = 100.40 - 0.50\xi - 1.771 \sqrt{0.038 \left( \frac{1}{15} + \frac{(\xi-6)^{2}}{90} \right)} $$

Solving this equation for $\xi$ yields the estimated shelf life:

$$ \hat{\xi} \approx 20.1 \text{ months} $$

5. Conclusion

The estimated tentative shelf life is $\mathbf{20.1}$ months.

Since the batches were determined to be similar, pooling the data was justified, resulting in a narrower confidence limit due to the larger degrees of freedom ($N-2=13$) and improved precision. This yielded a statistically determined shelf life of $20.1$ months, based on the time point where the lower $95%$ confidence boundary for the mean degradation profile of the combined batches meets the $90%$ specification limit. 


Ps: I am using NotebookLM to create this blog.

How to calcuate drug shelf life

 Shelf life, or the expiration dating period, is defined as the interval that a drug product is expected to remain within the approved specifications after manufacture. The calculation of the shelf life is the primary objective of a stability study.

The general method for determining the shelf life, as recommended by the FDA and ICH guidelines, involves statistical analysis of stability data:

Primary Calculation Method (Long-Term Stability)

The shelf life is determined as the time point at which the 95% one-sided lower confidence limit for the mean degradation curve intersects the acceptable lower specification limit ($\tau_{\eta}$).

  1. Modeling Degradation: The stability data, typically using percent of label claim as the primary variable, are fitted to a mathematical relationship.

    • The degradation relationship can usually be represented by a linear, quadratic, or cubic function on an arithmetic or logarithmic scale.
    • For characteristics expected to decrease (e.g., strength), the 95% one-sided lower confidence limit is used.
    • For characteristics expected to increase (e.g., degradation products), the 95% one-sided upper confidence limit is used.
  2. Statistical Calculation (Linear Model): Assuming the strength decreases linearly over time (a zero-order reaction), the expected degradation is modeled by linear regression, $E(Y_{j}) = \alpha + \beta\lambda_{j}$.

    • The shelf life ($x_{L}$) is calculated by solving the quadratic equation that results from setting the 95% lower confidence limit for the mean degradation line, $L(x)$, equal to the lower specification limit, $\tau_{\eta}$. $x_{L}$ is the smaller root of this equation.
    • It is not acceptable to determine the expiration dating period by simply finding where the fitted least-squares line intersects the specification limit (which would only provide a 50% confidence level).

Handling Multiple Batches

When multiple batches (a minimum of three) are tested, the approach depends on batch-to-batch variability:

  • Pooling Data: If analysis shows that the batch-to-batch variability is small (e.g., slopes and intercepts are sufficiently similar, sometimes assessed using a significance level of 0.25), the data from different batches may be combined into one overall estimate to establish a single, more precise shelf life.
  • Minimum Approach (Fixed Effects): If it is inappropriate to combine data due to significant batch-to-batch variability, the overall expiration dating period may be based on the minimum of the individual shelf lives estimated from each batch. This is considered a conservative estimate.
  • Random Batch Effects (Advanced Methods): For establishing a shelf life applicable to all future production batches, statistical methods incorporating random batch effects are used (e.g., Chow and Shao's approach or the HLC method). These methods include the between-batch variability when constructing the confidence limit for the mean degradation curve.

Tentative Shelf Life (Accelerated Testing)

Accelerated stability testing (or stress testing) is used primarily to predict a tentative expiration dating period in a shorter timeframe by increasing the rate of chemical or physical degradation under exaggerated conditions.

The prediction relies on kinetic models:

  1. Reaction Order: The analysis involves empirically determining the order of the reaction (e.g., zero-order for linear degradation or first-order for logarithmic degradation).
  2. Arrhenius Equation: The relationship between the degradation rate and temperature is characterized using the Arrhenius equation.
  3. Extrapolation: The tentative shelf life is obtained by extrapolating the relationship to ambient (marketing) storage conditions.

"lower.tail" confusion in R.

 "lower.tail" in R 

I usually get confused on how to use the argument in "pt" function and similar function. Here I will focus on t-distribution. I will utilize Minitab for graphical presentation.

A:- lower.tail is FALSE 

    In R

The code is 
> pt(q = -2.262, df = 9, lower.tail = FALSE)
The output 
[1] 0.9749936

    In Minitab

This is shown as in the graph from Minitab.



So when FASLE is chosen, the calculation will give the area after the critical value.

B:- "lower.tail" is TRUE

    In R 

the code is 
 pt(q = -2.262, df = 9, lower.tail = TRUE)
the output is 
[1] 0.02500642

    In Minitab


When TRUE is used, it orders R to compute the area before the critical value.

How to compute the probability between two values using t-distribution in R.

 To compute the "P" between two cut offs in t-distribution (two points) in R.

The example uses the d.f. = 9 i.e. n = 10 , first quantile -2.262, the second quantile is 2.262.

I used Minitab to give a graphical representation of that as below:-



The code in R to use is as follow:- 

pt(q = 2.262, df = 9, lower.tail = TRUE) - pt(q = -2.262, df = 9, lower.tail = TRUE)

I recomend to play a little with above code, to find out the argument "lower.tail" , when it is TRUE and FALSE.

The output from R is :-

[1] 0.9499872

Which when rounded, it will be 0.95 as Minitab.

P.S 

d.f. ; degree of freedom.

Linear and Logistic Regression

 While correlation is used to describe the strength and direction of an association between two random, uncontrolled continuous variables, r...