Regression Line for Standard Error of Estimate

Question

I am doing a regression analysis on a set of data and my primary interest for this data set is to find the regression line that best minimizes the average standard error of the estimate (SEE), rather than just find the line of best fit. I thought that these two were the same thing until I noticed that when I recalculated the line of best fit based on the inclusion of a new set of data, the SEE actually increased, whereas using the old regression equation produced a lower SEE even when including the new data. The SEE is calculated as follows...

SEE = abs((x1-x0)/x0)

where x1 is the estimated value and x0 is the actual value. Does anyone know a way to formulate a code in R so that the regression line calculated minimizes the average SEE rather than r2? Alternatively, does anyone know why the line of best fit does not minimize the average SEE?

what does your data look like? Is there a bunch of heteroskedasticity? — Rorschach
– Rorschach, Commented Jul 24, 2015 at 4:06

mra68 · Accepted Answer · 2015-08-02 13:21:28Z

If I understand the problem correctly it can be solved by a method called linear programming, using the R library "lpSolve":

library(lpSolve)

regression_1 <- function( data )
{
  n   <- nrow(data)

  L.obj <- c( rep(1,n), 0, 0 )  
  L.con <- rbind( cbind( diag(data$y),  data$x,  matrix(1,n,1) ),
                  cbind( diag(data$y), -data$x, -matrix(1,n,1) ) )  
  L.rhs <- matrix( cbind( data$y, -data$y ), 2*n, 1 )  
  L.dir <- rep(">=",2*n)

  M <- lp("min", L.obj, L.con, L.dir, L.rhs )  
  a <- M["solution"][[1]][n+1]
  b <- M["solution"][[1]][n+2]

  return ( c(a,b) )
}

#--------------------------------------------------------------------

Error <- function( data, ab )
{
  ab <- unlist(ab)
  sum( abs((ab[1]*data$x+ab[2]-data$y)/data$y) )
}

#====================================================================
# Example:

data.x <- 0:12
data.y <- (3.0+0.3*data.x) * (1+sample(-150:150,length(data.x),TRUE)/1000)
data <- data.frame( x = data.x,
                    y = data.y  )

ab <- regression_1(data)

N <- 30
eps <- (-N:N)/1000
neighborhood <- array( unlist(expand.grid(ab[1]+eps,ab[2]+eps)), c(2*N+1,2*N+1,2))

E <- apply(neighborhood,c(1,2),function(ab_plus_eps){Error(data,ab_plus_eps)})

t(data)
min(E)
Error(data,ab)
ab

Let "n" be he number of rows in the data frame "data" and assume that

y[i] is the measured value given x[i] and
y[i] is positive for each i. (If positive and negative values were allowed, using the below error function we had a problem near 0.)

(So "x" and "y" correspond to "X1" and "X0" in the formulation of the question, respectively.)

The goal is to estimate "y" by a linear function with slope "a" and y-intercept "b". More precisely we want to minimize the error function

Error(a,b) <- sum(abs((a*x+b-y)/y).

Our approach is to use linear programming. Define auxiliary variables "u[1],...,u[n+2]". Later "u[i]" will become equal to "abs((a*x[i]+b)/y)" for each i<"n" and "u[n+1],u[n+2] will become equal to the optimal values of "a" and "b", respectively. For this

minimize the function "u[1]+...+u[n]" suject to the constraints
u[i]*y[i] >= u[n+1]*x[i]+u[n+2]-y[i] and
u[i]*y[i] >= -u[n+1]*x[i]-u[n+2]+y[i] for each i<=n.

Upon minimization of "u[1]+...+u[n]", "u[i]" is equal to "abs((u[n+1]*x[i]+u[n+2])/y[i]" for each i<="n". Otherwise the value of "y[i]" could be decreases, keeping all other "u[j]"'s fixed. Taking this into account, under the above contraints the function "u[1]+...+u[n]" is minimal, if "u[n+1]" and "u[n+2]" are the optimal values of "a" and "b", respectively.

Here is the output of the example:

> t(data)
   [,1]   [,2]   [,3]   [,4]   [,5]   [,6] [,7]   [,8]   [,9]  [,10]  [,11]   [,12]  [,13]
x 0.000 1.0000 2.0000 3.0000 4.0000 5.0000 6.00 7.0000 8.0000 9.0000 10.000 11.0000 12.000
y 3.081 3.4353 3.2472 4.4772 3.7758 4.4055 5.04 5.5131 5.4378 5.5119  5.784  6.0102  5.907

> min(E)
[1] 0.6575712

> Error(data,ab)
[1] 0.6575712

> ab
[1] 0.2701 3.0810

For comparison:

> lm(data$y~data$x)

Call:
lm(formula = data$y ~ data$x)

Coefficients:
(Intercept)       data$x  
     3.1741       0.2611  

> Error(data,c(0.2611,3.1741))
[1] 0.67915

The values are distinct for two reasons:

"lm" minimizes the squared distance between the regression line and the sampled data, not the absolute value of the distance.
In the error terms used by "lm" there is no division by the "y"-values. (In particular, one doesn't have the problem near 0, mentioned above.)

Collectives™ on Stack Overflow

Regression Line for Standard Error of Estimate

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related