2

I am doing a regression analysis on a set of data and my primary interest for this data set is to find the regression line that best minimizes the average standard error of the estimate (SEE), rather than just find the line of best fit. I thought that these two were the same thing until I noticed that when I recalculated the line of best fit based on the inclusion of a new set of data, the SEE actually increased, whereas using the old regression equation produced a lower SEE even when including the new data. The SEE is calculated as follows...

SEE = abs((x1-x0)/x0)

where x1 is the estimated value and x0 is the actual value. Does anyone know a way to formulate a code in R so that the regression line calculated minimizes the average SEE rather than r2? Alternatively, does anyone know why the line of best fit does not minimize the average SEE?

2
  • what does your data look like? Is there a bunch of heteroskedasticity? Commented Jul 24, 2015 at 4:06
  • I am not sure. The data is log-transformed, so it might be. Commented Jul 25, 2015 at 17:48

1 Answer 1

0

If I understand the problem correctly it can be solved by a method called linear programming, using the R library "lpSolve":

library(lpSolve)

regression_1 <- function( data )
{
  n   <- nrow(data)

  L.obj <- c( rep(1,n), 0, 0 )  
  L.con <- rbind( cbind( diag(data$y),  data$x,  matrix(1,n,1) ),
                  cbind( diag(data$y), -data$x, -matrix(1,n,1) ) )  
  L.rhs <- matrix( cbind( data$y, -data$y ), 2*n, 1 )  
  L.dir <- rep(">=",2*n)

  M <- lp("min", L.obj, L.con, L.dir, L.rhs )  
  a <- M["solution"][[1]][n+1]
  b <- M["solution"][[1]][n+2]

  return ( c(a,b) )
}

#--------------------------------------------------------------------

Error <- function( data, ab )
{
  ab <- unlist(ab)
  sum( abs((ab[1]*data$x+ab[2]-data$y)/data$y) )
}

#====================================================================
# Example:

data.x <- 0:12
data.y <- (3.0+0.3*data.x) * (1+sample(-150:150,length(data.x),TRUE)/1000)
data <- data.frame( x = data.x,
                    y = data.y  )

ab <- regression_1(data)

N <- 30
eps <- (-N:N)/1000
neighborhood <- array( unlist(expand.grid(ab[1]+eps,ab[2]+eps)), c(2*N+1,2*N+1,2))

E <- apply(neighborhood,c(1,2),function(ab_plus_eps){Error(data,ab_plus_eps)})

t(data)
min(E)
Error(data,ab)
ab

Let "n" be he number of rows in the data frame "data" and assume that

  • y[i] is the measured value given x[i] and

  • y[i] is positive for each i. (If positive and negative values were allowed, using the below error function we had a problem near 0.)

(So "x" and "y" correspond to "X1" and "X0" in the formulation of the question, respectively.)

The goal is to estimate "y" by a linear function with slope "a" and y-intercept "b". More precisely we want to minimize the error function

  • Error(a,b) <- sum(abs((a*x+b-y)/y).

Our approach is to use linear programming. Define auxiliary variables "u[1],...,u[n+2]". Later "u[i]" will become equal to "abs((a*x[i]+b)/y)" for each i<"n" and "u[n+1],u[n+2] will become equal to the optimal values of "a" and "b", respectively. For this

  • minimize the function "u[1]+...+u[n]" suject to the constraints
  • u[i]*y[i] >= u[n+1]*x[i]+u[n+2]-y[i] and
  • u[i]*y[i] >= -u[n+1]*x[i]-u[n+2]+y[i] for each i<=n.

Upon minimization of "u[1]+...+u[n]", "u[i]" is equal to "abs((u[n+1]*x[i]+u[n+2])/y[i]" for each i<="n". Otherwise the value of "y[i]" could be decreases, keeping all other "u[j]"'s fixed. Taking this into account, under the above contraints the function "u[1]+...+u[n]" is minimal, if "u[n+1]" and "u[n+2]" are the optimal values of "a" and "b", respectively.

Here is the output of the example:

> t(data)
   [,1]   [,2]   [,3]   [,4]   [,5]   [,6] [,7]   [,8]   [,9]  [,10]  [,11]   [,12]  [,13]
x 0.000 1.0000 2.0000 3.0000 4.0000 5.0000 6.00 7.0000 8.0000 9.0000 10.000 11.0000 12.000
y 3.081 3.4353 3.2472 4.4772 3.7758 4.4055 5.04 5.5131 5.4378 5.5119  5.784  6.0102  5.907

> min(E)
[1] 0.6575712

> Error(data,ab)
[1] 0.6575712

> ab
[1] 0.2701 3.0810

For comparison:

> lm(data$y~data$x)

Call:
lm(formula = data$y ~ data$x)

Coefficients:
(Intercept)       data$x  
     3.1741       0.2611  

> Error(data,c(0.2611,3.1741))
[1] 0.67915

The values are distinct for two reasons:

  • "lm" minimizes the squared distance between the regression line and the sampled data, not the absolute value of the distance.
  • In the error terms used by "lm" there is no division by the "y"-values. (In particular, one doesn't have the problem near 0, mentioned above.)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.