0

I have the following DATA

And the following code that builds a binomial logistic regression model, wherein all variables are factors:

#setwd("wherever you downloaded the file")
data_ev <- read.csv("all_EV.csv")
df_all_EV <- data.frame(data_ev)

#remove extra columns
df_all_EV <- df_all_EV[,-1]
df_all_EV <- df_all_EV[,-3]
df_all_EV <- df_all_EV[,-4]

#remove uneeded rows
df_ev2 <- subset(df_all_EV, EV!="unknown")

#factorize
df_ev2$EV <- as.factor(df_ev2$EV)
df_ev2$Speech_VP <- as.factor(df_ev2$Speech_VP)
df_ev2$Genre <- as.factor(df_ev2$Genre)

#set response variable ref level
df_ev2$EV <- relevel(df_ev2$EV, ref = "self")

#create glm object
ev2.glm <- glm(EV ~ Genre + Speech_VP, data = df_ev2, family = binomial)
summary(ev2.glm)

#plot glm
library(visreg)
visreg(ev2.glm, "Speech_VP")
visreg(ev2.glm, "Genre")
visreg(ev2.glm, "Speech_VP", by = "Genre")

This produces a logistic regression with the following output:

    Call:
glm(formula = EV ~ Genre + Speech_VP, family = binomial, data = df_ev2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0115  -0.4628  -0.1381   0.5326   3.0519  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.6475     0.5115  -9.086  < 2e-16 ***
GenreTN       4.0611     0.4379   9.274  < 2e-16 ***
Speech_VPN    2.4675     0.3762   6.559  5.4e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 434.02  on 313  degrees of freedom
Residual deviance: 227.37  on 311  degrees of freedom
AIC: 233.37

Number of Fisher Scoring iterations: 5

I am interested in visualizing this regression model, so I use the visreg package.

For example, this plot shows the outcomes of the Genre variable: enter image description here

However, there's a problem here. I believe the Y axis is showing the log odds, but the log odds given in the summary of the glm object don't seem to match up with the plot. For genre, the log odds of the response variable occurring with TN is 4.06, but in this plot, the blue line (fitted values) seems to be around 2.

It's the same for the other variable, the plot does capture the overall relationship well, but I can't figure out what the y-axis is supposed to line up with.

So what accounts for this apparent disconnect between what the summary of the model shows, and what the plot of the model shows?

2
  • 1
    It's hard to say without reproducible code/data, but could the coefficient 4.06 here be referring to the difference between the levels "PEN" and "TN", not the mean value of "TN"? Commented Apr 7, 2024 at 22:53
  • the data is available in the hyperlink, this is reproducible, with a couple extra steps but I think you are right, the coefficient does seem the difference between the levels, and not from zero however, with visreg it will let you change the y-axis to probability, but then it's not so clear what the difference between the levels is, 4.0611 converted to probability is 0.9830618 or 98%, when plotted however, that doesn't seem to be the distance between the two levels, the PEN bar hovers around .12, while the TN bar is at .87, and the difference between those values is .75, and not .98 Commented Apr 8, 2024 at 4:36

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.