# 分段回归的拐点连续性

### 谢益辉 / 2012-04-30

``````set.seed(123)
# 真实模型
x = sort(runif(100))
y = 2 + 1 * x + 4 * (x > 0.5) + 3 * (x - 0.5) * (x > 0.5) + rnorm(100)
par(mar = c(4, 4, 0, 0), family = "serif", mgp = c(2, 1, 0))
plot(x, y, pch = 20, col = "darkgray")
# 斜率不同，截距限制
fit1 = lm(y ~ 1 + x + I((x - 0.5) * (x > 0.5)))
lines(x, fitted(fit1), lwd = 2, col = 2)
# 斜率不同，截距也不同
fit2 = lm(y ~ 1 + x + I(x > 0.5) + I((x - 0.5) * (x > 0.5)))
lines(x, fitted(fit2), lwd = 2, lty = 2)
``````

``````# 两个嵌套模型做F检验
anova(fit1, fit2)
``````
``````## Analysis of Variance Table
##
## Model 1: y ~ 1 + x + I((x - 0.5) * (x > 0.5))
## Model 2: y ~ 1 + x + I(x > 0.5) + I((x - 0.5) * (x > 0.5))
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)
## 1     97 155
## 2     96  88  1      66.6 72.6 2.2e-13 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
``````

``````fit3 = loess(y ~ x, span = 0.2)
fit4 = loess(y ~ 1 + x + I(x > 0.5) + I((x - 0.5) * (x > 0.5)),
span = 1, degree = 1)
par(mar = c(4, 4, 0, 0), family = "serif", mgp = c(2, 1, 0))
plot(x, y, pch = 20, col = "darkgray")
lines(x, fitted(fit3), lwd = 2, col = 2)
lines(x, fitted(fit4), lwd = 2, lty = 2)
``````

``````anova(fit3, fit4)
``````
``````## Model 1: loess(formula = y ~ x, span = 0.2)
## Model 2: loess(formula = y ~ 1 + x + I(x > 0.5) + I((x - 0.5) * (x > 0.5)), span = 1, degree = 1)
##
## Analysis of Variance:   denominator df 82.38
##
##        ENP  RSS F-value Pr(>F)
## [1,] 15.16 83.2
## [2,]  4.02 87.7   0.321   0.99
``````

P值很大，这毫不奇怪，因为真实模型就是按照两段直线构造的。弯弯曲曲的复杂模型无法打败简单模型，此时我们可以有点底气说两个线段的直线回归可能是对数据的一个恰当描述。

``````library(ggplot2)
qplot(x, y) + geom_smooth()  # 总趋势
``````

``````qplot(x, y, group = x > 0.5) + geom_smooth()  # 按0.5前后分组
``````