# Stacking

> stacking产生方法是一种截然不同的组合多个模型的方法
>
> 它不像bagging和boosting是同质组合
>
> 是组合不同的模型

## 步骤：

> input：训练集$$D={ (x\_1,y\_1),(x\_2,y\_2),...,(x\_m,y\_m)}$$;
>
> ​ 初级的，希望集成的T个弱学习算法：$$\xi\_1,\xi\_2,...,\xi\_T$$;
>
> ​ 次级的学习算法，$$\xi$$
>
> step：
>
> ​ 1：for t=1,2,...,T do
>
> ​ $$h\_t=\xi\_t(D)$$;（也就是对第t个弱学习器，都进行一次对D数据的学习）
>
> ​ end for
>
> ​ 2：D‘ =空;
>
> ​ 3：for i=1,2,...,m do
>
> ​ for t=1,2,....,T do
>
> ​ $$z\_{it}=h\_t(x\_i)$$;（也就是每个弱学习器，对样本进行一次预测）
>
> ​ end for
>
> ​ D' = D' 与 $$((z\_{i1,z\_{i2},...z\_{iT}}),y\_i)$$;（也就是计算T个弱学习器对样本$$x\_i$$的预测值，组成新的：$$X=(z\_{i1},z\_{i2},...z\_{iT})$$
>
> ​ end for
>
> ​ 4：h\`=$$\xi$$(D\`);（也就是把新的数据$$(X,y\_i)$$）作为输入，在次级学习算法$$\xi$$中进行训练
>
> ​ 5：输出：$$H(x)=$$h\`$$(h\_1(x),h\_2(x),...,h\_T(x))$$;（抽象化上面1-4步，代表把T个弱学习器作为新的函数输入，最后输出）

西瓜书🍉8.9图：

![](/files/-Lq7QSHZ8Zv0DAgTwk7t)

## 问题：

​ 这样的实现是有很大的缺陷的。

​ 在原始数据集D上面训练的模型，然后用这些模型**在D上面再进行预测得到的次级训练集肯定是非常好**的，会出现**过拟合**的现象。

## 改进：

**k折**方式，初始训练集D划分为k份，每次选一份作为测试集Dj，其余作为训练集训练出弱学习器，然后用测试集中每一个测试样本经过T个学习器预测后，产生T个的输出，作为次级学习器的输入的X，原始标签还是y）。

## 代码跑起来

这里用mlxtend这个库：`pip install mlxtend`

> 初级学习器：高斯、KNN，RF
>
> 次级：逻辑回归
>
> 用于分类

```python
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn import datasets

from sklearn.datasets.samples_generator import make_classification
import numpy as np

# 初级学习器，3个
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

# 次级学习器，逻辑回归
lr = LogisticRegression()


# 集成模型
#use_probas=True使用初级学习器的概率作为输入，没有的话直接使用其输出作为输入。
# average_probas=False，不使用初级的平均作为输入,拼接起来作为输入。
'''
classifier 1: [0.2, 0.5, 0.3]
classifier 2: [0.3, 0.4, 0.4]
   1) average = True : 
产生的meta-feature 为：[0.25, 0.45, 0.35]
   2) average = False:
产生的meta-feature为：[0.2, 0.5, 0.3, 0.3, 0.4, 0.4]
'''
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
                          use_probas=True,
                          average_probas=False,
                          meta_classifier=lr)


# 输入数据
# X为样本特征，Y为样本类别输出， 共1000个样本，每个样本2个特征，输出有3个类别，没有冗余特征，每个类别一个簇
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
                           n_clusters_per_class=1, n_classes=3)

# 遍历模型和名称
for clf, label in zip([clf1, clf2, clf3, sclf],
                      ['KNN',
                       'Random Forest',
                       'Naive Bayes',
                       'StackingClassifier']):
    scores = model_selection.cross_val_score(clf, X, y,
                                             cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]"
          % (scores.mean(), scores.std(), label))

#3-fold的结果：
Accuracy: 0.96 (+/- 0.00) [KNN]
Accuracy: 0.97 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.01) [Naive Bayes]
Accuracy: 0.97 (+/- 0.00) [StackingClassifier]
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://im-qianuxn.gitbook.io/pytorch/ji-suan-ji/ml/ji-cheng/stacking.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.