动手学深度学习(4)-实战Kaggle房价预测

代码描述

下面的代码主要工作是完成房价的预测。数据集的准备可以参考对应文章,这里假设已经可以拿到数据,数据以csv的格式存储在对应的目录下。

1
2
3
4
5
6
7
8
from utils.deep_learning_util import download
import pandas as pd

train_data = pd.read_csv(download('kaggle_house_train'))
test_data = pd.read_csv(download('kaggle_house_test'))

# 获取所有的特征feature
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

之后我们需要完成数据预处理的操作。数据中包括数字特征和离散特征。

对于数字特征,我们对其进行标准化处理,意在缩小不同量纲之间的差距,将所有的特征都放在一个共同的尺度上。代码如下:

1
2
3
4
5
6
7
# 对所有的数字特征,进行标准化,使用均值和方差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std())
)
# 填充缺失值为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)

对于离散特征,我们将其转化为one-hot编码,代码如下:

1
2
3
4
# 对所有的离散特征,使用one hot embedding
# 注意这里2.x版本的pandas会转化成True Or False, 而不是01编码, 需要增加一步类型转化
all_features = pd.get_dummies(all_features, dummy_na=True)
all_features = all_features.apply(lambda x: x.astype(float), axis=1)

数据处理完成之后,我们可以将DataFrame格式的数据转化成torch中的Tensor:

1
2
3
4
5
6
7
import torch

n_train = train_data.shape[0]
# DataFrame的.values可以获取到numpy.array
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(train_data['SalePrice'].values.reshape(-1, 1), dtype=torch.float32)

接下来就是训练相关的步骤,首先定义损失函数和网络结构,这里网络结构采用了最简单的线性模型。在应对一个全新的问题时,我们通常的做法是将线性模型作为一个baseline,后续的优化可以对比这个baseline模型。

1
2
3
4
5
6
7
8
9
10
11
from torch import nn

# 损失函数
loss_func = nn.MSELoss()
num_features = train_features.shape[1]


# 网络结构
def get_net():
net = nn.Sequential(nn.Linear(num_features, 1))
return net

实际上,这里使用标准的MSE是不太合适的,因为房价之间相差可能非常大,我们关心的应该是相对数量而不是绝对数量,即我们更加关心相对误差。一种方式是计算价格对数的均方差,即: \[ \sqrt{\frac{1}{n}\sum_{i=1}^n(\log y_i - \log \hat{y_i})^2} \] 代码实现如下:

1
2
3
4
def log_rmse(net, features, labels):
clipped_preds = torch.clamp(net(features), 1, float('inf'))
rmse = torch.sqrt(loss_func(torch.log(clipped_preds), torch.log(labels)))
return rmse.item()

接下来,可以完成模型训练的方法,按照基本流程即可。这里将每一轮的训练和测试损失记录在一个列表中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from utils.deep_learning_util import load_array


def train(net, train_features, train_labels,
test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size):
train_loss, test_loss = [], []
train_iter = load_array((train_features, train_labels), batch_size)
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay)

for epoch in range(num_epochs):
for X, y in train_iter:
optimizer.zero_grad()
loss = loss_func(net(X), y)
loss.backward()
optimizer.step()
train_loss.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_loss.append(log_rmse(net, test_features, test_labels))
return train_loss, test_loss

这里的load_array函数的作用是将一个由tensor组成的array转化为DataLoader并进行返回:

1
2
3
4
5
6
from torch.utils import data

# generate DataLoader from one array of Tensor
def load_array(data_arrays, batch_size, is_train=True):
dataset = data.TensorDataset(*data_arrays)
return data.DataLoader(dataset, batch_size, shuffle=is_train)

除此之外,我们还会使用K折交叉验证。该方法有助于模型选择和超参数调整。下面是相关代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# K折交叉验证
# 以train_features, train_labels, valid_features, valid_labels的格式返回K折的数据
def get_k_fold_data(k, i, X, y):
assert k > 1
fold_size = X.shape[0] // k
X_train, y_train = None, None
for j in range(k):
idx = slice(j * fold_size, (j + 1) * fold_size)
X_part, y_part = X[idx, :], y[idx]
if j == i:
X_valid, y_valid = X_part, y_part
elif X_train is None:
X_train, y_train = X_part, y_part
else:
X_train = torch.cat([X_train, X_part], 0)
y_train = torch.cat([y_train, y_part], 0)
return X_train, y_train, X_valid, y_valid


# 返回进行k次训练之后的平均训练误差和平均验证误差
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size):
train_loss_sum, valid_loss_sum = 0, 0
for i in range(k):
data = get_k_fold_data(k, i, X_train, y_train)
net = get_net()
train_loss, valid_loss = train(net, *data, num_epochs, learning_rate, weight_decay, batch_size)
train_loss_sum += train_loss[-1]
valid_loss_sum += valid_loss[-1]
print(f'折{i + 1},训练log rmse{float(train_loss[-1]):f}, 验证log rmse{float(valid_loss[-1]):f}')

return train_loss_sum / k, valid_loss_sum / k

至此准备工作已经完成,之后就可以调整超参数,进行训练(炼丹)。同时根据不同超参数带来的效果,选定最优的模型和参数。下面是一个例子:

1
2
3
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr, weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, 平均验证log rmse: {float(valid_l):f}')

参考文章

  1. 4.10. 实战Kaggle比赛:预测房价 — 动手学深度学习 2.0.0 documentation

动手学深度学习(4)-实战Kaggle房价预测
http://example.com/2023/09/14/动手学深度学习-4-实战Kaggle房价预测/
作者
EverNorif
发布于
2023年9月14日
许可协议