- 参加kaggle数据挖掘比赛,就第一个赛题Titanic的数据,学习相关数据预处理以及模型建立,本博客关注基于pandas进行数据预处理过程。包括数据统计、数据离散化、数据关联性分析。
引入包和加载数据
import pandas as pd
import numpy as np
train_df =pd.read_csv('../datas/train.csv') # train set
test_df = pd.read_csv('../datas/test.csv') # test set
combine = [train_df, test_df]
函数的含义
print(train_df.columns.values) # 查看表格数据中的属性名字
- 输出:
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare' 'Cabin' 'Embarked']
#查看object属性数据统计情况
print train_df.describe(include=['O'])
- 统计Title单列各个元素对应的个数
print train_df['Title'].value_counts()
- 属性列删除
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
缺省值处理
- 直接丢弃缺失数据列的行
print df4.dropna(axis=0,subset=['col1']) # 丢弃nan的行,subset指定查看哪几列
print df4.dropna(axis=1) # 丢弃nan的列
- 采用其他值填充
dataset['Cabin'] = dataset['Cabin'].fillna('U')
dataset['Title'] = dataset['Title'].fillna(0)
- 采用出现最频繁的值填充
freq_port = train_df.Embarked.dropna().mode()[0]
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
- 采用中位数或者平均数填充
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df['Fare'].fillna(test_df['Fare'].dropna().mean(), inplace=True)
数值属性离散化,object属性数值化
- 创造一个新列,AgeBand,将连续属性Age切分成5份
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
print(train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand',ascending=True))
输出:
AgeBand Survived
0 (-0.08, 16.0] 0.550000
1 (16.0, 32.0] 0.337374
2 (32.0, 48.0] 0.412037
3 (48.0, 64.0] 0.434783
4 (64.0, 80.0] 0.090909
- 其它代码:
查看切分后的属性与target属性Survive的关系
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
建立object属性映射字典
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royalty":5, "Officer": 6}
dataset['Title'] = dataset['Title'].map(title_mapping)
- DataFrame()函数功能
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
print(models.sort_values(by='Score', ascending=False))
输出:
Model Score
3 Random Forest 86.64
8 Decision Tree 86.64
1 KNN 84.06
0 Support Vector Machines 83.50
2 Logistic Regression 81.26
7 Linear SVC 79.46
5 Perceptron 78.79
4 Naive Bayes 76.88
6 Stochastic Gradient Decent 76.77