• 参加kaggle数据挖掘比赛,就第一个赛题Titanic的数据,学习相关数据预处理以及模型建立,本博客关注基于pandas进行数据预处理过程。包括数据统计、数据离散化、数据关联性分析。
    引入包和加载数据
import pandas as pd
import numpy as np
train_df =pd.read_csv('../datas/train.csv') # train set
test_df = pd.read_csv('../datas/test.csv') # test set
combine = [train_df, test_df]
函数的含义
print(train_df.columns.values)  # 查看表格数据中的属性名字
  • 输出:
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare' 'Cabin' 'Embarked']

#查看object属性数据统计情况
print train_df.describe(include=['O'])
  • 统计Title单列各个元素对应的个数
print train_df['Title'].value_counts() 
  • 属性列删除
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)  

缺省值处理

  • 直接丢弃缺失数据列的行
print df4.dropna(axis=0,subset=['col1'])  # 丢弃nan的行,subset指定查看哪几列 
print df4.dropna(axis=1) # 丢弃nan的列
  • 采用其他值填充
dataset['Cabin'] = dataset['Cabin'].fillna('U') 
dataset['Title'] = dataset['Title'].fillna(0)
  • 采用出现最频繁的值填充
freq_port = train_df.Embarked.dropna().mode()[0]
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
  • 采用中位数或者平均数填充
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df['Fare'].fillna(test_df['Fare'].dropna().mean(), inplace=True)

数值属性离散化,object属性数值化

  • 创造一个新列,AgeBand,将连续属性Age切分成5份
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
print(train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand',ascending=True))
  • 输出:

    AgeBand  Survived
    0 (-0.08, 16.0] 0.550000
    1 (16.0, 32.0] 0.337374
    2 (32.0, 48.0] 0.412037
    3 (48.0, 64.0] 0.434783
    4 (64.0, 80.0] 0.090909
  • 其它代码:

查看切分后的属性与target属性Survive的关系

train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

建立object属性映射字典

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royalty":5, "Officer": 6}
dataset['Title'] = dataset['Title'].map(title_mapping)
  • DataFrame()函数功能
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
print(models.sort_values(by='Score', ascending=False))
  • 输出:

    Model  Score

    3 Random Forest 86.64
    8 Decision Tree 86.64
    1 KNN 84.06
    0 Support Vector Machines 83.50
    2 Logistic Regression 81.26
    7 Linear SVC 79.46
    5 Perceptron 78.79
    4 Naive Bayes 76.88
    6 Stochastic Gradient Decent 76.77