ITPub博客

首页 > 大数据 > 数据挖掘 > 如何赢得英雄联盟LOL?游戏获胜与哪些特征更相关

如何赢得英雄联盟LOL?游戏获胜与哪些特征更相关

原创 数据挖掘 作者:helaoshi 时间:2020-06-26 15:36:02 0 删除 编辑

设置

In [1]:
# 导入包和数据import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns
In [2]:
%matplotlib inlinesns.set_style('darkgrid')
In [3]:
df = pd.read_csv('/home/kesci/input/lol8974/high_diamond_ranked_10min.csv')df.head()
Out[3]:
gameId blueWins blueWardsPlaced blueWardsDestroyed blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons ... redTowersDestroyed redTotalGold redAvgLevel redTotalExperience redTotalMinionsKilled redTotalJungleMinionsKilled redGoldDiff redExperienceDiff redCSPerMin redGoldPerMin
0 4519157822 0 28 2 1 9 6 11 0 0 ... 0 16567 6.8 17047 197 55 -643 8 19.7 1656.7
1 4523371949 0 12 1 0 5 5 5 0 0 ... 1 17620 6.8 17438 240 52 2908 1173 24.0 1762.0
2 4521474530 0 15 0 0 7 11 4 1 1 ... 0 17285 6.8 17254 203 28 1172 1033 20.3 1728.5
3 4524384067 0 43 1 0 4 5 5 1 0 ... 0 16478 7.0 17961 235 47 1321 7 23.5 1647.8
4 4436033771 0 75 4 0 6 6 6 0 0 ... 0 17404 7.0 18313 225 67 1004 -230 22.5 1740.4

5 rows × 40 columns

EDA

In [4]:
# 检查缺少的值和数据类型df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 40 columns):
gameId                          9879 non-null int64
blueWins                        9879 non-null int64
blueWardsPlaced                 9879 non-null int64
blueWardsDestroyed              9879 non-null int64
blueFirstBlood                  9879 non-null int64
blueKills                       9879 non-null int64
blueDeaths                      9879 non-null int64
blueAssists                     9879 non-null int64
blueEliteMonsters               9879 non-null int64
blueDragons                     9879 non-null int64
blueHeralds                     9879 non-null int64
blueTowersDestroyed             9879 non-null int64
blueTotalGold                   9879 non-null int64
blueAvgLevel                    9879 non-null float64
blueTotalExperience             9879 non-null int64
blueTotalMinionsKilled          9879 non-null int64
blueTotalJungleMinionsKilled    9879 non-null int64
blueGoldDiff                    9879 non-null int64
blueExperienceDiff              9879 non-null int64
blueCSPerMin                    9879 non-null float64
blueGoldPerMin                  9879 non-null float64
redWardsPlaced                  9879 non-null int64
redWardsDestroyed               9879 non-null int64
redFirstBlood                   9879 non-null int64
redKills                        9879 non-null int64
redDeaths                       9879 non-null int64
redAssists                      9879 non-null int64
redEliteMonsters                9879 non-null int64
redDragons                      9879 non-null int64
redHeralds                      9879 non-null int64
redTowersDestroyed              9879 non-null int64
redTotalGold                    9879 non-null int64
redAvgLevel                     9879 non-null float64
redTotalExperience              9879 non-null int64
redTotalMinionsKilled           9879 non-null int64
redTotalJungleMinionsKilled     9879 non-null int64
redGoldDiff                     9879 non-null int64
redExperienceDiff               9879 non-null int64
redCSPerMin                     9879 non-null float64
redGoldPerMin                   9879 non-null float64
dtypes: float64(6), int64(34)
memory usage: 3.0 MB
In [5]:
df_clean = df.copy()
In [6]:
# #删除一些不必要的列。例如,blueFirstblood/redfirst blood blueEliteMonster/redEliteMonster blueDeath/redKills等重复cols = ['gameId', 'redFirstBlood', 'redKills', 'redEliteMonsters', 'redDragons','redTotalMinionsKilled',
       'redTotalJungleMinionsKilled', 'redGoldDiff', 'redExperienceDiff', 'redCSPerMin', 'redGoldPerMin', 'redHeralds',
       'blueGoldDiff', 'blueExperienceDiff', 'blueCSPerMin', 'blueGoldPerMin', 'blueTotalMinionsKilled']df_clean = df_clean.drop(cols, axis = 1)
In [7]:
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 23 columns):
blueWins                        9879 non-null int64
blueWardsPlaced                 9879 non-null int64
blueWardsDestroyed              9879 non-null int64
blueFirstBlood                  9879 non-null int64
blueKills                       9879 non-null int64
blueDeaths                      9879 non-null int64
blueAssists                     9879 non-null int64
blueEliteMonsters               9879 non-null int64
blueDragons                     9879 non-null int64
blueHeralds                     9879 non-null int64
blueTowersDestroyed             9879 non-null int64
blueTotalGold                   9879 non-null int64
blueAvgLevel                    9879 non-null float64
blueTotalExperience             9879 non-null int64
blueTotalJungleMinionsKilled    9879 non-null int64
redWardsPlaced                  9879 non-null int64
redWardsDestroyed               9879 non-null int64
redDeaths                       9879 non-null int64
redAssists                      9879 non-null int64
redTowersDestroyed              9879 non-null int64
redTotalGold                    9879 non-null int64
redAvgLevel                     9879 non-null float64
redTotalExperience              9879 non-null int64
dtypes: float64(2), int64(21)
memory usage: 1.7 MB
In [8]:
# 接下来让我们检查blue team特征参数之间的关系g = sns.PairGrid(data=df_clean, vars=['blueKills', 'blueAssists', 'blueWardsPlaced', 'blueTotalGold'], hue='blueWins', size=3, palette='Set1')g.map_diag(plt.hist)g.map_offdiag(plt.scatter)g.add_legend();
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:1241: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(UserWarning(msg))

我们可以看到变量之间有很多线性对称

In [9]:
# 我们可以看到很多特征是高度相关的,让我们得到相关矩阵plt.figure(figsize=(16, 12))sns.heatmap(df_clean.drop('blueWins', axis=1).corr(), cmap='YlGnBu', annot=True, fmt='.2f', vmin=0);
In [10]:
# 基于相关性矩阵,让我们稍微清理一下数据集,以避免共线性cols = ['blueAvgLevel', 'redWardsPlaced', 'redWardsDestroyed', 'redDeaths', 'redAssists', 'redTowersDestroyed',
       'redTotalExperience', 'redTotalGold', 'redAvgLevel']df_clean = df_clean.drop(cols, axis=1)
In [11]:
# 接下来让我们删除与bluewins关系不大的列corr_list = df_clean[df_clean.columns[1:]].apply(lambda x: x.corr(df_clean['blueWins']))cols = []for col in corr_list.index:
    if (corr_list[col]>0.2 or corr_list[col]<-0.2):
        cols.append(col)cols
Out[11]:
['blueFirstBlood',
 'blueKills',
 'blueDeaths',
 'blueAssists',
 'blueEliteMonsters',
 'blueDragons',
 'blueTotalGold',
 'blueTotalExperience']
In [12]:
df_clean = df_clean[cols]df_clean.head()
Out[12]:
blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons blueTotalGold blueTotalExperience
0 1 9 6 11 0 0 17210 17039
1 0 5 5 5 0 0 14712 16265
2 0 7 11 4 1 1 16113 16221
3 0 4 5 5 1 0 15157 17954
4 0 6 6 6 0 0 16400 18543
In [13]:
df_clean.hist(alpha = 0.7, figsize=(12,10), bins=5);

模型选择

In [14]:
from sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import train_test_splitX = df_cleany = df['blueWins']scaler = MinMaxScaler()scaler.fit(X)X = scaler.transform(X)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

朴素贝叶斯

In [15]:
from sklearn.naive_bayes import GaussianNBfrom sklearn.metrics import accuracy_score# 拟合模型clf_nb = GaussianNB()clf_nb.fit(X_train, y_train)pred_nb = clf_nb.predict(X_test)# 得到准确性分数acc_nb = accuracy_score(pred_nb, y_test)print(acc_nb)
0.7176113360323887

决策树

In [16]:
# 拟合决策树模型from sklearn import treefrom sklearn.model_selection import GridSearchCVtree = tree.DecisionTreeClassifier()# 搜索最好的参数grid = {'min_samples_split': [5, 10, 20, 50, 100]},clf_tree = GridSearchCV(tree, grid, cv=5)clf_tree.fit(X_train, y_train)pred_tree = clf_tree.predict(X_test)# 获得准确度分数acc_tree = accuracy_score(pred_tree, y_test)print(acc_tree)
0.6928137651821862

随机森林

In [17]:
# 拟合模型from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier()# 搜索最好的参数grid = {'n_estimators':[100,200,300,400,500], 'max_depth': [2, 5, 10]}clf_rf = GridSearchCV(rf, grid, cv=5)clf_rf.fit(X_train, y_train)pred_rf = clf_rf.predict(X_test)# 获得准确分数acc_rf = accuracy_score(pred_rf, y_test)print(acc_rf)
0.7282388663967612

逻辑回归

In [18]:
# 拟合逻辑回归模型from sklearn.linear_model import LogisticRegressionlm = LogisticRegression()lm.fit(X_train, y_train)# 得到准确分数pred_lm = lm.predict(X_test)acc_lm = accuracy_score(pred_lm, y_test)print(acc_lm)
0.7302631578947368
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

K-近邻算法(KNN)

In [19]:
# 拟合模型from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier() # 搜索最好的参数grid = {"n_neighbors":np.arange(1,100)}clf_knn = GridSearchCV(knn, grid, cv=5)clf_knn.fit(X_train,y_train) # 得到准确分数pred_knn = clf_knn.predict(X_test) acc_knn = accuracy_score(pred_knn, y_test)print(acc_knn)
0.7171052631578947

结论

In [20]:
data_dict = {'Naive Bayes': [acc_nb], 'DT': [acc_tree], 'Random Forest': [acc_rf], 'Logistic Regression': [acc_lm], 'K_nearest Neighbors': [acc_knn]}df_c = pd.DataFrame.from_dict(data_dict, orient='index', columns=['Accuracy Score'])print(df_c)
                     Accuracy Score
Naive Bayes                0.717611
DT                         0.692814
Random Forest              0.728239
Logistic Regression        0.730263
K_nearest Neighbors        0.717105

从精度分数可以看出,逻辑回归和随机森林的预测效果最好。接下来让我们仔细看看这两种方法的召回率和精确率

In [21]:
# 召回率和精确率from sklearn.metrics import recall_score, precision_score# lm参数 recall_lm = recall_score(pred_lm, y_test, average = None)precision_lm = precision_score(pred_lm, y_test, average = None)print('precision score for naive bayes: {}\n recall score for naive bayes:{}'.format(precision_lm, recall_lm))
precision score for naive bayes: [0.72736521 0.73313192]
 recall score for naive bayes:[0.72959184 0.73092369]
In [22]:
# rf参数recall_rf = recall_score(pred_rf, y_test, average = None)precision_rf = precision_score(pred_rf, y_test, average = None)print('precision score for naive bayes: {}\n recall score for naive bayes:{}'.format(precision_rf, recall_rf))
precision score for naive bayes: [0.73550356 0.72104733]
 recall score for naive bayes:[0.723      0.73360656]

一般情况下,还是会选择逻辑回归

In [23]:
df_clean.columns
Out[23]:
Index(['blueFirstBlood', 'blueKills', 'blueDeaths', 'blueAssists',
       'blueEliteMonsters', 'blueDragons', 'blueTotalGold',
       'blueTotalExperience'],
      dtype='object')
In [24]:
lm.coef_
Out[24]:
array([[ 0.09223084,  1.6863957 , -4.9521688 , -0.28960539,  0.30701029,
         0.29102466,  5.33422084,  1.55350489]])
In [25]:
np.exp(lm.coef_)
Out[25]:
array([[1.09661794e+00, 5.39998244e+00, 7.06806308e-03, 7.48558900e-01,
        1.35935495e+00, 1.33779758e+00, 2.07311157e+02, 4.72801236e+00]])
In [26]:
coef_data = np.concatenate((lm.coef_, np.exp(lm.coef_)),axis=0)coef_df = pd.DataFrame(data=coef_data, columns=df_clean.columns).T.reset_index().rename(columns={'index': 'Var', 0: 'coef', 1: 'oddRatio'})coef_df.sort_values(by='coef', ascending=False)
Out[26]:
Var coef oddRatio
6 blueTotalGold 5.334221 207.311157
1 blueKills 1.686396 5.399982
7 blueTotalExperience 1.553505 4.728012
4 blueEliteMonsters 0.307010 1.359355
5 blueDragons 0.291025 1.337798
0 blueFirstBlood 0.092231 1.096618
3 blueAssists -0.289605 0.748559
2 blueDeaths -4.952169 0.007068

PCA

In [27]:
# 使用PCA让结果可视化X = df_cleany = df['blueWins']# PCA受scale的影响,首先要对数据集进行scalefrom sklearn import preprocessing # S标准化特征X = preprocessing.StandardScaler().fit_transform(X)
In [28]:
from sklearn.decomposition import PCApca = PCA(n_components=2)components = pca.fit_transform(X)print(pca.explained_variance_ratio_)
[0.42435947 0.20805752]
In [29]:
# 创造可视化dfdf_vis = pd.DataFrame(data = components, columns = ['pc1', 'pc2'])df_vis = pd.concat([df_vis, df['blueWins']], axis = 1)X = df_vis[['pc1', 'pc2']]y = df_vis['blueWins']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [30]:
# 重新定义pca数据lm.fit(X_train, y_train)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[30]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
In [31]:
# 可视化函数from matplotlib.colors import ListedColormapdef DecisionBoundary(clf):
    X = df_vis[['pc1', 'pc2']]
    y = df_vis['blueWins']
    
    h = .02 
    # 创建颜色映射
    cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000', '#0000FF'])
    
    # 绘制决策边界。为此,我们将为每一个分配一个颜色
    # 网格 [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
    y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(8, 8))
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
    
    # Plot also the training points
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=cmap_bold)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.show()
In [32]:
DecisionBoundary(lm)


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/69977871/viewspace-2700669/,如需转载,请注明出处,否则将追究法律责任。

上一篇: 没有了~
请登录后发表评论 登录
全部评论

注册时间:2020-06-26

  • 博文量
    7
  • 访问量
    726