优秀员工离职原因分析与预测_R

优秀员工离职原因分析与预测

一、背景

1.公司大量优秀且有经验的员工过早的离开

2.数据来源：kaggle

3.变量

satisfaction: Employee satisfaction level
evaluation: Last evaluation
project: Number of projects
hours: Average monthly hours
years: Time spent at the company
accident: Whether they have had a work accident
promotion: Whether they have had a promotion in the last 5 years
sales: Department
salary: Salary
left: Whether the employee has left

4.分析目的与衡量标准：

（1）分析并得出优秀员工离职的主要可能的原因
（2）构建预测模型，预测下一位将会离开的优秀员工是谁

二、数据分析

所需包导入

library(readr)
library(dplyr)
library(ggplot2)
library(gmodels)

（一）导入数据并查看

## 1.1 数据导入

library(readr)
hr <- read_csv(“HR_comma_sep.csv”)
hr <- tbl_df(hr)
View(hr)
str(hr)

Classes ‘tbl_df’, ‘tbl’ and ‘data.frame’: 14999 obs. of 10 variables:

$ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 …
$ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 …
$ number_project : int 2 5 7 5 2 2 6 5 5 2 …
$ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 …
$ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 …
$ Work_accident : int 0 0 0 0 0 0 0 0 0 0 …
$ left : int 1 1 1 1 1 1 1 1 1 1 …
$ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 …
$ sales : chr “sales” “sales” “sales” “sales” …
$ salary : chr “low” “medium” “medium” “low” …

## 1.2 变量重命名

hr_good <- filter(hr, evaluation>=0.75 & year>=4 & project>= 4)

colnames(hr) <- c(“satisfaction”,”evaluation”,”project”,”hours”,”years”,”accident”,”left”,”promotion”,”sales”,”salary”)

## 1.3 因子化

hr$sales <- factor(hr$sales)
hr$salary <- factor(hr$salary, levels=c(“low”,”medium”,”high”))

## 1.4 查看数据

sum(is.na(hr))

# [1] 0

summary(hr)

satisfaction_level last_evaluation number_project average_montly_hours
Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0

time_spend_company Work_accident left promotion_last_5years
Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
ax. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000

sales salary
sales :4140 low :7316
technical :2720 medium:6446
support :2229 high :1237
IT :1227
product_mng: 902
marketing : 858
(Other) :2923

（二）根据定义选取优秀员工的子集，并做初步分析

优秀员工的定义：
（1）评价(evaluation)>=0.75
（2）项目数量(project)>=4
（3）有经验的(year)>=4

## 2.1 根据定义选取子集

hr_good <- filter(hr, evaluation>=0.75 & years>=4 & project>= 4)

## 2.2 对比总体与选取子集中离职员工的占比情况

【结论】：优秀员工离职情况非常严重1.在总离职员工中，优秀员工的数量占了(1778/3571=) 50%；2.在优秀员工子集中，离职的数量高达(1778/2753=) 64%；

CrossTable(hr$left)

Total Observations in Table: 14999

CrossTable(hr_good$left)

## 2.3 了解优秀员工子集的统计量

summary(hr_good)

## 2.4 了解优秀员工子集中各变量之间的相关性

【结论】：离职与满意度呈负相关，且相关度最高

hr_good_corr <- select(hr, -sales,-salary) %>% cor()
corrplot(hr_good_corr, method=”circle”, tl.col=”black”,title=”离职与满意度呈负相关，且相关度最高”,mar=c(1,1,3,1))

（三）逐个变量分析员工离职、满意度与其他变量之间的关系

## 3.1 查看满意度的分布图

hr_good$left <- factor(hr_good$left, levels=c(0,1), labels=c(“stay”, “left”))

ggplot(hr_good, aes(satisfaction, fill=left)) + geom_histogram(position=”dodge”) + scale_x_continuous(breaks=c(0.1,0.13,0.25,0.50,0.73,0.75,0.92,1.00)) + theme3 + theme(axis.text.x=element_text(angle=90)) + labs(title=”满意度在[0.1,0.13]与[0.73,0.92]两个区间离职人数非常多”)

## 3.2 收入、工作时间、满意度之间的关系

【结论】：工作超长时间的员工满意度低，且离职率高1.低&中等薪资水平中较高工作时间的员工大量离职2.超长工作时间的员工满意度都很低且几乎都已离职

ggplot(hr_good, aes(salary, hours, alpha=satisfaction, color=left)) + geom_jitter() + theme3 + labs(title=paste(“低&中等薪资水平中较高工作时间的员工大量离职”,”\n”,”超长工作时间的员工满意度都很低且几乎都已离职”))

## 3.3 晋升、满意度与离职的关系

【结论】：高评价的员工几乎没有人晋升，离职人员也主要集中在未晋升中

ggplot(hr_good, aes(promotion, evaluation, color=left)) + geom_jitter() + theme3 + scale_x_discrete(limits=c(0,1)) + labs(title=paste(“高评价的员工几乎没有人晋升”,”\n”,”离职人员也主要集中在未晋升中”))

## 3.4 工作年限、满意度与离职的关系

【结论】：1.低满意度(0.1)水平下，4年司龄的员工大量离职2.大量高满意度员工在第5年与第6年离职3.7年以上的员工没有人离职

ggplot(hr_good, aes(years,satisfaction, color=left)) + geom_jitter() + scale_x_discrete(limits=c(4,5,6,7,8,9,10)) + theme3 + labs(title=paste(“低满意度(“,”0.1)”,”水平下，4年司龄的员工大量离职”,”\n”,”大量高满意度员工在第5年与第6年离职”))

## 3.5 部门、项目数与离职的关系

【结论】：对于6个以上的项目无论在哪个部门离职率都非常高

ggplot(hr_good, aes(sales,fill=left)) + geom_bar(position=”fill”) + facet_wrap(~factor(project),ncol=1) + theme3 + theme(axis.text.x=element_text(angle=270)) + labs(y=”number projcet”,title=”对于6个以上的项目无论在哪个部门离职率都非常高”)

## 3.6 部门与离职的关系

【结论】：各部门离职人员均高于在职人员，管理部门除外

ggplot(hr_good, aes(sales, fill=left)) + geom_bar(position=”dodge”) + coord_flip() + scale_x_discrete(limits=c(“management”,”RandD”,”hr”,”accounting”,”marketing”,”product_mng”,”IT”,”support”,”technical”,”sales”)) + labs(title=”各部门离职人员均高于在职人员，管理部门除外”) + theme3

三、构建预测模型1：分类

## 数据分割

library(caret)
set.seed(0001)
train <- createDataPartition(hr_good$left, p=0.75, list=FALSE)
hr_good_train <- hr_good[train, ]
hr_good_test <- hr_good[-train, ]

（一）Logistic回归

## 1. 构建逻辑回归并验证

ctrl <- trainControl(method=”cv”,number=5)
logit <- train(left~., hr_good_train, method=”LogitBoost”, trControl=ctrl)
logit.pred <- predict(logit, hr_good_test, tyep=”response”)
confusionMatrix( hr_good_test$left, logit.pred)

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 213 33
left 8 436

Accuracy : 0.9406
95% CI : (0.9203, 0.957)
No Information Rate : 0.6797
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8675
Mcnemar’s Test P-Value : 0.0001781
Sensitivity : 0.9638
Specificity : 0.9296
Pos Pred Value : 0.8659
Neg Pred Value : 0.9820
Prevalence : 0.3203
Detection Rate : 0.3087
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9467
‘Positive’ Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

library(pROC)
roc(as.numeric(hr_good_test$left), as.numeric(logit.pred), plot=TRUE, print.thres=TRUE, print.auc=TRUE, col=”black”)

（二）决策树

## 1. 构建决策树

library(rpart)
dtree <- rpart(left~., hr_good_train, method=”class”,parms=list(split=”information”))
dtree$cptable

CP nsplit rel error xerror xstd

1 0.55209743 0 1.00000000 1.00000000 0.02950911
2 0.12855210 1 0.44790257 0.44790257 0.02256805
3 0.08254398 3 0.19079838 0.19079838 0.01551204
4 0.02300406 4 0.10825440 0.10825440 0.01186737
5 0.01000000 5 0.08525034 0.08525034 0.01057607

plotcp(dtree)

dtree.pruned <- prune(dtree, cp=0.01)
library(partykit)
library(grid)
plot(as.party(dtree.pruned),main=”Decision Tree”)

dtree.pruned.pred <- predict(dtree.pruned, hr_good_test, type=”class”)
confusionMatrix(hr_good_test$left, dtree.pruned.pred)

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 236 10
left 13 431

Accuracy : 0.9667
95% CI : (0.9504, 0.9788)
No Information Rate : 0.6391
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9275
Mcnemar’s Test P-Value : 0.6767
Sensitivity : 0.9478
Specificity : 0.9773
Pos Pred Value : 0.9593
Neg Pred Value : 0.9707
Prevalence : 0.3609
Detection Rate : 0.3420
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9626
‘Positive’ Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(hr_good_test$left),as.numeric(dtree.pruned.pred), plot=TRUE, print.thres=TRUE, print.auc=TRUE,col=”blue”)

（三）随机森林

## 1. 构建随机森林

library(randomForest)
set.seed(0002)
forest <- randomForest(left~., hr_good_train, importance=TRUE, na.action=na.roughfix)
forest

Call:

randomForest(formula = left ~ ., data = hr_good_train, importance = TRUE, na.action = na.roughfix)

Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 1.35%
Confusion matrix:
stay left class.error
stay 731 8 0.01082544
left 20 1314 0.01499250

importance(forest, type=2)

**MeanDecreaseGini**

satisfaction 315.455428
evaluation 63.101527
project 51.659717
hours 307.201129
years 168.530576
accident 6.232136
promotion 1.897123
sales 20.666829
salary 12.338717

forest.pred <- predict(forest, hr_good_test)
confusionMatrix(hr_good_test$left, forest.pred)

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 241 5
left 4 440

Accuracy : 0.987
95% CI : (0.9754, 0.994)
No Information Rate : 0.6449
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9715
Mcnemar’s Test P-Value : 1
Sensitivity : 0.9837
Specificity : 0.9888
Pos Pred Value : 0.9797
Neg Pred Value : 0.9910
Prevalence : 0.3551
Detection Rate : 0.3493
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9862
‘Positive’ Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(hr_good_test$left), as.numeric(forest.pred), plot=TRUE, print.thres=TRUE, print.auc=T, col=”green”)

（四）支持向量机SVM

## 1. 构建SVM

library(e1071)
set.seed(0003)
svm <- svm(left~., hr_good_train)
svm.pred <- predict(svm, na.omit(hr_good_test))
confusionMatrix(na.omit(hr_good_test)$left, svm.pred)

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 211 35
left 11 433

Accuracy : 0.9333
95% CI : (0.9121, 0.9508)
No Information Rate : 0.6783
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8515
Mcnemar’s Test P-Value : 0.000696
Sensitivity : 0.9505
Specificity : 0.9252
Pos Pred Value : 0.8577
Neg Pred Value : 0.9752
Prevalence : 0.3217
Detection Rate : 0.3058
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9378
‘Positive’ Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(na.omit(hr_good_test)$left), as.numeric(svm.pred), plot=T, print.thres=T, print.auc=T, col=”orange”)

（五）对比模型，选择准确性最高的模型

【结论】：随机森林的拟合度最高，选择该模型为预测模型

roc(as.numeric(hr_good_test$left), as.numeric(logit.pred), plot=TRUE,col=”black”,main=paste(“ROC曲线:”,”Logitis(black)”,”dtree(blue)”,”randomForest(green)”,”SVM(orange)”,sep=” “))
roc(as.numeric(hr_good_test$left),as.numeric(dtree.pruned.pred), plot=TRUE, col=”blue”, add=T)
roc(as.numeric(hr_good_test$left), as.numeric(forest.pred), plot=TRUE, col=”green”, add=T)
roc(as.numeric(na.omit(hr_good_test)$left), as.numeric(svm.pred), plot=T, col=”orange”, add=T)

（六）模型应用

importance(forest,type=2)

MeanDecreaseGini

satisfaction 315.455428
evaluation 63.101527
project 51.659717
hours 307.201129
years 168.530576
accident 6.232136
promotion 1.897123
sales 20.666829
salary 12.338717

## 1. 剔除明显不重要的因子，重新构建模型

forest2 <- randomForest(left~.-promotion-accident-salary-sales, hr_good, na.action=na.roughfix, importance=TRUE)
importance(forest2, type=2)

MeanDecreaseGini

satisfaction 462.97925
evaluation 72.49430
project 71.09463
hours 417.63782
years 231.64080

forest2.pred <- predict(forest2, hr_good_test)
confusionMatrix(hr_good_test$left, forest2.pred, positive=”left”)

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 244 2
left 1 443

Accuracy : 0.9957
95% CI : (0.9873, 0.9991)
No Information Rate : 0.6449
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9905
Mcnemar’s Test P-Value : 1
Sensitivity : 0.9959
Specificity : 0.9955
Pos Pred Value : 0.9919
Neg Pred Value : 0.9977
Prevalence : 0.3551
Detection Rate : 0.3536
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9957
‘Positive’ Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(hr_good_test$left), as.numeric(forest2.pred), plot=TRUE, print.thres=T, print.auc=T, main=”Random Forest”, col=”green”)

（七）结论

1.调整后的随机森林预测模型员工离职的准确性达99.5%；其中离职的员工被正确预测的概率为99.5%，被预测离职的员工中，实际离职的概率为99.8%；

2.剔除不重要的变量（promotion,accident,sales,salary）并不会对模型造成影响；

3.满意度（satisfaction）、月平均工作时间（hours）、工作年限（years）是影响优秀员工离职的主要三个变量

四、构建预测模型2：主成分分析

（一）判断主成分个数

【结论】：根据结果，选择主成分个数为2个

library(psych)
hr_pc <- select(hr,-left,-sales,-salary)
fa.parallel(hr_pc, fa=”pc”, n.iter=100, show.legend=FALSE, main=”Scree plot with parallel analysis”)

## Parallel analysis suggests that the number of factors = NA and the number of components = 2

（二）提取主成分

【结论】：1.选择后的主成分RC1解释了数据33%的方差，RC2解释了16%；2个主成分共解释了数据49%的方差；2.RC1与”satisfaction,years”正相关，与”project, hours”负相关，称为综合因子13.RC2与”accident, promoion”正相关，与”evaluation”负相关，可称为综合因子2

hr.good.rc <- principal(hr_good_pc, nfactors=2, scores=T)
hr.good.rc

Principal Components Analysis

Call: principal(r = hr_good_pc, nfactors = 2, scores = T)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1  RC2  h2  u2 com

satisfaction 0.84 -0.19 0.74 0.26 1.1
evaluation 0.25 -0.66 0.49 0.51 1.3
project -0.85 0.03 0.73 0.27 1.0
hours -0.64 -0.30 0.50 0.50 1.4
years 0.61 0.10 0.39 0.61 1.1
accident 0.14 0.55 0.32 0.68 1.1
promotion 0.08 0.50 0.26 0.74 1.1

RC1  RC2

SS loadings 2.30 1.12
Proportion Var 0.33 0.16
Cumulative Var 0.33 0.49
Proportion Explained 0.67 0.33
Cumulative Proportion 0.67 1.00
Mean item complexity = 1.2
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0.14
with the empirical chi square 2283.14 with prob < 0
Fit based upon off diagonal values = 0.66

（三）获取主成分得分

【结论】：优秀员工离职人员在如下两个范围出现**1.范围1：RC1[-2,-1], RC2[-1,1]**2.范围2：RC1[0,1], RC2[-1.5,0]

hr.good.rc.scores <- as.data.frame(hr.good.rc$scores)
hr.good.rc.scores$left <- hr_good$left
ggplot(hr.good.rc.scores, aes(RC1,RC2,color=factor(left))) + geom_point()

五、构建预测模型3：聚类

library(cluster)
library(fpc)

（一）选择合适的数据，并进行标准化

hr_good_cl <- select(hr_good,-left,-sales,-salary)
hr_good_cl_scale <- as.data.frame(scale(hr_good_cl))

（二）选择合适的聚类数量

【结论】：根据图形与pamk的值，选择聚类个数为6

source(“wssplot.r”)
wssplot(hr_good_cl_scale)

set.seed(0004)
pamk.best <- pamk(hr_good_cl_scale)
pamk.best$nc

# [1] 6

（三）拟合聚类，并查看分类结构

fit.pam <- pam(hr_good_cl_scale, k=6)
cl.pam <- table(hr_good$left, fit.pam$clustering)
cl.pam

1     2     3         4      5    6

stay 228 72 150 138 364 33
left 326 83 8 502 87 21 4

hr_good$clustering <- fit.pam$clustering
ggplot(hr_good, aes(clustering, fill=factor(left))) + geom_bar(position=”dodge”) + scale_x_discrete(limits=c(1,2,3,4,5,6)) + theme3 + labs(title=”类别2和类别3中，离职人员比例远高于其他分类”)

clusplot(fit.pam, main=”基于PAM算法得到的六组聚类图”)

（四）评价聚类:兰德指数

【结论】：聚类结果与实际离职与否的结果吻合度不是很高；从之前的数据分析来看，优秀员工的离职与否在两种不同情境下都有较高的比例；

library(flexclust)
randIndex(cl.pam)

ARI
0.1742728

（五）根据聚类的分类进行子集细分，并描述统计情况

1.　查看各个聚类的统计量

【结论】：从统计量（均值）上看，如下两个情况的员工离职意向很高1.离职比例最高的类别2(left=0.92)，satisfaciton非常低(0.11)，项目数量最高(6.17)，月均工作时间最长(274.6)2.离职比例次高的类别3(left=0.77)，评价极高(0.96)，满意度(0.78)、月均工作时间(238.9)，工作年限(5.0)

hr_good <- as.data.frame(hr_good)
hr_good$left <- as.integer(hr_good$left)
hr_good$left <- ifelse(hr_good$left==1,0,1)
select(hr_good,-sales,-salary) %>% group_by(., clustering) %>% summarize_all(.,mean)

# A tibble: 6 × 9
clustering satisfaction evaluation project hours years accident left promotion
1 1 0.7574729 0.8453791 4.673285 246.5036 5.223827 0.0000000 0.588447 0
2 2 0.1114945 0.8714176 6.167033 274.6451 4.137363 0.0000000 0.920879 03 3 0.7809969 0.9649080 4.605828 238.9632 5.170245 0.0000000 0.769938 0
4 4 0.5258667 0.8865778 4.960000 219.9511 5.142222 1.0000000 0.386667 0
5 5 0.5542078 0.8585714 4.446753 161.2779 5.088312 0.0000000 0.054545 0
6 6 0.5908108 0.8648649 4.864865 225.5135 5.567568 0.1891892 0.108108 1

## 2. 选择类别2和类别3的子集

hr_good_cl_select <- filter(hr_good, clustering %in% c(2,3))
summary(hr_good_cl_select)

satisfaction evaluation project hours years accident
Min. :0.090 Min. :0.7600 Min. :4.000 Min. :137.0 Min. : 4.000 Min. :0
1st Qu.:0.100 1st Qu.:0.8525 1st Qu.:5.000 1st Qu.:243.0 1st Qu.: 4.000 1st Qu.:0
Median :0.110 Median :0.9200 Median :6.000 Median :260.0 Median : 4.000 Median :0
Mean :0.391 Mean :0.9104 Mean :5.515 Mean :259.8 Mean : 4.569 Mean :0
3rd Qu.:0.790 3rd Qu.:0.9700 3rd Qu.:6.000 3rd Qu.:282.0 3rd Qu.: 5.000 3rd Qu.:0
Max. :1.000 Max. :1.0000 Max. :7.000 Max. :310.0 Max. :10.000 Max. :0
left promotion sales salary clustering
Min. :0.000 Min. :0 sales :408 low :895 Min. :2.000
1st Qu.:1.000 1st Qu.:0 technical :341 medium:628 1st Qu.:2.000
Median :1.000 Median :0 support :213 high : 39 Median :2.000
Mean :0.858 Mean :0 IT :129 Mean :2.417
3rd Qu.:1.000 3rd Qu.:0 product_mng:104 3rd Qu.:3.000
Max. :1.000 Max. :0 accounting : 94 Max. :3.000
(Other) :273

## 3. 逐一分析子集的变量分布

## 3.1 满意度分布

hr_good_cl_select$clustering <- factor(hr_good_cl_select$clustering)
ggplot(hr_good_cl_select , aes(satisfaction,fill=factor(left))) + geom_histogram() + facet_wrap(~clustering,ncol=1) + theme2 + scale_x_continuous(limits=c(0.1,0.25,0.50,0.75,0.9,1.0))

## 3.2 评价分布

ggplot(hr_good_cl_select , aes(evaluation,fill=factor(left))) + geom_histogram() + facet_wrap(~clustering,ncol=1) + theme2 + scale_x_continuous(breaks=c(0.77,0.79,0.8,0.84,0.85,0.88,0.89,0.91,0.93,0.94,0.98,1.0)) + theme(axis.text.x=element_text(angle=90),panel.grid.minor=element_blank())

## 3.3 项目数量分布

ggplot(hr_good_cl_select , aes(project,fill=factor(left))) + geom_bar() + facet_wrap(~clustering,ncol=1) + theme2

## 3.4 月均工作时长分布

ggplot(hr_good_cl_select , aes(hours,fill=factor(left))) + geom_histogram() + facet_wrap(~clustering,ncol=1) + theme2 + scale_x_continuous(breaks=c(160,200,220,245,275,310)) + theme(panel.grid.minor=element_blank())

## 3.5 工作年限分布

ggplot(hr_good_cl_select , aes(years,fill=factor(left))) + geom_histogram(binwidth=0.5) + facet_wrap(~clustering,ncol=1) + theme2 + scale_x_continuous(breaks=c(4,5,6,7,8,10))

## 3.6 工作事故分布

ggplot(hr_good_cl_select , aes(accident,fill=factor(left))) + geom_bar(width=0.3) + facet_wrap(~clustering,ncol=1) + theme2

## 3.7 晋升情况分布

ggplot(hr_good_cl_select , aes(promotion,fill=factor(left))) + geom_bar() + facet_wrap(~clustering,ncol=1) + theme2 + scale_x_discrete(limits=c(0,1))

## 3.7 部门分布

ggplot(hr_good_cl_select , aes(sales,fill=factor(left))) + geom_bar() + facet_wrap(~clustering,ncol=1) + theme2 + theme(axis.text.x=element_text(angle=270,vjust=0.5))

## 3.7 薪资水平分布

ggplot(hr_good_cl_select , aes(salary,fill=factor(left))) + geom_bar(width=0.5) + facet_wrap(~clustering,ncol=1) + theme2 + scale_y_continuous(breaks=c(0,20,100,200,300,400,500)) + theme(panel.grid.minor=element_blank())

一、背景

二、数据分析

（一）导入数据并查看

（二）根据定义选取优秀员工的子集，并做初步分析

（三）逐个变量分析员工离职、满意度与其他变量之间的关系

三、构建预测模型1：分类

（一）Logistic回归

（二）决策树

（三）随机森林

（四）支持向量机SVM

（五）对比模型，选择准确性最高的模型

（六）模型应用

（七）结论

四、构建预测模型2：主成分分析

（一）判断主成分个数

（二）提取主成分

（三）获取主成分得分

五、构建预测模型3：聚类

（一）选择合适的数据，并进行标准化

（二）选择合适的聚类数量

（三） 拟合聚类，并查看分类结构

（四） 评价聚类:兰德指数

（五）根据聚类的分类进行子集细分，并描述统计情况

（三）拟合聚类，并查看分类结构

（四）评价聚类:兰德指数