关于技能树的点选问题

Posted on 2016-09-04

数据的技能树

看到之前整理的全栈技能树，这两天整理出来了关于数据人员的必备技能树，不多说了直接上图

总结起来，就是入了数据这个坑，想填还真不是一天两天，沉下心来，毕竟“聒噪的鸟儿飞不高”。

NaiveBayes

Posted on 2016-09-04

关于朴素贝叶斯

朴素贝叶斯分类的原理：
对于给出的待分类项目，求解在此项出现的条件下各个类别出现的概率，哪个最大，就认为此待分类项就属于哪个分类。
在这就不给出贝叶斯定理了，其实还是比较好理解的。

那么对于朴素贝叶斯分类器的包有很多，自己用的比较顺手的是e1071和klaR，这次以e1071为例
其中的的主要函数为naiveBayes(formula, data, laplace = 0)
formula: 公式的主要形式为type ~ . ，但是相互作用是不允许的(可以参照贝叶斯信念网络)
laplace: 正面双控制拉普拉斯平滑，默认值为0表示禁用。
data：数据集

data("Glass")
model<-naiveBayes( Type ~ .,data = Glass)
prd_test_Glass<-predict(model, Glass[1:10,],tpye = "raw")
prd_test_Glass

结果如下
> prd_test_Glass [1] 2 1 1 1 1 1 1 1 3 1 Levels: 1 2 3 5 6 7

可以看到对于glass的分类有7种，而对于数据集的前十行玻璃的分类结果为： 2 1 1 1 1 1 1 1 3 1
那么对比下数据，实际上Glass数据集的前10种玻璃的分类都为1，说明分类失误两个，整体判断的准确率还是有的。

总结一下，之前看到有大神用Titanic数据集做朴素贝叶斯分类，得到最后船上的人符合什么样的条件最容易获救，还是比较有意义的，于是自己做了下进行了对比，并且用了e1071和klaR两个包，结果还是不尽相同的。对于朴素贝叶斯整体而言还能做的还有很多，例如根据用户行为的社交网络、网络游戏的流失分析依然可以用到朴素贝叶斯分类器。

多重图的新纪元-cowplot

Posted on 2016-09-02

说说这两天的新发现-cowplot包，一个在CRAN上的新包，对于其好用的功能且听我细细道来
对于R的众多coder来说，郁闷的莫过于不知道该用哪个包来解决手上的问题，但是最郁闷的莫过于知道用哪个包了却不知道怎么用，其实对于cowplot来说由于发布的时间比较新所以中文的相关文档很少，在这我就作为一个贡献者提供些commits，哈哈哈哈。

display0 : 给ggplot图添加grid

这是新鲜刚画出的图，背景blank

在用到background_grid(major = "xy",minor = "none")
之后，背景加上了grid，其中major = “xy”是给x和y轴方向添加grid，可以只添加水平或者垂直的grid，只要输入x或者y就行

display1 : 给ggplot图添加label

在多图的情况下，给图加label就尤为重要了，这能让每个图更加凸显和容易识别
plot_grid(test_01, test_02, labels = c("Pic_01", "Pic_02"))
这次用到的是labels这个参数，已向量的形式给图命名

display2 : 给ggplot图layout

还是在多图的情况下，图的排列就尤为重要，以前R中会用到grid.arrange()，但是这次cowplot提供了好用的plot_grid()
具体参数如下
plot_grid(test_01, NULL,test_02, NULL, labels = c("Pic01", "Pic03", "Pic03", "Pic04"), ncol = 2)

但是这个时候图与图之间的坐标轴没有对齐，会非常难看，也会看起来不专业
在加入align后
plot_grid(test_01, NULL,test_02, NULL, labels = c("Pic01", "Pic03", "Pic03", "Pic04"), align = "v", ncol = 2)

display3 : 给ggplot图加水印

图画出来了，可以在必要时刻添加水印，不仅为了防止盗图的，而且还凸显图的比较和分类
draw_label("Live's chart" , angle = 45, size = 80, alpha = .1)

总体来说，cowplot的思想和ggplot2的思想还是比较相近的，在图的pannel上添加层来达到绘图的目的，而且易用性远超之前的方式。

决策树-C50

Posted on 2016-08-29

无意间看到data flow上的一个常用的数据集，于是就下了下来，这是一个关于银行客户违约的数据集，这正好让我有了对C5.0实践的一个想法。
这个数据集中的字段：账户余额，工作年限，贷款/收入比，居住年限，年龄，其他信用记录，房产，现有贷款账户，工作类型，受抚养人，电话，违约情况
下面说一下大体做法
‘’ set.seed(1024)#建立种子点
‘’ credit_sample <- credit[order(runif(1000)), ]#抽取1000条作为样本

‘’ train <- credit_sample[1:900, ]#按照1：9的比例设立训练集
‘’ test <- credit_sample[901:1000, ]#建立测试集

‘’ library(C50)
‘’ credit_model <- C5.0(train[-17], train$default)#建立C50模型

‘’ credit_pred <- predict(credit_model, test)#运用测试集来预测结果

具体结果是酱紫：
‘’ Total Observations in Table: 100
‘’
‘’
‘’ | predicted default
‘’ acutal default | no | yes | Row Total |
‘’ —————|———–|———–|———–|
‘’ no | 57 | 11 | 68 |
‘’ | 0.570 | 0.110 | |
‘’ —————|———–|———–|———–|
‘’ yes | 16 | 16 | 32 |
‘’ | 0.160 | 0.160 | |
‘’ —————|———–|———–|———–|
‘’ Column Total | 73 | 27 | 100 |
‘’ —————|———–|———–|———–|
可以看到应用到测试集，该模型正确率73%，只识别了实际违约贷款32人中的50%，显然不够理想。
在后期，对于决策树是可以用剪枝来进行修缮的，对于提高准确率，可以运用bagging、boosting和随机森林，特别的可以运用悲观剪枝法来提高模型的准确率，这是一种后剪枝的方法。

R的骚图绘制

Posted on 2016-08-28

来来来，说到R的绘图那么不得不提的就是ggplot2了，被誉为R中最高大上的绘图系统。对于ggplot2的思想和理念也是由衷的佩服，之前拜读《Elegant Graphics for Data Analysis》的时候就对作者Wickham有着不小的敬畏之心，虽然这本书写的很混乱，但是也让我初识了ggplot2的厉害，之后不断的应用，也让更深层的认识了ggplot2的精髓和思想。
想到为了让图画的更加骚气，毫不犹豫的用上了Rcolorbrewer，在不断调试中让图变得更加亮眼
对了，对于Rcolorbrewer大家应该都很熟悉，但是可能却不太了解其中的具体信息，其实这个包中有三类调色板，一个是sequential，一个是diverging，还有一个是qualitative，sequential调色板适合用于呈现顺序数据而diverging适合用于处在极端的数值，可以用来强调高低的对比，最后一个qualitative调色板颜色都特别的鲜明，对比度都特别高，所以适合用来呈现分类变量。用brewer.pal.info可以查看详情，display.brewer.pal查看颜色图谱而最关键的是可以与colorRamp/colorRampPalette结合使用。
其中会用到scale_color_manual来设置调色板参数

其中用到了qualitative调色板中的Paired主题。
还可以用玫瑰图绘制，颜色和形状会更加绚丽，待我有空格上更加骚气的图！

词云---有意思的呈现

Posted on 2016-08-28

关于R的词云
看到最近郎大为的R包出产了，作为一个新的R包特别是在词云解决方案上，已经足够让广大R迷或者说词云粉丝，也或许是段子手们为之兴奋了。
对于这次的wordcloud2的出现，外界的评价都特别高，被称为目前最好的词云解决方案。
那么先来开膛破肚的介绍下这个包的原理，其实这个新的词云包用的是wordcloud2.js的库，在调用这个库的同时，利用词与词之间的间隔来插入数据，其中一个新的feature是可以根据图片或者文字来定制化词云，其实也就是说可以做到词云的多样化。
先来张拿到这个包的上手词云图

这个是用到了其中的核心函数wordcloud2(),在调整了下背景后生成的

这个是用到了词云包里的新特性，就是根据自己的喜好生成了一幅自己定制词云，这个反战符号是用Sketch手动绘制的，因为比较心焦的想试下新特性，最先想到的就是这个比较简单，并且结构比较鲜明的符号。
此次词云包的新特性还有很多情景派的上用场，这让我想到了用户画像的词云定制，特别是在P2P金融的反欺诈用户的画像中，也可以用这种新特性完成，这个词云包简直是个好东西！

关于时效的SQL代码

Posted on 2016-08-27

产品线的时效问题

这两天一直受制于各产品的时效问题没有好的方案，起因是需求啊！！还是需求！！老大对于整个产品线的优化问题到了一个新的高度，同行业的审批流程一直在下降，但是对于风控的要求还是不能有过多的取舍，谁让鱼和熊掌不能够兼得，即想有好的MOB数据，又要减少审批时间，那么需求来了，到底应该优化哪个阶段的时效呢？

于是有了以下的代码

‘’ select ‘
‘’ 2016-06-12 ‘ as 周期, kx.产品线, kx.初审, sum(kx.初审等待) 初审等待, sum（kx.初审处理）初审处理, sum(时效3) 时效3, count(transport_id) 初审时效处理量
‘’ from (
‘’ select
‘’ transport_id, cs 初审, product as 产品, round(cs_dd,2) as 初审等待 , round(cs_cl,2) as 初审处理, round(cs_dd + cs_cl ,2) as 时效3 ,
‘’ (case
‘’ when substr(product,0,3) in （’新薪贷’,’精英贷’,’新薪宜’,’助业贷’,’线下金’,’线下信’,’新薪（’,’助业宜’,’MSE’） then ‘城市信贷’
‘’ when substr(product,0,3) in （’网商贷’,’乐购分’,’宜学贷’,’pos贷’,’供应链’,’汽车金’,’宜车购’,’企合消费金融’） then ‘渠道’
‘’ when substr(product,0,3) in （’线上精’,’线上码’,’新线上’） then ‘宜人线上’
‘’ when substr(product,0,3) in （’线上瞬’,’瞬时贷’,’公积金’） then ‘k计划’
‘’ else product end) 产品线
‘’ from (select transport_id, sum(csdd) as cs_dd , sum(cscl) as cs_cl
‘’ from (select transport_id, process_node,
‘’ (case when process_node in (‘2-1’,’2-2’,’2-3’,’2-4’) then all_time else 0 end ) as csdd,(case when process_node in (‘2-0’) then all_time else 0 end ) as cscl
‘’
‘’ from (select transport_id, process_node, s_time, e_time, 24(time1+time2+time3 ) as all_time
‘’ from (select transport_id, process_node, s_time, e_time,
‘’ (case when to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-30 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-30 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-30 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-30 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0.375
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-30 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-30 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) - to_date(s_time, ‘yyyy-mm-dd hh24:mi:ss’)
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-30 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-30 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) - to_date(‘ 2016-03-30 09:30:00’,’yyyy-mm-dd hh24:mi:ss’）
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-30 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-30 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(‘ 2016-03-30 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) - to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) else 0 end ) time1 ,
‘’
‘’ (case when to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-31 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-31 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-31 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-31 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0.375
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-31 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-31 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) - to_date(s_time, ‘yyyy-mm-dd hh24:mi:ss’)
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-31 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-03-31 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) - to_date(‘ 2016-03-31 09:30:00’,’yyyy-mm-dd hh24:mi:ss’）
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-31 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-03-31 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(‘ 2016-03-31 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) - to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) else 0 end ) time2 ,
‘’
‘’ (case when to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-04-01 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-04-01 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-04-01 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-04-01 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then 0.375
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-04-01 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-04-01 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) - to_date(s_time, ‘yyyy-mm-dd hh24:mi:ss’)
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-04-01 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) <= to_date(‘ 2016-04-01 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) - to_date(‘ 2016-04-01 09:30:00’,’yyyy-mm-dd hh24:mi:ss’）
‘’ when to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-04-01 09:30:00’,’yyyy-mm-dd hh24:mi:ss’) and to_date(e_time,’yyyy-mm-dd hh24:mi:ss’) >= to_date(‘ 2016-04-01 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) then to_date(‘ 2016-04-01 18:30:00’,’yyyy-mm-dd hh24:mi:ss’) - to_date(s_time,’yyyy-mm-dd hh24:mi:ss’) else 0 end ) time3
‘’
‘’ from (select transport_id from
‘’ ( select transport_id,max(decesion_id) as decision_id from clic_sele_v.v_xg_tc_busi_decision
‘’ where to_date(inspection_time,’yyyy-mm-dd hh24:mi:ss’)>to_date(‘ 2016-06-08 17:30:01’,’yyyy-mm-dd hh24:mi:ss’) and to_date(inspection_time,’yyyy-mm-dd hh24:mi:ss’)to_date(‘ 2016-06-08 17:30:01’,’yyyy-mm-dd hh24:mi:ss’) and to_date(inspection_time,’yyyy-mm-dd hh24:mi:ss’)flow_log_id group by transport_id )
‘’ natural left join (select flow_log_id as n_log_id, process_node from clic_sele_v.v_xg_tc_flow_log ) )
‘’ natural left join ( select transport_id, min(flow_log_id) as flow_2 from clic_sele_v.v_xg_tc_flow_log where process_node in (‘5-1’,’6-1’,’1-2’,’15-3’) group by transport_id)
‘’ where process_node in (‘11-1’,’11-2’,’20-1’) and flow_2 is null )
‘’ natural left join
‘’
‘’ (select transport_id, process_node, s_time1 as s_time,
‘’ case when s_time1>=e_time1 and e_time1 is not null then s_time2
‘’ else e_time1 end e_time
‘’ from ( select rownum+1 rn, transport_id, process_node, s_time1, e_time1
‘’ from ( select
‘’ transport_id, process_node,
‘’ nvl(start_time, create_date) s_time1,
‘’ nvl(end_time,assign_time) e_time1
‘’ FROM clic_sele_v.v_xg_tc_flow_log
‘’ where to_date(create_date,’yyyy-mm-dd hh24:mi:ss’)> to_date(‘2015-10-01 00:00:01’,’yyyy-mm-dd hh24:mi:ss’)
‘’ order by transport_id , flow_log_id )
‘’ )
‘’ natural left join ( select
‘’ from ( select rownum rn, s_time2
‘’ from
‘’ ( select
‘’ nvl(start_time, create_date) s_time2
‘’ from clic_sele_v.v_xg_tc_flow_log
‘’ where to_date(create_date,’yyyy-mm-dd hh24:mi:ss’)> to_date(‘2015-10-01 00:00:01’,’yyyy-mm-dd hh24:mi:ss’)
‘’ order by transport_id , flow_log_id )
‘’ )
‘’ where rn!=1
‘’ )
‘’ )
‘’ ))) group by transport_id )
‘’ natural left join ( select transport_id, product_type,submit_dept_no as dept_id from clic_sele_v.v_xg_tc_bs_transport )
‘’ natural left join ( select dept_id ,dept_name from clic_sele_v.v_xg_TC_BS_DEPARTMENT )
‘’ natural left join ( select system_id as product_type , remark as product from clic_sele_v.v_xg_s_data_dic where system_type=’COMMON_PRODUCT_TYPE’)
‘’
‘’ natural left join
‘’ (select transport_id,beg_jude_name as cs
‘’ from
‘’ (SELECT transport_id, inspection_man from clic_sele_v.v_xg_tc_busi_decision
‘’ where decesion_id in
‘’ ( select max(decesion_id) as decision_id from clic_sele_v.v_xg_tc_busi_decision
‘’ where data_source=’2’ and is_decision=’1’
‘’ group by transport_id ))
‘’ natural left join
‘’ (select distinct user_code as inspection_man,user_name as beg_jude_name
‘’ from clic_sele_v.v_xg_tc_user))
‘’ ) kx
‘’ group by kx.产品线, kx.初审

‘’ order by kx.产品线

不得不说虽然很丑很长，但是，我是说但是，考虑了所有的流程的情况，神马提交的反欺诈、二次协商问题、回退问题等等….统统考虑进去，效率在三分钟以内，对于全线产品来说是通用的！感觉自己的逻辑又有了新的境界！毕竟真的是和一线的审核和系统的技术人员擦出了绳命的火花！

部分原型

Posted on 2016-08-23

结游–原型图（部分）

原型

字段总结

Posted on 2016-08-22

数据库字段问题总结

问题汇总：

库名	字段	表名	存在问题
db4g	recruitment_date（入职时间）	借款人表 (TC_MORTGAGOR)	2016年一月部分数据存在格式问题并且部分没有数据，eg0015-08-31，无法参与字段计算
db4g	month_income（月认定收入）	收入流水表(TC_SALARY)	数据存在部分缺失，大部分属于宜人贷拒贷产品，但是拒贷产品因为流水问题会录入流水
db4g	content_type(社保公积金类型) ; month_payment_base(缴纳基数）	社保费用缴纳证件表(TC_IDENTIFI_WELFARE)	缴纳基数部分缺失，大部分属于宜人贷拒贷产品未录入
db4g	charge_against_liabilities	信贷汇总表	数据存在部分缺失，大部分属于宜人贷拒贷产品，存在客户没有信用负债情况
db4g	tc_loan_summary	信贷汇总表	数据库显示表名为loan_summary,实际使用时要去掉表名中的C
bi	apply_id(varchar2)	bi_ods.v_CE_borrow	字段格式不同，需要用cast()转换,union的时候要统一字段格式
bi	clic_apply_id(number)	bi_ods.yrd_audit	字段格式不同，需要用cast()转换,union的时候要统一字段格式

stay up all night

Posted on 2016-08-03

说几句

昨晚是加班最晚的一次，深夜回到家已经两点半了，可是却异常的难以入睡，对于平常的我来说，这个时候不睡应该是兴奋的写出了新东西，但是此刻的我却因为代码跑不出而辗转反侧，明明上个月还完好的代码此刻却匹配不出一些该有的字段，不过今天中午的时候再次遇到这个问题，初步断定了并非是代码的问题，应给是虚拟机下的SAS环境的问题，这也让我对SAS有了一个更加深刻的‘印象’，好吧估计这也和虚拟机有关，想到了自己未来的道路自己很是兴奋的，其实我只是在等待离开