1. 什么是XGBoost

说到XGBoost,不得不提GBDT(Gradient Boosting Decision Tree)。因为XGBoost本质上还是一个GBDT,但是力争把速度和效率发挥到极致,所以叫X (Extreme) GBoosted。包括前面说过,两者都是boosting方法。
1.1 XGBoost树的定义


就这样,训练出了2棵树tree1和tree2,类似之前gbdt的原理,两棵树的结论累加起来便是最终的结论,所以小孩的预测分数就是两棵树中小孩所落到的结点的分数相加:2 + 0.9 = 2.9。爷爷的预测分数同理:-1 + (-0.9)= -1.9。具体如下图所示:



[*]红色箭头所指向的L 即为损失函数(比如平方损失函数:






那接下来,我们如何选择每一轮加入什么 f 呢?答案是非常直接的,选取一个 f 来使得我们的目标函数尽量最大地降低。这里 f 可以使用泰勒展开公式近似。

1.2 正则项:树的复杂度



我们再来看一下XGBoost的目标函数(损失函数揭示训练误差 + 正则化定义复杂度):


1.3 树该怎么长

不断地枚举不同树的结构,然后利用打分函数来寻找出一个最优结构的树,接着加入到模型中,不断重复这样的操作。这个寻找的过程使用的就是贪心算法。选择一个feature分裂,计算loss function最小值,然后再选一个feature分裂,又得到一个loss function最小值,你枚举完,找一个效果最好的,把树给分裂,就得到了小树苗。
1.4 如何停止树的循环生成


[*]当引入的分裂带来的增益小于设定阀值的时候,我们可以忽略掉这个分裂,所以并不是每一次分裂loss function整体都会增加的,有点预剪枝的意思,阈值参数为(即正则项里叶子节点数T的系数);
[*]样本权重和小于设定阈值时则停止建树。什么意思呢,即涉及到一个超参数-最小的样本权重和min_child_weight,和GBM的 min_child_leaf 参数类似,但不完全一样。大意就是一个叶子节点样本太少了,也终止同样是防止过拟合;
2. XGBoost与GBDT有什么不同


[*]在使用CART作为基分类器时,XGBoost显式地加入了正则项来控制模 型的复杂度,有利于防止过拟合,从而提高模型的泛化能力。
[*]GBDT在模型训练时只使用了代价函数的一阶导数信息,XGBoost对代 价函数进行二阶泰勒展开,可以同时使用一阶和二阶导数。
[*]传统的GBDT采用CART作为基分类器,XGBoost支持多种类型的基分类 器,比如线性分类器。
[*]传统的GBDT在每轮迭代时使用全部的数据,XGBoost则采用了与随机 森林相似的策略,支持对数据进行采样。
[*]传统的GBDT没有设计对缺失值进行处理,XGBoost能够自动学习出缺 失值的处理策略。
3. 为什么XGBoost要用泰勒展开,优势在哪里?

XGBoost使用了一阶和二阶偏导, 二阶导数有利于梯度下降的更快更准. 使用泰勒展开取得函数做自变量的二阶导数形式, 可以在不选定损失函数具体形式的情况下, 仅仅依靠输入数据的值就可以进行叶子分裂优化计算, 本质上也就把损失函数的选取和模型算法优化/参数选择分开了. 这种去耦合增加了XGBoost的适用性, 使得它按需选取损失函数, 可以用于分类, 也可以用于回归。
4. 代码实现

Big data and machine learning deal with data. So, its important to keep the data correct in the system. If data is not accurate, it not only reduces the efficiency of the system, but also leads to some unfavourable insights. One of the big steps toward ensuring the correctness of data is through data quality and validation. With an increasing volume of data, and the noise that goes along with that, new methods or checks are getting added every day to ensure this data's quality. Since the amount of data is huge, one more thing which needs to be considered here is how to ensure fast processing of all of these checks and validations; i.e., a system which can go through each and every record ingested in a highly distributed way. This post talks about some examples of data quality and validation checks and how easy it is to programmatically ensure data quality with the help of Apache Spark and Scala.
Data accuracy, which refers to the closeness of results of observations to the true values or values accepted as being true.

[*]Null Value: Record that contains null value. For example: male/female/null


[*]Specific Value: company ID


Schema Validation: Every batch of data should follow the same column name and data type.

for (elem <- sampledataframe.schema) {
if (elem.dataType != "ExpectedDataType") {
// Print Error

Column Value Duplicates (like duplicate email in records)

val dataframe1 = sampledataframe.groupBy("columnname").count()
val dataframe2 = dataframe1.filter("count = 1")
println("No of duplicate records : "
+ (dataframe1.count() - dataframe2.count()).toString())

Uniqueness Check: Records are unique and kept in a w.r.t column
This is similar to duplicate.

val dataframe1 = sampledataframe.groupBy("columnname").count()
dataframe1.filter("count = 1").count() // this will give unique count.

Accuracy Check: Regular Expressions can be used. For example, we can look for email IDs that contain@.


Data currency: How up-to-date is your data? Here the assumption is that data is coming in on a daily basis and is then checked and timestamped.
This list can go on and on, but the good thing about this approach based on Spark and Scala is that, with fewer code, a lot can be achieved using a huge amout of data.
Sometimes, a system may have some specific requirements related to who is consuming the data and in what form; and the consumber may have assumptions about the the data.
Data usability: Consumer applications may apply certain expectations like:

[*]column1.value should not be equal to column2.value
[*]column3.value should always be column1.value + column2.value
[*]No value in column x should appear more than x% of the time

var arr = Array("ColumnName1", "ColumnName2", "ColumnName3")
var freq = sampledataframe.stat.freqItems(arr, 0.4)

While these are considered basic validations, we also have some advanced level checks to ensure data quality, like:

[*]Anomaly Detection: This includes two major points:

[*]If the dimension is given, like a time-based anomaly. This means within any timeframe (slice period), the number of records should not be more than x% of the average. To achive this with Spark:

[*]Let's assume the slice period is 1 minute.
[*]First, the timestamp column needs to be filtered/formatted such that the unit representation of the timestamp is a minute. This will produce duplicates, but that should not be an issue.
[*]Next, use groupBy, like so: sampledataframe.groupBy("timestamp").count().
[*]Get the average of that count and also find the slice period (if it exists), which has x% more records than the average.


[*]The record should follow a certain order. For example, within a day the records for a particular consumer should start with impressions, clicks, landing page, cart, and end with purchases. There may be partial records, but it should follow a general order. To check this with Spark:

[*]Run the order check for the group.

[*]Circular dependency: Let me explain this with an example.

[*]If two columns are taken up where column A => Column B, and the records are like:

IDNameFathers Name1AlphaBravo2BravoGamma3GammaAlpha

[*]If consuming the application tries to print the family hierarchy, it may fall into a loop.

[*]Failure Trend

[*]Consider that data is coming into the system everyday. Let's assume its behavioral/touchpoint data. For simplicity, let's call each day's data a 'batch.' In every batch, if we are getting exactly the same set of failures, then there must be a failure trend which is going on across batches.
[*]If the failure is coming for same a set of email_id (emain id is one column), then it might be a symptom of a bot's behavior.

[*]Data Bias: This means a consistent shift in the graph. Like:

[*]If 30 minutes is getting added to the timestamp, then all the records will always have this 30 minute implicit bias. So, if the prediction algorithm is going to use this data, this bias will impact its results.
[*]If the algorithms which is producing this data, has learning bias then for one set of data it produces more default values then for other. Like based on buying behaviour, it can predict the wrong gender.

Bot Behavioor: Usually, a bot's behaviour is something like:

[*]It generates records with the same set of unique identifiers of records. Like same set of email_ids.
[*]It generates website traffic at any particular time. This is a time-based anomaly.
[*]It generates records in a defined order: ordering checks across data batches.
