# Xgboost Sklearn

**By Edwin Lisowski, CTO at Addepto**

Instead of only comparing XGBoost and Random Forest in this post we will try to explain how to use those two very popular approaches with Bayesian Optimisation and that are those models main pros and cons. XGBoost (XGB) and Random Forest (RF) both are ensemble learning methods and predict (classification or regression) by combining the outputs from individual decision trees (we assume tree-based XGB or RF).

**Let’s dive deeper into comparison – XGBoost vs Random Forest**

The XGBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the XGBClassifier and XGBregressor classes. Let’s take a closer look at each in turn.

XGBoost Model with scikit-learn. Let's try Scikit-learn examples to train one of the best classifiers on the market. XGBoost is an improvement over the random forest. The theoretical background of the classifier out of the scope of this Python Scikit tutorial. Keep in mind that, XGBoost has won lots of kaggle competitions. Connect and share knowledge within a single location that is structured and easy to search. This is a quick post answering a question I get a lot: “how can I use in scikit-learn an XGBoost model that I trained on SageMaker? Once you’ve trained your XGBoost model in SageMaker (examples here), grab the training job name and the location of the model artifact.

**XGBoost or Gradient Boosting**

XGBoost build decision tree one each time. Each new tree corrects errors which were made by previously trained decision tree.

**Example of XGBoost application**

At Addepto we use XGBoost models to solve anomaly detection problems e.g. in supervised learning approach. In this case XGB is very helpful because data sets are often highly imbalanced. Examples of such data sets are user/consumer transactions, energy consumption or user behaviour in mobile app.

**Pros**

Since boosted trees are derived by optimizing an objective function, basically XGB can be used to solve almost all objective function that we can write gradient out. This including things like ranking and poisson regression, which RF is harder to achieve.

**Cons**

XGB model is more sensitive to overfitting if the data is noisy. Training generally takes longer because of the fact that trees are built sequentially. GBMs are harder to tune than RF. There are typically three parameters: number of trees, depth of trees and learning rate, and the each tree built is generally shallow.

**Random Forest**

Random Forest (RF) trains each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree. Thanks to that RF is less likely to overfit on the training data.

**Example of Random Forest application**

The random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.[1] Random Forest model is very attractive for this kind of applications in the following two cases:

Our goal is to have high predictive accuracy for a high-dimensional problem with strongly correlated features.

Our data set is very noisy and contains a lot of missing values e.g., some of the attributes are categorical or semi-continuous.

**Pros**

The model tuning in Random Forest is much easier than in case of XGBoost. In RF we have two main parameters: number of features to be selected at each node and number of decision trees. RF are harder to overfit than XGB.

### Xgboost Sklearn Import

**Cons**

The main limitation of the Random Forest algorithm is that a large number of trees can make the algorithm slow for real-time prediction. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels.

Bayesian optimization is a technique to optimise function that is expensive to evaluate.[2] It builds posterior distribution for the objective function and calculate the uncertainty in that distribution using Gaussian process regression, and then uses an acquisition function to decide where to sample. Bayesian optimization focuses on solving the problem:

*max _{x∈A} f(x)*

The dimension of hyperparameters (*x∈R ^{d}*) is often d < 20 in most successful applications.

Typically set A i a hyper-rectangle (*x∈R ^{d}:a_{i} ≤ x_{i} ≤ b_{i}*). The objective function is continuous which is required to model using Gaussian process regression. It also lacks special structure like concavity or linearity which make futile using techniques that leverage such structure to improve efficiency. Bayesian optimization consists of two main components: a Bayesian statistical model for modeling the objective function and an acquisition function for deciding where to sample next.

After evaluating the objective according to an initial space-filling experimental design they are used iteratively to allocate the remainder of a budget of N evaluations, as shown below:

- Observe at initial points
- While
*n ≤ N*do

Update the posterior probability distribution using all available data

Let*x*be a maximizer of the acquisition function_{n}

Observe*y*_{n}= f(x_{n})

Increment*n*

end while - Return a solution: the point evaluated with the largest

### Xgboost Sklearn Auc

We can summarize this problem by saying that Bayesian optimization is designed for black-box derivative-free global optimization. It has become extremely popular for tuning hyperparameters in machine learning.

Below is a graphical summary of the whole optimization: Gaussian Process with posterior distribution along with observations and confidence interval, and Utility Function where the maximum value indicates the next sample point.

Thanks to utility function bayesian optimization is much more efficient in tuning parameters of machine learning algorithms than grid or random search techniques. It can effectively balance “exploration” and “exploitation” in finding global optimum.

To present Bayesian optimization in action we use BayesianOptimization [3] library written in Python to tune hyperparameters of Random Forest and XGBoost classification algorithms. We need to install it via pip:

`pip install bayesian-optimization`

Now let’s train our model. First we import required libraries:

We define a function to run Bayesian optimization given data, function to optimize and its hyperparameters:

We define function to optimize which is Random Forest Classifier and its hyperparameters n_estimators, max_depth and min_samples_split. Additionally we use the mean of cross validation score on given dataset:

Analogically we define a function and hyperparameters for XGBoost classifier:

Now based on chosen classifier we can optimize it and train the model:

As an example data we use a view [dbo].[vTargetMail] from AdventureWorksDW2017 SQL Server database where based on personal data we need to predict whether person buy a bike. As a result of Bayesian optimization we present consecutive function sampling:

iter | AUC | max_depth | min_samples_split | n_estimators |

1 | 0.8549 | 45.88 | 6.099 | 34.82 |

2 | 0.8606 | 15.85 | 2.217 | 114.3 |

3 | 0.8612 | 47.42 | 8.694 | 306.0 |

4 | 0.8416 | 10.09 | 5.987 | 563.0 |

5 | 0.7188 | 4.538 | 7.332 | 766.7 |

6 | 0.8436 | 100.0 | 2.0 | 448.6 |

7 | 0.6529 | 1.012 | 2.213 | 315.6 |

8 | 0.8621 | 100.0 | 10.0 | 1e+03 |

9 | 0.8431 | 100.0 | 2.0 | 154.1 |

10 | 0.653 | 1.0 | 2.0 | 1e+03 |

11 | 0.8621 | 100.0 | 10.0 | 668.3 |

12 | 0.8437 | 100.0 | 2.0 | 867.3 |

13 | 0.637 | 1.0 | 10.0 | 10.0 |

14 | 0.8518 | 100.0 | 10.0 | 10.0 |

15 | 0.8435 | 100.0 | 2.0 | 317.6 |

16 | 0.8621 | 100.0 | 10.0 | 559.7 |

17 | 0.8612 | 89.86 | 10.0 | 82.96 |

18 | 0.8616 | 49.89 | 10.0 | 212.0 |

19 | 0.8622 | 100.0 | 10.0 | 771.7 |

20 | 0.8622 | 38.33 | 10.0 | 469.2 |

21 | 0.8621 | 39.43 | 10.0 | 638.6 |

22 | 0.8622 | 83.73 | 10.0 | 384.9 |

23 | 0.8621 | 100.0 | 10.0 | 936.1 |

24 | 0.8428 | 54.2 | 2.259 | 122.4 |

25 | 0.8617 | 99.62 | 9.856 | 254.8 |

As we can see Bayesian optimization found the best parameters in 23rd step which gives 0.8622 AUC score on test dataset. Probably this result can be higher if given more samples to check. Our optimized Random Forest model has ROC AUC curve presented below:

We presented a simple way of tuning hyperparameters in machine learning [4] using Bayesian optimization which is a faster method in finding optimal values and more sophisticated one than Grid or Random Search methods.

**Bio: Edwin Lisowski** is CTO at Addepto. He is an experienced advanced analytics consultant with a demonstrated history of working in the information technology and services industry. He is skilled in Predictive Modeling, Big Data, Cloud Computing and Advanced Analytics.

**Related:**

Xgboost is a powerful gradient boosting framework. It provides interfaces in many languages: Python, R, Java, C++, Juila, Perl, and Scala. In this post, I will show you how to save and load Xgboost models in Python. The Xgboost provides several Python API types, that can be a source of confusion at the beginning of the Machine Learning journey. I will try to show different ways for saving and loading the Xgboost models, and show which one is the safest.

Useful links:

- Xgboost documentation: https://xgboost.readthedocs.io,
- Xgboost GitHub: https://github.com/dmlc/xgboost,
- Xgboost website: https://xgboost.ai/.

The Xgboost model can be trained in two ways:

- we can use python API that connects Python with Xgboost internals. It is called
`Learning API`

in the Xgboost documentation. - or we can use Xgboost API that provides scikit-learn interface. The documentation of scikit-learn compatible API.

Depending on the way you use for training, the saving will be slightly different.

Let’s create the data:

## Xgboost Learning API

Let’s train Xgboost with learning API:

We will get the output like below:

We can see that there are trained 83 trees. Let’s check the optimal tree number:

I will show you something that might surprise you. Let’s compute the predictions:

Again, let’s compute the predictions with an **additional** parameter *ntree_limit*:

You see the difference in the predicted values! By default, the `predict()`

method is **not** using an optimal number of trees. You need to specify the number of trees by yourself, by setting the *ntree_limit* parameter.

### Save the Xgboost Booster object

There are two methods that can make the confusion:

`save_model()`

,`dump_model()`

.

For saving and loading the model the `save_model()`

should be used. The `dump_model()`

is for model exporting which should be used for further model interpretation, for example visualization.

OK, so we will use `save_model()`

. The next thing to remember is the extension of the saved file. If it will be `*.json`

then the model will be saved in `json`

format. Otherwise, it will be saved in text format.

Let’s check:

There is a difference in file size. The `model.json`

file has `100.8 KB`

and the `model.txt`

file has `57.9 KB`

(much smaller).

Let’s load the model:

You can also load the model from the `model.txt`

file. They will be the same. And now the surprise, let’s check the optimal number of trees:

That’s right the `best_ntree_limit`

variable is not saved. You must be very careful with this API. Let’s take a look at scikit-learn compatible API (it is much user friendly!).

## Xgboost with Scikit-learn API

Let’s train the Xgboost model with scikit-learn compatible API:

The output from training is the same as earlier, so I don’t post it here. Let’s check the `predict()`

:

… and `predict()`

with *ntree_limit*:

They are the same! Nice. It is intuitive and works as expected.

Let’s save the model:

The file `model_sklearn.json`

size is `103.3 KB`

and `model_sklearn.txt`

size is `60.4 KB`

.

To load the model:

Check the optimal number of trees:

The `best_ntree_limit`

is saved!

## Conclusions

I recommend using the Xgboost Python API that is scikit-learn compatible. It is much simpler than `Learning API`

and behaves as expected. It is more intuitive. For saving and loading the model, you can use `save_model()`

and `load_model()`

methods.

There is also an option to use `pickle.dump()`

for saving the Xgboost. It makes a memory snapshot and can be used for training resume. However, this method doesn’t guarantee backward compatibility between different versions. For long-term storage the `save_model()`

should be used.

The Xgboost is an amazing framework. However, its training may require a lot of coding (even with scikit-learn compatible API). You might be interested in trying our open-source AutoML package: https://github.com/mljar/mljar-supervised. With MLJAR you can train Xgboost with two lines of code:

That’s all. Thank you!