This Week I Learned 3: XGBoost

I had heard about XGBoost while doing a few mods in school. So to be able to learn as well as apply this really popular library to real world problems in my current internship is quite cool!

This is not really a comprehensive walkthrough of the XGBoost library (the docs provide a much better explanation than me). But here are a few small things that were important for me to realise.

XGBoost

1. Dataset preparation (DMatrixes, and NaN values)

  1. The XGBoost model is able to load data from different formats (csv, numpy arrays, dataframes). We are reccomended to convert it to a DMatrix (basically an array for more efficient training)
    Personally, I didn’t use it as similar to numpy arrays, you can’t see the names of the features. So I stick to dataframes. If speed is a priority, DMatrix should help.
# non dataframe way
# x_train is basically a df of all the features
# y is the label df
x_train, y_train, x_validate, y_validate = create_training_data(your_df)

# loading from csv
dtrain = xgb.DMatrix('train.svm.txt')
# or converting df to DMatrix
dtrain = xgb.DMatrix(x_train, label = y_train)
dval = xgb.DMatrix(x_val, label = y_val)
  1. I learned that XGBoost is quite robust as it can handle missing values behind the scenes without the need for a lot of preprocessing on our own side. Will definitely like to understand more about this, but for now I’m grateful that this library does take the imperfection of real world data into account.

2. xgb.XGBRegressor.fit() vs xgb.train: The sklearn wrappers

To maintain consistency between different machine learning models, xgb has an sklearn wrapper for its models. As I used XGBoost for regression, I used the xgb.XGBRegressor wrapper. So these 2 models are equivalent:

The base implementation:

## setting params
param = {
  'objective': 'reg:squarederror'
  }
param['eval_metric'] = ['rmse', ]
# specify evallist to watch performance over iterations
evallist = [(dtrain, 'train'), (dval, 'validation')]

## training the model
num_round = 1000
bst = xgb.train(
  param
  , dtrain
  , num_round
  , evallist
  , early_stopping_rounds = 50
  )

The sklearn wrapper equivalent:

new_bst = xgb.XGBRegressor(
    n_estimators = 1000
    , objective = 'reg:squarederror'
    , 
)

new_bst.fit(
    X = x_train
    , y = y_train
    , eval_set = [(x_train, y_train), (x_val, y_val)]
    , eval_metric = ['rmse', ]
    , early_stopping_rounds= 50
    , verbose = True
#     , xgb_model = # if you want to add mroe training
)

3. Issues with saving and loading models (Use pickles 🥒!)

This is how I save my models (which is not the reccomended way)

import pickle
file_name = "models/xgb_01.pkl"

# save
pickle.dump(new_bst, open(file_name), "wb")

# load
new_bst = pickle.load(open(file_name), "rb")

The reccomended way:

# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
bst.dump_model('dump.raw.txt', 'featmap.txt') # seem to get error when i run this

Using this method, I would lose the feature names when I tried reloading the model.

This also only seems to apply to the base implementation. I XGBRegressor() does not have the dump_model function. What we can do is:

new_bst.get_booster().dump_model('models/my_model.model')

Again, the issue of no feature importance names still exists. So stick to pickling!

4. Use early_stopping_rounds: faster, better results.

When you add a validation set to your training, you can use early_stopping_rounds parameter when fitting the model. This also requires at least 1 evaluation metric (eg. RMSE)

  • The model will keep get the rmse for both the train and validation set. As long as validation set RMSE keeps decreasing, the model training will continue to run. Otherwise, it will try train till it reaches the early_stopping_rounds. eg. after 10 rounds, it will decide that the model cannot be any better and just end the training.

This definitely saves a lot of time. Though if you feel as though your model can improve with more iterations, you can always set a high early_stopping_rounds value.