catboost.load_pool

catboost.load_pool(data,
                   label = NULL,
                   cat_features = NULL,
                   column_description = NULL,
                   pairs = NULL,
                   delimiter = "\t",
                   has_header = FALSE,
                   weight = NULL,
                   group_id = NULL,
                   group_weight = NULL,
                   subgroup_id = NULL,
                   pairs_weight = NULL,
                   baseline = NULL,
                   feature_names = NULL,
                   thread_count = -1)

Purpose

Load the CatBoost dataset.

Arguments

data

Description

A file path, data.frame or matrix with features.

The following column types are supported:

  • double
  • factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns:
    1. The values are converted to strings.
    2. The ConvertCatFeatureToFloat function is applied to the resulting string.

Default value

Required argument

label

Description

The target variables (in other words, the objects' label values) of the dataset.

This parameter is used if the input data format is matrix or data.frame. Otherwise it must be set to NULL.

Default value

NULL

cat_features

Description

A vector of categorical features indices.

The indices are zero-based and can differ from the ones given in the columns description file.

If data parameter is data.frame don't use cat_features, categorical features are determined automatically
from data.frame column types.

Default value

NULL (it is assumed that all columns are the values of numerical
features)

column_description

Description

The path to the input file that contains the columns description.

This parameter is used if the data is input from a file.

Default value

NULL, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.

pairs

Description

A file path, matrix or data.frame with  pairs description of shape N by 2:

  • N is the number of pairs.
  • The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
  • The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.

This information is used for calculation and optimization of Pairwise metrics.

Default value

NULL

Pairwise metrics require pairs data. If this data is not provided explicitly by specifying this parameter, pairs are generated automatically in each group using object label values.

delimiter

Description

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Default value

\t

has_header

Description

Read the column names from the first line of the dataset description file if this parameter is set.

Note

Used only if the dataset is given in the Delimiter-separated values format.

Default value

FALSE

weight

Description

The weights of objects.

Default value

NULL

group_id

Description

Group identifiers for all input objects.

Warning

All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.

Example

For example, let's assume that the dataset consists of documents d1,d2,d3,d4,d5d_{1}, d_{2}, d_{3}, d_{4}, d_{5}. The corresponding groups are g1,g2,g3,g2,g2g_{1}, g_{2}, g_{3}, g_{2}, g_{2}, respectively. The feature vectors for the given documents are f1,f2,f3,f4,f5f_{1}, f_{2}, f_{3}, f_{4}, f_{5} respectively. Then the dataset can take the following form:

(d2g2f2d4g2f4d5g2f5d3g3f3d1g1f1)\begin{pmatrix} d_{2}&g_{2}&f_{2}\\ d_{4}&g_{2}&f_{4}\\ d_{5}&g_{2}&f_{5}\\ d_{3}&g_{3}&f_{3}\\ d_{1}&g_{1}&f_{1} \end{pmatrix}

The grouped blocks of lines can be input in any order. For example, the following order is equivalent to the previous one:

(d1g1f1d3g3f3d2g2f2d4g2f4d5g2f5)\begin{pmatrix} d_{1}&g_{1}&f_{1}\\ d_{3}&g_{3}&f_{3}\\ d_{2}&g_{2}&f_{2}\\ d_{4}&g_{2}&f_{4}\\ d_{5}&g_{2}&f_{5} \end{pmatrix}

Default value

NULL

group_weight

Description

The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.

Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.

Alert

Only one of the following parameters can be used at a time:

  • weight
  • group_weight

Default value

NULL

subgroup_id

Description

Subgroup identifiers for all input objects.

Default value

NULL

pairs_weight

Description

The weight of each input pair of objects.

This information is used for calculation and optimization of Pairwise metrics.

By default, it is set to 1 for all pairs.

Do not use this parameter if an input file is specified in the pairs parameter.

Default value

NULL

baseline

Description

A vector of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Default value

NULL

feature_names

Description

A list of names for each feature in the dataset.

Default value

NULL

thread_count

Description

The number of threads to use while reading the data.
Optimizes the reading time. This parameter doesn't affect the results.

Default value

-1 (the number of threads is equal to the number of processor cores)

Examples

Load the model from a file

The following example illustrates how to save a trained model to a file and then load it.

from catboost import CatBoostClassifier, Pool

train_data = [[1, 3],
              [0, 4],
              [1, 7]]
train_labels = [1, 0, 1]

# catboost_pool = Pool(train_data, train_labels)

model = CatBoostClassifier(learning_rate=0.03)
model.fit(train_data,
          train_labels,
          verbose=False)

model.save_model("model")

from_file = CatBoostClassifier()

from_file.load_model("model")

Load the dataset from the CatBoostR package (this dataset is a subset of the Adult Data Set distributed through the UCI Machine Learning Repository):

library(catboost)

pool_path = system.file("extdata",
                        "adult_train.1000",
                        package="catboost")

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
  column_description_vector[i] <- 'factor'

data <- read.table(pool_path,
                   head = F,
                   sep = "\t",
                   colClasses = column_description_vector,
                   na.strings='NAN')

# Transform categorical features to numerical
for (i in cat_features)
  data[,i] <- as.numeric(factor(data[,i]))

pool <- catboost.load_pool(as.matrix(data[,-target]),
                           label = as.matrix(data[,target]),
                           cat_features = cat_features - 2)
head(pool, 1)[[1]]

Load the dataset from data.frame:

library(catboost)

train_path = system.file("extdata",
                         "adult_train.1000",
                         package="catboost")
test_path = system.file("extdata",
                        "adult_test.1000",
                        package="catboost")

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
  column_description_vector[i] <- 'factor'

train <- read.table(train_path,
                    head = F,
                    sep = "\t",
                    colClasses = column_description_vector,
                    na.strings='NAN')
test <- read.table(test_path,
                   head = F,
                   sep = "\t",
                   colClasses = column_description_vector,
                   na.strings='NAN')
target <- c(1)
train_pool <- catboost.load_pool(data=train[,-target],
                                 label = train[,target])
test_pool <- catboost.load_pool(data=test[,-target],
                                label = test[,target])
head(train_pool, 1)[[1]]
head(test_pool, 1)[[1]]