catboost.load_pool
catboost.load_pool(data,
label = NULL,
cat_features = NULL,
column_description = NULL,
pairs = NULL,
delimiter = "\t",
has_header = FALSE,
weight = NULL,
group_id = NULL,
group_weight = NULL,
subgroup_id = NULL,
pairs_weight = NULL,
baseline = NULL,
feature_names = NULL,
thread_count = -1)
Purpose
Load the CatBoost dataset.
Arguments
data
Description
A file path, data.frame or matrix with features.
The following column types are supported:
- double
- factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns:
- The values are converted to strings.
- The
ConvertCatFeatureToFloat
function is applied to the resulting string.
Default value
Required argument
label
Description
The target variables (in other words, the objects' label values) of the dataset.
This parameter is used if the input data format is matrix or data.frame. Otherwise it must be set to NULL.
Default value
NULL
cat_features
Description
A vector of categorical features indices.
The indices are zero-based and can differ from the ones given in the columns description file.
If data
parameter is data.frame
don't use cat_features
, categorical features are determined automatically
from data.frame
column types.
Default value
NULL (it is assumed that all columns are the values of numerical
features)
column_description
Description
The path to the input file that contains the columns description.
This parameter is used if the data is input from a file.
Default value
NULL, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.
pairs
Description
A file path, matrix or data.frame with pairs description of shape N
by 2:
N
is the number of pairs.- The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
- The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.
This information is used for calculation and optimization of Pairwise metrics.
Default value
NULL
Pairwise metrics require pairs data. If this data is not provided explicitly by specifying this parameter, pairs are generated automatically in each group using object label values.
delimiter
Description
The delimiter character used to separate the data in the dataset description input file.
Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Default value
\t
has_header
Description
Read the column names from the first line of the dataset description file if this parameter is set.
Note
Used only if the dataset is given in the Delimiter-separated values format.
Default value
FALSE
weight
Description
The weights of objects.
Default value
NULL
group_id
Description
Group identifiers for all input objects.
Warning
All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.
Example
For example, let's assume that the dataset consists of documents . The corresponding groups are , respectively. The feature vectors for the given documents are respectively. Then the dataset can take the following form:
The grouped blocks of lines can be input in any order. For example, the following order is equivalent to the previous one:
Default value
NULL
group_weight
Description
The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.
Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.
Alert
Only one of the following parameters can be used at a time:
weight
group_weight
Default value
NULL
subgroup_id
Description
Subgroup identifiers for all input objects.
Default value
NULL
pairs_weight
Description
The weight of each input pair of objects.
This information is used for calculation and optimization of Pairwise metrics.
By default, it is set to 1 for all pairs.
Do not use this parameter if an input file is specified in the pairs
parameter.
Default value
NULL
baseline
Description
A vector of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.
Default value
NULL
feature_names
Description
A list of names for each feature in the dataset.
Default value
NULL
thread_count
Description
The number of threads to use while reading the data.
Optimizes the reading time. This parameter doesn't affect the results.
Default value
-1 (the number of threads is equal to the number of processor cores)
Examples
Load the model from a file
The following example illustrates how to save a trained model to a file and then load it.
from catboost import CatBoostClassifier, Pool
train_data = [[1, 3],
[0, 4],
[1, 7]]
train_labels = [1, 0, 1]
# catboost_pool = Pool(train_data, train_labels)
model = CatBoostClassifier(learning_rate=0.03)
model.fit(train_data,
train_labels,
verbose=False)
model.save_model("model")
from_file = CatBoostClassifier()
from_file.load_model("model")
Load the dataset from the CatBoostR package (this dataset is a subset of the Adult Data Set distributed through the UCI Machine Learning Repository):
library(catboost)
pool_path = system.file("extdata",
"adult_train.1000",
package="catboost")
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
column_description_vector[i] <- 'factor'
data <- read.table(pool_path,
head = F,
sep = "\t",
colClasses = column_description_vector,
na.strings='NAN')
# Transform categorical features to numerical
for (i in cat_features)
data[,i] <- as.numeric(factor(data[,i]))
pool <- catboost.load_pool(as.matrix(data[,-target]),
label = as.matrix(data[,target]),
cat_features = cat_features - 2)
head(pool, 1)[[1]]
Load the dataset from data.frame:
library(catboost)
train_path = system.file("extdata",
"adult_train.1000",
package="catboost")
test_path = system.file("extdata",
"adult_test.1000",
package="catboost")
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
column_description_vector[i] <- 'factor'
train <- read.table(train_path,
head = F,
sep = "\t",
colClasses = column_description_vector,
na.strings='NAN')
test <- read.table(test_path,
head = F,
sep = "\t",
colClasses = column_description_vector,
na.strings='NAN')
target <- c(1)
train_pool <- catboost.load_pool(data=train[,-target],
label = train[,target])
test_pool <- catboost.load_pool(data=test[,-target],
label = test[,target])
head(train_pool, 1)[[1]]
head(test_pool, 1)[[1]]