Pool¶

class catboost_spark.Pool(data_frame_or_java_object, pairs_data_frame=None)[source]¶

Bases: pyspark.ml.wrapper.JavaParams

CatBoost’s abstraction of a dataset. Features data can be stored in raw (features column has pyspark.ml.linalg.Vector type) or quantized (float feature values are quantized into integer bin values, features column has Array[Byte] type) form.

Raw Pool can be transformed to quantized form using quantize method. This is useful if this dataset is used for training multiple times and quantization parameters do not change. Pre-quantized Pool allows to cache quantized features data and so do not re-run feature quantization step at the start of an each training.

Attributes Summary

`data`	DataFrame with the main data (features, label, (optionally) weight etc.)
`pairsData`	DataFrame with the pairs data (groupId, winnerId, loserId and optionally weight).

Methods Summary

`count`()	Returns the number of rows in the main data DataFrame.
`getBaselineCol`()	Returns
`getBaselineCount`()	Returns the dimension of the baseline data (0 if not specified).
`getFeatureCount`()	Returns the number of features.
`getFeatureNames`()	Returns the list of feature names.
`getFeaturesCol`()	Returns
`getGroupIdCol`()	Returns
`getGroupWeightCol`()	Returns
`getLabelCol`()	Returns
`getSampleIdCol`()	Returns
`getSubgroupIdCol`()	Returns
`getTimestampCol`()	Returns
`getWeightCol`()	Returns
`isQuantized`()	Returns whether the main data has already been quantized.
`load`(sparkSession, dataPathWithScheme[, …])	Load dataset in one of CatBoost’s natively supported formats:
`pairsCount`()	Returns the number of rows in the pairsData DataFrame.
`quantize`([quantizationParams])	Create Pool with quantized features from Pool with raw features
`repartition`(partitionCount, …)	Repartion data to the specified number of partitions.
`setBaselineCol`(value)	Parameters
`setFeaturesCol`(value)	Parameters
`setGroupIdCol`(value)	Parameters
`setGroupWeightCol`(value)	Parameters
`setLabelCol`(value)	Parameters
`setParams`([baselineCol, featuresCol, …])	Set the (keyword only) parameters
`setSampleIdCol`(value)	Parameters
`setSubgroupIdCol`(value)	Parameters
`setTimestampCol`(value)	Parameters
`setWeightCol`(value)	Parameters

Attributes Documentation

data¶: DataFrame with the main data (features, label, (optionally) weight etc.)

pairsData¶: DataFrame with the pairs data (groupId, winnerId, loserId and optionally weight). Can be None.

Methods Documentation

count()[source]¶: Returns the number of rows in the main data DataFrame.

getBaselineCol()[source]¶

Returns

str: baseline column name

getBaselineCount()[source]¶: Returns the dimension of the baseline data (0 if not specified).

getFeatureCount()[source]¶: Returns the number of features.

getFeatureNames()[source]¶: Returns the list of feature names.

getFeaturesCol()[source]¶

Returns

str: features column name

getGroupIdCol()[source]¶

Returns

str: groupId column name

getGroupWeightCol()[source]¶

Returns

str: groupWeight column name

getLabelCol()[source]¶

Returns

str: label column name

getSampleIdCol()[source]¶

Returns

str: sampleId column name

getSubgroupIdCol()[source]¶

Returns

str: subgroupId column name

getTimestampCol()[source]¶

Returns

str: timestamp column name

getWeightCol()[source]¶

Returns

str: weight column name. If this is not set or empty, we treat all instance weights as 1.0

isQuantized()[source]¶: Returns whether the main data has already been quantized.

static load(sparkSession, dataPathWithScheme, columnDescription=None, poolLoadParams=None, pairsDataPathWithScheme=None)[source]¶

Load dataset in one of CatBoost’s natively supported formats:

Parameters

sparkSessionSparkSession
dataPathWithSchemestr: Path with scheme to dataset in CatBoost format. For example, dsv:///home/user/datasets/my_dataset/train.dsv or libsvm:///home/user/datasets/my_dataset/train.libsvm
columnDescriptionstr, optional: Path to column description file. See https://catboost.ai/docs/concepts/input-data_column-descfile.html
paramsPoolLoadParams, optional: Additional params specifying data format.
pairsDataPathWithSchemestr, optional: Path with scheme to dataset pairs in CatBoost format. Only “dsv-grouped” format is supported for now. For example, dsv-grouped:///home/user/datasets/my_dataset/train_pairs.dsv

Returns

Pool: Pool containing loaded data

pairsCount()[source]¶: Returns the number of rows in the pairsData DataFrame.

quantize(quantizationParams=None)[source]¶: Create Pool with quantized features from Pool with raw features

repartition(partitionCount, byGroupColumnsIfPresent)[source]¶: Repartion data to the specified number of partitions. Useful to repartition data to create one partition per executor for training (where each executor gets its’ own CatBoost worker with a part of the training data).

setBaselineCol(value)[source]¶

Parameters

valuestr: baseline column name

setFeaturesCol(value)[source]¶

Parameters

valuestr: features column name

setGroupIdCol(value)[source]¶

Parameters

valuestr: groupId column name

setGroupWeightCol(value)[source]¶

Parameters

valuestr: groupWeight column name

setLabelCol(value)[source]¶

Parameters

valuestr: label column name

setParams(baselineCol=None, featuresCol='features', groupIdCol=None, groupWeightCol=None, labelCol='label', sampleIdCol=None, subgroupIdCol=None, timestampCol=None, weightCol=None)[source]¶

Set the (keyword only) parameters

Parameters

baselineColstr: baseline column name
featuresColstr, default: “features”: features column name
groupIdColstr: groupId column name
groupWeightColstr: groupWeight column name
labelColstr, default: “label”: label column name
sampleIdColstr: sampleId column name
subgroupIdColstr: subgroupId column name
timestampColstr: timestamp column name
weightColstr: weight column name. If this is not set or empty, we treat all instance weights as 1.0

setSampleIdCol(value)[source]¶

Parameters

valuestr: sampleId column name

setSubgroupIdCol(value)[source]¶

Parameters

valuestr: subgroupId column name

setTimestampCol(value)[source]¶

Parameters

valuestr: timestamp column name

setWeightCol(value)[source]¶

Parameters

valuestr: weight column name. If this is not set or empty, we treat all instance weights as 1.0

QuantizationParams

CatBoostClassificationModel