Pool

class catboost_spark.Pool(data_frame_or_java_object, pairs_data_frame=None)[source]

Bases: pyspark.ml.wrapper.JavaParams

CatBoost’s abstraction of a dataset. Features data can be stored in raw (features column has pyspark.ml.linalg.Vector type) or quantized (float feature values are quantized into integer bin values, features column has Array[Byte] type) form.

Raw Pool can be transformed to quantized form using quantize method. This is useful if this dataset is used for training multiple times and quantization parameters do not change. Pre-quantized Pool allows to cache quantized features data and so do not re-run feature quantization step at the start of an each training.

Attributes Summary

data

DataFrame with the main data (features, label, (optionally) weight etc.)

pairsData

DataFrame with the pairs data (groupId, winnerId, loserId and optionally weight).

Methods Summary

count()

Returns the number of rows in the main data DataFrame.

getBaselineCol()

Returns

getBaselineCount()

Returns the dimension of the baseline data (0 if not specified).

getFeatureCount()

Returns the number of features.

getFeatureNames()

Returns the list of feature names.

getFeaturesCol()

Returns

getGroupIdCol()

Returns

getGroupWeightCol()

Returns

getLabelCol()

Returns

getSampleIdCol()

Returns

getSubgroupIdCol()

Returns

getTimestampCol()

Returns

getWeightCol()

Returns

isQuantized()

Returns whether the main data has already been quantized.

load(sparkSession, dataPathWithScheme[, …])

Load dataset in one of CatBoost’s natively supported formats:

pairsCount()

Returns the number of rows in the pairsData DataFrame.

quantize([quantizationParams])

Create Pool with quantized features from Pool with raw features

repartition(partitionCount, …)

Repartion data to the specified number of partitions.

setBaselineCol(value)

Parameters

setFeaturesCol(value)

Parameters

setGroupIdCol(value)

Parameters

setGroupWeightCol(value)

Parameters

setLabelCol(value)

Parameters

setParams([baselineCol, featuresCol, …])

Set the (keyword only) parameters

setSampleIdCol(value)

Parameters

setSubgroupIdCol(value)

Parameters

setTimestampCol(value)

Parameters

setWeightCol(value)

Parameters

Attributes Documentation

data

DataFrame with the main data (features, label, (optionally) weight etc.)

pairsData

DataFrame with the pairs data (groupId, winnerId, loserId and optionally weight). Can be None.

Methods Documentation

count()[source]

Returns the number of rows in the main data DataFrame.

getBaselineCol()[source]
Returns
str

baseline column name

getBaselineCount()[source]

Returns the dimension of the baseline data (0 if not specified).

getFeatureCount()[source]

Returns the number of features.

getFeatureNames()[source]

Returns the list of feature names.

getFeaturesCol()[source]
Returns
str

features column name

getGroupIdCol()[source]
Returns
str

groupId column name

getGroupWeightCol()[source]
Returns
str

groupWeight column name

getLabelCol()[source]
Returns
str

label column name

getSampleIdCol()[source]
Returns
str

sampleId column name

getSubgroupIdCol()[source]
Returns
str

subgroupId column name

getTimestampCol()[source]
Returns
str

timestamp column name

getWeightCol()[source]
Returns
str

weight column name. If this is not set or empty, we treat all instance weights as 1.0

isQuantized()[source]

Returns whether the main data has already been quantized.

static load(sparkSession, dataPathWithScheme, columnDescription=None, poolLoadParams=None, pairsDataPathWithScheme=None)[source]
Load dataset in one of CatBoost’s natively supported formats:
Parameters
sparkSessionSparkSession
dataPathWithSchemestr

Path with scheme to dataset in CatBoost format. For example, dsv:///home/user/datasets/my_dataset/train.dsv or libsvm:///home/user/datasets/my_dataset/train.libsvm

columnDescriptionstr, optional

Path to column description file. See https://catboost.ai/docs/concepts/input-data_column-descfile.html

paramsPoolLoadParams, optional

Additional params specifying data format.

pairsDataPathWithSchemestr, optional

Path with scheme to dataset pairs in CatBoost format. Only “dsv-grouped” format is supported for now. For example, dsv-grouped:///home/user/datasets/my_dataset/train_pairs.dsv

Returns
Pool

Pool containing loaded data

pairsCount()[source]

Returns the number of rows in the pairsData DataFrame.

quantize(quantizationParams=None)[source]

Create Pool with quantized features from Pool with raw features

repartition(partitionCount, byGroupColumnsIfPresent)[source]

Repartion data to the specified number of partitions. Useful to repartition data to create one partition per executor for training (where each executor gets its’ own CatBoost worker with a part of the training data).

setBaselineCol(value)[source]
Parameters
valuestr

baseline column name

setFeaturesCol(value)[source]
Parameters
valuestr

features column name

setGroupIdCol(value)[source]
Parameters
valuestr

groupId column name

setGroupWeightCol(value)[source]
Parameters
valuestr

groupWeight column name

setLabelCol(value)[source]
Parameters
valuestr

label column name

setParams(baselineCol=None, featuresCol='features', groupIdCol=None, groupWeightCol=None, labelCol='label', sampleIdCol=None, subgroupIdCol=None, timestampCol=None, weightCol=None)[source]

Set the (keyword only) parameters

Parameters
baselineColstr

baseline column name

featuresColstr, default: “features”

features column name

groupIdColstr

groupId column name

groupWeightColstr

groupWeight column name

labelColstr, default: “label”

label column name

sampleIdColstr

sampleId column name

subgroupIdColstr

subgroupId column name

timestampColstr

timestamp column name

weightColstr

weight column name. If this is not set or empty, we treat all instance weights as 1.0

setSampleIdCol(value)[source]
Parameters
valuestr

sampleId column name

setSubgroupIdCol(value)[source]
Parameters
valuestr

subgroupId column name

setTimestampCol(value)[source]
Parameters
valuestr

timestamp column name

setWeightCol(value)[source]
Parameters
valuestr

weight column name. If this is not set or empty, we treat all instance weights as 1.0