Pool¶
- class catboost_spark.Pool(data_frame_or_java_object, pairs_data_frame=None)[source]¶
Bases:
pyspark.ml.wrapper.JavaParams
CatBoost’s abstraction of a dataset. Features data can be stored in raw (features column has pyspark.ml.linalg.Vector type) or quantized (float feature values are quantized into integer bin values, features column has Array[Byte] type) form.
Raw Pool can be transformed to quantized form using quantize method. This is useful if this dataset is used for training multiple times and quantization parameters do not change. Pre-quantized Pool allows to cache quantized features data and so do not re-run feature quantization step at the start of an each training.
Attributes Summary
DataFrame with the main data (features, label, (optionally) weight etc.)
DataFrame with the pairs data (groupId, winnerId, loserId and optionally weight).
Methods Summary
count
()Returns the number of rows in the main data DataFrame.
- Returns
Returns the dimension of the baseline data (0 if not specified).
Returns the number of features.
Returns the list of feature names.
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
Returns whether the main data has already been quantized.
load
(sparkSession, dataPathWithScheme[, …])Load dataset in one of CatBoost’s natively supported formats:
Returns the number of rows in the pairsData DataFrame.
quantize
([quantizationParams])Create Pool with quantized features from Pool with raw features
repartition
(partitionCount, …)Repartion data to the specified number of partitions.
setBaselineCol
(value)- Parameters
setFeaturesCol
(value)- Parameters
setGroupIdCol
(value)- Parameters
setGroupWeightCol
(value)- Parameters
setLabelCol
(value)- Parameters
setParams
([baselineCol, featuresCol, …])Set the (keyword only) parameters
setSampleIdCol
(value)- Parameters
setSubgroupIdCol
(value)- Parameters
setTimestampCol
(value)- Parameters
setWeightCol
(value)- Parameters
Attributes Documentation
- data¶
DataFrame with the main data (features, label, (optionally) weight etc.)
- pairsData¶
DataFrame with the pairs data (groupId, winnerId, loserId and optionally weight). Can be None.
Methods Documentation
- getWeightCol()[source]¶
- Returns
- str
weight column name. If this is not set or empty, we treat all instance weights as 1.0
- static load(sparkSession, dataPathWithScheme, columnDescription=None, poolLoadParams=None, pairsDataPathWithScheme=None)[source]¶
- Load dataset in one of CatBoost’s natively supported formats:
- Parameters
- sparkSessionSparkSession
- dataPathWithSchemestr
Path with scheme to dataset in CatBoost format. For example, dsv:///home/user/datasets/my_dataset/train.dsv or libsvm:///home/user/datasets/my_dataset/train.libsvm
- columnDescriptionstr, optional
Path to column description file. See https://catboost.ai/docs/concepts/input-data_column-descfile.html
- paramsPoolLoadParams, optional
Additional params specifying data format.
- pairsDataPathWithSchemestr, optional
Path with scheme to dataset pairs in CatBoost format. Only “dsv-grouped” format is supported for now. For example, dsv-grouped:///home/user/datasets/my_dataset/train_pairs.dsv
- Returns
- Pool
Pool containing loaded data
- quantize(quantizationParams=None)[source]¶
Create Pool with quantized features from Pool with raw features
- repartition(partitionCount, byGroupColumnsIfPresent)[source]¶
Repartion data to the specified number of partitions. Useful to repartition data to create one partition per executor for training (where each executor gets its’ own CatBoost worker with a part of the training data).
- setParams(baselineCol=None, featuresCol='features', groupIdCol=None, groupWeightCol=None, labelCol='label', sampleIdCol=None, subgroupIdCol=None, timestampCol=None, weightCol=None)[source]¶
Set the (keyword only) parameters
- Parameters
- baselineColstr
baseline column name
- featuresColstr, default: “features”
features column name
- groupIdColstr
groupId column name
- groupWeightColstr
groupWeight column name
- labelColstr, default: “label”
label column name
- sampleIdColstr
sampleId column name
- subgroupIdColstr
subgroupId column name
- timestampColstr
timestamp column name
- weightColstr
weight column name. If this is not set or empty, we treat all instance weights as 1.0