CatBoostRegressor

class catboost_spark.CatBoostRegressor(allowConstLabel=None, allowWritingFiles=None, approxOnFullHistory=None, baggingTemperature=None, bestModelMinTrees=None, bootstrapType=None, borderCount=None, connectTimeout=datetime.timedelta(seconds=60), customMetric=None, depth=None, diffusionTemperature=None, earlyStoppingRounds=None, evalMetric=None, featureBorderType=None, featureWeightsList=None, featureWeightsMap=None, featuresCol='features', firstFeatureUsePenaltiesList=None, firstFeatureUsePenaltiesMap=None, foldLenMultiplier=None, foldPermutationBlock=None, hasTime=None, ignoredFeaturesIndices=None, ignoredFeaturesNames=None, inputBorders=None, iterations=None, l2LeafReg=None, labelCol='label', leafEstimationBacktracking=None, leafEstimationIterations=None, leafEstimationMethod=None, learningRate=None, loggingLevel=None, lossFunction=None, metricPeriod=None, modelShrinkMode=None, modelShrinkRate=None, mvsReg=None, nanMode=None, odPval=None, odType=None, odWait=None, oneHotMaxSize=None, penaltiesCoefficient=None, perFloatFeatureQuantizaton=None, perObjectFeaturePenaltiesList=None, perObjectFeaturePenaltiesMap=None, predictionCol='prediction', randomSeed=None, randomStrength=None, rsm=None, samplingFrequency=None, samplingUnit=None, saveSnapshot=None, scoreFunction=None, snapshotFile=None, snapshotInterval=None, sparkPartitionCount=None, subsample=None, threadCount=None, trainDir=None, useBestModel=None, weightCol=None, workerInitializationTimeout=datetime.timedelta(seconds=600), workerMaxFailures=4)[source]

Bases: pyspark.ml.wrapper.JavaEstimator, pyspark.ml.util.MLReadable, pyspark.ml.util.JavaMLWritable

Class to train CatBoostRegressionModel

Methods Summary

fit(trainDataset[, evalDatasets])

Extended variant of standard Estimator’s fit method that accepts CatBoost’s Pool s and allows to specify additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost’s other APIs.

getAllowConstLabel()

Returns

getAllowWritingFiles()

Returns

getApproxOnFullHistory()

Returns

getBaggingTemperature()

Returns

getBestModelMinTrees()

Returns

getBootstrapType()

Returns

getBorderCount()

Returns

getConnectTimeout()

Returns

getCustomMetric()

Returns

getDepth()

Returns

getDiffusionTemperature()

Returns

getEarlyStoppingRounds()

Returns

getEvalMetric()

Returns

getFeatureBorderType()

Returns

getFeatureWeightsList()

Returns

getFeatureWeightsMap()

Returns

getFeaturesCol()

Returns

getFirstFeatureUsePenaltiesList()

Returns

getFirstFeatureUsePenaltiesMap()

Returns

getFoldLenMultiplier()

Returns

getFoldPermutationBlock()

Returns

getHasTime()

Returns

getIgnoredFeaturesIndices()

Returns

getIgnoredFeaturesNames()

Returns

getInputBorders()

Returns

getIterations()

Returns

getL2LeafReg()

Returns

getLabelCol()

Returns

getLeafEstimationBacktracking()

Returns

getLeafEstimationIterations()

Returns

getLeafEstimationMethod()

Returns

getLearningRate()

Returns

getLoggingLevel()

Returns

getLossFunction()

Returns

getMetricPeriod()

Returns

getModelShrinkMode()

Returns

getModelShrinkRate()

Returns

getMvsReg()

Returns

getNanMode()

Returns

getOdPval()

Returns

getOdType()

Returns

getOdWait()

Returns

getOneHotMaxSize()

Returns

getPenaltiesCoefficient()

Returns

getPerFloatFeatureQuantizaton()

Returns

getPerObjectFeaturePenaltiesList()

Returns

getPerObjectFeaturePenaltiesMap()

Returns

getPredictionCol()

Returns

getRandomSeed()

Returns

getRandomStrength()

Returns

getRsm()

Returns

getSamplingFrequency()

Returns

getSamplingUnit()

Returns

getSaveSnapshot()

Returns

getScoreFunction()

Returns

getSnapshotFile()

Returns

getSnapshotInterval()

Returns

getSparkPartitionCount()

Returns

getSubsample()

Returns

getThreadCount()

Returns

getTrainDir()

Returns

getUseBestModel()

Returns

getWeightCol()

Returns

getWorkerInitializationTimeout()

Returns

getWorkerMaxFailures()

Returns

read()

Returns an MLReader instance for this class.

setAllowConstLabel(value)

Parameters

setAllowWritingFiles(value)

Parameters

setApproxOnFullHistory(value)

Parameters

setBaggingTemperature(value)

Parameters

setBestModelMinTrees(value)

Parameters

setBootstrapType(value)

Parameters

setBorderCount(value)

Parameters

setConnectTimeout(value)

Parameters

setCustomMetric(value)

Parameters

setDepth(value)

Parameters

setDiffusionTemperature(value)

Parameters

setEarlyStoppingRounds(value)

Parameters

setEvalMetric(value)

Parameters

setFeatureBorderType(value)

Parameters

setFeatureWeightsList(value)

Parameters

setFeatureWeightsMap(value)

Parameters

setFeaturesCol(value)

Parameters

setFirstFeatureUsePenaltiesList(value)

Parameters

setFirstFeatureUsePenaltiesMap(value)

Parameters

setFoldLenMultiplier(value)

Parameters

setFoldPermutationBlock(value)

Parameters

setHasTime(value)

Parameters

setIgnoredFeaturesIndices(value)

Parameters

setIgnoredFeaturesNames(value)

Parameters

setInputBorders(value)

Parameters

setIterations(value)

Parameters

setL2LeafReg(value)

Parameters

setLabelCol(value)

Parameters

setLeafEstimationBacktracking(value)

Parameters

setLeafEstimationIterations(value)

Parameters

setLeafEstimationMethod(value)

Parameters

setLearningRate(value)

Parameters

setLoggingLevel(value)

Parameters

setLossFunction(value)

Parameters

setMetricPeriod(value)

Parameters

setModelShrinkMode(value)

Parameters

setModelShrinkRate(value)

Parameters

setMvsReg(value)

Parameters

setNanMode(value)

Parameters

setOdPval(value)

Parameters

setOdType(value)

Parameters

setOdWait(value)

Parameters

setOneHotMaxSize(value)

Parameters

setParams([allowConstLabel, …])

Set the (keyword only) parameters

setPenaltiesCoefficient(value)

Parameters

setPerFloatFeatureQuantizaton(value)

Parameters

setPerObjectFeaturePenaltiesList(value)

Parameters

setPerObjectFeaturePenaltiesMap(value)

Parameters

setPredictionCol(value)

Parameters

setRandomSeed(value)

Parameters

setRandomStrength(value)

Parameters

setRsm(value)

Parameters

setSamplingFrequency(value)

Parameters

setSamplingUnit(value)

Parameters

setSaveSnapshot(value)

Parameters

setScoreFunction(value)

Parameters

setSnapshotFile(value)

Parameters

setSnapshotInterval(value)

Parameters

setSparkPartitionCount(value)

Parameters

setSubsample(value)

Parameters

setThreadCount(value)

Parameters

setTrainDir(value)

Parameters

setUseBestModel(value)

Parameters

setWeightCol(value)

Parameters

setWorkerInitializationTimeout(value)

Parameters

setWorkerMaxFailures(value)

Parameters

Methods Documentation

fit(trainDataset, evalDatasets=None)[source]

Extended variant of standard Estimator’s fit method that accepts CatBoost’s Pool s and allows to specify additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost’s other APIs.

Parameters
trainDatasetPool or DataFrame

The input training dataset.

evalDatasetsPools, optional
The validation datasets used for the following processes:
  • overfitting detector

  • best iteration selection

  • monitoring metrics’ changes

Returns
trained model: CatBoostRegressionModel
getAllowConstLabel()[source]
Returns
bool

Use it to train models with datasets that have equal label values for all objects.

getAllowWritingFiles()[source]
Returns
bool

Allow to write analytical and snapshot files during training. Enabled by default.

getApproxOnFullHistory()[source]
Returns
bool

Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.

getBaggingTemperature()[source]
Returns
float

This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.

getBestModelMinTrees()[source]
Returns
int

The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.

getBootstrapType()[source]
Returns
EBootstrapType

Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.

getBorderCount()[source]
Returns
int

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.

getConnectTimeout()[source]
Returns
datetime.timedelta

Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute

getCustomMetric()[source]
Returns
list

Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

getDepth()[source]
Returns
int

Depth of the tree.Default value is 6.

getDiffusionTemperature()[source]
Returns
float

The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.

getEarlyStoppingRounds()[source]
Returns
int

Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

getEvalMetric()[source]
Returns
str

The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

getFeatureBorderType()[source]
Returns
EBorderSelectionType

The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’

getFeatureWeightsList()[source]
Returns
list

Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.

getFeatureWeightsMap()[source]
Returns
dict

Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.

getFeaturesCol()[source]
Returns
str

features column name

getFirstFeatureUsePenaltiesList()[source]
Returns
list

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.

getFirstFeatureUsePenaltiesMap()[source]
Returns
dict

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.

getFoldLenMultiplier()[source]
Returns
float

Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.

getFoldPermutationBlock()[source]
Returns
int

Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.

getHasTime()[source]
Returns
bool

Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).

getIgnoredFeaturesIndices()[source]
Returns
list

Feature indices to exclude from the training

getIgnoredFeaturesNames()[source]
Returns
list

Feature names to exclude from the training

getInputBorders()[source]
Returns
str

Load Custom quantization borders and missing value modes from a file (do not generate them)

getIterations()[source]
Returns
int

The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.

getL2LeafReg()[source]
Returns
float

Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.

getLabelCol()[source]
Returns
str

label column name

getLeafEstimationBacktracking()[source]
Returns
ELeavesEstimationStepBacktracking

When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’

getLeafEstimationIterations()[source]
Returns
int

CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.

getLeafEstimationMethod()[source]
Returns
ELeavesEstimation

The method used to calculate the values in leaves. See documentation for details.

getLearningRate()[source]
Returns
float

The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.

getLoggingLevel()[source]
Returns
ELoggingLevel

The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’

getLossFunction()[source]
Returns
str

The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

getMetricPeriod()[source]
Returns
int

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.

getModelShrinkMode()[source]
Returns
EModelShrinkMode

Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’

getModelShrinkRate()[source]
Returns
float

The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.

getMvsReg()[source]
Returns
float

Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.

getNanMode()[source]
Returns
ENanMode

The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’

getOdPval()[source]
Returns
float

The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.

getOdType()[source]
Returns
EOverfittingDetectorType

The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’

getOdWait()[source]
Returns
int

The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.

getOneHotMaxSize()[source]
Returns
int

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

getPenaltiesCoefficient()[source]
Returns
float

A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.

getPerFloatFeatureQuantizaton()[source]
Returns
list

The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

getPerObjectFeaturePenaltiesList()[source]
Returns
list

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.

getPerObjectFeaturePenaltiesMap()[source]
Returns
dict

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.

getPredictionCol()[source]
Returns
str

prediction column name

getRandomSeed()[source]
Returns
int

The random seed used for training. Default value is 0.

getRandomStrength()[source]
Returns
float

The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0

getRsm()[source]
Returns
float

Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.

getSamplingFrequency()[source]
Returns
ESamplingFrequency

Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’

getSamplingUnit()[source]
Returns
ESamplingUnit

The sampling scheme, see documentation for details. Default value is ‘Object’

getSaveSnapshot()[source]
Returns
bool

Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.

getScoreFunction()[source]
Returns
EScoreFunction

The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’

getSnapshotFile()[source]
Returns
str

The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

getSnapshotInterval()[source]
Returns
datetime.timedelta

The interval between saving snapshots. See documentation for details. Default value is 600 seconds.

getSparkPartitionCount()[source]
Returns
int

The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default

getSubsample()[source]
Returns
float

Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.

getThreadCount()[source]
Returns
int

Number of CPU threads in parallel operations on client

getTrainDir()[source]
Returns
str

The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’

getUseBestModel()[source]
Returns
bool

If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.

getWeightCol()[source]
Returns
str

weight column name. If this is not set or empty, we treat all instance weights as 1.0

getWorkerInitializationTimeout()[source]
Returns
datetime.timedelta

Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes

getWorkerMaxFailures()[source]
Returns
int

Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4

classmethod read()[source]

Returns an MLReader instance for this class.

setAllowConstLabel(value)[source]
Parameters
valuebool

Use it to train models with datasets that have equal label values for all objects.

setAllowWritingFiles(value)[source]
Parameters
valuebool

Allow to write analytical and snapshot files during training. Enabled by default.

setApproxOnFullHistory(value)[source]
Parameters
valuebool

Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.

setBaggingTemperature(value)[source]
Parameters
valuefloat

This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.

setBestModelMinTrees(value)[source]
Parameters
valueint

The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.

setBootstrapType(value)[source]
Parameters
valueEBootstrapType

Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.

setBorderCount(value)[source]
Parameters
valueint

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.

setConnectTimeout(value)[source]
Parameters
valuedatetime.timedelta

Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute

setCustomMetric(value)[source]
Parameters
valuelist

Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

setDepth(value)[source]
Parameters
valueint

Depth of the tree.Default value is 6.

setDiffusionTemperature(value)[source]
Parameters
valuefloat

The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.

setEarlyStoppingRounds(value)[source]
Parameters
valueint

Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

setEvalMetric(value)[source]
Parameters
valuestr

The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

setFeatureBorderType(value)[source]
Parameters
valueEBorderSelectionType

The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’

setFeatureWeightsList(value)[source]
Parameters
valuelist

Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.

setFeatureWeightsMap(value)[source]
Parameters
valuedict

Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.

setFeaturesCol(value)[source]
Parameters
valuestr

features column name

setFirstFeatureUsePenaltiesList(value)[source]
Parameters
valuelist

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.

setFirstFeatureUsePenaltiesMap(value)[source]
Parameters
valuedict

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.

setFoldLenMultiplier(value)[source]
Parameters
valuefloat

Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.

setFoldPermutationBlock(value)[source]
Parameters
valueint

Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.

setHasTime(value)[source]
Parameters
valuebool

Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).

setIgnoredFeaturesIndices(value)[source]
Parameters
valuelist

Feature indices to exclude from the training

setIgnoredFeaturesNames(value)[source]
Parameters
valuelist

Feature names to exclude from the training

setInputBorders(value)[source]
Parameters
valuestr

Load Custom quantization borders and missing value modes from a file (do not generate them)

setIterations(value)[source]
Parameters
valueint

The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.

setL2LeafReg(value)[source]
Parameters
valuefloat

Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.

setLabelCol(value)[source]
Parameters
valuestr

label column name

setLeafEstimationBacktracking(value)[source]
Parameters
valueELeavesEstimationStepBacktracking

When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’

setLeafEstimationIterations(value)[source]
Parameters
valueint

CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.

setLeafEstimationMethod(value)[source]
Parameters
valueELeavesEstimation

The method used to calculate the values in leaves. See documentation for details.

setLearningRate(value)[source]
Parameters
valuefloat

The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.

setLoggingLevel(value)[source]
Parameters
valueELoggingLevel

The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’

setLossFunction(value)[source]
Parameters
valuestr

The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

setMetricPeriod(value)[source]
Parameters
valueint

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.

setModelShrinkMode(value)[source]
Parameters
valueEModelShrinkMode

Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’

setModelShrinkRate(value)[source]
Parameters
valuefloat

The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.

setMvsReg(value)[source]
Parameters
valuefloat

Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.

setNanMode(value)[source]
Parameters
valueENanMode

The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’

setOdPval(value)[source]
Parameters
valuefloat

The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.

setOdType(value)[source]
Parameters
valueEOverfittingDetectorType

The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’

setOdWait(value)[source]
Parameters
valueint

The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.

setOneHotMaxSize(value)[source]
Parameters
valueint

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

setParams(allowConstLabel=None, allowWritingFiles=None, approxOnFullHistory=None, baggingTemperature=None, bestModelMinTrees=None, bootstrapType=None, borderCount=None, connectTimeout=datetime.timedelta(seconds=60), customMetric=None, depth=None, diffusionTemperature=None, earlyStoppingRounds=None, evalMetric=None, featureBorderType=None, featureWeightsList=None, featureWeightsMap=None, featuresCol='features', firstFeatureUsePenaltiesList=None, firstFeatureUsePenaltiesMap=None, foldLenMultiplier=None, foldPermutationBlock=None, hasTime=None, ignoredFeaturesIndices=None, ignoredFeaturesNames=None, inputBorders=None, iterations=None, l2LeafReg=None, labelCol='label', leafEstimationBacktracking=None, leafEstimationIterations=None, leafEstimationMethod=None, learningRate=None, loggingLevel=None, lossFunction=None, metricPeriod=None, modelShrinkMode=None, modelShrinkRate=None, mvsReg=None, nanMode=None, odPval=None, odType=None, odWait=None, oneHotMaxSize=None, penaltiesCoefficient=None, perFloatFeatureQuantizaton=None, perObjectFeaturePenaltiesList=None, perObjectFeaturePenaltiesMap=None, predictionCol='prediction', randomSeed=None, randomStrength=None, rsm=None, samplingFrequency=None, samplingUnit=None, saveSnapshot=None, scoreFunction=None, snapshotFile=None, snapshotInterval=None, sparkPartitionCount=None, subsample=None, threadCount=None, trainDir=None, useBestModel=None, weightCol=None, workerInitializationTimeout=datetime.timedelta(seconds=600), workerMaxFailures=4)[source]

Set the (keyword only) parameters

Parameters
allowConstLabelbool

Use it to train models with datasets that have equal label values for all objects.

allowWritingFilesbool

Allow to write analytical and snapshot files during training. Enabled by default.

approxOnFullHistorybool

Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.

baggingTemperaturefloat

This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.

bestModelMinTreesint

The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.

bootstrapTypeEBootstrapType

Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.

borderCountint

The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.

connectTimeoutdatetime.timedelta, default: datetime.timedelta(milliseconds=60000)

Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute

customMetriclist

Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

depthint

Depth of the tree.Default value is 6.

diffusionTemperaturefloat

The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.

earlyStoppingRoundsint

Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

evalMetricstr

The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

featureBorderTypeEBorderSelectionType

The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’

featureWeightsListlist

Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.

featureWeightsMapdict

Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.

featuresColstr, default: “features”

features column name

firstFeatureUsePenaltiesListlist

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.

firstFeatureUsePenaltiesMapdict

Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.

foldLenMultiplierfloat

Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.

foldPermutationBlockint

Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.

hasTimebool

Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).

ignoredFeaturesIndiceslist

Feature indices to exclude from the training

ignoredFeaturesNameslist

Feature names to exclude from the training

inputBordersstr

Load Custom quantization borders and missing value modes from a file (do not generate them)

iterationsint

The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.

l2LeafRegfloat

Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.

labelColstr, default: “label”

label column name

leafEstimationBacktrackingELeavesEstimationStepBacktracking

When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’

leafEstimationIterationsint

CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.

leafEstimationMethodELeavesEstimation

The method used to calculate the values in leaves. See documentation for details.

learningRatefloat

The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.

loggingLevelELoggingLevel

The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’

lossFunctionstr

The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

metricPeriodint

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.

modelShrinkModeEModelShrinkMode

Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’

modelShrinkRatefloat

The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.

mvsRegfloat

Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.

nanModeENanMode

The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’

odPvalfloat

The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.

odTypeEOverfittingDetectorType

The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’

odWaitint

The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.

oneHotMaxSizeint

Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

penaltiesCoefficientfloat

A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.

perFloatFeatureQuantizatonlist

The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

perObjectFeaturePenaltiesListlist

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.

perObjectFeaturePenaltiesMapdict

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.

predictionColstr, default: “prediction”

prediction column name

randomSeedint

The random seed used for training. Default value is 0.

randomStrengthfloat

The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0

rsmfloat

Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.

samplingFrequencyESamplingFrequency

Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’

samplingUnitESamplingUnit

The sampling scheme, see documentation for details. Default value is ‘Object’

saveSnapshotbool

Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.

scoreFunctionEScoreFunction

The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’

snapshotFilestr

The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

snapshotIntervaldatetime.timedelta

The interval between saving snapshots. See documentation for details. Default value is 600 seconds.

sparkPartitionCountint

The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default

subsamplefloat

Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.

threadCountint

Number of CPU threads in parallel operations on client

trainDirstr

The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’

useBestModelbool

If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.

weightColstr

weight column name. If this is not set or empty, we treat all instance weights as 1.0

workerInitializationTimeoutdatetime.timedelta, default: datetime.timedelta(milliseconds=600000)

Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes

workerMaxFailuresint, default: 4

Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4

setPenaltiesCoefficient(value)[source]
Parameters
valuefloat

A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.

setPerFloatFeatureQuantizaton(value)[source]
Parameters
valuelist

The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

setPerObjectFeaturePenaltiesList(value)[source]
Parameters
valuelist

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.

setPerObjectFeaturePenaltiesMap(value)[source]
Parameters
valuedict

Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.

setPredictionCol(value)[source]
Parameters
valuestr

prediction column name

setRandomSeed(value)[source]
Parameters
valueint

The random seed used for training. Default value is 0.

setRandomStrength(value)[source]
Parameters
valuefloat

The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0

setRsm(value)[source]
Parameters
valuefloat

Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.

setSamplingFrequency(value)[source]
Parameters
valueESamplingFrequency

Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’

setSamplingUnit(value)[source]
Parameters
valueESamplingUnit

The sampling scheme, see documentation for details. Default value is ‘Object’

setSaveSnapshot(value)[source]
Parameters
valuebool

Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.

setScoreFunction(value)[source]
Parameters
valueEScoreFunction

The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’

setSnapshotFile(value)[source]
Parameters
valuestr

The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

setSnapshotInterval(value)[source]
Parameters
valuedatetime.timedelta

The interval between saving snapshots. See documentation for details. Default value is 600 seconds.

setSparkPartitionCount(value)[source]
Parameters
valueint

The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default

setSubsample(value)[source]
Parameters
valuefloat

Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.

setThreadCount(value)[source]
Parameters
valueint

Number of CPU threads in parallel operations on client

setTrainDir(value)[source]
Parameters
valuestr

The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’

setUseBestModel(value)[source]
Parameters
valuebool

If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.

setWeightCol(value)[source]
Parameters
valuestr

weight column name. If this is not set or empty, we treat all instance weights as 1.0

setWorkerInitializationTimeout(value)[source]
Parameters
valuedatetime.timedelta

Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes

setWorkerMaxFailures(value)[source]
Parameters
valueint

Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4