CatBoostRegressor¶

class catboost_spark.CatBoostRegressor(allowConstLabel=None, allowWritingFiles=None, approxOnFullHistory=None, baggingTemperature=None, bestModelMinTrees=None, bootstrapType=None, borderCount=None, connectTimeout=datetime.timedelta(seconds=60), customMetric=None, depth=None, diffusionTemperature=None, earlyStoppingRounds=None, evalMetric=None, featureBorderType=None, featureWeightsList=None, featureWeightsMap=None, featuresCol='features', firstFeatureUsePenaltiesList=None, firstFeatureUsePenaltiesMap=None, foldLenMultiplier=None, foldPermutationBlock=None, hasTime=None, ignoredFeaturesIndices=None, ignoredFeaturesNames=None, inputBorders=None, iterations=None, l2LeafReg=None, labelCol='label', leafEstimationBacktracking=None, leafEstimationIterations=None, leafEstimationMethod=None, learningRate=None, loggingLevel=None, lossFunction=None, metricPeriod=None, modelShrinkMode=None, modelShrinkRate=None, mvsReg=None, nanMode=None, odPval=None, odType=None, odWait=None, oneHotMaxSize=None, penaltiesCoefficient=None, perFloatFeatureQuantizaton=None, perObjectFeaturePenaltiesList=None, perObjectFeaturePenaltiesMap=None, predictionCol='prediction', randomSeed=None, randomStrength=None, rsm=None, samplingFrequency=None, samplingUnit=None, saveSnapshot=None, scoreFunction=None, snapshotFile=None, snapshotInterval=None, sparkPartitionCount=None, subsample=None, threadCount=None, trainDir=None, useBestModel=None, weightCol=None, workerInitializationTimeout=datetime.timedelta(seconds=600), workerMaxFailures=4)[source]¶

Bases: pyspark.ml.wrapper.JavaEstimator, pyspark.ml.util.MLReadable, pyspark.ml.util.JavaMLWritable

Class to train CatBoostRegressionModel

Methods Summary

`fit`(trainDataset[, evalDatasets])	Extended variant of standard Estimator’s fit method that accepts CatBoost’s Pool s and allows to specify additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost’s other APIs.
`getAllowConstLabel`()	Returns
`getAllowWritingFiles`()	Returns
`getApproxOnFullHistory`()	Returns
`getBaggingTemperature`()	Returns
`getBestModelMinTrees`()	Returns
`getBootstrapType`()	Returns
`getBorderCount`()	Returns
`getConnectTimeout`()	Returns
`getCustomMetric`()	Returns
`getDepth`()	Returns
`getDiffusionTemperature`()	Returns
`getEarlyStoppingRounds`()	Returns
`getEvalMetric`()	Returns
`getFeatureBorderType`()	Returns
`getFeatureWeightsList`()	Returns
`getFeatureWeightsMap`()	Returns
`getFeaturesCol`()	Returns
`getFirstFeatureUsePenaltiesList`()	Returns
`getFirstFeatureUsePenaltiesMap`()	Returns
`getFoldLenMultiplier`()	Returns
`getFoldPermutationBlock`()	Returns
`getHasTime`()	Returns
`getIgnoredFeaturesIndices`()	Returns
`getIgnoredFeaturesNames`()	Returns
`getInputBorders`()	Returns
`getIterations`()	Returns
`getL2LeafReg`()	Returns
`getLabelCol`()	Returns
`getLeafEstimationBacktracking`()	Returns
`getLeafEstimationIterations`()	Returns
`getLeafEstimationMethod`()	Returns
`getLearningRate`()	Returns
`getLoggingLevel`()	Returns
`getLossFunction`()	Returns
`getMetricPeriod`()	Returns
`getModelShrinkMode`()	Returns
`getModelShrinkRate`()	Returns
`getMvsReg`()	Returns
`getNanMode`()	Returns
`getOdPval`()	Returns
`getOdType`()	Returns
`getOdWait`()	Returns
`getOneHotMaxSize`()	Returns
`getPenaltiesCoefficient`()	Returns
`getPerFloatFeatureQuantizaton`()	Returns
`getPerObjectFeaturePenaltiesList`()	Returns
`getPerObjectFeaturePenaltiesMap`()	Returns
`getPredictionCol`()	Returns
`getRandomSeed`()	Returns
`getRandomStrength`()	Returns
`getRsm`()	Returns
`getSamplingFrequency`()	Returns
`getSamplingUnit`()	Returns
`getSaveSnapshot`()	Returns
`getScoreFunction`()	Returns
`getSnapshotFile`()	Returns
`getSnapshotInterval`()	Returns
`getSparkPartitionCount`()	Returns
`getSubsample`()	Returns
`getThreadCount`()	Returns
`getTrainDir`()	Returns
`getUseBestModel`()	Returns
`getWeightCol`()	Returns
`getWorkerInitializationTimeout`()	Returns
`getWorkerMaxFailures`()	Returns
`read`()	Returns an MLReader instance for this class.
`setAllowConstLabel`(value)	Parameters
`setAllowWritingFiles`(value)	Parameters
`setApproxOnFullHistory`(value)	Parameters
`setBaggingTemperature`(value)	Parameters
`setBestModelMinTrees`(value)	Parameters
`setBootstrapType`(value)	Parameters
`setBorderCount`(value)	Parameters
`setConnectTimeout`(value)	Parameters
`setCustomMetric`(value)	Parameters
`setDepth`(value)	Parameters
`setDiffusionTemperature`(value)	Parameters
`setEarlyStoppingRounds`(value)	Parameters
`setEvalMetric`(value)	Parameters
`setFeatureBorderType`(value)	Parameters
`setFeatureWeightsList`(value)	Parameters
`setFeatureWeightsMap`(value)	Parameters
`setFeaturesCol`(value)	Parameters
`setFirstFeatureUsePenaltiesList`(value)	Parameters
`setFirstFeatureUsePenaltiesMap`(value)	Parameters
`setFoldLenMultiplier`(value)	Parameters
`setFoldPermutationBlock`(value)	Parameters
`setHasTime`(value)	Parameters
`setIgnoredFeaturesIndices`(value)	Parameters
`setIgnoredFeaturesNames`(value)	Parameters
`setInputBorders`(value)	Parameters
`setIterations`(value)	Parameters
`setL2LeafReg`(value)	Parameters
`setLabelCol`(value)	Parameters
`setLeafEstimationBacktracking`(value)	Parameters
`setLeafEstimationIterations`(value)	Parameters
`setLeafEstimationMethod`(value)	Parameters
`setLearningRate`(value)	Parameters
`setLoggingLevel`(value)	Parameters
`setLossFunction`(value)	Parameters
`setMetricPeriod`(value)	Parameters
`setModelShrinkMode`(value)	Parameters
`setModelShrinkRate`(value)	Parameters
`setMvsReg`(value)	Parameters
`setNanMode`(value)	Parameters
`setOdPval`(value)	Parameters
`setOdType`(value)	Parameters
`setOdWait`(value)	Parameters
`setOneHotMaxSize`(value)	Parameters
`setParams`([allowConstLabel, …])	Set the (keyword only) parameters
`setPenaltiesCoefficient`(value)	Parameters
`setPerFloatFeatureQuantizaton`(value)	Parameters
`setPerObjectFeaturePenaltiesList`(value)	Parameters
`setPerObjectFeaturePenaltiesMap`(value)	Parameters
`setPredictionCol`(value)	Parameters
`setRandomSeed`(value)	Parameters
`setRandomStrength`(value)	Parameters
`setRsm`(value)	Parameters
`setSamplingFrequency`(value)	Parameters
`setSamplingUnit`(value)	Parameters
`setSaveSnapshot`(value)	Parameters
`setScoreFunction`(value)	Parameters
`setSnapshotFile`(value)	Parameters
`setSnapshotInterval`(value)	Parameters
`setSparkPartitionCount`(value)	Parameters
`setSubsample`(value)	Parameters
`setThreadCount`(value)	Parameters
`setTrainDir`(value)	Parameters
`setUseBestModel`(value)	Parameters
`setWeightCol`(value)	Parameters
`setWorkerInitializationTimeout`(value)	Parameters
`setWorkerMaxFailures`(value)	Parameters

Methods Documentation

fit(trainDataset, evalDatasets=None)[source]¶

Extended variant of standard Estimator’s fit method that accepts CatBoost’s Pool s and allows to specify additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost’s other APIs.

Parameters

trainDatasetPool or DataFrame

The input training dataset.

evalDatasetsPools, optional

The validation datasets used for the following processes:

overfitting detector
best iteration selection
monitoring metrics’ changes

Returns

trained model: CatBoostRegressionModel

getAllowConstLabel()[source]¶

Returns

bool: Use it to train models with datasets that have equal label values for all objects.

getAllowWritingFiles()[source]¶

Returns

bool: Allow to write analytical and snapshot files during training. Enabled by default.

getApproxOnFullHistory()[source]¶

Returns

bool: Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.

getBaggingTemperature()[source]¶

Returns

float: This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.

getBestModelMinTrees()[source]¶

Returns

int: The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.

getBootstrapType()[source]¶

Returns

EBootstrapType: Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.

getBorderCount()[source]¶

Returns

int: The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.

getConnectTimeout()[source]¶

Returns

datetime.timedelta: Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute

getCustomMetric()[source]¶

Returns

list: Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

getDepth()[source]¶

Returns

int: Depth of the tree.Default value is 6.

getDiffusionTemperature()[source]¶

Returns

float: The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.

getEarlyStoppingRounds()[source]¶

Returns

int: Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

getEvalMetric()[source]¶

Returns

str: The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

getFeatureBorderType()[source]¶

Returns

EBorderSelectionType: The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’

getFeatureWeightsList()[source]¶

Returns

list: Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.

getFeatureWeightsMap()[source]¶

Returns

dict: Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.

getFeaturesCol()[source]¶

Returns

str: features column name

getFirstFeatureUsePenaltiesList()[source]¶

Returns

list: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.

getFirstFeatureUsePenaltiesMap()[source]¶

Returns

dict: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.

getFoldLenMultiplier()[source]¶

Returns

float: Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.

getFoldPermutationBlock()[source]¶

Returns

int: Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.

getHasTime()[source]¶

Returns

bool: Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).

getIgnoredFeaturesIndices()[source]¶

Returns

list: Feature indices to exclude from the training

getIgnoredFeaturesNames()[source]¶

Returns

list: Feature names to exclude from the training

getInputBorders()[source]¶

Returns

str: Load Custom quantization borders and missing value modes from a file (do not generate them)

getIterations()[source]¶

Returns

int: The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.

getL2LeafReg()[source]¶

Returns

float: Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.

getLabelCol()[source]¶

Returns

str: label column name

getLeafEstimationBacktracking()[source]¶

Returns

ELeavesEstimationStepBacktracking: When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’

getLeafEstimationIterations()[source]¶

Returns

int: CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.

getLeafEstimationMethod()[source]¶

Returns

ELeavesEstimation: The method used to calculate the values in leaves. See documentation for details.

getLearningRate()[source]¶

Returns

float: The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.

getLoggingLevel()[source]¶

Returns

ELoggingLevel: The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’

getLossFunction()[source]¶

Returns

str: The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

getMetricPeriod()[source]¶

Returns

int: The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.

getModelShrinkMode()[source]¶

Returns

EModelShrinkMode: Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’

getModelShrinkRate()[source]¶

Returns

float: The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.

getMvsReg()[source]¶

Returns

float: Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.

getNanMode()[source]¶

Returns

ENanMode: The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’

getOdPval()[source]¶

Returns

float: The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.

getOdType()[source]¶

Returns

EOverfittingDetectorType: The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’

getOdWait()[source]¶

Returns

int: The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.

getOneHotMaxSize()[source]¶

Returns

int: Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

getPenaltiesCoefficient()[source]¶

Returns

float: A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.

getPerFloatFeatureQuantizaton()[source]¶

Returns

list: The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

getPerObjectFeaturePenaltiesList()[source]¶

Returns

list: Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.

getPerObjectFeaturePenaltiesMap()[source]¶

Returns

dict: Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.

getPredictionCol()[source]¶

Returns

str: prediction column name

getRandomSeed()[source]¶

Returns

int: The random seed used for training. Default value is 0.

getRandomStrength()[source]¶

Returns

float: The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0

getRsm()[source]¶

Returns

float: Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.

getSamplingFrequency()[source]¶

Returns

ESamplingFrequency: Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’

getSamplingUnit()[source]¶

Returns

ESamplingUnit: The sampling scheme, see documentation for details. Default value is ‘Object’

getSaveSnapshot()[source]¶

Returns

bool: Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.

getScoreFunction()[source]¶

Returns

EScoreFunction: The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’

getSnapshotFile()[source]¶

Returns

str: The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

getSnapshotInterval()[source]¶

Returns

datetime.timedelta: The interval between saving snapshots. See documentation for details. Default value is 600 seconds.

getSparkPartitionCount()[source]¶

Returns

int: The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default

getSubsample()[source]¶

Returns

float: Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.

getThreadCount()[source]¶

Returns

int: Number of CPU threads in parallel operations on client

getTrainDir()[source]¶

Returns

str: The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’

getUseBestModel()[source]¶

Returns

bool: If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.

getWeightCol()[source]¶

Returns

str: weight column name. If this is not set or empty, we treat all instance weights as 1.0

getWorkerInitializationTimeout()[source]¶

Returns

datetime.timedelta: Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes

getWorkerMaxFailures()[source]¶

Returns

int: Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4

classmethod read()[source]¶: Returns an MLReader instance for this class.

setAllowConstLabel(value)[source]¶

Parameters

valuebool: Use it to train models with datasets that have equal label values for all objects.

setAllowWritingFiles(value)[source]¶

Parameters

valuebool: Allow to write analytical and snapshot files during training. Enabled by default.

setApproxOnFullHistory(value)[source]¶

Parameters

valuebool: Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.

setBaggingTemperature(value)[source]¶

Parameters

valuefloat: This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.

setBestModelMinTrees(value)[source]¶

Parameters

valueint: The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.

setBootstrapType(value)[source]¶

Parameters

valueEBootstrapType: Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.

setBorderCount(value)[source]¶

Parameters

valueint: The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.

setConnectTimeout(value)[source]¶

Parameters

valuedatetime.timedelta: Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute

setCustomMetric(value)[source]¶

Parameters

valuelist: Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

setDepth(value)[source]¶

Parameters

valueint: Depth of the tree.Default value is 6.

setDiffusionTemperature(value)[source]¶

Parameters

valuefloat: The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.

setEarlyStoppingRounds(value)[source]¶

Parameters

valueint: Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.

setEvalMetric(value)[source]¶

Parameters

valuestr: The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

setFeatureBorderType(value)[source]¶

Parameters

valueEBorderSelectionType: The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’

setFeatureWeightsList(value)[source]¶

Parameters

valuelist: Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.

setFeatureWeightsMap(value)[source]¶

Parameters

valuedict: Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.

setFeaturesCol(value)[source]¶

Parameters

valuestr: features column name

setFirstFeatureUsePenaltiesList(value)[source]¶

Parameters

valuelist: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.

setFirstFeatureUsePenaltiesMap(value)[source]¶

Parameters

valuedict: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.

setFoldLenMultiplier(value)[source]¶

Parameters

valuefloat: Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.

setFoldPermutationBlock(value)[source]¶

Parameters

valueint: Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.

setHasTime(value)[source]¶

Parameters

valuebool: Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).

setIgnoredFeaturesIndices(value)[source]¶

Parameters

valuelist: Feature indices to exclude from the training

setIgnoredFeaturesNames(value)[source]¶

Parameters

valuelist: Feature names to exclude from the training

setInputBorders(value)[source]¶

Parameters

valuestr: Load Custom quantization borders and missing value modes from a file (do not generate them)

setIterations(value)[source]¶

Parameters

valueint: The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.

setL2LeafReg(value)[source]¶

Parameters

valuefloat: Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.

setLabelCol(value)[source]¶

Parameters

valuestr: label column name

setLeafEstimationBacktracking(value)[source]¶

Parameters

valueELeavesEstimationStepBacktracking: When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’

setLeafEstimationIterations(value)[source]¶

Parameters

valueint: CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.

setLeafEstimationMethod(value)[source]¶

Parameters

valueELeavesEstimation: The method used to calculate the values in leaves. See documentation for details.

setLearningRate(value)[source]¶

Parameters

valuefloat: The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.

setLoggingLevel(value)[source]¶

Parameters

valueELoggingLevel: The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’

setLossFunction(value)[source]¶

Parameters

valuestr: The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).

setMetricPeriod(value)[source]¶

Parameters

valueint: The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.

setModelShrinkMode(value)[source]¶

Parameters

valueEModelShrinkMode: Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’

setModelShrinkRate(value)[source]¶

Parameters

valuefloat: The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.

setMvsReg(value)[source]¶

Parameters

valuefloat: Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.

setNanMode(value)[source]¶

Parameters

valueENanMode: The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’

setOdPval(value)[source]¶

Parameters

valuefloat: The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.

setOdType(value)[source]¶

Parameters

valueEOverfittingDetectorType: The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’

setOdWait(value)[source]¶

Parameters

valueint: The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.

setOneHotMaxSize(value)[source]¶

Parameters

valueint: Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

setParams(allowConstLabel=None, allowWritingFiles=None, approxOnFullHistory=None, baggingTemperature=None, bestModelMinTrees=None, bootstrapType=None, borderCount=None, connectTimeout=datetime.timedelta(seconds=60), customMetric=None, depth=None, diffusionTemperature=None, earlyStoppingRounds=None, evalMetric=None, featureBorderType=None, featureWeightsList=None, featureWeightsMap=None, featuresCol='features', firstFeatureUsePenaltiesList=None, firstFeatureUsePenaltiesMap=None, foldLenMultiplier=None, foldPermutationBlock=None, hasTime=None, ignoredFeaturesIndices=None, ignoredFeaturesNames=None, inputBorders=None, iterations=None, l2LeafReg=None, labelCol='label', leafEstimationBacktracking=None, leafEstimationIterations=None, leafEstimationMethod=None, learningRate=None, loggingLevel=None, lossFunction=None, metricPeriod=None, modelShrinkMode=None, modelShrinkRate=None, mvsReg=None, nanMode=None, odPval=None, odType=None, odWait=None, oneHotMaxSize=None, penaltiesCoefficient=None, perFloatFeatureQuantizaton=None, perObjectFeaturePenaltiesList=None, perObjectFeaturePenaltiesMap=None, predictionCol='prediction', randomSeed=None, randomStrength=None, rsm=None, samplingFrequency=None, samplingUnit=None, saveSnapshot=None, scoreFunction=None, snapshotFile=None, snapshotInterval=None, sparkPartitionCount=None, subsample=None, threadCount=None, trainDir=None, useBestModel=None, weightCol=None, workerInitializationTimeout=datetime.timedelta(seconds=600), workerMaxFailures=4)[source]¶

Set the (keyword only) parameters

Parameters

allowConstLabelbool: Use it to train models with datasets that have equal label values for all objects.
allowWritingFilesbool: Allow to write analytical and snapshot files during training. Enabled by default.
approxOnFullHistorybool: Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.
baggingTemperaturefloat: This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.
bestModelMinTreesint: The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.
bootstrapTypeEBootstrapType: Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.
borderCountint: The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.
connectTimeoutdatetime.timedelta, default: datetime.timedelta(milliseconds=60000): Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute
customMetriclist: Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
depthint: Depth of the tree.Default value is 6.
diffusionTemperaturefloat: The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.
earlyStoppingRoundsint: Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
evalMetricstr: The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
featureBorderTypeEBorderSelectionType: The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’
featureWeightsListlist: Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.
featureWeightsMapdict: Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.
featuresColstr, default: “features”: features column name
firstFeatureUsePenaltiesListlist: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.
firstFeatureUsePenaltiesMapdict: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.
foldLenMultiplierfloat: Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.
foldPermutationBlockint: Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.
hasTimebool: Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).
ignoredFeaturesIndiceslist: Feature indices to exclude from the training
ignoredFeaturesNameslist: Feature names to exclude from the training
inputBordersstr: Load Custom quantization borders and missing value modes from a file (do not generate them)
iterationsint: The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.
l2LeafRegfloat: Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.
labelColstr, default: “label”: label column name
leafEstimationBacktrackingELeavesEstimationStepBacktracking: When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’
leafEstimationIterationsint: CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.
leafEstimationMethodELeavesEstimation: The method used to calculate the values in leaves. See documentation for details.
learningRatefloat: The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.
loggingLevelELoggingLevel: The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’
lossFunctionstr: The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
metricPeriodint: The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.
modelShrinkModeEModelShrinkMode: Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’
modelShrinkRatefloat: The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.
mvsRegfloat: Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.
nanModeENanMode: The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’
odPvalfloat: The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.
odTypeEOverfittingDetectorType: The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’
odWaitint: The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.
oneHotMaxSizeint: Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.
penaltiesCoefficientfloat: A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.
perFloatFeatureQuantizatonlist: The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]
perObjectFeaturePenaltiesListlist: Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.
perObjectFeaturePenaltiesMapdict: Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.
predictionColstr, default: “prediction”: prediction column name
randomSeedint: The random seed used for training. Default value is 0.
randomStrengthfloat: The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0
rsmfloat: Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.
samplingFrequencyESamplingFrequency: Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’
samplingUnitESamplingUnit: The sampling scheme, see documentation for details. Default value is ‘Object’
saveSnapshotbool: Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.
scoreFunctionEScoreFunction: The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’
snapshotFilestr: The name of the file to save the training progress information in. This file is used for recovering training after an interruption.
snapshotIntervaldatetime.timedelta: The interval between saving snapshots. See documentation for details. Default value is 600 seconds.
sparkPartitionCountint: The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default
subsamplefloat: Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.
threadCountint: Number of CPU threads in parallel operations on client
trainDirstr: The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’
useBestModelbool: If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.
weightColstr: weight column name. If this is not set or empty, we treat all instance weights as 1.0
workerInitializationTimeoutdatetime.timedelta, default: datetime.timedelta(milliseconds=600000): Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes
workerMaxFailuresint, default: 4: Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4

setPenaltiesCoefficient(value)[source]¶

Parameters

valuefloat: A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.

setPerFloatFeatureQuantizaton(value)[source]¶

Parameters

valuelist: The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]

setPerObjectFeaturePenaltiesList(value)[source]¶

Parameters

valuelist: Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.

setPerObjectFeaturePenaltiesMap(value)[source]¶

Parameters

valuedict: Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.

setPredictionCol(value)[source]¶

Parameters

valuestr: prediction column name

setRandomSeed(value)[source]¶

Parameters

valueint: The random seed used for training. Default value is 0.

setRandomStrength(value)[source]¶

Parameters

valuefloat: The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0

setRsm(value)[source]¶

Parameters

valuefloat: Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.

setSamplingFrequency(value)[source]¶

Parameters

valueESamplingFrequency: Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’

setSamplingUnit(value)[source]¶

Parameters

valueESamplingUnit: The sampling scheme, see documentation for details. Default value is ‘Object’

setSaveSnapshot(value)[source]¶

Parameters

valuebool: Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.

setScoreFunction(value)[source]¶

Parameters

valueEScoreFunction: The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’

setSnapshotFile(value)[source]¶

Parameters

valuestr: The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

setSnapshotInterval(value)[source]¶

Parameters

valuedatetime.timedelta: The interval between saving snapshots. See documentation for details. Default value is 600 seconds.

setSparkPartitionCount(value)[source]¶

Parameters

valueint: The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default

setSubsample(value)[source]¶

Parameters

valuefloat: Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.

setThreadCount(value)[source]¶

Parameters

valueint: Number of CPU threads in parallel operations on client

setTrainDir(value)[source]¶

Parameters

valuestr: The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’

setUseBestModel(value)[source]¶

Parameters

valuebool: If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.

setWeightCol(value)[source]¶

Parameters

valuestr: weight column name. If this is not set or empty, we treat all instance weights as 1.0

setWorkerInitializationTimeout(value)[source]¶

Parameters

valuedatetime.timedelta: Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes

setWorkerMaxFailures(value)[source]¶

Parameters

valueint: Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4

CatBoostRegressionModel

EAutoClassWeightsType