CatBoostRegressor¶
- class catboost_spark.CatBoostRegressor(allowConstLabel=None, allowWritingFiles=None, approxOnFullHistory=None, baggingTemperature=None, bestModelMinTrees=None, bootstrapType=None, borderCount=None, connectTimeout=datetime.timedelta(seconds=60), customMetric=None, depth=None, diffusionTemperature=None, earlyStoppingRounds=None, evalMetric=None, featureBorderType=None, featureWeightsList=None, featureWeightsMap=None, featuresCol='features', firstFeatureUsePenaltiesList=None, firstFeatureUsePenaltiesMap=None, foldLenMultiplier=None, foldPermutationBlock=None, hasTime=None, ignoredFeaturesIndices=None, ignoredFeaturesNames=None, inputBorders=None, iterations=None, l2LeafReg=None, labelCol='label', leafEstimationBacktracking=None, leafEstimationIterations=None, leafEstimationMethod=None, learningRate=None, loggingLevel=None, lossFunction=None, metricPeriod=None, modelShrinkMode=None, modelShrinkRate=None, mvsReg=None, nanMode=None, odPval=None, odType=None, odWait=None, oneHotMaxSize=None, penaltiesCoefficient=None, perFloatFeatureQuantizaton=None, perObjectFeaturePenaltiesList=None, perObjectFeaturePenaltiesMap=None, predictionCol='prediction', randomSeed=None, randomStrength=None, rsm=None, samplingFrequency=None, samplingUnit=None, saveSnapshot=None, scoreFunction=None, snapshotFile=None, snapshotInterval=None, sparkPartitionCount=None, subsample=None, threadCount=None, trainDir=None, useBestModel=None, weightCol=None, workerInitializationTimeout=datetime.timedelta(seconds=600), workerMaxFailures=4)[source]¶
Bases:
pyspark.ml.wrapper.JavaEstimator
,pyspark.ml.util.MLReadable
,pyspark.ml.util.JavaMLWritable
Class to train CatBoostRegressionModel
Methods Summary
fit
(trainDataset[, evalDatasets])Extended variant of standard Estimator’s fit method that accepts CatBoost’s Pool s and allows to specify additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost’s other APIs.
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
getDepth
()- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
getRsm
()- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
read
()Returns an MLReader instance for this class.
setAllowConstLabel
(value)- Parameters
setAllowWritingFiles
(value)- Parameters
setApproxOnFullHistory
(value)- Parameters
setBaggingTemperature
(value)- Parameters
setBestModelMinTrees
(value)- Parameters
setBootstrapType
(value)- Parameters
setBorderCount
(value)- Parameters
setConnectTimeout
(value)- Parameters
setCustomMetric
(value)- Parameters
setDepth
(value)- Parameters
setDiffusionTemperature
(value)- Parameters
setEarlyStoppingRounds
(value)- Parameters
setEvalMetric
(value)- Parameters
setFeatureBorderType
(value)- Parameters
setFeatureWeightsList
(value)- Parameters
setFeatureWeightsMap
(value)- Parameters
setFeaturesCol
(value)- Parameters
- Parameters
- Parameters
setFoldLenMultiplier
(value)- Parameters
setFoldPermutationBlock
(value)- Parameters
setHasTime
(value)- Parameters
setIgnoredFeaturesIndices
(value)- Parameters
setIgnoredFeaturesNames
(value)- Parameters
setInputBorders
(value)- Parameters
setIterations
(value)- Parameters
setL2LeafReg
(value)- Parameters
setLabelCol
(value)- Parameters
- Parameters
setLeafEstimationIterations
(value)- Parameters
setLeafEstimationMethod
(value)- Parameters
setLearningRate
(value)- Parameters
setLoggingLevel
(value)- Parameters
setLossFunction
(value)- Parameters
setMetricPeriod
(value)- Parameters
setModelShrinkMode
(value)- Parameters
setModelShrinkRate
(value)- Parameters
setMvsReg
(value)- Parameters
setNanMode
(value)- Parameters
setOdPval
(value)- Parameters
setOdType
(value)- Parameters
setOdWait
(value)- Parameters
setOneHotMaxSize
(value)- Parameters
setParams
([allowConstLabel, …])Set the (keyword only) parameters
setPenaltiesCoefficient
(value)- Parameters
- Parameters
- Parameters
- Parameters
setPredictionCol
(value)- Parameters
setRandomSeed
(value)- Parameters
setRandomStrength
(value)- Parameters
setRsm
(value)- Parameters
setSamplingFrequency
(value)- Parameters
setSamplingUnit
(value)- Parameters
setSaveSnapshot
(value)- Parameters
setScoreFunction
(value)- Parameters
setSnapshotFile
(value)- Parameters
setSnapshotInterval
(value)- Parameters
setSparkPartitionCount
(value)- Parameters
setSubsample
(value)- Parameters
setThreadCount
(value)- Parameters
setTrainDir
(value)- Parameters
setUseBestModel
(value)- Parameters
setWeightCol
(value)- Parameters
- Parameters
setWorkerMaxFailures
(value)- Parameters
Methods Documentation
- fit(trainDataset, evalDatasets=None)[source]¶
Extended variant of standard Estimator’s fit method that accepts CatBoost’s Pool s and allows to specify additional datasets for computing evaluation metrics and overfitting detection similarily to CatBoost’s other APIs.
- Parameters
- trainDatasetPool or DataFrame
The input training dataset.
- evalDatasetsPools, optional
- The validation datasets used for the following processes:
overfitting detector
best iteration selection
monitoring metrics’ changes
- Returns
- trained model: CatBoostRegressionModel
- getAllowConstLabel()[source]¶
- Returns
- bool
Use it to train models with datasets that have equal label values for all objects.
- getAllowWritingFiles()[source]¶
- Returns
- bool
Allow to write analytical and snapshot files during training. Enabled by default.
- getApproxOnFullHistory()[source]¶
- Returns
- bool
Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.
- getBaggingTemperature()[source]¶
- Returns
- float
This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.
- getBestModelMinTrees()[source]¶
- Returns
- int
The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.
- getBootstrapType()[source]¶
- Returns
- EBootstrapType
Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.
- getBorderCount()[source]¶
- Returns
- int
The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.
- getConnectTimeout()[source]¶
- Returns
- datetime.timedelta
Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute
- getCustomMetric()[source]¶
- Returns
- list
Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- getDiffusionTemperature()[source]¶
- Returns
- float
The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.
- getEarlyStoppingRounds()[source]¶
- Returns
- int
Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
- getEvalMetric()[source]¶
- Returns
- str
The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- getFeatureBorderType()[source]¶
- Returns
- EBorderSelectionType
The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’
- getFeatureWeightsList()[source]¶
- Returns
- list
Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.
- getFeatureWeightsMap()[source]¶
- Returns
- dict
Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.
- getFirstFeatureUsePenaltiesList()[source]¶
- Returns
- list
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.
- getFirstFeatureUsePenaltiesMap()[source]¶
- Returns
- dict
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.
- getFoldLenMultiplier()[source]¶
- Returns
- float
Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.
- getFoldPermutationBlock()[source]¶
- Returns
- int
Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.
- getHasTime()[source]¶
- Returns
- bool
Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).
- getInputBorders()[source]¶
- Returns
- str
Load Custom quantization borders and missing value modes from a file (do not generate them)
- getIterations()[source]¶
- Returns
- int
The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.
- getL2LeafReg()[source]¶
- Returns
- float
Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.
- getLeafEstimationBacktracking()[source]¶
- Returns
- ELeavesEstimationStepBacktracking
When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’
- getLeafEstimationIterations()[source]¶
- Returns
- int
CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.
- getLeafEstimationMethod()[source]¶
- Returns
- ELeavesEstimation
The method used to calculate the values in leaves. See documentation for details.
- getLearningRate()[source]¶
- Returns
- float
The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.
- getLoggingLevel()[source]¶
- Returns
- ELoggingLevel
The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’
- getLossFunction()[source]¶
- Returns
- str
The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- getMetricPeriod()[source]¶
- Returns
- int
The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.
- getModelShrinkMode()[source]¶
- Returns
- EModelShrinkMode
Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’
- getModelShrinkRate()[source]¶
- Returns
- float
The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.
- getMvsReg()[source]¶
- Returns
- float
Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.
- getNanMode()[source]¶
- Returns
- ENanMode
The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’
- getOdPval()[source]¶
- Returns
- float
The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.
- getOdType()[source]¶
- Returns
- EOverfittingDetectorType
The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’
- getOdWait()[source]¶
- Returns
- int
The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.
- getOneHotMaxSize()[source]¶
- Returns
- int
Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.
- getPenaltiesCoefficient()[source]¶
- Returns
- float
A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.
- getPerFloatFeatureQuantizaton()[source]¶
- Returns
- list
The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]
- getPerObjectFeaturePenaltiesList()[source]¶
- Returns
- list
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.
- getPerObjectFeaturePenaltiesMap()[source]¶
- Returns
- dict
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.
- getRandomStrength()[source]¶
- Returns
- float
The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0
- getRsm()[source]¶
- Returns
- float
Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.
- getSamplingFrequency()[source]¶
- Returns
- ESamplingFrequency
Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’
- getSamplingUnit()[source]¶
- Returns
- ESamplingUnit
The sampling scheme, see documentation for details. Default value is ‘Object’
- getSaveSnapshot()[source]¶
- Returns
- bool
Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.
- getScoreFunction()[source]¶
- Returns
- EScoreFunction
The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’
- getSnapshotFile()[source]¶
- Returns
- str
The name of the file to save the training progress information in. This file is used for recovering training after an interruption.
- getSnapshotInterval()[source]¶
- Returns
- datetime.timedelta
The interval between saving snapshots. See documentation for details. Default value is 600 seconds.
- getSparkPartitionCount()[source]¶
- Returns
- int
The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default
- getSubsample()[source]¶
- Returns
- float
Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.
- getTrainDir()[source]¶
- Returns
- str
The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’
- getUseBestModel()[source]¶
- Returns
- bool
If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.
- getWeightCol()[source]¶
- Returns
- str
weight column name. If this is not set or empty, we treat all instance weights as 1.0
- getWorkerInitializationTimeout()[source]¶
- Returns
- datetime.timedelta
Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes
- getWorkerMaxFailures()[source]¶
- Returns
- int
Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4
- setAllowConstLabel(value)[source]¶
- Parameters
- valuebool
Use it to train models with datasets that have equal label values for all objects.
- setAllowWritingFiles(value)[source]¶
- Parameters
- valuebool
Allow to write analytical and snapshot files during training. Enabled by default.
- setApproxOnFullHistory(value)[source]¶
- Parameters
- valuebool
Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.
- setBaggingTemperature(value)[source]¶
- Parameters
- valuefloat
This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.
- setBestModelMinTrees(value)[source]¶
- Parameters
- valueint
The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.
- setBootstrapType(value)[source]¶
- Parameters
- valueEBootstrapType
Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.
- setBorderCount(value)[source]¶
- Parameters
- valueint
The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.
- setConnectTimeout(value)[source]¶
- Parameters
- valuedatetime.timedelta
Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute
- setCustomMetric(value)[source]¶
- Parameters
- valuelist
Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- setDiffusionTemperature(value)[source]¶
- Parameters
- valuefloat
The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.
- setEarlyStoppingRounds(value)[source]¶
- Parameters
- valueint
Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
- setEvalMetric(value)[source]¶
- Parameters
- valuestr
The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- setFeatureBorderType(value)[source]¶
- Parameters
- valueEBorderSelectionType
The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’
- setFeatureWeightsList(value)[source]¶
- Parameters
- valuelist
Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.
- setFeatureWeightsMap(value)[source]¶
- Parameters
- valuedict
Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.
- setFirstFeatureUsePenaltiesList(value)[source]¶
- Parameters
- valuelist
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.
- setFirstFeatureUsePenaltiesMap(value)[source]¶
- Parameters
- valuedict
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.
- setFoldLenMultiplier(value)[source]¶
- Parameters
- valuefloat
Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.
- setFoldPermutationBlock(value)[source]¶
- Parameters
- valueint
Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.
- setHasTime(value)[source]¶
- Parameters
- valuebool
Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).
- setIgnoredFeaturesIndices(value)[source]¶
- Parameters
- valuelist
Feature indices to exclude from the training
- setIgnoredFeaturesNames(value)[source]¶
- Parameters
- valuelist
Feature names to exclude from the training
- setInputBorders(value)[source]¶
- Parameters
- valuestr
Load Custom quantization borders and missing value modes from a file (do not generate them)
- setIterations(value)[source]¶
- Parameters
- valueint
The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.
- setL2LeafReg(value)[source]¶
- Parameters
- valuefloat
Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.
- setLeafEstimationBacktracking(value)[source]¶
- Parameters
- valueELeavesEstimationStepBacktracking
When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’
- setLeafEstimationIterations(value)[source]¶
- Parameters
- valueint
CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.
- setLeafEstimationMethod(value)[source]¶
- Parameters
- valueELeavesEstimation
The method used to calculate the values in leaves. See documentation for details.
- setLearningRate(value)[source]¶
- Parameters
- valuefloat
The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.
- setLoggingLevel(value)[source]¶
- Parameters
- valueELoggingLevel
The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’
- setLossFunction(value)[source]¶
- Parameters
- valuestr
The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- setMetricPeriod(value)[source]¶
- Parameters
- valueint
The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.
- setModelShrinkMode(value)[source]¶
- Parameters
- valueEModelShrinkMode
Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’
- setModelShrinkRate(value)[source]¶
- Parameters
- valuefloat
The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.
- setMvsReg(value)[source]¶
- Parameters
- valuefloat
Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.
- setNanMode(value)[source]¶
- Parameters
- valueENanMode
The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’
- setOdPval(value)[source]¶
- Parameters
- valuefloat
The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.
- setOdType(value)[source]¶
- Parameters
- valueEOverfittingDetectorType
The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’
- setOdWait(value)[source]¶
- Parameters
- valueint
The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.
- setOneHotMaxSize(value)[source]¶
- Parameters
- valueint
Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.
- setParams(allowConstLabel=None, allowWritingFiles=None, approxOnFullHistory=None, baggingTemperature=None, bestModelMinTrees=None, bootstrapType=None, borderCount=None, connectTimeout=datetime.timedelta(seconds=60), customMetric=None, depth=None, diffusionTemperature=None, earlyStoppingRounds=None, evalMetric=None, featureBorderType=None, featureWeightsList=None, featureWeightsMap=None, featuresCol='features', firstFeatureUsePenaltiesList=None, firstFeatureUsePenaltiesMap=None, foldLenMultiplier=None, foldPermutationBlock=None, hasTime=None, ignoredFeaturesIndices=None, ignoredFeaturesNames=None, inputBorders=None, iterations=None, l2LeafReg=None, labelCol='label', leafEstimationBacktracking=None, leafEstimationIterations=None, leafEstimationMethod=None, learningRate=None, loggingLevel=None, lossFunction=None, metricPeriod=None, modelShrinkMode=None, modelShrinkRate=None, mvsReg=None, nanMode=None, odPval=None, odType=None, odWait=None, oneHotMaxSize=None, penaltiesCoefficient=None, perFloatFeatureQuantizaton=None, perObjectFeaturePenaltiesList=None, perObjectFeaturePenaltiesMap=None, predictionCol='prediction', randomSeed=None, randomStrength=None, rsm=None, samplingFrequency=None, samplingUnit=None, saveSnapshot=None, scoreFunction=None, snapshotFile=None, snapshotInterval=None, sparkPartitionCount=None, subsample=None, threadCount=None, trainDir=None, useBestModel=None, weightCol=None, workerInitializationTimeout=datetime.timedelta(seconds=600), workerMaxFailures=4)[source]¶
Set the (keyword only) parameters
- Parameters
- allowConstLabelbool
Use it to train models with datasets that have equal label values for all objects.
- allowWritingFilesbool
Allow to write analytical and snapshot files during training. Enabled by default.
- approxOnFullHistorybool
Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.
- baggingTemperaturefloat
This parameter can be used if the selected bootstrap type is Bayesian. Possible values are in the range [0, +inf). The higher the value the more aggressive the bagging is.Default value in 1.0.
- bestModelMinTreesint
The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees. Should be used with the useBestModel parameter. No limit by default.
- bootstrapTypeEBootstrapType
Bootstrap type. Defines the method for sampling the weights of objects.The default value depends on the selected mode and processing unit type: QueryCrossEntropy, YetiRankPairwise, PairLogitPairwise: Bernoulli with the subsample parameter set to 0.5. MultiClass and MultiClassOneVsAll: Bayesian. Other modes: MVS with the subsample parameter set to 0.8.
- borderCountint
The number of splits for numerical features. Allowed values are integers from 1 to 65535 inclusively. Default value is 254.
- connectTimeoutdatetime.timedelta, default: datetime.timedelta(milliseconds=60000)
Timeout to wait while establishing socket connections between TrainingDriver and workers.Default is 1 minute
- customMetriclist
Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- depthint
Depth of the tree.Default value is 6.
- diffusionTemperaturefloat
The diffusion temperature of the Stochastic Gradient Langevin Boosting mode. Only non-negative values are supported. Default value is 10000.
- earlyStoppingRoundsint
Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
- evalMetricstr
The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- featureBorderTypeEBorderSelectionType
The quantization mode for numerical features. See documentation for details. Default value is ‘GreedyLogSum’
- featureWeightsListlist
Per-feature multiplication weights used when choosing the best split. Array indices correspond to feature indices. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsMap.
- featureWeightsMapdict
Per-feature multiplication weights used when choosing the best split. Map is ‘feature_name’ -> weight. The score of each candidate is multiplied by the weights of features from the current split.This parameter is mutually exclusive with featureWeightsList.
- featuresColstr, default: “features”
features column name
- firstFeatureUsePenaltiesListlist
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesMap.
- firstFeatureUsePenaltiesMapdict
Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with firstFeatureUsePenaltiesList.
- foldLenMultiplierfloat
Coefficient for changing the length of folds. The value must be greater than 1. The best validation result is achieved with minimum values. Default value is 2.0.
- foldPermutationBlockint
Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation. Default value is 1.
- hasTimebool
Use the order of objects in the input data (do not perform random permutations during Choosing the tree structure stage).
- ignoredFeaturesIndiceslist
Feature indices to exclude from the training
- ignoredFeaturesNameslist
Feature names to exclude from the training
- inputBordersstr
Load Custom quantization borders and missing value modes from a file (do not generate them)
- iterationsint
The maximum number of trees that can be built when solving machine learning problems. When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter. Default value is 1000.
- l2LeafRegfloat
Coefficient at the L2 regularization term of the cost function. Any positive value is allowed. Default value is 3.0.
- labelColstr, default: “label”
label column name
- leafEstimationBacktrackingELeavesEstimationStepBacktracking
When the value of the leafEstimationIterations parameter is greater than 1, CatBoost makes several gradient or newton steps when calculating the resulting leaf values of a tree. The behaviour differs depending on the value of this parameter. See documentation for details. Default value is ‘AnyImprovement’
- leafEstimationIterationsint
CatBoost might calculate leaf values using several gradient or newton steps instead of a single one. This parameter regulates how many steps are done in every tree when calculating leaf values.
- leafEstimationMethodELeavesEstimation
The method used to calculate the values in leaves. See documentation for details.
- learningRatefloat
The learning rate. Used for reducing the gradient step. The default value is defined automatically for Logloss, MultiClass & RMSE loss functions depending on the number of iterations if none of ‘leaf_estimation_iterations’, leaf_estimation_method’, ‘l2_leaf_reg’ is set. In this case, the selected learning rate is printed to stdout and saved in the model. In other cases, the default value is 0.03.
- loggingLevelELoggingLevel
The logging level to output to stdout. See documentation for details. Default value is ‘Verbose’
- lossFunctionstr
The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics documentation section for details on each metric).
- metricPeriodint
The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer. The usage of this parameter speeds up the training. Default value is 1.
- modelShrinkModeEModelShrinkMode
Determines how the actual model shrinkage coefficient is calculated at each iteration. See documentation for details. Default value is ‘Constant’
- modelShrinkRatefloat
The constant used to calculate the coefficient for multiplying the model on each iteration. See documentation for details.
- mvsRegfloat
Affects the weight of the denominator and can be used for balancing between the importance and Bernoulli sampling (setting it to 0 implies importance sampling and to +Inf - Bernoulli).Note: This parameter is supported only for the MVS sampling method.
- nanModeENanMode
The method for processing missing values in the input dataset. See documentation for details. Default value is ‘Min’
- odPvalfloat
The threshold for the IncToDec overfitting detector type. The training is stopped when the specified value is reached. Requires that a validation dataset was input. See documentation for details.Turned off by default.
- odTypeEOverfittingDetectorType
The type of the overfitting detector to use. See documentation for details. Default value is ‘IncToDec’
- odWaitint
The number of iterations to continue the training after the iteration with the optimal metric value. See documentation for details. Default value is 20.
- oneHotMaxSizeint
Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.
- penaltiesCoefficientfloat
A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.
- perFloatFeatureQuantizatonlist
The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]
- perObjectFeaturePenaltiesListlist
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.
- perObjectFeaturePenaltiesMapdict
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.
- predictionColstr, default: “prediction”
prediction column name
- randomSeedint
The random seed used for training. Default value is 0.
- randomStrengthfloat
The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0
- rsmfloat
Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.
- samplingFrequencyESamplingFrequency
Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’
- samplingUnitESamplingUnit
The sampling scheme, see documentation for details. Default value is ‘Object’
- saveSnapshotbool
Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.
- scoreFunctionEScoreFunction
The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’
- snapshotFilestr
The name of the file to save the training progress information in. This file is used for recovering training after an interruption.
- snapshotIntervaldatetime.timedelta
The interval between saving snapshots. See documentation for details. Default value is 600 seconds.
- sparkPartitionCountint
The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default
- subsamplefloat
Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.
- threadCountint
Number of CPU threads in parallel operations on client
- trainDirstr
The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’
- useBestModelbool
If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.
- weightColstr
weight column name. If this is not set or empty, we treat all instance weights as 1.0
- workerInitializationTimeoutdatetime.timedelta, default: datetime.timedelta(milliseconds=600000)
Timeout to wait until CatBoost workers on Spark executors are initalized and sent their info to master. Depends on dataset size. Default is 10 minutes
- workerMaxFailuresint, default: 4
Number of individual CatBoost workers failures before giving up training. Should be greater than or equal to 1. Default is 4
- setPenaltiesCoefficient(value)[source]¶
- Parameters
- valuefloat
A single-value common coefficient to multiply all penalties. Non-negative values are supported. Default value is 1.0.
- setPerFloatFeatureQuantizaton(value)[source]¶
- Parameters
- valuelist
The quantization description for the given list of features (one or more).Description format for a single feature: FeatureId[:border_count=BorderCount][:nan_mode=BorderType][:border_type=border_selection_method]
- setPerObjectFeaturePenaltiesList(value)[source]¶
- Parameters
- valuelist
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Array indices correspond to feature indices. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesMap.
- setPerObjectFeaturePenaltiesMap(value)[source]¶
- Parameters
- valuedict
Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. Map is ‘feature_name’ -> penalty. See documentation for details. This parameter is mutually exclusive with perObjectFeaturePenaltiesList.
- setRandomSeed(value)[source]¶
- Parameters
- valueint
The random seed used for training. Default value is 0.
- setRandomStrength(value)[source]¶
- Parameters
- valuefloat
The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. See documentation for details. Default value is 1.0
- setRsm(value)[source]¶
- Parameters
- valuefloat
Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. The value must be in the range (0;1]. Default value is 1.
- setSamplingFrequency(value)[source]¶
- Parameters
- valueESamplingFrequency
Frequency to sample weights and objects when building trees. Default value is ‘PerTreeLevel’
- setSamplingUnit(value)[source]¶
- Parameters
- valueESamplingUnit
The sampling scheme, see documentation for details. Default value is ‘Object’
- setSaveSnapshot(value)[source]¶
- Parameters
- valuebool
Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshotInterval parameter to change this period.
- setScoreFunction(value)[source]¶
- Parameters
- valueEScoreFunction
The score type used to select the next split during the tree construction. See documentation for details. Default value is ‘Cosine’
- setSnapshotFile(value)[source]¶
- Parameters
- valuestr
The name of the file to save the training progress information in. This file is used for recovering training after an interruption.
- setSnapshotInterval(value)[source]¶
- Parameters
- valuedatetime.timedelta
The interval between saving snapshots. See documentation for details. Default value is 600 seconds.
- setSparkPartitionCount(value)[source]¶
- Parameters
- valueint
The number of partitions used during training. Corresponds to the number of active parallel tasks. Set to the number of active executors by default
- setSubsample(value)[source]¶
- Parameters
- valuefloat
Sample rate for bagging. The default value depends on the dataset size and the bootstrap type, see documentation for details.
- setThreadCount(value)[source]¶
- Parameters
- valueint
Number of CPU threads in parallel operations on client
- setTrainDir(value)[source]¶
- Parameters
- valuestr
The directory for storing the files on Driver node generated during training. Default value is ‘catboost_info’
- setUseBestModel(value)[source]¶
- Parameters
- valuebool
If this parameter is set, the number of trees that are saved in the resulting model is selected based on the optimal value of the evalMetric. This option requires a validation dataset to be provided.
- setWeightCol(value)[source]¶
- Parameters
- valuestr
weight column name. If this is not set or empty, we treat all instance weights as 1.0