CatBoostRegressionModel

class catboost_spark.CatBoostRegressionModel(java_model)[source]

Bases: pyspark.ml.regression.JavaRegressionModel, pyspark.ml.util.MLReadable, pyspark.ml.util.JavaMLWritable

Regression model trained by CatBoost. Use CatBoostRegressor to train it

Methods Summary

getFeatureImportance([fstrType, data, calcType])

Parameters

getFeatureImportanceInteraction()

Returns

getFeatureImportancePrettified([fstrType, …])

Parameters

getFeatureImportanceShapInteractionValues(data)

SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames

getFeatureImportanceShapValues(data[, …])

Parameters

getFeaturesCol()

Returns

getLabelCol()

Returns

getPredictionCol()

Returns

loadNativeModel(fileName[, format])

Load the model from a local file.

read()

Returns an MLReader instance for this class.

saveNativeModel(fileName[, format, …])

Save the model to a local file.

setFeaturesCol(value)

Parameters

setLabelCol(value)

Parameters

setParams([featuresCol, labelCol, predictionCol])

Set the (keyword only) parameters

setPredictionCol(value)

Parameters

transformPool(pool)

This function is useful when the dataset has been already quantized but works with any Pool

Methods Documentation

getFeatureImportance(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]
Parameters
fstrTypeEFstrType

Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff

dataPool

if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null

calcTypeECalcTypeShapValues

Used only for PredictionValuesChange. Possible values:

  • Regular

    Calculate regular SHAP values

  • Approximate

    Calculate approximate SHAP values

  • Exact

    Calculate exact SHAP values

Returns
list of float

array of feature importances (index corresponds to the order of features in the model)

getFeatureImportanceInteraction()[source]
Returns
list of FeatureInteractionScore
getFeatureImportancePrettified(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]
Parameters
fstrTypeEFstrType

Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff

dataPool

if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null

calcTypeECalcTypeShapValues

Used only for PredictionValuesChange. Possible values:

  • Regular

    Calculate regular SHAP values

  • Approximate

    Calculate approximate SHAP values

  • Exact

    Calculate exact SHAP values

Returns
list of FeatureImportance

array of feature importances sorted in descending order by importance

getFeatureImportanceShapInteractionValues(data, featureIndices=None, featureNames=None, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, outputColumns=None)[source]
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames

are specified.

Parameters
dataPool

dataset to calculate SHAP interaction values

featureIndices(int, int), optional

pair of features indices to calculate SHAP interaction values for.

featureNames(str, str), optional

pair of features names to calculate SHAP interaction values for.

preCalcModeEPreCalcShapValues

Possible values:

  • Auto

    Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).

  • UsePreCalc

    Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L

  • NoPreCalc

    Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)

calcTypeECalcTypeShapValues

Possible values:

  • Regular

    Calculate regular SHAP values

  • Approximate

    Calculate approximate SHAP values

  • Exact

    Calculate exact SHAP values

outputColumnslist of str

columns from data to add to output DataFrame, if None - add all columns

Returns
DataFrame
  • for regression and binclass models: contains outputColumns and “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns

  • for multiclass models: contains outputColumns and “classIdx”, “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns

getFeatureImportanceShapValues(data, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, modelOutputType=EExplainableModelOutput.Raw, referenceData=None, outputColumns=None)[source]
Parameters
dataPool

dataset to calculate SHAP values for

preCalcModeEPreCalcShapValues
Possible values:
  • Auto

    Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).

  • UsePreCalc

    Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L

  • NoPreCalc

    Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)

calcTypeECalcTypeShapValues

Possible values:

  • Regular

    Calculate regular SHAP values

  • Approximate

    Calculate approximate SHAP values

  • Exact

    Calculate exact SHAP values

referenceDataPool

reference data for Independent Tree SHAP values from https://arxiv.org/abs/1905.04610v1 if referenceData is not null, then Independent Tree SHAP values are calculated

outputColumnslist of str

columns from data to add to output DataFrame, if None - add all columns

Returns
DataFrame
  • for regression and binclass models: contains outputColumns and “shapValues” column with Vector of length (n_features + 1) with SHAP values

  • for multiclass models: contains outputColumns and “shapValues” column with Matrix of shape (n_classes x (n_features + 1)) with SHAP values

getFeaturesCol()[source]
Returns
str

features column name

getLabelCol()[source]
Returns
str

label column name

getPredictionCol()[source]
Returns
str

prediction column name

static loadNativeModel(fileName, format=EModelType.CatboostBinary)[source]

Load the model from a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_load_model.html for detailed parameters description

classmethod read()[source]

Returns an MLReader instance for this class.

saveNativeModel(fileName, format=EModelType.CatboostBinary, exportParameters=None, pool=None)[source]

Save the model to a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_save_model.html for detailed parameters description

setFeaturesCol(value)[source]
Parameters
valuestr

features column name

setLabelCol(value)[source]
Parameters
valuestr

label column name

setParams(featuresCol='features', labelCol='label', predictionCol='prediction')[source]

Set the (keyword only) parameters

Parameters
featuresColstr, default: “features”

features column name

labelColstr, default: “label”

label column name

predictionColstr, default: “prediction”

prediction column name

setPredictionCol(value)[source]
Parameters
valuestr

prediction column name

transformPool(pool)[source]

This function is useful when the dataset has been already quantized but works with any Pool