CatBoostClassificationModel¶

class catboost_spark.CatBoostClassificationModel(java_model)[source]¶

Bases: pyspark.ml.classification.JavaProbabilisticClassificationModel, pyspark.ml.util.MLReadable, pyspark.ml.util.JavaMLWritable

Classification model trained by CatBoost. Use CatBoostClassifier to train it

Methods Summary

`getFeatureImportance`([fstrType, data, calcType])	Parameters
`getFeatureImportanceInteraction`()	Returns
`getFeatureImportancePrettified`([fstrType, …])	Parameters
`getFeatureImportanceShapInteractionValues`(data)	SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames
`getFeatureImportanceShapValues`(data[, …])	Parameters
`getFeaturesCol`()	Returns
`getLabelCol`()	Returns
`getPredictionCol`()	Returns
`getProbabilityCol`()	Returns
`getRawPredictionCol`()	Returns
`getThresholds`()	Returns
`loadNativeModel`(fileName[, format])	Load the model from a local file.
`read`()	Returns an MLReader instance for this class.
`saveNativeModel`(fileName[, format, …])	Save the model to a local file.
`setFeaturesCol`(value)	Parameters
`setLabelCol`(value)	Parameters
`setParams`([featuresCol, labelCol, …])	Set the (keyword only) parameters
`setPredictionCol`(value)	Parameters
`setProbabilityCol`(value)	Parameters
`setRawPredictionCol`(value)	Parameters
`setThresholds`(value)	Parameters
`transformPool`(pool)	This function is useful when the dataset has been already quantized but works with any Pool

Methods Documentation

getFeatureImportance(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]¶

Parameters

fstrTypeEFstrType

Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff

dataPool

if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null

calcTypeECalcTypeShapValues

Used only for PredictionValuesChange. Possible values:

Regular
Calculate regular SHAP values

Approximate
Calculate approximate SHAP values

Exact
Calculate exact SHAP values

Returns

list of float: array of feature importances (index corresponds to the order of features in the model)

getFeatureImportanceInteraction()[source]¶

Returns

list of FeatureInteractionScore

getFeatureImportancePrettified(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]¶

Parameters

fstrTypeEFstrType

Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff

dataPool

calcTypeECalcTypeShapValues

Used only for PredictionValuesChange. Possible values:

Regular
Calculate regular SHAP values

Approximate
Calculate approximate SHAP values

Exact
Calculate exact SHAP values

Returns

list of FeatureImportance: array of feature importances sorted in descending order by importance

getFeatureImportanceShapInteractionValues(data, featureIndices=None, featureNames=None, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, outputColumns=None)[source]¶

SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames: are specified.

Parameters

dataPool

dataset to calculate SHAP interaction values

featureIndices(int, int), optional

pair of features indices to calculate SHAP interaction values for.

featureNames(str, str), optional

pair of features names to calculate SHAP interaction values for.

preCalcModeEPreCalcShapValues

Possible values:

Auto
Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
UsePreCalc
Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
NoPreCalc
Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)

calcTypeECalcTypeShapValues

Possible values:

Regular
Calculate regular SHAP values

Approximate
Calculate approximate SHAP values

Exact
Calculate exact SHAP values

outputColumnslist of str

columns from data to add to output DataFrame, if None - add all columns

Returns

DataFrame

for regression and binclass models: contains outputColumns and “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns
for multiclass models: contains outputColumns and “classIdx”, “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns

getFeatureImportanceShapValues(data, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, modelOutputType=EExplainableModelOutput.Raw, referenceData=None, outputColumns=None)[source]¶

Parameters

dataPool

dataset to calculate SHAP values for

preCalcModeEPreCalcShapValues

Possible values:

Auto
Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
UsePreCalc
Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
NoPreCalc
Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)

calcTypeECalcTypeShapValues

Possible values:

Regular
Calculate regular SHAP values

Approximate
Calculate approximate SHAP values

Exact
Calculate exact SHAP values

referenceDataPool

reference data for Independent Tree SHAP values from https://arxiv.org/abs/1905.04610v1 if referenceData is not null, then Independent Tree SHAP values are calculated

outputColumnslist of str

columns from data to add to output DataFrame, if None - add all columns

Returns

DataFrame

for regression and binclass models: contains outputColumns and “shapValues” column with Vector of length (n_features + 1) with SHAP values
for multiclass models: contains outputColumns and “shapValues” column with Matrix of shape (n_classes x (n_features + 1)) with SHAP values

getFeaturesCol()[source]¶

Returns

str: features column name

getLabelCol()[source]¶

Returns

str: label column name

getPredictionCol()[source]¶

Returns

str: prediction column name

getProbabilityCol()[source]¶

Returns

str: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

getRawPredictionCol()[source]¶

Returns

str: raw prediction (a.k.a. confidence) column name

getThresholds()[source]¶

Returns

list: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

static loadNativeModel(fileName, format=EModelType.CatboostBinary)[source]¶: Load the model from a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_load_model.html for detailed parameters description

classmethod read()[source]¶: Returns an MLReader instance for this class.

saveNativeModel(fileName, format=EModelType.CatboostBinary, exportParameters=None, pool=None)[source]¶: Save the model to a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_save_model.html for detailed parameters description

setFeaturesCol(value)[source]¶

Parameters

valuestr: features column name

setLabelCol(value)[source]¶

Parameters

valuestr: label column name

setParams(featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', thresholds=None)[source]¶

Set the (keyword only) parameters

Parameters

featuresColstr, default: “features”: features column name
labelColstr, default: “label”: label column name
predictionColstr, default: “prediction”: prediction column name
probabilityColstr, default: “probability”: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
rawPredictionColstr, default: “rawPrediction”: raw prediction (a.k.a. confidence) column name
thresholdslist: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

setPredictionCol(value)[source]¶

Parameters

valuestr: prediction column name

setProbabilityCol(value)[source]¶

Parameters

valuestr: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

setRawPredictionCol(value)[source]¶

Parameters

valuestr: raw prediction (a.k.a. confidence) column name

setThresholds(value)[source]¶

Parameters

valuelist: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

transformPool(pool)[source]¶: This function is useful when the dataset has been already quantized but works with any Pool

Pool

CatBoostClassifier