CatBoostClassificationModel¶
- class catboost_spark.CatBoostClassificationModel(java_model)[source]¶
Bases:
pyspark.ml.classification.JavaProbabilisticClassificationModel
,pyspark.ml.util.MLReadable
,pyspark.ml.util.JavaMLWritable
Classification model trained by CatBoost. Use CatBoostClassifier to train it
Methods Summary
getFeatureImportance
([fstrType, data, calcType])- Parameters
- Returns
getFeatureImportancePrettified
([fstrType, …])- Parameters
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames
getFeatureImportanceShapValues
(data[, …])- Parameters
- Returns
- Returns
- Returns
- Returns
- Returns
- Returns
loadNativeModel
(fileName[, format])Load the model from a local file.
read
()Returns an MLReader instance for this class.
saveNativeModel
(fileName[, format, …])Save the model to a local file.
setFeaturesCol
(value)- Parameters
setLabelCol
(value)- Parameters
setParams
([featuresCol, labelCol, …])Set the (keyword only) parameters
setPredictionCol
(value)- Parameters
setProbabilityCol
(value)- Parameters
setRawPredictionCol
(value)- Parameters
setThresholds
(value)- Parameters
transformPool
(pool)This function is useful when the dataset has been already quantized but works with any Pool
Methods Documentation
- getFeatureImportance(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]¶
- Parameters
- fstrTypeEFstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- dataPool
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcTypeECalcTypeShapValues
Used only for PredictionValuesChange. Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- Returns
- list of float
array of feature importances (index corresponds to the order of features in the model)
- getFeatureImportancePrettified(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]¶
- Parameters
- fstrTypeEFstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- dataPool
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcTypeECalcTypeShapValues
Used only for PredictionValuesChange. Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- Returns
- list of FeatureImportance
array of feature importances sorted in descending order by importance
- getFeatureImportanceShapInteractionValues(data, featureIndices=None, featureNames=None, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, outputColumns=None)[source]¶
- SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames
are specified.
- Parameters
- dataPool
dataset to calculate SHAP interaction values
- featureIndices(int, int), optional
pair of features indices to calculate SHAP interaction values for.
- featureNames(str, str), optional
pair of features names to calculate SHAP interaction values for.
- preCalcModeEPreCalcShapValues
Possible values:
- Auto
Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc
Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc
Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcTypeECalcTypeShapValues
Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- outputColumnslist of str
columns from data to add to output DataFrame, if None - add all columns
- Returns
- DataFrame
for regression and binclass models: contains outputColumns and “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns
for multiclass models: contains outputColumns and “classIdx”, “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns
- getFeatureImportanceShapValues(data, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, modelOutputType=EExplainableModelOutput.Raw, referenceData=None, outputColumns=None)[source]¶
- Parameters
- dataPool
dataset to calculate SHAP values for
- preCalcModeEPreCalcShapValues
- Possible values:
- Auto
Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc
Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc
Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcTypeECalcTypeShapValues
Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- referenceDataPool
reference data for Independent Tree SHAP values from https://arxiv.org/abs/1905.04610v1 if referenceData is not null, then Independent Tree SHAP values are calculated
- outputColumnslist of str
columns from data to add to output DataFrame, if None - add all columns
- Returns
- DataFrame
for regression and binclass models: contains outputColumns and “shapValues” column with Vector of length (n_features + 1) with SHAP values
for multiclass models: contains outputColumns and “shapValues” column with Matrix of shape (n_classes x (n_features + 1)) with SHAP values
- getProbabilityCol()[source]¶
- Returns
- str
Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- getThresholds()[source]¶
- Returns
- list
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
- static loadNativeModel(fileName, format=EModelType.CatboostBinary)[source]¶
Load the model from a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_load_model.html for detailed parameters description
- saveNativeModel(fileName, format=EModelType.CatboostBinary, exportParameters=None, pool=None)[source]¶
Save the model to a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_save_model.html for detailed parameters description
- setParams(featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', thresholds=None)[source]¶
Set the (keyword only) parameters
- Parameters
- featuresColstr, default: “features”
features column name
- labelColstr, default: “label”
label column name
- predictionColstr, default: “prediction”
prediction column name
- probabilityColstr, default: “probability”
Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- rawPredictionColstr, default: “rawPrediction”
raw prediction (a.k.a. confidence) column name
- thresholdslist
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
- setProbabilityCol(value)[source]¶
- Parameters
- valuestr
Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- setRawPredictionCol(value)[source]¶
- Parameters
- valuestr
raw prediction (a.k.a. confidence) column name
- setThresholds(value)[source]¶
- Parameters
- valuelist
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold