CatBoostRegressionModel¶
- class catboost_spark.CatBoostRegressionModel(java_model)[source]¶
Bases:
pyspark.ml.regression.JavaRegressionModel
,pyspark.ml.util.MLReadable
,pyspark.ml.util.JavaMLWritable
Regression model trained by CatBoost. Use CatBoostRegressor to train it
Methods Summary
getFeatureImportance
([fstrType, data, calcType])- Parameters
- Returns
getFeatureImportancePrettified
([fstrType, …])- Parameters
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames
getFeatureImportanceShapValues
(data[, …])- Parameters
- Returns
- Returns
- Returns
loadNativeModel
(fileName[, format])Load the model from a local file.
read
()Returns an MLReader instance for this class.
saveNativeModel
(fileName[, format, …])Save the model to a local file.
setFeaturesCol
(value)- Parameters
setLabelCol
(value)- Parameters
setParams
([featuresCol, labelCol, predictionCol])Set the (keyword only) parameters
setPredictionCol
(value)- Parameters
transformPool
(pool)This function is useful when the dataset has been already quantized but works with any Pool
Methods Documentation
- getFeatureImportance(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]¶
- Parameters
- fstrTypeEFstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- dataPool
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcTypeECalcTypeShapValues
Used only for PredictionValuesChange. Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- Returns
- list of float
array of feature importances (index corresponds to the order of features in the model)
- getFeatureImportancePrettified(fstrType=EFstrType.FeatureImportance, data=None, calcType=ECalcTypeShapValues.Regular)[source]¶
- Parameters
- fstrTypeEFstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- dataPool
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcTypeECalcTypeShapValues
Used only for PredictionValuesChange. Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- Returns
- list of FeatureImportance
array of feature importances sorted in descending order by importance
- getFeatureImportanceShapInteractionValues(data, featureIndices=None, featureNames=None, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, outputColumns=None)[source]¶
- SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames
are specified.
- Parameters
- dataPool
dataset to calculate SHAP interaction values
- featureIndices(int, int), optional
pair of features indices to calculate SHAP interaction values for.
- featureNames(str, str), optional
pair of features names to calculate SHAP interaction values for.
- preCalcModeEPreCalcShapValues
Possible values:
- Auto
Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc
Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc
Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcTypeECalcTypeShapValues
Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- outputColumnslist of str
columns from data to add to output DataFrame, if None - add all columns
- Returns
- DataFrame
for regression and binclass models: contains outputColumns and “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns
for multiclass models: contains outputColumns and “classIdx”, “featureIdx1”, “featureIdx2”, “shapInteractionValue” columns
- getFeatureImportanceShapValues(data, preCalcMode=EPreCalcShapValues.Auto, calcType=ECalcTypeShapValues.Regular, modelOutputType=EExplainableModelOutput.Raw, referenceData=None, outputColumns=None)[source]¶
- Parameters
- dataPool
dataset to calculate SHAP values for
- preCalcModeEPreCalcShapValues
- Possible values:
- Auto
Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc
Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc
Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcTypeECalcTypeShapValues
Possible values:
- Regular
Calculate regular SHAP values
- Approximate
Calculate approximate SHAP values
- Exact
Calculate exact SHAP values
- referenceDataPool
reference data for Independent Tree SHAP values from https://arxiv.org/abs/1905.04610v1 if referenceData is not null, then Independent Tree SHAP values are calculated
- outputColumnslist of str
columns from data to add to output DataFrame, if None - add all columns
- Returns
- DataFrame
for regression and binclass models: contains outputColumns and “shapValues” column with Vector of length (n_features + 1) with SHAP values
for multiclass models: contains outputColumns and “shapValues” column with Matrix of shape (n_classes x (n_features + 1)) with SHAP values
- static loadNativeModel(fileName, format=EModelType.CatboostBinary)[source]¶
Load the model from a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_load_model.html for detailed parameters description
- saveNativeModel(fileName, format=EModelType.CatboostBinary, exportParameters=None, pool=None)[source]¶
Save the model to a local file. See https://catboost.ai/docs/concepts/python-reference_catboostclassifier_save_model.html for detailed parameters description