class CatBoostRegressionModel extends RegressionModel[Vector, CatBoostRegressionModel] with CatBoostModelTrait[CatBoostRegressionModel]
Regression model trained by CatBoost. Use CatBoostRegressor to train it
Serialization
Supports standard Spark MLLib serialization. Data can be saved to distributed filesystem like HDFS or
local files.
When saved to path
two files are created:
-<path>/metadata
which contains Spark-specific metadata in JSON format
-<path>/model
which contains model in usual CatBoost format which can be read using other local
CatBoost APIs (if stored in a distributed filesystem it has to be copied to the local filesystem first).
Saving to and loading from local files in standard CatBoost model formats is also supported.
Save model
val trainPool : Pool = ... init Pool ... val regressor = new CatBoostRegressor val model = regressor.fit(trainPool) val path = "/home/user/catboost_spark_models/model0" model.write.save(path)
, Load model
val dataFrameForPrediction : DataFrame = ... init DataFrame ... val path = "/home/user/catboost_spark_models/model0" val model = CatBoostRegressionModel.load(path) val predictions = model.transform(dataFrameForPrediction) predictions.show()
, Save as a native model
val trainPool : Pool = ... init Pool ... val regressor = new CatBoostRegressor val model = regressor.fit(trainPool) val path = "/home/user/catboost_native_models/model0.cbm" model.saveNativeModel(path)
, Load native model
val dataFrameForPrediction : DataFrame = ... init DataFrame ... val path = "/home/user/catboost_native_models/model0.cbm" val model = CatBoostRegressionModel.loadNativeModel(path) val predictions = model.transform(dataFrameForPrediction) predictions.show()
- Alphabetic
- By Inheritance
- CatBoostRegressionModel
- CatBoostModelTrait
- MLWritable
- RegressionModel
- PredictionModel
- PredictorParams
- HasPredictionCol
- HasFeaturesCol
- HasLabelCol
- Model
- Transformer
- PipelineStage
- Logging
- Params
- Serializable
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new CatBoostRegressionModel(nativeModel: TFullModel)
- new CatBoostRegressionModel(uid: String, nativeModel: TFullModel = null, nativeDimension: Int)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
$[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
final
def
clear(param: Param[_]): CatBoostRegressionModel.this.type
- Definition Classes
- Params
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
copy(extra: ParamMap): CatBoostRegressionModel
- Definition Classes
- CatBoostRegressionModel → Model → Transformer → PipelineStage → Params
-
def
copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
final
def
defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
explainParam(param: Param[_]): String
- Definition Classes
- Params
-
def
explainParams(): String
- Definition Classes
- Params
-
def
extractInstances(dataset: Dataset[_], validateInstance: (Instance) ⇒ Unit): RDD[Instance]
- Attributes
- protected
- Definition Classes
- PredictorParams
-
def
extractInstances(dataset: Dataset[_]): RDD[Instance]
- Attributes
- protected
- Definition Classes
- PredictorParams
-
final
def
extractParamMap(): ParamMap
- Definition Classes
- Params
-
final
def
extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
-
final
val
featuresCol: Param[String]
- Definition Classes
- HasFeaturesCol
-
def
featuresDataType: DataType
- Attributes
- protected
- Definition Classes
- PredictionModel
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getAdditionalColumnsForApply: Seq[StructField]
- Attributes
- protected
- Definition Classes
- CatBoostRegressionModel → CatBoostModelTrait
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getFeatureImportance(fstrType: EFstrType = EFstrType.FeatureImportance, data: Pool = null, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular): Array[Double]
- fstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- data
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcType
Used only for PredictionValuesChange. Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- returns
array of feature importances (index corresponds to the order of features in the model)
- Definition Classes
- CatBoostModelTrait
-
def
getFeatureImportanceInteraction(): Array[FeatureInteractionScore]
- returns
array of feature interaction scores
- Definition Classes
- CatBoostModelTrait
-
def
getFeatureImportancePrettified(fstrType: EFstrType = EFstrType.FeatureImportance, data: Pool = null, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular): Array[FeatureImportance]
- fstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- data
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcType
Used only for PredictionValuesChange. Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- returns
array of feature importances sorted in descending order by importance
- Definition Classes
- CatBoostModelTrait
-
def
getFeatureImportanceShapInteractionValues(data: Pool, featureIndices: Pair[Int, Int] = null, featureNames: Pair[String, String] = null, preCalcMode: EPreCalcShapValues = EPreCalcShapValues.Auto, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular, outputColumns: Array[String] = null): DataFrame
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames are specified.
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames are specified.
- data
dataset to calculate SHAP interaction values
- featureIndices
(optional) pair of feature indices to calculate SHAP interaction values for.
- featureNames
(optional) pair of feature names to calculate SHAP interaction values for.
- preCalcMode
Possible values:
- Auto Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL2 D2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcType
Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- outputColumns
columns from data to add to output DataFrame, if null - add all columns
- returns
- for binclass or regression: DataFrame which contains outputColumns and "featureIdx1", "featureIdx2", "shapInteractionValue" columns
- for multiclass: DataFrame which contains outputColumns and "classIdx", "featureIdx1", "featureIdx2", "shapInteractionValue" columns
- Definition Classes
- CatBoostModelTrait
-
def
getFeatureImportanceShapValues(data: Pool, preCalcMode: EPreCalcShapValues = EPreCalcShapValues.Auto, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular, modelOutputType: EExplainableModelOutput = EExplainableModelOutput.Raw, referenceData: Pool = null, outputColumns: Array[String] = null): DataFrame
- data
dataset to calculate SHAP values for
- preCalcMode
Possible values:
- Auto Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL2 D2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcType
Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- referenceData
reference data for Independent Tree SHAP values from https://arxiv.org/abs/1905.04610v1 if referenceData is not null, then Independent Tree SHAP values are calculated
- outputColumns
columns from data to add to output DataFrame, if null - add all columns
- returns
- for regression and binclass models: DataFrame which contains outputColumns and "shapValues" column with Vector of length (n_features + 1) with SHAP values
- for multiclass models: DataFrame which contains outputColumns and "shapValues" column with Matrix of shape (n_classes x (n_features + 1)) with SHAP values
- Definition Classes
- CatBoostModelTrait
-
final
def
getFeaturesCol: String
- Definition Classes
- HasFeaturesCol
-
final
def
getLabelCol: String
- Definition Classes
- HasLabelCol
-
final
def
getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
-
def
getParam(paramName: String): Param[Any]
- Definition Classes
- Params
-
final
def
getPredictionCol: String
- Definition Classes
- HasPredictionCol
-
def
getResultIteratorForApply(objectsDataProvider: SWIGTYPE_p_NCB__TObjectsDataProviderPtr, dstRows: ArrayBuffer[Array[Any]], localExecutor: TLocalExecutor): Iterator[Row]
- Attributes
- protected
- Definition Classes
- CatBoostRegressionModel → CatBoostModelTrait
-
final
def
hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
-
def
hasParam(paramName: String): Boolean
- Definition Classes
- Params
-
def
hasParent: Boolean
- Definition Classes
- Model
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
final
def
isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
isSet(param: Param[_]): Boolean
- Definition Classes
- Params
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
final
val
labelCol: Param[String]
- Definition Classes
- HasLabelCol
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
nativeDimension: Int
- Attributes
- protected
- Definition Classes
- CatBoostRegressionModel → CatBoostModelTrait
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
numFeatures: Int
- Definition Classes
- PredictionModel
- Annotations
- @Since( "1.6.0" )
-
lazy val
params: Array[Param[_]]
- Definition Classes
- Params
-
var
parent: Estimator[CatBoostRegressionModel]
- Definition Classes
- Model
-
def
predict(features: Vector): Double
Prefer batch computations operating on datasets as a whole for efficiency
Prefer batch computations operating on datasets as a whole for efficiency
- Definition Classes
- CatBoostRegressionModel → PredictionModel
-
final
def
predictRawImpl(features: Vector): Array[Double]
Prefer batch computations operating on datasets as a whole for efficiency
Prefer batch computations operating on datasets as a whole for efficiency
- Definition Classes
- CatBoostModelTrait
-
final
val
predictionCol: Param[String]
- Definition Classes
- HasPredictionCol
-
def
save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since( "1.6.0" ) @throws( ... )
-
def
saveNativeModel(fileName: String, format: EModelType = EModelType.CatboostBinary, exportParameters: Map[String, Any] = null, pool: Pool = null): Unit
Save the model to a local file.
Save the model to a local file.
- fileName
The path to the output model.
- format
The output format of the model. Possible values:
CatboostBinary CatBoost binary format (default). AppleCoreML Apple CoreML format (only datasets without categorical features are currently supported). Cpp Standalone C++ code (multiclassification models are not currently supported). See the C++ section for details on applying the resulting model. Python Standalone Python code (multiclassification models are not currently supported). See the Python section for details on applying the resulting model. Json JSON format. Refer to the CatBoost JSON model tutorial for format details. Onnx ONNX-ML format (only datasets without categorical features are currently supported). Refer to https://onnx.ai for details. Pmml PMML version 4.3 format. Categorical features must be interpreted as one-hot encoded during the training if present in the training dataset. This can be accomplished by setting the --one-hot-max-size/one_hot_max_size parameter to a value that is greater than the maximum number of unique categorical feature values among all categorical features in the dataset. Note. Multiclassification models are not currently supported. See the PMML section for details on applying the resulting model. - exportParameters
Additional format-dependent parameters for AppleCoreML, Onnx or Pmml formats. See python API documentation for details.
- pool
The dataset previously used for training. This parameter is required if the model contains categorical features and the output format is Cpp, Python, or Json.
- Definition Classes
- CatBoostModelTrait
val spark = SparkSession.builder() .master("local[*]") .appName("testSaveLocalModel") .getOrCreate() val pool = Pool.load( spark, "dsv:///home/user/datasets/my_dataset/train.dsv", columnDescription = "/home/user/datasets/my_dataset/cd" ) val regressor = new CatBoostRegressor() val model = regressor.fit(pool) // save in CatBoostBinary format model.saveNativeModel("/home/user/model/model.cbm") // save in ONNX format with metadata model.saveNativeModel( "/home/user/model/model.onnx", EModelType.Onnx, Map( "onnx_domain" -> "ai.catboost", "onnx_model_version" -> 1, "onnx_doc_string" -> "test model for regression", "onnx_graph_name" -> "CatBoostModel_for_regression" ) )
Example: -
final
def
set(paramPair: ParamPair[_]): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set(param: String, value: Any): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set[T](param: Param[T], value: T): CatBoostRegressionModel.this.type
- Definition Classes
- Params
-
final
def
setDefault(paramPairs: ParamPair[_]*): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
setDefault[T](param: Param[T], value: T): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
-
def
setFeaturesCol(value: String): CatBoostRegressionModel
- Definition Classes
- PredictionModel
-
def
setParent(parent: Estimator[CatBoostRegressionModel]): CatBoostRegressionModel
- Definition Classes
- Model
-
def
setPredictionCol(value: String): CatBoostRegressionModel
- Definition Classes
- PredictionModel
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
-
def
transform(dataset: Dataset[_]): DataFrame
- Definition Classes
- PredictionModel → Transformer
-
def
transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
- Definition Classes
- Transformer
- Annotations
- @Since( "2.0.0" )
-
def
transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
- Definition Classes
- Transformer
- Annotations
- @Since( "2.0.0" ) @varargs()
-
def
transformCatBoostImpl(dataset: Dataset[_]): DataFrame
- Attributes
- protected
- Definition Classes
- CatBoostModelTrait
-
def
transformImpl(dataset: Dataset[_]): DataFrame
- Definition Classes
- CatBoostRegressionModel → PredictionModel
-
def
transformPool(dataset: Pool): DataFrame
This function is useful when the dataset has been already quantized but works with any Pool
This function is useful when the dataset has been already quantized but works with any Pool
- Definition Classes
- CatBoostModelTrait
-
def
transformSchema(schema: StructType): StructType
- Definition Classes
- PredictionModel → PipelineStage
-
def
transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
-
val
uid: String
- Definition Classes
- CatBoostRegressionModel → Identifiable
-
def
validateAndTransformSchema(schema: StructType, fitting: Boolean, featuresDataType: DataType): StructType
- Attributes
- protected
- Definition Classes
- PredictorParams
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
write: MLWriter
- Definition Classes
- CatBoostModelTrait → MLWritable