Columns description
- Contains
-
A description of the data types contained in the columns of the Dataset description in delimiter-separated values format or the Dataset description in extended libsvm format.
Note.The columns description file is optional. If omitted, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.
The table below lists the supported column types.Type Description Label
The target variable (in other words, the object's label value).
The type of data depends on the machine learning task being solved:- Regression , multiregression and ranking — Numeric values.
Binary classification — Numeric values.
The interpretation of numeric values depends on the selected loss function:
- Logloss — The value is considered a positive class if it is strictly grater than the value of the border parameter of the loss function. Otherwise, it is considered a negative class.
- CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range
[0; 1]
.
- Multiclassification — Integers or strings that represents the labels of the classes.
Num
A numerical feature.
A tab-delimited feature ID can be added for this type of column. The specified value replaces the feature ID in the following output files:Categ
A categorical feature.
A tab-delimited feature ID can be added for this type of column. The specified value replaces the feature ID in the following output files:Text
A text feature.
Auxiliary
Any data.
The value of this column is ignored (the behavior is the same as when this column is omitted in the file with the dataset description).
SampleId
Alias:
DocId
An alphanumeric ID of the object.
Weight
The object's weight.
Used as an additional coefficient in the objective functions and metrics. By default, it is set to 1 for all objects.
Note.Do not use this column type if the
GroupWeight
column is defined in the dataset description.GroupWeight
The group weight.
Used as an additional coefficient in the objective functions and metrics. By default, it is set to 1 for all objects in the group.
Note.- The weight must be the same for all objects in one group.
- Do not use this column type if the
Weight
column is defined in the dataset description.
Baseline
The initial formula values for all input objects.
Used for calculating the final values of trees.
The required number of these columns depends on the machine learning mode:- For classification and regression – one column.
- For multiclassification – the same as the number of classes.
GroupId
Alias:
QueryId
The identifier of the object's group. An arbitrary string, possibly representing an integer.
Attention.All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.
SubgroupId
The identifier of the object's subgroup. Used to divide objects within a group. An arbitrary string, possibly representing an integer.
Timestamp
The timestamp of the object.
Should be a non-negative integer.
Type Description Label
The target variable (in other words, the object's label value).
The type of data depends on the machine learning task being solved:- Regression , multiregression and ranking — Numeric values.
Binary classification — Numeric values.
The interpretation of numeric values depends on the selected loss function:
- Logloss — The value is considered a positive class if it is strictly grater than the value of the border parameter of the loss function. Otherwise, it is considered a negative class.
- CrossEntropy — The value is interpreted as the probability that the dataset object belongs to the positive class. Possible values are in the range
[0; 1]
.
- Multiclassification — Integers or strings that represents the labels of the classes.
Num
A numerical feature.
A tab-delimited feature ID can be added for this type of column. The specified value replaces the feature ID in the following output files:Categ
A categorical feature.
A tab-delimited feature ID can be added for this type of column. The specified value replaces the feature ID in the following output files:Text
A text feature.
Auxiliary
Any data.
The value of this column is ignored (the behavior is the same as when this column is omitted in the file with the dataset description).
SampleId
Alias:
DocId
An alphanumeric ID of the object.
Weight
The object's weight.
Used as an additional coefficient in the objective functions and metrics. By default, it is set to 1 for all objects.
Note.Do not use this column type if the
GroupWeight
column is defined in the dataset description.GroupWeight
The group weight.
Used as an additional coefficient in the objective functions and metrics. By default, it is set to 1 for all objects in the group.
Note.- The weight must be the same for all objects in one group.
- Do not use this column type if the
Weight
column is defined in the dataset description.
Baseline
The initial formula values for all input objects.
Used for calculating the final values of trees.
The required number of these columns depends on the machine learning mode:- For classification and regression – one column.
- For multiclassification – the same as the number of classes.
GroupId
Alias:
QueryId
The identifier of the object's group. An arbitrary string, possibly representing an integer.
Attention.All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.
SubgroupId
The identifier of the object's subgroup. Used to divide objects within a group. An arbitrary string, possibly representing an integer.
Timestamp
The timestamp of the object.
Should be a non-negative integer.
- Specification
-
- List each column on a new line.
- Additional properties are set on the corresponding line.
- Use a tab as the delimiter to separate data for a single column.
- Columns that contain numerical features don't require descriptions. Any columns that aren't specified in the file are assumed to be Num.
- Row format
-
<column ID (numbering starts from zero)><\t><data type><\t><feature id (optional, applicable for Num and Categ column types only)>
- Specifics
-
The feature indices and the column indices usually differ.
The table below shows the difference between these indices on the columns description example given above.
Column index Column data Feature index 0 Label
— 1 Num
0 2 Num
1 3 Categ<\t>wind direction
2 4 Auxiliary
— 5 Num
3 Column index Column data Feature index 0 Label
— 1 Num
0 2 Num
1 3 Categ<\t>wind direction
2 4 Auxiliary
— 5 Num
3 Multiregression labels are specified in several separate columns.
Example0<\t>Label 1<\t>Label
- Example
- An object contains information about the weather, and the features represent:
- temperature (degrees Celsius)
- wind speed (meters per second)
- wind direction (“south”, “west”, “north”, “east”)
- pressure (mmHg)
The label (target) takes binary values:- “0” stands for the absence of precipitation
- “1” stands for the presence of precipitation
A column with arbitrary data is provided.
The feature representing the wind direction should be renamed to “wind direction” in the output files with information on the feature strength.
The file with the columns description with tab-separated data looks like this:0<\t>Label 3<\t>Categ<\t>wind direction 4<\t>Auxiliary
The following variant is equivalent to the previous but is redundant:
0<\t>Label<\t> 1<\t>Num 2<\t>Num 3<\t>Categ<\t>wind direction 4<\t>Auxiliary 5<\t>Num