Dataset in delimiter-separated values format

Contains

For each object:

A list of features.
The target or multiple targets for multiregression (optional).
Other types of data.

Feature indices used in training and feature importance are numbered from 0 to featureCount – 1. Any non-feature column types are ignored when calculating these indices.

Specification

List each object on a new line.
All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.
If the group weight is specified, it must be the same for all objects in one group.
Use any single char delimiters to separate data about a single object. The required delimiter can be specified in the training parameters. Tabs are used as the default separator.
Use the feature types that are specified in the columns description.
List features in the same order for all the objects.
Feature numbering starts from zero.

Example

The dataset consists of 6 columns.

The first column (indexed 0) contains label values.

The label (target) takes binary values:

0 stands for the absence of precipitation
1 stands for the presence of precipitation

Columns indexed 1, 2, 3 and 5 contain features.

The column indexed 4 contains arbitrary data.

The file with the columns description with tab-separated data looks like this:

0<\t>Label
3<\t>Categ<\t>wind direction
4<\t>Auxiliary

The feature indexed 3 is categorical, so the value in the second column of the description file is set to . The name of this feature is set to wind direction in the third column of the description file.

Other features are numerical and are omitted from the columns description file.

The dataset file looks like this:

1<\t>–10<\t>5<\t>north<\t>Memphis TN<\t>753
0<\t>30<\t>1<\t>south<\t>Los Angeles CA<\t>760
0<\t>40<\t>0.1<\t>south<\t>Las Vegas NV<\t>705

Was the article helpful?

Columns description

Dataset in extended libsvm format