Skip to content

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

@yangguangfd

Description

@yangguangfd

In svm_binary_svc_probability() function, random shuffle is applied on the train data before it is used in the 5-fold cross-validation process. The random shuffle is realized by the following codes:

for(i=0;il;i++) perm[i]=i;
for(i=0;il;i++)
{
int j = i+rand()%(prob->l-i);
swap(perm[i],perm[j]);
}

The C++ rand() function in the codes returns a random number in the range between 0 and RAND_MAX. Normally, RAND_MAX is 32767 (on my PC, windows, x64-based processor, RAND_MAX is also this value). So if prob->l-i is larger than RAND_MAX, the codes above can only shuffle index between 0 and RAND_MAX. I noticed that the train data input svm_problem *prob of the function svm_binary_svc_probability() had already been sorted by the data label (+1, -1 for binary classification), so the first part of prob->y[i] are for label being +1. If the number of train data with label being +1 is above RAND_MAX, in the 5-fold cross-validation, the first "predicting data set" will probably be the ones all with label +1. This will create weird results for estimating probA and probB.

So I suggest using the random function from William H. Press, et al.,
Numerical Recipes in C, which can return a random float value between 0 and 1. And another question is, in svm_binary_svc_probability() function, why not using stratified shuffle as it is used in svm_cross_validation() function?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions