Re: data scaling and validation for learning classifier



Thanks Greg!
Sorry for my poor description. What I want to do is just training and
validation on the dataset that is available. The testing set will be
provided in the future. Before reading your post, I am only aware of
dividing the dataset available, which I call "original training set",
into training set and validation set. It is really good to know how
validation and testing can both be done by leaving a fold for testing.

Please let me try to describe what I meant to ask. If I do some
feature transformation in preprocessing like scaling, normalization or
standardization, which way is more proper regarding the order between
it and cross validation:

1. firstly scale the features in the WHOLE training set, then do the
cross validation. When testing, use the same scales on the the
training
set.

2. firstly divide the training set into folds for training and
validation.
In each alternation of cross validation, do the scaling on the SHRUNK
training set , and these scales will apply to the corresponding
validation
set in the same alternation. After cross validation complete, do
another
scaling on the original training set (train + validation sets), and
the
scales computed will be applied to the testing set.

So I think what you described in your 2.a is basically what I
described in my 2? The argument for my approach 2 is: If you have a
training set T and a validation set V, then the samples in V cannot be
used for ANY aspect of the learning. So you cannot determine data
transformation based on T and V together and then test on V. On the
other hand the argument for my approach 1 is: We can use common
scaling during cross-validation, for "new_training + validation" sets,
where new_training and validation sets really are parts of training
set.

Thanks!
-Tim





On Apr 23, 9:48 am, Greg Heath <he...@xxxxxxxxxxxxxxxx> wrote:
On Apr 22, 4:23 pm, Tim <timlee...@xxxxxxxxx> wrote:

Hi,

I now scale the features before feeding them to classifier.  The
scales are computed on training set and stored and then applied to
test set.

OK

I also tune some parameter of the classifier using cross
validation.

This requires a separate hold out set that is neither used
for training nor testing.

I was wondering which way is proper regarding the order
between cross validation and data scaling:
1. first scale the features in the whole training set,

No.

then do the
cross validation. When testing use the same scales on the the training
set.

Not clear. Separate validation and testing XVALs? OR
validation and testing within one XVAL experiment?

2. in each step of cross validation, just before feeding the shrinking

What does shrinking mean ???

training set to training, do the scaling, and the specific scales only
apply to the corresponding validation set. After cross validation, do
another scaling on the original training set when apply the tuned the
parameter to train on the original training set. These scales will be
applied to the testing set.

Very unclear.

I think this question also applies to any kind of preprocessing
transformation besides scaling.

Thanks and regards!

f-fold XVAL:

1. Randomly partition the data into f subsets
2. At each stage
   a. Combine f-2 subsets for training (i.e., determining
      scale factors and regression coefficients)
   b. Use 1 holdout subset for validation (i.e., tuning model
      topology and learning algorithm parameters)
   c. Use the remaining holdout subset for testing (estimating
      performance parameters).
3. Obtain the summary stats (e.g., min,median,mean,stdv,max)
   of the f performance estimates.

Therefore, at each stage,
1. There is a separate scaling using parameters estimated
   from the training subset.
2. There may be multiple validation trials to determine
   topology and learning algorithm parameters.

Notice that, for each of the f test subsets, there are
f-1 ways to choose a validation subset. Some experimenters
just make sure that the f pair selections are unique. Others
use all f*(f-1) pair selections to try to obtain more precision.

I haven't seen any comparisons of the two techniques.
However, Warren Sarle has suggested averaging over
M separate repartitioned f-fold XVAL experiments with
f unique val/tst combinations instead of using f*(f-1)
combinations in one XVAL experiment.

Hope this helps.

Greg

.



Relevant Pages

  • Re: Exchange 2007 event id 2161 and 2157
    ... production server, ... Event Type:     Warning ... DC.MyDomain.com failed validation. ... It's obviously something looking for the public folders on the DR ...
    (microsoft.public.exchange.admin)
  • Re: Public attributes with really private data
    ... solution---for Python 3. ... property that has a tiny bit of validation. ...     def r: return 5 ... The read-only property is completely private because it isn't ...
    (comp.lang.python)
  • Re: Public attributes with really private data
    ... solution---for Python 3. ... property that has a tiny bit of validation. ...     def r: return 5 ... Respectable programmers won't lightly bypass your validation if they ...
    (comp.lang.python)
  • Re: Closure vs Rewrite
    ... minus the alert and validation: ... |   alert ... Sounds like Crockford alright. ...
    (comp.lang.javascript)
  • Re: Javascript onClick question
    ...     if (emptyvalidation(thisform.CustName.value,"Customer Name is ...  > it to execute the validation routine first. ... The onSubmit Event Handler is is used to execute specified JavaScript ... this is why you calling submitin your onSubmit handler does ...
    (comp.lang.javascript)