question

IlkkaKosunenimec-2146 avatar image
0 Votes"
IlkkaKosunenimec-2146 asked IlkkaKosunenimec-2146 commented

Leave-one-group-out cross-validation in Azure AutoML

I have a dataset where each row is a data sample, and there is a a column indicating a group this sample came from. So, each group has several data points, and each one is a row in the dataframe. I would like to run the cross-validation so that at each fold, the data points from one group are used as the validation set, and the data points from other groups as the training test. Is this currently somehow possible in Azure AutoML ?

azure-machine-learning
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

GiftA-MSFT avatar image
0 Votes"
GiftA-MSFT answered IlkkaKosunenimec-2146 commented

Yes, you can specify custom cross-validation data folds based on columns. More details are provided in the following document. Hope this helps.

Example:

 automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                              task = 'classification',
                              primary_metric = 'AUC_weighted',
                              training_data = dataset,
                              label_column_name = 'y',
                              cv_split_column_names = ['cv1', 'cv2']
                             )






· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you for your reply! However, this unfortunately does not help: I checked that cv_split_column_names earlier, and it only allows me to specify one specific fold, and then run one run with that split. What I would need to do is, for example when I have 20 sites or groups, to run 20 runs where in each run 1 instance of the group is the validation set.

I really hope there is some solution to this, because otherwise it seems that there is no way to use AutoML for any dataset that has repeated measures, longitudinal measures, data measured from different sites etc. etc.. Indeed, it makes AutoML unusable for anything but the most trivial examples?

I hope I'm wrong because I really, really want to use AutoML in many interesting projects!



0 Votes 0 ·
GiftA-MSFT avatar image GiftA-MSFT IlkkaKosunenimec-2146 ·

Hi, thanks for your feedback. cv_split_column_names is used to list names of columns that contain custom cross validation split, and each column represents one CV split where each row are either marked 1 for training or 0 for validation. I'm making some inquiries and will get back to you as soon as possible.

0 Votes 0 ·

Thanks a lot for the reply! And sorry I was bit provocative in my question, I just wanted to provoke a productive response :)

Also, I'm not sure if I be of any help, but I've implemented that kind of cross-validation for both sklearn and pytorch for my own use.

thanks again for the reply and all the efforts!

0 Votes 0 ·
Show more comments