Sampling from two different datasets for training and testing a model - neural-network

I am doing an experiment to test the ability of a neural network model to generalize. I have two datasets of different size. The second dataset is different from the first one, it contains some words that are not in the first dataset. I want to use examples from the first dataset for training and test on examples from the second dataset. Is it correct to take a sample from the first dataset and use this sample as a training set and take a sample from the second dataset and use this sample as a test set? More precisely, if the first dataset contains 66360 examples and the second one contains 56112 examples, can I sample 50000 examples from the first dataset and use those 50000 examples as a training set and sample 50000 examples from the second dataset and use those 50000 examples as a test set?

Related

Loading a dataset in parts for training a neural network

This is my first post so please ask me if something is not clear.
I am currently working on training a neural network on a custom dataset that I have created. This dataset consists of 1000 folders which contain 81 images (512x512 px) each that are going to be loaded, processed and used as an input. My issue is that my computer cannot handle such a large dataset and I have to find a way to use the whole dataset.
The neural network that I am working on can be found here https://github.com/chshin10/epinet.
On the EPINET_train.py file you can see the data generator that is being used.
The neural network uses the RMSProp optimizer.
What I did to deal with this issue is that I split the data into 2 folders one for training and one for testing with an 80%-20% split. Then I load 10% of the data from each folder in order to train the neural network (data was not chosen randomly). I train the neural network for 100 epoches and the I load the next set of data until all of the sets have been used for training. Then I repeat the procedure.
After 3 iterations it seems to me that the loss function is not getting minimized more for each set of data. Is this solution used in a similar scenario? Is there something I can do better.

For Kolmogorov - Smirnov two sample test on very large data is there a way to split the data into multiple sample and do the computation per sample?

Suppose I have two very large lists with many millions of values. Is there any way to do computation on smaller samples individually and then combine the results from different sample.

Statistical testing of classifier performance on different populations subsets

I have a dataset of people's medical records which contains people of different races, income levels, genders, etc.
I have a binary classification problem, and want to compare the accuracy of the two models:
Model 1: I have trained and tested a classifier on a random sample from the population.
Model 2: I have trained and tested a classifier on a sample of the population where the income level is above say 100k.
The test set sizes are of different sizes (both contain over 50,000 people). The test sets may contain some of the same people of course - I have a lot of data and will be doing the above experiment for many different conditions so I am not mentioning how many people overlap between the sets as it will change depending on the condition.
I believe I can't use a standard or modified t-test to compare the performance on the separate test sets, since the demographics are different in the two test sets - is this correct?
Instead, I think the only option is to compare the accuracy of model 1 on the same test as model 2 to figure out if model 2 performs better?

multiple regressor time series producing one output

Absolute beginner here. I'm trying to use a neural network to predict price of a product that's being shipped while using temperature, deaths during a pandemic, rain volume, and a column of 0 and 1's (dummy variable).
So imagine that I have a dataset given those values as well as column giving me time in a year/week format.
I started reading Rob Hyndman's forecasting book but I haven't yet seen anything that can help me. One idea that I have is to make a variable that's going to take out each column of the dataframe and make it into a time series. For example, for rain, I can do something like
rain <- df$rain_inches cost<-mainset3 %>% select(approx_cost) raintimeseries <-ts(rain, frequency=52, start=c(2015,1), end=c(2021,5))
I would the same for the other regressors.
I want to use neural networks on each of the regressors to predict cost and then put them all together.
Ideally I'm thinking it would be a good idea to train on say, 3/4 ths of the time series data and test on the remain 1/4 and then possibly make predictions.
I'm now seeing that even if I am using one regressor I'm still left with a multivariate time series and I've only found examples online for univariate models.
I'd appreciate if someone could give me ideas on how to model my problem using neural networks
I saw this link Keras RNN with LSTM cells for predicting multiple output time series based on multiple intput time series
but I just see a bunch of functions and nowhere where I can actually insert my function
The solution to your problem is the same as for the univariate case you found online except for the fact that you just need to work differently with your feature/independent set. your y variable or cost variable remains as is but your X variables will have to be in 3 dimensions which is (Number of observations, time steps, number of independent variables)

Concept of validate for neural network

I have a problem with concept of Validation for NN. suppose I have 100 set of input variables (for example 8 input, X1,...,X8) and want to predict one Target(Y). now I have two ways to use NN:
1- use 70 set of data for training NN and then use trained NN to predict other 30 sets of Target for validation and then plot output VS Target for this 30 sets as validation plot.
2- use 100 sets of data for training NN and then divide all outputs to two part (70% and 30%). plot 70% of outputs VS corresponding Targets as Training plot. then plot other 30% outputs VS their corresponding Targets as validation plot
Which one is correct??
Also, what the difference between checking NN with new data set and validation data set??
Thanks
You cannot use data for validation, if it has been already used for the training, because the trained NN will already "know" your validation examples. The result of such validation will be very biased. I would for sure use the first way.