Multiple regression analysis Split result concern

Multiple regression analysis Split result concern - linear-regression

I currently have a study on my thesis and as I ran the MRA in the SPSS and split the main sample into 2 the result for the predictors of the Split 1 and 2 are not the same. How should this be interpreted?
I tried to ran through again the data set but same results. My problem is how should this be interpreted?

Related

usage of CP-SAT to forecast 3 Milions of boolean variables

Dears,
I want to understand if I using not properly the CP-SAT algorithm. Basically my code automatically creates a model reading a csv with a dataset. My code creates model.NewBoolVar() for each record of the dataset multiplied for the number of possible decisions to be taken by the optimization problem...
For example if I have a dataset with 1 Milion of records and I have to decide between 3 options, the model will contains 3 Milions of boolean variables. The combination of the 3 Milions of booleans is the solution to my optimizzation problem.
Currently after 100K variables the program is becoming unstable and python crashes. Do you think that I'm trying to use CP-SAT not properly? Do you have experience with this kind of volumes?
Thank you very much.
Cheers

You are aware that this is an NP problem.
Thus potentially, you are creating a search tree of size 2^3000000000.

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?

In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

Machine Learning: How to handle discrete and continuous data together

I'm posting to ask whether there any methodologies, or ideas as to how to handle discrete and continuous data in a classifying problem.
In my situation, I have a bunch of independent "batches" that have discrete data. This is process related data, and so for each batch, there are separate points. I also have a dataset, that varies with time for the same batches. This time however there are many time observations for every batch. The data sets look like below:
Data Set 1
Batch 1 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Batch 2 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Batch 3 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Batch 4 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Data Set 2
Batch 1 t(1) TimeData
Batch 1 t(2) TimeData
Batch 1 t(3) TimeData
Batch 1 t(4) TimeData
.
.
.
.
Batch n (t1) TimeData
Batch n (t2) TimeData
Batch n (t3) TimeData
I am trying to classify whether all this data belongs to a 'Good' batch, a 'Bad' batch, or a 'so-so' batch. This is determined by one specific discrete parameter (not used in the data sets).
I'm very new to machine learning; any input or ideas would be appreciated. I'm using the matlab classification learner to try to tackle this problem.

There are a few things that you need to consider while dealing with a classification problem.
Training Data. We need training data for classification, i-e we need all the above mentioned attribute's values along with the class value whether it is 'Good' or 'Bad' or 'so-so'.
Using this we can train a model, and then given a new data for all the trained attributes we can predict which class it belongs to.
As far as discrete and continuous is concerned, There is no difference in the way we handle discrete and continuous data. In fact, for this case we can generate a new attribute which will be a function of all the other time variables for a given batch and then perform the classification. If you provide an instance of the data-set then the question can be answered more precisely.

make a neural network continue number patterns

I discovered neural netwaorks a few time ago and i now how they work
but i have a question: how can I train a NN countinue number patterns,
e.g. 0,2,4,6,8 as a input and 10,12,14,16,18 as output.
I dont know how to make this, but I thought about a input [0,1,2,3,4] and
output [0,2,4,6,8],what means 0 is the first number of the stream and 3 the third
Can this work or what are the other ways.

Weka classifier MultilayerPerceptron

I have a problem with weka =/.
I'm using weka for data mining time series with neural network, in other words: the classifier MultilayerPerceptron.
my configuration is "MultilayerPerceptron -L 0.3 -M 0.1 -N 1000 -V 0 -S 0 -E 20 -H a"
There is the problem... the weka never ends.
I have 1904 instances and 18 attributes, corresponding to five days of time series, is not much data =/.
the last time the weka run for 8 days and it stop to run but don't give me a result.
any idea ?

I have run a MultilayerPerceptron using 10-fold Cross-Validation using a generated dataset containing 1904 instances and 18 attributes.
Given the configuration outlined above, each fold took 12 seconds on my PC and completed quite fine. Given the size of the dataset and the number of training runs, it shouldn't really take too long to train the MLP.
Perhaps there is something up with the data that you are using (Perhaps you could supply the arff header and some sample lines) or the system stopped training for some reason. You could try on another computer, but I'm not sure if that would resolve the issue.
I can't see why it would take 8 days to train a network like this. You probably don't need to wait that long before realising that there is an issue in the training. :)
Hope this Helps!