Prediction/delay forcasting using Machine Learning? - matlab

I have a set of data for the past 5 years. Approx 7000 rows of data with features that are binary {yes/no} or are multi-classed {product A, B, C} A total of about 20+ features.
I am trying to make a program (or one time analysis project) to determine (predict) the product shipdate(shipping delay days) based on this historical data. I have 2 columns that indicate when a product was planned to be shipped and another column of when it was actually shipped! Currently.
I'm wondering how I can make a prediction program that determines based on the historic data when new data input of a product will expect to ship. I don't care about a getting a specific date but even just a program that can tell me number of delay days to add...
I took an ML class a while back and I wasn't sure how to start something like this. Any advice? Plus the closest thing to this I can think of is an image recognition assignment using NN. but that was too easy here I have to deal with a date instead of pixel white/black.... I used Matlab back in the day (I still know how to use it) but I just downloaded Weka data mining tool.
I was thinking of a neural network but I'm not sure how to set it up to have my program give me a the expected delay time (# of days/month) from the inputed ship date.
Basically,
I want to input (size = 5, prod = A, ....,expected ship date = jan 1st)
and the program returns the number of days to add as a delay onto my expected ship date given the historical trends...
Would appreciate any any help on how start something like this the correct/easiest/best way... Thanks in advance.

If you use weka, then get your input/label data into the arff format and then you try out all the different regressors (this is a regression problem after all). To avoid having to do too much programming quite yet (if you are just in an exploratory phase), use the weka experimenter which has a GUI for trying out a whole bunch of regressors on your dataset.
Then when you find one that does something expected and you want to do some more data analysis using MATLAB, then you can use a weka/matlab interface.

Related

tbl_regression (gtsummary) ordering covariables levels and processing time

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..
1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

Circular System, how to get numbers back into stock 1

I am creating a system dynamic and agent-based model for my dissertation.
Numbers generated through the different flows must be added back to the start to continue through the process.
For example, numbers flow from a parameter to stock 1, which goes through a flow process at a specific rate to stock 2. From stock 2, there is another flow process based on a particular rate to stock 3. The numbers from stock 3 need to go back into stock 1 to repeat the process.
Methods I have tried have been adding flows, links, and changing the initial value of stock 1.
Any help or suggestions are greatly appreciated!
Updated:
Added screenshots.
I think it is because of the difference between the two flows, e.g. a -9 based on the difference between flow and flow3 as shown in the screenshot.
Screenshots:
Graph of Stock 1
Model as a whole
In system dynamics, if you want to have a circular system (feedback loop) it needs to contain as a minimum 1 stock inside the loop, which means that there is at least 1 delay in the feedback loop
I will explain your model
stock has an outflow of 20 (flow), and an inflow of 11 (flow3)... this produces a net outflow of 9/timeUnit
This is what you see in the graph... and it doesn't even matter if your system is circular or not, that stock will lose 9/timeUnit forever.
Your wording is very strange when you say "numbers are generated and numbers are flowing"... it's not numbers that flow through your system... you can't really say "i have 3 numbers per minute flowing into a pool of numbers"
In your model, the "numbers" are definitely going back to the initial stock, but system dynamics is like water flowing, it is not a discrete paradigm, so you will not see the same "numbers" going back because system dynamics doesn't differentiate what individual "numbers" are flowing.
It's so weird already to have to use the word numbers to be consistent with your question.
So in order to have a better answer, you will need to specify:
what is the behavior you see in here
what is the behavior you expect, and how it differs from what you see
what would you need to see in the system in order to say "yes, my system in working exactly as expected"
It would help if you let us know what your system represents, and if you use names that represents what is flowing through the system (instead of using numbers, stock and flow, because any explanation becomes confusing)

Modeler question: Is there a function in SPSS for multiple 'if' statements? Forecasting dates

I am trying to build a forecast for interest expense for floating debt in my company.
I have been given a set of ResetDates which help me match a given rate based on when the ResetDate is.
I have been successful in forecasting one period, but I need a much longer set of periods to satisfy my requirements.
I've tried derive nodes and nested if statements as well as filler nodes.
I am given this data to work with, I can only look at one ResetDate ahead.
Here you will find the data I used: Columns A/B/C/D is what i'm given, Column E (or 5th column from left to right) is what I want to derive as my output
I want to use 'InterestPayDate' and derive:
if it's more than 'NextReset' , the add 90 days to the 'NextReset' to create 'NextReset2'
That is as far as I can get.... where my problem lies is I want to look at NextReset2 and derive:
if 'InterestPayDate' is more than 'NextReset2', then add 90 days to 'NextReset2', if it's less than 'NextReset2', keep the current value for 'NextReset2'
Output should look like Column E here
Not sure if I need to dig deeper into the logical functions, in all honesty, I've just picked up SPSS and I am really trying to learn. Hopefully, you can point me in the right direction.
Thank you.
After computing the first NextReset2, you need to use a Filler node like the one below to change the value of the field.
You might need more than one identical nodes like this - one for each potential 90-day period that you are looking to extend the NextReset2 date. In your sample data, you will need at least two Filler nodes to get the correct value of NextReset2 for the last of the records.
There might be a more elegant way to do it, but this will work and it's easy enough to make copies of a node and string them together like this.
Please also see a sample IBM SPSS Modeler stream showing this approach here and using your sample data.

prediction and time series

how to decide how in advance my prediction is?
i am following the featuretools churn tutorial https://github.com/Featuretools/predict-customer-churn
what i don't quite understand how did it decide that the prediction is for one month in advance.. in previous churn examples i tried, i just get aggregated data ( it could be historical for a years or months) then i build churn model and predict but i don't know if my prediction is for a month a year or even how many days in advance how is that decided!.
does it depend on the period of aggregation or the data i didn't use. i know cut off time is the time i want to make prediction but how do i tell the system i want to make prediction for 2 month in advance do i just disregard the data for the last two months by setting the cut_off time but provide the label after the two months and say my model based on the features i get is for a 2 month advanced prediction.
for ex. cut_off date is 1/8/2010 label is the customer state on 1/10/2010
so two months period is the advance prediction? and i used all historical data previous to cut_off time?
this might be a time series problem that is turned into a simple classification but i am not sure!
You pick the amount of time in advanced (called "lead time") using your domain expertise. Depending on the real world application the lead time might be more or less. Sometimes you might even build multiple models with different lead times to apply in different situations.
You control the lead time by moving the cutoff earlier with respect to the time the label became known. So, the example you give looks correct.

running NN software with my own data

New with Matlab.
When I try to load my own date using the NN pattern recognition app window, I can load the source data, but not the target (it is never on the drop down list). Both source and target are in the same directory. Source is 5000 observations with 400 vars per observation and target can take on 10 different values (recognizing digits). Any Ideas?
Before you do anything with your own data you might want to try out the example data sets available in the toolbox. That should make many problems easier to find later on because they definitely work, so you can see what's wrong with your code.
Regarding your actual question: Without more details, e.g. what your matrices contain and what their dimensions are, it's hard to help you. In your case some of the problems mentioned here might be similar to yours:
http://www.mathworks.com/matlabcentral/answers/17531-problem-with-targets-in-nprtool
From what I understand about nprtool your targets have to consist of a matrix with only one 1 (for the correct class) in either row or column (depending on the input matrix), so make sure that's the case.