Why Add Disjunction in VRP solver Not Working As Expected? - or-tools

cross posted from OR-tools google group
I am working with a multi-vehicle VRP with due dates over 5 periods (ex. some nodes are due at time t=0, so I give them a penalty cost of 10000 and 1000 for all other nodes, etc. till period 4). Initially I followed the exact steps as laid out here with "AddDisjunction" to set priorities for certain nodes, and so expected that the solution would always pick up more, rather than less, nodes. However, in my example code , you'll see that the solver is dropping multiple nodes with smaller demand and picking up nodes with larger demand instead. I came across this same issue when working on a single-vehicle problem, but I was able to use AddSoftSameVehicleConstraint as a workaround.
My code here: I would direct you to the cell titled "Basic SVRP" onwards; the cells prior are for data generation. The most important thing in the output is that, all nodes starting with "a" or "b" are only of demand 1-2 units, while "c" and "d" are of demands between 4-8 units. Therefore, I should ideally see nodes of "a" and "b" only dropped as a last resort.
Any help here to rectify this would be greatly appreciated, happy to simplify/clarify where needed.

Related

tbl_regression (gtsummary) ordering covariables levels and processing time

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..
1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

Circular System, how to get numbers back into stock 1

I am creating a system dynamic and agent-based model for my dissertation.
Numbers generated through the different flows must be added back to the start to continue through the process.
For example, numbers flow from a parameter to stock 1, which goes through a flow process at a specific rate to stock 2. From stock 2, there is another flow process based on a particular rate to stock 3. The numbers from stock 3 need to go back into stock 1 to repeat the process.
Methods I have tried have been adding flows, links, and changing the initial value of stock 1.
Any help or suggestions are greatly appreciated!
Updated:
Added screenshots.
I think it is because of the difference between the two flows, e.g. a -9 based on the difference between flow and flow3 as shown in the screenshot.
Screenshots:
Graph of Stock 1
Model as a whole
In system dynamics, if you want to have a circular system (feedback loop) it needs to contain as a minimum 1 stock inside the loop, which means that there is at least 1 delay in the feedback loop
I will explain your model
stock has an outflow of 20 (flow), and an inflow of 11 (flow3)... this produces a net outflow of 9/timeUnit
This is what you see in the graph... and it doesn't even matter if your system is circular or not, that stock will lose 9/timeUnit forever.
Your wording is very strange when you say "numbers are generated and numbers are flowing"... it's not numbers that flow through your system... you can't really say "i have 3 numbers per minute flowing into a pool of numbers"
In your model, the "numbers" are definitely going back to the initial stock, but system dynamics is like water flowing, it is not a discrete paradigm, so you will not see the same "numbers" going back because system dynamics doesn't differentiate what individual "numbers" are flowing.
It's so weird already to have to use the word numbers to be consistent with your question.
So in order to have a better answer, you will need to specify:
what is the behavior you see in here
what is the behavior you expect, and how it differs from what you see
what would you need to see in the system in order to say "yes, my system in working exactly as expected"
It would help if you let us know what your system represents, and if you use names that represents what is flowing through the system (instead of using numbers, stock and flow, because any explanation becomes confusing)

Average result of 50 Netlogo Simulation_Agent Based Simulation

I run an infectious disease spread model similar to "VIRUS" model in the model library changing the "infectiousness".
I did 20 runs each for infectiousness values 98% , 95% , 93% and the Maximum infected count was 74.05 , 73 ,78.9 respectively. (peak was at tick 38 for all 3 infectiousness values)
[I took the average of the infected count for each tick and took the maximum of these averages as the "maximum infected".]
I was expecting the maximum infected count to decrease when the infectiousness is reduced, but it didn't. As per what I understood this happens, because I considered the average values of each simulation run. (It is like I am considering a new simulation run with average infected count for each tick ).
I want to say that, I am considering all 20 simulation runs. Is there a way to do that other than the way I used the average?
In the Models Library Virus model with default parameter settings at other values, and those high infectiousness values, what I see when I run the model is a periodic variation in the numbers three classes of person. Look at the plot in the lower left corner, and you'll see this. What is happening, I believe, is this:
When there are many healthy, non-immune people, that means that there are many people who can get infected, so the number of infected people goes up, and the number of healthy people goes down.
Soon after that, the number of sick, infectious people goes down, because they either die or become immune.
Since there are now more immune people, and fewer infectious people, the number of non-immune healthy grows; they are reproducing. (See "How it works" in the Info tab.) But now we have returned to the situation in step 1, ... so the cycle continues.
If your model is sufficiently similar to the Models Library Virus model, I'd bet that this is part of what's happening. If you don't have a plot window like the Virus model, I recommend adding it.
Also, you didn't say how many ticks you are running the model for. If you run it for a short number of ticks, you won't notice the periodic behavior, but that doesn't mean it hasn't begun.
What this all means that increasing infectiousness wouldn't necessarily increase the maximum number infected: a faster rate of infection means that the number of individuals who can infected drops faster. I'm not sure that the maximum number infected over the whole run is an interesting number, with this model and a high infectiousness value. It depends what you are trying to understand.
One of the great things about NetLogo and some other ABM systems is that you can watch the system evolve over time, using various tools such as plots, monitors, etc. as well as just looking at the agents move around or change states over time. This can help you understand what is going on in a way that a single number like an average won't. Then you can use this insight to figure out a more informative way of measuring what is happening.
Another model where you can see a similar kind of periodic pattern is Wolf-Sheep Predation. I recommend looking at that. It may be easier to understand the pattern. (If you are interested in mathematical models of this kind of phenomenon, look up Lotka-Volterra models.)
(Real virus transmission can be more complicated, because a person (or other animal) is a kind of big "island" where viruses can reproduce quickly. If they reproduce too quickly, this can kill the host, and prevent further transmission of the virus. Sometimes a virus that reproduces more slowly can harm more people, because there is time for them to infect others. This blog post by Elliott Sober gives a relatively simple mathematical introduction to some of the issues involved, but his simple mathematical models don't take into account all of the complications involved in real virus transmission.)
EDIT: You added a comment Lawan, saying that you are interested in modeling COVID-19 transmission. This paper, Variation and multilevel selection of SARS‐CoV‐2 by Blackstone, Blackstone, and Berg, suggests that some of the dynamics that I mentioned in the preceding remarks might be characteristic of COVID-19 transmission. That paper is about six months old now, and it offered some speculations based on limited information. There's probably more known now, but this might suggest avenues for further investigation.
If you're interested, you might also consider asking general questions about virus transmission on the Biology Stackexchange site.

Alert in RAM/CPU Usage Detection in e-Commerce Server

Currently I'm building my monitoring services for my e-commerce Server, which mostly focus on CPU/RAM usage. It's likely Anomaly Detection on Timeseries data.
My approach is building LSTM Neural Network to predict next CPU/RAM value on chart trending and compare with STD (standard deviation) value multiply with some number (currently is 10)
But in real life conditions, it depends on many differents conditions, such as:
1- Maintainance Time (in this time "anomaly" is not "anomaly")
2- Sales time in day-off events, holidays, etc., RAM/CPU usages increase is normal, of courses
3- If percentages of CPU/RAM decrement are the same over 3 observations: 5 mins, 10 mins & 15 mins -> Anomaly. But if 5 mins decreased 50%, but 10 mins it didn't decrease too much (-5% ~ +5%) -> Not an "anomaly".
Currently I detect anomaly on formular likes this:
isAlert = (Diff5m >= 10 && Diff10m >= 15 && Diff30m >= 40)
where Diff is Different Percentage in Absolute value.
Unfortunately I don't save my "pure" data for building neural network, for example, when it detects anomaly, I modified that it is not an anomaly anymore.
I would like to add some attributes to my input for model, such as isMaintenance, isPromotion, isHoliday, etc. but sometimes it leads to overfitting.
I also want to my NN can adjust baseline over the time, for example, when my Service is more popular, etc.
There are any hints on these aims?
Thanks
I would say that an anomaly is an unusual outcome, i.e. a outcome that's not expected given the inputs. As you've figured out, there are a few variables that are expected to influence CPU and RAM usage. So why not feed those to the network? That's the whole point of Machine Learning. Your network will make a prediction of CPU usage, taking into account the sales volume, whether there is (or was) a maintenance window, etc.
Note that you probably don't need an isPromotion input if you include actual sales volumes. The former is a discrete input, and only captures a fraction of the information present in the totalSales input
Machine Learning definitely needs data. If you threw that away, you'll have to restart capturing it. As for adjusting the baseline, you can achieve that by overweighting recent input data.

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.