Reshaping and merging simulations in Stata - merge

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}

Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

Related

Detect contiguous numbers - MATLAB

I coded a program that create some bunch of binary numbers like this:
out = [0,1,1,0,1,1,1,0,0,0,1,0];
I want check existence of nine 1 digit together in above out, for example when we have this in our output:
out_2 = [0,0,0,0,1,1,1,1,1,1,1,1,1];
or
out_3 = [0,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,0,0,0,1,1,0];
condition variable should be set to 1. We don't know exact position of start of ones in outvariable. It is random. I only want find existence of duplicate ones values in above variable (one occurrence or more).
PS.
We are searching for a general answer to find other duplicate numbers (not only 1 here and not only for binary data. this is just an example)
You can use convolution to solve such r-contiguous detection cases.
Case #1 : To find contiguous 1s in a binary array -
check = any(conv(double(input_arr),ones(r,1))>=r)
Sample run -
input_arr =
0 0 0 0 1 1 1 1 1 1 1 1 1
r =
9
check =
1
Case #2 : For detecting any number as contiguous, you could modify it a bit, like so -
check = any(conv(double(diff(input_arr)==0),ones(1,r-1))>=r-1)
Sample run -
input_arr =
3 5 2 4 4 4 5 5 2 2
r =
3
check =
1
To save Stackoverflow from further duplicates, also feel free to look into related problems -
Fast r-contiguous matching (based on location similarities).
r-contiguous matching, MATLAB.

OneR WEKA - wrong prediction?

I am trying to make a ranking of attributes depending on their predictive power by using OneR in WEKA iteratively. At every run I remove the chosen attribute to see what the next best is.
I have done this for all my attributes and some (3 out of ten attributes) get 'ranked' higher than others, although they have less % correct prediction, a smaller ROC Area average and their rules are less compact.
As I understand, OneR just looks at the frequency tables for the attribute it has and then the class values, so it wouldn't care about whether I take attributes out or not...but I am probably missing something
Would anyone have an idea?
As an alternative you can you use the OneR package (available on CRAN, more information here: OneR - Establishing a New Baseline for Machine Learning Classification Models)
With the option verbose = TRUE you get the accuracy of all attributes, e.g.:
> library(OneR)
> example(OneR)
OneR> data <- optbin(iris)
OneR> model <- OneR(data, verbose = TRUE)
Attribute Accuracy
1 * Petal.Width 96%
2 Petal.Length 95.33%
3 Sepal.Length 74.67%
4 Sepal.Width 55.33%
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'
OneR> summary(model)
Rules:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
Accuracy:
144 of 150 instances classified correctly (96%)
Contingency table:
Petal.Width
Species (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
setosa * 50 0 0 50
versicolor 0 * 48 2 50
virginica 0 4 * 46 50
Sum 50 52 48 150
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 266.35, df = 4, p-value < 2.2e-16
(full disclosure: I am the author of this package and I would be very interested in the results you get)
The OneR classifier looks a bit like nearest-neighbor. Given that, the following applies: In the source code of the OneR classifier, it says:
// if this attribute is the best so far, replace the rule
if (noRule || r.m_correct > m_rule.m_correct) {
m_rule = r;
}
Thus, it should be possible (either in 1-R generally or in this implementation) for an attribute to block another, yet be later removed in your process.
Say you have attributes 1,2, and 3 with the distribution 1: 50%, 2: 30%, 3: 20%. In all cases where attribute 1 is best, attribute 3 is second best.
Thus, when attribute 1 is left out, attribute 3 wins with 70%, even though before attribute 2 ranked as "better" than 3 in the comparison of all three.

arc4random() and arc4random_uniform() not really random?

I have been using arc4random() and arc4random_uniform() and I always had the feeling that they wasn't exactly random, for example, I was randomly choosing values from an Array but often the values that came out were the same when I generated them multiple times in a row, so today I thought that I would use an Xcode playground to see how these functions are behaving, so I first tests arc4random_uniform to generate a number between 0 and 4, so I used this algorithm :
import Cocoa
var number = 0
for i in 1...20 {
number = Int(arc4random_uniform(5))
}
And I ran it several times, and here is how to values are evolving most of the time :
So as you can see the values are increasing and decreasing repeatedly, and once the values are at the maximum/minimum, they often stay at it during a certain time (see the first screenshot at the 5th step, the value stays at 3 during 6 steps, the problem is that it isn't at all unusual, the function actually behaves in that way most of the time in my tests.
Now, if we look at arc4random(), it's basically the same :
So here are my questions :
Why is this function behaving in this way ?
How to make it more random ?
Thank you.
EDIT :
Finally, I made two experiments that were surprising, the first one with a real dice :
What surprised me is that I wouldn't have said that it was random, since I was seeing the same sort of pattern that as described as non-random for arc4random() & arc4random_uniform(), so as Jean-Baptiste Yunès pointed out, humans aren't good to see if a sequence of numbers is really random.
I also wanted to do a more "scientific" experiment, so I made this algorithm :
import Foundation
var appeared = [0,0,0,0,0,0,0,0,0,0,0]
var numberOfGenerations = 1000
for _ in 1...numberOfGenerations {
let randomNumber = Int(arc4random_uniform(11))
appeared[randomNumber]++
}
for (number,numberOfTimes) in enumerate(appeared) {
println("\(number) appeard \(numberOfTimes) times (\(Double(numberOfGenerations)/Double(numberOfTimes))%)")
}
To see how many times each number appeared, and effectively the numbers are randomly generated, for example, here is one output from the console :
0 appeared 99 times.
1 appeared 97 times.
2 appeared 78 times.
3 appeared 80 times.
4 appeared 87 times.
5 appeared 107 times.
6 appeared 86 times.
7 appeared 97 times.
8 appeared 100 times.
9 appeared 91 times.
10 appeared 78 times.
So it's definitely OK 😊
EDIT #2 : I made again the dice experiment with more rolls, and it's still as surprising to me :
A true random sequence of numbers cannot be generated by an algorithm. They can only produce pseudo-random sequence of numbers (something that looks like a random sequence). So depending on the algorithm chosen, the quality of the "randomness" may vary. The quality of arc4random() sequences is generally considered to have a good randomness.
You cannot analyze the randomness of a sequence visually... Humans are very bad to detect randomness! They tend to find some structure where there is no. Nothing really hurts in your diagrams (except the rare subsequence of 6 three in-a-row, but that is randomness, sometimes unusual things happens). You would be surprised if you had used a dice to generate a sequence and draw its graph. Beware that a sample of only 20 numbers cannot be seriously analyzed against its randomness, your need much bigger samples.
If you need some other kind of randomness, you can try to use /dev/random pseudo-file, which generate a random number each time you read in. The sequence is generated by a mix of algorithms and external physical events that ay happens in your computer.
It depends on what you mean when you say random.
As stated in the comments, true randomness is clumpy. Long strings of repeats or close values are expected.
If this doesn't fit your requirement, then you need to better define your requirement.
Other options could include using a shuffle algorithm to dis-order things in an array, or use an low-discrepancy sequence algorithm to give a equal distribution of values.
I don’t really agree with the idea of humans who are very bad to detect randomness.
Would you be satisfied if you obtain 1-1-2-2-3-3-4-4-5-5-6-6 after throwing 6 couples of dices ? however the dices frequencies are perfect…
This is exactly the problem i’m encountering with arc4random or arc4random_uniform functions.
I’m developing a backgammon application since many years which is based on a neural network trained by word champions players. I DO know that it plays much better than any one but many users think it is cheating. I also have doubts sometimes so I’ve decided to throw all dices by myself…
I’m not satisfied at all with arc4random, even if frequencies are OK.
I always throw a couple of dices and results lead to unacceptable situations, for example : getting five consecutive double dices for the same player, waiting 12 turns (24 dices) until the first 6 occurs.
It is easy to test (C code) :
void randomDices ( int * dice1, int * dice2, int player )
{
( * dice1 ) = arc4random_uniform ( 6 ) ;
( * dice2 ) = arc4random_uniform ( 6 ) ;
// Add to your statistics
[self didRandomDice1:( * dice1 ) dice2:( * dice2 ) forPlayer:player] ;
}
Maybe arc4random doesn’t like to be called twice during a short time…
So I’ve tried several solutions and finally choose this code which runs a second level of randomization after arc4random_uniform :
int CFRandomDice ()
{
int __result = -1 ;
BOOL __found = NO ;
while ( ! __found )
{
// random int big enough but not too big
int __bigint = arc4random_uniform ( 10000 ) ;
// Searching for the first character between '1' and '6'
// in the string version of bigint :
NSString * __bigString = #( __bigint ).stringValue ;
NSInteger __nbcar = __bigString.length ;
NSInteger __i = 0 ;
while ( ( __i < __nbcar ) && ( ! __found ) )
{
unichar __ch = [__bigString characterAtIndex:__i] ;
if ( ( __ch >= '1' ) && ( __ch <= '6' ) )
{
__found = YES ;
__result = __ch - '1' + 1 ;
}
else
{
__i++ ;
}
}
}
return ( __result ) ;
}
This code create a random number with arc4random_uniform ( 10000 ), convert it to string and then searches for the first digit between ‘1’ and ‘6’ in the string.
This appeared to me as a very good way to randomize the dices because :
1/ frequencies are OK (see the statistics hereunder) ;
2/ Exceptional dice sequences occur at exceptional times.
10000 dices test:
----------
Game Stats
----------
HIM :
Total 1 = 3297
Total 2 = 3378
Total 3 = 3303
Total 4 = 3365
Total 5 = 3386
Total 6 = 3271
----------
ME :
Total 1 = 3316
Total 2 = 3289
Total 3 = 3282
Total 4 = 3467
Total 5 = 3236
Total 6 = 3410
----------
HIM doubles = 1623
ME doubles = 1648
Now I’m sure that players won’t complain…

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.

Change the class of columns in a data frame

First of all, excuse me if I do any mistakes, but English is not a language I use very often.
I have a data frame with numbers. A small part of the data frame is this:
nominal 2
2
2
2
ordinal
2
1
1
2
So, I want to use the gower distance function on these numbers.
Here ( http://rgm2.lab.nig.ac.jp/RGM2/R_man-2.9.0/library/StatMatch/man/gower.dist.html ) says that in order to use gower.dist, all nominal variables must be of class "factor" and all ordinal variables of class "ordered".
By default, all the columns are of class "integer" and mode "numeric". In order to change the class of the columns, i use these commands:
DF=read.table("clipboard",header=TRUE,sep="\t")
# I select all the cells and I copy them to the clipboard.
#Then R, with this command, reads the data from there.
MyHeader=names(DF) # I save the headers of the data frame to a temp matrix
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="nominal") DF[[i]]=as.factor(DF[[i]])
}
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="ordinal") DF[[i]]=as.ordered(DF[[i]])
}
The first for/if loop changes the class from integer to factor, which is what I want, but the second changes the class of ordinal variables to: "ordered" "factor".
I need to change all the columns with the header "ordinal" to "ordered", as the gower.dist function says.
Thanks in advance,
B.T.
What you are doing is fine --- if perhaps a little inelegantly.
With your ordered factor, you have something like:
> foo <- as.ordered(1:10)
> foo
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10
> class(foo)
[1] "ordered" "factor"
Notice that it has two classes, indicating that it is an ordered factor and that is is a factor:
> is.ordered(as.ordered(1:10))
[1] TRUE
> is.factor(as.ordered(1:10))
[1] TRUE
In some senses, you might like to think that foo is an ordered factor but also inherits from the factor class too. Alternatively, if there isn't a specific method that handles ordered factors, but there is a method for factors, R will use the factor method. As far as R is concerned, an ordered factor is an object with classes "ordered" and "factor". This is what your function for Gower's distance will require.
You could easily do this with:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
which gives you a dataframe with the correct structure. If you work with data frames, please stay away from [[]] unless you know very well what you're doing. Take Dirks advice, and check Owen's R Guide as well. You definitely need it.
If i do the conversion as I showed above, gower.dist() works perfectly fine. On a sidenote, the gowers distance can easily be calculated using the daisy() function as well:
DF <- data.frame(
ordinal= c(1,2,3,1,2,1),
nominal= c(2,2,2,2,2,2)
)
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
library(cluster)
daisy(DF,metric="gower")
library(StatMatch)
gower.dist(DF)