Change the class of columns in a data frame - class

First of all, excuse me if I do any mistakes, but English is not a language I use very often.
I have a data frame with numbers. A small part of the data frame is this:
nominal 2
2
2
2
ordinal
2
1
1
2
So, I want to use the gower distance function on these numbers.
Here ( http://rgm2.lab.nig.ac.jp/RGM2/R_man-2.9.0/library/StatMatch/man/gower.dist.html ) says that in order to use gower.dist, all nominal variables must be of class "factor" and all ordinal variables of class "ordered".
By default, all the columns are of class "integer" and mode "numeric". In order to change the class of the columns, i use these commands:
DF=read.table("clipboard",header=TRUE,sep="\t")
# I select all the cells and I copy them to the clipboard.
#Then R, with this command, reads the data from there.
MyHeader=names(DF) # I save the headers of the data frame to a temp matrix
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="nominal") DF[[i]]=as.factor(DF[[i]])
}
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="ordinal") DF[[i]]=as.ordered(DF[[i]])
}
The first for/if loop changes the class from integer to factor, which is what I want, but the second changes the class of ordinal variables to: "ordered" "factor".
I need to change all the columns with the header "ordinal" to "ordered", as the gower.dist function says.
Thanks in advance,
B.T.

What you are doing is fine --- if perhaps a little inelegantly.
With your ordered factor, you have something like:
> foo <- as.ordered(1:10)
> foo
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10
> class(foo)
[1] "ordered" "factor"
Notice that it has two classes, indicating that it is an ordered factor and that is is a factor:
> is.ordered(as.ordered(1:10))
[1] TRUE
> is.factor(as.ordered(1:10))
[1] TRUE
In some senses, you might like to think that foo is an ordered factor but also inherits from the factor class too. Alternatively, if there isn't a specific method that handles ordered factors, but there is a method for factors, R will use the factor method. As far as R is concerned, an ordered factor is an object with classes "ordered" and "factor". This is what your function for Gower's distance will require.

You could easily do this with:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
which gives you a dataframe with the correct structure. If you work with data frames, please stay away from [[]] unless you know very well what you're doing. Take Dirks advice, and check Owen's R Guide as well. You definitely need it.
If i do the conversion as I showed above, gower.dist() works perfectly fine. On a sidenote, the gowers distance can easily be calculated using the daisy() function as well:
DF <- data.frame(
ordinal= c(1,2,3,1,2,1),
nominal= c(2,2,2,2,2,2)
)
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
library(cluster)
daisy(DF,metric="gower")
library(StatMatch)
gower.dist(DF)

Related

Multiple Knapsacks with Fungible Items

I am using cp_model to solve a problem very similar to the multiple-knapsack problem (https://developers.google.com/optimization/bin/multiple_knapsack). Just like in the example code, I use some boolean variables to encode membership:
# Variables
# x[i, j] = 1 if item i is packed in bin j.
x = {}
for i in data['items']:
for j in data['bins']:
x[(i, j)] = solver.IntVar(0, 1, 'x_%i_%i' % (i, j))
What is specific to my problem is that there are a large number of fungible items. There may be 5 items of type 1 and 10 items of type 2. Any item is exchangeable with items of the same type. Using the boolean variables to encode the problem implicitly assumes that the order of the assignment for the same type of items matter. But in fact, the order does not matter and only takes up unnecessary computation time.
I am wondering if there is any way to design the model so that it accurately expresses that we are allocating from fungible pools of items to save computation.
Instead of creating 5 Boolean variables for 5 items of type 'i' in bin 'b', just create an integer variable 'count' from 0 to 5 of items 'i' in bin 'b'. Then sum over b (count[i][b]) == #item b

Postgres: How to increment the index (pointer) to access other rows

I have been trying to understand how to increment the reference to some value.
In C I would simply increment the pointer to retrieve a value in the next array location.
How does this mechanism work in Postgres? is it possible?
For an example, I have created a table with some data in:
create table mathtest (
x int, y int, val int)
insert into mathtest (x,y,val)
values (1,1,10),(2,2,20),(3,3,30),(4,4,40),(5,5,50),(6,6,60),(7,7,70),(8,8,80),(9,9,90),(10,10,100),(11,11,110)
What I want to do is add the val value from the current row and then the val value when the x value in the row equals the current x value plus 2, and then plus 4. I realise that I can't assume the next row that is retrieved will be in a set order so I can't use 'lead'
If it was C I would simply increment the pointer.
The data output needs to be when the modulo of x and y = 0 for certain divisors. (this bit works)
select
x base,
(x+2) plus1x,
(x+4) plus2x,
y,
val
from mathtest
where x%2 =0 and y%3 = 0
This outputs the following:
base plus1x plus2x y val
1 6 8 10 6 60
The output I would like is:
60 + 80 +100 = 240
I can't conceptualise how to do it. My mind seems to be stuck in procedural C mode!
Whatever I type and try is an error.
Can any body help me to get over this hurdle?
Welcome to the world of window functions.
You need an explicit ordering, otherwise it makes no sense to speak of the "previous row".
As a simple example, to get the difference to the previous value, you can query like
SELECT val -
lag(val) OVER (ORDER BY x)
FROM mathtest;

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

how to create unique integer number from 3 different integers numbers(1 Oracle Long, 1 Date Field, 1 Short)

the thing is that, the 1st number is already ORACLE LONG,
second one a Date (SQL DATE, no timestamp info extra), the last one being a Short value in the range 1000-100'000.
how can I create sort of hash value that will be unique for each combination optimally?
string concatenation and converting to long later:
I don't want this, for example.
Day Month
12 1 --> 121
1 12 --> 121
When you have a few numeric values and need to have a single "unique" (that is, statistically improbable duplicate) value out of them you can usually use a formula like:
h = (a*P1 + b)*P2 + c
where P1 and P2 are either well-chosen numbers (e.g. if you know 'a' is always in the 1-31 range, you can use P1=32) or, when you know nothing particular about the allowable ranges of a,b,c best approach is to have P1 and P2 as big prime numbers (they have the least chance to generate values that collide).
For an optimal solution the math is a bit more complex than that, but using prime numbers you can usually have a decent solution.
For example, Java implementation for .hashCode() for an array (or a String) is something like:
h = 0;
for (int i = 0; i < a.length; ++i)
h = h * 31 + a[i];
Even though personally, I would have chosen a prime bigger than 31 as values inside a String can easily collide, since a delta of 31 places can be quite common, e.g.:
"BB".hashCode() == "Aa".hashCode() == 2122
Your
12 1 --> 121
1 12 --> 121
problem is easily fixed by zero-padding your input numbers to the maximum width expected for each input field.
For example, if the first field can range from 0 to 10000 and the second field can range from 0 to 100, your example becomes:
00012 001 --> 00012001
00001 012 --> 00001012
In python, you can use this:
#pip install pairing
import pairing as pf
n = [12,6,20,19]
print(n)
key = pf.pair(pf.pair(n[0],n[1]),
pf.pair(n[2], n[3]))
print(key)
m = [pf.depair(pf.depair(key)[0]),
pf.depair(pf.depair(key)[1])]
print(m)
Output is:
[12, 6, 20, 19]
477575
[(12, 6), (20, 19)]

Generate a hash sum for several integers

I am facing the problem of having several integers, and I have to generate one using them. For example.
Int 1: 14
Int 2: 4
Int 3: 8
Int 4: 4
Hash Sum: 43
I have some restriction in the values, the maximum value that and attribute can have is 30, the addition of all of them is always 30. And the attributes are always positive.
The key is that I want to generate the same hash sum for similar integers, for example if I have the integers, 14, 4, 10, 2 then I want to generate the same hash sum, in the case above 43. But of course if the integers are very different (4, 4, 2, 20) then I should have a different hash sum. Also it needs to be fast.
Ideally I would like that the output of the hash sum is between 0 and 512, and it should evenly distributed. With my restrictions I can have around 5K different possibilities, so what I would like to have is around 10 per bucket.
I am sure there are many algorithms that do this, but I could not find a way of googling this thing. Can anyone please post an algorithm to do this?.
Some more information
The whole thing with this is that those integers are attributes for a function. I want to store the values of the function in a table, but I do not have enough memory to store all the different options. That is why I want to generalize between similar attributes.
The reason why 10, 5, 15 are totally different from 5, 10, 15, it is because if you imagine this in 3d then both points are a totally different point
Some more information 2
Some answers try to solve the problem using hashing. But I do not think this is so complex. Thanks to one of the comments I have realized that this is a clustering algorithm problem. If we have only 3 attributes and we imagine the problem in 3d, what I just need is divide the space in blocks.
In fact this can be solved with rules of this type
if (att[0] < 5 && att[1] < 5 && att[2] < 5 && att[3] < 5)
Block = 21
if ( (5 < att[0] < 10) && (5 < att[1] < 10) && (5 < att[2] < 10) && (5 < att[3] < 10))
Block = 45
The problem is that I need a fast and a general way to generate those ifs I cannot write all the possibilities.
The simple solution:
Convert the integers to strings separated by commas, and hash the resulting string using a common hashing algorithm (md5, sha, etc).
If you really want to roll-your-own, I would do something like:
Generate large prime P
Generate random numbers 0 < a[i] < P (for each dimension you have)
To generate hash, calculate: sum(a[i] * x[i]) mod P
Given the inputs a, b, c, and d, each ranging in value from 0 to 30 (5 bits), the following will produce an number in the range of 0 to 255 (8 bits).
bucket = ((a & 0x18) << 3) | ((b & 0x18) << 1) | ((c & 0x18) >> 1) | ((d & 0x18) >> 3)
Whether the general approach is appropriate depends on how the question is interpreted. The 3 least significant bits are dropped, grouping 0-7 in the same set, 8-15 in the next, and so forth.
0-7,0-7,0-7,0-7 -> bucket 0
0-7,0-7,0-7,8-15 -> bucket 1
0-7,0-7,0-7,16-23 -> bucket 2
...
24-30,24-30,24-30,24-30 -> bucket 255
Trivially tested with:
for (int a = 0; a <= 30; a++)
for (int b = 0; b <= 30; b++)
for (int c = 0; c <= 30; c++)
for (int d = 0; d <= 30; d++) {
int bucket = ((a & 0x18) << 3) |
((b & 0x18) << 1) |
((c & 0x18) >> 1) |
((d & 0x18) >> 3);
printf("%d, %d, %d, %d -> %d\n",
a, b, c, d, bucket);
}
You want a hash function that depends on the order of inputs and where similar sets of numbers will generate the same hash? That is, you want 50 5 5 10 and 5 5 10 50 to generate different values, but you want 52 7 4 12 to generate the same hash as 50 5 5 10? A simple way to do something like this is:
long hash = 13;
for (int i = 0; i < array.length; i++) {
hash = hash * 37 + array[i] / 5;
}
This is imperfect, but should give you an idea of one way to implement what you want. It will treat the values 50 - 54 as the same value, but it will treat 49 and 50 as different values.
If you want the hash to be independent of the order of the inputs (so the hash of 5 10 20 and 20 10 5 are the same) then one way to do this is to sort the array of integers into ascending order before applying the hash. Another way would be to replace
hash = hash * 37 + array[i] / 5;
with
hash += array[i] / 5;
EDIT: Taking into account your comments in response to this answer, it sounds like my attempt above may serve your needs well enough. It won't be ideal, nor perfect. If you need high performance you have some research and experimentation to do.
To summarize, order is important, so 5 10 20 differs from 20 10 5. Also, you would ideally store each "vector" separately in your hash table, but to handle space limitations you want to store some groups of values in one table entry.
An ideal hash function would return a number evenly spread across the possible values based on your table size. Doing this right depends on the expected size of your table and on the number of and expected maximum value of the input vector values. If you can have negative values as "coordinate" values then this may affect how you compute your hash. If, given your range of input values and the hash function chosen, your maximum hash value is less than your hash table size, then you need to change the hash function to generate a larger hash value.
You might want to try using vectors to describe each number set as the hash value.
EDIT:
Since you're not describing why you want to not run the function itself, I'm guessing it's long running. Since you haven't described the breadth of the argument set.
If every value is expected then a full lookup table in a database might be faster.
If you're expecting repeated calls with the same arguments and little overall variation, then you could look at memoizing so only the first run for a argument set is expensive, and each additional request is fast, with less memory usage.
You would need to define what you mean by "similar". Hashes are generally designed to create unique results from unique input.
One approach would be to normalize your input and then generate a hash from the results.
Generating the same hash sum is called a collision, and is a bad thing for a hash to have. It makes it less useful.
If you want similar values to give the same output, you can divide the input by however close you want them to count. If the order makes a difference, use a different divisor for each number. The following function does what you describe:
int SqueezedSum( int a, int b, int c, int d )
{
return (a/11) + (b/7) + (c/5) + (d/3);
}
This is not a hash, but does what you describe.
You want to look into geometric hashing. In "standard" hashing you want
a short key
inverse resistance
collision resistance
With geometric hashing you susbtitute number 3 with something whihch is almost opposite; namely close initial values give close hash values.
Another way to view my problem is using the multidimesional scaling (MS). In MS we start with a matrix of items and what we want is assign a location of each item to an N dimensional space. Reducing in this way the number of dimensions.
http://en.wikipedia.org/wiki/Multidimensional_scaling