Let's say I have a case class MyCaseClass with two fields in the constructor, and a sequence of values of this case class, sequence.
How do I unzip sequence?
If the fields are a and b then I'd just write
(sequence map (_.a), sequence map (_.b))
OK, you traverse sequence twice, but list traversal is so cheap, I'd wager that this is quicker than using Option.get.
edit: After Rex's comment, couldn't resist running a benchmark myself; results below...
times in ms for 100 traversals of 10000 elem collection,
L = List, A = Array, V = Vector
// Java 6 // Java 7
sequence.unzip{case MyCaseClass(a,b) => (a,b)} //L 173 A 101 V 87 //L 27 A 29 V 21
sequence.unzip{MyCaseClass.unapply(_).get} //L 194 A 116 V 100 //L 35 A 32 V 25
(sequence map (_.a), sequence map (_.b)) //L 177 A 70 V 86 //L 34 A 20 V 23
Your results may vary, according to CPU, memory, JRE version, collection size, phase of the moon etc.
Case classes don't extend Product2, Product3 etc., so a simple unzip doesn't work.
This does:
sequence.unzip { MyCaseClass.unapply(_).get }
Related
I am working with TraMineR and I am new to R and TraMineR.
Actually I made a typology of a life course dataset with TraMineR and the cluster library in R.
(used this guide: http://traminer.unige.ch/preview-typology.shtml)
Now I have different Cases sorted into different Types (all in all 4 Types).
I want to get into deeper analysis with a certain Type but I need to know which cases ( I have case numbers) belong to which type.
#
Is it possible to write the certain type a case is sorted to into the dataset itself as a new variable Is there another way?
In the example of the referenced guide, the Type is obtained as follows using an optimal matching distances with substitution costs based on transition probabilities
library(TraMineR)
data(mvad)
mvad.seq <- seqdef(mvad, 17:86)
dist.om1 <- seqdist(mvad.seq, method = "OM", indel = 1, sm = "TRATE")
library(cluster)
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
cl1.4 <- cutree(clusterward1, k = 4)
cl.4 is a vector with the cluster membership of the sequences in the order corresponding to the mvad dataset. (It could be convenient to transform it into a factor.) Therefore, you can simply add this variable as an additional column to the dataset. For instance, if we want to name this new column as Type
mvad$Type <- cl1.4
tail(mvad[,c("id","Type")]) ## id and Type of the last 6 sequences
## id Type
## 707 707 3
## 708 708 3
## 709 709 4
## 710 710 2
## 711 711 1
## 712 712 4
I am learning how use spark and scala and I am trying to write a scala spark program that receives and input of string values such as:
12 13
13 14
13 12
15 16
16 17
17 16
I initially create my pair rdd with:
val myRdd = sc.textFile(args(0)).map(line=>(line.split("\\s+"))(0),line.split("\\s+")(1))).distinct()
Now this is where I am getting stuck. In the set of values there are instances like (12,13) and (13,12). In the context of the data these two are the same instances. Simply put (a,b)=(b,a).
I need to create an RDD that has one or the other, but not both. So the result, once this is done, would look something like this:
12 13
13 14
15 16
16 17
The only way I can see it as of right now is that I need to take one tuple and compare it with the rest in the RDD to make sure it isn't the same data just swapped.
The numbers just need to be sorted before creating a tuple.
val myRdd = sc.textFile(args(0))
.map(line => {
val nums = line.split("\\s+").sorted
(nums(0), nums(1))
}).distinct
If I have a table of prices
t:([]date:2018.01.01+til 30;px:100+sums 30?(-1;1))
date px
2018.01.01 101
2018.01.02 102
2018.01.03 103
2018.01.04 102
2018.01.05 103
2018.01.06 102
2018.01.07 103
...
how do I compute the returns over n days? I am interested in both computing
(px[i] - px[i-n])/px[i-n] and (px[date] - px[date-n])/px[date-n], i.e. one where the column px is shifted n slots by index and one where the previous price is the price at date-n
Thanks for the help
Well you've pretty much got it right with the first one. To get the returns you can use this lambda:
{update return1:(px-px[i-x])%px[i-x] from t}[5]
For the date shift you can use an aj like this:
select date,return2:(px-pr)%pr from aj[`date;t;select date,pr:px from update date:date+5 from t]
Basically what you are trying to do here is to shift the date by the number of days you want to and then extract the price. You use an aj to create your table which will look something like this:
q)aj[`date;t;select date,pr:px from update date:date+5 from t]
date px pr
----------------
2018.01.01 99 98
2018.01.02 98 97
2018.01.03 97 98
Where px is your price now and pr is your price 5 days from now.
Then the return is calculated just the normal way.
Hope this helps!
I have a dataset that contains DocID, WordID and frequency (count) as shown below.
Note that the first three numbers represent 1. the number of documents, 2. the
number of words in the vocabulary and 3. the total number of words in the collection.
189
1430
12300
1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3
What I want to do is to read the data (ignore the first three lines), combine the words per document and finally represent each document as a vector that contains the frequency of the wordID.
Based on the above dataset the representation of documents 1, 2 and 3 will be (note that vocab_size can be extracted by the second line of the data):
val data = Array(
Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0), (95, 1.0), (96, 1.0))),
Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))),
Vectors.sparse(vocab_size, Seq((133, 3.0))))
The problem is that I am not quite sure how to read the .txt.gz file as RDD and create an Array of sparse vectors as described above. Please note that I actually want to pass the data array in the PCA transformer.
Something like this should do the trick:
sc.textFile("path/to/file").flatMap(r => r.split(' ') match {
case Array(doc, word, freq) => Some((doc.toInt, (word.toInt, freq.toDouble)))
case _ => None
}).groupByKey().mapValues(a => Vectors.sparse(vocab_size, a.toSeq))
Note that the groupByKey method will load all the keys for each document into memory, so you might want to use one of its variants reduceByKey or aggregateByKey instead (I would have, but I don't know the methods you have on your sparse vectors, although you probably have something to merge them together).
the thing is that, the 1st number is already ORACLE LONG,
second one a Date (SQL DATE, no timestamp info extra), the last one being a Short value in the range 1000-100'000.
how can I create sort of hash value that will be unique for each combination optimally?
string concatenation and converting to long later:
I don't want this, for example.
Day Month
12 1 --> 121
1 12 --> 121
When you have a few numeric values and need to have a single "unique" (that is, statistically improbable duplicate) value out of them you can usually use a formula like:
h = (a*P1 + b)*P2 + c
where P1 and P2 are either well-chosen numbers (e.g. if you know 'a' is always in the 1-31 range, you can use P1=32) or, when you know nothing particular about the allowable ranges of a,b,c best approach is to have P1 and P2 as big prime numbers (they have the least chance to generate values that collide).
For an optimal solution the math is a bit more complex than that, but using prime numbers you can usually have a decent solution.
For example, Java implementation for .hashCode() for an array (or a String) is something like:
h = 0;
for (int i = 0; i < a.length; ++i)
h = h * 31 + a[i];
Even though personally, I would have chosen a prime bigger than 31 as values inside a String can easily collide, since a delta of 31 places can be quite common, e.g.:
"BB".hashCode() == "Aa".hashCode() == 2122
Your
12 1 --> 121
1 12 --> 121
problem is easily fixed by zero-padding your input numbers to the maximum width expected for each input field.
For example, if the first field can range from 0 to 10000 and the second field can range from 0 to 100, your example becomes:
00012 001 --> 00012001
00001 012 --> 00001012
In python, you can use this:
#pip install pairing
import pairing as pf
n = [12,6,20,19]
print(n)
key = pf.pair(pf.pair(n[0],n[1]),
pf.pair(n[2], n[3]))
print(key)
m = [pf.depair(pf.depair(key)[0]),
pf.depair(pf.depair(key)[1])]
print(m)
Output is:
[12, 6, 20, 19]
477575
[(12, 6), (20, 19)]