Pair RDD tuple comparison - eclipse

I am learning how use spark and scala and I am trying to write a scala spark program that receives and input of string values such as:
12 13
13 14
13 12
15 16
16 17
17 16
I initially create my pair rdd with:
val myRdd = sc.textFile(args(0)).map(line=>(line.split("\\s+"))(0),line.split("\\s+")(1))).distinct()
Now this is where I am getting stuck. In the set of values there are instances like (12,13) and (13,12). In the context of the data these two are the same instances. Simply put (a,b)=(b,a).
I need to create an RDD that has one or the other, but not both. So the result, once this is done, would look something like this:
12 13
13 14
15 16
16 17
The only way I can see it as of right now is that I need to take one tuple and compare it with the rest in the RDD to make sure it isn't the same data just swapped.

The numbers just need to be sorted before creating a tuple.
val myRdd = sc.textFile(args(0))
.map(line => {
val nums = line.split("\\s+").sorted
(nums(0), nums(1))
}).distinct

Related

kdb - how to create sum a list of dynamic columns using functional select

I want to be able to construct (+; (+; `a; `b); `c) given a list of `a`b`c
Similarly if I have a list of `a`b`c`d, I want to be able to construct another nest and so on and so fourth.
I've been trying to use scan but I cant get it right
q)fsum:(+;;)/
enlist[+;;]/
q)fsum `a`b`c`d
+
(+;(+;`a;`b);`c)
`d
If you only want the raw parse tree output, one way is to form the equivalent string and use parse. This isn't recommended for more complex examples, but in this case it is clear.
{parse "+" sv string x}[`a`b`c`d]
+
`d
(+;`c;(+;`b;`a))
If you are looking to use this in a functional select, we can use +/ instead of adding each column individually, like how you specified in your example
q)parse"+/[(a;b;c;d)]"
(/;+)
(enlist;`a;`b;`c;`d)
q)f:{[t;c] ?[t;();0b;enlist[`res]!enlist (+/;(enlist,c))]};
q)t:([]a:1 2 3;b:4 5 6;c:7 8 9;d:10 11 12)
q)f[t;`a`b`c]
res
---
12
15
18
q)f[t;`a`b]
res
---
5
7
9
q)f[t;`a`b`c]~?[t;();0b;enlist[`res]!enlist (+;(+;`a;`b);`c)]
1b
You can also get the sum by indexing directly to return a list of each column values and sum over these. We use (), to turn any input into a list, otherwise it will sum the values in that single column and return only a single value
q)f:{[t;c] sum t (),c}
q)f[t;`a`b`c]
12 15 18

Performing random trials in pyspark

I am learning pyspark recently and wanted to apply in one of the problems. Basically i want to perform random trials on each record in a dataframe.My dataframe is structured as below.
order_id,order_date,distribution,quantity
O1,D1,3 4 4 5 6 7 8 ... ,10
O2,D2,1 6 9 10 12 16 18 ..., 20
O3,D3,7 12 15 16 18 20 ... ,50
Here distribution column is 100 percentile points where each value is space separated.
I want to loop through each of these rows in the dataframe and randomly select a point in the distribution and add those many days to order_date and create a new column arrival_date.
At the end i want to get the avg(quantity) by arrival_date. So my final dataframe should look like
arrival_date,qty
A1,5
A2,10
What i have achieved till now is below
df = spark.read.option("header",True).csv("/tmp/test.csv")
def randSample(row):
order_id = row.order_id
quantity = int(row.quantity)
data = []
for i in range(1,20):
n = random.randint(0,99)
randnum = int(float(row.edd.split(" ")[n]))
arrival_date = datetime.datetime.strptime(row.order_date.split(" ")[0], "%Y-%m-%d") + datetime.timedelta(days=randnum)
data.append((arrival_date, quantity))
return data
finalRDD = df.rdd.map(randSample)
The calculations look correct, however the finalRDD is structured as list of lists as below
[
[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
,[(),(),(),()]
]
Each of the list inside the main list is a single record . And each tuple inside the nested list is a trial of that record.
Basically i want the final output as flattened records, so that i can perform the average.
[
(),
(),
(),
]

How to generate a random sequence of binary strings of fixed size ( say 36 bits ) in scala

I'm trying to generate a unique random sequence of 50 Binary strings of size 36 bits each. I tried doing nextInt followed by toBinaryString which didn't solve my problem as nextInt don't support such big numbers and also checked nextString which generates string of some random characters (not 0/1) is there any other way to achieve this ?
And to add one more requirement I want 36 bits to be present at every time suppose if some random generator generated 3 as a number I want the output as 000...(34)11.
I'm quite new to scala, Pardon me if my question seemed irrelavant or redundant.
You can try
val r = scala.util.Random
val a: immutable.Seq[Int] = (0 to 50).map(_ => r.nextInt(1000000))
val y = a.map( x => {
val bin = x.toBinaryString
val zero = 36 - bin.length
List.fill(zero)(0).mkString("") ++ bin
})
println(Random.shuffle(y))

Moving Data Between PySpark, SparkR, and Scala Interpreters

Using Apache Zeppelin, I have the following notebook paragraphs that load content into the zeppelinContext object. One from python (pyspark):
%pyspark
py_list = [5,6,7,8]
z.put("p1", py_list)
And one from scala (spark):
val scala_arr1 = Array(Array(1, 4), Array(8, 16))
z.put("s1", scala_arr1)
val scala_arr2 = Array(1,2,3,4)
z.put("s2", scala_arr2)
val scala_vec = Vector(4,3,2,1)
z.put("s3", scala_vec)
I am trying to access these values from a sparkR paragraph using the following:
%r
unlist(z.get("s1"))
unlist(z.get("s2"))
unlist(z.get("s3"))
unlist(z.get("p1"))
However, the result is:
[1] 1 4 8 16
[1] 1 2 3 4
Java ref type scala.collection.immutable.Vector id 51
Java ref type java.util.ArrayList id 53
How can I get the values that were in the scala Vector and the python list? Have scala and java objects inside an R interpretter is not particularly useful because to my knowledge no R functions can make sense of them. Am I outside the range of what zeppelin is currently capable of? I am on a snapshot of zeppelin-0.6.0.

Unzip a sequence of case classes with two fields

Let's say I have a case class MyCaseClass with two fields in the constructor, and a sequence of values of this case class, sequence.
How do I unzip sequence?
If the fields are a and b then I'd just write
(sequence map (_.a), sequence map (_.b))
OK, you traverse sequence twice, but list traversal is so cheap, I'd wager that this is quicker than using Option.get.
edit: After Rex's comment, couldn't resist running a benchmark myself; results below...
times in ms for 100 traversals of 10000 elem collection,
L = List, A = Array, V = Vector
// Java 6 // Java 7
sequence.unzip{case MyCaseClass(a,b) => (a,b)} //L 173 A 101 V 87 //L 27 A 29 V 21
sequence.unzip{MyCaseClass.unapply(_).get} //L 194 A 116 V 100 //L 35 A 32 V 25
(sequence map (_.a), sequence map (_.b)) //L 177 A 70 V 86 //L 34 A 20 V 23
Your results may vary, according to CPU, memory, JRE version, collection size, phase of the moon etc.
Case classes don't extend Product2, Product3 etc., so a simple unzip doesn't work.
This does:
sequence.unzip { MyCaseClass.unapply(_).get }