Moving Data Between PySpark, SparkR, and Scala Interpreters

Moving Data Between PySpark, SparkR, and Scala Interpreters - pyspark

Using Apache Zeppelin, I have the following notebook paragraphs that load content into the zeppelinContext object. One from python (pyspark):
%pyspark
py_list = [5,6,7,8]
z.put("p1", py_list)
And one from scala (spark):
val scala_arr1 = Array(Array(1, 4), Array(8, 16))
z.put("s1", scala_arr1)
val scala_arr2 = Array(1,2,3,4)
z.put("s2", scala_arr2)
val scala_vec = Vector(4,3,2,1)
z.put("s3", scala_vec)
I am trying to access these values from a sparkR paragraph using the following:
%r
unlist(z.get("s1"))
unlist(z.get("s2"))
unlist(z.get("s3"))
unlist(z.get("p1"))
However, the result is:
[1] 1 4 8 16
[1] 1 2 3 4
Java ref type scala.collection.immutable.Vector id 51
Java ref type java.util.ArrayList id 53
How can I get the values that were in the scala Vector and the python list? Have scala and java objects inside an R interpretter is not particularly useful because to my knowledge no R functions can make sense of them. Am I outside the range of what zeppelin is currently capable of? I am on a snapshot of zeppelin-0.6.0.

Related

Resiliency property of RDD

I have started studying pyspark and for some reason I'm not able to get the concept of Resiliency property of RDD. My understanding is RDD is a data structure like dataframe in pandas and is immutable. But I wrote a code (shown below) and it works.
file = sc.textfile('file name')
filterData = file.map(lambda x: x.split(','))
filterData = filterData.reduceByKey(lambda x,y: x+y)
filterData = filterData.sortBy(lambda x: x[1])
result = filterData.collect()
Doesn't this violate the immutable property as you can see I'm modifying the same RDD again and again.
File is a csv file with 2 columns. column 1 is an id and column 2 is just some integer.
Can you guys please explain where I'm going wrong with my understanding.

What are "zip" methods in Scala and Spark?

In Scala, Spark and a lot of other "big data"-type frameworks, languages, libraries I see methods named "zip*". For instance, in Scala, List types have an inherent zipWithIndex method that you can use like so:
val listOfNames : List[String] = getSomehow()
for((name,i) <- listOfNames.zipWithIndex) {
println(s"Names #${i+1}: ${name}")
}
Similarly Spark has RDD methods like zip, zipPartitions, etc.
But the method name "zip" is totally throwing me off. Is this a concept in computing or discrete math?! What's the motivation for all these methods with "zip" in their names?

They are named zip because you are zipping two datasets like a zipper.
To visualize it, take two datasets:
x = [1,2,3,4,5,6]
y = [a,b,c,d,e,f]
and then zip them together to get
1 a
2 b
3 c
4 d
5 e
6 f
I put the extra spacing just give the zipper illusion as you move down the dataset :)

Iterate through mixed-Type scala Lists

Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).

Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)

I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)

I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)

Union of DataSets in ApacheFlink

I'm trying to union a Seq[DataSet(Long,Long,Double)] to
a single DataSet[(Long,Long,Double)] in Flink:
val neighbors= graph.map(el => zKnn.neighbors(results,
el.vector, 150, metric)).reduce(
(a, b) => a.union(b)
).collect()
Where graph is a regular scala collections but can be converted to DataSet;
results is a DataSet[Vector] and should not be collected and is needed in the neighbors method
I always get a FlinkRuntime Exeption:
cannot currently handle nodes with more than 64 outputs.
org.apache.flink.optimizer.CompilerException: Cannot currently handle nodes with more than 64 outputs.
at org.apache.flink.optimizer.dag.OptimizerNode.addOutgoingConnection(OptimizerNode.java:347)
at org.apache.flink.optimizer.dag.SingleInputNode.setInput(SingleInputNode.java:202

Flink does not support union operators with more than 64 input data sets at the moment.
As a workaround you can hierarchically union up to 64 data sets and inject an identity mapper between levels of the hierarchy.
Something like:
DataSet level1a = data1.union(data2.union(data3...(data64))).map(new IDMapper());
DataSet level1b = data65.union(data66...(data128))).map(new IDMapper());
DataSet level2 = level1a.union(level1b)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Moving Data Between PySpark, SparkR, and Scala Interpreters - pyspark

Related

Resiliency property of RDD

What are "zip" methods in Scala and Spark?

Iterate through mixed-Type scala Lists

Implement a MergeSort like feature in spark with scala

Union of DataSets in ApacheFlink

Categories

Resources