Union of DataSets in ApacheFlink - scala

I'm trying to union a Seq[DataSet(Long,Long,Double)] to
a single DataSet[(Long,Long,Double)] in Flink:
val neighbors= graph.map(el => zKnn.neighbors(results,
el.vector, 150, metric)).reduce(
(a, b) => a.union(b)
).collect()
Where graph is a regular scala collections but can be converted to DataSet;
results is a DataSet[Vector] and should not be collected and is needed in the neighbors method
I always get a FlinkRuntime Exeption:
cannot currently handle nodes with more than 64 outputs.
org.apache.flink.optimizer.CompilerException: Cannot currently handle nodes with more than 64 outputs.
at org.apache.flink.optimizer.dag.OptimizerNode.addOutgoingConnection(OptimizerNode.java:347)
at org.apache.flink.optimizer.dag.SingleInputNode.setInput(SingleInputNode.java:202

Flink does not support union operators with more than 64 input data sets at the moment.
As a workaround you can hierarchically union up to 64 data sets and inject an identity mapper between levels of the hierarchy.
Something like:
DataSet level1a = data1.union(data2.union(data3...(data64))).map(new IDMapper());
DataSet level1b = data65.union(data66...(data128))).map(new IDMapper());
DataSet level2 = level1a.union(level1b)

Related

How to read csv in pyspark as different types, or map dataset into two different types

Is there a way to map RDD as
covidRDD = sc.textFile("us-states.csv") \
.map(lambda x: x.split(","))
#reducing states and cases by key
reducedCOVID = covidRDD.reduceByKey(lambda accum, n:accum+n)
print(reducedCOVID.take(1))
The dataset consists of 1 column of states and 1 column of cases. When it's created, it is read as
[[u'Washington', u'1'],...]
Thus, I want to have a column of string and a column of int. I am doing a project on RDD, so I want to avoid using dataframe.. any thoughts?
Thanks!
As the dataset contains key value pair, use groupBykey and aggregate the count.
If you have a dataset like [['WH', 10], ['TX', 5], ['WH', 2], ['IL', 5], ['TX', 6]]
The code below gives this output - [('IL', 5), ('TX', 11), ('WH', 12)]
data.groupByKey().map(lambda row: (row[0], sum(row[1]))).collect()
can use aggregateByKey with UDF. This method requires 3 parameters start location, aggregation function within partition and aggregation function across the partitions
This code also produces the same result as above
def addValues(a,b):
return a+b
data.aggregateByKey(0, addValues, addValues).collect()

PySpark - Split the combined key

I have a RDD that is structured in this format:
(MAC_address, dst_ip_address, 1)
Here, 1 means the machine with the MAC_address has accessed the dst_ip_address once. I need to count how many times a specific machine with MAC_address has reached a specific dst_ip_address.
I created a rdd with a combined MAC_address and dst_ip_address as key, and applied reduceByKey to count the times.
def processJson(data):
return ((MAC_address, dst_ip_address), 1)
def countreducer(a,b):
return a+b
tt = df.map(processJson).reduceByKey(countreducer)
I am able to get a RDD ((MAC_address, dst_ip_address), 52)
I need to write the RDD into a Json format like this:
MAC_address_1:
[dst_ip_1: 52],
[dst_ip_2: 38]
MAC_address_2:
[dst_ip_1: 12]
My intuition is to split the combined key first but there is no function to flat a combined key. Thus, I wonder whether the above approach is on the right track.

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)

FlinkML: Joining DataSets of LabeledVector does not work

I am currently trying to join two DataSets (part of the flink 0.10-SNAPSHOT API). Both DataSets have the same form:
predictions:
6.932018685453303E155 DenseVector(0.0, 1.4, 1437.0)
org:
2.0 DenseVector(0.0, 1.4, 1437.0)
general form:
LabeledVector(Double, DenseVector(Double,Double,Double))
What I want to create is a new DataSet[(Double,Double)] containing only the labels of the two DataSets i.e.:
join:
6.932018685453303E155 2.0
Therefore I tried the following command:
val join = org.join(predictions).where(0).equalTo(0){
(l, r) => (l.label, r.label)
}
But as a result 'join' is empty. Am I missing something?
You are joining on the label field (index 0) of the LabeledVector type, i.e., building all pairs of elements with matching labels. Your example indicates that you want to join on the vector field instead.
However, joining on the vector field, for example by calling:
org.join(predictions).where("vector").equalTo("vector"){
(l, r) => (l.label, r.label)
}
will not work, because DenseVector, the type of the vector field, is not recognized as key type by Flink (such as all kinds of Arrays).
Till describes how to compare prediction and label values in a comment below.

Scala + Spark collections interactions

I'm working under my little project that using graph as the main structure. Graph consists of Vertices that have this structure:
class SWVertex[T: ClassTag](
val id: Long,
val data: T,
var neighbors: Vector[Long] = Vector.empty[Long],
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
) extends Serializable {
def addNeighbor(neighbor: Long): Unit = {
if (neighbor >= 0) { neighbors = neighbors :+ neighbor }
}
}
Notes:
There are will be a lot of vertices, possibly over MAX_INT I think.
Each vertex has a mutable array of neighbors (which are just ID's of another vertices).
There are special function for adding vertex to the graph that using BFS algorithm to choose the best vertex in graph for connecting new vertex - modifying existing and adding vertices' neighbors arrays.
I've decided to use Apache Spark and Scala for processing and navigating through my graph, but I stuck with some misunderstandings: I know, that RDD is a parallel dataset, which I'm making from main collection using parallelize() method and I've discovered, that modifying source collection will take affect on created RDD as well. I used this piece of code to find this out:
val newVertex1 = new SWVertex[String](1, "test1")
val newVertex2 = new SWVertex[String](2, "test2")
var vertexData = Seq(newVertex1, newVertex2)
val testRDD1 = sc.parallelize(vertexData, vertexData.length)
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// The result is:
// | ID: 1, data: test1, neighbors:
// | ID: 2, data: test2, neighbors:
// Calling simple procedure, that uses `addNeighbor` on both parameters
makeFriends(vertexData(0), vertexData(1))
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// Now the result is:
// | ID: 1, data: test1, neighbors: 2
// | ID: 2, data: test2, neighbors: 1
, but I didn't found the way to make the same thing using RDD methods (and honestly I'm not sure that this is even possible due to RDD immutability). In this case, the question is:
Is there any way to deal with such big amount of data, keeping the ability to access to the random vertices for modifying their neighbors lists and continuous appending of new vertices?
I believe that solution must be in using some kind of Vector data structures, and in this case I have another question:
Is it possible to store Scala structures in cluster memory?
P.S. I'm planning to use Spark for processing BFS search at least, but I will be really happy to hear any of other suggestions.
P.P.S. I've read about .view method for creating "lazy" collections transformations, but still have no clue how it could be used...
Update 1: As far as I'm reading Scala Cookbook, I think that choosing Vector will be the best choice, because working with graph in my case means a lot of random accessing to the vertices aka elements of the graph and appending new vertices, but still - I'm not sure that using Vector for such large amount of vertices won't cause OutOfMemoryException
Update 2: I've found several interesting things going on with the memory in the test above. Here's the deal (keep in mind, I'm using single-node Spark cluster):
// Test were performed using these lines of code:
val runtime = Runtime.getRuntime
var usedMemory = runtime.totalMemory - runtime.freeMemory
// In the beginning of my work, before creating vertices and collection:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating collection with two vertices:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating testRDD1
usedMemory = 191066552 bytes // ~182 MB, 1st run
usedMemory = 173991168 bytes // ~166 MB, 2nd run
// After performing first testRDD1.collect() function
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling makeFriends on source collection
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling testRDD1.collect() for modified collection
usedMemory = 216645128 bytes // ~207 MB, 1st run
usedMemory = 203955264 bytes // ~195 MB, 2nd run
I know that this amount of test is too low to be sure in my conclusions, but I noticed, that:
There's nothing happens, when you creating collection.
After creating RDD on this sample, there are 96 bytes allocated, perhaps for storing partitions data or something.
The most amount of memory was allocated when I called .collect() method, because I basically collect all data to one node, and, probably because of single-node Spark installation, I'm getting double copy of data (not sure here), which has taken about 23 MB of memory.
Interesting moment happens after modifying neighbors' arrays, which requires additional 4 MB of memory to store them.
Let me try to address the different questions here:
RDD is a parallel dataset, which I'm making from main collection using
parallelize() method and I've discovered, that modifying source
collection will take affect on created RDD as well.
RDDs are parallel, distributed datasets. parallelize lets you take a local collection and distribute it over a cluster. The current behavior you are observing that when mutating the underlying objects the RDD representation also mutates is only because the program is currently running in 1 node. In a cluster that behavior would not be possible.
Immutability is key to distribute a computation either 'vertically': over several cores of the same processor or 'horizontally': over several machines in a cluster.
I didn't found the way to update the graph structure using RDD
methods
To achieve that you will need to re-think the graph structure in terms of a distributed collection. In the current OO model, each Vertex contains their own list of adjacent vertices and require mutation of the object in order to build up the graph.
We would need to make vertex immutable, by creating them only with their properties and externalize the relationships as a list of edges. In a nutshell, this is what GraphX does. Your Edge would look like:
case class Vertex[T: ClassTag](
val id: Long,
val data: T,
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
)
and then we can build a collection of Edges:
val Edges:RDD[(Long, Long)] // (Source Vertex Id, Dest Vertex Id)
Then, given:
val usr1 = Vertex(1, "SuppieRK")
val usr2 = Vertex(2, "maasg")
val usr3 = Vertex(3, "graphy")
val usr4 = Vertex(4, "spark")
And some initial relationship:
val edgeSeq = Seq((1,2), (2,3))
and the RDD of such relationship:
val relations = sparkContext.parallelize(edgeSeq)
then adding new relationships will mean creating new edges:
val newRelations = sparkContext.parallelize(Seq((1,4),(2,4),(3,4))
and union-ing those collections together.
val allRel = relations.union(newRelations)
This is how "addFriend" would be implemented, but we probably will be reading that data from somewhere. This method is not to be used to do a one-by-one addition to the Edges collection. You are using Spark because the dataset to consider is very large and you need the possibility to distribute the computation across several machines.
If the collection fits in one node, I would stick to "standard" Scala representations and algorithms.