How to make spark partition sticky, i.e. stay with node? - streaming

I am trying to use Spark Streaming 1.2.0. At some point, I grouped streaming data by key and then applied some operation on them.
The following is a segment of the test code:
...
JavaPairDStream<Integer, Iterable<Integer>> grouped = mapped.groupByKey();
JavaPairDStream<Integer, Integer> results = grouped.mapToPair(
new PairFunction<Tuple2<Integer, Iterable<Integer>>, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Iterable<Integer>> tp) throws Exception {
TaskContext tc = TaskContext.get();
String ip = InetAddress.getLocalHost().getHostAddress();
int key = tp._1();
System.out.println(ip + ": Partition: " + tc.partitionId() + "\tKey: " + key);
return new Tuple2<>(key, 1);
}
});
results.print();
mapped is an JavaPairDStream wrapping a dummy receiver that stores an array of integers every second.
I ran this app on a cluster with two slaves, each has 2 cores.
When I checked out the printout, it seems that partitions were not assigned to nodes permanently (or in a "sticky" fashion). They moved between the two nodes frequently. This creates a problem for me.
In my real application, I need to load fairly large amount of geo data per partition. These geo data will be used to process the data in the streams. I can only afford to load part of the geo data set per partition. If the partition moves between nodes, I will have to move the geo data too, which can be very expensive.
Is there a way to make the partitions sticky, i.e. partition 0,1,2,3 stay with node 0, partition 4,5,6,7 stay with node 1?
I have tried setting spark.locality.wait to a large number, say, 1000000. And it did not work.
Thanks.

I found a workaround.
I can make my auxiliary data a RDD. Partition it and cache it.
Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. E.g.
...
JavaPairRDD<Integer, GeoData> geoRDD =
geoRDD1.partitionBy(new HashPartitioner(num)).cache();
Later, do this
JavaPairRDD<Integer, Integer> someOtherRDD = ...
JavaPairRDD<Integer, Tuple2<Iterator<GeoData>>, Iterator<Integer>>> grp =
geoRDD.cogroup(someOtherRDD);
Then, you can use foreach on the cogroupped rdd to process the input data with geo data.

Related

10 most common female first names - order changes

I am running through the exercise in Databricks and the below code returns firstName in different order everytime I run. Please explain the reason why the order is not same for every run:
val peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
id:integer
firstName:string
middleName:string
lastName:string
gender:string
birthDate:timestamp
ssn:string
salary:integer
/* Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.*/
import org.apache.spark.sql.functions.count
val top10FemaleFirstNamesDF_1 = peopleDF.filter($"gender"=== "F").groupBy($"firstName").agg(count($"firstName").alias("cnt_firstName")).withColumn("cnt_firstName",$"cnt_firstName".cast("Int")).sort($"cnt_firstName".desc).limit(10)
val top10FemaleNamesDF = top10FemaleFirstNamesDF_1.orderBy($"firstName")
Some runs the assertion passes and in some run the assertion fails:
lazy val results = top10FemaleNamesDF.collect()
dbTest("DF-L2-names-0", Row("Alesha", 1368), results(0))
// dbTest("DF-L2-names-1", Row("Alice", 1384), results(1))
// dbTest("DF-L2-names-2", Row("Bridgette", 1373), results(2))
// dbTest("DF-L2-names-3", Row("Cristen", 1375), results(3))
// dbTest("DF-L2-names-4", Row("Jacquelyn", 1381), results(4))
// dbTest("DF-L2-names-5", Row("Katherin", 1373), results(5))
// dbTest("DF-L2-names-5", Row("Lashell", 1387), results(6))
// dbTest("DF-L2-names-7", Row("Louie", 1382), results(7))
// dbTest("DF-L2-names-8", Row("Lucille", 1384), results(8))
// dbTest("DF-L2-names-9", Row("Sharyn", 1394), results(9))
println("Tests passed!")
The problem might be the limit 10. Due to distributed nature of spark, you can't assume every time it runs the limit function it is going to give you same result. Spark might find different partition in different runs to give you 10 elements.
If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition.
However, I do realize you are sorting the data first and then limiting on that. The limit function supposed to return deterministically when the underlying rdd is sorted. It might be non-deterministic for unsorted data.
It will be worthwhile to see the physical plan of your query.

spark No space left on device when working on extremely large data

The followings are my scala spark code:
val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")
var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
for (i <- 1 to MaxIter) {
var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
var adjacency = edges.join(
encodedVertex,//FMvertex.toDF("keyR", "valueR"),
$"value" === $"keyR"
).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))
// update vetex encode
encodedVertex = FMvertex.toDF("keyR", "valueR")
var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
lastRes = N_curr
var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
if (middleAns.values.sum() == 0){
println(i)
break
}
Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
}
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res
In which I use several functions I defined in Java:
import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;
import java.io.Serializable;
public class HLLCounter implements Serializable {
private static int seed = 1234567;
private static HashFunction hs = Hashing.murmur3_128(seed);
private static int log2m = 15;
private static int regwidth = 5;
public static byte[] encode(Long id) {
HLL hll = new HLL(log2m, regwidth);
Hasher myhash = hs.newHasher();
hll.addRaw(myhash.putLong(id).hash().asLong());
return hll.toBytes();
}
public static byte[] Union(byte[] byteA, byte[] byteB) {
HLL hllA = HLL.fromBytes(byteA);
HLL hllB = HLL.fromBytes(byteB);
hllA.union(hllB);
return hllA.toBytes();
}
public static long decode(byte[] bytes) {
HLL hll = HLL.fromBytes(bytes);
return hll.cardinality();
}
}
This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.
The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Can anybody help me? I just begin to use spark for several days. Thank you for helping.
Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.
When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.
The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.
At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.
Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).
You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.
In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.
At this point, you need to go back to your Apache Spark algorithm and see if it is possible to
Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).
Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.
Good Luck!

How to Persist an array in spark

I'm comparing two tables to find out difference between them (i.e Source and destination), for that I'm loading those tables to memory and the comparison happens as expected in the machine of configuration 8GB memory and 4 cores but when comparing large amount of data the system hangs and runs out of memory, so I used persist() of storagelevel DISK_ONLY
the machine is capable of holding 100,000 rows in memory to store that to DISK at a time and do the rest comparison operations, I'm trying like below:
var partition = math.ceil(c / 100000.toFloat).toInt
println(partition + " partition")
var a = 1
var data = spark.sparkContext.parallelize(Seq(""))
var offset = 0
for (s <- a to partition) {
val query = "(select * from destination LIMIT 100000 OFFSET " + offset + ") as src"
data = data.union(spark.read.jdbc(url, query, connectionProperties).rdd.map(_.mkString(","))).persist(StorageLevel.DISK_ONLY)
offset += 100000
}
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
yes off-course I can use partition but the problem is I need to supply Lowerbounds,Upperbounds,NumPartitions dynamically for fetching 100,000 I tried like:
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "EMPLOYEE_ID", 1, 22, 21, new java.util.Properties()).rdd.map(_.mkString(","))
it takes too much of time and storing those files into partitions though comparing operation is Iterative in nature its reading all the partitions for each and every step.
Coming to the problem
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
the above lines convert all the partitioned RDD's to Array and parallelizing it to single partition so I don't want to iterate through all the partitions again and again. But val dest = data.collect.toArray can't able to convert some huge amount of lines because of shortage in memory and seems it won't allow to Persist() an array in spark.
Is there is any way I can store and parallelize to one partition in DISK
Sorry for being a noob.
Thanks you..!

Scala + Spark collections interactions

I'm working under my little project that using graph as the main structure. Graph consists of Vertices that have this structure:
class SWVertex[T: ClassTag](
val id: Long,
val data: T,
var neighbors: Vector[Long] = Vector.empty[Long],
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
) extends Serializable {
def addNeighbor(neighbor: Long): Unit = {
if (neighbor >= 0) { neighbors = neighbors :+ neighbor }
}
}
Notes:
There are will be a lot of vertices, possibly over MAX_INT I think.
Each vertex has a mutable array of neighbors (which are just ID's of another vertices).
There are special function for adding vertex to the graph that using BFS algorithm to choose the best vertex in graph for connecting new vertex - modifying existing and adding vertices' neighbors arrays.
I've decided to use Apache Spark and Scala for processing and navigating through my graph, but I stuck with some misunderstandings: I know, that RDD is a parallel dataset, which I'm making from main collection using parallelize() method and I've discovered, that modifying source collection will take affect on created RDD as well. I used this piece of code to find this out:
val newVertex1 = new SWVertex[String](1, "test1")
val newVertex2 = new SWVertex[String](2, "test2")
var vertexData = Seq(newVertex1, newVertex2)
val testRDD1 = sc.parallelize(vertexData, vertexData.length)
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// The result is:
// | ID: 1, data: test1, neighbors:
// | ID: 2, data: test2, neighbors:
// Calling simple procedure, that uses `addNeighbor` on both parameters
makeFriends(vertexData(0), vertexData(1))
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// Now the result is:
// | ID: 1, data: test1, neighbors: 2
// | ID: 2, data: test2, neighbors: 1
, but I didn't found the way to make the same thing using RDD methods (and honestly I'm not sure that this is even possible due to RDD immutability). In this case, the question is:
Is there any way to deal with such big amount of data, keeping the ability to access to the random vertices for modifying their neighbors lists and continuous appending of new vertices?
I believe that solution must be in using some kind of Vector data structures, and in this case I have another question:
Is it possible to store Scala structures in cluster memory?
P.S. I'm planning to use Spark for processing BFS search at least, but I will be really happy to hear any of other suggestions.
P.P.S. I've read about .view method for creating "lazy" collections transformations, but still have no clue how it could be used...
Update 1: As far as I'm reading Scala Cookbook, I think that choosing Vector will be the best choice, because working with graph in my case means a lot of random accessing to the vertices aka elements of the graph and appending new vertices, but still - I'm not sure that using Vector for such large amount of vertices won't cause OutOfMemoryException
Update 2: I've found several interesting things going on with the memory in the test above. Here's the deal (keep in mind, I'm using single-node Spark cluster):
// Test were performed using these lines of code:
val runtime = Runtime.getRuntime
var usedMemory = runtime.totalMemory - runtime.freeMemory
// In the beginning of my work, before creating vertices and collection:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating collection with two vertices:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating testRDD1
usedMemory = 191066552 bytes // ~182 MB, 1st run
usedMemory = 173991168 bytes // ~166 MB, 2nd run
// After performing first testRDD1.collect() function
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling makeFriends on source collection
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling testRDD1.collect() for modified collection
usedMemory = 216645128 bytes // ~207 MB, 1st run
usedMemory = 203955264 bytes // ~195 MB, 2nd run
I know that this amount of test is too low to be sure in my conclusions, but I noticed, that:
There's nothing happens, when you creating collection.
After creating RDD on this sample, there are 96 bytes allocated, perhaps for storing partitions data or something.
The most amount of memory was allocated when I called .collect() method, because I basically collect all data to one node, and, probably because of single-node Spark installation, I'm getting double copy of data (not sure here), which has taken about 23 MB of memory.
Interesting moment happens after modifying neighbors' arrays, which requires additional 4 MB of memory to store them.
Let me try to address the different questions here:
RDD is a parallel dataset, which I'm making from main collection using
parallelize() method and I've discovered, that modifying source
collection will take affect on created RDD as well.
RDDs are parallel, distributed datasets. parallelize lets you take a local collection and distribute it over a cluster. The current behavior you are observing that when mutating the underlying objects the RDD representation also mutates is only because the program is currently running in 1 node. In a cluster that behavior would not be possible.
Immutability is key to distribute a computation either 'vertically': over several cores of the same processor or 'horizontally': over several machines in a cluster.
I didn't found the way to update the graph structure using RDD
methods
To achieve that you will need to re-think the graph structure in terms of a distributed collection. In the current OO model, each Vertex contains their own list of adjacent vertices and require mutation of the object in order to build up the graph.
We would need to make vertex immutable, by creating them only with their properties and externalize the relationships as a list of edges. In a nutshell, this is what GraphX does. Your Edge would look like:
case class Vertex[T: ClassTag](
val id: Long,
val data: T,
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
)
and then we can build a collection of Edges:
val Edges:RDD[(Long, Long)] // (Source Vertex Id, Dest Vertex Id)
Then, given:
val usr1 = Vertex(1, "SuppieRK")
val usr2 = Vertex(2, "maasg")
val usr3 = Vertex(3, "graphy")
val usr4 = Vertex(4, "spark")
And some initial relationship:
val edgeSeq = Seq((1,2), (2,3))
and the RDD of such relationship:
val relations = sparkContext.parallelize(edgeSeq)
then adding new relationships will mean creating new edges:
val newRelations = sparkContext.parallelize(Seq((1,4),(2,4),(3,4))
and union-ing those collections together.
val allRel = relations.union(newRelations)
This is how "addFriend" would be implemented, but we probably will be reading that data from somewhere. This method is not to be used to do a one-by-one addition to the Edges collection. You are using Spark because the dataset to consider is very large and you need the possibility to distribute the computation across several machines.
If the collection fits in one node, I would stick to "standard" Scala representations and algorithms.

Spark: coalesce very slow even the output data is very small

I have the following code in Spark:
myData.filter(t => t.getMyEnum() == null)
.map(t => t.toString)
.saveAsTextFile("myOutput")
There are 2000+ files in the myOutput folder, but only a few t.getMyEnum() == null, so there are only very few output records. Since I don't want to search just a few outputs in 2000+ output files, I tried to combine the output using coalesce like below:
myData.filter(t => t.getMyEnum() == null)
.map(t => t.toString)
.coalesce(1, false)
.saveAsTextFile("myOutput")
Then the job becomes EXTREMELY SLOW! I am wondering why it is so slow? There was just a few output records scattering in 2000+ partitions? Is there a better way to solve this problem?
if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Note: With shuffle = true, you can actually coalesce to a larger
number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.
So try by passing the true to coalesce function. i.e.
myData.filter(_.getMyEnum == null)
.map(_.toString)
.coalesce(1, shuffle = true)
.saveAsTextFile("myOutput")