How to Persist an array in spark - scala

I'm comparing two tables to find out difference between them (i.e Source and destination), for that I'm loading those tables to memory and the comparison happens as expected in the machine of configuration 8GB memory and 4 cores but when comparing large amount of data the system hangs and runs out of memory, so I used persist() of storagelevel DISK_ONLY
the machine is capable of holding 100,000 rows in memory to store that to DISK at a time and do the rest comparison operations, I'm trying like below:
var partition = math.ceil(c / 100000.toFloat).toInt
println(partition + " partition")
var a = 1
var data = spark.sparkContext.parallelize(Seq(""))
var offset = 0
for (s <- a to partition) {
val query = "(select * from destination LIMIT 100000 OFFSET " + offset + ") as src"
data = data.union(spark.read.jdbc(url, query, connectionProperties).rdd.map(_.mkString(","))).persist(StorageLevel.DISK_ONLY)
offset += 100000
}
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
yes off-course I can use partition but the problem is I need to supply Lowerbounds,Upperbounds,NumPartitions dynamically for fetching 100,000 I tried like:
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "EMPLOYEE_ID", 1, 22, 21, new java.util.Properties()).rdd.map(_.mkString(","))
it takes too much of time and storing those files into partitions though comparing operation is Iterative in nature its reading all the partitions for each and every step.
Coming to the problem
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
the above lines convert all the partitioned RDD's to Array and parallelizing it to single partition so I don't want to iterate through all the partitions again and again. But val dest = data.collect.toArray can't able to convert some huge amount of lines because of shortage in memory and seems it won't allow to Persist() an array in spark.
Is there is any way I can store and parallelize to one partition in DISK
Sorry for being a noob.
Thanks you..!

Related

Spark dataframe process partitions in batches, N partitions at a time

I need to process Spark dataframe partitions in batches, N partitions at a time. For example if i have 1000 partitions in hive table, i need to process 100 partitions at a time.
I tried following approach
Get partition list from hive table and find total count
Get loop count using total_count/100
Then
for x in range(loop_count):
files_list=partition_path_list[start_index:end_index]
df = spark.read.option("basePath", target_table_location).parquet(*files_list)
But this is not working as expected. Can anyone suggest a better method. Solution in Spark Scala is preferred
The for loop you have is just having x increment each time. That's why the start and end indices do not increment.
Not sure why you mention Scala since your code is in Python.
Here's an example with loop count being 1000.
partitions_per_iteration = 100
loop_count = 1000
for start_index in range(0, loop_count, partitions_per_iteration):
files_list=partition_path_list[start_index:start_index + partitions_per_iteration]
df = spark.read.option("basePath", target_table_location).parquet(*files_list)
In Scala, you can do a similar loop:
total = 1000
for {
startIndex <- 0 until total by 100
} {
val filesList = partitionsPathList.slice(startIndex, startIndex + partitionsPerIteration)
val df = ...
}
I think total or totalPartitions is a clearer variable name than "loop count".

Reading Mongo data from Spark

I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir

Spark DataFrame save as parquet - out of memory

I am using spark to read a file from s3, then I load it to a dataframe and then i try to write it to hdfs as parquet.
The thing is that when the file is big (65G), and for some reason I get out of memory...in any case I have no idea why I get the outof memory because it looks like the data is partitioned pretty well.
this is a sampel of my code:
val records = gzCsvFile.filter { x => x.length == 31 }.map { x =>
var d:Date = Date.valueOf(x(0))
//var z = new GregorianCalendar();z.getWeekYear
var week = (1900+d.getYear )* 1000 + d.getMonth()*10 + Math.ceil(d.getDate()/7.0).toInt
Row(d, Timestamp.valueOf(x(1)), toLong( x(2)), toLong(x(3)), toLong(x(4)), toLong(x(5)), toLong(x(6)), toLong(x(7)), toLong(x(8)), toLong(x(9)), toLong(x(10)), toLong(x(11)), toLong(x(12)), toLong(x(13)), toLong(x(14)), toLong(x(15)), toLong(x(16)), toLong(x(17)), toLong(x(18)), toLong(x(19)), toLong(x(20)), toLong(x(21)), toLong(x(22)), toLong(x(23)), toLong(x(24)), toLong(x(25)), x(26).trim(), toLong(x(27)), toLong(x(28)), toLong(x(29)), toInt(x(30)), week)
}
var cubeDF = sqlContext.createDataFrame(records, cubeSchema)
cubeDF.write.mode(SaveMode.Overwrite).partitionBy("CREATION_DATE","COUNTRY_ID","CHANNEL_ID" ).parquet(cubeParquetUrl)
does anyone have any idea what is going on?
You are hitting this bug: https://issues.apache.org/jira/browse/SPARK-8890
Parquet's memory consumption when writing output out is substantially larger than we thought. In the soon-to-be-released Spark 1.5, we turn to sorting the data before writing a large number of parquet partitions out to reduce memory consumption.

Scala + Spark collections interactions

I'm working under my little project that using graph as the main structure. Graph consists of Vertices that have this structure:
class SWVertex[T: ClassTag](
val id: Long,
val data: T,
var neighbors: Vector[Long] = Vector.empty[Long],
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
) extends Serializable {
def addNeighbor(neighbor: Long): Unit = {
if (neighbor >= 0) { neighbors = neighbors :+ neighbor }
}
}
Notes:
There are will be a lot of vertices, possibly over MAX_INT I think.
Each vertex has a mutable array of neighbors (which are just ID's of another vertices).
There are special function for adding vertex to the graph that using BFS algorithm to choose the best vertex in graph for connecting new vertex - modifying existing and adding vertices' neighbors arrays.
I've decided to use Apache Spark and Scala for processing and navigating through my graph, but I stuck with some misunderstandings: I know, that RDD is a parallel dataset, which I'm making from main collection using parallelize() method and I've discovered, that modifying source collection will take affect on created RDD as well. I used this piece of code to find this out:
val newVertex1 = new SWVertex[String](1, "test1")
val newVertex2 = new SWVertex[String](2, "test2")
var vertexData = Seq(newVertex1, newVertex2)
val testRDD1 = sc.parallelize(vertexData, vertexData.length)
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// The result is:
// | ID: 1, data: test1, neighbors:
// | ID: 2, data: test2, neighbors:
// Calling simple procedure, that uses `addNeighbor` on both parameters
makeFriends(vertexData(0), vertexData(1))
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// Now the result is:
// | ID: 1, data: test1, neighbors: 2
// | ID: 2, data: test2, neighbors: 1
, but I didn't found the way to make the same thing using RDD methods (and honestly I'm not sure that this is even possible due to RDD immutability). In this case, the question is:
Is there any way to deal with such big amount of data, keeping the ability to access to the random vertices for modifying their neighbors lists and continuous appending of new vertices?
I believe that solution must be in using some kind of Vector data structures, and in this case I have another question:
Is it possible to store Scala structures in cluster memory?
P.S. I'm planning to use Spark for processing BFS search at least, but I will be really happy to hear any of other suggestions.
P.P.S. I've read about .view method for creating "lazy" collections transformations, but still have no clue how it could be used...
Update 1: As far as I'm reading Scala Cookbook, I think that choosing Vector will be the best choice, because working with graph in my case means a lot of random accessing to the vertices aka elements of the graph and appending new vertices, but still - I'm not sure that using Vector for such large amount of vertices won't cause OutOfMemoryException
Update 2: I've found several interesting things going on with the memory in the test above. Here's the deal (keep in mind, I'm using single-node Spark cluster):
// Test were performed using these lines of code:
val runtime = Runtime.getRuntime
var usedMemory = runtime.totalMemory - runtime.freeMemory
// In the beginning of my work, before creating vertices and collection:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating collection with two vertices:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating testRDD1
usedMemory = 191066552 bytes // ~182 MB, 1st run
usedMemory = 173991168 bytes // ~166 MB, 2nd run
// After performing first testRDD1.collect() function
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling makeFriends on source collection
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling testRDD1.collect() for modified collection
usedMemory = 216645128 bytes // ~207 MB, 1st run
usedMemory = 203955264 bytes // ~195 MB, 2nd run
I know that this amount of test is too low to be sure in my conclusions, but I noticed, that:
There's nothing happens, when you creating collection.
After creating RDD on this sample, there are 96 bytes allocated, perhaps for storing partitions data or something.
The most amount of memory was allocated when I called .collect() method, because I basically collect all data to one node, and, probably because of single-node Spark installation, I'm getting double copy of data (not sure here), which has taken about 23 MB of memory.
Interesting moment happens after modifying neighbors' arrays, which requires additional 4 MB of memory to store them.
Let me try to address the different questions here:
RDD is a parallel dataset, which I'm making from main collection using
parallelize() method and I've discovered, that modifying source
collection will take affect on created RDD as well.
RDDs are parallel, distributed datasets. parallelize lets you take a local collection and distribute it over a cluster. The current behavior you are observing that when mutating the underlying objects the RDD representation also mutates is only because the program is currently running in 1 node. In a cluster that behavior would not be possible.
Immutability is key to distribute a computation either 'vertically': over several cores of the same processor or 'horizontally': over several machines in a cluster.
I didn't found the way to update the graph structure using RDD
methods
To achieve that you will need to re-think the graph structure in terms of a distributed collection. In the current OO model, each Vertex contains their own list of adjacent vertices and require mutation of the object in order to build up the graph.
We would need to make vertex immutable, by creating them only with their properties and externalize the relationships as a list of edges. In a nutshell, this is what GraphX does. Your Edge would look like:
case class Vertex[T: ClassTag](
val id: Long,
val data: T,
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
)
and then we can build a collection of Edges:
val Edges:RDD[(Long, Long)] // (Source Vertex Id, Dest Vertex Id)
Then, given:
val usr1 = Vertex(1, "SuppieRK")
val usr2 = Vertex(2, "maasg")
val usr3 = Vertex(3, "graphy")
val usr4 = Vertex(4, "spark")
And some initial relationship:
val edgeSeq = Seq((1,2), (2,3))
and the RDD of such relationship:
val relations = sparkContext.parallelize(edgeSeq)
then adding new relationships will mean creating new edges:
val newRelations = sparkContext.parallelize(Seq((1,4),(2,4),(3,4))
and union-ing those collections together.
val allRel = relations.union(newRelations)
This is how "addFriend" would be implemented, but we probably will be reading that data from somewhere. This method is not to be used to do a one-by-one addition to the Edges collection. You are using Spark because the dataset to consider is very large and you need the possibility to distribute the computation across several machines.
If the collection fits in one node, I would stick to "standard" Scala representations and algorithms.

How to make spark partition sticky, i.e. stay with node?

I am trying to use Spark Streaming 1.2.0. At some point, I grouped streaming data by key and then applied some operation on them.
The following is a segment of the test code:
...
JavaPairDStream<Integer, Iterable<Integer>> grouped = mapped.groupByKey();
JavaPairDStream<Integer, Integer> results = grouped.mapToPair(
new PairFunction<Tuple2<Integer, Iterable<Integer>>, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Iterable<Integer>> tp) throws Exception {
TaskContext tc = TaskContext.get();
String ip = InetAddress.getLocalHost().getHostAddress();
int key = tp._1();
System.out.println(ip + ": Partition: " + tc.partitionId() + "\tKey: " + key);
return new Tuple2<>(key, 1);
}
});
results.print();
mapped is an JavaPairDStream wrapping a dummy receiver that stores an array of integers every second.
I ran this app on a cluster with two slaves, each has 2 cores.
When I checked out the printout, it seems that partitions were not assigned to nodes permanently (or in a "sticky" fashion). They moved between the two nodes frequently. This creates a problem for me.
In my real application, I need to load fairly large amount of geo data per partition. These geo data will be used to process the data in the streams. I can only afford to load part of the geo data set per partition. If the partition moves between nodes, I will have to move the geo data too, which can be very expensive.
Is there a way to make the partitions sticky, i.e. partition 0,1,2,3 stay with node 0, partition 4,5,6,7 stay with node 1?
I have tried setting spark.locality.wait to a large number, say, 1000000. And it did not work.
Thanks.
I found a workaround.
I can make my auxiliary data a RDD. Partition it and cache it.
Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. E.g.
...
JavaPairRDD<Integer, GeoData> geoRDD =
geoRDD1.partitionBy(new HashPartitioner(num)).cache();
Later, do this
JavaPairRDD<Integer, Integer> someOtherRDD = ...
JavaPairRDD<Integer, Tuple2<Iterator<GeoData>>, Iterator<Integer>>> grp =
geoRDD.cogroup(someOtherRDD);
Then, you can use foreach on the cogroupped rdd to process the input data with geo data.