Spark: unpersist RDDs for which I have lost the reference - scala

How can I unpersist RDD that were generated in an MLlib model for which I don't have a reference?
I know in pyspark you could unpersist all dataframes with sqlContext.clearCache(), is there something similar but for RDDs in the scala API? Furthermore, is there a way I could unpersist only some RDDs without having to unpersist all?

You can call
val rdds = sparkContext.getPersistentRDDs(); // result is Map[Int, RDD]
and then filter values to get this value that you want (1) :
rdds.filter (x => filterLogic(x._2)).foreach (x => x._2.unpersist())
(1) - written by hand, without compiler - sorry if there's some error, but there shouldn't be ;)

Related

How to concatenate transformations on a spark scala dataframe?

I am teaching myself scala (so as to use it with Apache Spark) and wanted to know if there would be some way to concatenate a series of transformations on a Spark DataFrame. E.g. let's assume we have a list of transformations
l: List[(String, String)] = List(("field1", "nonEmpty"), ("field2", "notNull"))
and a Spark DataFrame
df, such that the desired result would be
df.filter(df("field1") =!= "").filter(df("field2").isNotNull).
I was thinking perhaps this could be done using function composition or list folding or something, but I really don't know how. Any help would be greatly appreciated.
Thanks!
Yes, it is perfectly possible. But it depends of you really want, I mean, Spark provides Pipelines, that allows to compose your transformations and create a pipeline that can be serialized. You can create your custom transformers, here an example. You can include your "filter" stages in custom transformations, you will be able to use later, for example, in a Spark structured streaming.
Other option is to use Spark datasets and use the transform api. That seems more functional and elegant.
Scala has a lot of possibilities to create your own api, but take a look first to these approaches.
Yes you can fold over an existing Dataframe. You could keep all columns in a list and don't bother with other intermediary types:
val df =
???
val columns =
List(
col("1") =!= "",
col("2").isNotNull,
col("3") > 10
)
val filtered =
columns.foldLeft(df)((df, col) => df.filter(col))

Load RDD from name

In spark, you can do setName on a RDD.
Is it possible to load a RDD from the name ?
Like spark.loadRDD(name) ?
Thanks.
There is no such option, because the names are not unique identifiers. There are just a method to attach additional information that will be showed in the UI or debugs string.
It is perfectly fine to have:
val rdd1 = sc.parallelize(Seq(1, 2, 3)).setName("foo")
val rdd2 = sc.parallelize(Seq(4, 5, 6)).setName("foo")
and Spark wouldn't "know" which RDD to return.
Additionally there Spark doesn't track RDDs in general. Only objects that are cached or persisted in other ways, are "known" to Spark.

Is a filtered RDD still in cache when performed on a cached RDD

I'm wondering if we perform the following instructions :
val rdd : = sc.textFile("myfile").zipwithIndex.cache
val size = rdd.count
val filter = rdd.filter(_._2 % 2 == 0)
val sizeF = filter.count
The action performed on the filter RDD is execute as if it is in cache or not ? Despite the fact we create a second RDD from the first one, the information came from the same place, so i'm wondering if it is copied into a new object that needs to be cached or if the filtered object is directly linked to his parent allowing faster actions ?
Since filter is a transformation and not an action, and since spark is lazy nothing was actually done in the following line:
val filter = rdd.filter(_._2 % 2 == 0)
The following line:
val sizeF = filter.count
Will use the cached() rdd, and will perform the filter transformation followed by the count action
Hence, there is nothing to cache in the filter transformation.
Spark Guide
Transformations
The following table lists some of the common transformations supported
by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair
RDD functions doc (Scala, Java) for details.
filter(func) Return a new dataset formed by selecting those elements
of the source on which func returns true.
Note. if filter was an action, and a new RDD was created, it wouldn't be cached, only the RDDs which the cache() operation was executed on them are cached.
No,
The child RDD will not be cahced, the cache will mantain the original RDD in your workers and the other data will not be cached.
If you request for this filtered RDD other step that doesn't change the data, the response always will be fast due to Spark keep the Spark Files in workers until a real change.

How to parallelize list iteration and be able to create RDDs in Spark?

I've just started learning Spark and Scala.
From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.
Now, I have a List of numbers from 1 to 10:
List(1,2,3,4,5,6,7,8,9,10)
and for each of these values I need to generate a RDD using this value.
in such cases, how can I generate the RDD?
By doing
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))
I get an error because RDD cannot be generated inside another RDD.
What is the best workaround to this problem?
Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))
This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.
Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
println(rdd.count())
})
Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.

Actions on the same apache Spark RDD cause all statements re-execution

I am using Apache Spark to process a huge amount of data. I need to execute many Spark actions on the same RDD. My code looks like the following:
val rdd = /* Get the rdd using the SparkContext */
val map1 = rdd.map(/* Some transformation */)
val map2 = map1.map(/* Some other transformation */)
map2.count
val map3 = map2.map(/* More transformation */)
map3.count
The problem is that calling the second action map3.count forces the re-execution of the transformations rdd.map and map1.map.
What the hell is going on? I think the DAG built by Spark is responible of this behaviour.
This is an expected behavior. Unless one of the ancestor can be fetched from cache (typically it means that is has been persisted explicitly or implicitly during shuffle) every action will recompute a whole lineage.
Recomputation can be also triggered if RDD has been persisted but data has been lost / removed from cache or amount of available space is to low to store all records.
In this particular case you should cache in a following order
...
val map2 = map1.map(/* Some other transformation */)
map2.cache
map2.count
val map3 = map2.map(/* More transformation */)
...
if you want to avoid repeated evaluation of rdd, map1 and map2.