How to concatenate transformations on a spark scala dataframe? - scala

I am teaching myself scala (so as to use it with Apache Spark) and wanted to know if there would be some way to concatenate a series of transformations on a Spark DataFrame. E.g. let's assume we have a list of transformations
l: List[(String, String)] = List(("field1", "nonEmpty"), ("field2", "notNull"))
and a Spark DataFrame
df, such that the desired result would be
df.filter(df("field1") =!= "").filter(df("field2").isNotNull).
I was thinking perhaps this could be done using function composition or list folding or something, but I really don't know how. Any help would be greatly appreciated.
Thanks!

Yes, it is perfectly possible. But it depends of you really want, I mean, Spark provides Pipelines, that allows to compose your transformations and create a pipeline that can be serialized. You can create your custom transformers, here an example. You can include your "filter" stages in custom transformations, you will be able to use later, for example, in a Spark structured streaming.
Other option is to use Spark datasets and use the transform api. That seems more functional and elegant.
Scala has a lot of possibilities to create your own api, but take a look first to these approaches.

Yes you can fold over an existing Dataframe. You could keep all columns in a list and don't bother with other intermediary types:
val df =
???
val columns =
List(
col("1") =!= "",
col("2").isNotNull,
col("3") > 10
)
val filtered =
columns.foldLeft(df)((df, col) => df.filter(col))

Related

How to use Latent Dirichlet Allocation (migrating from spark.mllib package)?

I am using Apache Spark 2.1.2 and I want to use Latent Dirichlet allocation (LDA).
Previously I was using org.apache.spark.mllib package and I could run this without any problems, but now after starting using spark.ml I am getting an error.
val lda = new LDA().setK(numTopics).setMaxIter(numIterations)
val docs = spark.createDataset(documents)
val ldaModel = lda.fit(docs)
As you may have noticed, I'm converting documents RDD to a dataset object and am not sure if this is the correct way of doing this.
In this last line with .fit I am getting the following error:
java.lang.IllegalArgumentException: Field "features" does not exist.
My docs dataset looks like this:
scala> docs.take(2)
res28: Array[(Long, org.apache.spark.ml.linalg.Vector)] = Array((0,(7336,[1,2,4,5,12,13,19,24,26,42,48,49,57,59,63,73,81,89,99,106,113,114,141,151,157,160,177,181,198,261,266,267,272,297,307,314,315,359,383,385,410,416,422,468,471,527,564,629,717,744,763,837,890,928,932,951,961,1042,1134,1174,1305,1604,1653,1850,2119,2159,2418,2634,2836,3002,3132,3594,4103,4316,4852,5065,5107,5632,5945,6378,6597,6658],[1.0,1.0,1.0.......
My previous documents before converting them to a dataset:
documents: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)] = MapPartitionsRDD[2520]
How to get rid of the error above?
The main difference between spark mllib and spark ml is that spark ml operates on Dataframes (or Datasets) while mllib operates directly on RDDs of very defined structure.
You don't need to do much to make your code work with spark ml, but I'd still suggest to go through their documentation page and understand the differences, because you will come against more and more differences as you shift more and more towards spark ml. A good starting page with all the basics is here https://spark.apache.org/docs/2.1.0/ml-pipeline.html.
But to your code, all that is needed is just to give a correct column name to each column and it should be working just fine. Probably the easiest way to do so would be to utilise the implicit method toDF on the underlying RDD:
import spark.implicits._
val lda = new LDA().setK(numTopics).setMaxIter(numIterations)
val docs = documents.toDF("label", "features")
val ldaModel = lda.fit(docs)

Spark: unpersist RDDs for which I have lost the reference

How can I unpersist RDD that were generated in an MLlib model for which I don't have a reference?
I know in pyspark you could unpersist all dataframes with sqlContext.clearCache(), is there something similar but for RDDs in the scala API? Furthermore, is there a way I could unpersist only some RDDs without having to unpersist all?
You can call
val rdds = sparkContext.getPersistentRDDs(); // result is Map[Int, RDD]
and then filter values to get this value that you want (1) :
rdds.filter (x => filterLogic(x._2)).foreach (x => x._2.unpersist())
(1) - written by hand, without compiler - sorry if there's some error, but there shouldn't be ;)

Split Spark DataFrame based on condition

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

How to parallelize list iteration and be able to create RDDs in Spark?

I've just started learning Spark and Scala.
From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.
Now, I have a List of numbers from 1 to 10:
List(1,2,3,4,5,6,7,8,9,10)
and for each of these values I need to generate a RDD using this value.
in such cases, how can I generate the RDD?
By doing
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))
I get an error because RDD cannot be generated inside another RDD.
What is the best workaround to this problem?
Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))
This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.
Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
println(rdd.count())
})
Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.

Spark: Single pipelined scala command better than separate commands?

I am using Spark with scala. I wanted to know if having single one line command better than separate commands? What are the benefits if any? Does it gain more efficiency in terms of speed? Why?
for e.g.
var d = data.filter(_(1)==user).map(f => (f(2),f(5).toInt)).groupByKey().map(f=> (f._1,f._2.count(x=>true), f._2.sum))
against
var a = data.filter(_(1)==user)
var b = a.map(f => (f(2),f(5).toInt))
var c = b.groupByKey()
var d = c.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
There is no performance difference between your two examples; the decision to chain RDD transformations or to explicitly represent the intermediate RDDs is just a matter of style. Spark's lazy evaluation means that no actual distributed computation will be performed until you invoke an RDD action like take() or count().
During execution, Spark will pipeline as many transformations as possible. For your example, Spark won't materialize the entire filtered dataset before it maps it: the filter() and map() transformations will be pipelined together and executed in a single stage. The groupByKey() transformation (usually) needs to shuffle data over the network, so it's executed in a separate stage. Spark would materialize the output of filter() only if it had been cache()d.
You might need to use the second style if you want to cache an intermediate RDD and perform further processing on it. For example, if I wanted to perform multiple actions on the output of the groupByKey() transformation, I would write something like
val grouped = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.cache()
val mapped = grouped.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
val counted = grouped.count()
There is no difference in terms of execution, but you might want to consider the readability of your code. I would go with your first example but like this:
var d = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
Really this is more of a Scala question than Spark though. Still, as you can see from Spark's implementation of word count as shown in their documentation
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
you don't need to worry about those kinds of things. The Scala language (through laziness, etc.) and Spark's RDD implementation handles all that at a higher level of abstraction.
If you find really bad performance, then you should take the time to explore why. As Knuth said, "premature optimization is the root of all evil."