I have tried to make a function that can enrich a given DataFrame with a "session" column using a window function. So I need to use partitionBy and orderBy.
val by_uuid_per_date = Window.partitionBy("uuid").orderBy("year","month","day")
// A Session = A day of events for a certain user. uuid x (year+month+day)
val enriched_df = df
.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy("uuid","timestamp")
.select("uuid","year","month","day","session")
This works perfectly, but when I try to make a function that encapsulates this behavior :
PS: I used the _* splat operator.
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _*).orderBy(window_order_by_cols: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols:_*)
.select("uuid","year","month","mday","session")
}
I get the following error:
notebook:6: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to -parameters)
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _).orderBy(window_order_by_cols: _*)
partitionBy and orderBy are expecting Seq[Column] or
Array[Column] as arguments, see below:
val data = Seq(
(1,99),
(1,99),
(1,70),
(1,20)
).toDF("id","value")
data.select('id,'value, rank().over(Window.partitionBy('id).orderBy('value))).show()
val partitionBy: Seq[Column] = Seq(data("id"))
val orderBy: Seq[Column] = Seq(data("value"))
data.select('id,'value, rank().over(Window.partitionBy(partitionBy:_*).orderBy(orderBy:_*))).show()
So in this case, your code should looks like this:
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val window_partition_cols_2: Array[Column] = window_partition_cols.map(df(_))
val window_order_by_cols_2: Array[Column] = window_order_by_cols.map(df(_))
val presentation_order_by_cols_2: Array[Column] = presentation_order_by_cols.map(df(_))
val by_uuid_per_date = Window.partitionBy(window_partition_cols_2: _*).orderBy(window_order_by_cols_2: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols_2:_*)
.select("uuid","year","month","mday","session")
}
Related
I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})
I'm trying to load several input files in to a single dataframe:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield {df}
val inputDataFrame = unionAll(dataFrames, sqlContext)
// union of all given DataFrames
private def unionAll(dataFrames: Seq[DataFrame], sqlContext: SQLContext): DataFrame = dataFrames match {
case Nil => sqlContext.emptyDataFrame
case head :: Nil => head
case head :: tail => head.unionAll(unionAll(tail, sqlContext))
}
Compiler says
Error:(40, 8) type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.collection.GenTraversableOnce[?]
df <- sc.textFile(i).toDF()
^
Any idea?
First, SQLContext.read.text(...) accepts multiple filename arguments, so you can simply do:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val inputDataFrame = sqlContext.read.text(inputs: _*)
Or:
val inputDataFrame = sqlContext.read.text("input1.txt", "input2.txt", "input3.txt")
As for your code - when you write:
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield df
It is translated into:
inputs.flatMap(i => sc.textFile(i).toDF().map(df => df))
Which can't compile, because flatMap expects a function that returns a GenTraversableOnce[?], while the supplied function returns an RDD[Row] (See signature of DataFrame.map). In other words, when you write df <- sc.textFile(i).toDF() you're actually taking each row in the dataframe, and yielding a new RDD with these rows, which isn't what you intended.
What you were trying to do is simpler:
val dataFrames = for (
i <- inputs;
) yield sc.textFile(i).toDF()
But, as mentioned at the beginning, the recommended approach is using sqlContext.read.text.
I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}
I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = {
categories(input.toInt)
}
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?
Most likely not the prettiest solution but you can try something like this:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
}
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an array of literals:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
df.withColumn(
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Map instead of Array is probably a better idea:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
}
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an array function:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs
I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))