Loading several input files into one Dataframe in Scala / Spark 1.6 - scala

I'm trying to load several input files in to a single dataframe:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield {df}
val inputDataFrame = unionAll(dataFrames, sqlContext)
// union of all given DataFrames
private def unionAll(dataFrames: Seq[DataFrame], sqlContext: SQLContext): DataFrame = dataFrames match {
case Nil => sqlContext.emptyDataFrame
case head :: Nil => head
case head :: tail => head.unionAll(unionAll(tail, sqlContext))
}
Compiler says
Error:(40, 8) type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.collection.GenTraversableOnce[?]
df <- sc.textFile(i).toDF()
^
Any idea?

First, SQLContext.read.text(...) accepts multiple filename arguments, so you can simply do:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val inputDataFrame = sqlContext.read.text(inputs: _*)
Or:
val inputDataFrame = sqlContext.read.text("input1.txt", "input2.txt", "input3.txt")
As for your code - when you write:
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield df
It is translated into:
inputs.flatMap(i => sc.textFile(i).toDF().map(df => df))
Which can't compile, because flatMap expects a function that returns a GenTraversableOnce[?], while the supplied function returns an RDD[Row] (See signature of DataFrame.map). In other words, when you write df <- sc.textFile(i).toDF() you're actually taking each row in the dataframe, and yielding a new RDD with these rows, which isn't what you intended.
What you were trying to do is simpler:
val dataFrames = for (
i <- inputs;
) yield sc.textFile(i).toDF()
But, as mentioned at the beginning, the recommended approach is using sqlContext.read.text.

Related

Cannot splat an Array into function's arguments that accepts varargs

I have tried to make a function that can enrich a given DataFrame with a "session" column using a window function. So I need to use partitionBy and orderBy.
val by_uuid_per_date = Window.partitionBy("uuid").orderBy("year","month","day")
// A Session = A day of events for a certain user. uuid x (year+month+day)
val enriched_df = df
.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy("uuid","timestamp")
.select("uuid","year","month","day","session")
This works perfectly, but when I try to make a function that encapsulates this behavior :
PS: I used the _* splat operator.
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _*).orderBy(window_order_by_cols: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols:_*)
.select("uuid","year","month","mday","session")
}
I get the following error:
notebook:6: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to -parameters)
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _).orderBy(window_order_by_cols: _*)
partitionBy and orderBy are expecting Seq[Column] or
Array[Column] as arguments, see below:
val data = Seq(
(1,99),
(1,99),
(1,70),
(1,20)
).toDF("id","value")
data.select('id,'value, rank().over(Window.partitionBy('id).orderBy('value))).show()
val partitionBy: Seq[Column] = Seq(data("id"))
val orderBy: Seq[Column] = Seq(data("value"))
data.select('id,'value, rank().over(Window.partitionBy(partitionBy:_*).orderBy(orderBy:_*))).show()
So in this case, your code should looks like this:
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val window_partition_cols_2: Array[Column] = window_partition_cols.map(df(_))
val window_order_by_cols_2: Array[Column] = window_order_by_cols.map(df(_))
val presentation_order_by_cols_2: Array[Column] = presentation_order_by_cols.map(df(_))
val by_uuid_per_date = Window.partitionBy(window_partition_cols_2: _*).orderBy(window_order_by_cols_2: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols_2:_*)
.select("uuid","year","month","mday","session")
}

How to fix this type error, after using UDF function in Spark SQL?

I want to explode my features(type: sparse vector of ml.linalg) as each feature's index and value,so I do the following things:
def zipKeyValue(vec:linalg.Vector) : Array[(Int,Double)] = {
val indice:Array[Int] = vec.toSparse.indices;
val value:Array[Double] = vec.toSparse.values;
indice.zip(value)
}
val udf1 = udf( zipKeyValue _)
val df1 = df.withColumn("features",udf1(col("features")));
val df2 = df1.withColumn("features",explode(col("features")) );
val udf2 = udf( ( f:Tuple2[Int,Double]) => f._1.toString ) ;
val udf3 = udf( (f:Tuple2[Int,Double]) =>f._2) ;
val df3 = df2.withColumn("key",udf2(col("features"))).withColumn("value",udf3(col("features")));
df3.show();
But I got error:
Failed to execute user defined function(anonfun$38: (struct<_1:int,_2:double>) => string)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
It is confused for me since my function zipKeyValue return a Tuple2[(Int,Double)], but actually I got a struct<_1:int,_2:double>. How can I fix it ?
You don't need UDF here. Just select the columns.
df2
.withColumn("key", col("features._1"))
.withColumn("value", col("features._2"))
In general case you should use Rows not Tuples:
import org.apache.spark.sql.Row
val udf2 = udf((f: Row) => f.getInt(0).toString)
val udf3 = udf((f: Row) => f.getDouble(1))

"Recursive value X$3 needs type" in a fold operation even though the types are specified

For the following fold invocation we can see that the types of each return value have been indicated:
(Note: the above content that is shown on the first "three" lines are actually all on one line #59 in the code)
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame)
= (1 to nLevels).foldLeft((0, dfsIn, dfIn)) {
case ((nRowsPrior, dfsPrior, dfPrior), level) =>
..
(nnRows, dfs, dfOut1) // These return values are verified as correctly
// matching the listed return types
}
But we have the following error:
Error:(59, 10) recursive value x$3 needs type
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) = (1 to nLevels).foldLeft((0, dfsIn, dfIn)) { case ((nRowsPrior, dfsPrior, dfPrior), level) =>
Column 10 indicates the first entry nRows which is set as follows:
val nnRows = cntAccum.value.toInt
That is definitively an Int .. so it is unclear what is the root issue.
(fyi there is another similarly titled question - recursive value x$5 needs type - but that question was doing strange things in the output parameters whereas mine is straightforward value assignments)
Here is an MCVE that does not have any dependencies:
trait DataFrameMap
trait DataFrame
val dfsIn: DataFrameMap = ???
val dfIn: DataFrame = ???
val nLevels: Int = 0
val (_, _) = (1, 2)
val (_, _) = (3, 4)
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) =
(1 to nLevels).foldLeft((0, dfsIn, dfIn)) {
case ((nRowsPrior, dfsPrior, dfPrior), level) =>
val nnRows: Int = nRows
val dfs: DataFrameMap = ???
val dfOut1: DataFrame = ???
(nnRows, dfs, dfOut1)
}
it reproduces the error message exactly:
error: recursive value x$3 needs type
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) =
^
You must have used nRows, dfsOut or dfOut somewhere inside the body of foldLeft. This here compiles just fine:
trait DataFrameMap
trait DataFrame
val dfsIn: DataFrameMap = ???
val dfIn: DataFrame = ???
val nLevels: Int = 0
val (_, _) = (1, 2)
val (_, _) = (3, 4)
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) =
(1 to nLevels).foldLeft((0, dfsIn, dfIn)) {
case ((nRowsPrior, dfsPrior, dfPrior), level) =>
val nnRows: Int = ???
val dfs: DataFrameMap = ???
val dfOut1: DataFrame = ???
(nnRows, dfs, dfOut1)
}
Fun fact: the x$3 does not refer to dfOut (third component of the tuple), but rather to the entire tuple (nRows, dfsOut, dfOut) itself. This is why I had to add two (_, _) = ...'s before the val (nRows, dfsOut, dfOut) definition to get x$3 instead of x$1.
The problem was inside a print statement: the outer foldLeft argument was being referenced by accident instead of the inner loop argument
info(s"$tag: ** ROWS ** ${props(OutputTag)} at ${props(OutputPath)} count=$nRows")
The $nRows is the outer scoped variable: this causes the recursion. The intention had been to reference $nnRows

How to convert var to List?

How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))
First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful
reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))

Recursive value outputColumns needs type -- spark-scala

I am getting error "recursive value outputColumns needs type" while running below code. Can anyone help me on this.
import sqlContext.implicits._
import org.apache.spark.sql.types.StringType
val zipArrays = udf { seqs: Seq[Seq[String]] => for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i)) }
val columnsToSelect = Seq($"CP_PAY_MADE_ON", $"CP_PRV_TIN", $"CP_PAYER_835_ID")
val columnsToZip = Seq($"CLM_STR_DT", $"CLM_END_DT")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) => $"col".getItem(index).as(column.toString())
val output = payment_summary_new_columns.select($"CP_PAY_MADE_ON", $"CP_PRV_TIN", $"CP_PAYER_835_ID", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns:_*) //gives error recursive value outputColumns needs type
output.show()