Convert RDD[String] to RDD[Row] to Dataframe Spark Scala - scala

I am reading in a file that has many spaces and need to filter out the space. Afterwards we need to convert it to a dataframe. Example input below.
2017123 ¦ ¦10¦running¦00000¦111¦-EXAMPLE
My solution to this was the following function which parses out all spaces and trims the file.
def truncateRDD(fileName : String): RDD[String] = {
val example = sc.textFile(fileName)
example.map(lines => lines.replaceAll("""[\t\p{Zs}]+""", ""))
}
However, I am not sure how to get it into a dataframe. sc.textFile returns a RDD[String]. I tried the case class way but the issue is we have 800 field schema, case class cannot go beyond 22.
I was thinking of somehow converting RDD[String] to RDD[Row] so I can use the createDataFrame function.
val DF = spark.createDataFrame(rowRDD, schema)
Any suggestions on how to do this?

First split/parse your strings into the fields.
rdd.map( line => parse(line)) where parse is some parsing function. It could be as simple as split but you may want something more robust. This will get you an RDD[Array[String]] or similar.
You can then convert to an RDD[Row] with rdd.map(a => Row.fromSeq(a))
From there you can convert to DataFrame wising sqlContext.createDataFrame(rdd, schema) where rdd is your RDD[Row] and schema is your schema StructType.

In your case simple way :
val RowOfRDD = truncateRDD("yourfilename").map(r => Row.fromSeq(r))
How to solve productarity issue if you are using scala 2.10 ?
However, I am not sure how to get it into a dataframe. sc.textFile
returns a RDD[String]. I tried the case class way but the issue is we
have 800 field schema, case class cannot go beyond 22.
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here

Related

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

How to create udf containing Array (case class) for complex column in a dataframe

I have a dataframe which have a complex column datatype of Arraytype>. For transforming this dataframe I have created udf which can consume this column using Array [case class] as parameter. The main bottle neck here is when I create case class according to stucttype, the structfield name contains special characters for example "##field". So I provide same name to case class like this way case class (##field) and attach this to udf parameter. After interpreted in spark udf definition change name of case class field to this "$hash$hashfield". When performing transform using this dataframe it is failing because of this miss match. Please help ...
Due JVM limitations Scala stores identifiers in encoded form and currently Spark can't map ##field to $hash$hashfield.
One possible solution is to extract fields manually from raw row (but you need to know order of the fields in df, you can use df.schema for that):
val myUdf = udf { (struct: Row) =>
// Pattern match struct:
struct match {
case Row(a: String) => Foo(a)
}
// .. or extract values from Row
val `##a` = struct.getAs[String](0)
}

Convert Dataframe back to RDD of case class in Spark

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

What is the best practice for reading/parsing Spark generated files which saved using SaveAsTextFile?

As you know, if you use saveAsTextFile on an RDD[String, Int], the output looks like this:
(T0000036162,1747)
(T0000066859,1704)
(T0000043861,1650)
(T0000075501,1641)
(T0000071951,1638)
(T0000075623,1638)
(T0000070102,1635)
(T0000043868,1627)
(T0000094043,1626)
You may want to use this file in Spark again and what should be best practice for reading and parsing it? Should it be something like that or is there any elegant way for it?
val lines = sc.textFile("result/hebe")
case class Foo(id: String, count: Long)
val parsed = lines
.map(l => l.stripPrefix("(").stripSuffix(")").split(","))
.map(l => new Foo(id=l(0),count = l(1).toLong))
It depends what you're looking for.
If you want something pretty I'd consider possibly adding an alternative constructor to Foo which takes a single string so you could have something like
lines.map(new Foo)
And Foo would look like
case class Foo(id: String, count: Long) {
def apply(l: String): Foo = {
val split = l.stripPrefix("(").stripSuffix(")").split(",")
new Foo(l(0), l(1))
}
}
If you have no requirement to output the data like that then I'd consider saving it as a sequence file.
If performance isn't an issue then its fine. I'd just say the most important thing is to just isolate the text parsing so that later you could unit test it and come back to it later and easily edit it.
you should either save it as a Dataframe which will use the case class as a schema (that allows you to easily parse it back into Spark) or you should map out the individual components of your RDD (so you remove the brackets before saving) since it only makes the file larger:
yourRDD.toDF("id","count").saveAsParquetFile(path)
when you load in the DF, you can pass it through a schema definition to get it back into a RDD if you want
RDDInput = input.map(x=>(x.getAs[Long]("id"),x.getAs[Int]("count")))
If you prefer to store as a text file, you could consider mapping the elements without the brackets:
yourRDD.map(x => s"${x._1}, ${x._2}")
The best way will be, you write dataframes instead of RDD directly as file.
Code that writing files -
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd.toDF()
df.write.parquet("dir”)
Code that reading files -
val rdd = sqlContext.read.parquet(“dir”).rdd.map(row => (row.getString(0),row.getLong(1)))
Before making saveAsTextFile use map(x=>x.mkString(",").
rdd.map(x=>x.mkString(",").saveAsTextFile(path). Output will not have bracket.
Output of this will be:-
T0000036162,1747
T0000066859,1704