toDF() not handling RDD - scala

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :
value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.
If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.
Assuming this is the same RDD as defined in your previous question then schema is easy to generate:
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD
val rowRdd: RDD[Row] = ???
val schema = StructType(
(1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)
val df = sqlContext.createDataFrame(rowRdd, schema)

Related

How to extract data from MapType Scala Spark Column as Scala Map?

Well, the question is pretty much that. Let me provide sample:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Column, Dataset}
val data = List(
Row("miley",
Map("good_songs" -> "wrecking ball",
"bad_songs" -> "younger now"
)
),
Row("kesha",
Map(
"good_songs" -> "tik tok",
"bad_songs" -> "rainbow"
)
)
)
val schema = List(
StructField("singer", StringType, true),
StructField("songs", MapType(StringType, StringType, true))
)
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// This returns scala.collection.Map[Nothing,Nothing]
someDF.select($"songs").head().getMap(0)
// Therefore, this won't work:
val myHappyMap : Map[String, String] = someDF.select($"songs").head().getMap(0)
I don't understand why I'm getting a Map[Nothing, Nothing] if I properly described my desired schema for the MapType column - not only that: when I do someDF.schema, what I get is
org.apache.spark.sql.types.StructType = StructType(StructField(singer,StringType,true), StructField(songs,MapType(StringType,StringType,true),true)), showing that the DataFrame schema is properly set.
I've read extract or filter MapType of Spark DataFrame
, and also How to get keys and values from MapType column in SparkSQL DataFrame
. I thought the latter would solve my problem by at least being able to extract the keys and the values separately, but, still, I get the values as WrappedArray(Nothing), which means it just adds extra complication for no real gain.
What am I missing here?
.getMap is a typed method and it's incapable of infering the types on your map, so you have to actually tell it:
val myHappyMap: Map[String, String] = someDF.select($"songs").head().getMap[String, String](0).toMap
the toMap in the end is just to convert it from scala.collection.Map to scala.collection.immutable.Map (they are different stuff and when you declare the type usually you are refering to the second one) (edited)

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

schema error while converting Vector collection to dataframe

I have a vector collection named values which I'm trying to convert to a dataframe
scala.collection.immutable.Vector[(String, Double)] = Vector((1,1.0), (2,2.4), (3,3.7), (4,5.0), (5,4.9))
I have defined a custom schema as follows and tried to do the conversion.
val customSchema = new StructType()
.add("A", IntegerType, true)
.add("B", DoubleType, true)
val df = values.toDF.schema(customSchema)
This gives me an error saying,
error: overloaded method value apply with alternatives:
(fieldIndex: Int)org.apache.spark.sql.types.StructField <and>
(names: Set[String])org.apache.spark.sql.types.StructType <and>
(name: String)org.apache.spark.sql.types.StructField
cannot be applied to (org.apache.spark.sql.types.StructType)
I've tried all the methods described here and here as well as the StructType documentation to create the schema. However all methods lead to the same custom schema, customSchema: org.apache.spark.sql.types.StructType = StructType(StructField(A,IntegerType,true), StructField(B,DoubleType,true))
toDF method works just fine without a custom schema. However I want to force a custom schema. Can anyone tell me what I'm doing wrong here?
schema is a property. You should use schema when you want to get StructType of DataFrame or Dataset.
val df = values.toDF
df.schema
//prints
StructType(StructField(_1,IntegerType,false), StructField(_2,DoubleType,false))
To convert a vector to a DataFrame or Dataset, you can use spark.createDataFrame or spark.createDataset. These methods are overloaded and they expect RDD or JavaRDD or java.util.List or Row and schema information. You can do the following to convert your Vector into DataFrame:
val df = spark.createDataFrame(vec.toDF.rdd, customSchema)
df.schema
//prints
StructType(StructField(A,IntegerType,true), StructField(B,DoubleType,true))
I hope it helps!

Encoder[Row] in Scala Spark

I'm trying to perform a simple map on a Dataset[Row] (DataFrame) in Spark 2.0.0. Something as simple as this
val df: DataSet[Row] = ...
df.map { r: Row => r }
But the compiler is complaining that I'm not providing the implicit Encoder[Row] argument to the map function:
not enough arguments for method map: (implicit evidence$7:
Encoder[Row]).
Everything works fine if I convert to an RDD first ds.rdd.map { r: Row => r } but shouldn't there be an easy way to get an Encoder[Row] like there is for tuple types Encoders.product[(Int, Double)]?
[Note that my Row is dynamically sized in such a way that it can't easily be converted into a strongly-typed Dataset.]
An Encoder needs to know how to pack the elements inside the Row. So you could write your own Encoder[Row] by using row.structType which determines the elements of your Row at runtime and uses the corresponding decoders.
Or if you know more about the data that goes into Row, you could use https://github.com/adelbertc/frameless/
SSry to be a "bit" late. Hopefully this helps to someone who is hitting the problem right now. Easiest way to define encoder is deriving the structure from existing DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c").toDF("id", "name")
val myEncoder = RowEndocer(df.schema)
Such approach could be useful when you need altering existing fields from your original DataFrame.
If you're dealing with completely new structure, explicit definition relying on StructType and StructField (as suggested in #Reactormonk 's little cryptic response).
Example defining the same encoder:
val myEncoder2 = RowEncoder(StructType(
Seq(StructField("id", IntegerType),
StructField("name", StringType)
)))
Please remember org.apache.spark.sql._, org.apache.spark.sql.types._ and org.apache.spark.sql.catalyst.encoders.RowEncoder libraries have to be imported.
In your specific case where the mapped function does not change the schema, you can pass in the encoder of the DataFrame itself:
df.map(r => r)(df.encoder)

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either