Converting RDD into Dataframe - scala

I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"

If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.

Related

Cannot resolve overloaded method 'createDataFrame'

The following code:
val data1 = Seq(("Android", 1, "2021-07-24 12:01:19.000", "play"), ("Android", 1, "2021-07-24 12:02:19.000", "stop"),
("Apple", 1, "2021-07-24 12:03:19.000", "play"), ("Apple", 1, "2021-07-24 12:04:19.000", "stop"))
val schema1 = StructType(Array(StructField("device_id", StringType, true),
StructField("video_id", IntegerType, true),
StructField("event_timestamp", StringType, true),
StructField("event_type", StringType, true)
))
val spark = SparkSession.builder()
.enableHiveSupport()
.appName("PlayStop")
.getOrCreate()
var transaction=spark.createDataFrame(data1, schema1)
produces the error:
Cannot resolve overloaded method 'createDataFrame'
Why?
And how to fix it?
If your schema consists of default StructField settings, the easiest way to create a DataFrame would be to simply apply toDF():
val transaction = data1.toDF("device_id", "video_id", "event_timestamp", "event_type")
To specify custom schema definition, note that createDataFrame() takes a RDD[Row] and schema as its parameters. In your case, you could transform data1 into a RDD[Row] like below:
val transaction = spark.createDataFrame(sc.parallelize(data1.map(Row(_))), schema1)
An alternative is to use toDF, followed by rdd which represents a DataFrame (i.e. Dataset[Row]) as RDD[Row]:
val transaction = spark.createDataFrame(data1.toDF.rdd, schema1)

Define StructType as input datatype of a Function Spark-Scala 2.11 [duplicate]

This question already has an answer here:
Defining a UDF that accepts an Array of objects in a Spark DataFrame?
(1 answer)
Closed 3 years ago.
I'm trying to write a Spark UDF in scala, I need to define a Function's input datatype
I have a schema variable with the StructType, mentioned the same below.
import org.apache.spark.sql.types._
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
I'm trying to write a Function like below
val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)
input.withColumn("relation",relationUDF(col("relation")))
above code throws below exception
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]
if I give the input type as
val relationsFunc: StructType => Array[String] =
I'm not able to implement the logic, as _.map gives me metadata, filed names, etc.
Please advice how to define relationsSchema as input datatype in the below function.
val relationsFunc: ? => Array[String] = _.map(somelogic)
Your structure under relation is a Row, so your function should have the following signature :
val relationsFunc: Array[Row] => Array[String]
then you can access your data either by position or by name, ie :
{r:Row => r.getAs[String]("email")}
Check the mapping table in the documentation to determine the data type representations between Spark SQL and Scala: https://spark.apache.org/docs/2.4.4/sql-reference.html#data-types
Your relation field is a Spark SQL complex type of type StructType, which is represented by Scala type org.apache.spark.sql.Row so this is the input type you should be using.
I used your code to create this complete working example that extracts email values:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(
Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
val data = Seq(
Row("{'relation':[{'attribute':'1','email':'johnny#example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
relationsSchema
)
val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)
df.withColumn("relation", relationUdf(col("relation")))

Spark error when using except on a dataframe with MapType

I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?

Spark : ClassCastException while converting string into date

I am trying to read parquet file in dataframe with following code:
val data = spark.read.schema(schema)
.option("dateFormat", "YYYY-MM-dd'T'hh:mm:ss").parquet(<file_path>)
data.show()
Here's the schema:
def schema: StructType = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", DateType, false)
))
When I try to execute data.show(), it throws the following exception:
Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificInternalRow.scala:74)
at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:240)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.set(ParquetRowConverter.scala:159)
at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addBinary(ParquetRowConverter.scala:89)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:324)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:372)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
Apparently, it's because of date format and DateType in my schema. If I change DateType to StringType, it works fine and outputs the following:
+--------------------+--------------------+----------------------+
| id| text| created_date|
+--------------------+--------------------+----------------------+
|id..................|text................|2017-01-01T00:08:09Z|
I want to read created_date into DateType, do I need to change anything else?
The following works under Spark 2.1. Note the change of the date format and the usage of TimestampType instead of DateType.
val schema = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", TimestampType, false)
))
val data = spark
.read
.schema(schema)
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'")
.parquet("s3a://thisisnotabucket")
In older versions of Spark (I can confirm this works under 1.5.2), you can create a UDF to do the conversion for you in SQL.
def cvtDt(d: String): java.sql.Date = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Date(fmt.parseDateTime(d).getMillis)
}
def cvtTs(d: String): java.sql.Timestamp = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Timestamp(fmt.parseDateTime(d).getMillis)
}
sqlContext.udf.register("mkdate", cvtDt(_: String))
sqlContext.udf.register("mktimestamp", cvtTs(_: String))
sqlContext.read.parquet("s3a://thisisnotabucket").registerTempTable("dttest")
val query = "select *, mkdate(created_date), mktimestamp(created_date) from dttest"
sqlContext.sql(query).collect.foreach(println)
NOTE: I did this in the REPL, so I had to create the DateTimeFormat pattern on every call to the cvt* methods to avoid serialization issues. If you're doing this is an application, I recommend extracting the formatter into an object.
object DtFmt {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T‌​'HH:mm:ss'Z'")
}

how to convert VertexRDD to DataFrame

I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.
I am trying to specify the schema as:
val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:
val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)
But I get a
// scala.MatchError: 20502 (of class java.lang.Long)
Any hint more than welcome
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}
val rows = myvertexRDD.map{
case(id, v) => Row.fromSeq(id +: v.toArray)
}
val schema = StructType(
StructField("id", LongType, false) +:
(1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))
val df = sqlContext.createDataFrame(rows, schema)
Notes:
declared types have to match actual types. You cannot declare string and pass long or double
structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns