I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.
I am trying to specify the schema as:
val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:
val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)
But I get a
// scala.MatchError: 20502 (of class java.lang.Long)
Any hint more than welcome
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}
val rows = myvertexRDD.map{
case(id, v) => Row.fromSeq(id +: v.toArray)
}
val schema = StructType(
StructField("id", LongType, false) +:
(1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))
val df = sqlContext.createDataFrame(rows, schema)
Notes:
declared types have to match actual types. You cannot declare string and pass long or double
structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns
Related
I dont have schema information because, i have different tables and the dataframe will be created with their data. So i want to detect any LongType column and cast it to IntegerType.
My approach is the create new dataFrame with the new schema which LongType fields converted to IntegerType.
val df = spark.read.format("bigquery").load(sql)
// cast long types to int
val newSchemaArr = df.schema.fields.map(f => if(f.dataType.isInstanceOf[LongType]) StructField(name = f.name, dataType = IntegerType, nullable = f.nullable) else f)
val newSchema = new StructType(newSchemaArr)
val df2 = spark.createDataFrame(df.rdd, newSchema)
// write to hdfs files
df2.write.format("avro").save(destinationPath)
But i got this error, when i writing the data.
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of int
Is there any solution to fix, or any another approach to handle this problem?
Spark version: 3.2.0
Scala version: 2.12
The easiest way to do this is to simply cast columns when necessary:
// cast long columns to integer
val columns = df.schema.map {
case StructField(name, LongType, _, _) => col(name).cast(IntegerType)
case f => col(f.name)
}
// write modified columns
df.select(columns: _*).write.format("avro").save(destinationPath)
import org.apache.spark.sql.types.{IntegerType, LongType}
val sch = df.schema
val df2 = sch.fieldNames.foldLeft(df) { (tmpDF, colName) =>
if (tmpDF.schema(colName).dataType == LongType)
tmpDF.withColumn(colName, col(colName).cast(IntegerType))
else tmpDF
}
I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.
I have a list of RDD. I have iterated the rdd and for each elemet of rdd I am doing some parsing logic. Finally I getting
val mRdd = nRdd.map {
ele => //parsing logic, I have the below field
colum = Array[String] // example ['id','name','dept']<br>
c_type = Array[String] // example ['Int','String','String']<br>
value = ArrayBuffer[String] // [1,lucy,it][2,denis,cs]<br>
}
How I can get the list of dataframe in mRdd
I tried a logic to create dataframe, in this case I have to rdd first. But I can't create rdd inside rdd.
I am new in spark. I am using spark 1.6.3
Please help me
In order to convert an RDD into a Dataframe, you would need to do the following:
Approach 1 - Use createDaframe function:
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("dept", StringType)
))
createDataframe(parsedRDD, schema)
}
Read more about this approach here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Approach 2 - Use toDF implicit function:
import sqlContext.implicits._
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val columns = Seq("id", "name", "dept")
parsedRDD.toDF(columns: _*)
}
I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()
I have a case class which I want to convert to schema in Spark
case class test(request1:Map[String, Any],response1:Option[String] = None,)
How do I convert this class to schema object
val mySchema = StructType(
StructField("request1", Map[String, Any], false),StructField(" response1", Option[String],true))
Map and Options are not available in DataType
It is not possible to use this case class to create a DataFrame schema. While Spark supports map via MapType and Options are handled using wrapped type with Nones converted to NULLs, schema of type Any is not supported.
Assuming you change Value type to String:
case class Test(request1: Map[String, String], response1: Option[String] = None)
corresponding schema should look like this:
StructType(Seq(
StructField("request1", MapType(StringType, StringType, true), true),
StructField("response1", StringType, true)
))
As #zero323 already eloquently said, even though you can use MapType, it is probably not the best thing in your case. Your request and response are probably already structured and you should invest a bit of time to define that structure/schema. For example, you can define all the string type columns at once programmatically, all int type columns programmatically as in the code below.
In sql, Option translates to the third argument of StructField which is nullable and it is true or false - most times you will set it to true, so that null values are allowed.
You can define nested structures like this:
import org.apache.spark.sql.types._
case class Request(url:String, enc:String)
case class Response(code:Int, body:String)
case class Record( request:Request, response:Response)
val names = Array("url", "enc")
val requestStructType = StructType( names.map( name => StructField(name, StringType, true)))
/// example of StructType with differing types, programmaticaly, add more field names if needed
val respNamesInt = Array("code")
val respNamesString = Array("body")
val responseStructType =
StructType( respNamesInt.map( name => StructField(name, IntegerType, true)) ++
respNamesString.map( name => StructField(name, StringType, true)))
// example of nested structures
val recordStructType =
StructType( Array(StructField("request", requestStructType, false), // nullable = false
StructField("response", responseStructType, true))) // nullable = true