Error in creating dataframe: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string - scala

I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()

Related

Spark cast all LongType columns to IntegerType dynamically

I dont have schema information because, i have different tables and the dataframe will be created with their data. So i want to detect any LongType column and cast it to IntegerType.
My approach is the create new dataFrame with the new schema which LongType fields converted to IntegerType.
val df = spark.read.format("bigquery").load(sql)
// cast long types to int
val newSchemaArr = df.schema.fields.map(f => if(f.dataType.isInstanceOf[LongType]) StructField(name = f.name, dataType = IntegerType, nullable = f.nullable) else f)
val newSchema = new StructType(newSchemaArr)
val df2 = spark.createDataFrame(df.rdd, newSchema)
// write to hdfs files
df2.write.format("avro").save(destinationPath)
But i got this error, when i writing the data.
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of int
Is there any solution to fix, or any another approach to handle this problem?
Spark version: 3.2.0
Scala version: 2.12
The easiest way to do this is to simply cast columns when necessary:
// cast long columns to integer
val columns = df.schema.map {
case StructField(name, LongType, _, _) => col(name).cast(IntegerType)
case f => col(f.name)
}
// write modified columns
df.select(columns: _*).write.format("avro").save(destinationPath)
import org.apache.spark.sql.types.{IntegerType, LongType}
val sch = df.schema
val df2 = sch.fieldNames.foldLeft(df) { (tmpDF, colName) =>
if (tmpDF.schema(colName).dataType == LongType)
tmpDF.withColumn(colName, col(colName).cast(IntegerType))
else tmpDF
}

Dynamically build case class or schema

Given a list of strings, is there a way to create a case class or a Schema without inputing the srings manually.
For eaxample, I have a List,
val name_list=Seq("Bob", "Mike", "Tim")
The List will not always be the same. Sometimes it will contain different names and will vary in size.
I can create a case class
case class names(Bob:Integer, Mike:Integer, Time:Integer)
or a schema
val schema = StructType(StructFiel("Bob", IntegerType,true)::
StructFiel("Mike", IntegerType,true)::
StructFiel("Tim", IntegerType,true)::Nil)
but I have to do it manually. I am looking for a method to perform this operation dynamically.
Assuming the data type of the columns are the same:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val nameList=Seq("Bob", "Mike", "Tim")
val schema = StructType(nameList.map(n => StructField(n, IntegerType, true)))
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(Bob,IntegerType,true), StructField(Mike,IntegerType,true), StructField(Tim,IntegerType,true)
// )
spark.createDataFrame(rdd, schema)
If the data types are different, you'll have to provide them as well (in which case it might not save much time compared with assembling the schema manually):
val typeList = Array[DataType](StringType, IntegerType, DoubleType)
val colSpec = nameList zip typeList
val schema = StructType(colSpec.map(cs => StructField(cs._1, cs._2, true)))
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(Bob,StringType,true), StructField(Mike,IntegerType,true), StructField(Tim,DoubleType,true)
// )
If you have all the fields with same datatype than you can simply create as
val name_list=Seq("Bob", "Mike", "Tim")
val fields = name_list.map(name => StructField(name, IntegerType, true))
val schema = StructType(fields)
If you have different datatype than create a map of fields and type and create a schema as above.
Hope this helps!
All the answers above only covered one aspect which is create the schema. Here is one solution you can use to create the case class from the generated schema:
https://gist.github.com/yoyama/ce83f688717719fc8ca145c3b3ff43fd

Not able to create parquet files in hdfs using spark shell

I want to create parquet file in hdfs and then read it through hive as external table. I'm struck with stage failures in spark-shell while writing parquet files.
Spark Version: 1.5.2
Scala Version: 2.10.4
Java: 1.7
Input file:(employee.txt)
1201,satish,25
1202,krishna,28
1203,amith,39
1204,javed,23
1205,prudvi,23
In Spark-Shell:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
When I type the last command I get,
ERROR
SPARK APPLICATION MANAGER
I even tried increasing the executor memory, its still failing.
Also Importantly , finalDF.show() is producing the same error.
So, I believe I have made a logical error here.
Thanks for supporting
The issue here is you are creating a schema with all the fields/columns type defaulted to StringType. But while passing the values in the schema, the value of Id and Age is being converted to Integer as per the code.Hence, throwing the Matcherror while running.
The data types of columns in the schema should match the data type of values being passed to it. Try the below code.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
//val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types._;
val schema = StructType(StructField("id", IntegerType, true) :: StructField("name", StringType, true) :: StructField("age", IntegerType, true) :: Nil)
val rowRDD = employee.map(_.split(" ")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
This code should run fine.

how to convert VertexRDD to DataFrame

I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.
I am trying to specify the schema as:
val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:
val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)
But I get a
// scala.MatchError: 20502 (of class java.lang.Long)
Any hint more than welcome
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}
val rows = myvertexRDD.map{
case(id, v) => Row.fromSeq(id +: v.toArray)
}
val schema = StructType(
StructField("id", LongType, false) +:
(1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))
val df = sqlContext.createDataFrame(rows, schema)
Notes:
declared types have to match actual types. You cannot declare string and pass long or double
structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns

Spark SQL: automatic schema from csv

does spark sql provide any way to automatically load csv data?
I found the following Jira: https://issues.apache.org/jira/browse/SPARK-2360 but it was closed....
Currently I would load a csv file as follows:
case class Record(id: String, val1: String, val2: String, ....)
sc.textFile("Data.csv")
.map(_.split(","))
.map { r =>
Record(r(0),r(1), .....)
}.registerAsTable("table1")
Any hints on the automatic schema deduction from csv files? In particular a) how can I generate a class representing the schema and b) how can I automatically fill it (i.e. Record(r(0),r(1), .....))?
Update:
I found a partial answer to the schema generation here:
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
So the only question left would be how to do the step
map(p => Row(p(0), p(1).trim)) dynamically for the given number of attributes?
Thanks for your support!
Joerg
You can use spark-csv where you can save a few keystrokes not having to define the column names and auto-use the headers.
val schemaString = "name age".split(" ")
// Generate the schema based on the string of schema
val schema = StructType(schemaString.map(fieldName => StructField(fieldName, StringType, true)))
val lines = people.flatMap(x=> x.split("\n"))
val rowRDD = lines.map(line=>{
Row.fromSeq(line.split(" "))
})
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
May be this link will help you.
http://devslogics.blogspot.in/2014/11/spark-sql-automatic-schema-from-csv.html