Spark cast all LongType columns to IntegerType dynamically - scala

I dont have schema information because, i have different tables and the dataframe will be created with their data. So i want to detect any LongType column and cast it to IntegerType.
My approach is the create new dataFrame with the new schema which LongType fields converted to IntegerType.
val df = spark.read.format("bigquery").load(sql)
// cast long types to int
val newSchemaArr = df.schema.fields.map(f => if(f.dataType.isInstanceOf[LongType]) StructField(name = f.name, dataType = IntegerType, nullable = f.nullable) else f)
val newSchema = new StructType(newSchemaArr)
val df2 = spark.createDataFrame(df.rdd, newSchema)
// write to hdfs files
df2.write.format("avro").save(destinationPath)
But i got this error, when i writing the data.
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of int
Is there any solution to fix, or any another approach to handle this problem?
Spark version: 3.2.0
Scala version: 2.12

The easiest way to do this is to simply cast columns when necessary:
// cast long columns to integer
val columns = df.schema.map {
case StructField(name, LongType, _, _) => col(name).cast(IntegerType)
case f => col(f.name)
}
// write modified columns
df.select(columns: _*).write.format("avro").save(destinationPath)

import org.apache.spark.sql.types.{IntegerType, LongType}
val sch = df.schema
val df2 = sch.fieldNames.foldLeft(df) { (tmpDF, colName) =>
if (tmpDF.schema(colName).dataType == LongType)
tmpDF.withColumn(colName, col(colName).cast(IntegerType))
else tmpDF
}

Related

Infer Schema from rdd to Dataframe in Spark Scala

This question is a reference from (Spark - creating schema programmatically with different data types)
I am trying infer schema from rdd to Dataframe , Below is my code
def inferType(field: String) = field.split(":")(1) match {
case "Integer" => IntegerType
case "Double" => DoubleType
case "String" => StringType
case "Timestamp" => TimestampType
case "Date" => DateType
case "Long" => LongType
case _ => StringType
}
val header = c1:String|c2:String|c3:Double|c4:Integer|c5:String|c6:Timestamp|c7:Long|c8:Date
val df1 = Seq(("a|b|44.44|5|c|2018-01-01 01:00:00|456|2018-01-01")).toDF("data")
val rdd1 = df1.rdd.map(x => Row(x.getString(0).split("\\|"): _*))
val schema = StructType(header.split("\\|").map(column => StructField(column.split(":")(0), inferType(column), true)))
val df = spark.createDataFrame(rdd1, schema)
df.show()
When I do the show , it throws the below error . I have to perform this operation on larger scale data and having trouble finding the right solution, can you anybody please help me find a solution for this or any other way, where I can achieve this.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
Thanks in advance
Short answer: String/Text cannot be specified with custom types/formats.
What you are trying to do is that to parse string as sql columns. The difference from other example is that loads from csv, you are trying to just.
Working version can be achieved like this:
// skipped other details such as schematype, spark session...
val header = "c1:String|c2:String|c3:Double|c4:Integer"
// Create `Row` from `Seq`
val row = Row.fromSeq(Seq("a|b|44.44|12|"))
// Create `RDD` from `Row`
val rdd: RDD[Row] = spark.sparkContext
.makeRDD(List(row))
.map { row =>
row.getString(0).split("\\|") match {
case Array(col1, col2, col3, col4) =>
Row.fromTuple(col1, col2, col3.toDouble, col4.toInt)
}
}
val stt: StructType = StructType(
header
.split("\\|")
.map(column => StructField(column, inferType(column), true))
)
val dataFrame = spark.createDataFrame(rdd, stt)
dataFrame.show()
The reason to create a Row from Scala types is that introducing compatible types or Row respected types here.
Note I skipped date and time related fields, date conversions are tricky. You can check my another answer how to use formatted date and timestamps here

How to read a BigDecimal type in spark sql [duplicate]

This question already has answers here:
Convert value depending on a type in SparkSQL via case matching of type
(3 answers)
Closed 4 years ago.
What is the correct DataType to use for reading from a schema listed as Decimal - and with underlying java type of BigDecimal ?
Here is the schema entry for that field:
-- realmId: decimal(38,9) (nullable = true)
When I tried a java.lang.Long it ends up with the following error:
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to java.lang.Long
I noticed there is a DecimalType but it extends AbstractDataType and not DataType and it is not clear how to specify it as a return type.
Here is the pickle. It's actually the way match DecimalType that is weird.
import org.apache.spark.SparkContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SparkSession}
val spark: SparkSession = SparkSession.builder().getOrCreate()
val sc: SparkContext = spark.sparkContext
def rg(r: Row, fname: String, ftype: DataType = StringType) = ftype match {
case StringType => r.getString(r.schema.fieldIndex(fname))
case DecimalType() => r.getDecimal(r.schema.fieldIndex(fname))
case _ => "error"
}
Let's now test that. First we need to create our decimal type as followed :
val decimalType : DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("x1", StringType, true) :: StructField("x2", decimalType, true) :: Nil)
val row = sc.parallelize(Seq("abc,0.352", "def,0.27", "foo,8.35", "bar,-153.890"))
.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))
val df = spark.createDataFrame(row, sch)
// df: org.apache.spark.sql.DataFrame = [x1: string, x2: decimal(15,10)]
Let's check now what that function does :
println(rg(df.first(), "x2", decimalType))
// 0.3520000000

Error in creating dataframe: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()

Scala java.lang.String cannot be cast to java.lang.Double error when converting double type dataframe to LabeledPoint in Spark

I have a dataset of 2002 variables. All variables are numeric. I first read in the dataset to Spark 1.5.0 and created a Double Type dataframe following the instruction here . Then I converted the dataframe to LabeledPoint following instructions here and here. However, when I tried to print out sample rows in the generated LabeledPoint, I got the "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double" error. Below is the Scala code I used. Sorry for the long code but I hope that will help the debug.
Could anyone please tell me where the error is coming from and how to resolve the problem? Thank you very much for your help!
Below is the Scala code I used:
// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
arrName = line.split('\t').map(_.trim).toArray
}
// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format
// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val targetInd = dataDF.columns.indexOf("target")
val ignored = List("target")
val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
return dataLP
}
// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
Update:
Thanks a lot for David Griffin's and zero323's comments below. David is correct. I find that exception is indeed caused by the null values in the data. I replaced the following original code:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
with this one to impute null values to 0.0 and then the problem is gone:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}

how to convert VertexRDD to DataFrame

I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.
I am trying to specify the schema as:
val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:
val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)
But I get a
// scala.MatchError: 20502 (of class java.lang.Long)
Any hint more than welcome
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}
val rows = myvertexRDD.map{
case(id, v) => Row.fromSeq(id +: v.toArray)
}
val schema = StructType(
StructField("id", LongType, false) +:
(1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))
val df = sqlContext.createDataFrame(rows, schema)
Notes:
declared types have to match actual types. You cannot declare string and pass long or double
structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns