cast schema of a data frame in Spark and Scala - scala

I want to cast the schema of a dataframe to change the type of some columns
using Spark and Scala.
Specifically I am trying to use as[U] function whose description reads:
"Returns a new Dataset where each record has been mapped on to the specified type.
The method used to map columns depend on the type of U"
In principle this is exactly what I want, but I cannot get it to work.
Here is a simple example taken from
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
// definition of data
val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")
As expected the schema of data is:
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
I would like to cast the column "b" to Double. So I try the following:
import session.implicits._;
println(" --------------------------- Casting using (String Double)")
val data_TupleCast=data.as[(String, Double)]
data_TupleCast.show()
data_TupleCast.printSchema()
println(" --------------------------- Casting using ClassData_Double")
case class ClassData_Double(a: String, b: Double)
val data_ClassCast= data.as[ClassData_Double]
data_ClassCast.show()
data_ClassCast.printSchema()
As I understand the definition of as[u], the new DataFrames should have the following schema
root
|-- a: string (nullable = true)
|-- b: double (nullable = false)
But the output is
--------------------------- Casting using (String Double)
+---+---+
| a| b|
+---+---+
| a| 1|
| b| 2|
+---+---+
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
--------------------------- Casting using ClassData_Double
+---+---+
| a| b|
+---+---+
| a| 1|
| b| 2|
+---+---+
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
which shows that column "b" has not been cast to double.
Any hints on what I am doing wrong?
BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).

Well, since functions are chained and Spark does lazy evaluation,
it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:
import spark.implicits._
df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...
As an alternative, I'm thinking you could use map to do your cast in one go, like:
df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}

Related

What happens when data is loaded into Scala Spark and there is integer overflow?

How will Scala/Spark handle loading data that is larger than its assigned type in a schema? I.e. if I define a column with IntegerType but load a number from an external dataset larger than Scala's int will the program fail or will the data just be dropped?
I've tried this out to get the answer.
Test Data
name,id
"one",1234
"two",123456789
"three",1234567890123456123456789
Reading Data
val schema = new StructType().add("name", StringType).add("id", IntegerType)
val test = spark.read.option("header", "true").schema(schema).csv("test.csv")
Resulting read
test.show()
+----+---------+
|name| id|
+----+---------+
| one| 1234|
| two|123456789|
|null| null|
+----+---------+
test.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
So since the data was an integer type, the resulting data row ended up getting read as null, instead of failing.
Non-Nullable Field?
If the field in a schema cannot be null, it will still end up reading the data as null.
val schema2 = new StructType().add("name", StringType).add("id", IntegerType, false)
val test2 = spark.read...
:
:
test2.show()
+----+---------+
|name| id|
+----+---------+
| one| 1234|
| two|123456789|
|null| null|
+----+---------+
test2.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
My understanding here is that spark would rather assume that it's a nullable field when it sees data it can't read, even if it is specified as a non-nullable field in the schema. This way, spark reads the entire data rather than fail on runtime.

Scala - How to convert a Dataset[Row] to a column that can be added to a Dataframe

I'm trying to add a dataframe of one column to a larger dataframe however the issue with the first dataframe is that after creating it and trying to add it to the main dataframe via the command:
df.withColumn("name", dataframe)
I get the error:
**found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Column**
I understand that a Dataset[Row] is supposed to be synonymous with a Dataframe however I'm not sure how to get around this error.
For context, a (really) watered down version of my code is below:
// test function - will be used as part of the main script below
def Test(inputone: Double, inputtwo: Double): Double = {
var test = (2 * inputone) + inputtwo
test
}
For the main script (i.e. where the problem lies)
//Importing the data via CSV
var df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/root/file.csv")
To give a context of what the data looks like:
df: org.apache.spark.sql.DataFrame = [ID: int, blue: int ... 8 more fields]
+---+----+------+-----+------+------+----+---+-----+-----+
| ID|blue|purple|green|yellow|orange|pink|red|white|black|
+---+----+------+-----+------+------+----+---+-----+-----+
| 1| 500| 44| 0| 0| 3| 0| 5| 43| 2|
| 2| 560| 33| 1| 0| 4| 0| 22| 33| 4|
| 3| 744| 44| 1| 99| 3|1000| 78| 90| 0|
+---+----+------+-----+------+------+----+---+-----+-----+
root
|-- ID: integer (nullable = true)
|-- blue: integer (nullable = true)
|-- purple: integer (nullable = true)
|-- green: integer (nullable = true)
|-- yellow: integer (nullable = true)
|-- orange: integer (nullable = true)
|-- pink: integer (nullable = true)
|-- red: integer (nullable = true)
|-- white: integer (nullable = true)
|-- black: integer (nullable = true)
From then on, the script continues
// Creating a list for which columns to draw from the main dataframe
val a = List("green", "blue")
// Creating the mini dataframe to perform the function upon
val test_df = df.select(a.map(col): _*)
// The new dataframe will now go through the 'Test' function defined above
val df_function = test_df.rdd.map(col => Test(col(0).toString.toDouble, col(1).toString.toDouble))
// Converting the RDD output back to a dataframe (of one column)
val df_convert = df_function.toDF
As a reference, the output looks as follows
+-----+
|value|
+-----+
|500.0|
|562.0|
|746.0|
+-----+
The last line of the script is to add it to the main dataframe as follows
df = df.withColumn("new column", df_convert)
But as stated above, I receive the following error:
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Column
//////////EDIT////////////
#user9819212 solution works for simplistic methods but when calling one a bit more complex, I get the following error
test2_udf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function5>,DoubleType,Some(List(DoubleType, IntegerType, StringType, DoubleType, DoubleType)))
java.lang.ClassCastException: $anonfun$1 cannot be cast to scala.Function1
So I tried to create another simplistic version of my code with a few extra changes to the test function that is called
// test function - will be used as part of the main script below
def Test (valueone: Double, valuetwo: Integer): Double = {
val test = if(valuetwo > 2000) valueone + 4000 else valueone
val fakeList = List(3000,4000,500000000)
val index = fakeList.indexWhere(x => x>=valueone)
val test2 = fakeList(index - 1) * valueone
test2
}
val test_udf = udf(Test _)
df = df.withColumn(
"new column",
test_udf(col("green").cast("double"), col("blue").cast("integer"))
)
At first that seems to work but when I try to view the dataframe with the command
df.show
I get the following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 153.0 failed 1 times, most recent failure: Lost task 0.0 in stage 153.0 (TID 192, localhost, executor driver):
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (double, int) => double)
You cannot add columns from another DataFrame (or DataFrame) this way. Just use UserDefinedFunction:
import org.apache.spark.sql.functions.udf._
val test_udf = udf(Test _)
df.withColumn(
"new column",
test_udf(col("green").cast("double"), col("blue").cast("double"))
)
or with such simple function:
df.withColumn(
"new column",
2 * col("green").cast("double") + col("blue").cast("double")
)
If you go the api document it is clearly mentioned as
public DataFrame withColumn(java.lang.String colName, Column col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
As you can see that second argument should be a Column and you have been passing a DataFrame.
thats the cause of the issue.
And you are trying to add a column from df_convert to df. The two dataframes are totally different. For that case you will have to look at either join if you want to separate the dataframes
Or at spark functions to be used with withColumn api as Column.
Updated
Looking at your first log
test2_udf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function5>,DoubleType,Some(List(DoubleType, IntegerType, StringType, DoubleType, DoubleType)))
suggests that you have your udf function defined as
def Test(valueone: Double, valuetwo: Integer, valuethree: String, valuefour: Double, valuefive: Double): Double = {
???
//calculation parts
}
val test2_udf = udf(Test _)
//Test: Test[](val valueone: Double,val valuetwo: Integer,val valuethree: String,val valuefour: Double,val valuefive: Double) => Double
//test2_udf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function5>,DoubleType,Some(List(DoubleType, IntegerType, StringType, DoubleType, DoubleType)))
and your second log
java.lang.ClassCastException: $anonfun$1 cannot be cast to scala.Function1
suggests that you are passing only one argument in the test2_udf call as
df.withColumn("new column", test2_udf(col("green").cast("double"))).show(false)
//java.lang.ClassCastException: A$A30$A$A30$$anonfun$test2_udf$1 cannot be cast to scala.Function1
If you focus on cannot be cast to scala.Function1 part of the error message, it clearly suggests the number of columns passed to the udf functions
If you pass three arguments then you get following
df.withColumn("new column", test2_udf(col("green").cast("double"),col("green").cast("double"),col("green").cast("double"))).show(false)
//java.lang.ClassCastException: A$A31$A$A31$$anonfun$test2_udf$1 cannot be cast to scala.Function3

spark sql cast function creates column with NULLS

I have the following dataframe and schema in Spark
val df = spark.read.options(Map("header"-> "true")).csv("path")
scala> df show()
+-------+-------+-----+
| user| topic| hits|
+-------+-------+-----+
| om| scala| 120|
| daniel| spark| 80|
|3754978| spark| 1|
+-------+-------+-----+
scala> df printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
I want to change the column hits to integer
I tried this:
scala> df.createOrReplaceTempView("test")
val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")
scala> dfNew.printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
|-- hist2: integer (nullable = true)
but when I print the dataframe the column hist 2 is filled with NULLS
scala> dfNew show()
+-------+-------+-----+-----+
| user| topic| hits|hist2|
+-------+-------+-----+-----+
| om| scala| 120| null|
| daniel| spark| 80| null|
|3754978| spark| 1| null|
+-------+-------+-----+-----+
I also tried this:
scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")
and got this:
<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e
Also tried this:
scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")
and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user, topic, hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv
with
scala> val df2 = df.selectExpr ("cast(hits as int) hits")
I get similar error.
Any help will be appreciated. I know this question has been addressed before but I tried 3 different approaches (published here) and none is working.
Thanks.
How do we let the spark cast throw an exception instead of generating all the null values?
Do I have to calculate the total number of null values before & after the cast in order to see if the cast is actually successful?
This post How to test datatype conversion during casting is doing that. I wonder if there is a better solution here.
You can cast a column to Integer type in following ways
df.withColumn("hits", df("hits").cast("integer"))
Or
data.withColumn("hitsTmp",
data("hits").cast(IntegerType)).drop("hits").
withColumnRenamed("hitsTmp", "hits")
Or
data.selectExpr ("user","topic","cast(hits as int) hits")
I know that this answer probably won't be useful for the OP since it comes with a ~2 year delay. It might however be helpful for someone facing this problem.
Just like you, I had a dataframe with a column of strings which I was trying to cast to integer:
> df.show
+-------+
| id|
+-------+
|4918088|
|4918111|
|4918154|
...
> df.printSchema
root
|-- id: string (nullable = true)
But after doing the cast to IntegerType the only thing I obtained, just as you did, was a column of nulls:
> df.withColumn("test", $"id".cast(IntegerType))
.select("id","test")
.show
+-------+----+
| id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
...
By default if you try to cast a string that contain non-numeric characters to integer the cast of the column won't fail but those values will be set to null as you can see in the following example:
> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
.select("n_str","n_int")
.show
+-----+-----+
|n_str|n_int|
+-----+-----+
| 1| 1|
| 2| 2|
| 3A| null|
+-----+-----+
The thing with our initial dataframe is that, at first sight, when we use the show method, we can't see any non-numeric character. However, if you a row from your dataframe you'll see something different:
> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]
Why is this happening? You are probably reading a csv file containing a non-supported encoding.
You can solve this by changing the encoding of the file you are reading. If that is not an option you can also clean each column before doing the type cast. An example:
> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
.select("id","test")
> df_cast.show
+-------+-------+
| id| test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
...
> df_cast.printSchema
root
|-- id: string (nullable = true)
|-- test: integer (nullable = true)
Try removing the quote around hist
if that does not work, then
try trimming the column:
dfNew = spark.sql("select *, cast(trim(hist) as integer) as hist2 from test")
The response is delayed but i was facing the same issue & worked.So thought to put it over here. Might be of help to someone.
Try declaring the schema as StructType. Reading from CSV files & providing inferential schema using case class gives weird errors for data types. Although, all the data formats are properly specified.
I had a similar problem where I was casting Strings to integers but I realized I needed to cast them to longs instead. It was hard to realize this at first since my column's type was an int when I tried to print the type using
print(df.dtypes)

Spark: Convert column of string to an array

How to convert a column that has been read as a string into a column of arrays?
i.e. convert from below schema
scala> test.printSchema
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
+---+---+
| a| b|
+---+---+
| 1|2,3|
+---+---+
| 2|4,5|
+---+---+
To:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
Please share both scala and python implementation if possible.
On a related note, how do I take care of it while reading from the file itself?
I have data with ~450 columns and few of them I want to specify in this format.
Currently I am reading in pyspark as below:
df = spark.read.format('com.databricks.spark.csv').options(
header='true', inferschema='true', delimiter='|').load(input_file)
Thanks.
There are various method,
The best way to do is using split function and cast to array<long>
data.withColumn("b", split(col("b"), ",").cast("array<long>"))
You can also create simple udf to convert the values
val tolong = udf((value : String) => value.split(",").map(_.toLong))
data.withColumn("newB", tolong(data("b"))).show
Hope this helps!
Using a UDF would give you exact required schema. Like this:
val toArray = udf((b: String) => b.split(",").map(_.toLong))
val test1 = test.withColumn("b", toArray(col("b")))
It would give you schema as follows:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReader of test.
I hope this helps!
In python (pyspark) it would be:
from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
"b",
split(col("b"), ",\s*").cast("array<int>").alias("ev")
)

How to cast Array[Struct[String,String]] column type in Hive to Array[Map[String,String]]?

I've a column in a Hive table:
Column Name: Filters
Data Type:
|-- filters: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
I want to get the value from this column by it's corresponding name.
What I did so far:
val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")
val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()
Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))
Note: Casting to Seq because WrappedArray to Map conversion is not possible.
What to do next?
I want to get the value from this column by it's corresponding name.
If you want simple and reliable way to get all values by name, you may flatten your table using explode and filter:
case class Data(name: String, value: String)
case class Filters(filters: Array[Data])
val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
| filters|
+--------------+
|[[a,b], [a,c]]|
| [[b,c]]|
+--------------+
df.withColumn("filter", explode($"filters"))
.select($"filter.name" as "name", $"filter.value" as "value")
.where($"name" === "a")
.show()
+----+-----+
|name|value|
+----+-----+
| a| b|
| a| c|
+----+-----+
You can also collect your data any way you want:
val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))
If you want to cast array[struct] to map[string, string] for future saving to some storage - it's different story, and this case is better solved by UDF. Anyway, you have to avoid collect() as long as it possible to keep your code scalable.