When I want to rename columns of my DataFrame in Spark 2.2 and print its content using show(), I get the following errors:
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'project' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'twitter_mentioned_user' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'author' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 7)
scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:61)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:58)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
Caused by: scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)
I printed the schema and it looks as follows:
df_processed
.withColumn("srcId", toInt(df_processed("srcId")))
.withColumn("dstId", toInt(df_processed("dstId")))
.withColumn("attr", rand).printSchema()
Output:
root
|-- srcId: integer (nullable = true)
|-- dstId: integer (nullable = true)
|-- attr: double (nullable = false)
The error occurs when I run this code:
df_processed
.withColumn("srcId", toInt(df_processed("srcId")))
.withColumn("dstId", toInt(df_processed("dstId")))
.withColumn("attr", rand).show()
It occurs when I add .withColumn("attr", rand), but it works when I use .withColumn("attr2", lit(0)).
UPDATE:
df_processed.printSchema()
root
|-- srcId: double (nullable = true)
|-- dstId: double (nullable = true)
df_processed.show() does not give any error.
Here is the similar example that you are trying to do, To cast the data type you can use cast function
val ds = Seq(
(1.2, 3.5),
(1.2, 3.5),
(1.2, 3.5)
).toDF("srcId", "dstId")
ds.withColumn("srcId", $"srcId".cast(IntegerType))
.withColumn("dstId", $"dstId".cast(IntegerType))
.withColumn("attr", rand)
Hope this helps!
You can add a UDF function:
val getRandom= udf(()=> {scala.math.random})
val df2 = df.withColumn("randomColumn", getRandom())
Related
I am trying to write the following pyspark dataframe to csv by using
df_final.write.csv("v1_results/run1",header=True, emptyValue='',nullValue='')
Please help me locate on why this issue
Below is the schema associated with the dataframe
root
|-- a: string (nullable = true)
|-- id: string (nullable = true)
|-- b_name: string (nullable = true)
|-- sim: float (nullable = true)
Below is the stack of the error
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 17.0 failed 4 times, most recent failure: Lost task 16.3 in stage 17.0 (TID 1074) (cluster-a183-w-1.us-central1-c.c.analytics-online-data-sci-thd.internal executor 8): java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException: Null value appeared in non-nullable field:
- array element class: "scala.Double"
- root class: "scala.collection.Seq"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
mapobjects(lambdavariable(MapObject, DoubleType, true, -1), assertnotnull(lambdavariable(MapObject, DoubleType, true, -1)), input[0, array<double>, true], Some(interface scala.collection.Seq))
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:186)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:159)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:53)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:
- array element class: "scala.Double"
- root class: "scala.collection.Seq"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
I have a piece of code where at the end, I am write dataframe to parquet file.
The logic is such that the dataframe could be empty sometimes and hence I get the below error.
df.write.format("parquet").mode("overwrite").save(somePath)
org.apache.spark.sql.AnalysisException: Parquet data source does not support null data type.;
When I print the schema of "df", I get below.
df.schema
res2: org.apache.spark.sql.types.StructType =
StructType(
StructField(rpt_date_id,IntegerType,true),
StructField(rpt_hour_no,ShortType,true),
StructField(kpi_id,IntegerType,false),
StructField(kpi_scnr_cd,StringType,false),
StructField(channel_x_id,IntegerType,false),
StructField(brand_id,ShortType,true),
StructField(kpi_value,FloatType,false),
StructField(src_lst_updt_dt,NullType,true),
StructField(etl_insrt_dt,DateType,false),
StructField(etl_updt_dt,DateType,false)
)
Is there a workaround to just write the empty file with schema, or not write the file at all when empty?
Thanks
The error you are getting is not related with the fact that your dataframe is empty. I don't see the point of saving an empty dataframe but you can do it if you want. Try this if you don't believe me:
val schema = StructType(
Array(
StructField("col1",StringType,true),
StructField("col2",StringType,false)
)
)
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
.write
.format("parquet")
.save("/tmp/test_empty_df")
You are getting that error because one of your columns is of NullType and as the exception that was thrown indicates "Parquet data source does not support null data type"
I can't know for sure why you have a column with Null type but that usually happens when you read your data from a source and let spark infer the schema. If in that source there is an empty column, spark won't be able to infer the schema and will set it to null type.
If this is what's happening, my advice is that you specify the schema on read.
If this is not the case, a possible solution is to cast all the columns of NullType to a parquet-compatible type (like StringType). Here is an example on how to do it:
//df is a dataframe with a column of NullType
val df = Seq(("abc",null)).toDF("col1", "col2")
df.printSchema
root
|-- col1: string (nullable = true)
|-- col2: null (nullable = true)
//fold left to cast all NullType to StringType
val df1 = df.columns.foldLeft(df){
(acc,cur) => {
if(df.schema(cur).dataType == NullType)
acc.withColumn(cur, col(cur).cast(StringType))
else
acc
}
}
df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
Hope this helps
'or not write the file at all when empty?' Check if df is not empty & then only write it.
if (!df.isEmpty)
df.write.format("parquet").mode("overwrite").save("somePath")
I'm trying to retrieve a list from a row with the following schema element.
[info] |-- ARRAY_FIELD: array (nullable = false)
[info] | |-- element: string (containsNull = false)
When printing using
row.getAs[WrappedArray[String]]("ARRAY_FIELD")
I get the following result
WrappedArray(Some String value)
But when I attempt to print the data at that index as a list
using....
row.getList(0)
I get the following exception
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to scala.collection.Seq
Does anyone have any ideas on why this happens and how it can be resolved?
I was actually pulling from the wrong index in the schema. I assumed that the index for getList was based on the index of the elements shown when using df.printSchema. But I was wrong. Off by 6 postions.
Getting this null error in spark Dataset.filter
Input CSV:
name,age,stat
abc,22,m
xyz,,s
Working code:
case class Person(name: String, age: Long, stat: String)
val peopleDS = spark.read.option("inferSchema","true")
.option("header", "true").option("delimiter", ",")
.csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()
Failing code (Adding following lines return error):
val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()
Returns null error
java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
Exception you get should explain everything but let's go step-by-step:
When load data using csv data source all fields are marked as nullable:
val path: String = ???
val peopleDF = spark.read
.option("inferSchema","true")
.option("header", "true")
.option("delimiter", ",")
.csv(path)
peopleDF.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)
Missing field is represented as SQL NULL
peopleDF.where($"age".isNull).show
+----+----+----+
|name| age|stat|
+----+----+----+
| xyz|null| s|
+----+----+----+
Next you convert Dataset[Row] to Dataset[Person] which uses Long to encode age field. Long in Scala cannot be null. Because input schema is nullable, output schema stays nullable despite of that:
val peopleDS = peopleDF.as[Person]
peopleDS.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)
Note that it as[T] doesn't affect the schema at all.
When you query Dataset using SQL (on registered table) or DataFrame API Spark won't deserialize the object. Since schema is still nullable we can execute:
peopleDS.where($"age" > 30).show
+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+
without any issues. This is just a plain SQL logic and NULL is a valid value.
When we use statically typed Dataset API:
peopleDS.filter(_.age > 30)
Spark has to deserialize the object. Because Long cannot be null (SQL NULL) it fails with exception you've seen.
If it wasn't for that you'd get NPE.
Correct statically typed representation of your data should use Optional types:
case class Person(name: String, age: Option[Long], stat: String)
with adjusted filter function:
peopleDS.filter(_.age.map(_ > 30).getOrElse(false))
+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+
If you prefer you can use pattern matching:
peopleDS.filter {
case Some(age) => age > 30
case _ => false // or case None => false
}
Note that you don't have to (but it would be recommended anyway) to use optional types for name and stat. Because Scala String is just a Java String it can be null. Of course if you go with this approach you have to explicitly check if accessed values are null or not.
Related Spark 2.0 Dataset vs DataFrame
I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
println(sqlContext.parquetFile(path))
prints:
ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None
You should be able to do this:
sqlContext.read.parquet(path).printSchema()
From Spark docs:
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)
sqlContext.parquetFile(p).first()
At some point prints:
{
optional binary cust_id;
optional binary blar;
optional double foo;
}
fileSchema: message schema {
optional binary cust_id;
optional binary blar;
optional double foo;
}
The result of parquetFile() is a SchemaRDD (1.2) or DataFrame (1.3) which have the .printSchema() method.