Data format inconsistency during read/write parquet file with spark

Data format inconsistency during read/write parquet file with spark - scala

Here is the schema of the input data that I read from a file myfile.parquet with spark/scala :
val df = spark.read.format("parquet").load("/usr/sample/myIntialFile.parquet")
df.printSchema
root
|-- details: string (nullable = true)
|-- infos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- text: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- value: string (nullable = true)
Then, I just made a .select("infos") and write the dataframe as a parquet file (let say as sparkProcessedFile.parquet). And of course, the data schema of infos column remained unchanged.
On the other hand, when I try to compare schemas of myIntialFile.parquet and sparkProcessedFile.parquet using pyarrow, I don't get the same data schema :
import pyarrow.parquet as pa
initialTable = pa.read_table('myIntialFile.parquet')
initialTable.schema
infos: list<array: struct<text: string, id: string, value: string> not null>
sparkProcessedTable = pa.read_table('sparkProcessedFile.parquet')
sparkProcessedTable.schema
infos: list<element: struct<text: string, id: string, value: string>>
I don't understand why there is a difference (list<array<struct>> instead of list<struct>) and why spark changed the initial nested structure with a simple select.
Thanks for any suggestions.

The actual data type didn't change. In both cases infos is a variable sized list of structs. In other words, each item in the infos array is a list of structs.
Arguably, there isn't much point in the name array or element. I think different parquet readers/writers basically just make something up here. Note that pyarrow will call the field item when creating a new array from memory:
>>> pa.list_(pa.struct([pa.field('text', pa.string()), pa.field('id', pa.string()), pa.field('value', pa.string())]))
ListType(list<item: struct<text: string, id: string, value: string>>)
It appears that Spark is normalizing the "list element name" to element (or perhaps whatever is in its schema) regardless of what is actually in the parquet file. This seems like a reasonable thing to do (although one could imagine it causing a compatibility issue)
Perhaps more concerning is the fact that the field items changed from "not null" to "nullable". Again, Spark reports the field as "nullable" so either Spark has decided that all array columns are nullable or Spark had decided the schema required that to be nullable in some other way.

Related

Saving empty dataframe to parquet results in error - Spark 2.4.4

I have a piece of code where at the end, I am write dataframe to parquet file.
The logic is such that the dataframe could be empty sometimes and hence I get the below error.
df.write.format("parquet").mode("overwrite").save(somePath)
org.apache.spark.sql.AnalysisException: Parquet data source does not support null data type.;
When I print the schema of "df", I get below.
df.schema
res2: org.apache.spark.sql.types.StructType =
StructType(
StructField(rpt_date_id,IntegerType,true),
StructField(rpt_hour_no,ShortType,true),
StructField(kpi_id,IntegerType,false),
StructField(kpi_scnr_cd,StringType,false),
StructField(channel_x_id,IntegerType,false),
StructField(brand_id,ShortType,true),
StructField(kpi_value,FloatType,false),
StructField(src_lst_updt_dt,NullType,true),
StructField(etl_insrt_dt,DateType,false),
StructField(etl_updt_dt,DateType,false)
)
Is there a workaround to just write the empty file with schema, or not write the file at all when empty?
Thanks

The error you are getting is not related with the fact that your dataframe is empty. I don't see the point of saving an empty dataframe but you can do it if you want. Try this if you don't believe me:
val schema = StructType(
Array(
StructField("col1",StringType,true),
StructField("col2",StringType,false)
)
)
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
.write
.format("parquet")
.save("/tmp/test_empty_df")
You are getting that error because one of your columns is of NullType and as the exception that was thrown indicates "Parquet data source does not support null data type"
I can't know for sure why you have a column with Null type but that usually happens when you read your data from a source and let spark infer the schema. If in that source there is an empty column, spark won't be able to infer the schema and will set it to null type.
If this is what's happening, my advice is that you specify the schema on read.
If this is not the case, a possible solution is to cast all the columns of NullType to a parquet-compatible type (like StringType). Here is an example on how to do it:
//df is a dataframe with a column of NullType
val df = Seq(("abc",null)).toDF("col1", "col2")
df.printSchema
root
|-- col1: string (nullable = true)
|-- col2: null (nullable = true)
//fold left to cast all NullType to StringType
val df1 = df.columns.foldLeft(df){
(acc,cur) => {
if(df.schema(cur).dataType == NullType)
acc.withColumn(cur, col(cur).cast(StringType))
else
acc
}
}
df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
Hope this helps

'or not write the file at all when empty?' Check if df is not empty & then only write it.
if (!df.isEmpty)
df.write.format("parquet").mode("overwrite").save("somePath")

Filter using Nested Struct Field

|-- data: struct (nullable = true)
| |-- keyNote: struct (nullable = true)
| | |-- key: string (nullable = true)
| | |-- note: string (nullable = true)
With the example structure above, how would I select the note field under the structs data and keyNote?
I need to filter with two different data frames and can not seem to be able to select a nested field. I am using Spark 1.6.2 where the left anti isn't available so I used the following filter. Below are two ways I have tried.
val dataFrame = esData.join(broadcastDataFrame, esData.select(esData.col("data.keyNote")).col("note") !== broadcastDataFrame("id"))
Error: Cannot resolve column name "note" among (keyNote)
val dataFrame = esData.join(broadcastDataFrame, esData.select(esData.col("data.keyNote.*")).col("note") !== broadcastDataFrame("id"))
Error: No such struct field * in key, note
val dataFrame = esData.join(broadcastDataFrame, esData("data.keyNote.note") !== broadcastDataFrame("id"))
java.lang.IllegalArgumentException: Field "note" does not exist.(..)
val dataFrame = esData.join(broadcastDataFrame, esData.select($"data.keyNote.note").col("note") !== broadcastDataFrame("id"))
Error: resolved attribute(s) note#9 missing from data#1,id#3 in operator !Join Inner, Some(NOT (note#9 = id#3))
The dataFrame used is created from Elastic Search (artifact: elastic-spark-13_2.10, Version:5.1.1)
val dataFrameES = context.read.format("org.elasticsearch.spark.sql")
.options(Map("es.read.field.exclude" ->
"<Excluding All the fields except those I need>"))
.load("<Index>/<Type>")
Now I attempted to use the es.read.field.include but nothing I tried would be able to retrieve the nested items except for excluding everything else. I tried to include the following; data, data.keyNote, data.keyNote.key, and every permutation plus wildcard of * after each. I am not sure if this is a spark thing or an elastic search thing.
I thought it was the schema being read wrong until I excluded all the unwanted fields and successfully retrieved the ones I wanted.
I think now that it is the join because I am able to grab that field with no errors in a filter like so;
esData.filter(esData("data.keyNote.key").equalTo("x"))
I just continue to get errors when I try to complete the join above, which is required being I have two data sets. When I do run the filter above right after creating the elastic search data frame is takes far longer than running a curl.

The correct syntax is:
df1.join(df2, df1("x.y.z") !== df2("v"))
or
df1.join(df).where(df1("x.y.z") !== df2("v")
Full example
scala> :paste
// Entering paste mode (ctrl-D to finish)
val esData = sqlContext.read.json(sc.parallelize(Seq(
"""{"data": {"keyNote": {"key": "foo", "note": "bar"}}}""")))
val broadcastDataFrame = Seq((1L, "foo"), (2L, "bar")).toDF("n", "id")
esData.join(
broadcastDataFrame, esData("data.keyNote.note") !== broadcastDataFrame("id")
).show
// Exiting paste mode, now interpreting.
+-----------+---+---+
| data| n| id|
+-----------+---+---+
|[[foo,bar]]| 1|foo|
+-----------+---+---+
esData: org.apache.spark.sql.DataFrame = [data: struct<keyNote:struct<key:string,note:string>>]
broadcastDataFrame: org.apache.spark.sql.DataFrame = [n: bigint, id: string]
If you want antijoin it is better to use outer join and filter out nulls.

Spark 2 Dataset Null value exception

Getting this null error in spark Dataset.filter
Input CSV:
name,age,stat
abc,22,m
xyz,,s
Working code:
case class Person(name: String, age: Long, stat: String)
val peopleDS = spark.read.option("inferSchema","true")
.option("header", "true").option("delimiter", ",")
.csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()
Failing code (Adding following lines return error):
val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()
Returns null error
java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).

Exception you get should explain everything but let's go step-by-step:
When load data using csv data source all fields are marked as nullable:
val path: String = ???
val peopleDF = spark.read
.option("inferSchema","true")
.option("header", "true")
.option("delimiter", ",")
.csv(path)
peopleDF.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)
Missing field is represented as SQL NULL
peopleDF.where($"age".isNull).show
+----+----+----+
|name| age|stat|
+----+----+----+
| xyz|null| s|
+----+----+----+
Next you convert Dataset[Row] to Dataset[Person] which uses Long to encode age field. Long in Scala cannot be null. Because input schema is nullable, output schema stays nullable despite of that:
val peopleDS = peopleDF.as[Person]
peopleDS.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)
Note that it as[T] doesn't affect the schema at all.
When you query Dataset using SQL (on registered table) or DataFrame API Spark won't deserialize the object. Since schema is still nullable we can execute:
peopleDS.where($"age" > 30).show
+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+
without any issues. This is just a plain SQL logic and NULL is a valid value.
When we use statically typed Dataset API:
peopleDS.filter(_.age > 30)
Spark has to deserialize the object. Because Long cannot be null (SQL NULL) it fails with exception you've seen.
If it wasn't for that you'd get NPE.
Correct statically typed representation of your data should use Optional types:
case class Person(name: String, age: Option[Long], stat: String)
with adjusted filter function:
peopleDS.filter(_.age.map(_ > 30).getOrElse(false))
+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+
If you prefer you can use pattern matching:
peopleDS.filter {
case Some(age) => age > 30
case _ => false // or case None => false
}
Note that you don't have to (but it would be recommended anyway) to use optional types for name and stat. Because Scala String is just a Java String it can be null. Of course if you go with this approach you have to explicitly check if accessed values are null or not.
Related Spark 2.0 Dataset vs DataFrame

How can I change column types in Spark SQL's DataFrame?

Suppose I'm doing something like:
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.printSchema()
root
|-- year: string (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- comment: string (nullable = true)
|-- blank: string (nullable = true)
df.show()
year make model comment blank
2012 Tesla S No comment
1997 Ford E350 Go get one now th...
But I really wanted the year as Int (and perhaps transform some other columns).
The best I could come up with was
df.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment, 'blank)
org.apache.spark.sql.DataFrame = [year: int, make: string, model: string, comment: string, blank: string]
which is a bit convoluted.
I'm coming from R, and I'm used to being able to write, e.g.
df2 <- df %>%
mutate(year = year %>% as.integer,
make = make %>% toupper)
I'm likely missing something, since there should be a better way to do this in Spark/Scala...

Edit: Newest newest version
Since spark 2.x you should use dataset api instead when using Scala [1]. Check docs here:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
If working with python, even though easier, I leave the link here as it's a very highly voted question:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
[1] https://spark.apache.org/docs/latest/sql-programming-guide.html:
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
While, in Java API, users need to use Dataset to represent a
DataFrame.
Edit: Newest version
Since spark 2.x you can use .withColumn. Check the docs here:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
Oldest answer
Since Spark version 1.4 you can apply the cast method with DataType on the column:
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
For more info check the docs:
http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn, withColumnRenamed and cast put forward by msemelman, Martin Senne and others are simpler and cleaner].
I think your approach is ok, recall that a Spark DataFrame is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame each time with a new schema.
Assuming you have an original df with the following schema:
scala> df.printSchema
root
|-- Year: string (nullable = true)
|-- Month: string (nullable = true)
|-- DayofMonth: string (nullable = true)
|-- DayOfWeek: string (nullable = true)
|-- DepDelay: string (nullable = true)
|-- Distance: string (nullable = true)
|-- CRSDepTime: string (nullable = true)
And some UDF's defined on one or several columns:
import org.apache.spark.sql.functions._
val toInt = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val toHour = udf((t: String) => "%04d".format(t.toInt).take(2).toInt )
val days_since_nearest_holidays = udf(
(year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
)
Changing column types or even building a new DataFrame from another can be written like this:
val featureDf = df
.withColumn("departureDelay", toDouble(df("DepDelay")))
.withColumn("departureHour", toHour(df("CRSDepTime")))
.withColumn("dayOfWeek", toInt(df("DayOfWeek")))
.withColumn("dayOfMonth", toInt(df("DayofMonth")))
.withColumn("month", toInt(df("Month")))
.withColumn("distance", toDouble(df("Distance")))
.withColumn("nearestHoliday", days_since_nearest_holidays(
df("Year"), df("Month"), df("DayofMonth"))
)
.select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth",
"month", "distance", "nearestHoliday")
which yields:
scala> df.printSchema
root
|-- departureDelay: double (nullable = true)
|-- departureHour: integer (nullable = true)
|-- dayOfWeek: integer (nullable = true)
|-- dayOfMonth: integer (nullable = true)
|-- month: integer (nullable = true)
|-- distance: double (nullable = true)
|-- nearestHoliday: integer (nullable = true)
This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable.

As the cast operation is available for Spark Column's (and as I personally do not favour udf's as proposed by #Svend at this point), how about:
df.select( df("year").cast(IntegerType).as("year"), ... )
to cast to the requested type? As a neat side effect, values not castable / "convertable" in that sense, will become null.
In case you need this as a helper method, use:
object DFHelper{
def castColumnTo( df: DataFrame, cn: String, tpe: DataType ) : DataFrame = {
df.withColumn( cn, df(cn).cast(tpe) )
}
}
which is used like:
import DFHelper._
val df2 = castColumnTo( df, "year", IntegerType )

First, if you wanna cast type, then this:
import org.apache.spark.sql
df.withColumn("year", $"year".cast(sql.types.IntegerType))
With same column name, the column will be replaced with new one. You don't need to do add and delete steps.
Second, about Scala vs R.
This is the code that most similar to R I can come up with:
val df2 = df.select(
df.columns.map {
case year # "year" => df(year).cast(IntegerType).as(year)
case make # "make" => functions.upper(df(make)).as(make)
case other => df(other)
}: _*
)
Though the code length is a little longer than R's. That is nothing to do with the verbosity of the language. In R the mutate is a special function for R dataframe, while in Scala you can easily ad-hoc one thanks to its expressive power.
In word, it avoid specific solutions, because the language design is good enough for you to quickly and easy build your own domain language.
side note: df.columns is surprisingly a Array[String] instead of Array[Column], maybe they want it look like Python pandas's dataframe.

You can use selectExpr to make it a little cleaner:
df.selectExpr("cast(year as int) as year", "upper(make) as make",
"model", "comment", "blank")

Java code for modifying the datatype of the DataFrame from String to Integer
df.withColumn("col_name", df.col("col_name").cast(DataTypes.IntegerType))
It will simply cast the existing(String datatype) to Integer.

I think this is lot more readable for me.
import org.apache.spark.sql.types._
df.withColumn("year", df("year").cast(IntegerType))
This will convert your year column to IntegerType with creating any temporary columns and dropping those columns.
If you want to convert to any other datatype, you can check the types inside org.apache.spark.sql.types package.

To convert the year from string to int, you can add the following option to the csv reader: "inferSchema" -> "true", see DataBricks documentation

Generate a simple dataset containing five values and convert int to string type:
val df = spark.range(5).select( col("id").cast("string") )

So this only really works if your having issues saving to a jdbc driver like sqlserver, but it's really helpful for errors you will run into with syntax and types.
import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect}
import org.apache.spark.sql.jdbc.JdbcType
val SQLServerDialect = new JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:jtds:sqlserver") || url.contains("sqlserver")
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case StringType => Some(JdbcType("VARCHAR(5000)", java.sql.Types.VARCHAR))
case BooleanType => Some(JdbcType("BIT(1)", java.sql.Types.BIT))
case IntegerType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
case LongType => Some(JdbcType("BIGINT", java.sql.Types.BIGINT))
case DoubleType => Some(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE))
case FloatType => Some(JdbcType("REAL", java.sql.Types.REAL))
case ShortType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
case ByteType => Some(JdbcType("INTEGER", java.sql.Types.INTEGER))
case BinaryType => Some(JdbcType("BINARY", java.sql.Types.BINARY))
case TimestampType => Some(JdbcType("DATE", java.sql.Types.DATE))
case DateType => Some(JdbcType("DATE", java.sql.Types.DATE))
// case DecimalType.Fixed(precision, scale) => Some(JdbcType("NUMBER(" + precision + "," + scale + ")", java.sql.Types.NUMERIC))
case t: DecimalType => Some(JdbcType(s"DECIMAL(${t.precision},${t.scale})", java.sql.Types.DECIMAL))
case _ => throw new IllegalArgumentException(s"Don't know how to save ${dt.json} to JDBC")
}
}
JdbcDialects.registerDialect(SQLServerDialect)

the answers suggesting to use cast, FYI, the cast method in spark 1.4.1 is broken.
for example, a dataframe with a string column having value "8182175552014127960" when casted to bigint has value "8182175552014128100"
df.show
+-------------------+
| a|
+-------------------+
|8182175552014127960|
+-------------------+
df.selectExpr("cast(a as bigint) a").show
+-------------------+
| a|
+-------------------+
|8182175552014128100|
+-------------------+
We had to face a lot of issue before finding this bug because we had bigint columns in production.

df.select($"long_col".cast(IntegerType).as("int_col"))

You can use below code.
df.withColumn("year", df("year").cast(IntegerType))
Which will convert year column to IntegerType column.

Using Spark Sql 2.4.0 you can do that:
spark.sql("SELECT STRING(NULLIF(column,'')) as column_string")

This method will drop the old column and create new columns with same values and new datatype. My original datatypes when the DataFrame was created were:-
root
|-- id: integer (nullable = true)
|-- flag1: string (nullable = true)
|-- flag2: string (nullable = true)
|-- name: string (nullable = true)
|-- flag3: string (nullable = true)
After this I ran following code to change the datatype:-
df=df.withColumnRenamed(<old column name>,<dummy column>) // This was done for both flag1 and flag3
df=df.withColumn(<old column name>,df.col(<dummy column>).cast(<datatype>)).drop(<dummy column>)
After this my result came out to be:-
root
|-- id: integer (nullable = true)
|-- flag2: string (nullable = true)
|-- name: string (nullable = true)
|-- flag1: boolean (nullable = true)
|-- flag3: boolean (nullable = true)

So many answers and not much thorough explanations
The following syntax works Using Databricks Notebook with Spark 2.4
from pyspark.sql.functions import *
df = df.withColumn("COL_NAME", to_date(BLDFm["LOAD_DATE"], "MM-dd-yyyy"))
Note that you have to specify the entry format you have (in my case "MM-dd-yyyy") and the import is mandatory as the to_date is a spark sql function
Also Tried this syntax but got nulls instead of a proper cast :
df = df.withColumn("COL_NAME", df["COL_NAME"].cast("Date"))
(Note I had to use brackets and quotes for it to be syntaxically correct though)
PS : I have to admit this is like a syntax jungle, there are many possible ways entry points, and the official API references lack proper examples.

Another solution is as follows:
1) Keep "inferSchema" as False
2) While running 'Map' functions on the row, you can read 'asString' (row.getString...)
//Read CSV and create dataset
Dataset<Row> enginesDataSet = sparkSession
.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","false")
.load(args[0]);
JavaRDD<Box> vertices = enginesDataSet
.select("BOX","BOX_CD")
.toJavaRDD()
.map(new Function<Row, Box>() {
#Override
public Box call(Row row) throws Exception {
return new Box((String)row.getString(0),(String)row.get(1));
}
});

Why not just do as described under http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
df.select(df.year.cast("int"),"make","model","comment","blank")

One can change data type of a column by using cast in spark sql.
table name is table and it has two columns only column1 and column2 and column1 data type is to be changed.
ex-spark.sql("select cast(column1 as Double) column1NewName,column2 from table")
In the place of double write your data type.

Another way:
// Generate a simple dataset containing five values and convert int to string type
val df = spark.range(5).select( col("id").cast("string")).withColumnRenamed("id","value")

In case you have to rename dozens of columns given by their name, the following example takes the approach of #dnlbrky and applies it to several columns at once:
df.selectExpr(df.columns.map(cn => {
if (Set("speed", "weight", "height").contains(cn)) s"cast($cn as double) as $cn"
else if (Set("isActive", "hasDevice").contains(cn)) s"cast($cn as boolean) as $cn"
else cn
}):_*)
Uncasted columns are kept unchanged. All columns stay in their original order.

val fact_df = df.select($"data"(30) as "TopicTypeId", $"data"(31) as "TopicId",$"data"(21).cast(FloatType).as( "Data_Value_Std_Err")).rdd
//Schema to be applied to the table
val fact_schema = (new StructType).add("TopicTypeId", StringType).add("TopicId", StringType).add("Data_Value_Std_Err", FloatType)
val fact_table = sqlContext.createDataFrame(fact_df, fact_schema).dropDuplicates()

In case if you want to change multiple columns of a specific type to another without specifying individual column names
/* Get names of all columns that you want to change type.
In this example I want to change all columns of type Array to String*/
val arrColsNames = originalDataFrame.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)
//iterate columns you want to change type and cast to the required type
val updatedDataFrame = arrColsNames.foldLeft(originalDataFrame){(tempDF, colName) => tempDF.withColumn(colName, tempDF.col(colName).cast(DataTypes.StringType))}
//display
updatedDataFrame.show(truncate = false)

How to show the scheme (including type) of a parquet file from command line or spark shell?

I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
println(sqlContext.parquetFile(path))
prints:
ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None

You should be able to do this:
sqlContext.read.parquet(path).printSchema()
From Spark docs:
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)
sqlContext.parquetFile(p).first()
At some point prints:
{
optional binary cust_id;
optional binary blar;
optional double foo;
}
fileSchema: message schema {
optional binary cust_id;
optional binary blar;
optional double foo;
}

The result of parquetFile() is a SchemaRDD (1.2) or DataFrame (1.3) which have the .printSchema() method.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Data format inconsistency during read/write parquet file with spark - scala

Related

Saving empty dataframe to parquet results in error - Spark 2.4.4

Filter using Nested Struct Field

Spark 2 Dataset Null value exception

How can I change column types in Spark SQL's DataFrame?

How to show the scheme (including type) of a parquet file from command line or spark shell?

Categories

Resources