Spark DataFrame duplicate column names while using mergeSchema - scala

I have a huge Spark DataFrame which I create using the following statement
val df = sqlContext.read.option("mergeSchema", "true").parquet("parquet/partitions/path")
Now when I try to do column rename or select operation on above DataFrame it fails saying ambiguous columns found with the following exception
org.apache.spark.sql.AnalysisException: Reference 'Product_Type' is
ambiguous, could be Product_Type#13, Product_Type#235
Now I saw columns and found there are two columns Product_Type and Product_type which seems to be same columns with one letter case different created because of schema merge over time. Now I don't mind keeping duplicate columns but Spark sqlContext for some reason don't like it.
I believe by default spark.sql.caseSensitive config is true so don't know why it fails. I am using Spark 1.5.2. I am new to Spark.

By default, spark.sql.caseSensitive property is false so before your rename or select statement, you should set the property to true
sqlContext.sql("set spark.sql.caseSensitive=true")

Related

Cannot create a table having a column whose name contains commas in Hive metastore

Tried to create a delta table from spark data frame using below command:
destination_path = "/dbfs/mnt/kidneycaredevstore/delta/df_corr_feats_spark_4"
df_corr_feats_spark.write.format("delta").option("delta.columnMapping.mode", "name").option("path",destination_path).saveAsTable("CKD_Features_4")
Getting below error:
AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: default.abc_features_4; Column: Adverse, abc initial encounter
Please note that there are around 6k columns in this data frame and it is developed by data scientist generate feature. So, we cannot rename columns.
How to fix this error? Can any configuration change in Metastore solve this issue?
Column mapping feature requires writer version 5 and reader version 2 of Delta protocol, therefore you need to specify this when saving:
df_corr_feats_spark.write.format("delta")
.option("delta.columnMapping.mode", "name")
.option("delta.minReaderVersion", "2")
.option("delta.minWriterVersion", "5")
.option("path", destination_path)
.saveAsTable("CKD_Features_4")

How did spark RDD map to Cassandra table?

I am new to Spark, and recently I saw a code is saving data in RDD format to Cassandra table. But I am not able to figure it out how it is doing the column mapping. It neither uses case class, also specifies any column names in the code like below:
rdd
.map(x => (x._1, x._2, x_3)) // x is a List here
.repartitionByCassandraReplica(keyspace, tableName)
.saveToCassandra(keyspace, tableName)
Since x inside is simply a List[(Int, String, Int)], which is not a case class, there is no name mapping to Cassandra table. So is there any definite order in Cassandra table that can match the order of columns we specify in the code?
This mapping relies on the order of columns in the Cassandra table definition that is as following:
Partition key columns in the specified order
Clustering columns in the specified order
Alphabetically sorted by name for rest of the columns
Spark Cassandra Connector relies that these columns from table definition will be matched to the order of fields in the Scala tuple. You can see that in the source code of TupleColumnMapper class.

Insert Spark DF as column in existing hive table

I'm looking for a way to append a column spark DF to an existing Hive table, I'm using the code below to overwrite the table but only works when df schema and hive table schema are equal, but sometimes I need to add one column and since schemas don't match it does not work.
Is there a way to append the df as a column?
Or I have to make an ALTER TABLE ADD COLUMN in a spark.sql()?
temp = spark.table('temp')
temp.write.mode('overwrite').insertInto(datalab + '.' + table,overwrite=True)
Hope my question is understandable, thanks.
You can get a dataframe with new set of data that has additional columns, then append that to the existing table, in the following manner.
new_data_df = df with additional columns
new_data_df.write.mode('append').saveAsTable('same_table_name', mergeSchema=True)
So suppose, the new column you have added is 'column_new', the older records in the table will have been set with null values.

Cannot read case-sensitive Glue table backed by Parquet

Spark version: 2.4.2 on Amazon EMR 5.24.0
I have a Glue Catalog table backed by S3 Parquet directory. The Parquet files have case-sensitive column names (like lastModified). It doesn't matter what I do, I get lowercase column names (lastmodified) when reading the Glue Catalog table with Spark:
for {
i <- Seq(false, true)
j <- Seq("NEVER_INFER", "INFER_AND_SAVE", "INFER_ONLY")
k <- Seq(false, true)
} {
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet", i)
.config("spark.sql.hive.caseSensitiveInferenceMode", j)
.config("spark.sql.parquet.mergeSchema", k)
.enableHiveSupport()
.getOrCreate()
import spark.sql
val df = sql("""SELECT * FROM ecs_db.test_small""")
df.columns.foreach(println)
}
[1] https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Edit
The below solution is incorrect.
Glue Crawlers are not supposed to set the spark.sql.sources.schema.* properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that Spark infers the schema from the underlying files and alters the tables to add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our case, Spark failed to do so, because of a IllegalArgumentException: Can not create a Path from an empty string exception which is caused because the Hive database class instance has an empty locationUri property string. This is caused because the Glue database does not have a Location property . After the schema is saved, Spark reads it from the table.
There could be a way around this, by setting INFER_ONLY, which should only infer the schema from the files and not attempt to alter the table SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where the inferred schema is then lowercased (see here).
Original solution (incorrect)
This bug happens because the Glue table's SERDEPROPERTIES is missing two important properties:
spark.sql.sources.schema.numParts
spark.sql.sources.schema.part.0
To solve the problem, I had to add those two properties via the Glue console (couldn't do it with ALTER TABLE …)
I guess this is a bug with Glue crawlers, which do not set these properties when creating the table.

pySpark jdbc write error: An error occurred while calling o43.jdbc. : scala.MatchError: null

I am trying to write simple spark dataframe to db2 database using pySpark. Dataframe has only one column with double as a data type.
This is the dataframe with only one row and one column:
This is the dataframe schema:
When I try to write this dataframe to db2 table with this syntax:
dataframe.write.mode('overwrite').jdbc(url=url, table=source, properties=prop)
it creates the table in the database for the first time without any issue, but if I run the code second time, it throws an exception:
On the DB2 side the column datatype is also DOUBLE.
Not sure what am I missing.
I just changed small part in the code and it worked perfectly.
Here is the small change I did to syntax
dataframe.write.jdbc(url=url, table=source, mode = 'overwrite', properties=prop)