Unable to remove the space from column names in spark scala - scala

I have parquet dataset column names with spaces in between the words like for eg: BRANCH NAME. Now, when I replace the space with "_" and try printing the column it results in an error. Below is my code with multiple approaches followed by error:
Approach 1:
Var df= spark.read.parquet("s3://tvsc-lumiq-edl/raw-v2/LMSDB/DESUSR/TBL_DES_SLA_MIS1")
for (c <- df.columns){
df = df.withColumnRenamed(c, c.replace(" ", ""))
}
Approach 2:
df = df.columns.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "")))
Approach 3:
val new_cols = df.columns.map(x => x.replaceAll(" ", ""))
val df2 = df.toDF(new_cols : _*)
Error:
org.apache.spark.sql.AnalysisException: Attribute name "BRANCH NAME" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
Below is the schema:
scala> df.printSchema()
root
|-- dms_timestamp: string (nullable = true)
|-- BRANCH NAME: string (nullable = true)
|-- BRANCH CODE: string (nullable = true)
|-- DEALER NAME: string (nullable = true)
|-- DEALER CODE: string (nullable = true)
|-- DEALER CATEGORY: string (nullable = true)
|-- PRODUCT: string (nullable = true)
|-- CREATION DATE: string (nullable = true)
|-- CHANNEL TYPE: string (nullable = true)
|-- DELAY DAYS: string (nullable = true)
I have had also referred multiple SO posts but didn't help.

It was fixed in Spark 3.3.0 release (I tested). So your only option is to upgrade (or use pandas to rename the fields).

Thanks for #Andrey 's update:
Now we can correctly load parquet files with column names containing these special characters. Just make sure you're using spark after version 3.3.0.
If all the datasets are in parquet files, I'm afraid we're out of luck and you have to load them in Pandas and then do the renaming.
Spark won't read parquet files with column names containing characters among " ,;{}()\n\t=" at all. AFAIK, Spark devs refused to resolve this issue. The root cause of it lies in your parquet files themselves. At least according to the dev, parquet files should not have these "invalid characters" in their column names in the first place.
See https://issues.apache.org/jira/browse/SPARK-27442 . It was marked as "won't fix".

Try below code.
df
.select(df.columns.map(c => col(s"`${c}`").as(c.replace(" ",""))):_*)
.show(false)

This worked for me
val dfnew =df.select(df.columns.map(i => col(i).as(i.replaceAll(" ", ""))): _*)

Related

Error while inserting into partitioned hive table for spark scala

I am having hive table with following structure
CREATE TABLE gcganamrswp_work.historical_trend_result(
column_name string,
metric_name string,
current_percentage string,
lower_threshold double,
upper_threshold double,
calc_status string,
final_status string,
support_override string,
dataset_name string,
insert_timestamp string,
appid string,
currentdate string,
indicator map<string,string>)
PARTITIONED BY (
appname string,
year_month int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
I am having spark dataframe with schema
root
|-- metric_name: string (nullable = true)
|-- column_name: string (nullable = true)
|-- Lower_Threshold: double (nullable = true)
|-- Upper_Threshold: double (nullable = true)
|-- Current_Percentage: double (nullable = true)
|-- Calc_Status: string (nullable = false)
|-- Final_Status: string (nullable = false)
|-- support_override: string (nullable = false)
|-- Dataset_Name: string (nullable = false)
|-- insert_timestamp: string (nullable = false)
|-- appId: string (nullable = false)
|-- currentDate: string (nullable = false)
|-- indicator: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
|-- appname: string (nullable = false)
|-- year_month: string (nullable = false)
when i try to insert into hive table using below code it is failing
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
data_df.repartition(1)
.write.mode("append")
.format("hive")
.insertInto(Outputhive_table)
Spark Version : Spark 2.4.0
Error:
ERROR Hive:1987 - Exception when loading partition with parameters
partPath=hdfs://gcgprod/data/work/hive/historical_trend_result/.hive-staging_hive_2021-09-01_04-34-04_254_8783620706620422928-1/-ext-10000/_temporary/0,
table=historical_trend_result, partSpec={appname=, year_month=},
replace=false, listBucketingEnabled=false, isAcid=false,
hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at
org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at
org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at
org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
I have specified the partition columns in the last columns of dataframe, so i expect it consider last tow columns as partition columns. I wanted to used the same routine for inserting different tables so i don't want to mention the partition columns explicitly
Just to recap that you are using spark to write data to a hive table with dynamic partitions. So my answer below is based on same, if my understanding is incorrect, please feel free to correct me in comment.
While you have a table that is dynamically partitioned (by app_name and year_month), the spark job doesn't know the partitioning fields in the destination so you will still have to tell your spark job about the partitioning field of the destination table.
Something like this should work
data_df.repartition(1)
.write
.partitionBy("appname", "year_month")
.mode(SaveMode.Append)
.saveAsTable(Outputhive_table)
Make sure that you enable support for dynamic partitions by executing something like
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Check out this post by Itai Yaffe, this may be handy https://medium.com/nmc-techblog/spark-dynamic-partition-inserts-part-1-5b66a145974f
I think the problem is that some records have appname and year_month as strings. At least this is suggested by
Partition spec is incorrect. {appname=, year_month=}
Make sure partition colums are never empty or null! Also note that the type of year_month is not consistent between the DataFrame and your schema (string/int)

How to add a new nullable String column in a DataFrame using Scala

There are probably at least 10 question very similar to this, but I still have not found a clear answer.
How can I add a nullable string column to a DataFrame using scala? I was able to add a column with null values, but the DataType shows null
val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", null).otherwise(null))
However, the schema shows
root
|-- UID: string (nullable = true)
|-- IsPartnerInd: string (nullable = true)
|-- newcolumn: null (nullable = true)
I want the new column to be string |-- newcolumn: string (nullable = true)
Please don't mark as duplicate, unless it's really the same question and in scala.
Just explicitly cast null literal to StringType.
scala> val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", lit(null).cast(StringType)).otherwise(lit(null).cast(StringType)))
scala> testDF.printSchema
root
|-- UID: string (nullable = true)
|-- newcolumn: string (nullable = true)
Why do you want a column which is always null? There are several ways, I would prefer the solution with typedLit:
myDF.withColumn("newcolumn", typedLit[String](null))
or for older Spark versions:
myDF.withColumn("newcolumn",lit(null).cast(StringType))

how to remove/replace column name with whitespaces of a spark dataframe read from parquet file? [duplicate]

This question already has answers here:
Spark Dataframe validating column names for parquet writes
(7 answers)
Closed last year.
The dataset I'm working on has whitespaces in its columns and I got struck while trying to rename spark dataframe column name. Tried almost all the solutions available in stackoverflow. Nothing seems to work.
Note: The file must be a parquet file.
df.printSchema
root
|-- Type: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ID: string (nullable = true)
|-- Catg Name: string (nullable = true)
|-- Error Msg: string (nullable = true)
df.show()
Error:
warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
Tried:
df.select(df.col("Catg Name").alias("Catg_Name"))
and then df.printSchema
root
|-- Type: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ID: string (nullable = true)
|-- Catg_Name: string (nullable = true)
|-- Error_Msg: string (nullable = true)
works well but when I use df.show() it throws the same error.
warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
How about this idea by removing the spaces in column names and reassigning to Dataframe?
val df1 = df.toDF("col 1","col 2","col 3") // Dataframe with spaces in column names
val new_cols = df1.columns.map(x => x.replaceAll(" ", "")) // new column names array with spaces removed
val df2 = df1.toDF(new_cols : _*) // df2 with new column names(spaces removed)

Spark Scala DataFrame Single Row conversion to JSON for PostrgeSQL Insertion

With a DataFrame called lastTail, I can iterate like this:
import scalikejdbc._
// ...
// Do Kafka Streaming to create DataFrame lastTail
// ...
lastTail.printSchema
lastTail.foreachPartition(iter => {
// open database connection from connection pool
// with scalikeJDBC (to PostgreSQL)
while(iter.hasNext) {
val item = iter.next()
println("****")
println(item.getClass)
println(item.getAs("fileGid"))
println("Schema: "+item.schema)
println("String: "+item.toString())
println("Seqnce: "+item.toSeq)
// convert this item into an XXX format (like JSON)
// write row to DB in the selected format
}
})
This outputs "something like" (with redaction):
root
|-- fileGid: string (nullable = true)
|-- eventStruct: struct (nullable = false)
| |-- eventIndex: integer (nullable = true)
| |-- eventGid: string (nullable = true)
| |-- eventType: string (nullable = true)
|-- revisionStruct: struct (nullable = false)
| |-- eventIndex: integer (nullable = true)
| |-- eventGid: string (nullable = true)
| |-- eventType: string (nullable = true)
and (with just one iteration item - redacted, but hopefully with good enough syntax as well)
****
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
12345
Schema: StructType(StructField(fileGid,StringType,true), StructField(eventStruct,StructType(StructField(eventIndex,IntegerType,true), StructField(eventGid,StringType,true), StructField(eventType,StringType,true)), StructField(revisionStruct,StructType(StructField(eventIndex,IntegerType,true), StructField(eventGid,StringType,true), StructField(eventType,StringType,true), StructField(editIndex,IntegerType,true)),false))
String: [12345,[1,4,edit],[1,4,revision]]
Seqnce: WrappedArray(12345, [1,4,edit], [1,4,revision])
Note: I doing the part like val metric = iter.sum on https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartition.scala, but with DataFrames instead. I am also following "Design Patterns for using foreachRDD" seen at http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning.
How can I convert this
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
(see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala)
iteration item into a something that is easily written (JSON or ...? - I'm open) into PostgreSQL. (If not JSON, please suggest how to read this value back into a DataFrame for use at another point.)
Well I figured out a different way to do this as a work around.
val ltk = lastTail.select($"fileGid").rdd.map(fileGid => fileGid.toString)
val ltv = lastTail.toJSON
val kvPair = ltk.zip(ltv)
Then I would simply iterate over the RDD instead of the DataFrame.
kvPair.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
println(item.getClass)
println(item)
}
})
The data aside, I get class scala.Tuple2 which makes for a easier way to store KV pairs in JDBC / PostgreSQL.
I'm sure that there could yet other ways that are not work-arounds.

inferSchema in spark-csv package

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?
I have the following csv file
Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10
I've read the CSV using the below code
val df = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load(sampleAdDataS3Location)
df.schema
All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date
Please note that I've set the option inferSchema to true.
I am using the latest version (1.0.3) of spark-csv package
Am I missing something here?
2015-07-30
The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.
2015-08-17
The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:
scala> df.printSchema
root
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- years_of_experience: integer (nullable = true)
|-- DOB: string (nullable = true)
Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.
Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.
Finally you can easily parse date string manually:
sqlContext
.sql("SELECT *, DATE(dob) as dob_d FROM df")
.drop("DOB")
.printSchema
root
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- years_of_experience: integer (nullable = true)
|-- dob_d: date (nullable = true)
so it is really not a serious issue.
2017-12-20:
Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:
timestampFormat with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
dateFormat with default yyyy-MM-dd
See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?