inferSchema in spark-csv package - scala

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?
I have the following csv file
Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10
I've read the CSV using the below code
val df = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load(sampleAdDataS3Location)
df.schema
All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date
Please note that I've set the option inferSchema to true.
I am using the latest version (1.0.3) of spark-csv package
Am I missing something here?

2015-07-30
The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.
2015-08-17
The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:
scala> df.printSchema
root
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- years_of_experience: integer (nullable = true)
|-- DOB: string (nullable = true)
Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.
Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.
Finally you can easily parse date string manually:
sqlContext
.sql("SELECT *, DATE(dob) as dob_d FROM df")
.drop("DOB")
.printSchema
root
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- years_of_experience: integer (nullable = true)
|-- dob_d: date (nullable = true)
so it is really not a serious issue.
2017-12-20:
Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:
timestampFormat with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
dateFormat with default yyyy-MM-dd
See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

Related

Error while inserting into partitioned hive table for spark scala

I am having hive table with following structure
CREATE TABLE gcganamrswp_work.historical_trend_result(
column_name string,
metric_name string,
current_percentage string,
lower_threshold double,
upper_threshold double,
calc_status string,
final_status string,
support_override string,
dataset_name string,
insert_timestamp string,
appid string,
currentdate string,
indicator map<string,string>)
PARTITIONED BY (
appname string,
year_month int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
I am having spark dataframe with schema
root
|-- metric_name: string (nullable = true)
|-- column_name: string (nullable = true)
|-- Lower_Threshold: double (nullable = true)
|-- Upper_Threshold: double (nullable = true)
|-- Current_Percentage: double (nullable = true)
|-- Calc_Status: string (nullable = false)
|-- Final_Status: string (nullable = false)
|-- support_override: string (nullable = false)
|-- Dataset_Name: string (nullable = false)
|-- insert_timestamp: string (nullable = false)
|-- appId: string (nullable = false)
|-- currentDate: string (nullable = false)
|-- indicator: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
|-- appname: string (nullable = false)
|-- year_month: string (nullable = false)
when i try to insert into hive table using below code it is failing
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
data_df.repartition(1)
.write.mode("append")
.format("hive")
.insertInto(Outputhive_table)
Spark Version : Spark 2.4.0
Error:
ERROR Hive:1987 - Exception when loading partition with parameters
partPath=hdfs://gcgprod/data/work/hive/historical_trend_result/.hive-staging_hive_2021-09-01_04-34-04_254_8783620706620422928-1/-ext-10000/_temporary/0,
table=historical_trend_result, partSpec={appname=, year_month=},
replace=false, listBucketingEnabled=false, isAcid=false,
hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at
org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at
org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at
org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
I have specified the partition columns in the last columns of dataframe, so i expect it consider last tow columns as partition columns. I wanted to used the same routine for inserting different tables so i don't want to mention the partition columns explicitly
Just to recap that you are using spark to write data to a hive table with dynamic partitions. So my answer below is based on same, if my understanding is incorrect, please feel free to correct me in comment.
While you have a table that is dynamically partitioned (by app_name and year_month), the spark job doesn't know the partitioning fields in the destination so you will still have to tell your spark job about the partitioning field of the destination table.
Something like this should work
data_df.repartition(1)
.write
.partitionBy("appname", "year_month")
.mode(SaveMode.Append)
.saveAsTable(Outputhive_table)
Make sure that you enable support for dynamic partitions by executing something like
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Check out this post by Itai Yaffe, this may be handy https://medium.com/nmc-techblog/spark-dynamic-partition-inserts-part-1-5b66a145974f
I think the problem is that some records have appname and year_month as strings. At least this is suggested by
Partition spec is incorrect. {appname=, year_month=}
Make sure partition colums are never empty or null! Also note that the type of year_month is not consistent between the DataFrame and your schema (string/int)

Unable to remove the space from column names in spark scala

I have parquet dataset column names with spaces in between the words like for eg: BRANCH NAME. Now, when I replace the space with "_" and try printing the column it results in an error. Below is my code with multiple approaches followed by error:
Approach 1:
Var df= spark.read.parquet("s3://tvsc-lumiq-edl/raw-v2/LMSDB/DESUSR/TBL_DES_SLA_MIS1")
for (c <- df.columns){
df = df.withColumnRenamed(c, c.replace(" ", ""))
}
Approach 2:
df = df.columns.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "")))
Approach 3:
val new_cols = df.columns.map(x => x.replaceAll(" ", ""))
val df2 = df.toDF(new_cols : _*)
Error:
org.apache.spark.sql.AnalysisException: Attribute name "BRANCH NAME" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
Below is the schema:
scala> df.printSchema()
root
|-- dms_timestamp: string (nullable = true)
|-- BRANCH NAME: string (nullable = true)
|-- BRANCH CODE: string (nullable = true)
|-- DEALER NAME: string (nullable = true)
|-- DEALER CODE: string (nullable = true)
|-- DEALER CATEGORY: string (nullable = true)
|-- PRODUCT: string (nullable = true)
|-- CREATION DATE: string (nullable = true)
|-- CHANNEL TYPE: string (nullable = true)
|-- DELAY DAYS: string (nullable = true)
I have had also referred multiple SO posts but didn't help.
It was fixed in Spark 3.3.0 release (I tested). So your only option is to upgrade (or use pandas to rename the fields).
Thanks for #Andrey 's update:
Now we can correctly load parquet files with column names containing these special characters. Just make sure you're using spark after version 3.3.0.
If all the datasets are in parquet files, I'm afraid we're out of luck and you have to load them in Pandas and then do the renaming.
Spark won't read parquet files with column names containing characters among " ,;{}()\n\t=" at all. AFAIK, Spark devs refused to resolve this issue. The root cause of it lies in your parquet files themselves. At least according to the dev, parquet files should not have these "invalid characters" in their column names in the first place.
See https://issues.apache.org/jira/browse/SPARK-27442 . It was marked as "won't fix".
Try below code.
df
.select(df.columns.map(c => col(s"`${c}`").as(c.replace(" ",""))):_*)
.show(false)
This worked for me
val dfnew =df.select(df.columns.map(i => col(i).as(i.replaceAll(" ", ""))): _*)

How to add a new nullable String column in a DataFrame using Scala

There are probably at least 10 question very similar to this, but I still have not found a clear answer.
How can I add a nullable string column to a DataFrame using scala? I was able to add a column with null values, but the DataType shows null
val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", null).otherwise(null))
However, the schema shows
root
|-- UID: string (nullable = true)
|-- IsPartnerInd: string (nullable = true)
|-- newcolumn: null (nullable = true)
I want the new column to be string |-- newcolumn: string (nullable = true)
Please don't mark as duplicate, unless it's really the same question and in scala.
Just explicitly cast null literal to StringType.
scala> val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", lit(null).cast(StringType)).otherwise(lit(null).cast(StringType)))
scala> testDF.printSchema
root
|-- UID: string (nullable = true)
|-- newcolumn: string (nullable = true)
Why do you want a column which is always null? There are several ways, I would prefer the solution with typedLit:
myDF.withColumn("newcolumn", typedLit[String](null))
or for older Spark versions:
myDF.withColumn("newcolumn",lit(null).cast(StringType))

Spark apply custom schema to a DataFrame

I have a data in Parquet file and want to apply custom schema to it.
My initial data within Parquet is as below,
root
|-- CUST_ID: decimal(9,0) (nullable = true)
|-- INACTV_DT: string (nullable = true)
|-- UPDT_DT: string (nullable = true)
|-- ACTV_DT: string (nullable = true)
|-- PMT_AMT: decimal(9,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = true)
My custom schema is below,
root
|-- CUST_ID: decimal(38,0) (nullable = false)
|-- INACTV_DT: timestamp (nullable = false)
|-- UPDT_DT: timestamp (nullable = false)
|-- ACTV_DT: timestamp (nullable = true)
|-- PMT_AMT: decimal(19,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = false)
Below is my code to apply new data-frame to it
val customSchema = getOracleDBSchema(sparkSession, QUERY).schema
val DF_frmOldParkquet = sqlContext_par.read.parquet("src/main/resources/data_0_0_0.parquet")
val rows: RDD[Row] = DF_frmOldParkquet.rdd
val newDataFrame = sparkSession.sqlContext.createDataFrame(rows, tblSchema)
newDataFrame.printSchema()
newDataFrame.show()
I am getting below error, when I perform this operation.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of timestamp
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), fromDecimal, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, CUST_ID), DecimalType(38,0)), true) AS CUST_ID#27
There are two main applications of schema in Spark SQL
schema argument passed to schema method of the DataFrameReader which is used to transform data in some formats (primarily plain text files). In this case schema can be used to automatically cast input records.
schema argument passed to createDataFrame (variants which take RDD or List of Rows) of the SparkSession. In this case schema has to conform to the data, and is not used for casting.
None of the above is applicable in your case:
Input is strongly typed, therefore schema, if present, is ignored by the reader.
Schema doesn't match the data, therefore it cannot be used to createDataFrame.
In this scenario you should cast each column to the desired type. Assuming the types are compatible, something like this should work
val newDataFrame = df.schema.fields.foldLeft(df){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
Depending on the format of the data this might be sufficient or not. For example if fields that should be transformed to timestamps don't use standard formatting, casting won't work, and you'll have to use Spark datetime processing utilities.

spark convert a string to TimestampType

I have a dataframe that I want to insert into Postgresql in spark. In spark the DateTimestamp column is in string format.In postgreSQL it is TimeStamp without time zone.
Spark errors out when inserting into the database on the date time column. I did try to change the data type but the insert still errors out. I am unable to figure out why the cast does not work.If I paste the same insert string into PgAdmin and run, the insert statement runs fine.
import java.text.SimpleDateFormat;
import java.util.Calendar
object EtlHelper {
// Return the current time stamp
def getCurrentTime() : String = {
val now = Calendar.getInstance().getTime()
val hourFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
return hourFormat.format(now)
}
}
In another file
object CreateDimensions {
def createDimCompany(spark:SparkSession, location:String, propsLocation :String):Unit = {
import spark.implicits._
val dimCompanyStartTime = EtlHelper.getCurrentTime()
val dimcompanyEndTime = EtlHelper.getCurrentTime()
val prevDimCompanyId = 2
val numRdd = 27
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd,dimCompanyStartTime,dimcompanyEndTime))).toDF("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date")//.show()
AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
AuditDF.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast(DataTypes.TimestampType))
AuditDF.printSchema()
}
}
root
|-- audit_tbl_name: string (nullable = true)
|-- audit_tbl_id: long (nullable = false)
|-- audit_no_rows: long (nullable = false)
|-- audit_tbl_start_date: string (nullable = true)
|-- audit_tbl_end_date: string (nullable = true)
This is the error I get
INSERT INTO etl.audit_master ("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date") VALUES ('dim_company',27,2,'2018-05-02 12:15:54','2018-05-02 12:15:59') was aborted: ERROR: column "audit_tbl_start_date" is of type timestamp without time zone but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Any Help is appreciated.
Thank you
AuditDF.printSchema() is taking the original AuditDF dataframe since you didn't save the transformations of .withColumn by assigning. Dataframes are immutable objects which can be transformed to another dataframes but cannot change itself. So you would always need an assignment to save the transformations you've applied.
so the correct way is to assign in order to save the changes
val transformedDF = AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast("timestamp"))
transformedDF.printSchema()
you shall see the changes
root
|-- audit_tbl_name: string (nullable = true)
|-- audit_tbl_id: integer (nullable = false)
|-- audit_no_rows: integer (nullable = false)
|-- audit_tbl_start_date: timestamp (nullable = true)
|-- audit_tbl_end_date: timestamp (nullable = true)
.cast(DataTypes.TimestampType) and .cast("timestamp") are both same
The root of your problem is what #Ramesh mentioned i.e. that you didn't assign the changes in the AuditDF to a new value (val) note that both the dataframe and the value you assigned it to are immutable (i.e. auditDF was defined val so it also can't be changed)
Another thing is that you don't need to reinvent the wheel and use the EtlHelper spark has built-in function that gives you a timestamp of current time:
import org.apache.spark.sql.functions._
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd)))
.toDF("audit_tbl_name","audit_tbl_id","audit_no_rows")
.withColumn("audit_tbl_start_date"current_timestamp())
.withColumn("audit_tbl_end_date",current_timestamp())