pyspark delta table schema evolution nullable

pyspark delta table schema evolution nullable - pyspark

I am using the schema evolution in delta table and the code is written in databricks notebook.
df_read = spark.read.schema(schema)
.format(file_format)
.option("inferSchema", "false")
.option("mergeSchema", "true")
.load(source_folders)
But I still got the error below. Is it correct to define the schema and enable the mergeSchema at the same time?
AnalysisException: The specified schema does not match the existing schema at path.
== Specified ==
root
-- A: string (nullable = false)
-- B: string (nullable = true)
-- C: long (nullable = true)
== Existing ==
root
-- A: string (nullable = true)
-- B: string (nullable = true)
-- C: long (nullable = true)
== Differences==
- Field A is non-nullable in specified schema but nullable in existing schema.
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.

Related

Schema error using hiveContext.createDataFrame from an RDD [scala spark 2.4]

Trying to run:
val outputDF = hiveContext.createDataFrame(myRDD, schema)
Getting this error:
Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>
myRDD.take(5).foreach(println)
[string number,[Lscala.Tuple2;#163601a5]
[1234567890,[Lscala.Tuple2;#6fa7a81c]
data of the RDD:
RDD[Row]: [string number, [(string key, string value)]]
Row(string, Array(Tuple(String, String)))
where the tuple2 contains data like this:
(string key, string value)
schema:
schema:
root
|-- col1name: string (nullable = true)
|-- col2name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col3name: string (nullable = true)
| | |-- col4name: string (nullable = true)
StructType(
StructField(col1name,StringType,true),
StructField(col2name,ArrayType(
StructType(
StructField(col3name,StringType,true),
StructField(col4name,StringType,true)
),
true
),
true
)
)
This code was used to run in spark 1.6 before and didn't have problems. In spark 2.4, it appears that tuple2 doesn't count as a Struct Type? In that case, what should it be changed to?
I'm assuming the easiest solution would be to adjust the schema to suite the data.
Let me know if more details are needed

The answer to this is changing the tuple type that contained the 2 string types to a row containing the 2 string types instead.
So for the provided schema, the incoming data structure was
Row(string, Array(Tuple(String, String)))
This was changed to
Row(string, Array(Row(String, String)))
in order to continue using the same schema.

Reading data from csv returns null values

I am trying to read data from csv using Scala and Spark but
the values of columns are null.
I tried to read data from csv. I also provided a schema
for querying the data easily.
private val myData= sparkSession.read.schema(createDataSchema).csv("data/myData.csv")
def createDataSchema = {
val schema = StructType(
Array(
StructField("data_index",StringType, nullable = false),
StructField("property_a",IntegerType, nullable = false),
StructField("property_b",IntegerType, nullable = false),
//some other columns
)
)
schema
Querying data:
val myProperty= accidentData.select($"property_b")
myProperty.collect()
I expect that the data are returned as a List of certain values
but they are returned as a list containing null values (values are null).
Why?
When I print the schema then nullable is set to true instead of false.
I am using Scala 2.12.9 and Spark 2.4.3.

While loading data from CSV file though schema has been provided as nullable = false, Still Spark overwrites schema as nullable = true, so that null pointer could be avoided during data load.
Let us take an example, let's assume CSV file has two rows with second-row has an empty or null column value.
CSV:
a,1,2
b,,2
If nullable = false, a null pointer exception would be thrown while loading data when an action has called on the data frame, as there is empty/null value to be loaded & there is no default value a Null pointer is thrown. So to avoid it Spark overwrites it as nullable = true.
However, this could be handled by replacing all null with a default value and then re-applying schema.
val df = spark.read.schema(schema).csv("data/myData.csv")
val dfWithDefault = df.withColumn("property_a", when(col("property_a").isNull, 0).otherwise(df.col("property_a")))
val dfNullableFalse = spark.sqlContext.createDataFrame(dfWithDefault.rdd, schema)
dfNullableFalse.show(10)
df.printSchema()
root
|-- data_index: string (nullable = true)
|-- property_a: integer (nullable = true)
|-- property_b: integer (nullable = true)
dfNullableFalse.printSchema()
root
|-- data_index: string (nullable = false)
|-- property_a: integer (nullable = false)
|-- property_b: integer (nullable = false)

Spark apply custom schema to a DataFrame

I have a data in Parquet file and want to apply custom schema to it.
My initial data within Parquet is as below,
root
|-- CUST_ID: decimal(9,0) (nullable = true)
|-- INACTV_DT: string (nullable = true)
|-- UPDT_DT: string (nullable = true)
|-- ACTV_DT: string (nullable = true)
|-- PMT_AMT: decimal(9,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = true)
My custom schema is below,
root
|-- CUST_ID: decimal(38,0) (nullable = false)
|-- INACTV_DT: timestamp (nullable = false)
|-- UPDT_DT: timestamp (nullable = false)
|-- ACTV_DT: timestamp (nullable = true)
|-- PMT_AMT: decimal(19,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = false)
Below is my code to apply new data-frame to it
val customSchema = getOracleDBSchema(sparkSession, QUERY).schema
val DF_frmOldParkquet = sqlContext_par.read.parquet("src/main/resources/data_0_0_0.parquet")
val rows: RDD[Row] = DF_frmOldParkquet.rdd
val newDataFrame = sparkSession.sqlContext.createDataFrame(rows, tblSchema)
newDataFrame.printSchema()
newDataFrame.show()
I am getting below error, when I perform this operation.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of timestamp
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), fromDecimal, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, CUST_ID), DecimalType(38,0)), true) AS CUST_ID#27

There are two main applications of schema in Spark SQL
schema argument passed to schema method of the DataFrameReader which is used to transform data in some formats (primarily plain text files). In this case schema can be used to automatically cast input records.
schema argument passed to createDataFrame (variants which take RDD or List of Rows) of the SparkSession. In this case schema has to conform to the data, and is not used for casting.
None of the above is applicable in your case:
Input is strongly typed, therefore schema, if present, is ignored by the reader.
Schema doesn't match the data, therefore it cannot be used to createDataFrame.
In this scenario you should cast each column to the desired type. Assuming the types are compatible, something like this should work
val newDataFrame = df.schema.fields.foldLeft(df){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
Depending on the format of the data this might be sufficient or not. For example if fields that should be transformed to timestamps don't use standard formatting, casting won't work, and you'll have to use Spark datetime processing utilities.

spark convert a string to TimestampType

I have a dataframe that I want to insert into Postgresql in spark. In spark the DateTimestamp column is in string format.In postgreSQL it is TimeStamp without time zone.
Spark errors out when inserting into the database on the date time column. I did try to change the data type but the insert still errors out. I am unable to figure out why the cast does not work.If I paste the same insert string into PgAdmin and run, the insert statement runs fine.
import java.text.SimpleDateFormat;
import java.util.Calendar
object EtlHelper {
// Return the current time stamp
def getCurrentTime() : String = {
val now = Calendar.getInstance().getTime()
val hourFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
return hourFormat.format(now)
}
}
In another file
object CreateDimensions {
def createDimCompany(spark:SparkSession, location:String, propsLocation :String):Unit = {
import spark.implicits._
val dimCompanyStartTime = EtlHelper.getCurrentTime()
val dimcompanyEndTime = EtlHelper.getCurrentTime()
val prevDimCompanyId = 2
val numRdd = 27
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd,dimCompanyStartTime,dimcompanyEndTime))).toDF("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date")//.show()
AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
AuditDF.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast(DataTypes.TimestampType))
AuditDF.printSchema()
}
}
root
|-- audit_tbl_name: string (nullable = true)
|-- audit_tbl_id: long (nullable = false)
|-- audit_no_rows: long (nullable = false)
|-- audit_tbl_start_date: string (nullable = true)
|-- audit_tbl_end_date: string (nullable = true)
This is the error I get
INSERT INTO etl.audit_master ("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date") VALUES ('dim_company',27,2,'2018-05-02 12:15:54','2018-05-02 12:15:59') was aborted: ERROR: column "audit_tbl_start_date" is of type timestamp without time zone but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Any Help is appreciated.
Thank you

AuditDF.printSchema() is taking the original AuditDF dataframe since you didn't save the transformations of .withColumn by assigning. Dataframes are immutable objects which can be transformed to another dataframes but cannot change itself. So you would always need an assignment to save the transformations you've applied.
so the correct way is to assign in order to save the changes
val transformedDF = AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast("timestamp"))
transformedDF.printSchema()
you shall see the changes
root
|-- audit_tbl_name: string (nullable = true)
|-- audit_tbl_id: integer (nullable = false)
|-- audit_no_rows: integer (nullable = false)
|-- audit_tbl_start_date: timestamp (nullable = true)
|-- audit_tbl_end_date: timestamp (nullable = true)
.cast(DataTypes.TimestampType) and .cast("timestamp") are both same

The root of your problem is what #Ramesh mentioned i.e. that you didn't assign the changes in the AuditDF to a new value (val) note that both the dataframe and the value you assigned it to are immutable (i.e. auditDF was defined val so it also can't be changed)
Another thing is that you don't need to reinvent the wheel and use the EtlHelper spark has built-in function that gives you a timestamp of current time:
import org.apache.spark.sql.functions._
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd)))
.toDF("audit_tbl_name","audit_tbl_id","audit_no_rows")
.withColumn("audit_tbl_start_date"current_timestamp())
.withColumn("audit_tbl_end_date",current_timestamp())

Change Data Types for Dataframe by Schema in Scala Spark

I have a dataframe without schema and every column stored as StringType such as:
ID | LOG_IN_DATE | USER
1 | 2017-11-01 | Johns
Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.
I already tried:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
There's no error while running this but afterwards when I call the df.schema, nothing is changed.
Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.

If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list
Your dataframe
val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")
Your schema
val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))
Cast all the columns to its type from the list
val newColumns = schema.map(c => col(c._1).cast(c._2))
select all te casted columns
val newDF = df.select(newColumns:_*)
Print Schema
newDF.printSchema()
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Show Dataframe
newDF.show()
Output:
+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+

My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd
Yes, foldLeft is the way to go
This is the schema before using foldLeft
root
|-- ID: string (nullable = true)
|-- LOG_IN_DATE: string (nullable = true)
|-- USER: string (nullable = true)
Using foldLeft
val schema = List(("ID","double"),("LOG_IN_DATE","date"),("USER","string"))
import org.apache.spark.sql.functions._
schema.foldLeft(df){case(tempdf, x)=> tempdf.withColumn(x._1, col(x._1).cast(x._2))}.printSchema()
and this is the schema after foldLeft
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
I hope the answer is helpful

If you apply any function of Scala, It returns modified data so you can't change the data type of existing schema.
Below is the code to create new data frame of modified schema by casting column.
1.Create a new DataFrame
val df=Seq((1,"2017-11-01","Johns"),(2,"2018-01-03","Alice")).toDF("ID","LOG_IN_DATE","USER")
2.Register DataFrame as temp table
df.registerTempTable("user")
3.Now create new DataFrame by casting column data type
val new_df=spark.sql("""SELECT ID,TO_DATE(CAST(UNIX_TIMESTAMP(LOG_IN_DATE, 'yyyy-MM-dd') AS TIMESTAMP)) AS LOG_IN_DATE,USER from user""")
4. Display schema
new_df.printSchema
root
|-- ID: integer (nullable = false)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)

Actually what you did:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
could work but you need to define your dataframe as a var and do like this:
for((name, type) <- schema) {
df = df.withColumn(name, col(name).cast(type)))
}
Also you could try reading your dataframe like this:
case class MyClass(ID: Int, LOG_IN_DATE: Date, USER: String)
//Suppose you are reading from json
val df = spark.read.json(path).as[MyClass]
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark delta table schema evolution nullable - pyspark

Related

Schema error using hiveContext.createDataFrame from an RDD [scala spark 2.4]

Reading data from csv returns null values

Spark apply custom schema to a DataFrame

spark convert a string to TimestampType

Change Data Types for Dataframe by Schema in Scala Spark

Categories

Resources