Reading data from csv returns null values - scala

I am trying to read data from csv using Scala and Spark but
the values of columns are null.
I tried to read data from csv. I also provided a schema
for querying the data easily.
private val myData= sparkSession.read.schema(createDataSchema).csv("data/myData.csv")
def createDataSchema = {
val schema = StructType(
Array(
StructField("data_index",StringType, nullable = false),
StructField("property_a",IntegerType, nullable = false),
StructField("property_b",IntegerType, nullable = false),
//some other columns
)
)
schema
Querying data:
val myProperty= accidentData.select($"property_b")
myProperty.collect()
I expect that the data are returned as a List of certain values
but they are returned as a list containing null values (values are null).
Why?
When I print the schema then nullable is set to true instead of false.
I am using Scala 2.12.9 and Spark 2.4.3.

While loading data from CSV file though schema has been provided as nullable = false, Still Spark overwrites schema as nullable = true, so that null pointer could be avoided during data load.
Let us take an example, let's assume CSV file has two rows with second-row has an empty or null column value.
CSV:
a,1,2
b,,2
If nullable = false, a null pointer exception would be thrown while loading data when an action has called on the data frame, as there is empty/null value to be loaded & there is no default value a Null pointer is thrown. So to avoid it Spark overwrites it as nullable = true.
However, this could be handled by replacing all null with a default value and then re-applying schema.
val df = spark.read.schema(schema).csv("data/myData.csv")
val dfWithDefault = df.withColumn("property_a", when(col("property_a").isNull, 0).otherwise(df.col("property_a")))
val dfNullableFalse = spark.sqlContext.createDataFrame(dfWithDefault.rdd, schema)
dfNullableFalse.show(10)
df.printSchema()
root
|-- data_index: string (nullable = true)
|-- property_a: integer (nullable = true)
|-- property_b: integer (nullable = true)
dfNullableFalse.printSchema()
root
|-- data_index: string (nullable = false)
|-- property_a: integer (nullable = false)
|-- property_b: integer (nullable = false)

Related

pyspark delta table schema evolution nullable

I am using the schema evolution in delta table and the code is written in databricks notebook.
df_read = spark.read.schema(schema)
.format(file_format)
.option("inferSchema", "false")
.option("mergeSchema", "true")
.load(source_folders)
But I still got the error below. Is it correct to define the schema and enable the mergeSchema at the same time?
AnalysisException: The specified schema does not match the existing schema at path.
== Specified ==
root
-- A: string (nullable = false)
-- B: string (nullable = true)
-- C: long (nullable = true)
== Existing ==
root
-- A: string (nullable = true)
-- B: string (nullable = true)
-- C: long (nullable = true)
== Differences==
- Field A is non-nullable in specified schema but nullable in existing schema.
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.

Schema error using hiveContext.createDataFrame from an RDD [scala spark 2.4]

Trying to run:
val outputDF = hiveContext.createDataFrame(myRDD, schema)
Getting this error:
Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>
myRDD.take(5).foreach(println)
[string number,[Lscala.Tuple2;#163601a5]
[1234567890,[Lscala.Tuple2;#6fa7a81c]
data of the RDD:
RDD[Row]: [string number, [(string key, string value)]]
Row(string, Array(Tuple(String, String)))
where the tuple2 contains data like this:
(string key, string value)
schema:
schema:
root
|-- col1name: string (nullable = true)
|-- col2name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col3name: string (nullable = true)
| | |-- col4name: string (nullable = true)
StructType(
StructField(col1name,StringType,true),
StructField(col2name,ArrayType(
StructType(
StructField(col3name,StringType,true),
StructField(col4name,StringType,true)
),
true
),
true
)
)
This code was used to run in spark 1.6 before and didn't have problems. In spark 2.4, it appears that tuple2 doesn't count as a Struct Type? In that case, what should it be changed to?
I'm assuming the easiest solution would be to adjust the schema to suite the data.
Let me know if more details are needed
The answer to this is changing the tuple type that contained the 2 string types to a row containing the 2 string types instead.
So for the provided schema, the incoming data structure was
Row(string, Array(Tuple(String, String)))
This was changed to
Row(string, Array(Row(String, String)))
in order to continue using the same schema.

withColumnRenamed changes the null type of Column

df = df.withColumnRenamed('mail', 'EmailAddress')
changes the nulltype i declared as part of schema (declared as false). is there a way to not let this happen?
Nothing in the pyspark documentation really mentions anything.
schema = StructType([StructField("mail", StringType(), False)])
df = spark.read.json(inputPath, schema = schema)
df = df.withColumnRenamed('mail', 'EmailAddress')
df.printSchema()
this outputs:
|-- EmailAddress: string (nullable = true)

Spark apply custom schema to a DataFrame

I have a data in Parquet file and want to apply custom schema to it.
My initial data within Parquet is as below,
root
|-- CUST_ID: decimal(9,0) (nullable = true)
|-- INACTV_DT: string (nullable = true)
|-- UPDT_DT: string (nullable = true)
|-- ACTV_DT: string (nullable = true)
|-- PMT_AMT: decimal(9,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = true)
My custom schema is below,
root
|-- CUST_ID: decimal(38,0) (nullable = false)
|-- INACTV_DT: timestamp (nullable = false)
|-- UPDT_DT: timestamp (nullable = false)
|-- ACTV_DT: timestamp (nullable = true)
|-- PMT_AMT: decimal(19,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = false)
Below is my code to apply new data-frame to it
val customSchema = getOracleDBSchema(sparkSession, QUERY).schema
val DF_frmOldParkquet = sqlContext_par.read.parquet("src/main/resources/data_0_0_0.parquet")
val rows: RDD[Row] = DF_frmOldParkquet.rdd
val newDataFrame = sparkSession.sqlContext.createDataFrame(rows, tblSchema)
newDataFrame.printSchema()
newDataFrame.show()
I am getting below error, when I perform this operation.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of timestamp
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), fromDecimal, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, CUST_ID), DecimalType(38,0)), true) AS CUST_ID#27
There are two main applications of schema in Spark SQL
schema argument passed to schema method of the DataFrameReader which is used to transform data in some formats (primarily plain text files). In this case schema can be used to automatically cast input records.
schema argument passed to createDataFrame (variants which take RDD or List of Rows) of the SparkSession. In this case schema has to conform to the data, and is not used for casting.
None of the above is applicable in your case:
Input is strongly typed, therefore schema, if present, is ignored by the reader.
Schema doesn't match the data, therefore it cannot be used to createDataFrame.
In this scenario you should cast each column to the desired type. Assuming the types are compatible, something like this should work
val newDataFrame = df.schema.fields.foldLeft(df){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
Depending on the format of the data this might be sufficient or not. For example if fields that should be transformed to timestamps don't use standard formatting, casting won't work, and you'll have to use Spark datetime processing utilities.

spark convert a string to TimestampType

I have a dataframe that I want to insert into Postgresql in spark. In spark the DateTimestamp column is in string format.In postgreSQL it is TimeStamp without time zone.
Spark errors out when inserting into the database on the date time column. I did try to change the data type but the insert still errors out. I am unable to figure out why the cast does not work.If I paste the same insert string into PgAdmin and run, the insert statement runs fine.
import java.text.SimpleDateFormat;
import java.util.Calendar
object EtlHelper {
// Return the current time stamp
def getCurrentTime() : String = {
val now = Calendar.getInstance().getTime()
val hourFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
return hourFormat.format(now)
}
}
In another file
object CreateDimensions {
def createDimCompany(spark:SparkSession, location:String, propsLocation :String):Unit = {
import spark.implicits._
val dimCompanyStartTime = EtlHelper.getCurrentTime()
val dimcompanyEndTime = EtlHelper.getCurrentTime()
val prevDimCompanyId = 2
val numRdd = 27
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd,dimCompanyStartTime,dimcompanyEndTime))).toDF("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date")//.show()
AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
AuditDF.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast(DataTypes.TimestampType))
AuditDF.printSchema()
}
}
root
|-- audit_tbl_name: string (nullable = true)
|-- audit_tbl_id: long (nullable = false)
|-- audit_no_rows: long (nullable = false)
|-- audit_tbl_start_date: string (nullable = true)
|-- audit_tbl_end_date: string (nullable = true)
This is the error I get
INSERT INTO etl.audit_master ("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date") VALUES ('dim_company',27,2,'2018-05-02 12:15:54','2018-05-02 12:15:59') was aborted: ERROR: column "audit_tbl_start_date" is of type timestamp without time zone but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Any Help is appreciated.
Thank you
AuditDF.printSchema() is taking the original AuditDF dataframe since you didn't save the transformations of .withColumn by assigning. Dataframes are immutable objects which can be transformed to another dataframes but cannot change itself. So you would always need an assignment to save the transformations you've applied.
so the correct way is to assign in order to save the changes
val transformedDF = AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast("timestamp"))
transformedDF.printSchema()
you shall see the changes
root
|-- audit_tbl_name: string (nullable = true)
|-- audit_tbl_id: integer (nullable = false)
|-- audit_no_rows: integer (nullable = false)
|-- audit_tbl_start_date: timestamp (nullable = true)
|-- audit_tbl_end_date: timestamp (nullable = true)
.cast(DataTypes.TimestampType) and .cast("timestamp") are both same
The root of your problem is what #Ramesh mentioned i.e. that you didn't assign the changes in the AuditDF to a new value (val) note that both the dataframe and the value you assigned it to are immutable (i.e. auditDF was defined val so it also can't be changed)
Another thing is that you don't need to reinvent the wheel and use the EtlHelper spark has built-in function that gives you a timestamp of current time:
import org.apache.spark.sql.functions._
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd)))
.toDF("audit_tbl_name","audit_tbl_id","audit_no_rows")
.withColumn("audit_tbl_start_date"current_timestamp())
.withColumn("audit_tbl_end_date",current_timestamp())