pyspark hive - Insert NULL as DB null through text file - pyspark

While inserting text file from pyspark shell to hive table.
NULL values treating as string in table.
If i query hive table, records can be retried only with filter condition = 'NULL' rather than is null.
Can any one suggest how to insert data as DB NULLS in table

Check if your spark dataframe is having null or None.
And while writing to the hive table set the nullValue option as
df.write.option('nullValue', None).saveAsTable(path)
This will solve your issue.

Related

Glue load job not retaining default column value in redshift

I have a Glue job that loads a CSV from S3 into a redshift table. There is 1 column (updated_date) which is not mapped. Default value for that column is set to current_timestamp in UTC. But each time the Glue job runs, this updated_date column is null.
I tried removing updated_dt from Glue metadata table. I tried removing updated_dt from SelectFields.apply() in Glue script.
When I do a normal insert statement in Redshift without using updated_dt column, default current_timestamp() value is inserted for those rows.
Thanks
Well I had the same problem. AWS support told me to convert the Glue DynamicFrame into a Spark DataFrame and use the Spark writer to load data into Redshift:
SparkDF = GlueDynFrame.toDF()
SparkDF.write.format('jdbc').options(
url = ‘<‘JDBC url>’,
dbtable=‘<schema>.<table>’,
user=‘<username>’,
password=‘<password>’).mode(‘append’).save()
Me on the other hand handled the problem by either dropping and creating the target table using preactions or just use some update to set the default values in the postactions.

How to differentiate between null and missing mongogdb values in a spark dataframe?

Pre-requisite: The schema of MongoDB documents is unknown.
It has null values and missing values:
For instance CCNO in the following documents
(1) has value
(2) has null value
(3) is missing.
I am fetching them in spark using MongoSpark.load(SparkSession, ReadConfig), but it is replacing missing values with nulls in the dataframe.
Please suggest a way to differentiate between the manually inserted nulls and spark inserted nulls(for missing values).

Hive Partition Table with Date Datatype via Spark

I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.

Typecasting a Dataframe returns 'null' for empty fields

I have a raw data loaded into my hive tables with all the columns as strings by default. Now I need to change the datatypes of hive tables to export to SQLServer.
When Typecasting the hive columns the empty fields returns 'NULL', tried loading the hive tables into dataframe and typecast the columns, but still dataframe also returning 'null' for empty fields. SQLserver couldn't recognize such values.
Can anyone suggest a solution to avoid the 'null' values in display when I get data from hive or dataframes.
If you want to change the data type only because you want to have that particular format in exported data, consider using writing to a directory as per your requirement and then export using sqoop/any other tool.
INSERT OVERWRITE DIRECTORY '<HDFS path>'
Row format delimited
Fields terminated by '<delimiter>'
SELECT
a,
b
From
table_name
Where <condition>;
While exporting, if you have null values consider using these arguments in your sqoop command
--null-string "\\N" --null-non-string "\\N"
Hope this helps you