Typecasting a Dataframe returns 'null' for empty fields - scala

I have a raw data loaded into my hive tables with all the columns as strings by default. Now I need to change the datatypes of hive tables to export to SQLServer.
When Typecasting the hive columns the empty fields returns 'NULL', tried loading the hive tables into dataframe and typecast the columns, but still dataframe also returning 'null' for empty fields. SQLserver couldn't recognize such values.
Can anyone suggest a solution to avoid the 'null' values in display when I get data from hive or dataframes.

If you want to change the data type only because you want to have that particular format in exported data, consider using writing to a directory as per your requirement and then export using sqoop/any other tool.
INSERT OVERWRITE DIRECTORY '<HDFS path>'
Row format delimited
Fields terminated by '<delimiter>'
SELECT
a,
b
From
table_name
Where <condition>;
While exporting, if you have null values consider using these arguments in your sqoop command
--null-string "\\N" --null-non-string "\\N"
Hope this helps you

Related

Store JSONB PostgreSQL data type column into Athena

I am creating an Athena external table on a CSV that I generated from my PostgreSQL database.
The csv contains a columns that has a jsonb datatype.
If possible, I want to exclude this column from the table created in Athena, or kindly suggest a way to include this datatype.

Column mapping option argument is not supported for PARQUET based COPY

I have to insert parquet file data into redshift table. Number of columns in parquet might be less when compared to redshift table. I have used the below command.
COPY table_name
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
FORMAT AS PARQUET
But getting the below issue when I run the COPY command.
Column mapping option argument is not supported for PARQUET based COPY
I tried to use the column mapping like
COPY table_name(column1, column2..)
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
But am getting Delimiter not found issue. If I specify FORMAT AS PARQUET in the above COPY command (which has column list) then am getting Column mapping option argument is not supported for PARQUET based COPY.
Could you please let me know how to resolve this.
Thanks
The number of columns in the parquet file MUST match the table's columns reference . You can't use column mapping with parquet files.
What you can do: create a staging table and copy parquet file content to it. Then run a insert to your final table using insert into final_table (select col1, col2 from stg_table)

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Using import wizard to copy csv file that has two columns of numeric data result gives all null values in postgre

I have a two column csv file that has only numeric data. I create a table in Postgre with the columns as numeric. I successfully use the import wizard, but it gives all my values in the postgre table as null. Not sure why.

PostgreSQL COPY CSV with two NULL strings

I have a source of csv files from a web query which contains two variations of a string that I would like to class as NULL when copying to a PostgreSQL table.
e.g.
COPY my_table FROM STDIN WITH CSV DELIMITER AS ',' NULL AS ('N/A', 'Not applicable');
I know this query will throw an error so I'm looking for a way to specify two separate NULL strings in a COPY CSV query?
I think your best bet in this case, since COPY does not support multiple NULL strings, is to set the NULL string argument to one of them, and then, once it's all loaded, do an UPDATE that will set values in any column you wish having the other NULL string you want to the actual NULL value (the exact query would depend on which columns could have those values).
If you have a bunch of columns, you could use CASE statements in your SET clause to return NULL if it matches your special string, or the value otherwise. NULLIF could also be used (that would be more compact). e.g. NULLIF(col1, 'Not applicable')