Druid deletes column with all rows empty, how to make Druid keep these empty columns? - druid

Druid deletes column which all rows empty.
For example, have a table as below:
A
B
C
Data
Data
Data
Data
Then use HDFS index ingest CSV from HDFS to Druid. Column A and column B load succeeded. But column C not shown in Druid. Seems like Druid delete column C due to column C is empty with all rows.
So, I want to understand how to make Druid keep these empty columns.

Related

Skip null columns while sink in mapping dataflow

Needed Solve: Need to skip if column value coming as NULL from data source so that it won't overwrite the existing values.
Scenario : I'm processing CDC of a master table and referenced table. To get CDC changes from both tables based on master table doing a left join between master & reference CDC data.
In scenario where there is change happened in master table only, then in left join the reference data will come as nulls for all columns. In mapping dataflow the reference column values in target getting overridden as NULLS.
any suggestions how to skip the columns with null values in dataflow ?
The reason for left join is that, there is very high chance changes will happen only in table1 but not table 2.

Hive Partition Table with Date Datatype via Spark

I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.

Typecasting a Dataframe returns 'null' for empty fields

I have a raw data loaded into my hive tables with all the columns as strings by default. Now I need to change the datatypes of hive tables to export to SQLServer.
When Typecasting the hive columns the empty fields returns 'NULL', tried loading the hive tables into dataframe and typecast the columns, but still dataframe also returning 'null' for empty fields. SQLserver couldn't recognize such values.
Can anyone suggest a solution to avoid the 'null' values in display when I get data from hive or dataframes.
If you want to change the data type only because you want to have that particular format in exported data, consider using writing to a directory as per your requirement and then export using sqoop/any other tool.
INSERT OVERWRITE DIRECTORY '<HDFS path>'
Row format delimited
Fields terminated by '<delimiter>'
SELECT
a,
b
From
table_name
Where <condition>;
While exporting, if you have null values consider using these arguments in your sqoop command
--null-string "\\N" --null-non-string "\\N"
Hope this helps you

How can I update or delete records of hive table from spark , with out loading entire table in to dataframe?

I have a hive orc table with around 2 million records , currently to update or delete I am loading entire table in to a dataframe and then update and save as new dataframe and saving this by Overwrite mode(below is command),so to update single record do I need to load and process entire table data??
I'm unable to do objHiveContext.sql("update myTable set columnName='' ")
I'm using Spark 1.4.1, Hive 1.2.1
myData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("myTable") where myData is updated dataframe.
How can I get rid of loading entire 2-3 million records just to update a single record of hive table .

Hive partitioning external table based on range

I want to partition an external table in hive based on range of numbers. Say numbers with 1 to 100 go to one partition. Is it possible to do this in hive?
I am assuming here that you have a table with some records from which you want to load data to an external table which is partitioned by some field say RANGEOFNUMS.
Now, suppose we have a table called testtable with columns name and value. The contents are like
India,1
India,2
India,3
India,3
India,4
India,10
India,11
India,12
India,13
India,14
Now, suppose we have a external table called testext with some columns along with a partition column say, RANGEOFNUMS.
Now you can do one thing,
insert into table testext partition(rangeofnums="your value")
select * from testtable where value>=1 and value<=5;
This way all records from the testtable having value 1 to 5 will come into one partition of the external table.
The scenario is my assumption only. Please comment if this is not the scenario you have.
Achyut