remove NULL columns in Spark SQL - scala

How to remove columns containing only null values from a table? Suppose I have a table -
SnapshotDate CreationDate Country Region CloseDate Probability BookingAmount RevenueAmount SnapshotDate1 CreationDate1 CloseDate1
null null null null null 25 882000 0 null null null
null null null null null 25 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
null null null null null 0 882000 0 null null null
So I would just like to have Probability, BookingAmount and RevenueAmount columns and ignore the rest.
Is there a way to dynamically select the columns?
I am using spark 1.6.1

I solved this with a global groupBy. This works for numeric and non-numeric columns:
case class Entry(id: Long, name: String, value: java.lang.Float)
val results = Seq(
Entry(10, null, null),
Entry(10, null, null),
Entry(20, null, null)
)
val df: DataFrame = spark.createDataFrame(results)
// mark all columns with null only
val row = df
.select(df.columns.map(c => when(col(c).isNull, 0).otherwise(1).as(c)): _*)
.groupBy().max(df.columns.map(c => c): _*)
.first
// and filter the columns out
val colKeep = row.getValuesMap[Int](row.schema.fieldNames)
.map{c => if (c._2 == 1) Some(c._1) else None }
.flatten.toArray
df.select(row.schema.fieldNames.intersect(colKeep)
.map(c => col(c.drop(4).dropRight(1))): _*).show(false)
+---+
|id |
+---+
|10 |
|10 |
|20 |
+---+
Edit: I removed the shuffling of columns. The new approach keeps the given order of the columns.

You can add custom udf, and it in Spark SQL.
sqlContext.udf.register("ISNOTNULL", (str: String) => Option(str).getOrElse(""))
And with Spark SQL you can do :
SELECT ISNOTNULL(Probability) Probability, ISNOTNULL(BookingAmount) BookingAmount, ISNOTNULL(RevenueAmount) RevenueAmount FROM df

Related

Hive select outputs null values

I get the select output as null for the following Hive table.
Describe studentdetails;
clustername string
schemaname string
tablename string
primary_key map<string,int>
schooldata struct<alternate_aliases:string,application_deadline:bigint,application_deadline_early_action:string,application_deadline_early_decision:bigint,calendaring_system:string,fips_code:string,funding_type:string,gender_preference:string,iped_id:bigint,learning_environment:string,mascot:string,offers_open_admission:boolean,offers_rolling_admission:boolean,region:string,religious_affiliation:string,school_abbreviation:string,school_colors:string,school_locale:string,school_term:string,short_name:string,created_date:bigint,modified_date:bigint,percent_students_outof_state:float> from deserializer
deletedind boolean
truncatedind boolean
versionid bigint
select * from studentdetails limit 3;
Output :
NULL NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL NULL
I have used the following properties while creating the table.
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
And the following properties while selecting the data.
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
ADD JAR s3://emr/hive/lib/hive-serde-1.0.jar;
Thank you for the comments, I have found the solution for this.
The issue was that the column name in my json file and the column name that i used while creating the table was different.
When i synched the column names between the Hive table and the Json File the issue was resolved.
Thanks & Regards,
Srivignesh KN

T-SQL: Timeseries filling ranges

I have this dataset where I have a time-series with in the YYYYMM format. I have two columns which basically as true/false flags. I would like to add two extra columns based on these true/false flags that retrieves the current range:
Default Cure
201301 0 NULL
201302 0 NULL
201303 0 NULL
201304 1 NULL
201305 1 NULL
201306 1 NULL
201307 1 NULL
201308 NULL 0
201309 NULL 0
201310 NULL 1
201311 0 NULL
201312 0 NULL
201401 0 NULL
201402 0 NULL
201403 1 NULL
201404 1 NULL
201405 0 NULL
201406 0 NULL
201407 NULL 1
201408 NULL 0
201409 NULL 0
201410 NULL 0
201411 NULL 0
201412 NULL 0
I this dataset you can see the Default column being set to 1 for the periods 201304, 05, 06, 07 and the Cure column is set to 1 in the period 201310.
This basically means the Default timeseries is valid from period 201304 until period 201310. Ultimately I would like to generate the following set:
Default Cure DefaultPeriod CurePeriod
201301 0 NULL NULL NULL
201302 0 NULL NULL NULL
201303 0 NULL NULL NULL
201304 1 NULL 201304 201310
201305 1 NULL 201304 201310
201306 1 NULL 201304 201310
201307 1 NULL 201304 201310
201308 NULL 0 201304 201310
201309 NULL 0 201304 201310
201310 NULL 1 201304 201310
201311 0 NULL NULL NULL
201312 0 NULL NULL NULL
201401 0 NULL NULL NULL
201402 0 NULL NULL NULL
201403 1 NULL 201403 201407
201404 1 NULL 201403 201407
201405 0 NULL 201403 201407
201406 0 NULL 201403 201407
201407 NULL 1 201403 201407
201408 NULL 0 NULL NULL
201409 NULL 0 NULL NULL
201410 NULL 0 NULL NULL
201411 NULL 0 NULL NULL
201412 NULL 0 NULL NULL
Multiple ranges can occur but they cannot overlap. How would I go about achieving this. I have tried to do all sorts of min/max period join on the same table, but I can't seem to find a working solution.
This was a real thinker :)
Basically I am dividing up the data on the "Cure" dates (c1), numbering each group(c2), then looking for mins and maxes within each group (c3 C4), then applying some logic to filter out the rows that come before the min.
declare #t table
(
[Month] varchar(6),
[Default] bit,
[Cure] bit
);
insert into #t values('201301', 0, NULL);
insert into #t values('201302', 0, NULL);
insert into #t values('201303', 0, NULL);
insert into #t values('201304', 1, NULL);
insert into #t values('201305', 1, NULL);
insert into #t values('201306', 1, NULL);
insert into #t values('201307', 1, NULL);
insert into #t values('201308', NULL, 0);
insert into #t values('201309', NULL, 0);
insert into #t values('201310', NULL, 1);
insert into #t values('201311', 0, NULL);
insert into #t values('201312', 0, NULL);
insert into #t values('201401', 0, NULL);
insert into #t values('201402', 0, NULL);
insert into #t values('201403', 1, NULL);
insert into #t values('201404', 1, NULL);
insert into #t values('201405', 0, NULL);
insert into #t values('201406', 0, NULL);
insert into #t values('201407', NULL, 1);
insert into #t values('201408', NULL, 0);
insert into #t values('201409', NULL, 0);
insert into #t values('201410', NULL, 0);
insert into #t values('201411', NULL, 0);
insert into #t values('201412', NULL, 0);
with c1 as
(
select min([Month]) [Month], 1 x from #t
union all
select [Month],1 from #t
where Cure = 1
),
c2 as
(
select t.[Month],[Default],[Cure],
sum(x) over (order by t.[Month] rows between unbounded preceding and 1 preceding) grp
from #t t
left outer join c1 on c1.[Month] = t.[Month]
),
c3 as
(
select grp, min([Month]) [Month]
from c2
where [Default] = 1
group by grp
),
c4 as
(
select grp, max([Month]) [Month]
from c2
where [Cure] = 1
group by grp
)
select c2.[Month], c2.[Default], c2.[Cure],
case when c2.[Month] >= c3.[Month] then c3.[Month] else null end as DefaultPeriod,
case when c2.[Month] >= c3.[Month] then c4.[Month] else null end as CurePeriod
from c2
left outer join c3 on c2.grp = c3.grp
left outer join c4 on c2.grp = c4.grp

How to use RAISEERROR statement?

This is something very basic, but I can't understand it, and the manual is not helping:
declare #rule int =
(select id from menu_availability_rules
where (daily_serving_start = null or
(daily_serving_start is null and null is null)) and
(daily_serving_end = null or
(daily_serving_end is null and null is null)) and
(weekly_service_off = 3 or
(weekly_service_off is null and 3 is null)) and
(one_time_service_off = null or
(one_time_service_off is null and null is null)));
print #rule;
-- syntax error here --\/
if (#rule is not null) raiseerror ('test error', 42, 42);
if #rule is not null
begin
delete from menu_availability
where menu_id = 5365 and rule_id = #rule
delete from menu_availability_rules
where (daily_serving_start = null or
(daily_serving_start is null and null is null)) and
(daily_serving_end = null or
(daily_serving_end is null and null is null)) and
(weekly_service_off = 3 or
(weekly_service_off is null and 3 is null)) and
(one_time_service_off = null or
(one_time_service_off is null and null is null))
and not exists
(select rule_id from menu_availability
where rule_id = #rule)
end
Why is it a syntax error? How would I write it? I need to throw error for debugging purposes, just to make sure the code reached the conditional branch.
I can just replace the raiseerror with select 1 / 0 and I will get what I need, but why can't I do it normally?
The correct name is RAISERROR.

How to replace nulls with zeros in postgresql crosstabs

I've a product table with product_id and 100+ attributes. The product_id is text whereas the attribute columns are integer, i.e. 1 if the attribute exists. When the Postgresql crosstab is run, non-matching atrributes return null values. How do I replace nulls with zeros instead.
SELECT ct.*
INTO ct3
FROM crosstab(
'SELECT account_number, attr_name, sub FROM products ORDER BY 1,2',
'SELECT DISTINCT attr_name FROM attr_names ORDER BY 1')
AS ct(
account_number text,
Attr1 integer,
Attr2 integer,
Attr3 integer,
Attr4 integer,
...
)
Replace this result:
account_number Attr1 Attr2 Attr3 Attr4
1.00000001 1 null null null
1.00000002 null null 1 null
1.00000003 null null 1 null
1.00000004 1 null null null
1.00000005 1 null null null
1.00000006 null null null 1
1.00000007 1 null null null
with this below:
account_number Attr1 Attr2 Attr3 Attr4
1.00000001 1 0 0 0
1.00000002 0 0 1 0
1.00000003 0 0 1 0
1.00000004 1 0 0 0
1.00000005 1 0 0 0
1.00000006 0 0 0 1
1.00000007 1 0 0 0
A workaround would be to do a select account_number, coalesce(Attr1,0)... on the result. But typing out coalesce for each of the 100+ columns is rather unyieldly. Is there a way to handle this using crosstab? Thanks
You can use coalesce:
select account_number,
coalesce(Attr1, 0) as Attr1,
coalesce(Attr2, 0) as Attr2,
etc
if you can put those Attrs into a table like
attr
-----
Attr1
Attr2
Attr3
...
then you could automatically generate the repeating coalesce statement like
SELECT 'coalesce("' || attr || '", 0) "'|| attr ||'",' from table;
to save some typing.

SQL WHERE statement?

What should my WHERE clause be in a SQL Statement in which I want to return those rows where column A is null or column B is null, but not where both are null?
WHERE (ColA is NULL AND ColB is NOT NULL)
OR (ColB is NULL AND ColA is NOT NULL)
(A IS NULL OR B IS NULL) AND NOT (A IS NULL AND B IS NULL)