How to add a validation on delta table column dynamically? - pyspark

I'm working on a transformation and stuck with a common problem. Any assist is well appreciated.
Scenario:
Step-1: Reading from a delta table.
+--------+------------------+
| emp_id | str |
+--------+------------------+
| 1 | name=qwerty. |
| 2 | age=22 |
| 3 | job=googling |
| 4 | dob=12-Jan-2001 |
| 5 | weight=62.7. |
+--------+------------------+
Step-2: I'm refining the data and outputting it into another delta table dynamically (No predefined schema). Let's say I'm adding null if the column name is not found.
+--------+--------+------+----------+-------------+--------+
| emp_id | name | age | job | dob | weight |
+--------+--------+------+----------+-------------+--------+
| 1 | qwerty | null | null | null | null |
| 2 | null | 22 | null | null | null |
| 3 | null | null | googling | null | null |
| 4 | null | null | null | 12-Jan-2001 | null |
| 5 | null | null | null | null | 62.7 |
+--------+--------+------+----------+-------------+--------+
Is there a way to apply validation in step-2 based on the column name? I'm splitting it by = while deriving the above table. Or do I have to do validation in step-3 while working on the new df?
Second question: Is there a way to achieve the following table?
+--------+--------+------+----------+-------------+--------+---------------------+
| emp_id | name | age | job | dob | weight | missing_attributes |
+--------+--------+------+----------+-------------+--------+---------------------+
| 1 | qwerty | null | null | null | null | age,job,dob,weight |
| 2 | null | 22 | null | null | null | name,job,dob,weight |
| 3 | null | null | googling | null | null | name,age,dob,weight |
| 4 | null | null | null | 12-Jan-2001 | null | name,age,job,weight |
| 5 | null | null | null | null | 62.7 | name,age,job,dob |
+--------+--------+------+----------+-------------+--------+---------------------+

Related

Postgres - Add new column to existing table

I want to alter table and add a new column. But I want also set Stroage column default value.
I tried the following and I get a error. Any idea how to fix this?
ALTER TABLE main_workflowjobtemplate
ADD COLUMN "ask_credential_on_launch" BOOLEAN NOT NULL STORAGE plain;
ERROR: syntax error at or near "STORAGE"
LINE 2: ...OLUMN "ask_credential_on_launch" BOOLEAN NOT NULL STORAGE pl...
Here is the table schema.
awx=# \d+ main_workflowjobtemplate;
Table "public.main_workflowjobtemplate"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
unifiedjobtemplate_ptr_id | integer | | not null | | plain | |
extra_vars | text | | not null | | extended | |
admin_role_id | integer | | | | plain | |
execute_role_id | integer | | | | plain | |
read_role_id | integer | | | | plain | |
survey_enabled | boolean | | not null | | plain | |
survey_spec | text | | not null | | extended | |
allow_simultaneous | boolean | | not null | | plain | |
ask_variables_on_launch | boolean | | not null | | plain | |
ask_inventory_on_launch | boolean | | not null | | plain | |
inventory_id | integer | | | | plain | |
approval_role_id | integer | | | | plain | |
ask_limit_on_launch | boolean | | not null | | plain | |
ask_scm_branch_on_launch | boolean | | not null | | plain | |
char_prompts | text | | not null | | extended | |
webhook_credential_id | integer | | | | plain | |
webhook_key | character varying(64) | | not null | | extended | |
webhook_service | character varying(16) | | not null | | extended | |

Flagging records after meeting a condition using Spark Scala

I need some expert opinion on the below scenario:
I have following dataframe df1:
+------------+------------+-------+-------+
| Date1 | OrderDate | Value | group |
+------------+------------+-------+-------+
| 10/10/2020 | 10/01/2020 | hostA | grp1 |
| 10/01/2020 | 09/30/2020 | hostB | grp1 |
| Null | 09/15/2020 | hostC | grp1 |
| 08/01/2020 | 08/30/2020 | hostD | grp1 |
| Null | 10/01/2020 | hostP | grp2 |
| Null | 09/28/2020 | hostQ | grp2 |
| 07/11/2020 | 08/08/2020 | hostR | grp2 |
| 07/01/2020 | 08/01/2020 | hostS | grp2 |
| NULL | 07/01/2020 | hostL | grp2 |
| NULL | 08/08/2020 | hostM | grp3 |
| NULL | 08/01/2020 | hostN | grp3 |
| NULL | 07/01/2020 | hostO | grp3 |
+------------+------------+-------+-------+
Each group is ordered by OrderDate in descending order. Post ordering, Each value having Current_date < (Date1 + 31Days) or Date1 as NULL needs to be flagged as valid until Current_date > (Date1 + 31Days).
Post that, every Value should be marked as Invalid irrespective of Date1 value.
If for a group, all the records are NULL, all the Value should be tagged as Valid
My output df should look like below:
+------------+------------+-------+-------+---------+
| Date1 | OrderDate | Value | group | Flag |
+------------+------------+-------+-------+---------+
| 10/10/2020 | 10/01/2020 | hostA | grp1 | Valid |
| 10/01/2020 | 09/30/2020 | hostB | grp1 | Valid |
| Null | 09/15/2020 | hostC | grp1 | Valid |
| 08/01/2020 | 08/30/2020 | hostD | grp1 | Invalid |
| Null | 10/01/2020 | hostP | grp2 | Valid |
| Null | 09/28/2020 | hostQ | grp2 | Valid |
| 07/11/2020 | 08/08/2020 | hostR | grp2 | Invalid |
| 07/01/2020 | 08/01/2020 | hostS | grp2 | Invalid |
| NULL | 07/01/2020 | hostL | grp2 | Invalid |
| NULL | 08/08/2020 | hostM | grp3 | Valid |
| NULL | 08/01/2020 | hostN | grp3 | Valid |
| NULL | 07/01/2020 | hostO | grp3 | Valid |
+------------+------------+-------+-------+---------+
My approach:
I created row_number for each group after ordering by OrderDate.
Post that i am getting the min(row_number) having Current_date > (Date1 + 31Days) for each group and save it as new dataframe dfMin.
I then join it with df1 and dfMin on group and filter based on row_number(row_number < min(row_number))
This approach works for most cases. But when for a group all values of Date1 are NULL, this approach fails.
Is there any other better approach to include the above scenario as well?
Note: I am using pretty old version of Spark- Spark 1.5. Also windows function won't work in my environment(Its a custom framework and there are many restrictions in place). For row_number, i used zipWithIndex method.

Why don't columns with citext datatype is processed by presto?

I'm running pgsql queries on the sql console provided by presto-client connected to presto-server running on top of postgres. The resultset of the queries contain only the columns that aren't of citext type.
DataDetails Table Description:
Table "public.datadetails"
Column | Type | Modifiers | Storage | Stats target | Description
------------------+----------+------------------------------+----------+--------------+-------------
data_sequence_id | bigint | not null | plain | |
key | citext | not null | extended | |
uploaded_by | bigint | not null | plain | |
uploaded_time | bigint | not null | plain | |
modified_by | bigint | | plain | |
modified_time | bigint | | plain | |
retrieved_by | bigint | | plain | |
retrieved_time | bigint | | plain | |
file_name | citext | not null | extended | |
file_type | citext | not null | extended | |
file_size | bigint | not null default 0::bigint | plain | |
Indexes:
"datadetails_pk1" PRIMARY KEY, btree (data_sequence_id)
"datadetails_uk0" UNIQUE CONSTRAINT, btree (key)
Check constraints:
"datadetails_file_name_c" CHECK (length(file_name::text) <= 32)
"datadetails_file_type_c" CHECK (length(file_type::text) <= 2048)
"datadetails_key_c" CHECK (length(key::text) <= 64)
Query Result in Presto-Client:
presto:public> select * from datadetails;
data_sequence_id | uploaded_by | uploaded_time | modified_by | modified_time | retrieved_by | retrieved_time | file_size |
------------------+-------------+---------------+-------------+---------------+--------------+----------------+-----------+
2000000000007 | 15062270 | 1586416286363 | 0 | 0 | 0 | 0 | 61 |
2000000000011 | 15062270 | 1586416299159 | 0 | 0 | 15062270 | 1586417517045 | 36 |
(2 rows)
Query 20200410_130419_00017_gmjgh, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [2 rows, 0B] [10 rows/s, 0B/s]
In the above resultset it is evident that the columns with citext type are missing.
Does presto support the citext datatype or Is there any configuration to process the citext datatype using presto?
Postgres: PostgreSQL 9.4.0-relocatable (Red Hat 4.4.7-11), 64-bit
Presto-Server: presto-server-0.230
Presto-Client: presto-cli-332

T-SQL : Pivot table without aggregate

I am trying to understand how to pivot data within T-SQL but can't seem to get it working. I have the following table structure
+-------------------+-----------------------+
| Name | Value |
+-------------------+-----------------------+
| TaskId | 12417 |
| TaskUid | XX00044497 |
| TaskDefId | 23 |
| TaskStatusId | 4 |
| Notes | |
| TaskActivityIndex | 0 |
| ModifiedBy | Orange |
| Modified | /Date(1554540200000)/ |
| CreatedBy | Apple |
| Created | /Date(2121212100000)/ |
| TaskPriorityId | 40 |
| OId | 2 |
+-------------------+-----------------------+
I want to pivot the name column to be columns expected output
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| TASKID | TASKUID | TASKDEFID | TASKSTATUSID | NOTES | TASKACTIVITYINDEX | MODIFIEDBY | MODIFIED | CREATEDBY | CREATED | TASKPRIORITYID | OID |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| | | | | | | | | | | | |
| 12417 | XX00044497 | 23 | 4 | | 0 | Orange | /Date(1554540200000)/ | Apple | /Date(2121212100000)/ | 40 | 2 |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
Is there an easy way of doing it? The columns are fixed (not dynamic).
Any help appreciated
Try this:
select * from yourtable
pivot
(
min(value)
for Name in ([TaskID],[TaskUID],[TaskDefID]......)
) as pivotable
You can also use case statements.
You must use the aggregate function in the pivot table.
If you want to learn more, here is the reference:
https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Output (I only tried three columns):
DB<>Fiddle

Spark merge rows based on some condition and retain the values

I have some dataframe like below :
+--------------------+-------------+----------+--------------------+------------------+
| cond | val | val1 | val2 | val3 |
+--------------------+-------------+----------+--------------------+------------------+
|cond1 | 1 | null | null | null |
|cond1 | null | 2 | null | null |
|cond1 | null | null | 3 | null |
|cond1 | null | null | null | 4 |
|cond2 | null | null | null | 44 |
|cond2 | null | 22 | null | null |
|cond2 | null | null | 33 | null |
|cond2 | 11 | null | null | null |
|cond3 | null | null | null | 444 |
|cond3 | 111 | 222 | null | null |
|cond3 | 1111 | null | null | null |
|cond3 | null | null | 333 | null |
I want to reduce the numbers based value of the some column, I want the resultant column to look like below :
+--------------------+-------------+----------+--------------------+------------------+
| cond | val | val1 | val2 | val3 |
+--------------------+-------------+----------+--------------------+------------------+
|cond1 | 1 | 2 | 3 | 4 |
|cond2 | 11 | 22 | 33 | 44 |
|cond3 | 111,1111 | 222 | 333 | 444 |
Try using .groupBy() and .agg() e.g.
val output = input.groupBy("cond")
.agg(collect_list("val").name("val"))
.agg(collect_list("val1").name("val1"))
.agg(collect_list("val2").name("val2"))
.agg(collect_list("val3").name("val3"))