Postgres - Add new column to existing table - postgresql

I want to alter table and add a new column. But I want also set Stroage column default value.
I tried the following and I get a error. Any idea how to fix this?
ALTER TABLE main_workflowjobtemplate
ADD COLUMN "ask_credential_on_launch" BOOLEAN NOT NULL STORAGE plain;
ERROR: syntax error at or near "STORAGE"
LINE 2: ...OLUMN "ask_credential_on_launch" BOOLEAN NOT NULL STORAGE pl...
Here is the table schema.
awx=# \d+ main_workflowjobtemplate;
Table "public.main_workflowjobtemplate"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
unifiedjobtemplate_ptr_id | integer | | not null | | plain | |
extra_vars | text | | not null | | extended | |
admin_role_id | integer | | | | plain | |
execute_role_id | integer | | | | plain | |
read_role_id | integer | | | | plain | |
survey_enabled | boolean | | not null | | plain | |
survey_spec | text | | not null | | extended | |
allow_simultaneous | boolean | | not null | | plain | |
ask_variables_on_launch | boolean | | not null | | plain | |
ask_inventory_on_launch | boolean | | not null | | plain | |
inventory_id | integer | | | | plain | |
approval_role_id | integer | | | | plain | |
ask_limit_on_launch | boolean | | not null | | plain | |
ask_scm_branch_on_launch | boolean | | not null | | plain | |
char_prompts | text | | not null | | extended | |
webhook_credential_id | integer | | | | plain | |
webhook_key | character varying(64) | | not null | | extended | |
webhook_service | character varying(16) | | not null | | extended | |

Related

How to add a validation on delta table column dynamically?

I'm working on a transformation and stuck with a common problem. Any assist is well appreciated.
Scenario:
Step-1: Reading from a delta table.
+--------+------------------+
| emp_id | str |
+--------+------------------+
| 1 | name=qwerty. |
| 2 | age=22 |
| 3 | job=googling |
| 4 | dob=12-Jan-2001 |
| 5 | weight=62.7. |
+--------+------------------+
Step-2: I'm refining the data and outputting it into another delta table dynamically (No predefined schema). Let's say I'm adding null if the column name is not found.
+--------+--------+------+----------+-------------+--------+
| emp_id | name | age | job | dob | weight |
+--------+--------+------+----------+-------------+--------+
| 1 | qwerty | null | null | null | null |
| 2 | null | 22 | null | null | null |
| 3 | null | null | googling | null | null |
| 4 | null | null | null | 12-Jan-2001 | null |
| 5 | null | null | null | null | 62.7 |
+--------+--------+------+----------+-------------+--------+
Is there a way to apply validation in step-2 based on the column name? I'm splitting it by = while deriving the above table. Or do I have to do validation in step-3 while working on the new df?
Second question: Is there a way to achieve the following table?
+--------+--------+------+----------+-------------+--------+---------------------+
| emp_id | name | age | job | dob | weight | missing_attributes |
+--------+--------+------+----------+-------------+--------+---------------------+
| 1 | qwerty | null | null | null | null | age,job,dob,weight |
| 2 | null | 22 | null | null | null | name,job,dob,weight |
| 3 | null | null | googling | null | null | name,age,dob,weight |
| 4 | null | null | null | 12-Jan-2001 | null | name,age,job,weight |
| 5 | null | null | null | null | 62.7 | name,age,job,dob |
+--------+--------+------+----------+-------------+--------+---------------------+

Union two DataFrame using spark 2.x with different schema/dataTypes

I'm trying to merge multiple hive tables using spark where some of the columns with the same name have different data types especially string and bigint.
My final table (hiveDF) should have schema like below-
+--------------------------+------------+----------+--+
| col_name | data_type | comment |
+--------------------------+------------+----------+--+
| announcementtype | bigint | |
| approvalstatus | string | |
| capitalrate | double | |
| cash | double | |
| cashinlieuprice | double | |
| costfactor | double | |
| createdby | string | |
| createddate | string | |
| currencycode | string | |
| declarationdate | string | |
| declarationtype | bigint | |
| divfeerate | double | |
| divonlyrate | double | |
| dividendtype | string | |
| dividendtypeid | bigint | |
| editedby | string | |
| editeddate | string | |
| exdate | string | |
| filerecordid | string | |
| frequency | string | |
| grossdivrate | double | |
| id | bigint | |
| indicatedannualdividend | string | |
| longtermrate | double | |
| netdivrate | double | |
| newname | string | |
| newsymbol | string | |
| note | string | |
| oldname | string | |
| oldsymbol | string | |
| paydate | string | |
| productid | bigint | |
| qualifiedratedollar | double | |
| qualifiedratepercent | double | |
| recorddate | string | |
| sharefactor | double | |
| shorttermrate | double | |
| specialdivrate | double | |
| splitfactor | double | |
| taxstatuscodeid | bigint | |
| lastmodifieddate | timestamp | |
| active_status | boolean | |
+--------------------------+------------+----------+--+
This final table (hiveDF) schema can be made with below JSON-
{
"id": -2147483647,
"productId": 150816,
"dividendTypeId": 2,
"dividendType": "Dividend/Capital Gain",
"payDate": null,
"exDate": "2009-03-25",
"oldSymbol": "ILAAX",
"newSymbol": "ILAAX",
"oldName": "",
"newName": "",
"grossDivRate": 0.115,
"shortTermRate": 0,
"longTermRate": 0,
"splitFactor": 0,
"shareFactor": 0,
"costFactor": 0,
"cashInLieuPrice": 0,
"cash": 0,
"note": "0",
"createdBy": "Yahoo",
"createdDate": "2009-08-03T06:44:19.677-05:00",
"editedBy": "Yahoo",
"editedDate": "2009-08-03T06:44:19.677-05:00",
"netDivRate": null,
"divFeeRate": null,
"specialDivRate": null,
"approvalStatus": null,
"capitalRate": null,
"qualifiedRateDollar": null,
"qualifiedRatePercent": null,
"declarationDate": null,
"declarationType": null,
"currencyCode": null,
"taxStatusCodeId": null,
"announcementType": null,
"frequency": null,
"recordDate": null,
"divOnlyRate": 0.115,
"fileRecordID": null,
"indicatedAnnualDividend": null
}
I am doing something like below-
var hiveDF = spark.sqlContext.sql("select * from final_destination_tableName")
var newDataDF = spark.sqlContext.sql("select * from incremental_table_1 where id > 866000")
My incremental table (newDataDF) has some columns with different data types. I have around 10 incremental tables where somewhere bigint and the same in other table as string so can't be sure if I do typecast. Typecast may be easy but I am not sure on which type should I do since multiple tables are there. I am looking for any approach where without typecast I can do.
For an example incremental table is something like below-
+--------------------------+------------+----------+--+
| col_name | data_type | comment |
+--------------------------+------------+----------+--+
| announcementtype | string | |
| approvalstatus | string | |
| capitalrate | string | |
| cash | double | |
| cashinlieuprice | double | |
| costfactor | double | |
| createdby | string | |
| createddate | string | |
| currencycode | string | |
| declarationdate | string | |
| declarationtype | string | |
| divfeerate | string | |
| divonlyrate | double | |
| dividendtype | string | |
| dividendtypeid | bigint | |
| editedby | string | |
| editeddate | string | |
| exdate | string | |
| filerecordid | string | |
| frequency | string | |
| grossdivrate | double | |
| id | bigint | |
| indicatedannualdividend | string | |
| longtermrate | double | |
| netdivrate | string | |
| newname | string | |
| newsymbol | string | |
| note | string | |
| oldname | string | |
| oldsymbol | string | |
| paydate | string | |
| productid | bigint | |
| qualifiedratedollar | string | |
| qualifiedratepercent | string | |
| recorddate | string | |
| sharefactor | double | |
| shorttermrate | double | |
| specialdivrate | string | |
| splitfactor | double | |
| taxstatuscodeid | string | |
| lastmodifieddate | timestamp | |
| active_status | boolean | |
+--------------------------+------------+----------+--+
I'm doing this union for table something like below-
var combinedDF = hiveDF.unionAll(newDataDF)
but no luck. I tried to give final schema as below but no luck-
val rows = newDataDF.rdd
val newDataDF2 = spark.sqlContext.createDataFrame(rows, hiveDF.schema)
var combinedDF = hiveDF.unionAll(newDataDF2)
combinedDF.coalesce(1).write.mode(SaveMode.Overwrite).option("orc.compress", "snappy").orc("/apps/hive/warehouse/" + database + "/" + tableLower + "_temp")
As per this, I tried below-
var combinedDF = sparkSession.read.json(hiveDF.toJSON.union(newDataDF.toJSON).rdd)
Finally I am trying to write into table like above but no luck, plz help me-
I also faced this situation while merging an incremental table with the existing table. There are generally 2 cases to handle
1. Incremental data with extra column:
This can be solved by normal merging process which you are trying here.
2. Incremental data with same column name but different schema:
This is the tricky one. One easy solution is to convert bot the data to toJSON and do union
hiveDF.toJSON.union(newDataDF.toJSON). This however will cause json schema merging and will change the existing schema. For eg: If the column a:Long in the table and a:String in the incremental table, after merging the final schema will be a:String. There is no way to evade this if you want to do json union.
The alternate to this is to have strict schema check for the incremental data. You test whether the incremental table has the same schema that the hive table, if the schema differs don't merge.
This is however little too stringent as for real time data it is pretty hard to put schema enforcement.
So the way I solved this is to have a separate enrichment process before merging. The process actually checks the schema and if the incoming column can be upgraded/downgraded to the current hive table schema it does that.
Essentially it iterate over the incoming delta, for each row convert that to the correct schema. This is little expensive but provides very good guarantee for the data correctness. In case the process fails to convert a row. I sideline the row and raise an alarm so that the data could be validated manually for any bug in the upstream system which is generating the data.
This is the code I use to validate whether the two schemas are mergable of not.

Spark Dataframe Union giving duplicates

I have a base dataset, and one of the columns is having null and not null values.
so I do:
val nonTrained_ds = base_ds.filter(col("col_name").isNull)
val trained_ds = base_ds.filter(col("col_name").isNotNull)
When I print that out, I get clear separate of rows. But when I do,
val combined_ds = nonTrained_ds.union(trained_ds)
I get duplicate records of rows from nonTrained_ds, and the strange thing is, rows from trained_ds are no longer in the combined ds.
Why does this happen?
the values of trained_ds are:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700001|16 |
|0456700004|16 |
|0456700007|16 |
|0456700010|16 |
|0456700013|16 |
|0456700016|16 |
|0456700019|16 |
|0456700022|16 |
|0456700025|16 |
|0456700028|16 |
|0456700031|16 |
|0456700034|16 |
|0456700037|16 |
|0456700040|16 |
|0456700043|16 |
|0456700046|16 |
|0456700049|16 |
|0456700052|16 |
|0456700055|16 |
|0456700058|16 |
|0456700061|16 |
|0456700064|16 |
|0456700067|16 |
|0456700070|16 |
+----------+----------------+
the values of nonTrained_ds are:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
+----------+----------------+
the values of the combined ds are:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
|0456700002|16 |
|0456700005|16 |
|0456700008|16 |
|0456700011|16 |
|0456700014|16 |
|0456700017|16 |
|0456700020|16 |
|0456700023|16 |
|0456700026|16 |
|0456700029|16 |
|0456700032|16 |
|0456700035|16 |
|0456700038|16 |
|0456700041|16 |
|0456700044|16 |
|0456700047|16 |
|0456700050|16 |
|0456700053|16 |
|0456700056|16 |
|0456700059|16 |
|0456700062|16 |
|0456700065|16 |
|0456700068|16 |
|0456700071|16 |
+----------+----------------+
This did the trick,
val nonTrained_ds = base_ds.filter(col("primary_offer_id").isNull).distinct()
val trained_ds = base_ds.filter(col("primary_offer_id").isNotNull).distinct()

Redshift: tables info query not working via spark

I am trying to run this query from spark code using databricks:
select * from svv_table_info
but I am getting this error msg:
Exception in thread "main" java.sql.SQLException: Amazon Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
any opinion why I am getting this?
That view returns table_id which is in the Postgres system type OID.
psql=# \d+ svv_table_info
Column | Type | Modifiers | Storage | Description
---------------+---------------+-----------+----------+-------------
database | text | | extended |
schema | text | | extended |
table_id | oid | | plain |
table | text | | extended |
encoded | text | | extended |
diststyle | text | | extended |
sortkey1 | text | | extended |
max_varchar | integer | | plain |
sortkey1_enc | character(32) | | extended |
sortkey_num | integer | | plain |
size | bigint | | plain |
pct_used | numeric(10,4) | | main |
empty | bigint | | plain |
unsorted | numeric(5,2) | | main |
stats_off | numeric(5,2) | | main |
tbl_rows | numeric(38,0) | | main |
skew_sortkey1 | numeric(19,2) | | main |
skew_rows | numeric(19,2) | | main |
You can cast it to INTEGER and Spark should be able to handle it.
SELECT database,schema,table_id::INT
,"table",encoded,diststyle,sortkey1
,max_varchar,sortkey1_enc,sortkey_num
,size,pct_used,empty,unsorted,stats_off
,tbl_rows,skew_sortkey1,skew_rows
FROM svv_table_info;

Error in Insert query : syntax error at or near ","

My insert query is,
insert into app_library_reports
(app_id,adp_id,reportname,description,searchstr,command,templatename,usereporttemplate,reporttype,sentbothfiles,useprevioustime,usescheduler,cronstr,option,displaysettings,isanalyticsreport,report_columns,chart_config)
values
(25,18,"Report_Barracuda_SpamDomain_summary","Report On Domains Sending Spam Emails","tl_tag:Barracuda_spam AND action:2","BarracudaSpam/Report_Barracuda_SpamDomain_summary.py",,,,,,,,,,,,);
Schema for the table 'app_library_reports' is:
Table "public.app_library_reports"
Column | Type | Modifiers | Storage | Stats target | Description
-------------------+---------+------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('app_library_reports_id_seq'::regclass) | plain | |
app_id | integer | | plain | |
adp_id | integer | | plain | |
reportname | text | | extended | |
description | text | | extended | |
searchstr | text | | extended | |
command | text | | extended | |
templatename | text | | extended | |
usereporttemplate | boolean | | plain | |
reporttype | text | | extended | |
sentbothfiles | text | | extended | |
useprevioustime | text | | extended | |
usescheduler | text | | extended | |
cronstr | text | | extended | |
option | text | | extended | |
displaysettings | text | | extended | |
isanalyticsreport | boolean | | plain | |
report_columns | json | | extended | |
chart_config | json | | extended | |
Indexes:
"app_library_reports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"app_library_reports_adp_id_fkey" FOREIGN KEY (adp_id) REFERENCES app_library_adapter(id)
"app_library_reports_app_id_fkey" FOREIGN KEY (app_id) REFERENCES app_library_definition(id)
When I execute insert query it gives error:ERROR: syntax error at or near ","
Please help me to find out this error.Thank you.
I'm fairly certain your immediate error is coming from the empty string of commas (i.e. ,,,,,,,) appearing at the end of the INSERT. If you don't want to specify values for a particular column, you can pass NULL for the value. But in your case, since you only specify values for the first 6 columns, another way is to just specify those 6 columns names when you insert:
INSERT INTO app_library_reports
(app_id, adp_id, reportname, description, searchstr, command)
VALUES
(25, 18, 'Report_Barracuda_SpamDomain_summary',
'Report On Domains Sending Spam Emails', 'tl_tag:Barracuda_spam AND action:2',
'BarracudaSpam/Report_Barracuda_SpamDomain_summary.py')
This insert would only work if the columns not specified accept NULL. If some of the other columns are not nullable, then you would have to pass in values for them.