Is there anyway we can compare two different tables based on any specific column using Spring Batch.
For example:
Table 1 Content:
cust_id, cust_name,flag
Table 2 Content:
cust_id, cust_name,flag
My requirement is like this,
whenever a record is available in Table 1, I will insert the same into Table 2,
when record is deleted in Table 1,I will mark the flag in Table 2 as 'N'
Why do you need spring batch here, creating a over-head of DB IO read/write. Create Trigger in DB such that when insert/delete is performed a trigger is generated and you will perform suitable task
Related
I have an ADF data flow that will only insert. It never updates rows.
Below is a screenshot of the flow, and the Alter Row task that sets the insert/Update policies.
data flow
alter row task
There is a source table and a destination table.
There is a source table for new data.
A lookup is done against the key of the destination table.
Two columns are then generated, a hash of the source data & hash of the destination data.
In the alter row task, the policy's are as follows:
Insert: if the lookup found no matching id.
Update: if lookup found a matching id and the checksums do not match (i.e. user exists but data is different between the source and existing record).
Otherwise it should do nothing.
The Sink allows insert and updates:
Even so, on first run it inserts all records but on second run it inserts all the records again, even if they exist.
I think I am misunderstanding the process and so appreciate any expertise or advise.
Thank you Joel Cochran for your valuable inputs, repro’d the scenario, and posting it as an answer to help other community members.
If you are using the upsert method in the sink, add alter row transformation with upsert if and write the expression for the upsert condition.
If you are using insert and update as your update method in the sink then in alter row transformation use both inserts if and update if conditions to insert and update data accordingly into the sink based on alter row conditions.
Any one knows best way for loading delta or CDC with using any tools
I got big table with billions of records and want to update or insert like Merge in Sql server or Oracle but in Amazon Redshift S3
Also we have loads of columns as can't compare all columns as well
e.g
TableA
Col1 Col2 Col3 ...
It has say already records
SO when inserting new records need to check that particular record is already existing if so no insert if not insert and if changed update record like that
I do have key id and date columns but as its got 200+ columns not easy to check all columns and taking much time
Many thanks in advance
I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.
There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.
I want to partition an external table in hive based on range of numbers. Say numbers with 1 to 100 go to one partition. Is it possible to do this in hive?
I am assuming here that you have a table with some records from which you want to load data to an external table which is partitioned by some field say RANGEOFNUMS.
Now, suppose we have a table called testtable with columns name and value. The contents are like
India,1
India,2
India,3
India,3
India,4
India,10
India,11
India,12
India,13
India,14
Now, suppose we have a external table called testext with some columns along with a partition column say, RANGEOFNUMS.
Now you can do one thing,
insert into table testext partition(rangeofnums="your value")
select * from testtable where value>=1 and value<=5;
This way all records from the testtable having value 1 to 5 will come into one partition of the external table.
The scenario is my assumption only. Please comment if this is not the scenario you have.
Achyut
I am running hive 071, processing existing data which is has the following directory layout:
-TableName
- d= (e.g. 2011-08-01)
- d=2011-08-02
- d=2011-08-03
... etc
under each date I have the date files.
now to load the data I'm using
CREATE EXTERNAL TABLE table_name (i int)
PARTITIONED BY (date String)
LOCATION '${hiveconf:basepath}/TableName';**
I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. so if I pass date='2011-08-03' and days='7'
The script should load the following partitions
- d=2011-08-03
- d=2011-08-04
- d=2011-08-05
- d=2011-08-06
- d=2011-08-07
- d=2011-08-08
- d=2011-08-09
I havn't found any discent way to do it except explicitlly running:
ALTER TABLE table_name ADD PARTITION (d='2011-08-03');
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');
and then running my query
select count(1) from table_name;
however this is offcourse not automated according to the date and days input
Is there any way I can define to the external table to load partitions according to date range, or date arithmetics?
I have a very similar issue where, after a migration, I have to recreate a table for which I have the data, but not the metadata. The solution seems to be, after recreating the table:
MSCK REPAIR TABLE table_name;
Explained here
This also mentions the "alter table X recover partitions" that OP commented on his own post. MSCK REPAIR TABLE table_name; works on non-Amazon-EMR implementations (Cloudera in my case).
The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want. for example:
select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');
but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first.
you can also create external tables, and add partitions to them that define the location.
This allows you to shred the data as you like, and still use the partition scheme to optimize the queries.
I do not believe there is any built-in functionality for this in Hive. You may be able to write a plugin. Creating custom UDFs
Probably do not need to mention this, but have you considered a simple bash script that would take your parameters and pipe the commands to hive?
Oozie workflows would be another option, however that might be overkill. Oozie Hive Extension - After some thinking I dont think Oozie would work for this.
I have explained the similar scenario in my blog post:
1) You need to set properties:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
2)Create a external staging table to load the input files data in to this table.
3) Create a main production external table "production_order" with date field as one of the partitioned columns.
4) Load the production table from the staging table so that data is distributed in partitions automatically.
Explained the similar concept in the below blog post. If you want to see the code.
http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html