How to perform conditional de-duplication in talend - talend

I have a table with the employee ID, name, and last effective date. I want to keep only the employee ID with max(last effective date) and discard the other duplicate employee id rows that have an earlier last effective date.
I am trying to implement this usingtAggregate and tFilterRow. I attemped to perform count using tAggregate but it does not capture max(lasteffectivedate).

With a tAggreggateRow you can do :
If you want to get the ID too you need to reuse your primary flow in tMap (Main branch) and put the flow with the tAggregateRow in the lookup of the tMap.
After that you do a join on name AND date.

Related

DynamoDB Model Chronological Shipment Update Data

I recently just started learning about DynamoDB single table design. Now, I am trying to model Shipment Update data that has the following properties:
an account has multiple users
an account has multiple shipments
a shipment can change eta multiple times
each time there's a shipment update, a new record will be inserted
Access patterns:
get all shipments of an account displaying the last updated status ordered by eta in an ascending order
for a given shipment, get the chronological updates
I am having a difficulty trying to resolve the 2 access patterns mentioned above. If, per se, I only have 1 record per shipment, then I can just update the sort key for the shipment update items to be shpm#55abc and the retrieval of all shipments for a given account by eta is straight forward, which is via the gsi accountEta.
How do I resolve this to get the access patterns I need? Should I consider having a separate table for the shipment update audit, i.e. to store just the shipment updates? So that when I need access pattern #2, then I query this audit table by the shipment id to get all the chronological updates. But, I feel like this defeats the purpose of the single table design.
A single-table design is a good fit for these access patterns. Use overloadable, generic key names like PK and SK. Here is one approach*:
Shipments have a "current" record. Add a global secondary index (GSI1) to create an alternate Primary Key for querying by account in ETA order (pattern #1). All changes to the shipment are executed as updates to this "current" record.
# shipment "current" record
PK SK GSI1PK GSI1SK
shpmt#55abc x_current account#123 x_eta#2022-07-01
Next, enable DynamoDB Streams on the table to capture shipment changes. Each time a "current" record is updated, the Lambda backing the Stream writes the OLD_IMAGE to the table as a change control record. This enables pattern #2 by shipment and account.
# shipment update record
PK SK GSI1PK GSI1SK
shpmt#55abc update#2022-06-28T06:10:33.247Z account#123 update#2022-06-28T06:10:33.247Z
One virtue of this approach is that a single query operation can retrieve both the current shipment record and its full/partial change history in reverse order. This is the reason for the x_ prefixes on the current record's keys. A query with a key expression of PK = shpmt#55abc AND SK >= "update", DESC sorting with ScanIndexForward=False and a limit of 2 returns the current record (x_current) and the latest update record.
* Whether this is a good solution for you also depends on expected read/write volumes.

How to handle future dated records in postgress using Ef core

I am working on microservices architecture for payroll application.
ORM -EF core
I have Employee table ,where employee details are stored as jsonb column(firstname,lastname,department etc) in postgress .
one of the use case is, I may receive request for future dated changes.Example- Employee designation gets changed next month but I receive request for those change in current month.
I have two approachs to handle this scenario.
Approach 1 :
when I get future dated record(effective date > current date), I will store those records in separate table not in employee master table.
I will create one console application which runs on everyday (cron) and picks up the correct record(effectivedate == currentdate) and update the employee master table.
Approach 2:
almost same as approach 1, instead of using a table for storing future dated record, I will update the record in employee master table.
If I go with approach 2,
I need to delete existing record when effective date becomes current date
when I do get request I should get only current record not future record - to achieve this, I need to add condition for checking effective date. All employee details are stored in jsonb column so I need to fetch entire records with current and future dated record and filter only the current record.
I feel approach 1 is better.Please help me on this. I would like to know another approaches which may fit for this use case.
Thanks,
Revathi

Tableau - Passing result of one calculation as a filter to get data from another data source

I have 2 datasources. One Data Source provides details of employee (ID, Name etc) and their Departments. This is a Database. Another datasource is a manually maintained excel sheet in a sharedrive that has employee ID and a flag that states if the employee is a New Joiner Or Leaver. This dashboard however doesn't have Department information of employees.
I need a create an Dashboard, where the user can select a department and get details of employees that are flagged as Leavers in the excel datasource.
How can this be achieved?
I suggest you join your data in Tableau. First put your database table and then join excel to it. Use ID as a join key.
By joining you will have a dataset where flag is one column among others.

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.
Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

TSQL - Deleting with Inner Joins and multiple conditions

My question is a variation on one already asked and answered (TSQL Delete Using Inner Joins) but I have a different level of complexity and I couldn't see a solution to it.
My requirement is to delete Special Prices which haven't been accessed in 90 days. Special Prices are keyed on Customer ID and Product ID and the products have to matched to a Customer Order Detail table which also contains a Customer ID and a Product ID. I want to write one function that will look at the Special Price table for each Customer, compare each Product for that Customer with the Customer Order Detail table and if the Maximum Order Date is more than 90 days earlier than today, delete it from the Special Price table.
I know I can use a CURSOR (slow but effective) but would prefer to have a single query like the one in the TSQL Delete Using Inner Joins example. Any ideas and/or is more information required?
I cannot dig more on the situation of your system but i think and if it is ok for you, check MERGE STATEMENT, it might be a help instead of using cursors. check this Link MERGE STATEMENT