how to tell talend not to hold data while being joined - talend

I am asking how to do something in talend that is a feature in datastage.
I am seeing a talend job where if I am going to perform a join or lookup, talend tries to "memorize" the entire lookup or reference dataset proir to the join. My datasets are too large for talend to 'memorize' and kills the job.
In datastage, I can avoid this by having sort stages in front of the join stage, and the join stage monopolizes this by using a "sorted join", whereas the entire dataset isn't held in memory, but is immediately joined and sent ot the next stage while the join is in progress, saving mmory.
How do I accomplish this in talend?
Thank you.

When you are retrieving 180 million records from database it might affect ETL performance so this kind of join you can do in database server also

I think you can
Use the temporary file storage option of the "tmap" that will store on disk.
Enlarge buffer size in "tmap".
Or maybe the component "tmemorize"

Related

Good way to ensure data is ready to be queried while Glue partition is created?

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

Huge data reporting using PostgreSQL and Data Studio

I managed a health-care database which is hosted in AWS RDS. The system info as below:
PostgreSQL 9.6
8 v-cores and 16GB RAM
DB size now is 35GB
The problem is I want to join few thousand users in accounts tables with other health-metric tables (up to 10, and few millions record per table) to make a custom data report (using Google Data Studio).
Here what I did:
Join all the needed tables as one materialized view.
Feed Google Data Studio by this materialized view.
But, I have waited 10 hours and it still runs without end. I thought it will never finished. Does anyone experience in huge data report? Just give me the keywords.
Here is my materialized view definition:
CREATE MATERIALIZED VIEW report_20210122 AS
SELECT /* long, but simple list */
FROM accounts
INNER JOIN user_weartime ON accounts.id = user_weartime.user_id
INNER JOIN admin_exchanges ON accounts.id = admin_exchanges.user_id
INNER JOIN user_health_source_stress_history ON accounts.id = user_health_source_stress_history.user_id
INNER JOIN user_health_source_step_history ON accounts.id = user_health_source_step_history.user_id
INNER JOIN user_health_source_nutri_history ON accounts.id = user_health_source_nutri_history.user_id
INNER JOIN user_health_source_heart_history ON accounts.id = user_health_source_heart_history.user_id
INNER JOIN user_health_source_energy_history ON accounts.id = user_health_source_energy_history.user_id
INNER JOIN user_health_source_bmi_history ON accounts.id = user_health_source_bmi_history.user_id
where accounts.id in (/* 438 numbers */);
Creating a materialized view for a huge join is probably not going to help you.
You didn't show us the query for the report, but I expect that it contains some aggregate functions, and you don't want to report a list of millions of raw data.
First, make sure that you have all the appropriate indexes in place. Which indexes you need depends on the query. For the one you are showing, you would want an index on accounts(id), and (if you want a nested loop join) on admin_exchanges(user_id), and similarly for the other tables.
But to find out the correct indexes for your eventual query, you'd have to look at its execution plan.
Sometimes a materialized view can help considerably, but typically by pre-aggregating some data.
If you join more than 8 tables, increasing join_collapse_limit can give you a better plan.
I changed my idea and know how to do that using FULL JOIN ON start_date AND user_id, then each health metrics should be a columns in huge view. My report now has more than 500k rows and 40 columns but the view creation still very FAST and also the Query time on view
I would ask you why are you using direct connection to PostgreSQL to display data in DataStudio. Despite this being supported, this only makes sense if you don't want to invest time in developing a good data flow (that is, your data is small) or if you want to display real time data.
But since your data is huge and you're using a Materialized View, I guess none of these are the case.
I suggest you to move to BigQuery. DataStudio and BigQuery play really nice together, and it is made to process huge amounts of data very fast. I bet your query would run in seconds in BigQuery and it'll cost cents.
Sadly, BigQuery only supports Cloud SQL external connectors and it can't connect directly to your AWS RDS service. You'll need to write a ETL job somewhere, or move your database to Cloud SQL for PostgreSQL (which I recommend, if it is possible).
Check out these answers, if you're interesting in transfer data from AWS RDS to BigQuery:
how to load data from AWS RDS to Google BigQuery in streaming mode?
Synchronize Amazon RDS with Google BigQuery

PySaprk- Perform Merge in Synapse using Databricks Spark

We are having a tricky situation while performing ACID operation using Databricks Spark .
We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND and OVERWRITE (only these two use full in our case) . So based these two mode we thought of below options:
We will write whole dataframe into a stage table . And we will use this stage table to perform MERGE operation( ~ UPSERT )with final Table .Stage table will be truncated / dropped after that .
We Will bring target table data into Spark also. Inside Spark We will perform MERGE using Delta lake and will generate a final Dataframe .This dataframe will be written back to Target table in OVERWRITE mode.
Considering the cons. sides..
in Option 1 , We have to use two table just to write the final data. And In,case both Stage and target tables are big , then performing MERGE operation inside Synapse is another herculean task and May take time .
in option 2 ,We have to bring the Target table into Spark in-memory. Even though network IO is not much of our concern as both Databricks and Synpse will be in same Azure AZ, It may leads to memory issue in Spark side.
Is there any other feasible options ?? Or any recommendation ??
Answer would depend on many factors not listed in your question. It's a very open ended question.
(Given the way your question is phrased I'm assuming you're using Dedicated SQL Pools and not an On-demand Synapse)
Here are some thoughts:
You'll be using spark cluster's compute in option 1 and Synapse' compute in option 2. Compare cost.
Pick the lower cost.
Read and write to/from Spark to/from Synapse using their driver uses Datalake as stage. I.e. while reading a table from Synapse into a datafrmae in Spark, driver will first make Synapse export data to Datalake (as parquet IIRC) and then read the files in Datalake to create the Dataframe. This scales nicely if you're talking about 10s or million or billions of rows. But the overhead could become a performance overhead if row counts are low (10-100s of thousands).
Test and pick the faster one.
Remember that Synapse is not like a traditional MySQL or SQL-Server. It's an MPP DB.
"performing MERGE operation inside Synapse is another herculean task and May take time" is a wrong statement. It scales just like a Spark cluster.
It may leads to memory issue in Spark side, yes and no. One one hand all data isn't going to be loaded into a single worker node. OTOH yes, you do need enough memory for each node to do it's own part.
Although Synapse can be scaled up and down dynamically, I've seen it take up to 40 minutes to complete a scale up. Databricks on the other hand is fully on-demand and you can probably get away with turning on cluster, do upsert, shutdown cluster. With Synapse you'll probably have other clients using it, so may not be able to shut it down.
So with Synapse either you'll have to live with 40-80 minutes down time for each upsert (scale up, upsert, scale down), OR
pay for high DWU flat-rate all the time, though your usage is high only when you upsert but otherwise it's pretty low.
Lastly, remember that MERGE is in preview at the time of writing this. Means no Sev-A support cases/immediate support if something breaks in your prod because you're using MERGE.
You can always use DELETE + INSERT instead. Assumes the delta you receive has all columns from target table and not just updated ones.
Did you try creating checksum to do merge upsert only for rows that have actual data change?

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.

Aggregate as part of ETL or within the database?

Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?
If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.
If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.