I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.
Related
I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.
I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.
So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.
At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.
I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.
We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.
This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?
Any suggestions/help is much appreciated. Thank you!
There are 3 approaches you can take:
Full dump load
Incremental dump load
Binlog replication
Full dump load
Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.
Incremental dump load
Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of
SELECT *
FROM my_table
WHERE last_update > #{last_import}
This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.
Binlog replication
Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.
This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.
Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.
You can read more about these different replication approaches in this blog post.
Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).
Greetings Overflowers,
Is there an SQL DBMS that allows me to create an indexed view in which I can insert new rows without modifying the original tables of the view? I will need to query this view after performing the in-view-only inserts. If the answer is no, what other methods can do the job? I simply want to merge a set of rows that comes from another server with the set of rows in the created view -in a specific order- to be able to perform fast queries against the merged set, ie the indexed view, without having to persist the received set in disk. I am not sure if using in-memory database would perform well as the merged sets grow ridiculously?
What do you think guys?
Kind regards
Well, there's no supported way to do that, since the view has to be based on some table(s).
Besides that, indexed views are not meant to be used like that. You don't have to push any data into the index view thinking that you will make data retrieval faster.
I suggest you keep your view just the way it is. And then have a staging table, with the proper indexes created on it, in which you insert the data coming from the external system.
The staging table should be truncated anytime you want to get rid of the data (so right before you're inserting new data). That should be done in a SNAPSHOT ISOLATION transaction, so your existing queries don't read dirty data, or deadlock.
Then you have two options:
Use an UNION ALL clause to merge the results from the view and the staging table when you want to retrieve your data.
If the staging table shouldn't be merged, but inner joined, then you perhaps can integrate it in the indexed view.
I have loaded a huge CSV dataset -- Eclipse's Filtered Usage Data using PostgreSQL's COPY, and it's taking a huge amount of space because it's not normalized: three of the TEXT columns is much more efficiently refactored into separate tables, to be referenced from the main table with foreign key columns.
My question is: is it faster to refactor the database after loading all the data, or to create the intended tables with all the constraints, and then load the data? The former involves repeatedly scanning a huge table (close to 10^9 rows), while the latter would involve doing multiple queries per CSV row (e.g. has this action type been seen before? If not, add it to the actions table, get its ID, create a row in the main table with the correct action ID, etc.).
Right now each refactoring step is taking roughly a day or so, and the initial loading also takes about the same time.
From my experience you want to get all the data you care about into a staging table in the database and go from there, after that do as much set based logic as you can most likely via stored procedures. When you load into the staging table don't have any indexes on the table. Create the indexes after the data is loaded into the table.
Check this link out for some tips http://www.postgresql.org/docs/9.0/interactive/populate.html