Is there a way to update side inputs in Apache Beam? - apache-beam

I am developing a data transformation pipeline in Apache Beam, where I need some look up table to help with transforming each incoming record.
I can pass in the look up table as a side input, but the caveat is an incoming record could update the lookup table that I am using.
Is there a way to update the lookup table and then broadcast the update to every other worker?
Update: One possibility is to use a data driven trigger to signal an end to the current window when an incoming record updates the side input. Does side input gets automatically refreshed at the start of the next window? In my case I am retrieving the side input from an external source.

If the lookup table is related to the key and window you can try using state
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Otherwise you may need an external data storage. Database or in memory cache. Just be aware of the fact that DoFns are serializable and opening/closing connections need to be done carefully.

Related

Azure Data Factory - Copy Data Upsert only updating a single row at a time

I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.

Kafka Connect and Custom Queries

I'm interested in using the Kafka Source JDBC connector to perform to a publish to Kafka, for when an Invoice gets created. On the source end, it's broken up into 2 tables Invoice, and InvoiceLine.
Is this possible, using custom queries. What would the query look like?
Also since its polling, what gets published could contain one or more invoices in a topic?
Thanks
Yes, you can use custom queries. From the docs:
Custom Query: The source connector supports using custom queries instead of copying whole tables. With a custom query, one of the other update automatic update modes can be used as long as the necessary WHERE clause can be correctly appended to the query. Alternatively, the specified query may handle filtering to new updates itself; however, note that no offset tracking will be performed (unlike the automatic modes where incrementing and/or timestamp column values are recorded for each record), so the query must track offsets itself.

Saving JDBC db data as shared state Spark

I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.

Detecting new data in a replicated MongoDB slave

Background: I have a very specific use case where I have an existing MongoDB that I need to interact with via reads, but I have to ensure that the data can never be modified. However I also need to trigger some form of event when new data comes in so I can do post processing on it.
The current plan is to use replication to get the data onto a slave for the read processing. However for my purposes I only care about new data in various document stores. Part of the issue is that I can not modify the existing MongoDB and not all the data is timestamped, so there is no incremental way to handle this that I can think of.
Question: Is it possible to fire an event from a slave that would tell me I have new data and what it is? I will only have access to the slave DB, as the master will be locked.
I may have some limited ability to change the master DB, but I can not expect to change the document structure at all.
Instead of using a master/slave configuration you could instead use a replica set with a priority 0 secondary (so that it can never become primary).
You can tail the oplog on that secondary looking for insert operations.