BigQuery - how to determine Data Availability - streaming

In regards to: https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
What's the best way to programatically determine if a table's data is available after streaming?
I am getting unexpected results trying to fetch Rows and TotalRows with the following apis: Jobs.Query, Jobs.GetQueryResults, Tables.Get, Tabledata.List
Thanks.

You can tell whether data is flushed on the table by executing the Tables.Get() API and looking at the streamingBuffer.oldestEntryTime value. This can be considered a high-water mark of data that has been flushed out of the buffer.
Any data before this timestamp should be available for copy, export, and list operations.
Also, I should clarify that data in the table is available for query immediately after streaming. It only is unavailable to table copy, export, and tabledata.list() operations. Yes, this is confusing, but yes, we're also working on addressing the problem.
For tables that haven't been streamed to before or recently, there is a warmup period where new streaming data won't show up.
See https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability for more information.

Related

Is it possible to query the diff between two Apache Iceberg snapshots?

I have two snapshots in my Iceberg history table, and I want to be able to see the difference between them, or at least with columns/ rows that have been affected on the last snapshot. Is there an easy way of getting this information?
You can use the java api to get the incremental change log between two snapshot id in a table.
table
.newIncrementalChangelogScan()
.fromSnapshotExclusive(startSnapshotId)
.toSnapshot(toSnapshot)
.caseSensitive(caseSensitive)
.filter(filterExpression())
.project(expectedSchema)
.planTasks();
It will get the full change log.
If you just want to query incremental data, an easier way is to use spark or flink:
spark.read()
.format("iceberg")
.option("start-snapshot-id", "10963874102873")
.option("end-snapshot-id", "63874143573109")
.load("path/to/table")
Currently gets only the data from append operation. Cannot support replace, overwrite, delete operations.
Enjoy yourself.

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

Saving JDBC db data as shared state Spark

I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.

Using Kafka for Data Integration with Updates & Deletes

So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.
At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.
I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.
We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.
This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?
Any suggestions/help is much appreciated. Thank you!
There are 3 approaches you can take:
Full dump load
Incremental dump load
Binlog replication
Full dump load
Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.
Incremental dump load
Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of
SELECT *
FROM my_table
WHERE last_update > #{last_import}
This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.
Binlog replication
Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.
This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.
Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.
You can read more about these different replication approaches in this blog post.
Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).