BigQuery streaming insert using template tables data availability issue - streaming

We have been using BigQuery for over a year now with no issues. We load data as batch jobs every few hours and it usually is instantly available.
We just started experimenting with streaming inserts using template tables. With our first test, we saw no errors and the data showed up instantly. The test created approximately 120 tables. A simple select count (using the web ui) on the tables came up with the right total number of ~8000 rows. After a couple of hours of more streaming, the total dropped to ~1400 rows.
Unsure about what happened, we dropped the dataset, recreated the template table and re-ran the streaming. This time around, the tables showed up right away but the data did not. On our third attempt the tables themselves did not show up for more than a couple of hours. We are on the fourth attempt and this time we only streamed data belonging to one table. The table showed up right away, but it has been over an hour and the data does not show up.
The streaming service uses the latest Java library, inserts only one record at a time and logs the response. The response, without an exception is always {"kind":"bigquery#tableDataInsertAllResponse"} and no errors.
Any help trying to understand what is happening would be great. Thanks.

Looks like we've identified the issue. It appears there's a race in the template-tables path only that causes our system to think the first chunk of data was deleted by user action (table truncation -- which it obviously wasn't), and is dropped. We've identified the fix and will attempt to push out a fix shortly.
Thanks for letting us know!

Related

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

DB2 Tables Not Loading when run in Batch

I have been working on a reporting database in DB2 for a month or so, and I have it setup to a pretty decent degree of what I want. I am however noticing small inconsistencies that I have not been able to work out.
Less important, but still annoying:
1) Users claim it takes two login attempts to connect, first always fails, second is a success. (Is there a recommendation for what to check for this?)
More importantly:
2) Whenever I want to refresh the data (which will be nightly), I have a script that drops and then recreates all of the tables. There are 66 tables, each ranging from 10's of records to just under 100,000 records. The data is not massive and takes about 2 minutes to run all 66 tables.
The issue is that once it says it completed, there is usually at least 3-4 tables that did not load any data in them. So the table is deleted and then created, but is empty. The log shows that the command completed successfully and if I run them independently they populate just fine.
If it helps, 95% of the commands are just CAST functions.
While I am sure I am not doing it the recommended way, is there a reason why a number of my tables are not populating? Are the commands executing too fast? Should I lag the Create after the DROP?
(This is DB2 Express-C 11.1 on Windows 2012 R2, The source DB is remote)
Example of my SQL:
DROP TABLE TEST.TIMESHEET;
CREATE TABLE TEST.TIMESHEET AS (
SELECT NAME00, CAST(TIMESHEET_ID AS INTEGER(34))TIMESHEET_ID ....
.. (for 5-50 more columns)
FROM REMOTE_DB.TIMESHEET
)WITH DATA;
It is possible to configure DB2 to tolerate certain SQL errors in nested table expressions.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyfqetnint.html
When the federated server encounters an allowable error, the server allows the error and continues processing the remainder of the query rather than returning an error for the entire query. The result set that the federated server returns can be a partial or an empty result.
However, I assume that your REMOTE_DB.TIMESHEET is simply a nickname, and not a view with nested table expressions, and so any errors when pulling data from the source should be surfaced by DB2. Taking a look at the db2diag.log is likely the way to go - you might even be hitting a Db2 issue.
It might be useful to change your script to TRUNCATE and INSERT into your local tables and see if that helps avoid the issue.
As you say you are maybe not doing things the most efficient way. You could consider using cache tables to take a periodic copy of your remote data https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyvfed_tuning_cachetbls.html

Reading newest rows from the updated database table : Anylogic 8

In my project I have to keep on inserting new rows in a table based on some logic. After this I want that each time an event is triggered, the rows of updated table should be fetched.
But the Problem is that new rows aren't accessed. The table is always updated after i close the current simulation. A similar case was posted last year but the answer wasn't clear, and due to less reputation score I am unable to comment on it. Does anyone know that whether Anylogic 8.1.0 PLE supports reading of newly updated database table records at runtime or not? or is there some other beneficial solution?
This works correctly in AnyLogic (at least in the latest 8.2.3 version) so I suspect there is another problem with your model.
I just tested it:
set up a simple 2-column database table;
list its contents (via query) at model startup;
update values in all rows (and add a bunch of rows) via a time 1 event;
list its contents (via query) via a time 2 event.
All the new and updated rows show correctly (including when viewing the table in AnyLogic, even when I do this during the simulation, pausing it just after the changes).
Note that, if you're checking the database contents via the AnyLogic client, you need to close/reopen the table to see the changes if you were already in it when starting the run. This view does auto-update when you close the experiment, so I suspect that is what you were seeing. Basically, the rows had been added (and will be there when/if you query them later in the model) but the table in the AnyLogic client only shows the changes when closing/reopening it or when the experiment is closed.
Since you used the SQL syntax (rather than the QueryDSL alternative syntax) to do your inserts, I also checked with both options (and everything works the same in either case).
The table is always updated after i close the current simulation
Do you mean when you close the experiment?
It might help if you can show the logic/syntax you are using for your database inserts and your queries.

SQL Query slow during batch update of table

I have a postgresql table with about 250K records. It gets updated a few times an hour. However, the entire table gets deleted and new records added. (Batch job).
I don't have much control over that process. During the time it takes the transaction to delete/re-load queries on the table basically lock/hang until the job finishes. The job takes about a minute to run. We have real time users looking at this data (which is spatial data on a map with a time slider). They recogize the lost records very easily.
Is there anything that can be done about these 60 or so second query times during the update. I've thought about loading into a 2nd table, dropping the original and renaming the 2nd to the original but that introduces more chance of error. Are there any settings that will just grab data as is and not necessarily look for a consistent view of the data.
Basically just looking for ideas on how to handle this situation.
I'm running postgresql 9.3.1
Thanks

PostgreSQL INSERT - auto-commit mode vs non auto-commit mode

I'm new to PostgreSQL and still learning a lot as I go. My company is using PostgreSQL and we are populating the database with tons of data. The data we collect is quite bulky in nature and is derived from certain types of video footage. For example, data related to about 15 minutes worth of video took me about 2 days to ingest into the database.
My problem is that I have data sets which relate to hours worth of video which would take weeks to ingest into the database. I was informed part of the reason this is taking so long to ingest was because PostgeSQK has auto commit set to true by default and committing transactions takes a lot of time/resources. I was informed that I could turn auto commit off, due to which the process would speed up tremendously. However, my concern is that multiple users are going to be populating this database. If i change the program to commit after say every 10 secords and two people are attempting to populate the same table. The first person gets an id and when he's on say record 7 then the second person attempts to insert into the same table they are given the same id key and once the first person decides to commit his changes, the second persons id key will already be used, thus throwing an error.
So what is the best way to insert data into a PostgreSQL database when multiple people are ingesting data at the same time? Is there a way to work around issuing out the same id key to multiple people when inserting data in auto-commit mode?
If the IDs are coming from the serial type or a PostgreSQL sequence (which is used by the serial type), then you never have to worry about two users getting the same ID from the sequence. It simply isn't possible. The nextval() function only ever hands out a given ID a single time.