Oracle NoSQL cloud service - OperationThrottlingException when ingesting data - nosql

I got some error about my tenancy exceeded the DDL operation rate while writing some data into my NoSQL existing table. My code has a step to create the table if it doesn't exist like:
CREATE TABLE IF NOT EXISTS " + tableName + "(id STRING) . . . etc.
Until now I was doing some tests and didn't have any problem. But, now I started to test a significant amount of data ingestion in sequence and got this error:
"oracle.nosql.driver.OperationThrottlingException: Tenant exceeded DDL operation rate limit of 4 per minute: 5"
This DDL is not creating any table because the table already exists, is this a bug or it works as expected?

This is expected behavior. Only 4 DDL operations (e.g. create/drop table or index) are allowed per minute for a given tenancy. This isn't related to data ingestion, just the management of tables themselves.

Related

Is there any other alternatives for load command in DB2

We daily receive 7 millions of records , we are going to append to the existing target table.The target table is partitioned based on date
We are using DB2 Load command to load data from one DB2 table (stage) to another DB2 table target
call SYSPROC.ADMIN_CMD('LOAD FROM (SELECT * FROM stage_table )
OF CURSOR INSERT INTO target_table NONRECOVERABLE INDEXING MODE INCREMENTAL ALLOW READ ACCESS')
As per IBM documentation , ALLOW READ ACCESS is going to be deprecated suggested to use INGEST method instead of that
https://www.ibm.com/docs/en/db2/10.1.0?topic=functionality-fp1-allow-read-access-parameter-load-command
Question:
How to use INGEST method to load data from DB2 to DB2 tables ?
what would be other alternatives to load millions of records with improved performance.

PostgreSQL - Creating index on multiple partitioned tables

I am trying to create indexes on multiple (1000) partitioned tables. As I'm using Postgres 10.2, I would have to do this for each of the partition separately, having to execute 1000 queries for the same.
I have figured how to do it, and it does work on environments where the table size(s) and number of transactions are very less. Below is the query to be executed for one of the table (which is to be repeated for all the tables ( user_2, user_3, etc.)
CREATE INDEX IF NOT EXISTS user_1_idx_location_id
ON users.user_1 ( user_id, ( user_data->>'locationId') );
where user_data is a jsonb column
This query does not work for large tables, with high number of transactions - when I run it for all the tables at once. Error thrown:
ERROR: SQL State : 40P01
Error Code : 0
Message : ERROR: deadlock detected
Detail: Process 77999 waits for ShareLock on relation 1999264 of database 16311; blocked by process 77902.
Process 77902 waits for RowExclusiveLock on relation 1999077 of database 16311; blocked by process 77999
I am able to run it in small batches (of 25 each) - still encountering the issue at times, but running successfully when I retry it once or twice. Smaller the batch, lesser the chances of a deadlock.
I would think this happens because all the user tables ( user_1, user_2, etc) are linked to the parent table: user. I don't want to lock the entire table for the index creation (since in theory only one table is being modified at a time). Why does this happen and is there any way around this, to ensure that the index is created without the deadlocks ?
This worked:
CREATE INDEX CONCURRENTLY IF NOT EXISTS user_1_idx_location_id
ON users.user_1 ( user_id, ( user_data->>'locationId') );

invalid input syntax for type json aws dms postgres

I'm running a task that migrates all data from a postgres 10.4 to a RDS postgres 10.4.
Not able to migrate tables which have jsonb column.
After error, whole table is getting suspended.Table contain 449 rows only.
I have made following error policy, still whole table suspended.
"DataErrorPolicy": "IGNORE_RECORD",
"DataTruncationErrorPolicy": "IGNORE_RECORD",
"DataErrorEscalationPolicy": "SUSPEND_TABLE",
"DataErrorEscalationCount": 1000,
My expectation is that whole table should be transferred, it can ignore record if any json is wrong.
I dont know why its giving this error 'invalid input syntax for type json' , i have checked all json and all jsons are valid.
After debugging more, this error has been considered as TABLE error , but why ? Thats why table got suspended since TableErrorPolicy is 'SUSPEND_TABLE'.
Why this error considered as table error instead of record error?
Is JSONB column not supported by DMS thats why we are getting below error?
Logs :-
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Next table to load 'public'.'TEMP_TABLE' ID = 1, order = 0 (tasktablesmanager.c:1817)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Start loading table 'public'.'TEMP_TABLE' (Id = 1) by subtask 1.
Start load timestamp 0005AE3F66381F0F (replicationtask_util.c:755)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: REPLICA IDENTITY information for table 'public'.'TEMP_TABLE': Query status='Success' Type='DEFAULT'
Description='Old values of the Primary Key columns (if any) will be captured.' (postgres_endpoint_unload.c:191)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Unload finished for table 'public'.'TEMP_TABLE' (Id = 1). 449 rows sent. (streamcomponent.c:3485)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Table 'public'.'TEMP_TABLE' contains LOB columns, change working mode to default mode (odbc_endpoint_imp.c:4775)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Table 'public'.'TEMP_TABLE' has Non-Optimized Full LOB Support (odbc_endpoint_imp.c:4788)
2020-09-01T12:10:04 https://forums.aws.amazon.com/I: Load finished for table 'public'.'TEMP_TABLE' (Id = 1). 449 rows received. 0 rows skipped.
Volume transferred 190376. (streamcomponent.c:3770)
2020-09-01T12:10:04 https://forums.aws.amazon.com/E: RetCode: SQL_ERROR SqlState: 22P02 NativeError: 1 Message: ERROR: invalid input syntax for type json;
Error while executing the query https://forums.aws.amazon.com/ (ar_odbc_stmt.c:2648)
2020-09-01T12:10:04 https://forums.aws.amazon.com/W: Table 'public'.'TEMP_TABLE' (subtask 1 thread 1) is suspended (replicationtask.c:2471)
Edit- after debugging more, this error has been considered as TABLE error , but why ?
JSONB column data type must be nullable in target DB.
Note- In my case, after making JSONB column as nullable, this error disappeared.
As mentioned in AWS documentation-
In this case, AWS DMS treats JSONB data as if it were a LOB column. During the full load phase of a migration, the target column must be nullable.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Prerequisites
https://aws.amazon.com/premiumsupport/knowledge-center/dms-error-null-value-column/
AWS DMS treats the JSON data type in PostgreSQL as a LOB data type column. This means that the LOB size limitation when you use limited LOB mode applies to JSON data. For example, suppose that limited LOB mode is set to 4,096 KB. In this case, any JSON data larger than 4,096 KB is truncated at the 4,096 KB limit and fails the validation test in PostgreSQL.
Reference: AWS DMS - JSON data types being truncated
Update: You can tweak the error handling task settings to skip erroneous rows by setting the value for DataErrorPolicy to IGNORE_RECORD which determines the action AWS DMS takes when there is an error related to data processing at the record level.
Some examples of data processing errors include conversion errors, errors in transformation, and bad data. The default is LOG_ERROR. IGNORE_RECORD, the task continues and the data for that record is ignored.
Reference: AWS DMS - Error handling task settings
You mentioned that you're migrating from PostgreSQL to PostgreSQL. Is there a specific reason to Use AWS DMS?
AWS Docs: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Homogeneous
When you migrate from a database engine other than PostgreSQL to a PostgreSQL database, AWS DMS is almost always the best migration tool to use. But when you are migrating from a PostgreSQL database to a PostgreSQL database, PostgreSQL tools can be more effective.
...
We recommend that you use PostgreSQL database migration tools such as pg_dump under the following conditions:
You have a homogeneous migration, where you are migrating from a source PostgreSQL database to a target PostgreSQL database.
You are migrating an entire database.
The native tools allow you to migrate your data with minimal downtime.

SCD2 Implementation in Redshift using AWS GLue Pyspark

I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.
We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .

Redshift time-series table loading questions

Redshift documentation identifies time-series tables as a best practice:
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
However, it doesn't address any of the following issues:
how many tables within a union-all view is reasonable - hundreds? (unanswered)
any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables? (Answer: no)
most effective method of loading underlying tables? Perhaps using firehose to insert into a staging table then periodically inserting those rows into appropriate table within union-all view? (unanswered)
any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria? (Answer: No)
can redshift support dropping old tables, adding new tables and rebuilding union-all view within a transaction? (unanswered)
My situation:
100 million rows added daily, which will grow to 500 million in 3 years
12 month retention desired
Estimated 99% of all queries will hit the most recent 1-7 days
Data is written to existing table via kinesis firehose to s3 which then triggers a copy to redshift table.
My proposed solution:
Create a year of daily tables with a union all view, along with a dist_key of sensor_id (100,000+ uniq values) and a sort_key of (timestamp, sensor_id).
Have firehose load into staging table
Create separate process that once an hour queries staging table to discover dates of data within table, then performs an insert into 'appropriate table' select * from where timestamp = table's timestamp.
This hourly writer can probably wrap a table rename, multiple insert-selects, and table recreate in a transaction to be invisible to firehose.
Once a month drop old tables, create next month of tables, and rebuild view.
This union-all view maintenance can probably be wrapped in a transaction to avoid impacts to users.
Once a night run the vacuum analyzer.
EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.
Your proposed process sounds quite good! While I can't answer all your questions, here is some information:
Any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables?
Views are read-only. It is not possible to write to a view, nor is it possible to insert data while expecting Redshift to send it to an appropriate table (eg a specific table for the given day).
Any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria?
Redshift will not exclude specific tables from the query, but it will avoid reading particular disk blocks through the use of Zone Maps. Each block of data written to disk is associated with a specific table and column. The block has a Zone Map, which indicates the minimum and maximum values of that field stored within the block.
If a query includes a WHERE clause, Redshift can skip blocks that do not contain relevant data. This is particularly powerful when used on the SORTKEY column, since similar ranges of data are grouped together.
Given that you are using a date as the SORTKEY, Redshift will read very few disk blocks if the query includes a WHERE clause based on that column. This is very similar to the idea of skipping tables, but it actually skips reading disk blocks.