Redshift Copy and auto-increment does not work - amazon-redshift

I am using the COPY command from redshift to copy json data from S3.
The table definition is as follows:
CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;
The command for copy i am using is as follows:
COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;
I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.
If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.
See below for a demo:
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=#
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
Thx in advance

The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.
Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.
In practice, this means executing the following statement after each Redshift COPY is performed:
insert into master
(id,result_code,ct_timestamp,...)
select
#{startIncrement}+row_number() over(order by ct_timestamp) as id,
result_code,...
from stage;
Then the Ids are guaranteed to be sequential/consecutives in the master table.

I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:
Be aware that you can specify the columns (and their order) for a copy command.
COPY my_table (col1, col2, col3) FROM s3://...
So if:
EXPLICIT_IDS flag is NOT set
no columns listed like shown above
and you csv does not contain data for the IDENTITY column
then the identity values in the table will be set automatically in monotonously as we all want it.
doc:
If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.

Related

Using MERGE command in Upsolver

I would like to use Upsolver MERGE command in my new transformations to populates S3/Athena and Snowflake tables. Since Snowflake is supporting Upsert command, while defining my transformation job, do I rely on Snowflake functionality and use Upsolver INSER statement or define Upsolver MERGE transformation in the same way I do it for Athena , i.e.
CREATE JOB my_job_upsert
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = TRUE
RUN_INTERVAL = 1 MINUTE
AS MERGE INTO default_glue_catalog.upsolver_samples.test_upsert_with_merge AS target
/*
Use the SELECT statement below to choose your columns and performed the desired transformations.
In this example, we aggregate the sample orders data by customer and filter it to only include repeat purchasers.
*/
USING (SELECT field1 AS email,
COUNT(DISTINCT field2) AS count
MIN(field3) AS min_number,
MAX(date) AS last_date
FROM default_glue_catalog.upsolver_samples.test_raw_data
WHERE $commit_time BETWEEN run_start_time() AND run_end_time()
GROUP BY 1
HAVING COUNT(DISTINCT field2) > 1) source
ON (target.email = source.email)--primary key
WHEN MATCHED THEN REPLACE -- Update if primary keys match
WHEN NOT MATCHED THEN INSERT MAP_COLUMNS_BY_NAME; -- Insert if primary key is unique (new record)
It would be nice to know in general if MERGE command syntax is consistent across various target platforms.
I already build Athena transformation and it works as expected
You can use the same way you did for Athena. Upsolver INSERT command will insert new keys (APPEND) and if the table had primary key defined, INSERT command will update the existing keys (Upsert) as its default behavior.
MERGE as its definition is for UPSERT and can handle Deletes as well. And the syntax is consistent across all database/data warehouse/catalog targets

Azure Data Factory LOOKUP Activity issues

I have the following pipeline with a range of activities, see image below.
I keep on getting the error with my lookup activity
Failure happened on 'Source' side.
ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A
database operation failed with the following error: 'Invalid column
name
'updated_at'.',Source=,''Type=System.Data.SqlClient.SqlException,Message=Invalid
column name 'updated_at'.,Source=.Net SqlClient Data
Provider,SqlErrorNumber=207,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=207,State=1,Message=Invalid
column name 'updated_at'.,},],'
I kind of know what the problem is.. the lookup isn't looping through the individual tables to find the column name 'updated_at'.
But, I don't understand why.
The Lookup 'Lookup New Watermark' activity has the following query
SELECT MAX(updated_at) as NewWatermarkvalue FROM #{item().Table_Name}
The ForEach activity 'For Each Table' as the following for Items:
#activity('Find SalesDB Tables').output.value
The Lookup activity 'Find SalesDB Tables' has the following query
SELECT QUOTENAME(table_schema)+'.'+QUOTENAME(table_name) AS Table_Name FROM information_Schema.tables WHERE table_name not in ('watermarktable', 'database_firewall_rules')
The only thing I can see that is wrong with the 'Lookup New Watermark' actvitiy is that its not looping through table. Can someone let me know what is needed.
Just to show the column exists I adjusted the connection from
To the following:
And the Lookup was able to find the updated_at column on dbo.Products, but couldn't locate the updated_at column on the other 4 tables.
Therefore, I'm suggesting the problem is that the Lookup activity isn't iterating over the tables automatically.
The error is when using the following query on a table that does not have updated_at column, we get this error.
SELECT MAX(updated_at) as NewWatermarkvalue FROM #{item().Table_Name}
The items field in for each activity was given the value as #activity('FindSalesDBTables').output.value (returns a list of table names). Inside the for each, when we use the above query, it will be executed as following:
#first iteration
SELECT MAX(updated_at) as NewWatermarkvalue FROM <table_1>
#second iteration
SELECT MAX(updated_at) as NewWatermarkvalue FROM <table_2>
.
.
...
During this process, when we use the above query on a table that does not have updated_at column, it gives the same error. The following is a demonstration of the same.
I created 2 tables (for demonstration) called t1 and t2.
create table t1(id int, updated_at int)
create table t2(id int, up int)
I used look up activity to get the list of table names using the following query:
SELECT QUOTENAME(table_schema)+'.'+QUOTENAME(table_name) AS Table_Name FROM information_Schema.tables WHERE table_name not in ('watermarktable', 'database_firewall_rules','ipv6_database_firewall_rules')
Inside the for each activity (looping through #activity('lookup1').output.value), I have tried the same query as given.
SELECT MAX(updated_at) as NewWatermarkvalue FROM #{item().Table_Name}
After debugging the pipeline, we can observe that it produces the same error.
For iteration where the table is t1 (has updated_at column):
For iteration where the table is t2 (does not have updated_at column):
If you publish and run this pipeline, the pipeline will fail giving the same error.
Therefore, try to check if the updated_at column exists or not in the particular table (current for each item). If it does exist, proceed to query it.
Inside for each use look up with the following query. It returns the length of column in bytes if the column exists in a table, else it returns null. Use this result along with If condition activity.
select COL_LENGTH('#{item().Table_Name}','updated_at') as column_exists
Use the following condition in If activity. If it returns false, then it indicates that the particular table contains updated_at column and we can work with it.
#equals(activity('check for column in table').output.firstRow['column_exists'],null)
The following is the debug output for the same (t1 and t2 tables)
You can continue with other required activities inside the False section of the If condition activity using above process.

Using results from PostgreSQL query in FROM clause of another query

I have a table that's designed as follows.
master_table
id -> serial
timestamp -> timestamp without time zone
fk_slave_id -> integer
fk_id -> id of the table
fk_table1_id -> foreign key relationship with table1
...
fk_table30_id -> foreign key relationship with table30
Every time a new table is added, this table gets altered to include a new column to link. I've been told it was designed as such to allow for deletes in the tables to cascade in the master.
The issue I'm having is finding a proper solution to linking the master table to the other tables. I can do it programmatically using loops and such, but that would be incredibly inefficient.
Here's the query being used to grab the id of the table the id of the row within that table.
SELECT fk_slave_id, concat(fk_table1_id,...,fk_table30_id) AS id
FROM master_table
ORDER BY id DESC
LIMIT 100;
The results are.
fk_slave_id | id
-------------+-----
30 | 678
25 | 677
29 | 676
1 | 675
15 | 674
9 | 673
The next step is using this data to formulate the table required to get the required data. For example, data is required from table30 with id 678.
This is where I'm stuck. If I use WITH it doesn't seem to accept the output in the FROM clause.
WITH items AS (
SELECT fk_slave_id, concat(fk_table1_id,...,fk_table30_id) AS id
FROM master_table
ORDER BY id DESC
LIMIT 100
)
SELECT data
FROM concat('table', items.fk_slave_id)
WHERE id = items.id;
This produces the following error.
ERROR: missing FROM-clause entry for table "items"
LINE x: FROM string_agg('table', items.fk_slave_id)
plpgsql is an option to use EXECUTE with format, but then I'd have to loop through each result and process it with EXECUTE.
Is there any way to achieve what I'm after using SQL or is it a matter of needing to do it programmatically?
Apologies on the bad title. I can't think of another way to word this question.
edit 1: Replaced rows with items
edit 2: Based on the responses it doesn't seem like this can be accomplished cleanly. I'll be resorting to creating an additional column and using triggers instead.
I don't think you can reference a dynamically named table like that in your FROM clause:
FROM concat('table', rows.fk_slave_id)
Have you tried building/executing that SQL from a stored procedure/function. You can create the SQL you want to execute as a string and then just EXECUTE it.
Take a look at this one:
PostgreSQL - Writing dynamic sql in stored procedure that returns a result set

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?
I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.
according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo
#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum