Redshift Temp Table Identity column - amazon-redshift

My stored procedure includes the following code:
CREATE TEMPORARY TABLE #lala
(
idx int IDENTITY(1,1),
tablename nvarchar(128)
);
INSERT INTO #lala(tablename)
SELECT LEFT(tablename, LEN(tablename) - 3)
FROM SVV_EXTERNAL_TABLES
WHERE schemaname = 'spectrum'
AND tablename LIKE '%_v2';
I'm then calling it like this:
BEGIN;
CALL myschema.make_union_views('spectrum_views','spectrum','mycursor');
FETCH ALL FROM mycursor;
COMMIT;
At first it was running succesfully.
Then it began falling over, and debugging, I listed the contents of '#lala
I am confused as to how this has come about - that the [idx] column is not sequential ?
Hope that someone can shed some light on what might be happening?

This is by design. Redshift is a cluster and as such communications between parts of the cluster are expensive. Redshift ensures the uniqueness of identity columns but NOT sequentiality. Per the CREATE TABLE documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html):
When you load the table using an INSERT INTO [tablename] SELECT * FROM
or COPY statement, the data is loaded in parallel and distributed to
the node slices. To be sure that the identity values are unique,
Amazon Redshift skips a number of values when creating the identity
values.

Related

Postgres partition size increase but no selectable data

I have a postgres table set up with a list partition defined on a column like so:
create table table1 (
val varchar(10),
t_type varchar(10)
) partition by list(t_type);
create table table1_xvals partition of table1 for values in ('xval1','xval2');
create table table1_yvals partition of table1 for values in ('yval1','yval2');
As I insert data into table1, and I can see the size of the table and individual partitions increasing, however when I try to select data from any of those tables, nothing shows up (select * from ). Is there anything wrong with how I'm creating the tables or selecting data?
That is normal. You are loading the data in a single database transaction, and the effects of a database transaction only become visible when the transaction is completed (committed). If you are loading the data with a single INSERT or COPY statement, the transaction will end as soon as the statement is done.

How to do PostgreSQL Bulk INSERT without Primary Key Violation [duplicate]

This question already has an answer here:
How to bulk insert only new rows in PostreSQL
(1 answer)
Closed 8 years ago.
I'm trying to achieve database abstraction in my project, but now I got stuck with doing a bulk INSERT in PostgreSQL. My project is in C# and I'm using PostgreSQL 9.3 with npgsql.dll 2.0.14.
For Microsoft SQL Server I'm doing the bulk INSERT simply by concatenating all statements and then performing an ExecuteNonQuery:
IF NOT EXISTS (SELECT id FROM table WHERE id = 1) INSERT INTO table (id) VALUES (1);
IF NOT EXISTS (SELECT id FROM table WHERE id = 2) INSERT INTO table (id) VALUES (2);
IF NOT EXISTS (SELECT id FROM table WHERE id = 3) INSERT INTO table (id) VALUES (3);
Though the IF-NOT-EXISTS clause can be substituted in PostgreSQL by a SELECT-WHERE, this approach unfortunately still doesn't work - because every single statement in PostgreSQL is committed separately.
So I googled for another solution and found the approach of using the COPY command along with NpgsqlCopySerializer/NpgsqlCopyIn to performantly "stream" the bulk data. But now I'm getting primary key violation errors all the time - 'cause the EXISTS/WHERE clause can seemingly not be used together with the COPY statement.
I would really like to avoid to do the INSERTs all one-by-one, as this will slow down my application extremely, so I hope that anyone solved this issue already!
Generally for this type of situation I'd have a separate staging table that does not have the PK constraint, which I'd populate using COPY (assuming the data were in a format for which it makes sense to do a COPY). Then I'd do something like:
insert into table
select a.*
from staging a
where not exists (select 1
from table
where a.id = b.id)
That approach isn't too far off from your original design.
I don't totally understand this part of your question, however, which doesn't even seem totally relevant to your question:
this approach unfortunately still doesn't work - because every single
statement in postgreSQL is committed separately.
That's not true at all, not for any RDBMS. Sure, auto-commit might be enabled on your client, but that doesn't mean that postgres commits every statement separately and that you can't disable the auto-commit. This approach would work:
begin;
insert into table (id) select 1 where not exists (select 1 from table where id = 1);
insert into table (id) select 2 where not exists (select 1 from table where id = 2);
insert into table (id) select 3 where not exists (select 1 from table where id = 3);
commit;
As you pointed out, however, if you've got more than a handful of such statements you'll quickly be hitting some performance concerns.

SQL Server - How to find a records in INSERTED when the database generates a primary key

I've never had to post a question on StackOverflow before because I can always find an answer here by just searching. Only this time, I think I've got a real stumper....
I'm writing code that automates the process of moving data from one SQL Server database to another. I have some pretty standard SQL Server Databases with foreign key relationships between some of their tables. Straight forward stuff. One of my requirements is that the entire table needs to be copied in one fell swoop, without looping through rows or using a cursor. Another requirement is I have to do this in SQL, no SSIS or other external helpers.
For example:
INSERT INTO TargetDatabase.dbo.MasterTable
SELECT * FROM SourceDatabase.dbo.MasterTable
That's easy enough. Then, once the data from the MasterTable has been moved, I move the data of the child table.
INSERT INTO TargetDatabase.dbo.ChildTable
SELECT * FROM SourceDatabase.dbo.ChildTable
Of course, in reality I use more explicit SQL... like I specifically name all the fields and things like that, but this is just a simplified version. Anyway, so far everything's going alright, except ...
The problem is that the primary key of the master table is defined as an identity field. So, when I insert into the MasterTable, the primary key for the new table gets calculated by the database. So to deal with that, I tried using the OUTPUT INTO statement to get the updated values into a Temp table:
INSERT INTO TargetDatabase.dbo.MasterTable
OUPUT INSERTED.* INTO #MyTempTable
SELECT * FROM SourceDatabase.dbo.MasterTable
So here's where it all falls apart. Since the database changed the primary key, how on earth do I figure out which record in the temp table matches up with the original record in the source table?
Do you see the problem? I know what the new ID is, I just don't know how to match it with the original record reliably. The SQL server lets me output the INSERTED values, but doesn't let me output the FROM TABLE values along side the INSERTED values. I've tried it with triggers, I've tried it with an SP, always I have the same problem.
If I were just updating one record at a time, I could easily match up my INSERTED values with the original record I was trying to insert to see the old and new primary key values, but I have this requirement to do it in a batch.
Any Ideas?
PS: I'm not allowed to change the table structure of the target or source table.
You can use MERGE.
declare #Source table (SourceID int identity(1,2), SourceName varchar(50))
declare #Target table (TargetID int identity(2,2), TargetName varchar(50))
insert into #Source values ('Row 1'), ('Row 2')
merge #Target as T
using #Source as S
on 0=1
when not matched then
insert (TargetName) values (SourceName)
output inserted.TargetID, S.SourceID;
Result:
TargetID SourceID
----------- -----------
2 1
4 3
Covered in this blog post by Adam Machanic: Dr. OUTPUT or: How I Learned to Stop Worrying and Love the MERGE
To illustrate what I mentioned in the comment:
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (IdentityColumn, OtherColumn1, OtherColumn2, ...)
SELECT IdentityColumn, OtherColumn1, OtherColumn2, ...
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
Okay, since that didn't work for you (pre-existing values in target tables), how about adding a fixed increment (offset) to the id values in both tables (use the current max id value). Assuming the identity column is "id" in both tables:
DECLARE #incr int
BEGIN TRAN
SELECT #incr = max(id)
FROM TargetDatabase.dbo.MasterTable AS m WITH (TABLOCKX, HOLDLOCK)
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
INSERT INTO TargetDatabase.dbo.ChildTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.ChildTable
COMMIT TRAN

What is the difference between these two T-SQL statements?

In a SSIS package at work there are some SQL tasks that create staging tables for holding import data. All the statements take this form:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.tbNewTable') AND type in (N'U'))
BEGIN
TRUNCATE TABLE dbo.tbNewTable
END
ELSE
BEGIN
CREATE TABLE dbo.tbNewTable (
ColumnA VARCHAR(10) NULL,
ColumnB VARCHAR(10) NULL,
ColumnC INT NULL
) ON PRIMARY
END
In Itzik Ben-Gan's T-SQL Fundamentals I see a different form of statement for creating a table:
IF OBJECT_ID('dbo.tbNewTable', 'U') IS NOT NULL
BEGIN
DROP TABLE dbo.tbNewTable
END
CREATE TABLE dbo.tbNewTable (
ColumnA VARCHAR(10) NULL,
ColumnB VARCHAR(10) NULL,
ColumnC INT NULL
) ON PRIMARY
Each of these appears to do the same thing. After execution, there will be a empty table called tbNewTable in the dbo schema.
Are there any practical or theoretical differences between the two? What implications might they have?
The first one assumes that if the table exists, it has the same columns as those it would create. The second one does not make that assumption. So if a table with that name happened to exist and had a different set of columns, the two would have very different results.
The first will not actually DROP the table -- it merely TRUNCATES all the data in said table. Hence why the CREATE is guarded.
Thus the form with the DROP will allow the subsequent CREATE to change the schema (when the new table is created) even if tbNewTable previously existed.
Because the DROP/CREATE alters the database schema it may not also be allowed in all cases. For instance, a view created with a SCHEMABINDING will prevent the table from being dropped. (This also hold true for more general FK relationships, should any exist.)
...when SCHEMABINDING is specified, the base table or tables cannot be modified in a way that would affect the view definition.
The TRUNCATE should be marginally faster in one of those constant "don't care" ways: there should be no performance consideration given to one over the other.
There are also permission differences. TRUNCATE only requires the ALTER permission.
The minimum permission required is ALTER on table_name. TRUNCATE TABLE permissions default to the table owner...
Happy coding.
These are very different..
The first does an equality check on the sys.objects system table and looks to see if there is a matching table name. If so, it truncates the table. Basically removing all rows but maintaining the table structure itself - i.e. the actual table is never dropped.
In the second, the check to make sure that the table exists is implicitly done using the OBJECT_ID() method. If so, the table is dropped completely - rows and structure.
If you have a primary and foreign key constraint on the table, you'll certainly have issues dropping it completely... and if you have other tables that are linked to the table you are trying to 'truncate' you'll have issues there too, unless you have cascade deletion turned on.
I tend to dislike either construction in an SSIS package. I create the tables in a deployment script and I want the package to fail if one of the tables I use is missing later on because then something drastically wrong has happened and I want to investigate what before I try putting data anywhere.

How to add a new identity column to a table in SQL Server?

I am using SQL Server 2008 Enterprise. I want to add an identity column (as unique clustered index and primary key) to an existing table. Integer based auto-increasing by 1 identity column is ok. Any solutions?
BTW: my most confusion is for existing rows, how to automatically fill-in new identity column data?
thanks in advance,
George
you can use -
alter table <mytable> add ident INT IDENTITY
This adds ident column to your table and adds data starting from 1 and incrementing by 1.
To add clustered index -
CREATE CLUSTERED INDEX <indexName> on <mytable>(ident)
have 1 approach in mind, but not sure whether it is feasible at your end or not. But let me assure you, this is a very effective approach. You can create a table having an identity column and insert your entire data in that table. And from there on handling any duplicate data is a child's play. There are two ways of adding an identity column to a table with existing data:
Create a new table with identity, copy data to this new table then drop the existing table followed by renaming the temp table.
Create a new column with identity & drop the existing column
For reference the I have found 2 articles : http://blog.sqlauthority.com/2009/05/03/sql-server-add-or-remove-identity-property-on-column/
http://cavemansblog.wordpress.com/2009/04/02/sql-how-to-add-an-identity-column-to-a-table-with-data/
Not always you have permissions for DBCC commands.
Solution #2:
create table #tempTable1 (Column1 int)
declare #new_seed varchar(20) = CAST((select max(ID) from SomeOtherTable) as varchar(20))
exec (N'alter table #tempTable1 add ID int IDENTITY('+#new_seed+', 1)')