Very simple bulk insert inserts NULLs for each row in CSV - tsql

I can hear you saying: Then it ain't so simple, is it.
Here's the table:
CREATE TABLE [bulkimport].[Test](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Data] [nvarchar](50) NULL,
CONSTRAINT [PK_Test_1] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Here's the tsql:
TRUNCATE TABLE bulkimport.Test
BULK INSERT bulkimport.Test
FROM 'D:\Referenzliste_VKnr.csv'
WITH (
FORMATFILE='D:\VKnrImport.xml',
CODEPAGE=28591,
ERRORFILE='D:\VKnrImportError.txt'
)
My Config File looks like this:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS" />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="Data" LENGTH="50" xsi:type="SQLNVARCHAR" />
</ROW>
</BCPFORMAT>
and finally, an excerpt from the data:
18
26
34
42
59
67
75
83
91
109
117
125
133
There's a carriage return line feed after each row, leaving one empty line at the end. I get a NULL for each entry in the csv file, using SQL Server 2012.

Your question isn't an exact duplicate of this one (you didn't get any errors) but it's the same problem. As the documentation says:
With an XML format file, you cannot skip a column when you are
importing directly into a table by using a bcp command or a BULK
INSERT statement. However, you can import into all but the last column
of a table. If you have to skip any but the last column, you must
create a view of the target table that contains only the columns
contained in the data file. Then, you can bulk import data from that
file into the view.
In other words, you can only skip the last column in the table when using an XML format file. But you are trying to skip the Id column, which is the first column.
Two solutions are:
Use a non-XML format file
Use OPENROWSET
When I used this non-XML format file, your data loaded fine:
10.0
1
1 SQLCHAR 0 100 "\r\n" 2 Data Latin1_General_CI_AS
But OPENROWSET is more flexible, because you can use the SELECT to re-order the columns or otherwise manipulate the data coming from the flat file.

Related

Unnest vs just having every row needed in table

I have a choice in how a data table is created and am wondering which approach is more performant.
Making a table with a row for every data point,
Making a table with an array column that will allow repeated content to be unnested
That is, if I have the data:
day
val1
val2
Mon
7
11
Tue
7
11
Wed
8
9
Thu
1
4
Is it better to enter the data in as shown, or instead:
day
val1
val2
(Mon,Tue)
7
11
(Wed)
8
9
(Thu)
1
4
And then use unnest() to explode those into unique rows when I need them?
Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.
I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus
unless your data will be very repeated 20 columns is alot to align.
"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.
Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:
-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
RETURNS uuid
LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';
-- data table
CREATE TABLE data (
data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)
);
-- reference data_id in next table
CREATE TABLE day_data (
day text
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id) -- must be unique?
);
db<>fiddle here
With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.
In-depth explanation in this closely related answer:
Why doesn't my UNIQUE constraint trigger?
You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).
About generated columns:
Computed / calculated / virtual / derived columns in PostgreSQL
Basic technique to avoid duplicates while inserting new data:
INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;
If there can be concurrent writes, see:
How to use RETURNING with ON CONFLICT in PostgreSQL?

Use SQL to Evaluate XML string broken into several rows

I have an application that stores a single XML record broken up into 3 separate rows, I'm assuming due to length limits. The first two rows each max out the storage at 4000 characters and unfortunately doesn't break at the same place for each record.
I'm trying to find a way to combine the three rows into a complete XML record that I can then extract data from.
I've tried concatenating the rows but can't find a data type or anything else that will let me pull the three rows into a single readable XML record.
I have several limitations I'm up against as we have select only access to the DB and I'm stuck using just SQL as I don't have enough access to implement any kind of external program to pull the data that is there an manipulate it using something else.
Any ideas would be very appreciated.
Without sample data, and desired results, we can only offer a possible approach.
Since you are on 2017, you have access to string_agg()
Here I am using ID as the proper sequence.
I should add that try_convert() will return a NULL if the conversion to XML fails.
Example
Declare #YourTable table (ID int,SomeCol varchar(4000))
Insert Into #YourTable values
(1,'<root><name>XYZ Co')
,(2,'mpany</')
,(3,'name></root>')
Select try_convert(xml,string_agg(SomeCol,'') within group (order by ID) )
From #YourTable
Returns
<root>
<name>XYZ Company</name>
</root>
EDIT 2014 Option
Select try_convert(xml,(Select '' + SomeCol
From #YourTable
Order By ID
For XML Path(''), TYPE).value('.', 'varchar(max)')
)
Or Even
Declare #S varchar(max) = ''
Select #S=#S+SomeCol
From #YourTable
Order By ID
Select try_convert(xml,#S)

Redshift Copy and auto-increment does not work

I am using the COPY command from redshift to copy json data from S3.
The table definition is as follows:
CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;
The command for copy i am using is as follows:
COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;
I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.
If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.
See below for a demo:
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=#
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
Thx in advance
The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.
Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.
In practice, this means executing the following statement after each Redshift COPY is performed:
insert into master
(id,result_code,ct_timestamp,...)
select
#{startIncrement}+row_number() over(order by ct_timestamp) as id,
result_code,...
from stage;
Then the Ids are guaranteed to be sequential/consecutives in the master table.
I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:
Be aware that you can specify the columns (and their order) for a copy command.
COPY my_table (col1, col2, col3) FROM s3://...
So if:
EXPLICIT_IDS flag is NOT set
no columns listed like shown above
and you csv does not contain data for the IDENTITY column
then the identity values in the table will be set automatically in monotonously as we all want it.
doc:
If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.

Updating a table in SQL Server 2008 R2

I have a Customer table that has 55 million records. I need to update a column HHPK with increment values
Example:
12345....upto 55 million
I'm using the following script but the script is erroring out with transaction log for the database is getting full..
DB is using simple recovery model
DECLARE #SEQ BIGINT
SET #SEQ = 0
UPDATE Customers
SET #SEQ = HHPK = #SEQ + 1
Is there any other way to do that task? Please help
As your table already has a CustomerPK identity column, just use:
UPDATE dbo.Customers
SET HHPK = CustomerPK
Of course - with 55 million rows, this will be a strain on your log file. So you might want to do this in batches - preferably of less than 5000 rows to avoid lock escalation effects that would exclusively lock the entire table:
UPDATE TOP (4500) dbo.Customers
SET HHPK = CustomerPK
WHERE HHPK IS NULL
and repeat this until the entire table has been updated.
But really: if you already have an INT IDENTITY column CustomerPK - why do you need a second column to hold the same values? Doesn't make a lot of sense to me ....

SQL Server 2012 poor performance when selecting using LIKE from a large table

I have a table with ~1M rows and run the following SQL against it:
select * from E where sys like '%,141,%'
which takes 2-5 seconds to execute (returning ~10 rows), I need it to be 10 times faster at least, is it something which can be achieved with SQL Server 2012?
A sample sys value (sys values length ranges from 5 to 1000 characters):
1,2,3,7,9,10,11,12,14,17,28,29,30,33,35,37,40,41,42,43,44,45,46,47,48,50,51,53,55,63,69,
72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,97,109,110,111,113,117,
119,121,122,123,124,130,131,132,133,134,135,139,141,146
The table's DDL:
CREATE TABLE [dbo].[E](
[o] [int] NOT NULL,
[sys] [varchar](8000) NULL,
[s] [varchar](8000) NULL,
[eys] [varchar](8000) NULL,
[e] [varchar](8000) NULL,
CONSTRAINT [PK_E] PRIMARY KEY CLUSTERED
(
[o] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Your like clause is causing a full table scan.
If you want instant performance for this query, you will need a one-to-many table that contains the following fields:
E_Key <-- Foreign Key, points to primary key of E table
sys <-- Each record contains one number, not multiple numbers
separated by commas
You can then index sys, and use an ordinary WHERE clause.
If you can't change the table schema you can enable Full-Text search and create a full text index on the table and then do:
select * from E where CONTAINS(sys, ",141,")
The LIKE Operator is always going to be slower because this forces SQL Server to scan each row for the data you are looking for. below is a alternative to LIKE that may work a little better (although will still scan the data).
SELECT * FROM E WHERE CHARINDEX(',141,', sys) > 0
I realize this is an older post but...
If you're absolutely hell bent on storing denormalized data in a table, convert it to XML so you can at least index it.
However, the best thing to do would be to normalize that data by splitting it out into a one to many lookup table (as robert Harvey suggested above).