insert based on value in first row - tsql

I have a fixed file that I am importing into a single column with data similar to what you see below:
ABC$ WC 11683
11608000163118430001002010056788000000007680031722800315723
11683000486080280000002010043213000000007120012669100126691
ABC$ WC 000000020000000148000
ABC$ WC 11683
1168101057561604000050200001234000000027020023194001231940
54322010240519720000502000011682000000035640006721001067210
1167701030336257000050200008765000000023610029066101151149
11680010471244820000502000011680000000027515026398201263982
I want to split and insert this data into another table but I want to do so as long as the '11683' is equal to a column value in a different table + 1. I will then increment that value (not seen here).
I tried the following:
declare #blob as varchar(5)
declare #Num as varchar(5)
set #blob = substring(sdg_winn_blob.blob, 23,5)
set #Num = (Cnum.num + 1)
IF #blob = #Num
INSERT INTO SDG_CWF
(
GAME,SERIAL,WINNER,TYPE
)
SELECT convert(numeric, substring(blob,28, 5)),convert(numeric, substring(blob, 8, 9)),
(Case when (substring(blob, 6,2)='10') then '3'
when (substring(blob, 6,2)='11') then '4'
else substring(blob, 7, 1)
End),
(Case when (substring(blob, 52,2)='10') then '3'
when (substring(blob, 52,2)='11') then '4'
else substring(blob, 53, 1)
End)
FROM sdg_winn_blob
WHERE blob not like 'ABC$%'
else
print 'The Job Failed'
The insert works fine until I try to check to see if the number at position (23, 5) is the same as the number in the Cnum table. I get the error:
Msg 4104, Level 16, State 1, Line 4
The multi-part identifier "sdg_winn_blob.blob" could not be bound.
Msg 4104, Level 16, State 1, Line 5
The multi-part identifier "Cnum.num" could not be bound.

It looks like you may be used to a procedural, object oriented style of coding. SQL Server wants you to think quite differently...
This line:
set #blob = substring(sdg_winn_blob.blob, 23,5)
Is failing because SQL interprets it in isolation. Within just that line, you haven't told SQL what the object sdg_winn_blob is, nor its member blob.
Since those things are database tables / columns, they can only be accessed as part of a query including a FROM clause. It's the FROM that tells SQL where these things are.
So you'll need to replace that line (and the immediate next one) with something like the following:
Select #blob = substring(sdg_winn_blob.blob, 23,5)
From sdg_winn_blob
Where...
Furthermore, as far as I can tell, your whole approach here is conceptually iterative: you're thinking about this in terms of looking at each line in turn, processing it, then moving onto the next. SQL does provide facilities to do that (which you've not used here), but they are very rarely the best solution. SQL prefers (and is optimised for) a set based approach: design a query that will operate on all rows in one go.
As it stands I don't think your query will ever do quite what you want, because you're expecting iterative behaviour that SQL doesn't follow.
The way you need to approach this if you want to "think like SQL Server" is to construct (using just SELECT type queries) a set of rows that has the '11683' type values from the header rows, applied to each corresponding "data" row that you want to insert to SDG_CWF.
Then you can use a SQL JOIN to link this row set to your Cnum table and ascertain, for each row, whether it meets the condition you want in Cnum. This set of rows can then just be inserted into SDG_CWF. No variables or IF statement involved (they're necessary in SQL far less often than some people think).
There are multiple possible approaches to this, none of them terribly easy (unless I'm missing something obvious). All will need you to break your logic down into steps, taking your initial set of data (just a blob column) and turning it into something a bit closer to what you need, then repeating. You might want to work this out yourself but if not, I've set out an example in this SQLFiddle.
I don't claim that example is the fastest or neatest (it isn't) but hopefully it'll show what I mean about thinking the way SQL wants you to think. The SQL engine behind that website is using SQL 2008, but the solution I give should work equally well on 2005. There are niftier possible ways if you get access to 2012 or later versions.

Related

SQL Server 2012 - Is use of cursor/dynamicSQL combo good practice?

Most common answers are: NO, don't use cursors, don't use dynamic SQL
But this question is to solicit feedback from a coding style that seems nifty but can be bad practice, heavier processing speeds, and SQL injections(?).
I learned this style due to an annoyance with copy-pasting set queries of which only one or two items change between each query. I find this style easier to perform code-reviews due to having only one code block, removing the need to scroll up and down.
An example use case is: a big slow table with historical data from 20 different insurance companies that is 1.7 billion rows by 200 columns. To do productive analysis, 10 columns are retrieved into separate tables for each of the 20 insurance companies.
Before the cursor/dynamic combination, a query was built and code-reviewed for one plan, then copied 19 times, each time retrieving a different plan.
After utilizing cursor/dynamic combo, there is one 20 item cursor and one dynamic SQL block. Review wise, it seems more consistent and less prone to human error.
A code example of the combo is below:
Declare #company_name varchar(10)
,#SQL_STATEMENT varchar(100)
Declare company_cursor cursor fast_forward for (
SELECT * FROM (VALUES('APPLE'),('GOOGLE'),('AMAZON')) AS TABLE_NAME(COLUMN_NAME)
)
open company_cursor
fetch next from company_cursor into #company_name
while ##fetch_status = 0
begin
set #SQL_STATEMENT = 'select * from database.schema.'+#company_name
print (#SQL_STATEMENT)
fetch next from company_cursor into #company_name
end
close company_cursor
deallocate company_cursor
I also noticed using a PRINT instead of EXEC instantaneously prints the query statement and acts as a code generator if the PRINT results are copied into the SQL editor.
Can somebody offer opinions, advice, or general practice rules surrounding this style of T-SQL coding? (bracing for downvotes...sniff)

Sort data within a subquery with another subquery?

I am trying to sort the OUN.note column by using the OUN.outcomeKey, since
the way it it is working right now is putting the notes in the wrong order (sorting alphabetically). Any idea on how to go about this? I've been trying to sort the data using another sub-query within, but I haven't had much luck (I don't have a plethora of experience).
Here's my current query:
SELECT DISTINCT OC.outcomeKey [Outcome Key], OC.outcome [Result],
STUFF((SELECT ','+' '+ OUN.note
FROM
Outcome AS OUT
JOIN OutcomeNote AS OUN
ON OUT.outcomeKey = OUN.outcomeKey
WHERE OUN.outcomeKey = OC.outcomeKey
GROUP BY OUN.note
FOR XML PATH ('')), 1, 1, '') [Outcome Note]
FROM Outcome AS OC
Any help or tips would be greatly appreciated! Also, please let me know if any more info is needed.
You may replace the line
GROUP BY OUN.note
with the line
ORDER BY OUN.outcomeKey
Also, because the concatenation starts with ', ', you may want to use 1, 2, '' as the additional arguments of the STUFF function. Otherwise, the values in your [Outcome note] column always start with a space.
Edit:
By the way, sorting the notes by outcomeKey in the subquery that generates the values for the [Outcome note] column has no effect... since all the notes in each subquery result will have the same outcomeKey value...
But you may sort on any column you want, of course. Perhaps there are other columns in your OutcomeNotes table that can serve as a useful sorting column of your outcome notes.
If I misunderstood your question, please provide definitions of the Outcome and OutcomeNote tables, together with a demo population of those tables and the desired/expected query result, please.
Edit 2:
Starting with SQL Server 2017, Transact-SQL contains a function called STRING_AGG, which seems to be functionally equivalent (more or less) to MySQL's GROUP_CONCAT function. Using this function, your query would become something like this:
SELECT
OUN.outcomeKey [Outcome Key],
OC.outcome [Result],
STRING_AGG(OUN.[Note], ', ') WITHIN GROUP (ORDER BY OUN.outcomeKey) [Outcome Note]
FROM
Outcome AS OC
JOIN OutcomeNote AS OUN ON OUN.outcomeKey = OC.outcomeKey
GROUP BY
OUN.outcomeKey,
OC.outcome;
When using SQL Server 2017 or SQL Azure, this might be a more fitting choice, since it does not only make the query more readable, but it also eliminates the use of (way less efficient) XML-functions in your query.
I too have used the XML-functionality for field concatenation (the way you use it) intensively in the past, but I noticed a considerable drop in performance of my queries (which sometimes contained up to 10 columns with concatenated data). Since then, I tend to go for recursive common table expressions or scalar UDF with recursion approaches in pre SQL Server 2017 environments.

Postgres 'if not exists' fails because the sequence exists

I have several counters in an application I am building, as am trying to get them to be dynamically created by the application as required.
For a simplistic example, if someone types a word into a script it should return the number of times that word has been entered previously. Here is an example of sql that may be executed if they typed the word example.
CREATE SEQUENCE IF NOT EXISTS example START WITH 1;
SELECT nextval('example')
This would return 1 the first time it ran, 2 the second time, etc.
The problem is when 2 people click the button at the same time.
First, please note that a lot more is happening in my application than just these statements, so the chances of them overlapping is much more significant than it would be if this was all that was happening.
1> BEGIN;
2> BEGIN;
1> CREATE SEQUENCE IF NOT EXISTS example START WITH 1;
2> CREATE SEQUENCE IF NOT EXISTS example START WITH 1; -- is blocked by previous statement
1> SELECT nextval('example') -- returns 1 to user.
1> COMMIT; -- unblocks second connection
2> ERROR: duplicate key value violates unique constraint
"pg_type_typname_nsp_index"
DETAIL: Key (typname, typnamespace)=(example, 109649) already exists.
I was under the impression that by using "IF NOT EXISTS", the statement should just be a no-op if it does exist, but it seems to have this race condition where that is not the case. I say race condition because if these two are not executed at the same time, it works as one would expect.
I have noticed that IF NOT EXISTS is fairly new to postgres, so maybe they haven't worked out all of the kinks yet?
EDIT:
The main reason we were considering doing things this way was to avoid excess locking. The thought being that if two people were to increment at the same time, using a sequence would mean that neither user should have to wait for the other (except, as in this example, for the initial creation of that sequence)
Sequences are part of the database schema. If you find yourself modifying the schema dynamically based on the data stored in the database, you are probably doing something wrong. This is especially true for sequences, which have special properties e.g. regarding their behavior with respect to transactions. Specifically, if you increment a sequence (with the help of nextval) in the middle of a transaction and then you rollback that transaction, the value of the sequence will not be rolled back. So most likely, this kind of behavior is something that you don't want with your data. In your example, imagine that a user tries to add word. This results in the corresponding sequence being incremented. Now imagine that the transaction does not complete for reason (e.g. maybe the computer crashes) and it gets rolled back. You would end up with the word not being added to the database but with the sequence being incremented.
For the particular example that you mentioned, there is an easy solution; create an ordinary table to store all the "sequences". Something like that would do it:
CREATE TABLE word_frequency (
word text NOT NULL UNIQUE,
frequency integer NOT NULL
);
Now I understand that this is just an example, but if this approach doesn't work for your actual use case, let us know and we can adjust it to your needs.
Edit: Here's how you the above solution works. If a new word is added, run the following query ("UPSERT" syntax in postgres 9.5+ only):
INSERT INTO word_frequency(word,frequency)
VALUES ('foo',1)
ON CONFLICT (word)
DO UPDATE
SET frequency = word_frequency.frequency + excluded.frequency
RETURNING frequency;
This query will insert a new word in word_frequency with frequency 1, or if the word exists already it will increment the existing frequency by 1. Now what happens if two transaction try to do that at the same time? Consider the following scenario:
client 1 client 2
-------- --------
BEGIN
BEGIN
UPSERT ('foo',1)
UPSERT ('foo',1) <====
COMMIT
COMMIT
What will happen is that as soon as client 2 tries increment the frequency for foo (marked with the arrow above), that operation will block because the row was modified by a different transaction. When client 1 commits, client 2 will get unblocked and continue without any errors. This is exactly how we wanted it to work. Also note, that postgresql will use row-level locking to implement this behavior, so other insertions will not be blocked.
EDIT: The main reason we were considering doing things this way was to
avoid excess locking. The thought being that if two people were to
increment at the same time, using a sequence would mean that neither
user should have to wait for the other (except, as in this example,
for the initial creation of that sequence)
It sounds like you're optimizing for a problem that likely does not exist. Sure, if you have 100,000 simultaneous users that are only inserting rows (since a sequence will only be used then normally) there is the possibility of some contention with the sequence but realistically there will be other bottle necks long before the sequence gets in the way.
I'd advise you to first prove that the sequence is an issue. With a proper database design (which dynamic DDL is not) the sequence will not be the bottle neck.
As a reference, DDL is not transaction safe in most databases.

Understanding MON$STAT_ID in the Firebird monitoring tables

I posted a few weeks back inquiring about the firebird DB and how to monitor it. Since then I have come up with a nifty script that monitors all of the page reads/writes/fetches/marks. One of the columns I am monitoring is the MON$STAT_ID and the MON$STAT_GROUP fields. This prints out a nice number for me; however, I have no way to correlate and understand what exactly it is. I thought printing out the MON$STAT_GROUP would help but it has yet to assist me in any way...
I have also looked into the RDB$ commands but have found very limited documentation to see if they might assist me in monitoring my database.
So I decided to come here and inquire first off whether I am monitoring my database in a way that others can view the data from page reads/writes/fetches/marks and make an intelligent decision on whether or not the database is performing as expected.
Secondly, would adding RDB$ commands to my script add anything to the value of the data that I will be giving our database folks?
Lastly, and maybe most importantly, is there anyway to correlate the MON$STAT_ID fields to an actual table in the database to understand when something is going on that should not be? I currently am monitoring the database every minute which may be to frequent, but I am getting valid data out. The only question now is how to interpret this data. Can someone give me advice on methods they use/have used in the past that have worked for them?
(NOTE: Running firebird 2.1)
The column MON$STAT_ID in MON$IO_STATS (and MON$RECORD_STATS and MON$MEMORY_USAGE) is the primary key of the record in the monitoring table. Almost all other monitoring tables include a MON$STAT_ID to point to these statistics: MON$ATTACHMENTS, MON$CALL_STACK, MON$DATABASE, MON$STATEMENTS, MON$TRANSACTIONS.
In other words: the statistics apply on the database, attachment, transaction, statement or call level (PSQL executes). The statistics tables contain a column called MON$STAT_GROUP to discern these types. The values of MON$STAT_GROUP are described in RDB$TYPES:
0 : DATABASE
1 : ATTACHMENT
2 : TRANSACTION
3 : STATEMENT
4 : CALL
Typically the statistics of level 0 contain all from level 1, level 1 contains all from level 2 for that attachment, level 2 contains all from level 3 for that transaction, level 3 contains all from level 4 for that statement.
As there might be data processed unrelated to the lower level, or a specific attachment, transaction or statement handle has already been dropped, the numbers of the lower level do not necessarily aggregate to the entire number of the higher level.
There is no way to correlate the statistics to a specific table (as this information isn't table related, but - simplified - from executing statements which might cover multiple tables).
As I also commented, I am unsure what you mean with "RDB$ commands". But I am assuming you are talking about RDB$GET_CONTEXT() and RDB$SET_CONTEXT(). You could use RDB$GET_CONTEXT() to obtain the current connection (SESSION_ID) and transaction id (TRANSACTION_ID). These values values can be used for MON$ATTACHMENT_ID and MON$TRANSACTION_ID in the monitoring tables. I don't think the other variables in the SYSTEM namespace are interesting, and those in USER_SESSION and USER_TRANSACTION are all user-defined (and initially those namespaces are empty).
It is far easier to use the CURRENT_CONNECTION and CURRENT_TRANSACTION context variables within a statement. As documented in doc\README.monitoring_tables.txt in the Firebird installation:
System variables CURRENT_CONNECTION and CURRENT_TRANSACTION could be used to select data about the current (for the caller) connection and transaction respectively. These variables correspond to the ID columns of the appropriate monitoring tables.
Note: my answer is based on Firebird 2.5.
To present statistics by specific tables I use this SQL (FB 3)
select t.mon$table_name,trim(
case when r.mon$record_seq_reads>0 then 'Non index Reads: '||r.mon$record_seq_reads else '' end||
case when r.mon$record_idx_reads>0 then ' Index Reads: '||r.mon$record_idx_reads else '' end||
case when r.mon$record_inserts>0 then ' Inserts: '||r.mon$record_inserts else '' end||
case when r.mon$record_updates>0 then ' Updates: '||r.mon$record_updates else '' end||
case when r.mon$record_deletes>0 then ' Deletes: '||r.mon$record_deletes else '' end)
from MON$TABLE_STATS t
join mon$record_stats r on r.mon$stat_id=t.mon$record_stat_id
where t.mon$table_name not starting 'RDB$' and r.mon$stat_group=2
order by 1

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.