Build tree-like hash (YAML) from a complex database extraction (SQL)? - perl

Introduction
Considering the tables given at the end of this question, I would like an algorithm or a simple solution that returns a nested tree from a YAML description. Using yaml format is an optional need. In fact, the output I need is an array of ordered hashes that may or may not contain nested ordered hashes or arrays of ordered hashes.
In short, I am talking about a tree-like structure.
For a better understanding of my question I will treat a simple example that covers all my needs. Actually this example is the one I am using to implement this algorithm.
I decided to ask this question in parallel with my own investigations as my knowledge in Perl is limited. I don't want to dig into the wrong tunnel and that's why I am asking for help.
I am currently focussing on the DBI module. I tried to look at other modules such as DBIx::Tree::NestedSet, but I don't think it is what I need.
So, let's get down to the details of my example.
Example
The inital idea is to write a perl program that takes a yaml description and outputs the extracted data.
This input description follows simple rules:
query is what data we are looking for. It can contains the following keys
sql is the SQL query
hide hides columns from the final output. This field is used when a column is required only in a subquery but not wanted in the end.
subquery is a nested query executed for each row of the parent query
bind to bind columns values to the sql query
hash tells the program to group the results not in an array of hashes but an hash of hashes. Actually this could be directly given to DBI::selectall_hashref. If this field is omitted the output is listed as an array of ordered hashes.
key is the name of the key listed at the same level of the parent's result. We will see
later that a key name can mask a result column.
list tells the program to list the result into an array. Notice that only one column can be displayed i.e. array: name displays the list of names
connect is the DBI connection string
format is the output format. It can be either XML, YAML or JSON. I primarly focus on the
YAML format as it can be easily translated. When omitted, the default ouput is YAML.
indent how many spaces is one identation. The tabs or tab value is also supported.
In addition, we know that in Perl hashes are not ordered. Here, the output keys order is important and should appear as they appear in the sql query.
From this I simpy use the YAML module :(
In summary, in the end we will just execute this command:
$ cat desc.yml | ./fetch > data.yml
The desc.yml description is given below:
---
connect: "dbi:SQLite:dbname=einstein-puzzle.sqlite"
ident: 4
query:
- sql: SELECT * from people
hide:
- pet_id
- house_id
- id
subquery:
- key: brevage
bind: id
sql: |
SELECT name, calories, potassium FROM drink
LEFT JOIN people_has_drink ON drink.id = people_has_drink.id_drink
WHERE people_has_drink.id_people = 1
hash:
- name
- key: house
sql: SELECT color as paint, size, id from house WHERE id = ?
hide: id
bind: paint
subquery:
- key: color
sql: SELECT name, ral, hex from color WHERE short LIKE ?
bind: color
- key: pet
sql: SELECT name from pet WHERE id = ?
bind: pet_id
list: name
Expected Output
From the description above, the output data would be this:
---
- nationality: Norvegian
smoke: Dunhill
brevage:
orange juice:
calories: 45
potassium: 200 mg
water:
calories: 0
potassium: 3 mg
house:
color:
name: Zinc yellow
ral: RAL 1018
hex: #F8F32B
paint: yellow
size: small
pet:
- cats
- nationality: Brit
smoke: Pall Mall
brevage:
milk:
calories: 42
potassium: 150 mg
house:
color:
name: Vermilion
ral: RAL 2002
hex: #CB2821
paint: red
size: big
pet:
- birds
- phasmatodea
Where I am
I still did not fully implemented the nested queries. My current sate is given here:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use DBI;
use YAML;
use Data::Dumper;
use Tie::IxHash;
# Read configuration and databse connection
my %yml = %{ Load(do { local $/; <DATA>}) };
my $dbh = DBI->connect($yml{connect});
# Fill the bind values of the first query with command-line information
my %bind;
for(#ARGV) {
next unless /--(\w+)=(.*)/;
$bind{$1} = $2;
}
my $q0 = $yml{query}[0];
if ($q0->{bind} and keys %bind > 0) {
$q0->{bind_values} = arrayref($q0->{bind});
$q0->{bind_values}[$_] = $bind{$q0->{bind}[$_]} foreach (0 .. #{$q0->{bind}} - 1);
}
# Fetch all data from the database recursively
my $out = fetch($q0);
sub fetch {
# As long we have a query, one processes it
my $query = shift;
return undef unless $query;
$query->{bind_values} = [] unless ref $query->{bind_values} eq 'ARRAY';
# Execute SQL query
my $sth = $dbh->prepare($query->{sql});
$sth->execute(#{$query->{bind_values}});
my #columns = #{$sth->{NAME}};
# Fetch all the current level's data and preserve columns order
my #return;
for my $row (#{$sth->fetchall_arrayref()}) {
my %data;
tie %data, 'Tie::IxHash';
$data{$columns[$_]} = $row->[$_] for (0 .. $#columns);
for my $subquery (#{ $query->{subquery} }) {
my #bind;
push #bind, $data{$_} for (#{ arrayref($subquery->{bind}) });
$subquery->{bind_values} = \#bind;
my $sub = fetch($subquery);
# Present output as a list
if ($subquery->{list}) {
#if ( map ( $query->{list} eq $_ , keys $sub ) )
my #list;
for (#$sub) {
push #list, $_->{$subquery->{list}};
}
$sub = \#list;
}
if ($subquery->{key}) {
$data{$subquery->{key}} = $sub;
} else {
die "[Error] Key is missing for query '$subquery->{sql}'";
}
}
# Remove unwanted columns from the output
if ($query->{hide}) {
delete $data{$_} for( #{ arrayref($query->{hide}) } );
}
push #return, \%data;
}
\#return;
}
DumpYaml($out);
sub arrayref {
my $ref = shift;
return (ref $ref ne 'ARRAY') ? [$ref] : $ref;
}
sub DumpYaml {
# I am not happy with this current dumper. I cannot specify the indent and it does
# not preserve the extraction order
print Dump shift;
}
__DATA__
---
connect: "dbi:SQLite:dbname=einstein-puzzle.sqlite"
ident: 4
query:
- sql: SELECT * from people
hide:
- pet_id
- house_id
- id
subquery:
- key: brevage
bind: id
sql: |
SELECT name, calories, potassium FROM drink
LEFT JOIN people_has_drink ON drink.id = people_has_drink.id_drink
WHERE people_has_drink.id_people = ?
hash:
- name
- key: house
sql: SELECT color as paint, size, id from house WHERE id = ?
hide: id
bind: house_id
subquery:
- key: color
sql: SELECT short, ral, hex from color WHERE short LIKE ?
bind: paint
- key: pet
sql: SELECT name from pet WHERE id = ?
bind: pet_id
list: name
And this is what output I get:
---
- brevage:
- calories: 0
name: water
potassium: 3 mg
- calories: 45
name: orange juice
potassium: 200 mg
house:
- color:
- hex: '#F8F32B'
ral: RAL 1018
short: yellow
paint: yellow
size: small
nationality: Norvegian
pet:
- cats
smoke: Dunhill
- brevage:
- calories: 42
name: milk
potassium: 150 mg
house:
- color:
- hex: '#CB2821'
ral: RAL 2002
short: red
paint: red
size: big
nationality: Brit
pet:
- birds
- phasmatodea
smoke: Pall Mall
Database
My test databse is a sqlite db where the tables are listed below:
Table People
.----+-------------+----------+--------+-----------.
| id | nationality | house_id | pet_id | smoke |
+----+-------------+----------+--------+-----------+
| 1 | Norvegian | 4 | 3 | Dunhill |
| 2 | Brit | 1 | 2 | Pall Mall |
'----+-------------+----------+--------+-----------'
Table Drink
.----+--------------+----------+-----------.
| id | name | calories | potassium |
+----+--------------+----------+-----------+
| 1 | tea | 1 | 18 mg |
| 2 | coffee | 0 | 49 mg |
| 3 | milk | 42 | 150 mg |
| 4 | beer | 43 | 27 mg |
| 5 | water | 0 | 3 mg |
| 6 | orange juice | 45 | 200 mg |
'----+--------------+----------+-----------'
Table People Has Drink
.-----------+----------.
| id_people | id_drink |
+-----------+----------+
| 1 | 5 |
| 1 | 6 |
| 2 | 3 |
'-----------+----------'
Table House
+----+--------+--------+
| id | color | size |
+----+--------+--------+
| 1 | red | big |
| 2 | green | small |
| 3 | white | middle |
| 4 | yellow | small |
| 5 | blue | huge |
+----+--------+--------+
Table Color
.--------+-------------+----------+---------.
| short | color | ral | hex |
+--------+-------------+----------+---------+
| red | Vermilion | RAL 2002 | #CB2821 |
| green | Pale green | RAL 6021 | #89AC76 |
| white | Light grey | RAL 7035 | #D7D7D7 |
| yellow | Zinc yellow | RAL 1018 | #F8F32B |
| blue | Capri blue | RAL 5019 | #1B5583 |
'--------+-------------+----------+---------'
Table Pet
+----+-------------+
| id | name |
+----+-------------+
| 1 | dogs |
| 2 | birds |
| 3 | cats |
| 4 | horses |
| 5 | fishes |
| 2 | phasmatodea |
+----+-------------+
Database data
If you wish use the same data as mine also give you all what you need:
BEGIN TRANSACTION;
CREATE TABLE "pet" (
`id` INTEGER,
`name` TEXT
);
INSERT INTO `pet` VALUES (1,'dogs');
INSERT INTO `pet` VALUES (2,'birds');
INSERT INTO `pet` VALUES (3,'cats');
INSERT INTO `pet` VALUES (4,'horses');
INSERT INTO `pet` VALUES (5,'fishes');
INSERT INTO `pet` VALUES (2,'phasmatodea');
CREATE TABLE `people_has_drink` (
`id_people` INTEGER NOT NULL,
`id_drink` INTEGER NOT NULL,
PRIMARY KEY(id_people,id_drink)
);
INSERT INTO `people_has_drink` VALUES (1,5);
INSERT INTO `people_has_drink` VALUES (1,6);
INSERT INTO `people_has_drink` VALUES (2,3);
CREATE TABLE "people" (
`id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`nationality` VARCHAR(45),
`house_id` INT,
`pet_id` INT,
`smoke` VARCHAR(45)
);
INSERT INTO `people` VALUES (1,'Norvegian',4,3,'Dunhill');
INSERT INTO `people` VALUES (2,'Brit',1,2,'Pall Mall');
CREATE TABLE "house" (
`id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`color` TEXT,
`size` TEXT
);
INSERT INTO `house` VALUES (1,'red','big');
INSERT INTO `house` VALUES (2,'green','small');
INSERT INTO `house` VALUES (3,'white','middle');
INSERT INTO `house` VALUES (4,'yellow','small');
INSERT INTO `house` VALUES (5,'blue','huge');
CREATE TABLE `drink` (
`id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`name` TEXT,
`calories` INTEGER,
`potassium` TEXT
);
INSERT INTO `drink` VALUES (1,'tea',1,'18 mg');
INSERT INTO `drink` VALUES (2,'coffee',0,'49 mg');
INSERT INTO `drink` VALUES (3,'milk',42,'150 mg');
INSERT INTO `drink` VALUES (4,'beer',43,'27 mg');
INSERT INTO `drink` VALUES (5,'water',0,'3 mg');
INSERT INTO `drink` VALUES (6,'orange juice',45,'200 mg');
CREATE TABLE `color` (
`short` TEXT UNIQUE,
`color` TEXT,
`ral` TEXT,
`hex` TEXT,
PRIMARY KEY(short)
);
INSERT INTO `color` VALUES ('red','Vermilion','RAL 2002','#CB2821');
INSERT INTO `color` VALUES ('green','Pale green','RAL 6021','#89AC76');
INSERT INTO `color` VALUES ('white','Light grey','RAL 7035','#D7D7D7');
INSERT INTO `color` VALUES ('yellow','Zinc yellow','RAL 1018','#F8F32B');
INSERT INTO `color` VALUES ('blue','Capri blue','RAL 5019','#1B5583');
COMMIT;

Is my implementation good
This is a rather broad question, and the answer probably depends on what you want from your code. For instance:
Does it work? Does it have all the features you need? Does it do what you want? Does it respond appropriately for all the ranges of inputs you want to cater for (and input you don't)? If you aren't sure, write some tests.
Is it fast enough? If not, what are the slow bits? Use Devel::NYTProf to find them.
If it's working, you probably also want to turn your code into a module rather than just a script so you can use it again.
and if not (I'm supposing that I am doing all wrong), what modules should I use to get the desired behavior?
It sounds very much like you're trying to do something like DBIx::Class (aka DBIC) does when you ask it to prefetch; it will build you a data structure of objects.
If you need to do this dynamically in response to arbitrary databases and YAML, that's not quite what DBIC was designed to do; it's probably possible but will probably involve you dynamically creating packages, which will not be easy.

Related

How to update JSONB column with value coming from another table column in PostgreSQL

I have a source table, which lists like below:
public.source
Id | part_no | category
1 | 01270-4 | Landscape
2 | 01102-3 | Sports
Then, I have target table with jsonb column (combinations) , which list like below;
public.target
Id | part_no | combinations
7 | 01270-4 | {"subject":""}
8 | 01102-3 | {"subject":""}
My problem is - how I can update the target table with jsonb column (combinations) with the values coming from source table using the part_no column?
Output like:
Id | part_no | combinations
7 | 01270-4 | {"subject":"Landscape"}
8 | 01102-3 | {"subject":"Sports"}
I tried below but giving error:
UPDATE public.target t
SET combinations = jsonb_set(combinations,'{subject}','s.category',false)
FROM public.source s
WHERE s.part_no = t.part_no;
ERROR: invalid input syntax for type json
LINE 2: SET combinations = jsonb_set(combinations,'{subject}', 's.categor...
^
DETAIL: Token "s" is invalid.
CONTEXT: JSON data, line 1: s...
SQL state: 22P02
Character: 77
You should use to_jsonb function to convert s.category to JSON
Demo
UPDATE public.target t
SET combinations = jsonb_set(combinations,'{subject}',to_jsonb(s.category),false)
FROM public.source s
WHERE s.part_no = t.part_no
Or you can use sample structure for join and update two JSON field:
Demo
UPDATE public.target t
SET combinations = combinations || jsonb_build_object('subject', s.category)
FROM public.source s
WHERE s.part_no = t.part_no

problems with full-text search in postgres

I have the next table, and data:
/* script for people table, with field tsvector and gin */
CREATE TABLE public.people (
id INTEGER,
name VARCHAR(30),
lastname VARCHAR(30),
complete TSVECTOR
)
WITH (oids = false);
CREATE INDEX idx_complete ON public.people
USING gin (complete);
/* data for people table */
INSERT INTO public.people ("id", "name", "lastname", "complete")
VALUES
(1, 'MICHAEL', 'BRYANT BRYANT', '''bryant'':2,3 ''michael'':1'),
(2, 'HENRY STEVEN', 'BUSH TIESSEN', '''bush'':3 ''henri'':1 ''steven'':2 ''tiessen'':4'),
(3, 'WILLINGTON STEVEN', 'STEPHENS FLINN', '''flinn'':4 ''stephen'':3 ''steven'':2 ''willington'':1'),
(4, 'BRET', 'MARTINEZ AROCH', '''aroch'':3 ''bret'':1 ''martinez'':2'),
(5, 'TERENCE BERT', 'CAVALIERE ENRON', '''bert'':2 ''cavalier'':3 ''terenc'':1');
I need retrieve the names and lastnames, according the tsvector field. Actually I have the query:
SELECT * FROM people WHERE complete ## to_tsquery('WILLINGTON & FLINN');
And the result is right (the third record). BUT if I try with
SELECT * FROM people WHERE complete ## to_tsquery('STEVEN & FLINN');
/* the same record! */
I don't have results. Why? What can I do?
You should use the same language to search your table as the values in your field 'complete' where inserted.
Check the result of that query compared english and german:
select * ,
to_tsvector('english', concat_ws(' ', name, lastname )) as english,
to_tsvector('german', concat_ws(' ', name, lastname )) as german
from public.people
so that should work for you :
SELECT * FROM people WHERE complete ## to_tsquery('english','STEVEN & FLINN');
You are probably using a text search configuration where either STEVEN or FLINN are modified by stemming.
I can reproduce this here:
test=> SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.german
(1 row)
test=> SELECT complete FROM public.people WHERE id = 3;
complete
-------------------------------------------------
'flinn':4 'stephen':3 'steven':2 'willington':1
(1 row)
test=> SELECT * FROM ts_debug('STEVEN & FLINN');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+---------------+-------------+---------
asciiword | Word, all ASCII | STEVEN | {german_stem} | german_stem | {stev}
blank | Space symbols | | {} | |
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | FLINN | {german_stem} | german_stem | {flinn}
(4 rows)
test=> SELECT * FROM public.people
WHERE complete ## to_tsquery('STEVEN & FLINN');
id | name | lastname | complete
----+------+----------+----------
(0 rows)
So you see, the German Snowball dictionary stems STEVEN to stev.
Since complete contains the unstemmed version steven, no match is found.
You should use the same text search configuration when you populate complete and in the query.

EXISTS(select 1 from t1) vs EXISTS(select * from t1) [duplicate]

I used to write my EXISTS checks like this:
IF EXISTS (SELECT * FROM TABLE WHERE Columns=#Filters)
BEGIN
UPDATE TABLE SET ColumnsX=ValuesX WHERE Where Columns=#Filters
END
One of the DBA's in a previous life told me that when I do an EXISTS clause, use SELECT 1 instead of SELECT *
IF EXISTS (SELECT 1 FROM TABLE WHERE Columns=#Filters)
BEGIN
UPDATE TABLE SET ColumnsX=ValuesX WHERE Columns=#Filters
END
Does this really make a difference?
No, SQL Server is smart and knows it is being used for an EXISTS, and returns NO DATA to the system.
Quoth Microsoft:
http://technet.microsoft.com/en-us/library/ms189259.aspx?ppud=4
The select list of a subquery
introduced by EXISTS almost always
consists of an asterisk (*). There is
no reason to list column names because
you are just testing whether rows that
meet the conditions specified in the
subquery exist.
To check yourself, try running the following:
SELECT whatever
FROM yourtable
WHERE EXISTS( SELECT 1/0
FROM someothertable
WHERE a_valid_clause )
If it was actually doing something with the SELECT list, it would throw a div by zero error. It doesn't.
EDIT: Note, the SQL Standard actually talks about this.
ANSI SQL 1992 Standard, pg 191 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
3) Case:
a) If the <select list> "*" is simply contained in a <subquery> that
is immediately contained in an <exists predicate>, then the <select list> is
equivalent to a <value expression>
that is an arbitrary <literal>.
The reason for this misconception is presumably because of the belief that it will end up reading all columns. It is easy to see that this is not the case.
CREATE TABLE T
(
X INT PRIMARY KEY,
Y INT,
Z CHAR(8000)
)
CREATE NONCLUSTERED INDEX NarrowIndex ON T(Y)
IF EXISTS (SELECT * FROM T)
PRINT 'Y'
Gives plan
This shows that SQL Server was able to use the narrowest index available to check the result despite the fact that the index does not include all columns. The index access is under a semi join operator which means that it can stop scanning as soon as the first row is returned.
So it is clear the above belief is wrong.
However Conor Cunningham from the Query Optimiser team explains here that he typically uses SELECT 1 in this case as it can make a minor performance difference in the compilation of the query.
The QP will take and expand all *'s
early in the pipeline and bind them to
objects (in this case, the list of
columns). It will then remove
unneeded columns due to the nature of
the query.
So for a simple EXISTS subquery like
this:
SELECT col1 FROM MyTable WHERE EXISTS (SELECT * FROM Table2 WHERE MyTable.col1=Table2.col2) The * will be
expanded to some potentially big
column list and then it will be
determined that the semantics of the
EXISTS does not require any of those
columns, so basically all of them can
be removed.
"SELECT 1" will avoid having to
examine any unneeded metadata for that
table during query compilation.
However, at runtime the two forms of
the query will be identical and will
have identical runtimes.
I tested four possible ways of expressing this query on an empty table with various numbers of columns. SELECT 1 vs SELECT * vs SELECT Primary_Key vs SELECT Other_Not_Null_Column.
I ran the queries in a loop using OPTION (RECOMPILE) and measured the average number of executions per second. Results below
+-------------+----------+---------+---------+--------------+
| Num of Cols | * | 1 | PK | Not Null col |
+-------------+----------+---------+---------+--------------+
| 2 | 2043.5 | 2043.25 | 2073.5 | 2067.5 |
| 4 | 2038.75 | 2041.25 | 2067.5 | 2067.5 |
| 8 | 2015.75 | 2017 | 2059.75 | 2059 |
| 16 | 2005.75 | 2005.25 | 2025.25 | 2035.75 |
| 32 | 1963.25 | 1967.25 | 2001.25 | 1992.75 |
| 64 | 1903 | 1904 | 1936.25 | 1939.75 |
| 128 | 1778.75 | 1779.75 | 1799 | 1806.75 |
| 256 | 1530.75 | 1526.5 | 1542.75 | 1541.25 |
| 512 | 1195 | 1189.75 | 1203.75 | 1198.5 |
| 1024 | 694.75 | 697 | 699 | 699.25 |
+-------------+----------+---------+---------+--------------+
| Total | 17169.25 | 17171 | 17408 | 17408 |
+-------------+----------+---------+---------+--------------+
As can be seen there is no consistent winner between SELECT 1 and SELECT * and the difference between the two approaches is negligible. The SELECT Not Null col and SELECT PK do appear slightly faster though.
All four of the queries degrade in performance as the number of columns in the table increases.
As the table is empty this relationship does seem only explicable by the amount of column metadata. For COUNT(1) it is easy to see that this gets rewritten to COUNT(*) at some point in the process from the below.
SET SHOWPLAN_TEXT ON;
GO
SELECT COUNT(1)
FROM master..spt_values
Which gives the following plan
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0)))
|--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
|--Index Scan(OBJECT:([master].[dbo].[spt_values].[ix2_spt_values_nu_nc]))
Attaching a debugger to the SQL Server process and randomly breaking whilst executing the below
DECLARE #V int
WHILE (1=1)
SELECT #V=1 WHERE EXISTS (SELECT 1 FROM ##T) OPTION(RECOMPILE)
I found that in the cases where the table has 1,024 columns most of the time the call stack looks like something like the below indicating that it is indeed spending a large proportion of the time loading column metadata even when SELECT 1 is used (For the case where the table has 1 column randomly breaking didn't hit this bit of the call stack in 10 attempts)
sqlservr.exe!CMEDAccess::GetProxyBaseIntnl() - 0x1e2c79 bytes
sqlservr.exe!CMEDProxyRelation::GetColumn() + 0x57 bytes
sqlservr.exe!CAlgTableMetadata::LoadColumns() + 0x256 bytes
sqlservr.exe!CAlgTableMetadata::Bind() + 0x15c bytes
sqlservr.exe!CRelOp_Get::BindTree() + 0x98 bytes
sqlservr.exe!COptExpr::BindTree() + 0x58 bytes
sqlservr.exe!CRelOp_FromList::BindTree() + 0x5c bytes
sqlservr.exe!COptExpr::BindTree() + 0x58 bytes
sqlservr.exe!CRelOp_QuerySpec::BindTree() + 0xbe bytes
sqlservr.exe!COptExpr::BindTree() + 0x58 bytes
sqlservr.exe!CScaOp_Exists::BindScalarTree() + 0x72 bytes
... Lines omitted ...
msvcr80.dll!_threadstartex(void * ptd=0x0031d888) Line 326 + 0x5 bytes C
kernel32.dll!_BaseThreadStart#8() + 0x37 bytes
This manual profiling attempt is backed up by the VS 2012 code profiler which shows a very different selection of functions consuming the compilation time for the two cases (Top 15 Functions 1024 columns vs Top 15 Functions 1 column).
Both the SELECT 1 and SELECT * versions wind up checking column permissions and fail if the user is not granted access to all columns in the table.
An example I cribbed from a conversation on the heap
CREATE USER blat WITHOUT LOGIN;
GO
CREATE TABLE dbo.T
(
X INT PRIMARY KEY,
Y INT,
Z CHAR(8000)
)
GO
GRANT SELECT ON dbo.T TO blat;
DENY SELECT ON dbo.T(Z) TO blat;
GO
EXECUTE AS USER = 'blat';
GO
SELECT 1
WHERE EXISTS (SELECT 1
FROM T);
/* ↑↑↑↑
Fails unexpectedly with
The SELECT permission was denied on the column 'Z' of the
object 'T', database 'tempdb', schema 'dbo'.*/
GO
REVERT;
DROP USER blat
DROP TABLE T
So one might speculate that the minor apparent difference when using SELECT some_not_null_col is that it only winds up checking permissions on that specific column (though still loads the metadata for all). However this doesn't seem to fit with the facts as the percentage difference between the two approaches if anything gets smaller as the number of columns in the underlying table increases.
In any event I won't be rushing out and changing all my queries to this form as the difference is very minor and only apparent during query compilation. Removing the OPTION (RECOMPILE) so that subsequent executions can use a cached plan gave the following.
+-------------+-----------+------------+-----------+--------------+
| Num of Cols | * | 1 | PK | Not Null col |
+-------------+-----------+------------+-----------+--------------+
| 2 | 144933.25 | 145292 | 146029.25 | 143973.5 |
| 4 | 146084 | 146633.5 | 146018.75 | 146581.25 |
| 8 | 143145.25 | 144393.25 | 145723.5 | 144790.25 |
| 16 | 145191.75 | 145174 | 144755.5 | 146666.75 |
| 32 | 144624 | 145483.75 | 143531 | 145366.25 |
| 64 | 145459.25 | 146175.75 | 147174.25 | 146622.5 |
| 128 | 145625.75 | 143823.25 | 144132 | 144739.25 |
| 256 | 145380.75 | 147224 | 146203.25 | 147078.75 |
| 512 | 146045 | 145609.25 | 145149.25 | 144335.5 |
| 1024 | 148280 | 148076 | 145593.25 | 146534.75 |
+-------------+-----------+------------+-----------+--------------+
| Total | 1454769 | 1457884.75 | 1454310 | 1456688.75 |
+-------------+-----------+------------+-----------+--------------+
The test script I used can be found here
Best way to know is to performance test both versions and check out the execution plan for both versions. Pick a table with lots of columns.
There is no difference in SQL Server and it has never been a problem in SQL Server. The optimizer knows that they are the same. If you look at the execution plans, you will see that they are identical.
Personally I find it very, very hard to believe that they don't optimize to the same query plan. But the only way to know in your particular situation is to test it. If you do, please report back!
Not any real difference but there might be a very small performance hit. As a rule of thumb you should not ask for more data than you need.

Split a string and populate a table for all records in table in SQL Server 2008 R2

I have a table EmployeeMoves:
| EmployeeID | CityIDs
+------------------------------
| 24 | 23,21,22
| 25 | 25,12,14
| 29 | 1,2,5
| 31 | 7
| 55 | 11,34
| 60 | 7,9,21,23,30
I'm trying to figure out how to expand the comma-delimited values from the EmployeeMoves.CityIDs column to populate an EmployeeCities table, which should look like this:
| EmployeeID | CityID
+------------------------------
| 24 | 23
| 24 | 21
| 24 | 22
| 25 | 25
| 25 | 12
| 25 | 14
| ... and so on
I already have a function called SplitADelimitedList that splits a comma-delimited list of integers into a rowset. It takes the delimited list as a parameter. The SQL below will give me a table with split values under the column Value:
select value from dbo.SplitADelimitedList ('23,21,1,4');
| Value
+-----------
| 23
| 21
| 1
| 4
The question is: How do I populate EmployeeCities from EmployeeMoves with a single (even if complex) SQL statement using the comma-delimited list of CityIDs from each row in the EmployeeMoves table, but without any cursors or looping in T-SQL? I could have 100 records in the EmployeeMoves table for 100 different employees.
This is how I tried to solve this problem. It seems to work and is very quick in performance.
INSERT INTO EmployeeCities
SELECT
em.EmployeeID,
c.Value
FROM EmployeeMoves em
CROSS APPLY dbo.SplitADelimitedList(em.CityIDs) c;
UPDATE 1:
This update provides the definition of the user-defined function dbo.SplitADelimitedList. This function is used in above query to split a comma-delimited list to table of integer values.
CREATE FUNCTION dbo.fn_SplitADelimitedList1
(
#String NVARCHAR(MAX)
)
RETURNS #SplittedValues TABLE(
Value INT
)
AS
BEGIN
DECLARE #SplitLength INT
DECLARE #Delimiter VARCHAR(10)
SET #Delimiter = ',' --set this to the delimiter you are using
WHILE len(#String) > 0
BEGIN
SELECT #SplitLength = (CASE charindex(#Delimiter, #String)
WHEN 0 THEN
datalength(#String) / 2
ELSE
charindex(#Delimiter, #String) - 1
END)
INSERT INTO #SplittedValues
SELECT cast(substring(#String, 1, #SplitLength) AS INTEGER)
WHERE
ltrim(rtrim(isnull(substring(#String, 1, #SplitLength), ''))) <> '';
SELECT #String = (CASE ((datalength(#String) / 2) - #SplitLength)
WHEN 0 THEN
''
ELSE
right(#String, (datalength(#String) / 2) - #SplitLength - 1)
END)
END
RETURN
END
Preface
This is not the right way to do it. You shouldn't create comma-delimited lists in SQL Server. This violates first normal form, which should sound like an unbelievably vile expletive to you.
It is trivial for a client-side application to select rows of employees and related cities and display this as a comma-separated list. It shouldn't be done in the database. Please do everything you can to avoid this kind of construction in the future. If at all possible, you should refactor your database.
The Right Answer
To get the list of cities, properly expanded, from a table containing lists of cities, you can do this:
INSERT dbo.EmployeeCities
SELECT
M.EmployeeID,
C.CityID
FROM
EmployeeMoves M
CROSS APPLY dbo.SplitADelimitedList(M.CityIDs) C
;
The Wrong Answer
I wrote this answer due to a misunderstanding of what you wanted: I thought you were trying to query against properly-stored data to produce a list of comma-separated CityIDs. But I realize now you wanted the reverse: to query the list of cities using existing comma-separated values already stored in a column.
WITH EmployeeData AS (
SELECT
M.EmployeeID,
M.CityID
FROM
dbo.SplitADelimitedList ('23,21,1,4') C
INNER JOIN dbo.EmployeeMoves M
ON Convert(int, C.Value) = M.CityID
)
SELECT
E.EmployeeID,
CityIDs = Substring((
SELECT ',' + Convert(varchar(max), CityID)
FROM EmployeeData C
WHERE E.EmployeeID = C.EmployeeID
FOR XML PATH (''), TYPE
).value('.[1]', 'varchar(max)'), 2, 2147483647)
FROM
(SELECT DISTINCT EmployeeID FROM EmployeeData) E
;
Part of my difficulty in understanding is that your question is a bit disorganized. Next time, please clearly label your example data and show what you have, and what you're trying to work toward. Since you put the data for EmployeeCities last, it looked like it was what you were trying to achieve. It's not a good use of people's time when questions are not laid out well.

Filter an ID Column against a range of values

I have the following SQL:
SELECT ',' + LTRIM(RTRIM(CAST(vessel_is_id as CHAR(2)))) + ',' AS 'Id'
FROM Vessels
WHERE ',' + LTRIM(RTRIM(CAST(vessel_is_id as varCHAR(2)))) + ',' IN (',1,2,3,4,5,6,')
Basically, I want to filter the vessel_is_id against a variable list of integer values (which is passed in as a varchar into the stored proc). Now, the above SQL does not work. I do have rows in the table with a `vessel__is_id' of 1, but they are not returned.
Can someone suggest a better approach to this for me? Or, if the above is OK
EDIT:
Sample data
| vessel_is_id |
| ------------ |
| 1 |
| 2 |
| 5 |
| 3 |
| 1 |
| 1 |
So I want to returned all of the above where vessel_is_id is in a variable filter i.e. '1,3' - which should return 4 records.
Cheers.
Jas.
IF OBJECT_ID(N'dbo.fn_ArrayToTable',N'FN') IS NOT NULL
DROP FUNCTION [dbo].[fn_ArrayToTable]
GO
CREATE FUNCTION [dbo].fn_ArrayToTable (#array VARCHAR(MAX))
-- =============================================
-- Author: Dan Andrews
-- Create date: 04/11/11
-- Description: String to Tabled-Valued Function
--
-- =============================================
RETURNS #output TABLE (data VARCHAR(256))
AS
BEGIN
DECLARE #pointer INT
SET #pointer = CHARINDEX(',', #array)
WHILE #pointer != 0
BEGIN
INSERT INTO #output
SELECT RTRIM(LTRIM(LEFT(#array,#pointer-1)))
SELECT #array = RIGHT(#array, LEN(#array)-#pointer),
#pointer = CHARINDEX(',', #array)
END
RETURN
END
Which you may apply like:
SELECT * FROM dbo.fn_ArrayToTable('2,3,4,5,2,2')
and in your case:
SELECT LTRIM(RTRIM(CAST(vessel_is_id AS CHAR(2)))) AS 'Id'
FROM Vessels
WHERE LTRIM(RTRIM(CAST(vessel_is_id AS VARCHAR(2)))) IN (SELECT data FROM dbo.fn_ArrayToTable('1,2,3,4,5,6')
Since Sql server doesn't have an Array you may want to consider passing in a set of values as an XML type. You can then turn the XML type into a relation and join on it. Drawing on the time-tested pubs database for example. Of course you're client may or may not have an easy time generating the XML for the parameter value, but this approach is safe from sql-injection which most "comma seperated" value approaches are not.
declare #stateSelector xml
set #stateSelector = '<values>
<value>or</value>
<value>ut</value>
<value>tn</value>
</values>'
select * from authors
where state in ( select c.value('.', 'varchar(2)') from #stateSelector.nodes('//value') as t(c))