SQL Server: Why is SELECT on case insensitive column faster than on case sensitive? - database-performance

I use SQL Server 2016 Express and a Java application with JDBC driver version 4.2.
My database has a collation of Latin1_General_CI_AS (case insensitive). My table has a column of type VARCHAR(128) NOT NULL. There is an unique index on that column.
My test scenario is as follows:
After inserting 150000 strings of 48 characters length I do 200 selects of randomly chosen, existing strings. I measure the total execution time of all queries.
Then I drop the index, alter the table to change the columns collation to Latin1_General_CS_AS (case sensitive) and create the unique index again.
Then 200 selects take in total more time.
In both cases (CI and CS) the execution plans are simple and identical (search by using the index).
The query execution time not only depends on case sensitivity. With collation CS it grows faster if the strings have identical prefixes. Here are my results (execution time in seconds):
+----+---------+------------------+-------------------+-------------------+
| + RND(48) + CONST(3)+RND(45) + CONST(10)+RND(38) + CONST(20)+RND(28) +
+----+---------+------------------+-------------------+-------------------+
| CI + 6 + 6 + 7 + 9 +
| CS + 10 + 20 + 45 + 78 +
+----+---------+------------------+-------------------+-------------------+
The longer the identical prefix of the random strings is the more time the case sensitive queries take.
Why is the search on case insensitive column faster than on case sensitive column?
What is the reason for the identical prefix behavior?

The reason is because your SQL installation (I am guessing) was done with CI collation. This means your tempdb and master databases are using CI and currently so is your own database. Therefore, even though you changed your character column to be CS, when it is used in tempdb for sorting/merging operations, that is executed in a CI context. To get an accurate comparison, you need to change your installation collation to be CS or make these comparisons side by side on different SQL instances - one using CS and one using CI.

Related

PostgreSQL update with indexed joins

PostgreSQL 14 for Windows on a medium sized machine. I'm using the default settings - literally as shipped. New to PostgreSQL from MS SQL Server.
A seemingly simple statement that runs in a minute in MS is taking hours in PostgreSQL - not sure why? I'm busy migrating over, i.e. it is the exact same data on the exact same hardware.
It's an update statement that joins a master table (roughly 1000 records) and fact table (roughly 8 million records). I've masked the tables and exact application here, but the structure is exactly reflective of the real data.
CREATE TABLE public.tmaster(
masterid SERIAL NOT NULL,
masterfield1 character varying,
PRIMARY KEY(masterid)
);
-- I've read that the primary key tag creates an index on that field automatically - correct?
CREATE TABLE public.tfact(
factid SERIAL NOT NULL,
masterid int not null,
fieldtoupdate character varying NULL,
PRIMARY KEY(factid),
CONSTRAINT fk_public_tfact_tmaster
FOREIGN KEY(masterid)
REFERENCES public.tmaster(masterid)
);
CREATE INDEX idx_public_fact_master on public.tfact(masterid);
The idea is to set public.tfact.fieldtoupdate = public.tmaster.masterfield1
I've tried the following ways (all taking over an hour to complete):
update public.tfact b
set fieldtoupdate = c.masterfield1
from public.tmaster c
where c.masterid = b.masterid;
update public.tfact b
set fieldtoupdate = c.masterfield1
from public.tfact bb
join public.tmaster c
on c.masterid = bb.masterid
where bb.factid = b.factid;
with t as (
select b.factid,
c.fieldtoupdate
from public.tfact b
join public.tmaster c
on c.masterid = b.masterid
)
update public.tfact b
set fieldtoupdate = t.fieldtoupdate
from t
where t.factid = b.factid;
What am I missing? This should take no time at all, but yet takes over an hour??
Any help is appreciated...
If the table was tightly packed to start with, there will be no room to use the HOT (Heap-only-tuple) UPDATE short cut. Updating 8 million rows will mean inserting 8 million new rows and doing index maintenance on each one.
If your indexed columns on tfact are not clustered, this can involve large amounts of random IO, and if your RAM is small most of this may be uncached. With slow disk, I can see why this would take a long time, maybe even much longer than an hour.
If you will be doing this on a regular basis, you should change the table's fillfactor to keep it loosely packed.
Note that the default settings are generally suited for a small machine, or at least a machine here running the database is a small one of its tasks. But the only thing likely to effect you here is work_mem, and even that is probably not really a problem for this task.
If you use psql, then the command \d+ tfact would show you what the fillfactor is set to if it is not the default. But note that this only applies to new tuples, not to existing ones. To see the fill on an existing table, you would want to check the freespacemap using the extension pg_freespacemap and see that every block has about half space available.
To see if an index is well clustered, you can check the correlation column of pg_stats on the table for the leading column ("attname") of the index.

How to index variable character column for pattern matching in Postgres?

Postgres Version: 9.5.19
I have the following table containing domains:
CREATE TABLE sites (
id SERIAL PRIMARY KEY,
domain character varying(255),
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX sites_pkey ON sites(id int4_ops);
CREATE INDEX index_sites_on_domain ON sites(domain text_ops);
id | domain
---| -----------
1 | www.abc.com
2 | alpha.net
3 | catfood.xyz
4 | example.org
5 | un.gov
6 | xyz.com
. | .......
The total number of records in the table are near ~ 1 million and running a pattern matching query easily takes 20+ seconds:
SELECT * from sites where domain LIKE '%abc.com%
I have a normal btree index on domain but that isn't being used for the above query. EXPLAIN shows a sequential scan.
How do I index this column so that the queries are fully optimized?
First create an expression index on the domain parts, e.g. turning 'www.abc.com' to the array '{www,abc,com}'.
CREATE INDEX index_on_domain ON sites USING GIN (regexp_split_to_array(domain, '\.'));
Then run a slightly modified query.
SELECT *
FROM sites
WHERE regexp_split_to_array(domain, '\.') #> '{abc,com}'
AND domain LIKE '%abc.com%
It's definitely uglier than your original query, but it will do the job.
The first condition of the where clause will use the GIN index to reduce the result set dramatically by checking if the domain parts contain the same parts as your query. The second will act as a verifying condition to ensure that 'abc' is immediately preceding 'com' but will only be run against rows that match the first, indexed condition. Use GIST (or GIN with fastupdate) instead if the sites table is frequently updated since for v9.5, standard GIN indexes are better for mostly-read-only use cases while GIST has better performance on tables/indexes that are updated regularly.
Trigrams could be used, but they are more useful I think for unstructured text. Domains on the other hand have structure where the parts have distinct (and atomic) meanings. It of course depends upon your use case, but from your description I'm assuming that you aren't looking for results like 'sample.xyzabc.com' as well. If you are, trigrams may in fact be the way to go since your queries truly are unstructured. If you are actually looking for complete domain parts, splitting the domain and searching by independent cohesive parts will probably give you cleaner results.
Additionally if your query cannot be changed because the code is unmodifiable or generated by a library that is not easily modified, you may have to accept the poor performance of a sequential scan when using open-ended LIKE wildcard patterns as seen here.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Postgres Crosstab Dynamic Number of Columns

In Postgres 9.4, I have a table like this:
id extra_col days value
-- --------- --- -----
1 rev 0 4
1 rev 30 5
2 cost 60 6
i want this pivoted result
id extra_col 0 30 60
-- --------- -- -- --
1 rev 4 5
2 cost 6
this is simple enough with a crosstab.
but i want the following specifications:
day column will be dynamic. sometimes increments of 1,2,3 (days), 0,30,60 days (accounting months), and sometimes in 360, 720 (accounting years).
range of days will be dynamic. (e.g., 0..500 days versus 1..10 days).
the first two columns are static (id and extra_col)
The return type for all the dynamic columns will remain the same type (in this example, integer)
Here are the solutions I've explored, none of which work for me for the following reasons:
Automatically creating pivot table column names in PostgreSQL -
requires two trips to the database.
Using crosstab_hash - is not dynamic
From all the solutions I've explored, it seems the only one that allows this to occur in one trip to the database requires that the same query be run three times. Is there a way to store the query as a CTE within the crosstab function?
SELECT *
FROM
CROSSTAB(
--QUERY--,
$$--RUN QUERY AGAIN TO GET NUMBER OF COLUMNS--$$
)
as ct (
--RUN QUERY AGAIN AND CREATE STRING OF COLUMNS WITH TYPE--
)
Every solution based on any buildin functionality needs to know a number of output columns. The PostgreSQL planner needs it. There is workaround based on cursors - it is only one way, how to get really dynamic result from Postgres.
The example is relative long and unreadable (the SQL really doesn't support crosstabulation), so I will not to rewrite code from blog here http://okbob.blogspot.cz/2008/08/using-cursors-for-generating-cross.html.

Is Hadoop Suitable For This?

We have some Postgres queries that take 6 - 12 hours to complete and are wondering if Hadoop is suited to doing it faster. We have (2) 64 core servers with 256GB of RAM that Hadoop could use.
We're running PostgreSQL 9.2.4. Postgres only uses one core on one server for the query, so I'm wondering if Hadoop could do it roughly 128 times faster, minus overhead.
We have two sets of data, each with millions of rows.
Set One:
id character varying(20),
a_lat double precision,
a_long double precision,
b_lat double precision,
b_long double precision,
line_id character varying(20),
type character varying(4),
freq numeric(10,5)
Set Two:
a_lat double precision,
a_long double precision,
b_lat double precision,
b_long double precision,
type character varying(4),
freq numeric(10,5)
We have indexes on all lat, long, type, and freq fields, using btree. Both tables have "VACUUM ANALYZE" run right before the query.
The Postgres query is:
SELECT
id
FROM
setone one
WHERE
not exists (
SELECT
'x'
FROM
settwo two
WHERE
two.a_lat >= one.a_lat - 0.000278 and
two.a_lat <= one.a_lat + 0.000278 and
two.a_long >= one.a_long - 0.000278 and
two.a_long <= one.a_long + 0.000278 and
two.b_lat >= one.b_lat - 0.000278 and
two.b_lat <= one.b_lat + 0.000278 and
two.b_long >= one.b_long - 0.000278 and
two.b_long <= one.b_long + 0.000278 and
(
two.type = one.type or
two.type = 'S'
) and
two.freq >= one.freq - 1.0 and
two.freq <= one.freq + 1.0
)
ORDER BY
line_id
Is that the type of thing Hadoop can do? If so can you point me in the right direction?
I think Hadoopis very apropriate for that, but consider using HBase too.
You can run a Hadoop MapReduceroutine to get data, treat it and save it in a optimal way to HBase table. That way, reading data from it would be really faster.
Try Stado at http://stado.us. Use this branch: https://code.launchpad.net/~sgdg/stado/stado, which will be used for the next release.
Even with 64 cores, you will only be using one core to process that query. With Stado you can create multiple PostgreSQL-based "nodes" even on a single box and leverage parallelism and get those cores working.
In addition, I have had success converting correlated not exists queries into WHERE (SELECT COUNT(*) ...) = 0.
Pure Hadoop isn't suitable because doesn't have indexes. HBase implementation is very tricky in this case because only one key is possible per table. Anyway they in best case both of them requires 5 servers at least to feel significant improvement. The best that you can do with PostgreSQL is to partition data per type column, use second server as replica of the first one and execute several queries in parallel for each particular type.
To be honest PostgeSQL isn't a best solution for that. The SybaseIQ(the best) or Oracle Exadata (in worse case) can do it much better because of columnar based data structure and BLOOM filtering.