Ordering numeric like strings in T-SQL - tsql

I have strings (called Alleles) in a table (called Antigens) that I'm having trouble sorting into the correct order.
A representative sample set of Alleles maybe:-
01:01
01:02
02:01
04:01
09:01N
10:01
104:01
105:01
11:01N
03:01:01
03:01:02
I need these Alleles to be in the following order:-
01:01
01:02
02:01
03:01:01
03:01:02
04:01
09:01N
10:01
11:01N
104:01
105:01
I can't sort the Alleles as strings because 104:01 & 105:01 will appear before 11:01.
I can't strip out the ':' characters and sort numerically as that will put 03:01:01 & 03:01:02 at the end as the numeric values would be 30101 & 30102 respectively.
I'm stumped as to how this can be achieved and would be grateful of any suggestions.
Cheers

Assuming maximum number of characters between/before/after : is 3, you could make all the string values same length and order like below. It looks bit complex though !
Fiddle demo
;with cte as (
select val, charindex(':',val,1) index1,
charindex(':',val,charindex(':',val,1)+1) index2
from t
)
select val,right('000' + left(val,index1-1),3) + ':' +
case when index2-index1>0
then right('000' + substring(val,index1+1,index2-index1-1),3)
else right('000' + substring(val,index1+1,len(val)),3) end + ':' +
case when index2>0
then right('000' + right(val, len(val) - index2),3)
else '000' end As odr
from cte
order by odr
| VAL | ODR |
--------------------------
| 01:01 | 001:001:000 |
| 01:02 | 001:002:000 |
| 02:01 | 002:001:000 |
| 03:01:01 | 003:001:001 |
| 03:01:02 | 003:001:002 |
| 04:01 | 004:001:000 |
| 09:01N | 009:01N:000 |
| 10:01 | 010:001:000 |
| 11:01N | 011:01N:000 |
| 104:01 | 104:001:000 |
| 105:01 | 105:001:000 |

Use
ORDER BY CAST (
SUBSTR(Allelens
, 0
, CHARINDEX(Allelens, ':') - 1)
AS INTEGER)

Do you want to sort numerically by the part before the colon? If yes, something like this should do it:
select *
from mytable
order by cast(substring(col, 0, CHARINDEX(':', col)) as int)
Results:
01:01
01:02
02:01
03:01:01
03:01:02
04:01
09:01N
10:01
11:01N
104:01
105:01

Related

Increment date within while loop using postgresql on Redshift table

MY SITUATION:
I have written a piece of code that returns a dataset containing a web user's aggregated activity for the previous 90 days and returns a score, subsequent to some calculation. Essentially, like RFV.
A (VERY) simplified version of the code can be seen below:
WITH start_data AS (
SELECT user_id
,COUNT(web_visits) AS count_web_visits
,COUNT(button_clicks) AS count_button_clicks
,COUNT(login) AS count_log_in
,SUM(time_on_site) AS total_time_on_site
,CURRENT_DATE AS run_date
FROM web.table
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, CURRENT_DATE) AND CURRENT_DATE
AND some_flag = 1
AND some_other_flag = 2
GROUP BY user_id
ORDER BY user_id DESC
)
The output might look something like the below:
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
This output is then stored in it's own AWS/Redhsift table and will form base table for the task.
SELECT *
into myschema.base_table
FROM start_data
DESIRED OUTPUT:
What I need to be able to do, is iteratively run this code such that I append new data to myschema.base_table, every day, for the previous 90's day aggregation.
The way I see it, I can either go forwards or backwards, it doesn't matter.
That is to say, I can either:
Starting from today, run the code, everyday, for the preceding 90 days, going BACK to the (first date in the table + 90 days)
OR
Starting from the (first date in the table + 90 days), run the code for the preceding 90 days, everyday, going FORWARD to today.
Option 2 seems the best option to me and the desired output looks like this (PARTITION FOR ILLUSTRATION ONLY):
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 412 | 339 |180 | 3600 | 20-01-20 |
| 2391823 | 417 | 6253 |863 | 2400 | 20-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 20-01-20 |
| 5561296 | 281 | 679 |262 | 4200 | 20-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 331 | 204 |83 | 3200 | 21-01-20 |
| 2391823 | 652 | 1222 |409 | 7200 | 21-01-20 |
| 3729128 | 71 | 248 |71 | 720 | 21-01-20 |
| 5561296 | 366 | 722 |519 | 3600 | 21-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 213 | 808 |57 | 3600 | 22-01-20 |
| 2391823 | 817 | 4265 |476 | 1200 | 22-01-20 |
| 3729128 | 33 | 128 |62 | 120 | 22-01-20 |
| 5561296 | 623 | 411 |283 | 2400 | 22-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
WHAT I HAVE TRIED:
I have successfully created a WHILE loop to sequentially increment the date as follows:
CREATE OR REPLACE PROCEDURE retrospective_data()
LANGUAGE plpgsql
AS $$
DECLARE
start_date DATE := '2020-11-20' ;
BEGIN
WHILE CURRENT_DATE > start_date
LOOP
RAISE INFO 'Date: %', start_date;
start_date = start_date + 1;
END LOOP;
RAISE INFO 'Loop Statment Executed Successfully';
END;
$$;
CALL retrospective_data();
Thus producing the dates as follows:
INFO: Date: 2020-11-20
INFO: Date: 2020-11-21
INFO: Date: 2020-11-22
INFO: Date: 2020-11-23
INFO: Date: 2020-11-24
INFO: Date: 2020-11-25
INFO: Date: 2020-11-26
INFO: Loop Statment Executed Successfully
Query 1 OK: CALL
WHAT I NEED HELP WITH:
I need to be able to apply the WHILE loop to the initial code such that the WHERE clause becomes:
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, start_date) AND start_date
But where start_date is the result of each incremental loop. Additionally, the result of each execution needs to be appended to the previous.
Any help appreciated.
It is fairly clear that you come from a procedural programming background and this first recommendation is to stop thinking in terms of loops. Databases are giant and powerful data filtering machines and thinking in terms of 'do step 1, then step 2' often leads to missing out on all this power.
You want to look into window functions which allow you to look over ranges of other rows for each row you are evaluating. This is exactly what you are trying to do.
Also you shouldn't cast a date to a string just to compare it to other dates (WHERE clause). This is just extra casting and defeats Redshift's table scan optimizations. Redshift uses block metadata that optimizes what data is needed to be read from disk but this cannot work if the column is being cast to another data type.
Now to your code (off the cuff rewrite and for just the first column). Be aware that group by clauses run BEFORE window functions and that I'm assuming that not all users have a visit every day. And since Redshift doesn't support RANGE in window functions will need to make sure all dates are represented for all user-ids. This is done by UNIONing with a sufficient number of rows that covers the date range. You may have a table like this or may want to create one but I'll just generate something on the fly to show the process (and this process makes the assumption that there are fewer dense dates than rows in the table - likely but not iron clad).
SELECT user_id
,COUNT(web_visits) AS count_web_visits_by_day,
,SUM(count_web_visits_by_day) OVER (partition by user_id order by visit_date rows between 90 preceding and current row)
...
,visit_date
FROM (
SELECT visit_date, user_id, web_visits, ...
FROM web.table
WHERE some_flag = 1 AND some_other_flag = 2
UNION ALL -- this is where I want to union with a full set of dates by user_id
( SELECT visit_date, user_id, NULL as web_visits, ...
FROM (
SELECT DISTINCT user_id FROM web.table
CROSS JOIN
SELECT CURRENT_DATE + 1 - row_number() over (order by visit_date) as visit_date
FROM web.table
)
)
)
GROUP BY visit_date, user_id
ORDER BY visit_date ASC, user_id DESC ;
The idea here is to set up your data to ensure that you have at least one row for each user_id for each date. Then the window functions can operate on the "grouped by date and user_id" information to sum and count over the past 90 row (which is the same as past 90 days). You now have all the information you want for all dates where each is looking back over 90 days. One query to give you all the information, no while loop, no stored procedures.
Untested but should give you the pattern. You may want to massage the output to give you the range you are looking for and clean up NULL result rows.

Flattenning the Left Join outcome in PostgreSQL

I have eventtags, filtervalues.So I have something like:
eventtags:
event_id, key_id, value_id, event_date
filtervalues:
value_id, key,value, counts_seen
Let's say I've 2 events reporting with multiple key, value pairs in eventtags table
event_id | key_id | value_id | event_date
---------+--------+----------+-----------
1 | 20 | 32 | xx-xx-xxxx
1 | 21 | 34 | xx-xx-xxxx
2 | 20 | 35 | yy-yy-yyyy
2 | 21 | 39 | yy-yy-yyyy
Corresponding filter_value table is having data as below
values_id | key | value | counts_seen
----------+-------+-------+----------
32 | type | staff | 52
34 | tag | tag1 | 13
35 | type | user | 10
39 | tag | tag2 | 35
Now based on this I tried below query to consolidate the data from two tables
SELECT t.event_id as Event_Id,
DATE (t.event_date) as Event_Date,
v.key as Keys,
v.value as Values
FROM eventtags t
LEFT JOIN filtervalues as v ON t.value_id = v.id
This results in something like this
Event_Id | Keys | Values | Event_Date
---------+--------+----------+-----------
1 | type | staff | xx-xx-xxxx
1 | tag | tag1 | xx-xx-xxxx
2 | type | user | yy-yy-yyyy
2 | tag | tag2 | yy-yy-yyyy
I want the data to be in the below format
Event_Id | type | tag | Event_Date
---------+--------+---------+-----------
1 | staff | tag1 | xx-xx-xxxx
2 | user | tag2 | yy-yy-yyyy
What changes do I need to make on the query above to obtain this format?
Note: I cannot use Pivots since the system I'm working on, doesn't support them.
Any help is much appreciated
Try this for your scenario without pivot(crosstab):
SELECT t.event_id as Event_Id,
max(v.value) filter (where v.key='type') as "type",
max(v.value) filter (where v.key='tag') as "tag",
DATE (t.event_date) as Event_Date
FROM eventtags t
LEFT JOIN filtervalues as v ON t.value_id = v.id
group by t.event_id,t.event_date
DEMO
above will work only for PostgreSQL 9.4 and above.

SQL Insert fails on i.name does not exist, when it seemingly does, during insert

I'm using Postgres SQL and pgAdmin. I'm attempting to copy data between a staging table, and a production table using INSERT INTO with a SELECT FROM statement with a to_char along the way. This may or may not be the wrong approach. The SELECT fails because apparently "column i.dates does not exist".
The question is: Why am I getting 'column i.dates does not exist'?
The schema for both tables is identical except for a date conversion.
I've tried matching the schema of the tables with the exception of the to_char conversion. I've checked and double checked the column exists.
This is the code I'm trying:
INSERT INTO weathergrids (location, dates, temperature, rh, wd, ws, df, cu, cc)
SELECT
i.location AS location,
i.dates as dates,
i.temperature as temperature,
i.rh as rh,
i.winddir as winddir,
i.windspeed as windspeed,
i.droughtfactor as droughtfactor,
i.curing as curing,
i.cloudcover as cloudcover
FROM (
SELECT location,
to_char(to_timestamp(dates, 'YYYY-DD-MM HH24:MI'), 'HH24:MI YYYY-MM-DD HH24:MI'),
temperature, rh, wd, ws, df, cu, cc
FROM wosweathergrids
) i;
The error I'm receiving is:
ERROR: column i.dates does not exist
LINE 4: i.dates as dates,
^
SQL state: 42703
Character: 151
My data schema is like:
+-----------------+-----+-------------+-----------------------------+-----+
| TABLE | NUM | COLNAME | DATATYPE | LEN |
+-----------------+-----+-------------+-----------------------------+-----+
| weathergrids | 1 | id | integer | 32 |
| weathergrids | 2 | location | numeric | 6 |
| weathergrids | 3 | dates | timestamp without time zone | |
| weathergrids | 4 | temperature | numeric | 3 |
| weathergrids | 5 | rh | numeric | 4 |
| weathergrids | 6 | wd | numeric | 4 |
| weathergrids | 7 | wsd | numeric | 4 |
| weathergrids | 8 | df | numeric | 4 |
| weathergrids | 9 | cu | numeric | 4 |
| weathergrids | 10 | cc | numeric | 4 |
| wosweathergrids | 1 | id | integer | 32 |
| wosweathergrids | 2 | location | numeric | 6 |
| wosweathergrids | 3 | dates | character varying | 16 |
| wosweathergrids | 4 | temperature | numeric | 3 |
| wosweathergrids | 5 | rh | numeric | 4 |
| wosweathergrids | 6 | wd | numeric | 4 |
| wosweathergrids | 7 | ws | numeric | 4 |
| wosweathergrids | 8 | df | numeric | 4 |
| wosweathergrids | 9 | cu | numeric | 4 |
| wosweathergrids | 10 | cc | numeric | 4 |
+-----------------+-----+-------------+-----------------------------+-----+
Your derived table (sub-query) named i has no column named dates because the column dates is "hidden" in the to_char() function and as it does not define an alias for that expression, no column dates is available "outside" of the derived table.
But I don't see the reason for a derived table to begin with. Also: aliasing a column with the same name is also unnecessary i.location as location is exactly the same thing as i.location.
So your query can be simplified to:
INSERT INTO weathergrids (location, dates, temperature, rh, wd, ws, df, cu, cc)
SELECT
location,
to_timestamp(dates, 'YYYY-DD-MM HH24:MI'),
temperature,
rh,
winddir,
windspeed,
droughtfactor,
curing,
cloudcover
FROM wosweathergrids
You don't need to give an alias to the to_timestamp() expression as the column are matched by position, not by name in an insert ... select statement.

PostgreSQL and columns to rows

I have a table (in PostgreSQL 9.3) with amount of people in cities grouped by their ages like this:
city | year | sex | age_0 | age_1 | age_2 | ... | age_115
---------------------------------------------------------
city1| 2014 | M | 12313 | 23414 | 52345 | ... | 0
city1| 2014 | F | 34562 | 23456 | 53456 | ... | 6
city2| 2014 | M | 3 | 2 | 2 | ... | 99
I'd like to break the columns down to rows, ending up with rows like this:
city | year | sex | age | amount | age_group
--------------------------------------------
city1| 2014 | M | 0 | 12313 | 0-6
city1| 2014 | M | 1 | 23414 | 0-6
city1| 2014 | M | 2 | 52345 | 0-6
city1| 2014 | M | ... | ... | ...
city1| 2014 | M | 115 | 0 | 7-115
and so on. I know I could do it with several (a lot) queries and UNIONs but instead I was wondering if there was a more elegant (less cut'n paste involving) way of doing such a query?
use arrays and unnest
select city,
year,
sex,
unnest(array[age_0 , age_1 , age_2 , ..., age_115]) as amount,
unnest(array[ 0 , 1 , 2 , ... , 115]) as age
from mytable
on large datasets this might be slow
did a quick look, there are many similar questions already asked , one good one with a good guide to dynamically generate the query you need ... les pasting for you link
generate query idiea
SELECT 'SELECT city , year , sex , unnest(ARRAY[' || string_agg(quote_ident(attname) , ',') || ']) AS amount from mytable' AS sql
FROM pg_attribute
WHERE attrelid = 'mytable'::regclass and attname ~ 'age_'
AND attnum > 0
AND NOT attisdropped
GROUP BY attrelid;

Linear regression with postgres

I use Postgres and i have a large number of rows with values and date per station.
(Dates can be separated by several days.)
id | value | idstation | udate
--------+-------+-----------+-----
1 | 5 | 12 | 1984-02-11 00:00:00
2 | 7 | 12 | 1984-02-17 00:00:00
3 | 8 | 12 | 1984-02-21 00:00:00
4 | 9 | 12 | 1984-02-23 00:00:00
5 | 4 | 12 | 1984-02-24 00:00:00
6 | 8 | 12 | 1984-02-28 00:00:00
7 | 9 | 14 | 1984-02-21 00:00:00
8 | 15 | 15 | 1984-02-21 00:00:00
9 | 14 | 18 | 1984-02-21 00:00:00
10 | 200 | 19 | 1984-02-21 00:00:00
Forgive what may be a silly question, but I'm not much of a database guru.
Is it possible to directly enter a SQL query that will calculate linear regression per station for each date, knowing that the regression must be calculate only with actual id date, previous id date and next id date ?
For example linear regression for id 2 must be calculate with value 7(actual),5(previous),8(next) for dates 1984-02-17 , 1984-02-11 and 1984-02-21
Edit : I have to use regr_intercept(value,udate) but i really don't know how to do this if i have to use only actual, previous and next value/date for each lines.
Edit2 : 3 rows added to idstation(12); id and dates numbers are changed
Hope you can help me, thank you !
This is the combination of Joop's statistics and Denis's window functions:
WITH num AS (
SELECT id, idstation
, (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
, value AS value
FROM thedata
)
-- id + the ids of the {prev,next} records
-- within the same idstation group
, drag AS (
SELECT id AS center
, LAG(id) OVER www AS prev
, LEAD(id) OVER www AS next
FROM thedata
WINDOW www AS (partition by idstation ORDER BY id)
)
-- junction CTE between ID and its three feeders
, tri AS (
SELECT center AS this, center AS that FROM drag
UNION ALL SELECT center AS this , prev AS that FROM drag
UNION ALL SELECT center AS this , next AS that FROM drag
)
SELECT t.this, n.idstation
, regr_intercept(value,idate) AS intercept
, regr_slope(value,idate) AS slope
, regr_r2(value,idate) AS rsq
, regr_avgx(value,idate) AS avgx
, regr_avgy(value,idate) AS avgy
FROM num n
JOIN tri t ON t.that = n.id
GROUP BY t.this, n.idstation
;
Results:
INSERT 0 7
this | idstation | intercept | slope | rsq | avgx | avgy
------+-----------+-------------------+-------------------+-------------------+------------------+------------------
1 | 12 | -46 | 1 | 1 | 52 | 6
2 | 12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
3 | 12 | -10.6666666666667 | 0.333333333333333 | 1 | 54.5 | 7.5
4 | 14 | | | | 51 | 9
5 | 15 | | | | 51 | 15
6 | 18 | | | | 51 | 14
7 | 19 | | | | 51 | 200
(7 rows)
The clustering of the group-of-three can probably be done more elegantly using a rank() or row_number() function, which would also allow larger sliding windows to be used.
DROP SCHEMA zzz CASCADE;
CREATE SCHEMA zzz ;
SET search_path=zzz;
CREATE TABLE thedata
( id INTEGER NOT NULL PRIMARY KEY
, value INTEGER NOT NULL
, idstation INTEGER NOT NULL
, udate DATE NOT NULL
);
INSERT INTO thedata(id,value,idstation,udate) VALUES
(1 ,5 ,12 ,'1984-02-21' )
,(2 ,7 ,12 ,'1984-02-23' )
,(3 ,8 ,12 ,'1984-02-26' )
,(4 ,9 ,14 ,'1984-02-21' )
,(5 ,15 ,15 ,'1984-02-21' )
,(6 ,14 ,18 ,'1984-02-21' )
,(7 ,200 ,19 ,'1984-02-21' )
;
WITH a AS (
SELECT idstation
, (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
, value AS value
FROM thedata
)
SELECT idstation
, regr_intercept(value,idate) AS intercept
, regr_slope(value,idate) AS slope
, regr_r2(value,idate) AS rsq
, regr_avgx(value,idate) AS avgx
, regr_avgy(value,idate) AS avgy
FROM a
GROUP BY idstation
;
output:
idstation | intercept | slope | rsq | avgx | avgy
-----------+-------------------+-------------------+-------------------+------------------+------------------
15 | | | | 51 | 15
14 | | | | 51 | 9
19 | | | | 51 | 200
12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
18 | | | | 51 | 14
(5 rows)
Note: if you want a spline-like regression you should also use the lag() and lead() window functions, like in Denis's answer.
If the average is ok for you you could use avg build in... Something like
SELECT avg("value") FROM "my_table" WHERE "idstation" = 3;
Should do. For more complicated things you will need to write some pl/SQL-function I'm afraid or check for a addon on PostgreSQL.
Look into window functions. If I get your question correctly, lead() and lag() will likely give you precisely what you want. Example usage:
select idstation as idstation,
id as curr_id,
udate as curr_date,
lag(id) over w as prev_id,
lag(udate) over w as prev_date,
lead(id) over w as next_id,
lead(udate) over w as next_date
from dates
window w as (
partition by idstation order by udate, id
)
order by idstation, udate, id
http://www.postgresql.org/docs/current/static/tutorial-window.html