Declaring variables in redshift - amazon-redshift

Background
I have been using Amazon Redshift to execute my queries.
I know there was a question asked earlier regarding this. But I don't understand how to incorporate UDFs.
I want to assign a temporary variable which takes a particular value.
I want to do this to make my script dynamic. For instance- This is my
usual way of writing code.
SELECT * FROM transaction_table WHERE invoice_date >= '2013-01-01'
AND invoice_date <= '2013-06-30';
What I want to do is ...
Something like what you will see below. I believe SQL server has a declare variable which does this sort of a thing.
SET start_date TO '2013-01-01';
SET end_date TO '2013-06-30';
SELECT * FROM transaction_table WHERE invoice_date >= start_date
AND invoice_date <= end_date;
This way I don't have to search deep in my script. I can just have a set
statement up top and just change that.
Your feedback is greatly welcome.

There are no variables in Redshift, unfortunately. You can, however, get a variable-like behaviour by creating a temporary table and referring to it as follows:
CREATE TEMPORARY TABLE _variables AS (
SELECT
'2013-01-01'::date as start_date
, '2013-06-30'::date as end_date
);
SELECT
transaction_table.*
FROM
transaction_table, _variables
WHERE
invoice_date >= start_date
AND
invoice_date <= end_date
;

Related

Postgres - Select all rows with a date column value within a range of another row's value?

I'm trying to use a date value as a starting point to construct a date range within a single postgres query. The date value would be something like
SELECT upgraded_at FROM accounts ORDER BY upgraded_at DESC limit 1;
which would then be used as the starting point. I then want to do something like
SELECT * from accounts WHERE upgraded_at >= (basis_date - 2 days) AND upgraded_at < (basis_date + 2 days);
Ideally I'd like to accomplish this with a single query. So I'll need to some subquery to get the starting date, then use that as a variable within the rest of the query.
Also eventually I'm going to be doing this within Sequelize. I definitely need the raw SQL way to do it but I'm also curious if later there's a Sequelize-specific way.
You can actually avoid making two references to the basis date here.
WITH cte AS (
SELECT *, MAX(upgraded_at) OVER () AS max_upgraded_at
FROM accounts
)
SELECT *
FROM cte
WHERE upgraded_at - max_upgraded_at BETWEEN -2 AND 2;

Update performance issues - best practice

I've just started working with PostgreSQL, I've used to work with SQL Server and I'm currently migrating some of the existing processes.
The current issue which I'm facing is the performance for an Update statement.
I'm trying to update all records from one table (e.g. MyTable_History) and set new values for some columns.
In Sql Server I've used the following syntax:
declare #NewEndDate datetime = (select dateadd(minute, -1, getdate()))
update MyTable_History
set isLastestVersion=0, ValidTo=#NewEndDate , ModifiedBy='TestSCriptSql',ModifiedTime=GETDATE()
The code which i could come up with (since I don't know how to simply use variables, therefore used a temp tbl) for PostgreSQL is:
CREATE TEMP TABLE dates AS VALUES (current_timestamp + (-1 ||' minutes')::interval);
with d as (
select th.validto as validto, th.islatestversion as islatestversion,
th.modifiedby as modifiedby, th.modifiedtime as modifiedtime, d.column1 as newvalidto
from MyTable_History th, dates d
)
update MyTable_History
set validto = d.newvalidto, islatestversion=false, modifiedby='test_update_script', modifiedtime=current_timestamp
from d
The Sql Server runs localy on my laptop (not a super config) and the PosgreSQL server runs on AWS as RDS (i don't know the exact specs).
My question is am I doing something wrong in the PostgreSql update statement? Because on a 5000+ dataset sample on Sql Server the statement is instantly performed, while on PostgreSql it takes around 50 secs to successfully finish.
Also, from my point of view it seems I've over engineered, since on Sql Server I was having 3 lines of code, while on postgreSql i'm using a CTE.
Regrards,
I don't see why you would need a variable to begin with. current_timestamp returns the same value throughout a transaction as documented in the manual and thus will have the same value for all updated rows.
update mytable_history
set islastestversion = 0,
validto = current_timestamp - interval '1 minute',
modifiedby = 'test_update_script',
modifiedtime = current_timestamp;
But your usage of FROM in the UPDATE statement is wrong. The semantics of using FROM in an UPDATE statement are very different between Postgres and SQL Server
The way you use it, creates a cross join between the CTE and mytable_history. (so essentially a cross join of the table with itself).
You need to have a join condition in the WHERE clause on the primary key:
with d as (...)
update MyTable_History
set validto = d.newvalidto, islatestversion=false,
modifiedby='test_update_script', modifiedtime=current_timestamp
from d
where d.pk_column = MyTable_History.pk_column;
But if you really want to simulate something like variables, you don't need the CTE:
update mytable_history
set islastestversion = 0,
validto = t.newvalidto
modifiedby = 'test_update_script',
modifiedtime = current_timestamp
from (
values (current_timestamp - interval '1 minute')
) t (newvalidto);
The above still creates a "cross join" but as the joined table (from (values ...)) only contains a single row, it's not really a cross join.

TSQL order by but first show these

I'm researching a dataset.
And I just wonder if there is a way to order like below in 1 query
Select * From MyTable where name ='international%' order by id
Select * From MyTable where name != 'international%' order by id
So first showing all international items, next by names who dont start with international.
My question is not about adding columns to make this work, or use multiple DB's, or a largerTSQL script to clone a DB into a new order.
I just wonder if anything after 'Where or order by' can be tricked to do this.
You can use expressions in the ORDER BY:
Select * From MyTable
order by
CASE
WHEN name like 'international%' THEN 0
ELSE 1
END,
id
(From your narrative, it also sounded like you wanted like, not =, so I changed that too)
Another way (slightly cleaner and a tiny bit faster)
-- Sample Data
DECLARE #mytable TABLE (id INT IDENTITY, [name] VARCHAR(100));
INSERT #mytable([name])
VALUES('international something' ),('ACME'),('international waffles'),('ABC Co.');
-- solution
SELECT t.*
FROM #mytable AS t
ORDER BY -PATINDEX('international%', t.[name]);
Note too that you can add a persisted computed column for -PATINDEX('international%', t.[name]) to speed things up.

T-SQL Count of items based on date

To make the example super simple, lets say that I have a table with three rows, ID, Name, and Date. I need to find the count of all ID's belonging to a specific name where the ID does not belong to this month.
Using that example, I would want this output:
In other words, I want to count how many ID's that a name has that aren't this month/year.
I'm more into PowerShell and still fairly new to SQL. I tried doing a case statement, but because it's not a foreach it seems to be returning "If the Name has ANY date in this month, return NULL" which is not what I want. I want it to count how many ID's per name do not appear in this month.
SELECT NAME,
CASE
WHEN ( Month(date) NOT LIKE Month(Getdate())
AND Year(date) NOT LIKE Year(Getdate()) ) THEN Count(id)
END AS TotalCount
FROM dbo.table
GROUP BY NAME,
date
I really hope this makes sense, but if it doesn't please let me know and I can try to clarify more. I tried researching cursors, but I'm having a hard time grasping them to get them into my statement. Any help would be greatly appreciated!
You only want to group by the non-aggregated columns that are in the result set (in this case, Name). You totally don't need a cursor for this, it's a fairly straight-forward query.
select
Name,
Count(*) count
from
tbl
where
tbl.date > eomonth(getdate()) or
tbl.date <= eomonth(dateadd(mm, -1, getdate())
group by
Name
I did a little bit of trickery on the exclusion of rows that are in the current month. Generally, you want to avoid running functions on the columns you're comparing to if you can so that SQL Server can use an index to speed up its search. I assumed that the ID column is unique, if it's not, change count(*) to count(distinct ID).
Alternative where clause if you're using older versions of sql server. If the table is small enough, you can just do it directly (similar to what you tried originally, it just goes in the query where clause and not embedded in a case)
where
Month(date) <> Month(Getdate())
AND Year(date) <> Year(Getdate())
If you have a large table and sarging on the index is important, there some fun stuff you can build eomonth with dateadd and the date part functions, but it's a pain.
SELECT Name, COUNT(ID) AS TotalCount
FROM dbo.[table]
WHERE DATEPART(MONTH, [Date]) != DATEPART(MONTH, GETDATE()) OR DATEPART(YEAR, [Date]) != DATEPART(YEAR, GETDATE())
GROUP BY Name;
In T-SQL:
SELECT
NAME,
COUNT(id)
FROM dbo.table
WHERE MONTH(Date_M) <> MONTH(GETDATE())
GROUP BY NAME

Creating views in redshift

I'm writing a code at the moment that has to access a transactions file multiple times for a given date range. I was wondering if it's possible to set up a "view" of my table to allow a single delete from at the start of the code (without affecting the table underneath) so that the date range will always be applied throughout the code
So in a simplified example changing the code from...
SELECT SUM(sales)
FROM trans_file
WHERE date_field BETWEEN '2012-01-01' AND '2012-01-31'
To this...
DELETE
FROM trans_file
WHERE date_field NOT BETWEEN '2012-01-01' AND '2012-01-31'
SELECT SUM(sales)
FROM trans_file
What you can do is to perform a Deep Copy first,
access data multiple times and then delete the "view", like this:
CREATE TABLE trans_file_view AS (
SELECT * FROM trans_file
WHERE date_field BETWEEN '2012-01-01' AND '2012-01-31'
);
SELECT SUM(sales) FROM trans_file_view;
...next SELECT statements...
DROP TABLE trans_file_view;
Bibliography:
You can read more about Deep Copy