How to convert SQL query to Pyspark Sql query

How to convert SQL query to Pyspark Sql query - pyspark

How to convert below SQL to pyspark SQL
WITH groups AS (
SELECT RANK() OVER (ORDER BY date_completed) AS row_number,
date_completed,
DATEADD (day, -RANK() OVER (ORDER BY date_completed),
date_completed) AS date_group
FROM lesson_completed)
SELECT COUNT(*) AS days_streak,
MIN (date_completed) AS min_date,
MAX (date_completed) AS max_date
FROM groups
GROUP BY date_group
the above SQL I need in pyspark3.1x

Related

How to "loop" through dates in PostgreSQL

Say I have a query with a nested query inside of a where condition.
SELECT COUNT(id)
FROM table
WHERE create_date = date_trunc('month', current_timestamp)
and id NOT IN (
SELECT DISTINCT id
FROM some_table
WHERE date_trunc('month', current_timestamp)
)
This query gets the metric for this month. However, what if I want it for all months?
I tried this query, although it doesn't seem to run/takes a very long time:
SELECT date_trunc('month', t.create_date), COUNT(id)
FROM table t
WHERE id NOT IN (
SELECT DISTINCT id
FROM some_table tt
WHERE date_trunc('month', tt.create_date)= date_trunc('month', t.create_date)
)
GROUP BY date_trunc('month', t.create_date)
I would like to execute this command via Postgres CLI (from the command line).
Any guidance to make this query more efficient or logical appreciated!

How to use variable in OVER clause in SQL Server

I would like to use a variable for the number of rows used in an 'OVER clause' statement. Up to now I only get it working by creation of the sql statement in a string and then execute it.
While the final purpose is to also use it in SSIS this does not work while that does not recognizes the fields in the dynamic query.
What works is:
select
[GUID_Fund], [Date], [Close],
avg([Close]) over (order by [GUID_Fund], [Date] rows 7 preceding) as MA_Low
from fundrates
group by [GUID_Fund], [Date], [Close]
order by [GUID_Fund] asc, [Date] desc;
The number 7 needs to be a variable so I was trying to do something like this:
declare #var_MA_Low as int;
select distinct #var_MA_Low = [Value1]
from Variables
where [Name]='MA_Low';
select
[GUID_Fund], [Date], [Close],
avg([Close]) over (order by [GUID_Fund], [Date] rows #var_MA_Low preceding) as MA_Low
from fundrates
group by [GUID_Fund], [Date], [Close]
order by [GUID_Fund] asc, [Date] desc;
This results in a syntax error at #var_MA_Low just after 'rows'.
What works is the same statement as above, but than I cannot use it as source in SSIS:
declare #MA as nvarchar(max);
declare #var_MA_Low as nvarchar(max);
select distinct #var_MA_Low = [Value1] from Variables where [Name]='MA_Low';
set #MA = N'select [GUID_Fund], [Date], [Close], avg([Close])
over (order by [GUID_Fund], [Date] rows '+#var_MA_Low+' preceding) as MA_Low
from fundrates
group by [GUID_Fund], [Date], [Close] order by [GUID_Fund] asc, [Date] desc;'
execute sp_executesql #MA;
Has anybody an idea how to pass the number of rows as a variable into the second option?

what if you create a stored procedure with working query and use that SP as source?

I might try to improve this answer, but if you take your solution that works using the dynamic SQL and combine it with a temp table and the "insert into ... exec ... " syntax, https://stackoverflow.com/a/24073229/3591870 , and then return back to SSIS just the "select * from #holdertable", SSIS should be able to determine the columns being returned and generate your source. I don't really like the fact of you being required to use dynamic SQL to solve this however.
According to the docs, http://msdn.microsoft.com/en-us/library/ms189461(v=sql.120).aspx , it really does specify "unsigned integer literal", so I think dynamic SQL is going to be the only way.

Update same table by its maximum date

I have a table in which i have to find the maximum date for each unique EMPid & testid
below is the input table and expected output
I tried with correlated sub query but that didn't work.
Any quick way to update the table with max date.

You can use a common-table-expression and the OVER clause with PARTITION BY:
WITH CTE AS
(
SELECT EmpId, [Hall Id], testId, Date, [Max date],
MaxDate = MAX(Date) OVER (PARTITION BY EmpId, testId)
FROM dbo.TableName
)
UPDATE CTE SET [Max date] = MaxDate
If you want to see what will happen replace UPDATE with SELECT * FROM.

You can use a CTE to select all maximum dates and join this with your original data like this:
WITH MaxDates AS (
SELECT empid
, testid
, MAX(Date) AS MaxDate
FROM table
GROUP BY empid
, testid
)
SELECT table.*
, MaxDate
FROM table
INNER JOIN MaxDates ON table.empid = MaxDates.empid AND table.testid = MaxDates.testid

SQL Server SUM() for DISTINCT records

I have a field called "Users", and I want to run SUM() on that field that returns the sum of all DISTINCT records. I thought that this would work:
SELECT SUM(DISTINCT table_name.users)
FROM table_name
But it's not selecting DISTINCT records, it's just running as if I had run SUM(table_name.users).
What would I have to do to add only the distinct records from this field?

Use count()
SELECT count(DISTINCT table_name.users)
FROM table_name
SQLFiddle demo

This code seems to indicate sum(distinct ) and sum() return different values.
with t as (
select 1 as a
union all
select '1'
union all
select '2'
union all
select '4'
)
select sum(distinct a) as DistinctSum, sum(a) as allSum, count(distinct a) as distinctCount, count(a) as allCount from t
Do you actually have non-distinct values?
select count(1), users
from table_name
group by users
having count(1) > 1
If not, the sums will be identical.

You can see for yourself that distinct works with the following example. Here I create a subquery with duplicate values, then I do a sum distinct on those values.
select DistinctSum=sum(distinct x), RegularSum=Sum(x)
from
(
select x=1
union All
select 1
union All
select 2
union All
select 2
) x
You can see that the distinct sum column returns 3 and the regular sum returns 6 in this example.

You can use a sub-query:
select sum(users)
from (select distinct users from table_name);

SUM(DISTINCTROW table_name.something)
It worked for me (innodb).
Description - "DISTINCTROW omits data based on entire duplicate records, not just duplicate fields." http://office.microsoft.com/en-001/access-help/all-distinct-distinctrow-top-predicates-HA001231351.aspx

;WITH cte
as
(
SELECT table_name.users , rn = ROW_NUMBER() OVER (PARTITION BY users ORDER BY users)
FROM table_name
)
SELECT SUM(users)
FROM cte
WHERE rn = 1
SQL Fiddle
Try here yourself
TEST
DECLARE #table_name Table (Users INT );
INSERT INTO #table_name Values (1),(1),(1),(3),(3),(5),(5);
;WITH cte
as
(
SELECT users , rn = ROW_NUMBER() OVER (PARTITION BY users ORDER BY users)
FROM #table_name
)
SELECT SUM(users) DisSum
FROM cte
WHERE rn = 1
Result
DisSum
9

If circumstances make it difficult to weave a "distinct" into the sum clause, it will usually be possible to add an extra "where" clause to the entire query - something like:
select sum(t.ColToSum)
from SomeTable t
where (select count(*) from SomeTable t1 where t1.ColToSum = t.ColToSum and t1.ID < t.ID) = 0

May be a duplicate to
Trying to sum distinct values SQL
As per Declan_K's answer:
Get the distinct list first...
SELECT SUM(SQ.COST)
FROM
(SELECT DISTINCT [Tracking #] as TRACK,[Ship Cost] as COST FROM YourTable) SQ

ROW_NUMBER() in Redshift to select biggest row from each group?

I need to select one row from each group based on COUNT(1) field.
In other databases I'd use ROW_NUMBER() function, which in redshift is unsupported yet.

The answer is to use a SUM(1) OVER(PARTITION BY group_field ORDER BY order field ROWS UNBOUNDED PRECEDING) construct like that:
SELECT id,
name,
cnt
FROM
(SELECT id,
name,
count(*) cnt,
sum(1) over (partition BY id ORDER BY cnt DESC ROWS UNBOUNDED PRECEDING) AS row_number
FROM table
GROUP BY id,
name)
WHERE row_number = 1
ORDER BY name

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to convert SQL query to Pyspark Sql query - pyspark

Related

How to "loop" through dates in PostgreSQL

How to use variable in OVER clause in SQL Server

Update same table by its maximum date

SQL Server SUM() for DISTINCT records

ROW_NUMBER() in Redshift to select biggest row from each group?

Categories

Resources