Add Sequence to Rows in SELECT Statement - tsql

I'm required to extract transactions from a table which may have transaction for customer happening on the same day. For these transaction I must add a sequence column only for the same day transactions.
CustAcct Transdate TransAmt
00001 2/1/2000 100
00001 2/1/2000 150
00005 3/2/2000 250
00001 2/1/2000 100
We want data to be shown as:
CustAcct Transdate TransAmt Seq
00001 2/1/2000 100 1
00001 2/1/2000 150 2
00005 3/2/2000 250 NULL
00001 2/1/2000 100 3
I thought of using the ROW_NUMBER() function but not sure how only use it for rows with same date and acct numbers. Any help would be greatly appreciated.

I believe this is what you're looking for:
SELECT
CustAcct
,TransDate
,TransAmt
,ROW_NUMBER() OVER (PARTITION BY TransDate,CustAcct ORDER BY CustAcct)
FROM Cust

Related

Create deciles to group and label records where the sum of a value is the same for each decile

I have something similar to the following table, which is a randomly ordered list of thousands of transactions with a Customer_ID and an order_cost for each transaction.
Customer_ID
order_cost
1
$503
53
$7
4
$80
13
$76
6
$270
78
$2
8
$45
910
$89
10
$3
1130
$43
etc...
etc...
I want to group the transactions by Customer_ID, aggregate the cost of all the orders into a spending column, and then create a new "decile" row that would assign a number 1-10 to each customer so that when the "spending" for all customers in a decile is added up, each decile contains 10% of all the spending.
The resulting table would look something like the table below where each ascending decile will contain fewer customers, but the total sum of "spending" for all the records in each decile group will be the same for deciles 1-10. (The actual numbers in this sample column don't actually add up, it's just the concept)
Customer_ID
spending
Decile
45
$500
1
3
$700
1
349
$800
1
23
$1,000
1
64
$2,000
1
718
$2,100
1
3452
$2,300
1
1276
$2,600
2
10
$3,000
2
34
$4,000
2
etc...
etc...
etc...
So far I have grouped by Customer_ID, aggregated the order_cost to a spending column, ordered each customer in ascending order based on the spending column, and then partitioned all the customers into 5000 groups. From there I manually found the values for each .when statement that would result in deciles 1-10 each containing the right amount of customers so each decile has 10% of the sum of the entire spending column. It's pretty time-consuming to use trial and error to find the right bucket configuration that results in each decile having 10% of the spending column.
I'm trying to find a way to automate this process so I don't have to find the right bucketing ratio for each decile by trial and error.
This is my code so far:
Import pyspark.sql.functions as F
deciles = (table
.groupBy('Customer_ID')
.agg(F.sum('order_cost').alias('spending')).alias('a')
.withColumn('rank', F.ntile (5000).over(W.Window.partitionBy()
.orderBy(F.asc('spending'))))
.withColumn('rank', F.when(F.col('rank')<=4628, F.lit(1))
.when(F.col('rank')<=4850, F.lit(2))
.when(F.col('rank')<=4925, F.lit(3))
.when(F.col('rank')<=4965, F.lit(4))
.when(F.col('rank')<=4980, F.lit(5))
.when(F.col('rank')<=4987, F.lit(6))
.when(F.col('rank')<=4993, F.lit(7))
.when(F.col('rank')<=4997, F.lit(8))
.when(F.col('rank')<=4999, F.lit(9))
.when(F.col('rank')<=5000, F.lit(10))
.otherwise (F.lit(0)))
)
end_table = (table.alias('a').join(deciles.alias('b'), ['Customer_ID'], 'left')
.selectExpr('a.*', 'b.rank')
)

How to apply partition by in lag function using postrgresql

I have a table like as shown below
subject_id, date_inside, value
1 2110-02-12 19:41:00 1.3
1 2110-02-15 01:40:00 1.4
1 2110-02-15 02:40:00 1.5
2 2110-04-15 04:07:00 1.6
2 2110-04-15 08:00:00 1.7
2 2110-04-15 18:30:00 1.8
I would like to compute the date difference between consecutive rows for each subject
I tried the below
select a.subject_id,a.date_inside, a.value,
a. date_inside- lag(a. date_inside) over (order by a. date_inside) as difference
from table1 a
While the above works, I am not able to apply partition by for each subject. So, it ends up calculating the difference for all the rows (without considering the subject_id). Basically, the last row of each subject has to be null because that's his or her last row (and should not be subtracted from consecutive record of the next subject)
I expect my output to be like as shown below
subject_id, date_inside, difference
1 2110-02-12 19:41:00 66 hours
1 2110-02-15 01:40:00 1 hour
1 2110-02-15 02:40:00 NULL
2 2110-04-15 04:07:00 3 hours, 53 minutes
2 2110-04-15 08:00:00 10 hours, 30 minutes
2 2110-04-15 18:30:00 NULL
Just add a PARTITION BY clause, and also your expected output seems to want LEAD, not LAG:
SELECT subject_id, date_inside, value,
LEAD(date_inside) OVER (PARTITION BY subject_id ORDER BY date_inside)
- date_inside AS difference
FROM table1
ORDER BY
subject_id,
date_inside;
Think of "partition by" to be simiar to how you could use "group by". In this case the logical boundaries are determined by subject_id so just include as part of the over clause:
select a.subject_id,a.date_inside, a.value,
a.date_inside - lag(a.date_inside) over (partition by a.subject_id order by a.date_inside) as difference
from table1

Default value in select query for null values in postgres

I have a table with sales Id, product code and amount. Some places product code is null. I want to show Missing instead of null. Below is my table.
salesId prodTypeCode amount
1 123 150
2 123 200
3 234 3000
4 234 400
5 234 500
6 123 200
7 111 40
8 111 500
9 1000
10 123 100
I want to display the total amount for every prodTypeCode with the option of If the prodTypeCode is null then Missing should be displayed.
select (CASE WHEN prodTypeCode IS NULL THEN
'Missing'
ELSE
prodTypeCode
END) as ProductCode, SUM(amount) From sales group by prodTypeCode
Above query giving error. Please suggest me to overcome this issue. I ahve created a SQLFIDDLE
The problem is a mismatch of datatypes; 'Missing' is text, but the product type code is numeric.
Cast the product type code to text so the two values are compatible:
select (CASE WHEN prodTypeCode IS NULL THEN
'Missing'
ELSE
prodTypeCode::varchar(40)
END) as ProductCode, SUM(amount) From sales group by prodTypeCode
See SQLFiddle.
Or, simpler:
select coalesce(prodTypeCode::varchar(40), 'Missing') ProductCode, SUM(amount)
from sales
group by prodTypeCode
See SQLFiddle.
Perhaps you have a type mismatch:
select coalesce(cast(prodTypeCode as varchar(255)), 'Missing') as ProductCode,
SUM(amount)
From sales s
group by prodTypeCode;
I prefer coalesce() to the case, simply because it is shorter.
I tried all 2 answers in my case and both did not work. I hope this snippet can help if both do not work for someone else:
SELECT
COALESCE(NULLIF(prodTypeCode,''), 'Missing') AS ProductCode,
SUM(amount)
From sales s
group by prodTypeCode;

TSQL in a 3 column table, I need the MAX score with the earliest

I have a 3 column table that shows a person's score and the ID representing the record of their "test attempt".
TABLE1
empid score attempt_id
1 10565 10001
1 10700 10010
1 12500 10009
1 13000 10025
1 13000 10021
2 10565 10041
2 10700 10020
2 12500 10029
3 13000 10035
4 13000 10051
I'm trying to pull a recordset that contains the employee id along with their maximum score and smallest attempt_id (if there are multiple records with the same max score).
Result
empid score attempt_id
1 13000 10021
2 12500 10029
3 13000 10035
4 13000 10051
I can't seem to get the right SQL.
Any help?
Give this a whirl.. Get the max score and put it in a subquery, then in the main query join to it and get the min attempt.
SELECT ms.empid, ms.max_score, MIN(attempt_id)
FROM Table1 ma
JOIN (
SELECT empid, Max(score) as max_score
FROM Table1
GROUP BY empid ) ms ON ma.empid = ms.empid AND ma.score = ms.max_score
GROUP BY ms.empid, ms.max_score
ORDER BY ms.empid

SELECT record based upon dates

Assuming data such as the following:
ID EffDate Rate
1 12/12/2011 100
1 01/01/2012 110
1 02/01/2012 120
2 01/01/2012 40
2 02/01/2012 50
3 01/01/2012 25
3 03/01/2012 30
3 05/01/2012 35
How would I find the rate for ID 2 as of 1/15/2012?
Or, the rate for ID 1 for 1/15/2012?
In other words, how do I do a query that finds the correct rate when the date falls between the EffDate for two records? (Rate should be for the date prior to the selected date).
Thanks,
John
How about this:
SELECT Rate
FROM Table1
WHERE ID = 1 AND EffDate = (
SELECT MAX(EffDate)
FROM Table1
WHERE ID = 1 AND EffDate <= '2012-15-01');
Here's an SQL Fiddle to play with. I assume here that 'ID/EffDate' pair is unique for all table (at least the opposite doesn't make sense).
SELECT TOP 1 Rate FROM the_table
WHERE ID=whatever AND EffDate <='whatever'
ORDER BY EffDate DESC
if I read you right.
(edited to suit my idea of ms-sql which I have no idea about).