Compare number of instances in a column for two different dates in SAS - date

I have the following dataset (items) with transactions on any date and amount paid on the next business day.
The amount paid for each id on the next business day is $10 for the ids whose rate is >5
My task is to compare the number of instances where rate > 5 are in line with amount paid on the next business day (This will have a standard code 121)
For instance, there are four instances with rate > 5 on 4/14/2017' - The amount$40 (4*10)is paid on4/17/2017`
Date id rate code batch
4/14/2017 1 12 100 A1
4/14/2017 1 2 101 A1
4/14/2017 1 13 101 A1
4/14/2017 1 10 100 A1
4/14/2017 1 10 100 A1
4/17/2017 1 40 121
4/20/2017 2 12 100 A1
4/20/2017 2 2 101 A1
4/20/2017 2 3 101 A1
4/20/2017 2 10 100 A1
4/20/2017 2 10 100 A1
4/21/2017 2 30 121
My code
proc sql;
create table items2 as select
count(id) as id_count,
(case when code='121' then rate/10 else 0 end) as rate_count
from items
group by date,id;
quit;
This has not yielded the desired result and the challenge I have here is to check the transaction dates (4/14/2017 and 4/20/2017) and next business day dates (4/17/2017,4/21/2017).
Appreciate your help.

LAG function will do the trick here. As we can use lagged values to create the condition we want without having to use the rate>5 condition.
Here is the solution:-
Data items;
set items;
Lag_dt=LAG(Date);
Lag_id=LAG(id);
Lag_rate=LAG(rate);
if ((id=lag_id) and (code=121) and (Date>lag_dt)) then rate_count=(rate/lag_rate);
else rate_count=0;
Drop lag_dt lag_id lag_rate;
run;
Hope this helps.

Related

Create deciles to group and label records where the sum of a value is the same for each decile

I have something similar to the following table, which is a randomly ordered list of thousands of transactions with a Customer_ID and an order_cost for each transaction.
Customer_ID
order_cost
1
$503
53
$7
4
$80
13
$76
6
$270
78
$2
8
$45
910
$89
10
$3
1130
$43
etc...
etc...
I want to group the transactions by Customer_ID, aggregate the cost of all the orders into a spending column, and then create a new "decile" row that would assign a number 1-10 to each customer so that when the "spending" for all customers in a decile is added up, each decile contains 10% of all the spending.
The resulting table would look something like the table below where each ascending decile will contain fewer customers, but the total sum of "spending" for all the records in each decile group will be the same for deciles 1-10. (The actual numbers in this sample column don't actually add up, it's just the concept)
Customer_ID
spending
Decile
45
$500
1
3
$700
1
349
$800
1
23
$1,000
1
64
$2,000
1
718
$2,100
1
3452
$2,300
1
1276
$2,600
2
10
$3,000
2
34
$4,000
2
etc...
etc...
etc...
So far I have grouped by Customer_ID, aggregated the order_cost to a spending column, ordered each customer in ascending order based on the spending column, and then partitioned all the customers into 5000 groups. From there I manually found the values for each .when statement that would result in deciles 1-10 each containing the right amount of customers so each decile has 10% of the sum of the entire spending column. It's pretty time-consuming to use trial and error to find the right bucket configuration that results in each decile having 10% of the spending column.
I'm trying to find a way to automate this process so I don't have to find the right bucketing ratio for each decile by trial and error.
This is my code so far:
Import pyspark.sql.functions as F
deciles = (table
.groupBy('Customer_ID')
.agg(F.sum('order_cost').alias('spending')).alias('a')
.withColumn('rank', F.ntile (5000).over(W.Window.partitionBy()
.orderBy(F.asc('spending'))))
.withColumn('rank', F.when(F.col('rank')<=4628, F.lit(1))
.when(F.col('rank')<=4850, F.lit(2))
.when(F.col('rank')<=4925, F.lit(3))
.when(F.col('rank')<=4965, F.lit(4))
.when(F.col('rank')<=4980, F.lit(5))
.when(F.col('rank')<=4987, F.lit(6))
.when(F.col('rank')<=4993, F.lit(7))
.when(F.col('rank')<=4997, F.lit(8))
.when(F.col('rank')<=4999, F.lit(9))
.when(F.col('rank')<=5000, F.lit(10))
.otherwise (F.lit(0)))
)
end_table = (table.alias('a').join(deciles.alias('b'), ['Customer_ID'], 'left')
.selectExpr('a.*', 'b.rank')
)

How do i write a group by query in PostgreSQL

I'm getting errors with PostgreSQL when am writing a group by query,
am sure someone will tell me to put all the columns I've selected in group by, but that will not give me the correct results.
Am writing a query that will select all the vehicles in the database and group the results by vehicles, giving me the total distance and cost for a given period.
Here is how am doing the query.
SELECT i.vehicle AS vehicle,
i.costcenter AS costCenter,
i.department AS department,
SUM(i.quantity) AS liters,
SUM(i.totalcost) AS Totalcost,
v.model AS model,
v.vtype AS vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates::text LIKE '%2019-03%' AND i.deleted_at IS NULL
GROUP BY i.vehicle;
If I put all the columns that are in the select in the group bt, the results will not be correct.
How do i go about this without putting all the columns in group by and creating sub-queries?
The fuel table looks like:
vehicle dates department quantity totalcost
1 2019-01-01 102 12 1200
1 2019-01-05 102 15 1500
1 2019-01-13 102 18 1800
1 2019-01-22 102 10 1000
2 2019-01-01 102 12 1260
2 2019-01-05 102 19 1995
2 2019-01-13 102 28 2940
Vehicle Table
id model vtype
1 1 2
2 4 6
2 5 7
This is the results i expect from the query
vehicle dates department quantity totalcost model vtype
1 2019-01-01 102 12 1200 1 2
1 2019-01-05 102 15 1500 1 2
1 2019-01-13 102 18 1800 1 2
1 2019-01-22 102 10 1000 1 2
1 2019-01-18 102 10 1000 1 2
1 65 6500
2 2019-01-01 102 12 1260 5 7
2 2019-01-05 102 19 1995 5 7
2 2019-01-13 102 28 2940 5 7
1 45 6195
Your query doesn't really make sense. Apparently there can be multiple departments and costcenters per vehicle in the fuelissuances table - which of those should be returned?
One way to deal with that, is to return all of them, e.g. as an array:
SELECT i.vehicle,
array_agg(i.costcenter) as costcenters,
array_agg(i.department) as departments,
SUM(i.quantity) AS liters,
SUM(i.totalcost) AS Totalcost,
v.model,
v.vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates >= date '2019-03-01'
and i.date < date '2019-04-01'
AND i.deleted_at IS NULL
group by i.vehicle, v.model, v.vtype;
Instead of an array, you could also return a comma separated lists of those values, e.g. string_agg(i.costcenter, ',') as costcenters.
Adding the columns v.model and v.vtype won't (shouldn't) change anything as the group by i.vehicle will only return a single vehicle anyway and thus the model and vtype won't change for that in the group.
Note that I removed the useless aliases and replaced the condition on the date with a proper range condition that can make use of an index on the dates column.
Edit
Based on your new sample data, you want a running total, rather than a "regular" aggregation. This can easily be done using window functions
SELECT i.vehicle,
i.costcenter,
i.department,
SUM(i.quantity) over (w) AS liters,
SUM(i.totalcost) over (w) AS Totalcost,
v.model,
v.vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates >= date '2019-01-01'
and i.dates < date '2019-02-01'
AND i.deleted_at IS NULL
window w as (partition by i.vehicle order by i.dates)
order by i.vehicle, i.dates;
I would not create those "total" lines using SQL, but rather in your front end that display the data.
Online example: https://rextester.com/CRJZ27446
You need to use a nested query to get those SUM you want inside that query.
SELECT i.vehicle AS vehicle,
i.costcenter AS costCenter,
i.department AS department,
(SELECT SUM(i.quantity) FROM TABLES WHERE CONDITIONS GROUP BY vehicle) AS liters,
(SELECT SUM(i.totalcost) FROM TABLES WHERE CONDITIONS GROUP BY vehicle) AS Totalcost,
v.model AS model,
v.vtype AS vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates::text LIKE '%2019-03%' AND i.deleted_at IS NULL;

duplicating table columns in KDB

Consider the code below:
q)tab:flip `items`sales`prices!(`nut`bolt`cam`cog;6 8 0 3;10 20 15 20)
q)tab
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
I would like to duplicate the prices column. I can write a query like this:
q)update prices_copy: prices from tab
I also can write a query like this:
q)select items, sales, prices, prices_copy: first prices by items from tab
Both would work. I would like to know how the "by" version would work and the motivation for writing each version. I cannot help but think the "by" version is more thinking in rows.
Your initial query would be ideally what you want for your duplicate column requirement.
The by creates groups of the column items in your example and collapses every other column in the select query according to the indices calculated from grouping items. More info on by here - http://code.kx.com/wiki/Reference/select and http://code.kx.com/wiki/JB:QforMortals2/queries_q_sql#The_by_Phrase
In your example, the column items is already unique and so no collapsing into groups is actually performed, however, the by will create nested lists from the other columns (i.e. lists of lists). The use of first will just un-nest the items column, thus collapsing it to a normal (long-typed) vector.
When the grouping is finished the by columns are used as the key column[s] of the result and you will see this by the use of a vertical line to the right hand side of the key column[s]. All other columns within the select query are placed to the right hand side of the key.
The logic of the by version coincidentally creates a copy of prices. But by changes the order:
q)ungroup select sales, prices by items from tab
items sales prices
------------------
bolt 8 20
cam 0 15
cog 3 20
nut 6 10
q)tab
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
The by version works only because items is unique. For a tab with multiple values for item eg. 8#tab, the query only produces 4 values for prices_copy.
q)select items, sales, prices, prices_copy: first prices by items from 8#tab
items| items sales prices prices_copy
-----| ----------------------------------
bolt | bolt bolt 8 8 20 20 20
cam | cam cam 0 0 15 15 15
cog | cog cog 3 3 20 20 20
nut | nut nut 6 6 10 10 10
There is a fundamental difference between a simple update and update by queries.
Let's explore it by adding an extra column brand to the table
tab2:flip `items`sales`prices`brand!(`nut`bolt`cam`cog`nut`bolt`cam`cog;6 8 0 3 1 2 3 4;10 20 15 20 30 40 50 60;`b1`b1`b1`b1`b2`b2`b2`b2)
The following will now simply copy the column :
asc update prices_copy: prices from tab2
However, the following query is copying the first item price regardless of the brand and updating it for all other brands of same item.
asc ungroup select sales, prices,brand, prices_copy: first prices by items from tab2
items sales prices brand prices_copy
------------------------------------
bolt 2 40 b2 20
bolt 8 20 b1 20 //b2 price
cam 0 15 b1 15 //b2 price
cam 3 50 b2 15
cog 3 20 b1 20
cog 4 60 b2 20 //b2 price
nut 1 30 b2 10 //b2 price
nut 6 10 b1 10
update by might be useful in the case where you want to copy the max price of the items regardless of the brand or some other aggregation query.
asc ungroup select sales, prices,brand, prices_copy: max prices by items from tab2
items sales prices brand prices_copy
------------------------------------
bolt 2 40 b2 40
bolt 8 20 b1 40 //max price in bolts regardless of the brand
cam 0 15 b1 50
cam 3 50 b2 50
cog 3 20 b1 60
cog 4 60 b2 60
nut 1 30 b2 30
nut 6 10 b1 30

PostgreSQL window function & difference between dates

Suppose I have data formatted in the following way (FYI, total row count is over 30K):
customer_id order_date order_rank
A 2017-02-19 1
A 2017-02-24 2
A 2017-03-31 3
A 2017-07-03 4
A 2017-08-10 5
B 2016-04-24 1
B 2016-04-30 2
C 2016-07-18 1
C 2016-09-01 2
C 2016-09-13 3
I need a 4th column, let's call it days_since_last_order which, in the case where order_rank = 1 then 0 else calculate the number of days since the previous order (with rank n-1).
So, the above would return:
customer_id order_date order_rank days_since_last_order
A 2017-02-19 1 0
A 2017-02-24 2 5
A 2017-03-31 3 35
A 2017-07-03 4 94
A 2017-08-10 5 38
B 2016-04-24 1 0
B 2016-04-30 2 6
C 2016-07-18 1 79
C 2016-09-01 2 45
C 2016-09-13 3 12
Is there an easier way to calculate the above with a window function (or similar) rather than join the entire dataset against itself (eg. on A.order_rank = B.order_rank - 1) and doing the calc?
Thanks!
use the lag window function
SELECT
customer_id
, order_date
, order_rank
, COALESCE(
DATE(order_date)
- DATE(LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date))
, 0)
FROM <table_name>

Insert rownumber repeatedly in records in t-sql

I want to insert a row number in a records like counting rows in a specific number of range. example output:
RowNumber ID Name
1 20 a
2 21 b
3 22 c
1 23 d
2 24 e
3 25 f
1 26 g
2 27 h
3 28 i
1 29 j
2 30 k
I rather to try using the rownumber() over (partition by order by column name) but my real records are not containing columns that will count into 1-3 rownumber.
I already try to loop each of record to insert a row count 1-3 but this loop affects the performance of the query. The query will use for the RDL report, that is why as much as possible the performance of the query must be good.
any suggestions are welcome. Thanks
have you tried modulo-ing rownumber()?
SELECT
((row_number() over (order by ID)-1) % 3) +1 as RowNumber
FROM table