This question already has answers here:
How to avoid the "divide by zero" error in SQL?
(19 answers)
Closed 5 years ago.
I have the following table:
Item Ordqty Total_Costprice TotalSaleprice onhand Markup
ENG-MAN-0102 3852 203.34 2494 73.992
SPG-P-018 2716 1232.80 473.2232
A8 8.62 9.335 0.71
A136 1621 148.35 518 0.3777
LA 1228 7.68 14.897 7.217
ENG-DOR 1039 34.94 50.8166 15.8766
A13-00-S 968 153.64 107 0.9997
Code is
SELECT
total_costprice,
markup,
CASE WHEN markup=0 THEN 0 ELSE 100*(markup)/costprice AS pctmarkup`
This gives a divide by zero error. I need to show the percentage markup for the markup values.
You need to use NULLIF function
select
total_costprice
,markup
,case when markup=0 then 0 else 100*(markup/NULLIF(costprice,0)) END as pctmarkup
Based on your values this will work. I inserted 0 where you dont have any data - I dont know if that is true.
declare #myt table (item nvarchar(50),ordqty int, total_costprice numeric(18,4),totalsalesprice numeric(18,4),onhand numeric(18,4),markup numeric(18,4)
)
insert into #myt
values
('ENG-MAN-0102', 0 , 3852 , 203.34 , 2494 , 73.992 ),
('SPG-P-018' , 0 , 2716 , 1232.80 , 473.2232 , 0 ),
('A8' , 0 , 8.62 , 9.335 , 0 , 0.71 ),
('A136' , 0 , 1621 , 148.35 , 518 , 0.3777 ),
('LA' , 1228 , 7.68 , 14.897 , 0 , 7.217 ),
('ENG-DOR' , 1039 , 34.94 , 50.8166 , 0 , 15.8766 ),
('A13-00-S' , 968 , 153.64 , 107 , 0 , 0.9997 )
select * ,CASE WHEN markup=0 THEN 0 ELSE 100*(markup)/total_costprice end as Pct from #myt
Result
Related
I have a dataframe with the following format...
id , name, start_date, end_date , active
1 , albert , 2019-08-14, 3499-12-31, 1
1 , albert , 2019-08-13, 2019-08-14, 0
1 , albert , 2019-06-26, 2019-08-13, 0
1 , brian , 2018-01-17, 2019-06-26, 0
1 , brian , 2017-07-31, 2018-01-17, 0
1 , albert , 2017-03-31, 2018-07-31, 0
2 , diane , 2019-07-14, 3499-12-31, 1
2 , diane , 2019-06-13, 2019-07-14, 0
2 , ethel , 2019-03-20, 2019-06-13, 0
2 , ethel , 2018-01-17, 2019-03-20, 0
2 , frank , 2017-07-31, 2018-01-17, 0
2 , frank , 2015-03-21, 2018-07-31, 0
And I want to merge consecutive rows where name is the same as the previous row, but maintain the correct start and end dates in the final output dataframe. So the correct output would be...
id , name, start_date, end_date , active
1 , albert , 2019-06-26, 3499-12-31, 1
1 , brian , 2017-07-31, 2019-06-26, 0
1 , albert , 2017-03-31, 2018-07-31, 0
2 , diane , 2019-06-13, 3499-12-31, 1
2 , ethel , 2018-01-17, 2019-06-13, 0
2 , frank , 2017-03-31, 2018-01-17, 0
The number of entries per id varies as does the number of different names per id.
How could this be achieved in pyspark?
Thanks
Are you looking for df.groupby(["name", "start_date", "end_date"]).sum("active")?
If I understood your questions right, the above code will do the job.
So after a bit of thinking I figured out how to do this. There may be a better way, but this works.
First create a window, partitioned by id and ordered by start_date and capture the next row.
frame = Window.partitionBy('id').orderBy(col('start_date').desc())
df = df.select('*', lag(col('name'), default=0).over(frame).alias('next_name'))
Then if the current name row and next names match set 0 else set 1...
df = df.withColumn('countrr', when(col('name') == col('next_name'), 0).otherwise(1))
Next create an extension of the frame to take the rows between the start of the window and the current row, and sum the count col for the frame...
frame2 = Window.partitionBy('id').orderBy(col('start_date').desc()).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn('sumrr', sum('countrr').over(frame2)
This effectively creates a column that increases by one when name changes. Finally you can use this new sumrr column and the other columns to group by and take the max and min dates as required...
gb_df = df.groupby(['id', 'name', 'sumrr'])
result = gb_df.agg({'start_date':'min', 'end_date':'max'})
Then you have to join back the active flag on id, name and end_date.
Gives the correct output...
I've been given a TSQL table an extract of which is below
Serial Number Code 1 Code 2 Code 3
15872 1242 NULL NULL
15872 NULL 558 222
99955 995 452 NULL
I'd like to group these four fields together to form the following output
Serial Number Code 1 Code 2 Code 3
15872 1242 558 222
99955 995 452 NULL
This looks a simple problem, but I just can't get it right. Any advice would be very much appreciated!
Kind regards,
DJ
You need to use aggregate function in conjunction with Coalesce function , something like this :
select Serial , COALESCE(SUM(code1),0),COALESCE(SUM(code2),0),COALESCE(SUM(code3),0)
from yourTable
group by Serial
This basically replaces NULL as 0 so that you can calculate SUM as intended
Hope this helps !
Oh...misread question twice, haha.
CREATE TABLE #X
(
SerialNumber INT
, Code1 INT
, Code2 INT
, Code3 INT
)
INSERT INTO #X
VALUES
(15872, 1242, NULL, NULL)
, (15872, NULL, 558, 222)
, (99955, 995, 452, NULL)
SELECT
[SerialNumber]
, MAX([Code1]) AS [Code1]
, MAX([Code2]) AS [Code2]
, MAX([Code3]) AS [Code3]
FROM #x
GROUP BY [SerialNumber]
Now, you've not mentioned anything about an aggregate so this assumes that there is only one row that can possibly have the value i.e. this doesn't exist:
Serial Number Code 1 Code 2 Code 3
15872 1242 773 NULL
15872 NULL 558 222
I'm trying to make some Basket Market Analysis using Spark MLlib with this dataset:
Purchase_ID Category Furnisher Value
1 , A , 1 , 7
1 , B , 2 , 7
2 , A , 1 , 1
3 , C , 2 , 4
3 , A , 1 , 4
3 , D , 3 , 4
4 , D , 3 , 10
4 , A , 1 , 10
5 , E , 1 , 8
5 , B , 3 , 8
5 , A , 1 , 8
6 , A , 1 , 3
6 , B , 1 , 3
6 , C , 5 , 3
7 , D , 3 , 4
7 , A , 1 , 4
The transaction value (value) is grouped by each Purchase_ID. And what I want is just return the top 3 categories with higher Value. Basically, I want to return this dataset:
D,A
E,B,A
A,B
For that I'm trying with the following code:
val data = sc.textFile("PATH");
case class Transactions(Purchase_ID:String,Category:String,Furnisher:String,Value:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Transactions(split(0),split(1),split(2),split(3))}
val df = data.map(csvToMyClass)
.toDF("Purchase_ID","Category","Furnisher","Value")
.(select("Purchase_ID","Category") FROM (SELECT "Purchase_ID","Category",dense_rank() over (PARTITION BY "Category" ORDER BY "Value" DESC) as rank) tmp WHERE rank <= 3)
.distinct();
The rank function isn't correct...
Anyone knows how to solve this problem?
Many thanks!
I have a table that has data of user_id and the timestamp they joined.
If I need to display the data month-wise I could just use:
select
count(user_id),
date_trunc('month',(to_timestamp(users.timestamp))::timestamp)::date
from
users
group by 2
The date_trunc code allows to use 'second', 'day', 'week' etc. Hence I could get data grouped by such periods.
How do I get data grouped by "n-day" period say 45 days ?
Basically I need to display number users per 45 day period.
Any suggestion or guidance appreciated!
Currently I get:
Date Users
2015-03-01 47
2015-04-01 72
2015-05-01 123
2015-06-01 132
2015-07-01 136
2015-08-01 166
2015-09-01 129
2015-10-01 189
I would like the data to come in 45 days interval. Something like :-
Date Users
2015-03-01 85
2015-04-15 157
2015-05-30 192
2015-07-14 229
2015-08-28 210
2015-10-12 294
UPDATE:
I used the following to get the output, but one problem remains. I'm getting values that are offset.
with
new_window as (
select
generate_series as cohort
, lag(generate_series, 1) over () as cohort_lag
from
(
select
*
from
generate_series('2015-03-01'::date, '2016-01-01', '45 day')
)
t
)
select
--cohort
cohort_lag -- This worked. !!!
, count(*)
from
new_window
join users on
user_timestamp <= cohort
and user_timestamp > cohort_lag
group by 1
order by 1
But the output I am getting is:
Date Users
2015-04-15 85
2015-05-30 157
2015-07-14 193
2015-08-28 225
2015-10-12 210
Basically The users displayed at 2015-03-01 should be the users between 2015-03-01 and 2015-04-15 and so on.
But I seem to be getting values of users upto a date. ie: upto 2015-04-15 users 85. which is not the results I want.
Any help here ?
Try this query :
SELECT to_char(i::date,'YYYY-MM-DD') as date, 0 as users
FROM generate_series('2015-03-01', '2015-11-30','45 day'::interval) as i;
OUTPUT :
date users
2015-03-01 0
2015-04-15 0
2015-05-30 0
2015-07-14 0
2015-08-28 0
2015-10-12 0
2015-11-26 0
This looks like a hot mess, and it might be better wrapped in a function where you could use some variables, but would something like this work?
with number_of_intervals as (
select
min (timestamp)::date as first_date,
ceiling (extract (days from max (timestamp) - min (timestamp)) / 45)::int as num
from users
),
intervals as (
select
generate_series(0, num - 1, 1) int_start,
generate_series(1, num, 1) int_end
from number_of_intervals
),
date_spans as (
select
n.first_date + 45 * i.int_start as interval_start,
n.first_date + 45 * i.int_end as interval_end
from
number_of_intervals n
cross join intervals i
)
select
d.interval_start, count (*) as user_count
from
users u
join date_spans d on
u.timestamp >= d.interval_start and
u.timestamp < d.interval_end
group by
d.interval_start
order by
d.interval_start
With this sample data:
User Id timestamp derived range count
1 3/1/2015 3/1-4/15
2 3/26/2015 "
3 4/4/2015 "
4 4/6/2015 " (4)
5 5/6/2015 4/16-5/30
6 5/19/2015 " (2)
7 6/16/2015 5/31-7/14
8 6/27/2015 "
9 7/9/2015 " (3)
10 7/15/2015 7/15-8/28
11 8/8/2015 "
12 8/9/2015 "
13 8/22/2015 "
14 8/27/2015 " (5)
Here is the output:
2015-03-01 4
2015-04-15 2
2015-05-30 3
2015-07-14 5
I use Oracle 10g and I have a table that stores a snapshot of data on a person for a given day. Every night an outside process adds new rows to the table for any person whose had any changes to their core data (stored elsewhere). This allows a query to be written using a date to find out what a person 'looked' like on some past day. A new row is added to the table even if only a single aspect of the person has changed--the implication being that many columns have duplicate values from slice to slice since not every detail changed in each snapshot.
Below is a data sample:
SliceID PersonID StartDt Detail1 Detail2 Detail3 Detail4 ...
1 101 08/20/09 Red Vanilla N 23
2 101 08/31/09 Orange Chocolate N 23
3 101 09/15/09 Yellow Chocolate Y 24
4 101 09/16/09 Green Chocolate N 24
5 102 01/10/09 Blue Lemon N 36
6 102 01/11/09 Indigo Lemon N 36
7 102 02/02/09 Violet Lemon Y 36
8 103 07/07/09 Red Orange N 12
9 104 01/31/09 Orange Orange N 12
10 104 10/20/09 Yellow Orange N 13
I need to write a query that pulls out time slices records where some pertinent bits, not the whole record, have changed. So, referring to the above, if I only want to know the slices in which Detail3 has changed from its previous value, then I would expect to only get rows having SliceID 1, 3 and 4 for PersonID 101 and SliceID 5 and 7 for PersonID 102 and SliceID 8 for PersonID 103 and SliceID 9 for PersonID 104.
I'm thinking I should be able to use some sort of Oracle Hierarchical Query (using CONNECT BY [PRIOR]) to get what I want, but I have not figured out how to write it yet. Perhaps YOU can help.
Thanks you for your time and consideration.
Here is my take on the LAG() solution, which is basically the same as that of egorius, but I show my workings ;)
SQL> select * from
2 (
3 select sliceid
4 , personid
5 , startdt
6 , detail3 as new_detail3
7 , lag(detail3) over (partition by personid
8 order by startdt) prev_detail3
9 from some_table
10 )
11 where prev_detail3 is null
12 or ( prev_detail3 != new_detail3 )
13 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
8 103 07-JUL-09 N
9 104 31-JAN-09 N
7 rows selected.
SQL>
The point about this solution is that it hauls in results for 103 and 104, who don't have slice records where detail3 has changed. If that is a problem we can apply an additional filtration, to return only rows with changes:
SQL> with subq as (
2 select t.*
3 , row_number () over (partition by personid
4 order by sliceid ) rn
5 from
6 (
7 select sliceid
8 , personid
9 , startdt
10 , detail3 as new_detail3
11 , lag(detail3) over (partition by personid
12 order by startdt) prev_detail3
13 from some_table
14 ) t
15 where t.prev_detail3 is null
16 or ( t.prev_detail3 != t.new_detail3 )
17 )
18 select sliceid
19 , personid
20 , startdt
21 , new_detail3
22 , prev_detail3
23 from subq sq
24 where exists ( select null from subq x
25 where x.personid = sq.personid
26 and x.rn > 1 )
27 order by sliceid
28 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
SQL>
edit
As egorius points out in the comments, the OP does want hits for all users, even if they haven't changed, so the first version of the query is the correct solution.
In addition to OMG Ponies' answer: if you need to query slices for all persons, you'll need partition by:
SELECT s.sliceid
, s.personid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (
PARTITION BY t.personid ORDER BY t.startdt
) prev_val
FROM t) s
WHERE (s.prev_val IS NULL OR s.prev_val != s.detail3)
I think you'll have better luck with the LAG function:
SELECT s.sliceid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t) s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)
Subquery Factoring alternative:
WITH slices AS (
SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t)
SELECT s.sliceid
FROM slices s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)