I'm trying to make some Basket Market Analysis using Spark MLlib with this dataset:
Purchase_ID Category Furnisher Value
1 , A , 1 , 7
1 , B , 2 , 7
2 , A , 1 , 1
3 , C , 2 , 4
3 , A , 1 , 4
3 , D , 3 , 4
4 , D , 3 , 10
4 , A , 1 , 10
5 , E , 1 , 8
5 , B , 3 , 8
5 , A , 1 , 8
6 , A , 1 , 3
6 , B , 1 , 3
6 , C , 5 , 3
7 , D , 3 , 4
7 , A , 1 , 4
The transaction value (value) is grouped by each Purchase_ID. And what I want is just return the top 3 categories with higher Value. Basically, I want to return this dataset:
D,A
E,B,A
A,B
For that I'm trying with the following code:
val data = sc.textFile("PATH");
case class Transactions(Purchase_ID:String,Category:String,Furnisher:String,Value:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Transactions(split(0),split(1),split(2),split(3))}
val df = data.map(csvToMyClass)
.toDF("Purchase_ID","Category","Furnisher","Value")
.(select("Purchase_ID","Category") FROM (SELECT "Purchase_ID","Category",dense_rank() over (PARTITION BY "Category" ORDER BY "Value" DESC) as rank) tmp WHERE rank <= 3)
.distinct();
The rank function isn't correct...
Anyone knows how to solve this problem?
Many thanks!
Related
Input:
Name GroupId Processed NewGroupId NgId
Mike 1 N 9 NULL
Mikes 1 N 9 NULL
Miken 5 Y 9 5
Mikel 5 Y 9 5
Output:
Name GroupId Processed NewGroupId NgId
Mike 1 N 9 5
Mikes 1 N 9 5
Miken 5 Y 9 5
Mikel 5 Y 9 5
below query worked in sql server, due to correlated subquery same is not working in spark sql.
Is there any alternate either with spark sql or pyspark dataframe.
SELECT Name,groupid,IsProcessed,ngid,
CASE WHEN ngid IS NULL THEN
COALESCE((SELECT top 1 ngid FROM temp D
WHERE D.NewGroupId = T.NewGroupId AND
D.ngid IS NOT NULL ), null)
ELSE ngid
END AS ngid
FROM temp T
worked with below in sparksql.
spark.sql("select LKUP,groupid,IsProcessed,NewGroupId ,coalesce((select Max(D.ngid) from test2 D where D.NewGroupId = T.NewGroupId AND D.ngidis not null),null) as ngid from test2 T")
I have table tbl_Survey:
SurveyID 1 2 3 4
7 4 4 4 4
8 3 3 3 3
9 2 2 2 2
My goal is to transfer table headers - 1 2 3 4 into rows, as the following:
enter
SurveyID Ouestion Rating
7 1 4
7 2 4
7 3 4
7 4 4
8 1 3
8 2 3
8 3 3
8 4 3
9 1 2
9 2 2
9 3 2
9 4 2
My code is (trying to follow help recommendations):
SELECT [SurveyID]
,[Question]
,[Rating]
FROM
[tbl_Survey]
cross apply
(
values
('1', 1 ),
('2', 2 ),
('3', 3 ),
('4', 4 )
) c (Question, Rating);
Results are not fully correct (Rating column is a problem):
SurveyID Ouestion Rating
7 1 1
7 2 2
7 3 3
7 4 4
8 1 1
8 2 2
8 3 3
8 4 4
9 1 1
9 2 2
9 3 3
9 4 4
Please, help...
My problem (because of which I couldn't proceed) was that I haven't used brackets for my code.
Here is the updated code for this:
SELECT [SurveyID], [Question], [Rating]
FROM [dbo].[tbl_Survey]
UNPIVOT
(
[Rating]
FOR [Question] in ([1], [2], [3], [4])
) AS SurveyUnpivot
How about this:
DECLARE #T TABLE (SurveyID int, q1 int, q2 int, q3 int, q4 int)
INSERT #T (SurveyID, q1, q2, q3, q4)
VALUES (7,4,4,4,4), (8,3,3,3,3), (9, 2, 2, 2, 2)
SELECT SurveyID, REPLACE(Question,'q','') as Question, Rating
FROM #T UNPIVOT (Rating FOR Question in (q1, q2, q3, q4)) as UPV
Same approach. Just make sure you use a global temporary table as a temp table will not be visible in the scope of the EXEC statement. This should work with any column name and any number of columns.
IF OBJECT_ID('tempdb..##T') IS NOT NULL DROP TABLE ##T
CREATE TABLE ##T (SurveyID int, xxxxx int, yyyyy int, zzzzzz int, tttttt int)
INSERT ##T VALUES (7,4,4,4,4), (8,3,3,3,3), (9, 2, 2, 2, 2)
DECLARE #Colnames nvarchar(4000)
SELECT #Colnames = STUFF((SELECT ',[' + [name] +']' FROM tempdb.sys.columns where object_id = object_id('tempdb..##T') AND name <> 'SurveyID' FOR XML PATH('') ),1,1,'')
DECLARE #SQL nvarchar(4000) SET #SQL = 'SELECT SurveyID, Question, Rating FROM ##T UNPIVOT (Rating FOR Question in ('+#colnames+')) as UPV'
EXEC(#SQL)
Suppose I have data formatted in the following way (FYI, total row count is over 30K):
customer_id order_date order_rank
A 2017-02-19 1
A 2017-02-24 2
A 2017-03-31 3
A 2017-07-03 4
A 2017-08-10 5
B 2016-04-24 1
B 2016-04-30 2
C 2016-07-18 1
C 2016-09-01 2
C 2016-09-13 3
I need a 4th column, let's call it days_since_last_order which, in the case where order_rank = 1 then 0 else calculate the number of days since the previous order (with rank n-1).
So, the above would return:
customer_id order_date order_rank days_since_last_order
A 2017-02-19 1 0
A 2017-02-24 2 5
A 2017-03-31 3 35
A 2017-07-03 4 94
A 2017-08-10 5 38
B 2016-04-24 1 0
B 2016-04-30 2 6
C 2016-07-18 1 79
C 2016-09-01 2 45
C 2016-09-13 3 12
Is there an easier way to calculate the above with a window function (or similar) rather than join the entire dataset against itself (eg. on A.order_rank = B.order_rank - 1) and doing the calc?
Thanks!
use the lag window function
SELECT
customer_id
, order_date
, order_rank
, COALESCE(
DATE(order_date)
- DATE(LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date))
, 0)
FROM <table_name>
Let's say I have an array of integers
1 6 6 3 3 8 4 4
It will be always of the form n*(pairs of number) + 2 (unique numbers).
Is there an efficient way of keeping only the 2 uniques values (i.e. the 2 with single occurence)?
Here, I would like to get 1 and 8.
So far is what I have:
SELECT node_id
FROM
( SELECT node_id, COUNT(*)
FROM unnest(array[1, 6, 6 , 3, 3 , 8 , 4 ,4]) AS node_id
GROUP BY node_id
) foo
ORDER BY count LIMIT 2;
You are very close, I think:
SELECT node_id
FROM (SELECT node_id, COUNT(*)
FROM unnest(array[1, 6, 6 , 3, 3 , 8 , 4 ,4]) AS node_id
GROUP BY node_id
HAVING count(*) = 1
) foo ;
You can group these back into an array, if you like, using array_agg().
I use Oracle 10g and I have a table that stores a snapshot of data on a person for a given day. Every night an outside process adds new rows to the table for any person whose had any changes to their core data (stored elsewhere). This allows a query to be written using a date to find out what a person 'looked' like on some past day. A new row is added to the table even if only a single aspect of the person has changed--the implication being that many columns have duplicate values from slice to slice since not every detail changed in each snapshot.
Below is a data sample:
SliceID PersonID StartDt Detail1 Detail2 Detail3 Detail4 ...
1 101 08/20/09 Red Vanilla N 23
2 101 08/31/09 Orange Chocolate N 23
3 101 09/15/09 Yellow Chocolate Y 24
4 101 09/16/09 Green Chocolate N 24
5 102 01/10/09 Blue Lemon N 36
6 102 01/11/09 Indigo Lemon N 36
7 102 02/02/09 Violet Lemon Y 36
8 103 07/07/09 Red Orange N 12
9 104 01/31/09 Orange Orange N 12
10 104 10/20/09 Yellow Orange N 13
I need to write a query that pulls out time slices records where some pertinent bits, not the whole record, have changed. So, referring to the above, if I only want to know the slices in which Detail3 has changed from its previous value, then I would expect to only get rows having SliceID 1, 3 and 4 for PersonID 101 and SliceID 5 and 7 for PersonID 102 and SliceID 8 for PersonID 103 and SliceID 9 for PersonID 104.
I'm thinking I should be able to use some sort of Oracle Hierarchical Query (using CONNECT BY [PRIOR]) to get what I want, but I have not figured out how to write it yet. Perhaps YOU can help.
Thanks you for your time and consideration.
Here is my take on the LAG() solution, which is basically the same as that of egorius, but I show my workings ;)
SQL> select * from
2 (
3 select sliceid
4 , personid
5 , startdt
6 , detail3 as new_detail3
7 , lag(detail3) over (partition by personid
8 order by startdt) prev_detail3
9 from some_table
10 )
11 where prev_detail3 is null
12 or ( prev_detail3 != new_detail3 )
13 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
8 103 07-JUL-09 N
9 104 31-JAN-09 N
7 rows selected.
SQL>
The point about this solution is that it hauls in results for 103 and 104, who don't have slice records where detail3 has changed. If that is a problem we can apply an additional filtration, to return only rows with changes:
SQL> with subq as (
2 select t.*
3 , row_number () over (partition by personid
4 order by sliceid ) rn
5 from
6 (
7 select sliceid
8 , personid
9 , startdt
10 , detail3 as new_detail3
11 , lag(detail3) over (partition by personid
12 order by startdt) prev_detail3
13 from some_table
14 ) t
15 where t.prev_detail3 is null
16 or ( t.prev_detail3 != t.new_detail3 )
17 )
18 select sliceid
19 , personid
20 , startdt
21 , new_detail3
22 , prev_detail3
23 from subq sq
24 where exists ( select null from subq x
25 where x.personid = sq.personid
26 and x.rn > 1 )
27 order by sliceid
28 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
SQL>
edit
As egorius points out in the comments, the OP does want hits for all users, even if they haven't changed, so the first version of the query is the correct solution.
In addition to OMG Ponies' answer: if you need to query slices for all persons, you'll need partition by:
SELECT s.sliceid
, s.personid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (
PARTITION BY t.personid ORDER BY t.startdt
) prev_val
FROM t) s
WHERE (s.prev_val IS NULL OR s.prev_val != s.detail3)
I think you'll have better luck with the LAG function:
SELECT s.sliceid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t) s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)
Subquery Factoring alternative:
WITH slices AS (
SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t)
SELECT s.sliceid
FROM slices s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)