Spark : How do I find the passengers who have been on more than 3 flights together - scala

I have a dataset as the following
passengerId, flightId, from, to, date
56, 0, cg, ir, 2017-01-01
78, 0, cg, ir, 2017-01-01
12, 0, cg, ir, 2017-02-01
34, 0, cg, ir, 2017-02-01
51, 0, cg, ir, 2017-02-01
56, 1, ir, uk, 2017-01-02
78, 1, ir, uk, 2017-01-02
12, 1, ir, uk, 2017-02-02
34, 1, ir, uk, 2017-02-02
51, 1, ir, uk, 2017-02-02
56, 2, uk, in, 2017-01-05
78, 2, uk, in, 2017-01-05
12, 2, uk, in, 2017-02-05
34, 2, uk, in, 2017-02-05
51, 3, uk, in, 2017-02-05
I need to present a report in the following formats.
Passenger 1 ID Passenger 2 ID No_flights_together
56 78 6
12 34 8
… … …
Find the passengers who have been on more than N flights together within the range
Passenger 1 ID Passenger 2 ID No_Flights_Together From To
56 78 6 2017-01-01 2017-03-01
12 34 8 2017-04-05 2017-12-01
… … … … …
I'm not sure how to go about it. Help would be appreciated.

You can self-join on df1.passengerId < df2.passengerId along with same flightId and date, followed by performing the necessary count(*), min(date) and max(date) using groupBy/agg:
val df = Seq(
(56, 0, "2017-01-01"),
(78, 0, "2017-01-01"),
(12, 0, "2017-02-01"),
(34, 0, "2017-02-01"),
(51, 0, "2017-02-01"),
(56, 1, "2017-01-02"),
(78, 1, "2017-01-02"),
(12, 1, "2017-02-02"),
(34, 1, "2017-02-02"),
(51, 1, "2017-02-02"),
(56, 2, "2017-01-05"),
(78, 2, "2017-01-05"),
(12, 2, "2017-02-01"),
(34, 2, "2017-02-01"),
(51, 3, "2017-02-01")
).toDF("passengerId", "flightId", "date")
df.as("df1").join(df.as("df2"),
$"df1.passengerId" < $"df2.passengerId" &&
$"df1.flightId" === $"df2.flightId" &&
$"df1.date" === $"df2.date",
"inner"
).
groupBy($"df1.passengerId", $"df2.passengerId").
agg(count("*").as("flightsTogether"), min($"df1.date").as("from"), max($"df1.date").as("to")).
where($"flightsTogether" >= 3).
show
// +-----------+-----------+---------------+----------+----------+
// |passengerId|passengerId|flightsTogether| from| to|
// +-----------+-----------+---------------+----------+----------+
// | 12| 34| 3|2017-02-01|2017-02-02|
// | 56| 78| 3|2017-01-01|2017-01-05|
// +-----------+-----------+---------------+----------+----------+

Related

Looking to get counts of items within ArrayType column without using Explode

NOTE: I'm working with Spark 2.4
Here is my dataset:
df
col
[1,3,1,4]
[1,1,1,2]
I'd like to essentially get a value_counts of the values in the array. The results df wou
df_upd
col
[{1:2},{3:1},{4:1}]
[{1:3},{2:1}]
I know I can do this by exploding df and then taking a group by but I'm wondering if I can do this without exploding.
Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])
df.show()
+---------------+
| arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
| [2, 2, 2]|
| [3, 3]|
+---------------+
from collections import Counter
#F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
return dict(Counter(array))
df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)
+---------------+------------------------+
|arrays |counts |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2] |[2 -> 3] |
|[3, 3] |[3 -> 2] |
+---------------+------------------------+

scala dataframe join columns and split arrays explode spark

I have some co-ordinates in multiple array columns in a dataframe and want to split them to have the x,y,z in separate columns in order, column1 data first, then column 2
for example...
COL 1 | COL2
[[x,y,z],[x,y,z],[x,y,z]...] | [[x,y,z],[x,y,z],[x,y,z]...]
e.g
[[1,1,1],[2,2,2],[3,3,3]...] | [[8,8,8],[9,9,9],[10,10,10]...]
required OUTPUT
COL X | COL Y | COL Z
x,x,x,x,x.... | y,y,y,y,y.... | z,z,z,z,z....
e.g.
1,2,3,..,8,9,10.. | 1,2,3,..,8,9,10.. | 1,2,3,..,8,9,10..
any help appreciated
You can use array_union function as follows
df.select(
array_union($"col1._1", $"col2._1").as("x"),
array_union($"col1._2", $"col2._2").as("y"),
array_union($"col1._3", $"col2._3").as("z"))
INPUT
+--------------------------------------------+--------------------------------------------------+
|col1 |col2 |
+--------------------------------------------+--------------------------------------------------+
|[[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]|[[8, 8, 8], [9, 9, 9], [10, 10, 10], [11, 11, 11]]|
+--------------------------------------------+--------------------------------------------------+
OTUPUT
+--------------------------+--------------------------+--------------------------+
|x |y |z |
+--------------------------+--------------------------+--------------------------+
|[1, 2, 3, 4, 8, 9, 10, 11]|[1, 2, 3, 4, 8, 9, 10, 11]|[1, 2, 3, 4, 8, 9, 10, 11]|
+--------------------------+--------------------------+--------------------------+

efficient way to reformat/shift time series data using Spark

I want to build some time series models using spark. The first step is to reformat the sequence data into training samples. The idea is:
original sequential data (each t* is a number)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
desired output
t1 t2 t3 t4 t5 t6
t2 t3 t4 t5 t6 t7
t3 t4 t5 t6 t7 t8
..................
how to write a function in spark to do this.
The function signature should be like
reformat(Array[Integer], n: Integer)
return type is Dataframe or Vector
==========The code I tried on Spark 1.6.1 =========
val arraydata=Array[Double](1,2,3,4,5,6,7,8,9,10)
val slideddata = arraydata.sliding(4).toSeq
val rows = arraydata.sliding(4).map{x=>Row(x:_*)}
sc.parallelize(arraydata.sliding(4).toSeq).toDF("Values")
The final line can not go through with error:
Error:(52, 48) value toDF is not a member of org.apache.spark.rdd.RDD[Array[Double]]
sc.parallelize(arraydata.sliding(4).toSeq).toDF("Values")
I was not able to figure out the significance of n as it can be used as the window size as well as the value with which it has to shift.
Hence there are both the flavours:
If n is the window size :
def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
sc.parallelize(arrayOfInteger.sliding(shiftValue).toSeq).toDF("values")
}
On REPL:
scala> def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
| sc.parallelize(arrayOfInteger.sliding(shiftValue).toSeq).toDF("values")
| }
reformat: (arrayOfInteger: Array[Int], shiftValue: Int)org.apache.spark.sql.DataFrame
scala> val arrayofInteger=(1 to 10).toArray
arrayofInteger: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> reformat(arrayofInteger,3).show
+----------+
| values|
+----------+
| [1, 2, 3]|
| [2, 3, 4]|
| [3, 4, 5]|
| [4, 5, 6]|
| [5, 6, 7]|
| [6, 7, 8]|
| [7, 8, 9]|
|[8, 9, 10]|
+----------+
If n is the value to be shifted:
def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
val slidingValue=arrayOfInteger.size-shiftValue
sc.parallelize(arrayOfInteger.sliding(slidingValue).toSeq).toDF("values")
}
On REPL:
scala> def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
| val slidingValue=arrayOfInteger.size-shiftValue
| sc.parallelize(arrayOfInteger.sliding(slidingValue).toSeq).toDF("values")
| }
reformat: (arrayOfInteger: Array[Int], shiftValue: Int)org.apache.spark.sql.DataFrame
scala> val arrayofInteger=(1 to 10).toArray
arrayofInteger: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> reformat(arrayofInteger,3).show(false)
+----------------------+
|values |
+----------------------+
|[1, 2, 3, 4, 5, 6, 7] |
|[2, 3, 4, 5, 6, 7, 8] |
|[3, 4, 5, 6, 7, 8, 9] |
|[4, 5, 6, 7, 8, 9, 10]|
+----------------------+

DateDiff with multiple events Same Column SQL Server

Thank you in advance for any assistance you can provide. I am looking to get the date difference between events that are stored in the same column. Referring to the Sample data, I am looking for the differences between the "Partial Submissions" and their respective subsequent "Jr Reviewed" events.
referring to same data again, I need the dateDiff from
1st "Partial Review" to 1st "Jr Reviewed"
2nd "Partial Review" to 2nd "Jr Reviewed"
6th "Partial Review" to 3rd "Jr Reviewed"
I am not sure where to start, all i have done is add the rownumbers which are partitioned by "Descrip" and ordered by "Date" Asc. Any sort of guidance or method of accomplishing (Recursive CTE?) this would be greatly appreciated.
Start End - 2 records
DECLARE #Tbl TABLE (RowNumber INT, RecordNumber INT, IDX INT, DESCRIP NVARCHAR(50), DATES DATETIME, EVENTNUM INT)
INSERT INTO #Tbl
VALUES
(1, 11515, 13, 'Partial Submission', '8/12/16 00:21', 3078),
(1, 11515, 14, 'Junior Reviewed', '8/12/16 15:52', 3089),
(2, 11515, 26, 'Partial Submission', '8/18/16 15:24', 3078),
(3, 11515, 33, 'Partial Submission', '9/6/16 9:47', 3078),
(4, 11515, 34, 'Partial Submission', '9/6/16 9:47', 3078),
(5, 11515, 39, 'Partial Submission', '9/9/16 13:19', 3078),
(2, 11515, 40, 'Junior Reviewed', '9/11/16 8:30', 3089),
(6, 11515, 46, 'Partial Submission', '9/15/16 12:30', 3078),
(3, 11515, 54, 'Junior Reviewed', '9/17/16 10:01', 3089),
(7, 11515, 57, 'Full! Submission', '9/19/16 9:16', 3079),
(1, 11520, 19, 'Partial Submission', '8/20/16 00:42', 3078),
(1, 11520, 22, 'Junior Reviewed', '8/22/16 9:06', 3089),
(2, 11520, 28, 'Partial Submission', '8/29/16 20:12', 3078),
(2, 11520, 34, 'Junior Reviewed', '9/1/16 8:20', 3089),
(3, 11520, 38, 'Partial Submission', '9/8/16 15:03', 3078),
(4, 11520, 39, 'Partial Submission', '9/8/16 15:03', 3078),
(3, 11520, 47, 'Junior Reviewed', '9/14/16 13:53', 3089),
(5, 11520, 48, 'Full! Submission', '9/16/16 13:19', 3079),
(4, 11520, 52, 'Junior Reviewed', '9/17/16 10:51', 3089),
(6, 11520, 53, 'Full! Submission', '9/19/16 16:21', 3079)
;WITH CTE
AS
(
SELECT
*,
RowId = ROW_NUMBER() OVER (Partition BY Recordnumber ORDER BY Recordnumber, IDX),
RowIdByDescrip = ROW_NUMBER() OVER (PARTITION BY Recordnumber, DESCRIP ORDER BY Recordnumber, IDX)
FROM #tbl
)
,Test as
(
SELECT
A.Recordnumber,
A.DESCRIP,
A.EVENTNUM,
A.IDX,
A.DATES StartDate,
LEAD(A.DATES) OVER ( Partition BY A.Recordnumber ORDER BY A.IDX) EndDate,
DATEDIFF(HOUR, A.DATES, LEAD(A.DATES) OVER (Partition BY A.Recordnumber ORDER BY A.IDX)) AS DateDifff
FROM #tbl A INNER JOIN
(
SELECT
C.Recordnumber,
MIN(C.IDX) AS IDX
FROM
CTE C
GROUP BY
C.RowId - C.RowIdByDescrip,
C.DESCRIP,
C.Recordnumber
) B ON A.IDX = B.IDX and A.Recordnumber = B.Recordnumber
)
Select
*
From Test
Where eventnum in ('3078')
order by Recordnumber, IDX
Try as the below:
DECLARE #Tbl TABLE (RowNumber INT, RecordNumber INT, IDX INT, DESCRIP NVARCHAR(50), DATES DATETIME, EVENTNUM INT)
INSERT INTO #Tbl
VALUES
(1, 11515, 13, 'Partial Submission', '8/12/16 00:21', 3078),
(1, 11515, 14, 'Junior Reviewed', '8/12/16 15:52', 3089),
(2, 11515, 26, 'Partial Submission', '8/18/16 15:24', 3078),
(3, 11515, 33, 'Partial Submission', '9/6/16 9:47', 3078),
(4, 11515, 34, 'Partial Submission', '9/6/16 9:47', 3078),
(5, 11515, 39, 'Partial Submission', '9/9/16 13:19', 3078),
(2, 11515, 40, 'Junior Reviewed', '9/11/16 8:30', 3089),
(6, 11515, 46, 'Partial Submission', '9/15/16 12:30', 3078),
(3, 11515, 54, 'Junior Reviewed', '9/17/16 10:01', 3089),
(7, 11515, 57, 'Full! Submission', '9/19/16 9:16', 3079),
(1, 11520, 19, 'Partial Submission', '8/20/16 00:42', 3078),
(1, 11520, 22, 'Junior Reviewed', '8/22/16 9:06', 3089),
(2, 11520, 28, 'Partial Submission', '8/29/16 20:12', 3078),
(2, 11520, 34, 'Junior Reviewed', '9/1/16 8:20', 3089),
(3, 11520, 38, 'Partial Submission', '9/8/16 15:03', 3078),
(4, 11520, 39, 'Partial Submission', '9/8/16 15:03', 3078),
(3, 11520, 47, 'Junior Reviewed', '9/14/16 13:53', 3089),
(5, 11520, 48, 'Full! Submission', '9/16/16 13:19', 3079),
(4, 11520, 52, 'Junior Reviewed', '9/17/16 10:51', 3089),
(6, 11520, 53, 'Full! Submission', '9/19/16 16:21', 3079)
;WITH CTE
AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY IDX) RowId,
ROW_NUMBER() OVER (PARTITION BY DESCRIP ORDER BY IDX) RowIdByDescrip
FROM #Tbl
WHERE
EVENTNUM IN
(
3078, --Partial Submission
3089 -- Junior Reviewed
)
), CTE2
AS
(
SELECT
MIN(C.IDX) AS IDX
FROM
CTE C
GROUP BY
C.RowId - C.RowIdByDescrip,
C.DESCRIP
)
SELECT
R.RecordNumber,
R.IDX ,
R.StartDate ,
R.EndDate ,
R.DateDifff
FROM
(
SELECT
A.EVENTNUM,
A.RecordNumber,
A.DESCRIP,
A.IDX,
A.DATES StartDate,
LEAD(A.DATES) OVER (ORDER BY A.IDX) EndDate,
DATEDIFF(HOUR, A.DATES, LEAD(A.DATES) OVER (ORDER BY A.IDX)) AS DateDifff
FROM
#Tbl A INNER JOIN
CTE2 B ON A.IDX = B.IDX
) R
WHERE
R.EVENTNUM = 3078 --Partial Submission
ORDER BY R.RecordNumber
Result:
RecordNumber IDX StartDate EndDate DateDifff
------------ ----------- ---------------- ---------------- -----------
11515 13 2016-08-12 00:21 2016-08-12 15:52 15
11515 26 2016-08-18 15:24 2016-09-06 09:47 450
11515 34 2016-09-06 09:47 2016-09-01 08:20 -121
11515 46 2016-09-15 12:30 2016-09-14 13:53 -23
11520 38 2016-09-08 15:03 2016-09-11 08:30 65
11520 19 2016-08-20 00:42 2016-08-22 09:06 57
This is not an answer. Just too long for a comment.
I guess. I did not understand the question exactly. Let me tell you what i know.
Firstly, Sorting according to IDX
I just work between the two events, Partial Submission and Junior Reviewed
Result table: Partial Submission Start - Junior Reviewed END
RowNumber RecordNumber IDX DESCRIP DATES EVENTNUM RowId RowIdByDescrip
----------- ------------ ----------- -------------------------------------------------- ----------------------- ----------- -------------------- --------------------
1 11515 13 Partial Submission Start 2016-08-12 00:21:00.000 3078 1 1
1 11515 14 Junior Reviewed END 2016-08-12 15:52:00.000 3089 2 1
1 11520 19 Partial Submission Start 2016-08-20 00:42:00.000 3078 3 2
1 11520 22 Junior Reviewed END 2016-08-22 09:06:00.000 3089 4 2
2 11515 26 Partial Submission Start 2016-08-18 15:24:00.000 3078 5 3
2 11520 28 Partial Submission 2016-08-29 20:12:00.000 3078 6 4
3 11515 33 Partial Submission 2016-09-06 09:47:00.000 3078 7 5
4 11515 34 Partial Submission 2016-09-06 09:47:00.000 3078 8 6
2 11520 34 Junior Reviewed End 2016-09-01 08:20:00.000 3089 9 3
3 11520 38 Partial Submission Start 2016-09-08 15:03:00.000 3078 10 7
4 11520 39 Partial Submission 2016-09-08 15:03:00.000 3078 11 8
5 11515 39 Partial Submission 2016-09-09 13:19:00.000 3078 12 9
2 11515 40 Junior Reviewed End 2016-09-11 08:30:00.000 3089 13 4
6 11515 46 Partial Submission Start 2016-09-15 12:30:00.000 3078 14 10
3 11520 47 Junior Reviewed End 2016-09-14 13:53:00.000 3089 15 5
4 11520 52 Junior Reviewed 2016-09-17 10:51:00.000 3089 16 6
3 11515 54 Junior Reviewed 2016-09-17 10:01:00.000 3089 17 7
Result:
RecordNumber IDX StartDate EndDate DateDifff
------------ ----------- ---------------- ---------------- -----------
11515 13 2016-08-12 00:21 2016-08-12 15:52 15
11515 26 2016-08-18 15:24 2016-09-06 09:47 450
11515 34 2016-09-06 09:47 2016-09-01 08:20 -121
11515 46 2016-09-15 12:30 2016-09-14 13:53 -23
11520 38 2016-09-08 15:03 2016-09-11 08:30 65
11520 19 2016-08-20 00:42 2016-08-22 09:06 57

Use psycopg2 to do loop in postgresql

I use postgresql 8.4 to route a river network, and I want to use psycopg2 to loop through all data points in my river network.
#set up python and postgresql connection
import psycopg2
query = """
select *
from driving_distance ($$
select
gid as id,
start_id::int4 as source,
end_id::int4 as target,
shape_leng::double precision as cost
from network
$$, %s, %s, %s, %s
)
;"""
conn = psycopg2.connect("dbname = 'routing_template' user = 'postgres' host = 'localhost' password = '****'")
cur = conn.cursor()
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
i = i + 1
else:
break
rs = cur.fetchall()
conn.close()
print rs
The code above costs a lot of time to run even though I have set the maximum iterator i equals to 2, and the output is an error message contains garbage,
I am thinking that if postgresql can accept only one result at one time, so I tried to put this line in my loop,
rs(i) = cur.fetchall()
and the error message said that this line has bugs,
I know that I can't write code like rs(i), but I don't know the replacement to validate my assumption.
So should I save one result to a file first then use the next iterator to run the loop, and again and again?
I am working with postgresql 8.4, python 2.7.6 under Windows 8.1 x64.
Update#1
I can do loop using Clodoaldo Neto's code(thanks), and the result is like this,
[(1, 2, 0.0), (2, 2, 4729.33082850235), (3, 19, 4874.27571718902), (4, 3, 7397.215962901), (5, 4,
6640.31749097187), (6, 7, 10285.3869655786), (7, 7, 14376.1087618696), (8, 5, 15053.164236979), (9, 10, 16243.5973710466), (10, 8, 19307.3024368889), (11, 9, 21654.8669532788), (12, 11, 23522.6224229233), (13, 18, 29706.6964721152), (14, 21, 24034.6792693279), (15, 18, 25408.306370489), (16, 20, 34204.1769580924), (17, 11, 26465.8348728118), (18, 20, 38596.7313209197), (19, 13, 35184.9925532175), (20, 16, 36530.059646027), (21, 15, 35789.4069722436), (22, 15, 38168.1750567026)]
[(1, 2, 4729.33082850235), (2, 2, 0.0), (3, 19, 144.944888686669), (4, 3, 2667.88513439865), (5, 4, 1910.98666246952), (6, 7, 5556.05613707624), (7, 7, 9646.77793336723), (8, 5, 10323.8334084767), (9, 10, 11514.2665425442), (10, 8, 14577.9716083866), (11, 9, 16925.5361247765), (12, 11, 18793.2915944209), (13, 18, 24977.3656436129), (14, 21, 19305.3484408255), (15, 18, 20678.9755419867), (16, 20, 29474.8461295901), (17, 11, 21736.5040443094), (18, 20, 33867.4004924174), (19, 13, 30455.6617247151), (20, 16, 31800.7288175247), (21, 15, 31060.0761437413), (22, 15, 33438.8442282003)]
but if I want to get this look of output,
(1, 2, 7397.215962901)
(2, 2, 2667.88513439865)
(3, 19, 2522.94024571198)
(4, 3, 0.0)
(5, 4, 4288.98201949483)
(6, 7, 7934.05149410155)
(7, 7, 12024.7732903925)
(8, 5, 12701.828765502)
(9, 10, 13892.2618995696)
(10, 8, 16955.9669654119)
(11, 9, 19303.5314818018)
(12, 11, 21171.2869514462)
(13, 18, 27355.3610006382)
(14, 21, 21683.3437978508)
(15, 18, 23056.970899012)
(16, 20, 31852.8414866154)
(17, 11, 24114.4994013347)
(18, 20, 36245.3958494427)
(19, 13, 32833.6570817404)
(20, 16, 34178.72417455)
(21, 15, 33438.0715007666)
(22, 15, 35816.8395852256)
What should I make a little change in the code?
rs = []
while True:
i = 1
if i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
else:
break
conn.close()
print rs
If it is just a counter that breaks that loop then
rs = []
i = 1
while i <= 2:
cur.execute(query, (i, 1000000, False, False))
rs.extend(cur.fetchall())
i = i + 1
conn.close()
print rs