T-SQL Consider timestamp ordering to determine group - tsql

I have an interesting grouping case for which I really can't figure the best way to proceed with. I'm dealing with a lot of data so I'm trying to find the most effective way to perform my query.
Here's an example of data I could have
|Line |Trip |Speed |Timestamp
====================================================
|100 |1 |50 kmh |2017-06-12 22:34:50
|100 |1 |55 kmh |2017-06-12 22:36:44
|100 |1 |56 kmh |2017-06-12 22:37:12
|200 |5 |12 kmh |2017-06-12 22:40:11
|200 |5 |18 kmh |2017-06-12 22:43:13
|100 |1 |23 kmh |2017-06-12 22:49:11
|100 |1 |45 kmh |2017-06-12 22:53:49
I would like to be able to define a grouping number sequentially based on the Line & Trip ordered by the timestamp. So basically at the end I would expect something such as:
|Line |Trip |Speed |Timestamp |GroupNumber
===========================================================================
|100 |1 |50 kmh |2017-06-12 22:34:50 | 1
|100 |1 |55 kmh |2017-06-12 22:36:44 | 1
|100 |1 |56 kmh |2017-06-12 22:37:12 | 1
|200 |5 |12 kmh |2017-06-12 22:40:11 | 2
|200 |5 |18 kmh |2017-06-12 22:43:13 | 2
|100 |1 |23 kmh |2017-06-12 22:49:11 | 3
|100 |1 |45 kmh |2017-06-12 22:53:49 | 3
I can't find any way using DENSE_RANK or ROW_NUMBER because the last group of 100/1 is merged again with the first records but should not because there is another group (200/5) between both occurence.
Any help would be appreciated.
Thank you.

This is known as a Gaps and Islands problem.
To identify your islands, you need to use two ranking functions, the first to find the position of a row within the set, and the second to find the position of the row within its subset (Line, Trip), So you would end up with:
SELECT Line,
Trip,
Speed,
Timestamp,
R1 = ROW_NUMBER() OVER(ORDER BY Timestamp),
R2 = ROW_NUMBER() OVER(PARTITION BY Line, Trip ORDER BY Timestamp)
FROM dbo.YourTable;
|Line |Trip |Speed |Timestamp R1 R2
==========================================================================
|100 |1 |50 kmh |2017-06-12 22:34:50 1 1
|100 |1 |55 kmh |2017-06-12 22:36:44 2 2
|100 |1 |56 kmh |2017-06-12 22:37:12 3 3
|200 |5 |12 kmh |2017-06-12 22:40:11 4 1 <-- starts new sequence for new group
|200 |5 |18 kmh |2017-06-12 22:43:13 5 2
|100 |1 |23 kmh |2017-06-12 22:49:11 6 4 <-- follows on from where it left off
|100 |1 |45 kmh |2017-06-12 22:53:49 7 5
Now, if you deduct one from the other you get a unique identifier for each island
|Line |Trip |Speed |Timestamp R1 R2 (R1 - R2)
=======================================================================================
|100 |1 |50 kmh |2017-06-12 22:34:50 1 1 0
|100 |1 |55 kmh |2017-06-12 22:36:44 2 2 0
|100 |1 |56 kmh |2017-06-12 22:37:12 3 3 0
|200 |5 |12 kmh |2017-06-12 22:40:11 4 1 3
|200 |5 |18 kmh |2017-06-12 22:43:13 5 2 3
|100 |1 |23 kmh |2017-06-12 22:49:11 6 4 2
|100 |1 |45 kmh |2017-06-12 22:53:49 7 5 2
Finally, you can use this unique identifier to get a starting time for each group:
SELECT Line,
Trip,
Speed,
Timestamp,
GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID)
FROM ( SELECT Line,
Trip,
Speed,
Timestamp,
IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
ROW_NUMBER() OVER(PARTITION BY Line, Trip
ORDER BY Timestamp)
FROM dbo.YourTable
) AS t;
Then finally, you can apply DENSE_RANK() to the group start time to get an integer ranking. So with your sample data you would get:
-- SAMPLE DATA
DECLARE #T TABLE (Line INT, Trip INT, Speed VARCHAR(6), Timestamp DATETIME);
INSERT #T (Line, Trip, Speed, Timestamp)
VALUES
(100, 1, '50 kmh', '2017-06-12 22:34:50'),
(100, 1, '55 kmh', '2017-06-12 22:36:44'),
(100, 1, '56 kmh', '2017-06-12 22:37:12'),
(200, 5, '12 kmh', '2017-06-12 22:40:11'),
(200, 5, '18 kmh', '2017-06-12 22:43:13'),
(100, 1, '23 kmh', '2017-06-12 22:49:11'),
(100, 1, '45 kmh', '2017-06-12 22:53:49');
WITH GroupedData AS
( SELECT Line,
Trip,
Speed,
Timestamp,
GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID),
IslandID
FROM ( SELECT Line,
Trip,
Speed,
Timestamp,
IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
ROW_NUMBER() OVER(PARTITION BY Line, Trip
ORDER BY Timestamp)
FROM #T
) AS t
)
SELECT Line,
Trip,
Speed,
Timestamp,
GroupNumber = DENSE_RANK() OVER(ORDER BY GroupStart, IslandID)
FROM GroupedData
ORDER BY Timestamp;
OUTPUT
Line Trip Speed Timestamp GroupNumber
--------------------------------------------------------------
100 1 50 kmh 2017-06-12 22:34:50.000 1
100 1 55 kmh 2017-06-12 22:36:44.000 1
100 1 56 kmh 2017-06-12 22:37:12.000 1
200 5 12 kmh 2017-06-12 22:40:11.000 2
200 5 18 kmh 2017-06-12 22:43:13.000 2
100 1 23 kmh 2017-06-12 22:49:11.000 3
100 1 45 kmh 2017-06-12 22:53:49.000 3

Related

Loop through large dataframe in Pyspark - alternative

df_hrrchy
|lefId |Lineage |
|-------|--------------------------------------|
|36326 |["36326","36465","36976","36091","82"]|
|36121 |["36121","36908","36976","36091","82"]|
|36380 |["36380","36465","36976","36091","82"]|
|36448 |["36448","36465","36976","36091","82"]|
|36683 |["36683","36465","36976","36091","82"]|
|36949 |["36949","36908","36976","36091","82"]|
|37349 |["37349","36908","36976","36091","82"]|
|37026 |["37026","36908","36976","36091","82"]|
|36879 |["36879","36465","36976","36091","82"]|
df_trans
|tranID | T_Id |
|-----------|-------------------------------------------------------------------------|
|1000540 |["36121","36326","37349","36949","36380","37026","36448","36683","36879"]|
df_creds
|T_Id |T_val |T_Goal |Parent_T_Id |Parent_Val |parent_Goal|
|-------|-------|-------|---------------|----------------|-----------|
|36448 |100 |1 |36465 |200 |1 |
|36465 |200 |1 |36976 |300 |2 |
|36326 |90 |1 |36465 |200 |1 |
|36091 |500 |19 |82 |600 |4 |
|36121 |90 |1 |36908 |200 |1 |
|36683 |90 |1 |36465 |200 |1 |
|36908 |200 |1 |36976 |300 |2 |
|36949 |90 |1 |36908 |200 |1 |
|36976 |300 |2 |36091 |500 |19 |
|37026 |90 |1 |36908 |200 |1 |
|37349 |100 |1 |36908 |200 |1 |
|36879 |90 |1 |36465 |200 |1 |
|36380 |90 |1 |36465 |200 |1 |
Desired Result
T_id
children
T_Val
T_Goal
parent_T_id
parent_Goal
trans_id
36091
["36976"]
500
19
82
4
1000540
36465
["36448","36326","36683","36879","36380"]
200
1
36976
2
1000540
36908
["36121","36949","37026","37349"]
200
1
36976
2
1000540
36976
["36465","36908"]
300
2
36091
19
1000540
36683
null
90
1
36465
1
1000540
37026
null
90
1
36908
1
1000540
36448
null
100
1
36465
1
1000540
36949
null
90
1
36908
1
1000540
36326
null
90
1
36465
1
1000540
36380
null
90
1
36465
1
1000540
36879
null
90
1
36465
1
1000540
36121
null
90
1
36908
1
1000540
37349
null
100
1
36908
1
1000540
Code Tried
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
for row in df_transactions.rdd.toLocalIterator():
# def find_nodemap(row):
dfs = []
df_hy_set = (df_hrrchy.filter(df_hrrchy. lefId.isin(row["T_ds"]))
.select(explode("Lineage").alias("Terrs"))
.agg(collect_set(col("Terrs")).alias("hierarchy_list"))
.select(F.lit(row["trans_id"]).alias("trans_id "),"hierarchy_list")
)
df_childrens = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_ creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
display(df_nodemap)
# dfs.append(df_nodemap)
# df = reduce(DataFrame.union, dfs)
# display(df)
# # display(df)
My problem - Its a bad design. df_trans is having millions of data and looping through dataframe , its taking forever. Without looping can I do it. I tried couple of other methods, not able to get the desired result.
You certainly need to process entire DataFrame in batch, not iterate row by row.
Key points are to "reverse" df_hrrchy, ie. from parent lineage obtain list of children for every T_Id:
val df_children = df_hrrchy.withColumn("children", slice($"Lineage", lit(1), size($"Lineage") - 1))
.withColumn("parents", slice($"Lineage", 2, 999999))
.select(explode(arrays_zip($"children", $"parents")).as("rels"))
.distinct
.groupBy($"rels.parents".as("T_Id"))
.agg(collect_set($"rels.children").as("children"))
df_children.show(false)
+-----+-----------------------------------+
|T_Id |children |
+-----+-----------------------------------+
|36091|[36976] |
|36465|[36448, 36380, 36326, 36879, 36683]|
|36976|[36465, 36908] |
|82 |[36091] |
|36908|[36949, 37349, 36121, 37026] |
+-----+-----------------------------------+
then expand list of T_Ids in df_trans and also include all T_Ids from the hierarchy:
val df_trans_map = df_trans.withColumn("T_Id", explode($"T_Id"))
.join(df_hrrchy, array_contains($"Lineage", $"T_Id"))
.select($"tranID", explode($"Lineage").as("T_Id"))
.distinct
df_trans_map.show(false)
+-------+-----+
|tranID |T_Id |
+-------+-----+
|1000540|36976|
|1000540|82 |
|1000540|36091|
|1000540|36465|
|1000540|36326|
|1000540|36121|
|1000540|36908|
|1000540|36380|
|1000540|36448|
|1000540|36683|
|1000540|36949|
|1000540|37349|
|1000540|37026|
|1000540|36879|
+-------+-----+
With this it is just a simple join to obtain final result:
df_trans_map.join(df_creds, Seq("T_Id"))
.join(df_children, Seq("T_Id"), "left_outer")
.show(false)
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|T_Id |tranID |T_val|T_Goal|Parent_T_Id|Parent_Val|parent_Goal|children |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|36976|1000540|300 |2 |36091 |500 |19 |[36465, 36908] |
|36091|1000540|500 |19 |82 |600 |4 |[36976] |
|36465|1000540|200 |1 |36976 |300 |2 |[36448, 36380, 36326, 36879, 36683]|
|36326|1000540|90 |1 |36465 |200 |1 |null |
|36121|1000540|90 |1 |36908 |200 |1 |null |
|36908|1000540|200 |1 |36976 |300 |2 |[36949, 37349, 36121, 37026] |
|36380|1000540|90 |1 |36465 |200 |1 |null |
|36448|1000540|100 |1 |36465 |200 |1 |null |
|36683|1000540|90 |1 |36465 |200 |1 |null |
|36949|1000540|90 |1 |36908 |200 |1 |null |
|37349|1000540|100 |1 |36908 |200 |1 |null |
|37026|1000540|90 |1 |36908 |200 |1 |null |
|36879|1000540|90 |1 |36465 |200 |1 |null |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
You need to re-write this to use the full cluster, using a localIterator means that you aren't fully utilizing the cluster for shared work.
Below code was not run as you didn't provide a workable data set to test. If you do I'll run the code to make sure it's sound.
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
#uses explode I know this will create a lot of short lived records but the flip side is it will use the entire cluster to complete the work instead of the driver.
df_trans_expld = df_trans.select( df_trans.tranID, explode(df_trans.T_Id).alias("T_Id") )
#uses explode
df_hrrchy_expld = df_hrrchy.select( df_hrrchy.leftId, explode( df_hrrchy.Lineage ).alias("Lineage") )
#uses exploded data to join which is the same as a filter.
df_hy_set = df_trans_expld.join( df_hrrchy_expld, df_hrrchy_expld.lefId === df_trans_expld.T_id, "left").select( "trans_id" ).agg(collect_set(col("Lineage")).alias("hierarchy_list"))
.select(F.lit(col("trans_id")).alias("trans_id "),"hierarchy_list")
#logic unchanged from here down
df_childrens = (df_creds.join(df_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(ddf_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
# no need to append/union data as it's now just one dataframe df_nodemap
I'd have to look into this more but I'm pretty sure you are pulling all the data through the driver(with your existing code), which will really slow things down, this will make use of all executors to complete the work.
There may be another optimization to get rid of the array_contains (and use a join instead). I'd have to look at the explain to see if you could get even more performance out of it. Don't remember off the top of my head, you are avoiding a shuffle so it may be better as is.

MariaDB - Conjunction-Search in Many-to-Many

I have problems to implement an "and-concatenated" search with many-to-many tables. I tried to present a simple example below. I use MariaDB.
I have a table with process. To the process a can assign persons and tags. There is a table for tags and a table for persons.
There a two many-to-many relationships: tags_to_processes and persons_to_processes.
example: Find all process with person 1 and person 2 and with tag 1 and 2. Result: process 1.
example: Find all process with person 1 and person 2 and with tag 2. Result: Process 1 and Process 2.
Thank you very much!
'processes' Table
+-----------+-------------------+
|process_id |process_name |
+-----------+-------------------+
|1 |Process 1 |
|2 |Process 2 |
|3 |Process 3 |
+-----------+-------------------+
'persons' table
+----------+------------+
|person_id |person_name |
+----------+------------+
|1 |Person 1 |
|2 |Person 2 |
|3 |Person 3 |
|4 |Person 4 |
|5 |Person 5 |
+----------+------------+
'tags' table
+----------+-----------+
|tag_id |tag_name |
+----------+-----------+
|1 |Tag 1 |
|2 |Tag 2 |
|3 |Tag 3 |
|4 |Tag 4 |
|5 |Tag 5 |
|6 |Tag 6 |
+----------+-----------+
'persons_to_processes' table
+----------+-----------+
|person_id |process_id |
+----------+-----------+
|1 |1 |
|2 |1 |
|3 |1 |
|4 |1 |
|5 |1 |
|1 |2 |
|2 |2 |
|4 |3 |
+----------+-----------+
'tags_to_processes' table
+----------+-----------+
|tag_id |process_id |
+----------+-----------+
|1 |1 |
|2 |1 |
|3 |1 |
|6 |1 |
|2 |2 |
|2 |3 |
+----------+-----------+
You can join persons_to_processes to persons, filter the resuults for the persons that you want and use aggregation:
SELECT ptp.process_id
FROM persons_to_processes ptp INNER JOIN persons p
ON p.person_id = ptp.person_id
WHERE p.person_name IN ('Person 1', 'Person 2')
GROUP BY ptp.process_id
HAVING COUNT(*) = 2 -- 2 persons
Similarly for the tables tags_to_processes and tags:
SELECT ttp.process_id
FROM tags_to_processes ttp INNER JOIN tags t
ON t.tag_id = ttp.tag_id
WHERE t.tag_name IN ('Tag 1', 'Tag 2')
GROUP BY ttp.process_id
HAVING COUNT(*) = 2 -- 2 tags
Finally, you can combine the 2 queries to get their common results with INTERSECT:
WITH
cte1 AS (
SELECT ptp.process_id
FROM persons_to_processes ptp INNER JOIN persons p
ON p.person_id = ptp.person_id
WHERE p.person_name IN ('Person 1', 'Person 2')
GROUP BY ptp.process_id
HAVING COUNT(*) = 2 -- 2 persons
),
cte2 AS (
SELECT ttp.process_id
FROM tags_to_processes ttp INNER JOIN tags t
ON t.tag_id = ttp.tag_id
WHERE t.tag_name IN ('Tag 1', 'Tag 2')
GROUP BY ttp.process_id
HAVING COUNT(*) = 2 -- 2 tags
)
SELECT process_id FROM cte1
INTERSECT
SELECT process_id FROM cte2;
See the demo.

Find new, left and existing records in PySpark

I have a dataframe with data like this
I want to compare each record with next year's record and see if that id is there. If it's there in the year and the next year it's 'Existing', If it's there in the year and not there in the next year, it's 'Left'. If it's not there in the year but there in the next year, it's 'New'. I want output like this below. The columns 2017-18, 2018-19 etc. should be created dynamically.
How do I achieve this?
After getting the data in the above format, I need to aggregate the sales for each year band like this as below. For example for 2017-2018,
New_sales = sum of all sales of 2018 (which is the later year in 2017-2018) where it's marked as 'New' which is 25 here.
Left_sales = sum of all sales of 2017 (the earlier year in 2017-2018) where it's marked as 'Left' which is 100 here.
Existing_sales = sum of sales of 2017 where it's marked as 'Existing' subtract sum of sales of 2018 where it's marked as 'Existing'
Existing_sales = 50+75 (sales of 2017, 'Existing') - (20+50) (sales of 2018, 'Existing') = 125-70 = 55
How do I achieve this?
As your date is a string, I think you can:
df = spark.createDataFrame(
data=[
(1, '31/12/2017'),
(2, '31/12/2017'),
(3, '31/12/2017'),
(1, '31/12/2018'),
(3, '31/12/2018'),
(5, '31/12/2018'),
],
schema=['id', 'date']
)
First you can get the year first:
df2 = df.withColumn('year', func.split(func.col('Date'), '/').getItem(2))
df2.show(10, False)
+---+----------+----+
|id |date |year|
+---+----------+----+
|1 |31/12/2017|2017|
|2 |31/12/2017|2017|
|3 |31/12/2017|2017|
|1 |31/12/2018|2018|
|3 |31/12/2018|2018|
|5 |31/12/2018|2018|
+---+----------+----+
Then you can collect the list of year by id as a reference:
df3 = df2.groupby('id')\
.agg(func.collect_set('year').alias('year_lst'))
df3.show(3, False)
+---+------------+
|id |year_lst |
+---+------------+
|1 |[2017, 2018]|
|2 |[2017] |
|3 |[2017, 2018]|
+---+------------+
Then you can join the reference back to the data:
df4 = df2.join(df3, on='id', how='left')
df4.show(10, False)
+---+----------+----+------------+
|id |date |year|year_lst |
+---+----------+----+------------+
|1 |31/12/2017|2017|[2017, 2018]|
|2 |31/12/2017|2017|[2017] |
|3 |31/12/2017|2017|[2017, 2018]|
|1 |31/12/2018|2018|[2017, 2018]|
|3 |31/12/2018|2018|[2017, 2018]|
|5 |31/12/2018|2018|[2018] |
+---+----------+----+------------+
The last step is to create the column dynamically. I think you can use a for loop:
year_loop = ['2017', '2018', '2019', '2020', '2021']
for idx in range(len(year_loop)-1):
this_year = year_loop[idx]
next_year = year_loop[idx+1]
column_name = f"{this_year}-{next_year}"
new_condition = (~func.array_contains(func.col('year_lst'), this_year)) & (func.array_contains(func.col('year_lst'), next_year))
exist_condition = (func.array_contains(func.col('year_lst'), this_year)) & (func.array_contains(func.col('year_lst'), next_year))
left_condition = (func.array_contains(func.col('year_lst'), this_year)) & (~func.array_contains(func.col('year_lst'), next_year))
df4 = df4.withColumn(column_name, func.when(new_condition, func.lit('New'))
.when(exist_condition, func.lit('Existing'))
.when(left_condition, func.lit('Left')))
df4.show(10, False)
+---+----------+----+------------+---------+---------+---------+---------+
|id |date |year|year_lst |2017-2018|2018-2019|2019-2020|2020-2021|
+---+----------+----+------------+---------+---------+---------+---------+
|1 |31/12/2017|2017|[2017, 2018]|Existing |Left |null |null |
|2 |31/12/2017|2017|[2017] |Left |null |null |null |
|3 |31/12/2017|2017|[2017, 2018]|Existing |Left |null |null |
|1 |31/12/2018|2018|[2017, 2018]|Existing |Left |null |null |
|3 |31/12/2018|2018|[2017, 2018]|Existing |Left |null |null |
|5 |31/12/2018|2018|[2018] |New |Left |null |null |
+---+----------+----+------------+---------+---------+---------+---------+

Trying to unwind two or more arrays in OrientDB

I'm using OrientDB's UI/Query tool to analyze some graph data, and I've spent a couple of days unsuccessfully trying to unwind two arrays.
The unwind clause works just fine for one array but I can't seem to get the output I'm looking for when trying to unwind two arrays.
Here's a simplified example of my data:
#class | amt | storeID | customerID
transaction $4 1 1
transaction $2 1 1
transaction $6 1 4
transaction $3 1 4
transaction $2 2 1
transaction $7 2 1
transaction $8 2 2
transaction $3 2 2
transaction $4 2 3
transaction $9 2 3
transaction $10 3 4
transaction $3 3 4
transaction $4 3 5
transaction $10 3 5
Each customer is a document with the following information:
#class | customerID | State
customer 1 NY
customer 2 NJ
customer 3 PA
customer 4 NY
customer 5 NY
Each store is a document with the following information:
#class | storeID | State | Zip
store 1 NY 1
store 2 NJ 3
store 3 NY 2
Assuming I did not have storeID (nor wanted to create it), I want to recover a flattened table with the following distinct values: name of the store, city, account numbers, and the sum of spent.
The query would hopefully generate something like the table below (for a given depth value).
State | Zip | customerID
NY 1 4
NY 1 5
NY 2 1
NY 2 4
NJ 3 1
NJ 3 2
NJ 3 3
I've tried various expand/flatten/unwind operations but I can't seem to get my query to work.
Here's the query I have that recovers the State and Zip as two arrays and flattens the customerID:
SELECT out().State as State,
out().Zip as Zip,
customerID
FROM ( SELECT EXPAND(IN())
FROM (TRAVERSE * FROM
( SELECT FROM transaction)
)
) ;
Which yields,
State | Zip | customerID
[NY, NY, NJ, NJ] [1,1,2,2] 1
[NY, NY, NJ, NJ] [1,1,2,2] 1
[NY, NY, PA, PA] [1,1,3,3] 4
[NY, NY, PA, PA] [1,1,3,3] 4
... .... ....
Which is not what I'm looking for. Can someone provide a little help on how I can flatten/unwind these two arrays all together?
I tried your case with this structure (based on your example):
I used this queries to retrieve State, Zip and customerID (not as array):
Query 1:
SELECT State, Zip, in('transaction').customerID AS customerID FROM Store
ORDER BY Zip UNWIND customerID
----+------+-----+----+----------
# |#CLASS|State|Zip |customerID
----+------+-----+----+----------
0 |null |NY |1 |1
1 |null |NY |1 |1
2 |null |NY |1 |4
3 |null |NY |1 |4
4 |null |NY |2 |4
5 |null |NY |2 |4
6 |null |NY |2 |5
7 |null |NY |2 |5
8 |null |NJ |3 |1
9 |null |NJ |3 |1
10 |null |NJ |3 |2
11 |null |NJ |3 |2
12 |null |NJ |3 |3
13 |null |NJ |3 |3
----+------+-----+----+----------
Query 2:
SELECT inV('transaction').State AS State, inV('transaction').Zip AS Zip,
outV('transaction').customerID AS customerID FROM transaction ORDER BY Zip
----+------+-----+----+----------
# |#CLASS|State|Zip |customerID
----+------+-----+----+----------
0 |null |NY |1 |1
1 |null |NY |1 |1
2 |null |NY |1 |4
3 |null |NY |1 |4
4 |null |NY |2 |4
5 |null |NY |2 |4
6 |null |NY |2 |5
7 |null |NY |2 |5
8 |null |NJ |3 |1
9 |null |NJ |3 |1
10 |null |NJ |3 |2
11 |null |NJ |3 |2
12 |null |NJ |3 |3
13 |null |NJ |3 |3
----+------+-----+----+----------
EDITED
In the following example, with the query you'll be able to retrieve the average and the total spent for every storeID (based on each customerID):
SELECT customerID, storeID, avg(amt) AS averagePerStore, sum(amt) AS totalPerStore
FROM transaction GROUP BY customerID,storeID ORDER BY customerID
----+------+----------+-------+---------------+-------------
# |#CLASS|customerID|storeID|averagePerStore|totalPerStore
----+------+----------+-------+---------------+-------------
0 |null |1 |1 |3.0 |6.0
1 |null |1 |2 |4.5 |9.0
2 |null |2 |2 |5.5 |11.0
3 |null |3 |2 |6.5 |13.0
4 |null |4 |1 |4.5 |9.0
5 |null |4 |3 |6.5 |13.0
6 |null |5 |3 |7.0 |14.0
----+------+----------+-------+---------------+-------------
Hope it helps

PostgreSQL: Free time slot algorithm

I have a table with some time slots in it, example:
#id datet userid agentid duration
+=======================================================+
|1 |2013-08-20 08:00:00 |-1 |3 |5
|2 |2013-08-20 08:05:00 |-1 |3 |5
|3 |2013-08-20 08:10:00 | 3 |3 |5
|4 |2013-08-20 08:15:00 |-1 |3 |5
|5 |2013-08-20 08:20:00 |-1 |3 |5
|6 |2013-08-20 08:25:00 |-1 |3 |5
|7 |2013-08-20 08:30:00 |-1 |3 |5
|8 |2013-08-20 08:05:00 |-1 |7 |15
|9 |2013-08-20 08:20:00 |-1 |7 |15
+=======================================================+
In the above example, the user wit id 3 has a slot at 8:10. (if userid = -1, it means it is a free slot). He has an appointment with agent 5. For example, now user 3 would like another timeslot, but this time with agent 7. So, the algorithm should keep only the free slots for agentid 7 and the possible slots wich doesn't overlap. This would mean, only the 9th record would be a solution in this case. (But maybe in another case, there are multiple solutions). Another thing, a user can only have one appointment with the same agent.
Any ideas how to implement this? I was thinking with the OVERLAPS operator, but can't figure it out how to do so.
Try something like:
select *
from time_slots ts
where agentid = 7 -- or any agent
and userid = -1 -- it is free
and not exists (select 1 -- and overlaping interval does not exist
from time_slots ts_2
where ts_2.userid <> -1 -- not free
and (ts.datet, ts.datet + interval '1 hour' * ts.duration) OVERLAPS
(ts_2.datet, ts_2.datet + interval '1 hour' * ts_2.duration))