OSError: [Errno 99] Cannot assign requested address on pyspark streaming - pyspark

I'm trying to runt he below code two calculate time difference between two timestamp columns of a pyspark streaming dataframe and getting this error:
ERROR Utils: Aborting task
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib64/python3.7/site-packages/thriftpy2/transport/socket.py", line 96, in open
self.sock.connect(addr)
OSError: \[Errno 99\] Cannot assign requested address
Code -
parsed_df = parsed_df.withColumn('time', datediff(col('last_transaction_date_ts'), col('transaction_dt_ts')))
Dataframe -
+----------------+-------------+-------+---------------+--------+-------------------+-----+-------------+-----------+------------------------+
|card_id |member_id |amount |pos_id |postcode|transaction_dt_ts |score|last_postcode|UCL |last_transaction_date_ts|
+----------------+-------------+-------+---------------+--------+-------------------+-----+-------------+-----------+------------------------+
|348702330256514 |37495066290 |4380912|248063406800722|96774 |2018-03-01 08:24:29|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |6703385|786562777140812|84758 |2018-06-02 04:15:03|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |7454328|466952571393508|93645 |2018-02-12 09:56:42|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |4013428|45845320330319 |15868 |2018-06-13 05:38:54|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |5495353|545499621965697|79033 |2018-06-16 21:51:54|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |3966214|369266342272501|22832 |2018-10-21 03:52:51|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |1753644|9475029292671 |17923 |2018-08-23 00:11:30|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|348702330256514 |37495066290 |1692115|27647525195860 |55708 |2018-11-23 17:02:39|339 |33946 |13011526.35|2018-02-11 00:00:00 |
|5189563368503974|117826301530 |9222134|525701337355194|64002 |2018-03-01 20:22:10|289 |14894 |11249668.43|2018-02-01 04:58:38 |
|5189563368503974|117826301530 |4133848|182031383443115|26346 |2018-09-09 01:52:32|289 |14894 |11249668.43|2018-02-01 04:58:38 |
|5189563368503974|117826301530 |8938921|799748246411019|76934 |2018-12-09 05:20:53|289 |14894 |11249668.43|2018-02-01 04:58:38 |
|5189563368503974|117826301530 |1786366|131276818071265|63431 |2018-08-12 14:29:38|289 |14894 |11249668.43|2018-02-01 04:58:38 |
|5189563368503974|117826301530 |9142237|564240259678903|50635 |2018-06-16 19:37:19|289 |14894 |11249668.43|2018-02-01 04:58:38 |
|5407073344486464|1147922084344|6885448|887913906711117|59031 |2018-05-05 07:53:53|393 |63770 |11892709.56|2018-01-18 10:55:30 |
|5407073344486464|1147922084344|4028209|116266051118182|80118 |2018-08-11 01:06:50|393 |63770 |11892709.56|2018-01-18 10:55:30 |
|5407073344486464|1147922084344|3858369|896105817613325|53820 |2018-07-12 17:37:26|393 |63770 |11892709.56|2018-01-18 10:55:30 |
|5407073344486464|1147922084344|9307733|729374116016479|14898 |2018-07-13 04:50:16|393 |63770 |11892709.56|2018-01-18 10:55:30 |
|5407073344486464|1147922084344|4011296|543373367319647|44028 |2018-10-17 13:09:34|393 |63770 |11892709.56|2018-01-18 10:55:30 |
|5407073344486464|1147922084344|9492531|211980095659371|49453 |2018-04-21 14:12:26|393 |63770 |11892709.56|2018-01-18 10:55:30 |
|5407073344486464|1147922084344|7550074|345533088112099|15030 |2018-09-29 02:34:52|393 |63770 |11892709.56|2018-01-18 10:55:30 |
+----------------+-------------+-------+---------------+--------+-------------------+-----+-------------+-----------+------------------------+
Note - I'm running this on EMR cluster.
I tried this as well using UDF but still no luck:
def time_cal(last_date, curr_date):
diff = curr_date-last_date
return (diff.total_seconds())/3600
time_udf = udf(time_cal,DoubleType())
parsed_df = parsed_df.withColumn('time_taken',time_udf(parsed_df.last_transaction_date_ts,parsed_df.transaction_dt_ts))

Related

How to do yearly comparison in spark scala

I have dataframe which contains columns like Month and Qty as you can see in below table:
| Month | Fruit | Qty |
| -------- | ------ | ------ |
| 2021-01 | orange | 5223 |
| 2021-02 | orange | 23 |
| ...... | ..... | ..... |
| 2022-01 | orange | 2342 |
| 2022-02 | orange | 37667 |
I need to do sum of the Qty group by the Fruit. My output DF will be below table:
| Year | Fruit | sum_of_qty_This_year | sum_of_qty_previous_year |
| ---- | -------- | --------------------- | -------------------------- |
| 2022 | orange | 29384 | 34534 |
| 2021 | orange | 34534 | 93584 |
but there is a catch here, consider below table.
| current year | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | dec |
| --------------------------------------------------------------------------------------------------------|
| previous year | jan | feb | | apr | may | jun | jul | aug | | oct | nov | dec |
as you can see the data for mar and sep is missing in previous year. So when we calculate sum of current year, Qty should exclude the missing months. and this should be done for each year
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._
val df1 = Seq(
("2021-01", "orange", 5223),
("2021-02", "orange", 23),
("2022-01", "orange", 2342),
("2022-02", "orange", 37667),
("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")
val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
.filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
.filter(col("Month").substr(1, 4) === priorYear)
.withColumnRenamed("Month", "MonthP")
.withColumnRenamed("Fruit", "FruitP")
.withColumnRenamed("Qty", "QtyP")
val resDF = priorYearDF
.join(
currentYearDF,
priorYearDF
.col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
.col("MonthP")
.substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
)
.select(
currentYearDF.col("Fruit").as("Fruit"),
currentYearDF.col("Qty").as("CurrentYearQty"),
priorYearDF.col("QtyP").as("PriorYearQty")
)
.groupBy("Fruit")
.agg(
sum("CurrentYearQty").as("sum_of_qty_This_year"),
sum("PriorYearQty").as("sum_of_qty_previous_year")
)
resDF.show(false)
// +------+--------------------+------------------------+
// |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
// +------+--------------------+------------------------+
// |orange|40009 |5246 |
// +------+--------------------+------------------------+

How correctly to join 2 dataframe in Apache Spark?

I am new in Apache Spark and need some help. Can someone say how correctly to join next 2 dataframes?!
First dataframe:
| DATE_TIME | PHONE_NUMBER |
|---------------------|--------------|
| 2019-01-01 00:00:00 | 7056589658 |
| 2019-02-02 00:00:00 | 7778965896 |
Second dataframe:
| DATE_TIME | IP |
|---------------------|---------------|
| 2019-01-01 01:00:00 | 194.67.45.126 |
| 2019-02-02 00:00:00 | 102.85.62.100 |
| 2019-03-03 03:00:00 | 102.85.62.100 |
Final dataframe which I want:
| DATE_TIME | PHONE_NUMBER | IP |
|---------------------|--------------|---------------|
| 2019-01-01 00:00:00 | 7056589658 | |
| 2019-01-01 01:00:00 | | 194.67.45.126 |
| 2019-02-02 00:00:00 | 7778965896 | 102.85.62.100 |
| 2019-03-03 03:00:00 | | 102.85.62.100 |
Here below the code which I tried:
import org.apache.spark.sql.Dataset
import spark.implicits._
val df1 = Seq(
("2019-01-01 00:00:00", "7056589658"),
("2019-02-02 00:00:00", "7778965896")
).toDF("DATE_TIME", "PHONE_NUMBER")
df1.show()
val df2 = Seq(
("2019-01-01 01:00:00", "194.67.45.126"),
("2019-02-02 00:00:00", "102.85.62.100"),
("2019-03-03 03:00:00", "102.85.62.100")
).toDF("DATE_TIME", "IP")
df2.show()
val total = df1.join(df2, Seq("DATE_TIME"), "left_outer")
total.show()
Unfortunately, it raise error:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
...
You need to full outer join, but your code is good. Your issue might be some thing else, but with the stack trace you mentioned can't conclude what the issue is.
val total = df1.join(df2, Seq("DATE_TIME"), "full_outer")
You can do this:
val total = df1.join(df2, (df1("DATE_TIME") === df2("DATE_TIME")), "left_outer")

Window function to achieve running sum that resets in Postgres SQL [duplicate]

I wrote a query that creates two columns: the_day, and the amount_raised on that day. Here is what I have:
And I would like to add a column that has a running sum of amount_raised:
Ultimately, I would like the sum column to reset after it reaches 1 million.
The recursive approach is above my pay grade, so if anyone knows a way to reset the sum without creating an entirely new table, please comment (maybe with a RESET function?). Thank you
I'd like to thank Juan Carlos Oropeza for providing a script and SQLFiddle with the test data. George, you should have done that.
The query itself it rather simple.
At first calculate a simple running sum (CTE_RunningSum) and divide it by 1,000,000 to get number of whole millions.
Then calculate the running sum again with partitioning by the number of millions.
SQL Fiddle
I included the columns RunningSum and Millions in the final result to illustrate how the query works.
WITH
CTE_RunningSum
AS
(
SELECT
ID
,day_t
,collect
,SUM(collect) OVER(ORDER BY day_t, id) AS RunningSum
,(SUM(collect) OVER(ORDER BY day_t, id)) / 1000000 AS Millions
FROM myTable
)
SELECT
ID
,day_t
,collect
,RunningSum
,Millions
,SUM(collect) OVER(PARTITION BY Millions ORDER BY day_t, id) AS Result
FROM CTE_RunningSum
ORDER BY day_t, id;
Result
| id | day_t | collect | runningsum | millions | result |
|-----|-----------------------------|---------|------------|----------|---------|
| 90 | March, 11 2015 00:00:00 | 69880 | 69880 | 0 | 69880 |
| 13 | March, 25 2015 00:00:00 | 69484 | 139364 | 0 | 139364 |
| 49 | March, 27 2015 00:00:00 | 57412 | 196776 | 0 | 196776 |
| 41 | March, 30 2015 00:00:00 | 56404 | 253180 | 0 | 253180 |
| 99 | April, 03 2015 00:00:00 | 59426 | 312606 | 0 | 312606 |
| 1 | April, 10 2015 00:00:00 | 65825 | 378431 | 0 | 378431 |
| 100 | April, 27 2015 00:00:00 | 60884 | 439315 | 0 | 439315 |
| 50 | May, 11 2015 00:00:00 | 39641 | 478956 | 0 | 478956 |
| 58 | May, 11 2015 00:00:00 | 49759 | 528715 | 0 | 528715 |
| 51 | May, 17 2015 00:00:00 | 32895 | 561610 | 0 | 561610 |
| 15 | May, 19 2015 00:00:00 | 50847 | 612457 | 0 | 612457 |
| 66 | May, 29 2015 00:00:00 | 66332 | 678789 | 0 | 678789 |
| 4 | June, 04 2015 00:00:00 | 46891 | 725680 | 0 | 725680 |
| 38 | June, 09 2015 00:00:00 | 64732 | 790412 | 0 | 790412 |
| 79 | June, 14 2015 00:00:00 | 62843 | 853255 | 0 | 853255 |
| 37 | June, 28 2015 00:00:00 | 54315 | 907570 | 0 | 907570 |
| 59 | June, 30 2015 00:00:00 | 34885 | 942455 | 0 | 942455 |
| 71 | July, 08 2015 00:00:00 | 46440 | 988895 | 0 | 988895 |
| 31 | July, 10 2015 00:00:00 | 39649 | 1028544 | 1 | 39649 |
| 91 | July, 12 2015 00:00:00 | 65048 | 1093592 | 1 | 104697 |
| 57 | July, 14 2015 00:00:00 | 60394 | 1153986 | 1 | 165091 |
| 98 | July, 20 2015 00:00:00 | 34481 | 1188467 | 1 | 199572 |
| 3 | July, 26 2015 00:00:00 | 58672 | 1247139 | 1 | 258244 |
| 95 | August, 19 2015 00:00:00 | 52393 | 1299532 | 1 | 310637 |
| 74 | August, 20 2015 00:00:00 | 37972 | 1337504 | 1 | 348609 |
| 20 | August, 27 2015 00:00:00 | 36882 | 1374386 | 1 | 385491 |
| 2 | September, 07 2015 00:00:00 | 39408 | 1413794 | 1 | 424899 |
| 14 | September, 09 2015 00:00:00 | 40234 | 1454028 | 1 | 465133 |
| 6 | September, 17 2015 00:00:00 | 65957 | 1519985 | 1 | 531090 |
| 93 | September, 29 2015 00:00:00 | 47213 | 1567198 | 1 | 578303 |
| 35 | September, 30 2015 00:00:00 | 49446 | 1616644 | 1 | 627749 |
| 86 | October, 11 2015 00:00:00 | 34291 | 1650935 | 1 | 662040 |
| 75 | October, 12 2015 00:00:00 | 31448 | 1682383 | 1 | 693488 |
| 19 | October, 14 2015 00:00:00 | 48509 | 1730892 | 1 | 741997 |
| 56 | October, 26 2015 00:00:00 | 30072 | 1760964 | 1 | 772069 |
| 48 | October, 28 2015 00:00:00 | 58527 | 1819491 | 1 | 830596 |
| 40 | November, 05 2015 00:00:00 | 67293 | 1886784 | 1 | 897889 |
| 33 | November, 09 2015 00:00:00 | 41944 | 1928728 | 1 | 939833 |
| 34 | November, 11 2015 00:00:00 | 35516 | 1964244 | 1 | 975349 |
| 85 | November, 20 2015 00:00:00 | 43920 | 2008164 | 2 | 43920 |
| 18 | November, 23 2015 00:00:00 | 44925 | 2053089 | 2 | 88845 |
| 62 | December, 24 2015 00:00:00 | 34678 | 2087767 | 2 | 123523 |
| 67 | December, 25 2015 00:00:00 | 35323 | 2123090 | 2 | 158846 |
| 81 | December, 28 2015 00:00:00 | 37071 | 2160161 | 2 | 195917 |
| 54 | January, 02 2016 00:00:00 | 32330 | 2192491 | 2 | 228247 |
| 70 | January, 06 2016 00:00:00 | 47875 | 2240366 | 2 | 276122 |
| 28 | January, 23 2016 00:00:00 | 40250 | 2280616 | 2 | 316372 |
| 65 | January, 25 2016 00:00:00 | 49404 | 2330020 | 2 | 365776 |
| 73 | January, 26 2016 00:00:00 | 65879 | 2395899 | 2 | 431655 |
| 5 | February, 05 2016 00:00:00 | 53953 | 2449852 | 2 | 485608 |
| 32 | February, 11 2016 00:00:00 | 44988 | 2494840 | 2 | 530596 |
| 53 | February, 25 2016 00:00:00 | 68948 | 2563788 | 2 | 599544 |
| 83 | March, 11 2016 00:00:00 | 47244 | 2611032 | 2 | 646788 |
| 8 | March, 25 2016 00:00:00 | 51809 | 2662841 | 2 | 698597 |
| 82 | March, 25 2016 00:00:00 | 66506 | 2729347 | 2 | 765103 |
| 88 | April, 06 2016 00:00:00 | 69288 | 2798635 | 2 | 834391 |
| 89 | April, 14 2016 00:00:00 | 43162 | 2841797 | 2 | 877553 |
| 52 | April, 23 2016 00:00:00 | 47772 | 2889569 | 2 | 925325 |
| 7 | April, 27 2016 00:00:00 | 33368 | 2922937 | 2 | 958693 |
| 84 | April, 27 2016 00:00:00 | 57644 | 2980581 | 2 | 1016337 |
| 17 | May, 17 2016 00:00:00 | 35416 | 3015997 | 3 | 35416 |
| 61 | May, 17 2016 00:00:00 | 64603 | 3080600 | 3 | 100019 |
| 87 | June, 07 2016 00:00:00 | 41865 | 3122465 | 3 | 141884 |
| 97 | June, 08 2016 00:00:00 | 64982 | 3187447 | 3 | 206866 |
| 92 | June, 15 2016 00:00:00 | 58684 | 3246131 | 3 | 265550 |
| 23 | June, 26 2016 00:00:00 | 46147 | 3292278 | 3 | 311697 |
| 46 | June, 30 2016 00:00:00 | 61921 | 3354199 | 3 | 373618 |
| 94 | July, 03 2016 00:00:00 | 55535 | 3409734 | 3 | 429153 |
| 60 | July, 07 2016 00:00:00 | 63607 | 3473341 | 3 | 492760 |
| 45 | July, 20 2016 00:00:00 | 51965 | 3525306 | 3 | 544725 |
| 96 | July, 20 2016 00:00:00 | 46684 | 3571990 | 3 | 591409 |
| 29 | August, 09 2016 00:00:00 | 37707 | 3609697 | 3 | 629116 |
| 69 | August, 11 2016 00:00:00 | 37194 | 3646891 | 3 | 666310 |
| 80 | August, 19 2016 00:00:00 | 62673 | 3709564 | 3 | 728983 |
| 36 | August, 28 2016 00:00:00 | 48237 | 3757801 | 3 | 777220 |
| 39 | August, 29 2016 00:00:00 | 48159 | 3805960 | 3 | 825379 |
| 25 | August, 30 2016 00:00:00 | 60958 | 3866918 | 3 | 886337 |
| 68 | September, 04 2016 00:00:00 | 50167 | 3917085 | 3 | 936504 |
| 55 | September, 08 2016 00:00:00 | 31193 | 3948278 | 3 | 967697 |
| 64 | September, 10 2016 00:00:00 | 31157 | 3979435 | 3 | 998854 |
| 42 | September, 14 2016 00:00:00 | 52878 | 4032313 | 4 | 52878 |
| 43 | September, 15 2016 00:00:00 | 54728 | 4087041 | 4 | 107606 |
| 77 | September, 18 2016 00:00:00 | 65320 | 4152361 | 4 | 172926 |
| 12 | September, 23 2016 00:00:00 | 43597 | 4195958 | 4 | 216523 |
| 30 | September, 26 2016 00:00:00 | 32764 | 4228722 | 4 | 249287 |
| 10 | September, 27 2016 00:00:00 | 47038 | 4275760 | 4 | 296325 |
| 47 | October, 08 2016 00:00:00 | 46280 | 4322040 | 4 | 342605 |
| 26 | October, 10 2016 00:00:00 | 69487 | 4391527 | 4 | 412092 |
| 63 | October, 30 2016 00:00:00 | 49561 | 4441088 | 4 | 461653 |
| 78 | November, 15 2016 00:00:00 | 40138 | 4481226 | 4 | 501791 |
| 27 | November, 27 2016 00:00:00 | 57378 | 4538604 | 4 | 559169 |
| 21 | December, 01 2016 00:00:00 | 35336 | 4573940 | 4 | 594505 |
| 16 | December, 03 2016 00:00:00 | 39671 | 4613611 | 4 | 634176 |
| 22 | December, 13 2016 00:00:00 | 34574 | 4648185 | 4 | 668750 |
| 72 | January, 29 2017 00:00:00 | 55084 | 4703269 | 4 | 723834 |
| 44 | January, 30 2017 00:00:00 | 36742 | 4740011 | 4 | 760576 |
| 24 | February, 01 2017 00:00:00 | 31061 | 4771072 | 4 | 791637 |
| 76 | February, 12 2017 00:00:00 | 35059 | 4806131 | 4 | 826696 |
| 9 | February, 27 2017 00:00:00 | 39767 | 4845898 | 4 | 866463 |
| 11 | February, 28 2017 00:00:00 | 66007 | 4911905 | 4 | 932470 |
I took a look again and couldnt solve it with a Windows Function so I took the recursive aproach
SQL Fiddle Demo
Sample Data: 100 rows random dates between 2015-2017 amounts between 10k - 70k
DROP TABLE IF EXISTS "myTable";
CREATE TABLE "myTable" (
id SERIAL PRIMARY KEY,
day_t varchar(255),
collect integer NULL
);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-04-10',65825),('2015-09-07',39408),('2015-07-26',58672),('2015-06-04',46891),('2016-02-05',53953),('2015-09-17',65957),('2016-04-27',33368),('2016-03-25',51809),('2017-02-27',39767),('2016-09-27',47038);
INSERT INTO "myTable" (day_t,collect) VALUES ('2017-02-28',66007),('2016-09-23',43597),('2015-03-25',69484),('2015-09-09',40234),('2015-05-19',50847),('2016-12-03',39671),('2016-05-17',35416),('2015-11-23',44925),('2015-10-14',48509),('2015-08-27',36882);
INSERT INTO "myTable" (day_t,collect) VALUES ('2016-12-01',35336),('2016-12-13',34574),('2016-06-26',46147),('2017-02-01',31061),('2016-08-30',60958),('2016-10-10',69487),('2016-11-27',57378),('2016-01-23',40250),('2016-08-09',37707),('2016-09-26',32764);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-07-10',39649),('2016-02-11',44988),('2015-11-09',41944),('2015-11-11',35516),('2015-09-30',49446),('2016-08-28',48237),('2015-06-28',54315),('2015-06-09',64732),('2016-08-29',48159),('2015-11-05',67293);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-03-30',56404),('2016-09-14',52878),('2016-09-15',54728),('2017-01-30',36742),('2016-07-20',51965),('2016-06-30',61921),('2016-10-08',46280),('2015-10-28',58527),('2015-03-27',57412),('2015-05-11',39641);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-05-17',32895),('2016-04-23',47772),('2016-02-25',68948),('2016-01-02',32330),('2016-09-08',31193),('2015-10-26',30072),('2015-07-14',60394),('2015-05-11',49759),('2015-06-30',34885),('2016-07-07',63607);
INSERT INTO "myTable" (day_t,collect) VALUES ('2016-05-17',64603),('2015-12-24',34678),('2016-10-30',49561),('2016-09-10',31157),('2016-01-25',49404),('2015-05-29',66332),('2015-12-25',35323),('2016-09-04',50167),('2016-08-11',37194),('2016-01-06',47875);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-07-08',46440),('2017-01-29',55084),('2016-01-26',65879),('2015-08-20',37972),('2015-10-12',31448),('2017-02-12',35059),('2016-09-18',65320),('2016-11-15',40138),('2015-06-14',62843),('2016-08-19',62673);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-12-28',37071),('2016-03-25',66506),('2016-03-11',47244),('2016-04-27',57644),('2015-11-20',43920),('2015-10-11',34291),('2016-06-07',41865),('2016-04-06',69288),('2016-04-14',43162),('2015-03-11',69880);
INSERT INTO "myTable" (day_t,collect) VALUES ('2015-07-12',65048),('2016-06-15',58684),('2015-09-29',47213),('2016-07-03',55535),('2015-08-19',52393),('2016-07-20',46684),('2016-06-08',64982),('2015-07-20',34481),('2015-04-03',59426),('2015-04-27',60884);
Create a row_number to perform the recursion need consecutive ID's
CREATE TABLE sortDates as
SELECT day_t,
collect,
row_number() over (order by day_t) rn
FROM "myTable";
Recursive Query
If you see the CASE if previous total m.collect is bigger than 1 million the total is reset.
WITH RECURSIVE million(rn, day_t, collect) AS (
(
SELECT rn, day_t, collect
FROM sortDates
WHERE rn = 1
)
UNION
(
SELECT s.rn, s.day_t, CASE WHEN m.collect > 1000000 THEN s.collect
ELSE m.collect + s.collect
END as collect
FROM sortDates s
JOIN million m
ON s.rn = m.rn + 1
)
)
SELECT *
FROM million
WHERE collect > 1000000
Finally just bring the rows where you break the 1 million limit.
OUTPUT
| rn | day_t | collect |
|----|------------|---------|
| 19 | 2015-07-10 | 1028544 |
| 41 | 2015-11-23 | 1024545 |
| 62 | 2016-05-17 | 1027511 |
| 82 | 2016-09-15 | 1006441 |

Symfony2 Query to find last working date from Holiday Calender

I had a calender entity in my project which manages the open and close time of business day of the whole year.
Below is the record of a specific month
id | today_date | year | month_of_year | day_of_month | is_business_day
-------+---------------------+------+---------------+-------------+---------------+
10103 | 2016-02-01 00:00:00 | 2016 | 2 | 1 | t
10104 | 2016-02-02 00:00:00 | 2016 | 2 | 2 | t
10105 | 2016-02-03 00:00:00 | 2016 | 2 | 3 | t
10106 | 2016-02-04 00:00:00 | 2016 | 2 | 4 | t
10107 | 2016-02-05 00:00:00 | 2016 | 2 | 5 | t
10108 | 2016-02-06 00:00:00 | 2016 | 2 | 6 | f
10109 | 2016-02-07 00:00:00 | 2016 | 2 | 7 | f
10110 | 2016-02-08 00:00:00 | 2016 | 2 | 8 | t
10111 | 2016-02-09 00:00:00 | 2016 | 2 | 9 | t
10112 | 2016-02-10 00:00:00 | 2016 | 2 | 10 | t
10113 | 2016-02-11 00:00:00 | 2016 | 2 | 11 | t
10114 | 2016-02-12 00:00:00 | 2016 | 2 | 12 | t
10115 | 2016-02-13 00:00:00 | 2016 | 2 | 13 | f
10116 | 2016-02-14 00:00:00 | 2016 | 2 | 14 | f
10117 | 2016-02-15 00:00:00 | 2016 | 2 | 15 | t
10118 | 2016-02-16 00:00:00 | 2016 | 2 | 16 | t
10119 | 2016-02-17 00:00:00 | 2016 | 2 | 17 | t
10120 | 2016-02-18 00:00:00 | 2016 | 2 | 18 | t
I want the get the today_date of last 7 working date. Supporse today_date is 2016-02-18 and date of last 7 working dates as 2016-02-09.
You can use row_number() for this like this:
SELECT * FROM
(SELECT t.*,row_number() OVER(order by today_date desc) as rnk
FROM Calender t
WHERE today_date <= current_date
AND is_business_day = 't')
WHERE rnk = 7
This will give you the row of the 7th business day from todays date
I see that you tagged your question with Doctrine, ORM and Datetime. Were you after a QueryBuilder solution? Maybe this is closer to what you want:
$qb->select('c.today_date')
->from(Calendar::class, 'c')
->where("c.today_date <= :today")
->andWhere("c.is_business_day = 't'")
->setMaxResults(7)
->orderBy("c.today_date", "DESC")
->setParameter('today', new \DateTime('now'), \Doctrine\DBAL\Types\Type::DATETIME));

postgres sql bucket values into generated time sequence

I am trying to transform data from a table of recorded events. I am transforming the data into a consistent 'daily half hour view'. e.g 48 half periods (padding out half hours with zero when there are no matching events), i have completed this with partial success.
SELECT t1.generate_series,
v1.begin_time,
v1.end_time,
v1.volume
FROM tbl_my_values v1
RIGHT JOIN ( SELECT generate_series.generate_series
FROM generate_series((to_char(now(), 'YYYY-MM-dd'::text) || ' 22:00'::text)::timestamp without time zone,
(to_char(now() + '1 day'::interval, 'YYYY-MM-dd'::text) || ' 22:00'::text)::timestamp without time zone, '00:30:00'::interval)
generate_series(generate_series)) t1 ON t1.generate_series = v1.begin_time
order by 1 ;
This provides the following results:
2015-12-19 22:00:00 | 2015-12-19 22:00:00+00 | 2015-12-19 23:00:00+00 | 172.10
2015-12-19 22:30:00 | | |
2015-12-19 23:00:00 | 2015-12-19 23:00:00+00 | 2015-12-20 00:00:00+00 | 243.60
2015-12-20 00:30:00 | | |
2015-12-20 01:00:00 | | |
However based on the 'start' and 'end' columns the view should be:
2015-12-19 22:00:00 | 2015-12-19 22:00:00+00 | 2015-12-19 23:00:00+00 | 172.10
2015-12-19 22:30:00 | | | 172.10
2015-12-19 23:00:00 | 2015-12-19 23:00:00+00 | 2015-12-20 00:00:00+00 | 243.60
2015-12-20 00:30:00 | | | 243.60
2015-12-20 01:00:00 | | |
because the the values in this example span 2 half hours e.g. are valid for one hour.
All help is very welcome. Thanks
Your ON clause is only comparing to the begin_time. I think you want an inequality:
on t1.generate_series between v1.begin_time and t1.end_time