Spark dataframe transform in time window - scala

I have two dataframes. [AllAccounts]: contains audit for all accounts for all users
UserId, AccountId, Balance, CreatedOn
1, acc1, 200.01, 2016-12-06T17:09:36.123-05:00
1, acc2, 189.00, 2016-12-06T17:09:38.123-05:00
1, acc1, 700.01, 2016-12-07T17:09:36.123-05:00
1, acc2, 189.00, 2016-12-07T17:09:38.123-05:00
1, acc3, 010.01, 2016-12-07T17:09:39.123-05:00
1, acc1, 900.01, 2016-12-08T17:09:36.123-05:00
[ActiveAccounts]: contains audit for only the active account(could be zero or 1) for any user
UserId, AccountId, CreatedOn
1, acc2, 189.00, 2016-12-06T17:09:38.123-05:00
1, acc3, 010.01, 2016-12-07T17:09:39.123-05:00
I want to transform these into a single DF which is of the format
UserId, AccountId, Balance, CreatedOn, IsActive
1, acc1, 200.01, 2016-12-06T17:09:36.123-05:00, false
1, acc2, 189.00, 2016-12-06T17:09:38.123-05:00, true
1, acc1, 700.01, 2016-12-07T17:09:36.123-05:00, false
1, acc2, 189.00, 2016-12-07T17:09:38.123-05:00, true
1, acc3, 010.01, 2016-12-07T17:09:39.123-05:00, true
1, acc1, 900.01, 2016-12-08T17:09:36.123-05:00, false
So based on accounts in ActiveAccounts, i need to flag the rows in first df appropriately. As in the example, acc2 for userId 1 was marked active on 2016-12-06T17:09:38.123-05:00 and acc3 was marked active on 2016-12-07T17:09:39.123-05:00. So btw these time ranges acc2 will be marked true and 2016-12-07T17:09:39 onwards acc3 will be marked true.
What will be a an efficient way to do this.

If I understand properly the account (1, acc1) is active between its creation time and that of (1, acc2).
We can do this in a few steps:
create a data frame with the start/end times for each account
join with AllAccounts
flag the rows of the resulting dataframe
I haven't tested this, so there may be syntax mistakes.
To accomplish the first task, we need to partition the dataframe by user and then look at the next creation time. This calls for a window function:
val window = Window.partitionBy("UserId").orderBy("StartTime")
val activeTimes = ActiveAccounts.withColumnRenamed("CreatedOn", "StartTime")
.withColumn("EndTime", lead("StartTime") over window)
Note that the last EndTime for each user will be null. Now join:
val withActive = AllAcounts.join(activeTimes, Seq("UserId", "AccountId"))
(This should be a left join if you might be missing active times for some accounts.)
Then you have to go through and flag the accounts as active:
val withFlags = withActive.withColumn("isActive",
$"CreatedOn" >= $"StartTime" &&
($"EndTime".isNull || ($"CreatedOn" < $"EndTime)))

Related

PostgreSQL - Delete values with conditions

In a PostgreSQL table, I wan't to delete some values with conditions. This conditions are based on the begining of a field with a substr. In my example, I want to delete the values ​​not starting with '01' or '02' or '03'. When I first run a SELECT on my values with conditions, it's ok.
Sample data
id | serial_number
---------------------
1 | 01A
2 | 01B
3 | 02A
4 | 02B
5 | 03A
6 | 03B
7 | 03C
8 | 04A
9 | 05A
10 | 06A
Example of a selection
SELECT * FROM my_table
WHERE substr(serial_number, 1, 2) LIKE '01' OR substr(serial_number, 1, 2) LIKE '02' OR substr(serial_number, 1, 2) LIKE '03'
But when I apply the same conditions in a DELETE and add a negation, it removes eveything. I need to change operators from OR to AND.
Delete with same condition (not expected result)
DELETE FROM my_table
WHERE substr(serial_number, 1, 2) NOT LIKE '01' OR substr(serial_number, 1, 2) NOT LIKE '02' AND substr(serial_number, 1, 2) NOT LIKE '03'
Delete with different operators (expected result)
DELETE FROM my_table
WHERE substr(serial_number, 1, 2) NOT LIKE '01' AND substr(serial_number, 1, 2) NOT LIKE '02' AND substr(serial_number, 1, 2) NOT LIKE '03'
How multiple conditions could be true with a AND ? It's not cumulative. I don't understand why I need to change operators.
Some basic boolean math:
True OR False = True
True AND False = False
Consider a single record in your table with value 01A. You don't want to delete this since it's in your list 01, 02, 03.
Running your logic:
substr(serial_number, 1, 2) NOT LIKE '01' = False
substr(serial_number, 1, 2) NOT LIKE '02' = True
Taking that back to the Boolean logic above you'll see that False OR True = True which means we delete the record. However, switching to an AND means False AND True = False and your record 01A is retained as expected.
This is a common issue when dealing with negation and booleans in that it doesn't match how we use OR in english.
Consider a store that cells gift cards in values 20, 50 and 100 dollars:
"Can you go get a gift card for Chili's restaurant. I can't remember
what dollar amounts they sell, but I don't need the 20 dollar one or
the 50 dollar one"
We all understand perfectly well that they don't want the 20 or the 50, but rather the 100 dollar gift card. But in boolean logic the person being asked would return with ALL the gift cards available as 20 != 50 and 50 != 20 and 100 != 50 (and 20). So the OR sees a TRUE for at least one of the conditions specified regardless of the gift card amount.

How to do handle this use-case (running-window data) in spark

I am using spark-sql-2.4.1v with java 1.8.
Have source data as below :
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
).toDF("industry_id","industry_name","country","state","revenue","generated_date");
Query :
val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));
For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data
val cols = {col("industry_id")}
val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));
val newDf = df_data
.withColumn("rank",rank().over(ws))
.where(col("rank").equalTo(lit(1)))
//.drop(col("rank"))
.select("*");
How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.
more details :
Example, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-02-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-02-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesm..how to pick distinct dataset for each distinct "generated_date" parallell ?
I don't know how to do this with window functions but a self join can solve your problem.
First, you need a DataFrame with distinct dates:
val df_dates = df_data
.select("generated_date")
.withColumnRenamed("generated_date", "distinct_date")
.distinct()
Next, for each row in your industries data you need to calculate up to which date that industry will be included, i.e., add 6 months to generated_date. I think of them as active dates. I've used add_months() to do this but you can think of different logics.
import org.apache.spark.sql.functions.add_months
val df_active = df_data.withColumn("active_date", add_months(col("generated_date"), 6))
If we start with this data (separated by date just for our eyes):
industry_id generated_date
(("Indus_1", ..., "2020-03-01"),
("Indus_1", ..., "2019-12-01"),
("Indus_2", ..., "2019-12-01"),
("Indus_3", ..., "2018-06-01"))
It has now:
industry_id generated_date active_date
(("Indus_1", ..., "2020-03-01", "2020-09-01"),
("Indus_1", ..., "2019-12-01", "2020-06-01"),
("Indus_2", ..., "2019-12-01", "2020-06-01")
("Indus_3", ..., "2018-06-01", "2018-12-01"))
Now proceed with self join based on dates, using the join condition that will match your 6 month period:
val condition: Column = (
col("distinct_date") >= col("generated_date")).and(
col("distinct_date") <= col("active_date"))
val df_joined = df_dates.join(df_active, condition, "inner")
df_joined has now:
distinct_date industry_id generated_date active_date
(("2020-03-01", "Indus_1", ..., "2020-03-01", "2020-09-01"),
("2020-03-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2020-03-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2018-06-01", "Indus_3", ..., "2018-06-01", "2018-12-01"))
Drop that auxiliary column active_date or even better, drop duplicates based on your needs:
val df_result = df_joined.dropDuplicates(Seq("distinct_date", "industry_id"))
Which drops the duplicated "Indus_1" in "2020-03-01" (It appeared twice because it's retrieved from two different generated_dates):
distinct_date industry_id
(("2020-03-01", "Indus_1"),
("2020-03-01", "Indus_2"),
("2019-12-01", "Indus_1"),
("2019-12-01", "Indus_2"),
("2018-06-01", "Indus_3"))

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!
I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

One2many field issue Odoo 10.0

I have this very weird issue with One2many field.
First let me explain you the scenario...
I have a One2many field in sale.order.line, below code will explain the structure better
class testModule(models.Model):
_name = 'test.module'
name = fields.Char()
class testModule2(models.Model):
_name = 'test.module2'
location_id = fields.Many2one('test.module')
field1 = fields.Char()
field2 = fields.Many2one('sale.order.line')
class testModule3(models.Model):
_inherit = 'sale.order.line'
test_location = fields.One2many('test.module2', 'field2')
CASE 1:
Now what is happening is that when i create a new sales order, i select the partner_id and then add a sale.order.line and inside this line i add the One2many field test_location and then i save.
CASE 2:
Create new sales order, select partner_id then add sale.order.line and inside the sale.order.line add the test_location line [close the sales order line window]. Now after the entry before hitting save i change a field say partner_id and then click save.
CASE 3:
this case is same as case 2 but with the addition that i again change the partner_id field [changes made total 2 times first of case2 and then now], then i click on save.
RESULTS
CASE 1 works fine.
CASE 2 has a issue of
odoo.sql_db: bad query: INSERT INTO "test_module2" ("id", "field2", "field1", "location_id", "create_uid", "write_uid", "create_date", "write_date") VALUES(nextval('test_module2_id_seq'), 27, 'asd', ARRAY[1, '1'], 1, 1, (now() at time zone 'UTC'), (now() at time zone 'UTC')) RETURNING id
ProgrammingError: column "location_id" is of type integer but expression is of type integer[]
LINE 1: ...VALUES(nextval('test_module2_id_seq'), 27, 'asd', ARRAY[1, '...
now for this case i put a debugger on create/write method of sale.order.line to see waht the values are getting passed..
values = {u'product_uom': 1, u'sequence': 0, u'price_unit': 885, u'product_uom_qty': 1, u'qty_invoiced': 0, u'procurement_ids': [[5]], u'qty_delivered': 0, u'qty_to_invoice': 0, u'qty_delivered_updateable': False, u'customer_lead': 0, u'analytic_tag_ids': [[5]], u'state': u'draft', u'tax_id': [[5]], u'test_location': [[5], [0, 0, {u'field1': u'asd', u'location_id': [1, u'1']}]], 'order_id': 20, u'price_subtotal': 885, u'discount': 0, u'layout_category_id': False, u'product_id': 29, u'price_total': 885, u'invoice_status': u'no', u'name': u'[CARD] Graphics Card', u'invoice_lines': [[5]]}
in the above values location_id is getting passed like u'location_id': [1, u'1']}]] which is not correct...so for this i correct the issue in code and the update the values and pass that...
CASE 3
if the user changes the field say 2 or more than 2 times then the values are
values = {u'invoice_lines': [[5]], u'procurement_ids': [[5]], u'tax_id': [[5]], u'test_location': [[5], [1, 7, {u'field1': u'asd', u'location_id': False}]], u'analytic_tag_ids': [[5]]}
here
u'location_id': False
MULTIPLE CASE
if the user does case 1 the on the same record does case 2 or case 3 then sometimes the line will be saved as field2 = Null or False in the database other values like location_id and field1 will have data but not field2
NOTE: THIS HAPPENS WITH ANY FIELD NOT ONLY PARTNER_ID FIELD ON HEADER LEVEL OF SALE ORDER
I tried debugging myself but couldn't find the reason why this is happening .

What does the exclude_nodata_value argument to ST_DumpValues do?

Could anyone explain what the exclude_nodata_value argument to ST_DumpValues does?
For example, given the following:
WITH
-- Create a raster 4x4 raster, with each value set to 8 and NODATA set to -99.
tbl_1 AS (
SELECT
ST_AddBand(
ST_MakeEmptyRaster(4, 4, 0, 0, 1, -1, 0, 0, 4326),
1, '32BF', 8, -99
) AS rast
),
-- Set the values in rows 1 and 2 to -99.
tbl_2 AS (
SELECT
ST_SetValues(
rast, 1, 1, 1, 4, 2, -99, FALSE
) AS rast FROM tbl_1)
Why does the following select statement return NULLs in the first two rows:
SELECT ST_DumpValues(rast, 1, TRUE) AS cell_values FROM tbl_2;
Like this:
{{NULL,NULL,NULL,NULL},{NULL,NULL,NULL,NULL},{8,8,8,8},{8,8,8,8}}
But the following select statement return -99s?
SELECT ST_DumpValues(rast, 1, FALSE) AS cell_values FROM tbl_2;
Like this:
{{-99,-99,-99,-99},{-99,-99,-99,-99},{8,8,8,8},{8,8,8,8}}
Clearly, with both statements the first two rows really contain -99s. However, in the first case (exclude_nodata_value=TRUE) these values have been masked (but not replaced) by NULLS.
Thanks for any help. The subtle differences between NULL and NODATA within PostGIS have been driving me crazy for several days.