PySpark Window.partitionBy() treats identical cell values as different - pyspark

I have a dataframe with columns like these:
df_lp.show()
+--------------------+-------------+--------------------+------+
| ts| uid| pid|action|
+--------------------+-------------+--------------------+------+
|2017-03-28 09:34:...|1950269663250|IST334-0149064968...| <L|
|2017-03-31 05:50:...|S578448696405|IST334-0149089179...| <L|
|2017-03-28 09:38:...|1950269663250|IST334-0149064968...| <L|
|2017-03-30 09:26:...| 412310802992|IST334-1212011845...| <L|
ts is a timestamp, all the other columns are strings. action can take one among a set of values (<L, +L, <B, +B, etc).
Now I defined a window, and two window functions like this:
ts_w_lp = Window.partitionBy(df_lp['pid']).orderBy(df_lp['ts'].asc())
first_lp = fns.min(df_lp['ts']).over(ts_w_lp)
last_lp = fns.max(df_lp['ts']).over(ts_w_lp)
I want to find the first and last timestamps for a given pid, and action. So I do this:
df_lp_pB = df_lp.filter(df_lp['action'] == '+B')\
.select('pid', first_lp.alias('tsf'), last_lp.alias('tsl'))\
.distinct().sort('pid')
However the dataframe I get has extra rows where one of the window functions seem to have intermediate values.
df_lp_pB.filter('pid = "BAG26723881"').toPandas()
pid tsf tsl
0 BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.674
1 BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.736
If I do the same with Spark SQL, it works as expected.
df_lp.createOrReplaceTempView('scans_lp')
df_sql = spark.sql("SELECT pid , min(ts) AS tsf, max(ts) AS tsl FROM scans_lp \
WHERE action='+B' GROUP BY pid ORDER BY pid")
df_sql.filter('pid = "BAG26723881"').toPandas()
pid tsf tsl
0 BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.736
In fact, when I invert the sorting of the timestamp column to descending, the other window function has the issue!
ts_w_lp2 = Window.partitionBy(df_lp['pid']).orderBy(df_lp['ts'].desc())
first_lp2 = fns.min(df_lp['ts']).over(ts_w_lp2)
last_lp2 = fns.max(df_lp['ts']).over(ts_w_lp2)
df_lp_pB2 = df_lp.filter(df_lp['action'] == '+B')\
.select('pid', first_lp2.alias('tsf'), last_lp2.alias('tsl'))\
.distinct().sort('pid')
df_lp_pB2.filter('pid = "BAG26723881"').toPandas()
pid tsf tsl
0 BAG26723881 2017-04-11 15:10:35.736 2017-04-11 15:10:35.736
1 BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.736
If I investigate further, I see all my pids are considered distinct, even for rows they are not! See this:
df_lp.filter('action = "+B"').select('pid').distinct().count()
6382
df_sql.count()
6382
df_lp.filter('action = "+B"').select('pid').count()
120303
df_lp_pB.count()
120303
What is going on? Did I misunderstand what Window.partitionBy() is supposed to do?

Related

Is it possible to have hibernate generate update from values statements for postgresql?

Given a postgresql table
Table "public.test"
Column | Type | Modifiers
----------+-----------------------------+-----------
id | integer | not null
info | text |
And the following values :
# select * from test;
id | info
----+--------------
3 | value3
4 | value4
5 | value5
As you may know with postgresql you can use this kind of statements to update multiples rows with different values :
update test set info=tmp.info from (values (3,'newvalue3'),(4,'newvalue4'),(5,'newvalue5')) as tmp (id,info) where test.id=tmp.id;
And it results in the table being updated in a single queries to :
# select * from test;
id | info
----+--------------
3 | newvalue3
4 | newvalue4
5 | newvalue5
I have been looking around everywhere as to how to make hibernate generate this kind of statements for update queries. I know how to make it work for insert queries (with reWriteBatchedInserts jdbc option and hibernate batch config options).
But is it possible for update queries or do I have to write the native query myself ?
No matter what I do, hibernate always sends separate update queries to the database (I'm looking to the postgresql server statements logs for this affirmation).
2020-06-18 08:19:48.895 UTC [1642] LOG: execute S_6: BEGIN
2020-06-18 08:19:48.895 UTC [1642] LOG: execute S_8: update test set info = $1 where id = $2
2020-06-18 08:19:48.895 UTC [1642] DETAIL: parameters: $1 = 'newvalue3', $2 = '3'
2020-06-18 08:19:48.896 UTC [1642] LOG: execute S_8: update test set info = $1 where id = $2
2020-06-18 08:19:48.896 UTC [1642] DETAIL: parameters: $1 = 'newvalue4', $2 = '4'
2020-06-18 08:19:48.896 UTC [1642] LOG: execute S_8: update test set info = $1 where id = $2
2020-06-18 08:19:48.896 UTC [1642] DETAIL: parameters: $1 = 'newvalue4', $2 = '5'
2020-06-18 08:19:48.896 UTC [1642] LOG: execute S_1: COMMIT
I always find it many times faster to issue a single massive update query than many separate update targeting single rows. With many seperate update queries, even though they are sent in a batch by the jdbc driver, they still need to be processed sequentially by the server, so it is not as efficient as a single update query targeting multiples rows. So if anyone has a solution that wouldn't involve writing native queries for my entities, I would be very glad !
Update
To further refine my question I want to add a clarification. I'm looking for a solution that wouldn't abandon Hibernate dirty checking feature for entities updates. I'm trying to avoid to write batch update queries by hand for the general case of having to updating a few basic fields with different values on an entity list. I'm currently looking into the SPI of hibernate to see it if it's doable. org.hibernate.engine.jdbc.batch.spi.Batch seems to be the proper place but I'm not quite sure yet because I've never done anything with hibernate SPI). Any insights would be welcomed !
You can use Blaze-Persistence for this which is a query builder on top of JPA which supports many of the advanced DBMS features on top of the JPA model.
It does not yet support the FROM clause in DML, but that is about to land in the next release: https://github.com/Blazebit/blaze-persistence/issues/693
Meanwhile you could use CTEs for this. First you need to define a CTE entity(a concept of Blaze-Persistence):
#CTE
#Entity
public class InfoCte {
#Id Integer id;
String info;
}
I'm assuming your entity model looks roughly like this
#Entity
public class Test {
#Id Integer id;
String info;
}
Then you can use Blaze-Persistence like this:
criteriaBuilderFactory.update(entityManager, Test.class, "test")
.with(InfoCte.class, false)
.fromValues(Test.class, "newInfos", newInfosCollection)
.bind("id").select("newInfos.id")
.bind("info").select("newInfos.info")
.end()
.set("info")
.from(InfoCte.class, "cte")
.select("cte.info")
.where("cte.id").eqExpression("test.id")
.end()
.whereExists()
.from(InfoCte.class, "cte")
.where("cte.id").eqExpression("test.id")
.end()
.executeUpdate();
This will create an SQL query similar to the following
WITH InfoCte(id, info) AS(
SELECT t.id, t.info
FROM (VALUES(1, 'newValue', ...)) t(id, info)
)
UPDATE test
SET info = (SELECT cte.info FROM InfoCte cte WHERE cte.id = test.id)
WHERE EXISTS (SELECT 1 FROM InfoCte cte WHERE cte.id = test.id)

Combining postgres query and log duration

I am aware that you can show the duration and log queries using the configuration below in postgresql.conf
------------------------------------------------------------------------------
CUSTOMIZED OPTIONS
------------------------------------------------------------------------------
log_statement = 'all'
log_duration = on
log_line_prefix = '{"Time":[%t], Host:%h} '
And then returns logs like
{"Time":[2018-08-13 16:24:20 +08], Host:172.18.0.2} LOG: statement: SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1
{"Time":[2018-08-13 16:24:20 +08], Host:172.18.0.2} LOG: duration: 7.694 ms
But can I combine the duration and statement in a single line like?
LOG: { statement: ..., duration: 7.694 ms}
The way you are logging, the statement is logged when the server starts processing it, but the duration is only known at the end of the execution.
This is why it has to be logged as two different messages.
If you use log_min_duration_statement = 0 instead, the statement is logged at the end of execution together with the duration.

PgBadger report from AWS RDS log parsed not fully

I use pgbadger as following:
pgbadger -p %t:%r:%u#%d:[%p]: postgresql.log
log_line_prefix are set for RDS and cannot be changed. Its same that i pass to pgbadger ( %t:%r:%u#%d:[%p]: )
When i launch pgbadger i get following stdout output.
[=======================> ] Parsed 52063631 bytes of 52063634 (100.00%), queries: 66116, events: 0
LOG: Ok, generating html report...
So it parsed queries, and it output i see most of stats. But in Top section i see wrong info. Time consuming queries and Slowest individual queries says "no dataset". And in Most frequent queries (N) all queries have all durations as 0 . See screenshot here : http://clip2net.com/s/3wUxfXg . And examples for query dont show any examples at all.
I checked postgresql log and duration is there. For example:
2016-04-13 22:00:02 UTC:blabla.com(43372):blabla#blabla:[20584]:LOG: statement: SELECT DISTINCT "reports2_report"."id", "reports2_report"."created", "reports2_report"."modified", "reports2_report"."data", "reports2_report"."person_info", "reports2_report"."status", "reports2_report"."source_profile_id", "reports2_report"."application_id", "reports2_report"."error_detail" FROM "reports2_report" INNER JOIN "reports2_reportsourceprofile" ON ( "reports2_report"."source_profile_id" = "reports2_reportsourceprofile"."id" ) INNER JOIN "reports2_reportsource" ON ( "reports2_reportsourceprofile"."report_source_id" = "reports2_reportsource"."id" ) INNER JOIN "applications_applicationdocument" ON ( "reports2_report"."application_id" = "applications_applicationdocument"."slug" ) WHERE ("reports2_reportsource"."identifier" = 'redridge_credit' AND "reports2_report"."application_id" = 'jqLoMe' AND ("reports2_report"."application_id" IN (SELECT DISTINCT V0."slug" FROM "applications_applicationdocument" V0 LEFT OUTER JOIN "auth_user" V1 ON ( V0."seller_id" = V1."id" ) LEFT OUTER JOIN "accounts_companymembership" V2 ON ( V1."id" = V2."user_id" ) LEFT OUTER JOIN "applications_applicationbundle" V5 ON ( V0."bundle_id" = V5."id" ) LEFT OUTER JOIN "applications_applicationbundle_sharees" V6 ON ( V5."id" = V6."applicationbundle_id" ) WHERE (V2."company_id" IN (SELECT U0."id" FROM "accounts_company" U0 WHERE (U0."lft" > 2 AND U0."tree_id" = 6 AND U0."rght" < 3)) OR V0."applicant_id" = 111827 OR V0."seller_id" = 111827 OR V6."user_id" = 111827)) OR "applications_applicationdocument"."seller_id" = 111827 OR "applications_applicationdocument"."applicant_id" = 111827 OR "reports2_reportsourceprofile"."user_id" = 111827)) ORDER BY "reports2_report"."created" DESC LIMIT 20
2016-04-13 22:00:02 UTC:blabla.com(43372):blabla#blabla:[20584]:LOG: duration: 517.047 ms
How to get PgBadger to generate full proper report?
It was due to log_statement = all instead of log_statement=none. log_min_duration_statement works only if log_statement = none

How can I query Dynamodb for different hash keys and secondaryindex?

I am trying to do log table in dynamodb and my table looks like
Pid[HashKey] || TableName[SecondaryIndex] || CreateDate[RangeKey] || OldValue || NewValue
10 || Product || 10.10.2013 00:00:01 || Shoe || Skirt
10 || Product || 10.10.2013 00:00:02 || Skirt || Pant
11 || ProductCategory || 10.10.2013 00:00:01 || Shoes || Skirts
19 || ProductCategory || 10.10.2013 00:00:01 || Tables || Armchairs
Pid = My main db tables primary key
TableName = My main db table names
CreateDate = Row created date
now I want to get list of
where (Pid = 10 AND TableName = "Product") OR (Pid = 11 AND
TableName="ProductCategory")
in a single request (it wouldn't be so short like this. It could include too many tables and pids)
I tried batchget but I didn't use it because I couldn't query with secondary index. It needs rangekey with equal operator.
I tried query but this time I couldn't send multiple hash key in a same query.
Any ideas or successions?
Thank you.
The problem here is the OR .... Generally you cannot get this where condition with a single query operation without modifying your row,
Solution 1: You have to issue 2 query operations, and append them to the same resultset.
where (Pid = 10 AND TableName = "Product")
union
where (Pid = 11 AND TableName = "ProductCategory")
Those operations should run in parallel to optimize performance.
Solution2: create a field xxx that describe your condition and maintain it on writes, than
you could create a global secondary index on it and perform a single query.

PlayFramework 2 + Ebean - raw Sql Update query - makes no effect on db

I have a play framework 2.0.4 application that wants to modify rows in db.
I need to update 'few' messages in db to status "opened" (read messages)
I did it like below
String sql = " UPDATE message SET opened = true, opened_date = now() "
+" WHERE id_profile_to = :id1 AND id_profile_from = :id2 AND opened IS NOT true";
SqlUpdate update = Ebean.createSqlUpdate(sql);
update.setParameter("id1", myProfileId);
update.setParameter("id2", conversationProfileId);
int modifiedCount = update.execute();
I have modified the postgresql to log all the queries.
modifiedCount is the actual number of modified rows - but the query is in transaction.
After the query is done in the db there is ROLLBACK - so the UPDATE is not made.
I have tried to change db to H2 - with the same result.
This is the query from postgres audit log
2012-12-18 00:21:17 CET : S_1: BEGIN
2012-12-18 00:21:17 CET : <unnamed>: UPDATE message SET opened = true, opened_date = now() WHERE id_profile_to = $1 AND id_profile_from = $2 AND opened IS NOT true
2012-12-18 00:21:17 CET : parameters: $1 = '1', $2 = '2'
2012-12-18 00:21:17 CET : S_2: ROLLBACK
..........
Play Framework documentation and Ebean docs - states that there is no transaction /if not declared or transient if needed per query/.
So... I have made the trick
Ebean.beginTransaction();
int modifiedCount = update.execute();
Ebean.commitTransaction();
Ebean.endTransaction();
Logger.info("update mod = " + modifiedCount);
But this makes no difference - the same behavior ...
Ebean.execute(update);
Again - the same ..
Next step i did - I annontated the method with
#Transactional(type=TxType.NEVER)
and
#Transactional(type=TxType.MANDATORY)
None of them made a difference.
I am so frustrated with Ebean :(
Anybody can help, please ?
BTW.
I set
Ebean.getServer(null).getAdminLogging().setDebugGeneratedSql(true);
Ebean.getServer(null).getAdminLogging().setDebugLazyLoad(true);
Ebean.getServer(null).getAdminLogging().setLogLevel(LogLevel.SQL);
to see in Play console the query - other queries are logged - this update - not
just remove the initial space...Yes..I couldn't believe it either...
change from " UPDATE... to "UPDATE...
And thats all...
i think you have to use raw sql instead of createSqlUpdate statement.