deleting/updating spark dataframe rows based on values in other columns

deleting/updating spark dataframe rows based on values in other columns - scala

I have 2 data frames and I join these 2 based on whether a string in the one of the columns is contained in a message column in another dataframe.
Example DFs:
DF1
Id
Application
Message
1
facebook
could not connect to facebook. java.lang.NullPointerException
2
google
Connection to the application could not established .Description[stepupRequestId:may not be null ]
3
homedepot
PorcupineCourier -Execution of Connection to the application could not established. java.lang.NullPointerException
DF2
Event_Id
Token
SortOrder
Action
10
java.lang.NullPointerException
25
yes
20
PorcupineCourier -Execution of
9
no
30
stepupRequestId:may not be null
1
no
If I do a left join of these 2 dataframes like below
val result1 = df1.join(df2, $"Message.contains($"Token"), left)
output DF looks like this
Id
Application
Message
Event_Id
Token
SortOrder
Action
1
facebook
could not connect to facebook. java.lang.NullPointerException
10
java.lang.NullPointerException
25
yes
2
google
Connection to the application could not established .Description[stepupRequestId:may not be null ]
30
stepupRequestId:may not be null
1
no
3
homedepot
PorcupineCourier -Execution of Connection to the application could not established. java.lang.NullPointerException
20
PorcupineCourier -Execution of
9
no
3
homedepot
PorcupineCourier -Execution of Connection to the application could not established. java.lang.NullPointerException
10
java.lang.NullPointerException
25
yes
Message = "PorcupineCourier -Execution of Connection to the application could not established. java.lang.NullPointerException" in DF1 has multiple tokens and i end up having 2 matches for the same message in my output.
But, i would like to use SortOrder Column in DF2 to get the right match and drop the duplicate row.I would like to get Event_Id with lowest SortOrder in case of multiple token matches, hence in the above output Event_Id = 20 has the lowest SortOrder, my output should have only 3 rows as below
Expected Output :
Id
Application
Message
Event_Id
Token
SortOrder
Action
1
facebook
could not connect to facebook. java.lang.NullPointerException
10
java.lang.NullPointerException
25
yes
2
google
Connection to the application could not established .Description[stepupRequestId:may not be null ]
30
stepupRequestId:may not be null
1
no
3
homedepot
PorcupineCourier -Execution of Connection to the application could not established. java.lang.NullPointerException
20
PorcupineCourier -Execution of
9
no
also, if the message matches multiple tokens and both have the same SortOrder, i would like to get the row which defaults to the lowest Event_Id value
I am stuck and would appreciate any help in getting the logic right

To drop the duplicate. This is a fairly standard pattern to drop duplicate rows.
result1.withColumn("rn", row_number over Window.partitionBy('Id).orderBy('SortOrder, 'Event_Id)).where('rn === 1)
We put SortOrder as first ranking criteria, then Event_id. It's by nature ascending order. If you have nulls, you may want to do .asc_nulls_last on the sort column
I implemented a more internally native version of it here: https://github.com/kanielc/jarvis-utils/blob/d95442ffad1d61d9269e87228e0911508784655f/src/main/scala/com/jarvis/utils/DatasetFunctions.scala#L27
but the version above will work just as well.

Related

PostgreSQL - Is it possible to write a PostgreSQL query that will limits the amount of results it returns based on specific criteria?

I know the wording of my title is vague so I will try to clarify my question.
tasks_table
user_id
completed_date
task_type
task_id
1
11/14/2021
A
34
1
11/13/2021
B
35
1
11/11/2021
A
36
1
11/09/2021
B
37
2
11/12/2021
A
38
2
11/02/2021
A
39
2
11/14/2021
B
40
2
10/14/2021
B
41
The table I am working with has more fields than this, but, these are the ones that are pertinent to the question. The task type can be either A or B.
What I am currently trying to do is get a result set that contains, at max, two tasks per user_id, one of task type A and one of task type B, that have been completed in the past 7 days. For example, the set the query should generate the following result set:
user_id
completed_date
task_type
task_id
1
11/14/2021
A
34
1
11/13/2021
B
35
2
11/12/2021
A
38
2
11/14/2021
B
40
There is a possibility that a user may have only done tasks of one type in that time period, but, it is guaranteed that any given user will have done atleast one task within that time. My question is it possible to create a query that can return such a result or would I have to query for a more generalized result and then trim the set down through logic in my JPA?

To select the most recent task for a given user_id and task_type within the last 7 days from now, if exists, you can try this :
SELECT DISTINCT ON (t.user_id, t.task_type) t.*
FROM tasks_table AS t
WHERE t.completed_date >= current_date - interval '7 days'
ORDER BY t.user_id, t.task_type, t.completed_date DESC

How would I configure analyze threshold for a table where the data is categorically different every couple months?

We host data for an auditing service. Every few months, a new audit comes out with similar questions to previous audits of the same category. Since questions can change verbiage and number, we store each question in each audit separately (we do link them through a "related_questions" table).
audits
id
name
passing_score
1
audit_1
100
2
audit_2
150
questions
id
audit_id
text
1
1
q1
2
1
q2
3
2
q1
4
2
q2
We then have a surveys and responses table. Surveys are the overall response to an audit, while responses store the individual responses to each question.
surveys
id
audit_id
overall_score
pass
1
1
120
true
2
1
95
false
3
2
200
true
4
2
100
false
responses
id
survey_id
question_id
score
1
1
1
60
2
1
2
60
3
2
1
60
4
2
2
35
5
3
3
100
6
3
4
100
7
4
3
50
8
4
4
50
The analyze threshold is base threshold + scale factor * number of tuples. The problem with this is that once an audit has finished (after a few months), we'll never receive new surveys or responses for that category of data. The new data that comes in is conceptually all that needs to be analyzed. All data is queried, but the new data has the most traffic.
If 10% is the ideal scale factor for today and analyze autoruns once every week, a couple years from now analyze may autorun once every 4 months due to the number of tuples. This is problematic when the past 3 months of data is for questions that the analyzer has never seen and so there are no helpful stats for the query planner on this data.
We could set the scale factor extremely low for this table, but that seems like a cheap solution that could cause issues in the future.

If you have a constant data modifications rate, setting autovacuum_analyze_scale_factor to 0 for that table and only using autovacuum_analyze_threshold is a good idea.

Postgres funnel analysis (time spent)

I have a table like this:
id visited_time page visitor_id
1 2019-04-29T10:44:53.847014+02:00 1 1
2 2019-04-29T10:46:53.174894+02:00 1 3
3 2019-04-29T10:49:44.000390+02:00 2 1
18 2019-04-29T10:52:46.574140+02:00 2 3
19 2019-04-29T10:52:58.158146+02:00 3 1
20 2019-04-29T10:53:27.402038+02:00 1 9
25 2019-04-29T10:55:18.275441+02:00 2 9
54 2019-04-29T11:10:01.818343+02:00 1 13
72 2019-04-29T11:40:28.056813+02:00 2 13
A visitor will also be going from page 1 to 2 to 3 and so forth (but can dropout along the way). I want to find the average time spent on each page. Logically this is the difference between the a unique visitor_id visited page 1 and then page 2 etc.
Is there a smart way to do this in postgres?

Here you go:
SELECT
page,
avg(visited_time_next - visited_time)
FROM
(
SELECT
page,
visited_time,
-- the time of the next page view by a certain visitor...
lead(visited_time) OVER (PARTITION BY visitor_id ORDER BY visited_time) AS visited_time_next
FROM visits_so_56097366
) AS tmp
GROUP BY page
ORDER BY page;
Online example: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=e64dd8862350b9357d9a4384937868c9
Please also make sure that you have an index over visitor_id and visited_time, otherwise you'll end up with very expensive sorts for larger number of intermediate rows:

Filter table based on specific status and same UUIDs

I would like some help in filtering this table with different account status for same uuids and keep all the uuids where there is at least one Status Active
Sample data:
UUID Status
1 Active
1 Rejected
1 Rejected
2 Rejected
2 Waitlisted
2 Processing
3 Active
3 Active
3 Processing
4 Rejected
4 Processing
Expected output:
UUID Status
1 Active
1 Rejected
1 Rejected
3 Active
3 Active
3 Processing
I am trying to use some kind of RANK function, but I don't find a way to maintain the rows where there is an UUID with Active, but the status is not Active.
Thanks

demo: db<>fiddle
SELECT *
FROM status
WHERE uuid IN
(SELECT uuid FROM status WHERE status = 'Active')
Selecting the uuids with an Active status
Selecting the rows with these uuids

Is it possible to return only one instance of each ID in a view?

I am trying to work out how I would ensure I only get one instance of each user and their ID when I try to do an inner join on my source table.
The source data is a series of user names and IDs
userid username
1 alice
2 bob
3 charley
4 dave
5 robin
6 jon
7 lou
8 scott
I have had to write the algorithm in python, to make sure I only get one set of user data matches with another (so we can make sure each user's data is used once in each round)
We store the pairings, and how many rounds have been completed successfully after the tests, but I'd like to try and shortcut the process
I can get all the results, but I want to find a better way to remove each matched pair from the results so they can't be matched again.
select u.user_id, u.user, ur.user_id, ur.user
from userResults u inner join userResults ur
on u.user_id < ur.user_id
and (u.user_id, ur.user_id) not in
(select uid, uid2 from rounds)
where u.match <= ur.match and ((u.user_id) not in %s
and ur.user_id not in %s) limit 1;
I've tried making materialised views with a unique constraint, but it doesn't seem to affect it - I get each possible pairing once, rather than each user paired only once
I'm trying to work out how I only get 4 results, in the right order.
Every time I look at the underlying code, I can't help but think there's a better way to write it natively in SQL rather than having to iterate over results in python.
edit
assuming each user has been matched 0 or more times, you might have a situtation where user_id's 1-4 have rounds set to 1, and matches set to 1, and the remaining 4 have rounds set to 1 and no matches.
I have a view which will return a default value of 0 and 0 for rounds and matches if they haven't yet played, and you can't assume all rounds entered have met with a match.
If the first 4 have all matched, and have generated rounds, user 1 and user 2 have already met and matched in a round, so they won't be matched again, so user 1 will match user 3 (or 4) and user 2 will match user 4 (or 3)
The issue I'm having is that when I remove limit, and iterate through manually - the first three matches I always get are: 2,4 then 1,3, then 2,3 (rather than 5,7 or 6,8)
Adding the sample data and current rounds
table rounds
uid uid2
1 2
3 4
userresults view
user_id user rounds score
1 alice 1 0
2 bob 1 1
3 charley 1 1
4 dave 1 0
5 robin 0 0
6 jon 0 0
7 lou 0 0
8 scott 0 0
I'm currently getting results like:
2,4
2,3
1,3
1,4
4,6
...
These are all valid results, but I would like to limit them to a single instance of each ID in each column, so just the first match of each valid pairing.
I've created a new view to try and simplify things a but, and populated it with dummy data and tried to generate matches
All these matches are valid, and I'm trying to add some form of filter or restriction to bring it back to sensible numbers.
777;"Bongo Steveson";779;"Drongo Mongoson"
777;"Bongo Steveson";782;"Cheesey McHamburger"
777;"Bongo Steveson";780;"Buns Mattburger"
779;"Drongo Mongoson";782;"Cheesey McHamburger"
779;"Drongo Mongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";777;"Bongo Steveson"
778;"Mongo Bongoson";779;"Drongo Mongoson"
775;"Bob Jobsworth";778;"Mongo Bongoson"
778;"Mongo Bongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";782;"Cheesey McHamburger"
775;"Bob Jobsworth";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";780;"Buns Mattburger"
776;"Steve Bobson";777;"Bongo Steveson"
776;"Steve Bobson";779;"Drongo Mongoson"
776;"Steve Bobson";782;"Cheesey McHamburger"
776;"Steve Bobson";778;"Mongo Bongoson"
776;"Steve Bobson";781;"Hamburgler Bunburger"
780;"Buns Mattburger";782;"Cheesey McHamburger"
780;"Buns Mattburger";781;"Hamburgler Bunburger"
I still can't work out a sensible way to restrict these values, and it's driving me nuts.
I've implemented a solution in code but I'd really like to see if I can get this working in native Postgres.
At this point I'm monkeying around with a new test database schema, and this is my view - the adding unique to the index generates an error, and I can't add a check constraint to a materialised view (grrrr)

You can try joining sub query to ensure distinct record from user table.
select * from any_table t1
inner join(
select distinct userid,username from source_table
) t2 on t1.userid=t2.userid;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse