Postgres funnel analysis (time spent) - postgresql

I have a table like this:
id visited_time page visitor_id
1 2019-04-29T10:44:53.847014+02:00 1 1
2 2019-04-29T10:46:53.174894+02:00 1 3
3 2019-04-29T10:49:44.000390+02:00 2 1
18 2019-04-29T10:52:46.574140+02:00 2 3
19 2019-04-29T10:52:58.158146+02:00 3 1
20 2019-04-29T10:53:27.402038+02:00 1 9
25 2019-04-29T10:55:18.275441+02:00 2 9
54 2019-04-29T11:10:01.818343+02:00 1 13
72 2019-04-29T11:40:28.056813+02:00 2 13
A visitor will also be going from page 1 to 2 to 3 and so forth (but can dropout along the way). I want to find the average time spent on each page. Logically this is the difference between the a unique visitor_id visited page 1 and then page 2 etc.
Is there a smart way to do this in postgres?

Here you go:
SELECT
page,
avg(visited_time_next - visited_time)
FROM
(
SELECT
page,
visited_time,
-- the time of the next page view by a certain visitor...
lead(visited_time) OVER (PARTITION BY visitor_id ORDER BY visited_time) AS visited_time_next
FROM visits_so_56097366
) AS tmp
GROUP BY page
ORDER BY page;
Online example: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=e64dd8862350b9357d9a4384937868c9
Please also make sure that you have an index over visitor_id and visited_time, otherwise you'll end up with very expensive sorts for larger number of intermediate rows:

Related

How would I configure analyze threshold for a table where the data is categorically different every couple months?

We host data for an auditing service. Every few months, a new audit comes out with similar questions to previous audits of the same category. Since questions can change verbiage and number, we store each question in each audit separately (we do link them through a "related_questions" table).
audits
id
name
passing_score
1
audit_1
100
2
audit_2
150
questions
id
audit_id
text
1
1
q1
2
1
q2
3
2
q1
4
2
q2
We then have a surveys and responses table. Surveys are the overall response to an audit, while responses store the individual responses to each question.
surveys
id
audit_id
overall_score
pass
1
1
120
true
2
1
95
false
3
2
200
true
4
2
100
false
responses
id
survey_id
question_id
score
1
1
1
60
2
1
2
60
3
2
1
60
4
2
2
35
5
3
3
100
6
3
4
100
7
4
3
50
8
4
4
50
The analyze threshold is base threshold + scale factor * number of tuples. The problem with this is that once an audit has finished (after a few months), we'll never receive new surveys or responses for that category of data. The new data that comes in is conceptually all that needs to be analyzed. All data is queried, but the new data has the most traffic.
If 10% is the ideal scale factor for today and analyze autoruns once every week, a couple years from now analyze may autorun once every 4 months due to the number of tuples. This is problematic when the past 3 months of data is for questions that the analyzer has never seen and so there are no helpful stats for the query planner on this data.
We could set the scale factor extremely low for this table, but that seems like a cheap solution that could cause issues in the future.
If you have a constant data modifications rate, setting autovacuum_analyze_scale_factor to 0 for that table and only using autovacuum_analyze_threshold is a good idea.

Grouping multiple values

Grouping multiple values on Details section
I have got an output from SQL query:
ID Value
1 1
1 3
1 5
1 7
1 9
2 1
2 4
3 1
3 2
3 3
I just want to have on each page ID and whole list of values assigned to this id. On next page I should have next ID a it's values.
As you can see for ID 1 I have got 4 values, for 2 I have got only 2 values, for 3 I have got 3 values. I want to say that how many values I have got for particular ID can be different.
I don't know what is the name of this kind go grouping, If someone will name it I will be able to dig the Internet to find the solution.
If someone knew how to do this and will share the knowledge I will really appreciate this.
Best regards,
Volcano
You should add a group (Insert Group) for ID and put Value in the detail section. Make sure to start each group on a new page (Section Expert for your group header or footer, then tick New Page Before / After.

SPSS - Create dummy for top volume months within customer grouping

I need to create a dummy for the top purchase months within each customer ID. That is, if a month belong to one of the four months within the year where the customer purchased the most then it is noted with the number 1, otherwise 0.
Example of data, cust id, order date, volume and new variable dummy:
This code creates some sample data:
data list free/ID volume (2f4).
begin data
1 100 1 500 2 1 2 2 2 3 2 90 1 600 1 90 1 870 2 9 2 8 2 10
end data.
Using the sample data in the question, this code will create a new variable containing the dummy according to your definition:
RANK VARIABLES=volume (A) BY ID /RANK.
compute high4=(Rvolume<=4).

"Inserting" Records into Fields from a Database Feed

So the background to this is I'm trying to create a survival curve based on a database feed from the directions here.
What I have so far is three calculated fields per below. Patient ID is not a calculated field or necessary for the survival analysis, but I believe it could be useful for this question. For reference, there are about 20,000 unique patients.
Patient ID | Time | Censor | Group
Id1 3 0 1
Id2 8 0 2
Id3 1 1 1
Id4 3 1 1
Id5 11 0 1
Id5 7 1 2
What I would like to do is insert two records (one for each group) such:
Patient ID | Time | Censor | Group | Link
0 1
0 2
Id1 3 0 1 link
Id2 8 0 2 link
Id3 1 1 1 link
Id4 3 1 1 link
Id5 11 0 1 link
Id5 7 1 2 link
I unsuccessfully tried to create an excel spreadsheet with these base attributes to union with the columns, however, an excel spreadsheet does not appear to be able to union with a database.
My next idea is to find 2 of the 20,000 patients where I can create a calculated field along these lines (not sure this is feasible in Tableau, please excuse my syntax):
IF [Patient ID] = Id3 THEN [TIME] = 0 AND [CENSOR] IS NULL
END
and then a [Link] calculated formula:
IF [Patient ID] = Id3 THEN NULL
ELSE "link"
END
Any help would be appreciated. Would like to avoid inserting these records in the database.
The best / easiest option is to use an outer join to your excel workbook -- this is a new feature in Tableau version 10 (Cross database joins)
Then, once the dataset is combined, you can build business logic through a filter or calculated field based on the absence or presence of the Excel data.
http://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.