I'm currently working on a web application (Node js) that has bar graphs indicating the number of sales done per month.
I have a table that has id, job_name, date_completed, and status columns (the first two columns seem to be irrelevant for this question).
I want to be able to query the number of completed jobs per month between two values (e.g. 2020-01-01 [January 2020] to 2020-12-01 [Dec. 2020]), with the condition that it will only count those with status value '1' (1 = complete, 0 = ongoing) with the query producing this table:
----------+-----------------------
Months | Jobs Completed
Jan 2020 | 3
Feb 2020 | 1
Mar 2020 | 5
...
Dec 2020 | 3
---------+-------------------------
I'll be using the data to turn it into json using node.js / pg-node.
I've looked at generate_series and row_to_json / json_agg but stumped at how to make this query work.
Thank you!
SELECT
MONTH(tbl.date_completed) as month,
COUNT(tbl.staus) as jobs
FROM yourTable tbl
WHERE
tbl.date_completed BETWEEN CAST ('2020-01-01' AS DATE) AND CAST ('2020-12-01' AS DATE)
AND
status = 1
GROUP BY MONTH(tbl.date_completed)
this should work, but not tested
Related
I am getting some statistics using a query like
SELECT date_trunc('month', created_at) AS time, count(DISTINCT "user_id") AS mau
FROM "session"
GROUP BY time
ORDER BY time;
Which is working fine if I want to get monthly active users for each calendar month. But I would like to shift the result to show last X moths starting from today instead of actual calendar months. How do I do? Can I add an offset in some way?
EDIT
As an example, I am currently getting results like
time | mau
2022-04-01 | 10
2022-05-01 | 20
2022-06-01 | 30
But I would like it to be something like (where 2022-06-07 is today)
time | mau
2022-04-07 | 10
2022-05-07 | 20
2022-06-07 | 30
I did dropDuplicates in a dataframe with subsets of Region,store,and id.
The dataframe contains some other columns like latitude, longitude, address, Zip, Year, Month...
When I do count of the derived dataframe am getting a constant value,
But when i take the count of a selected year, say 2018, am getting different counts when running the df.count()
Could anyone please explain why this is happening?
Df.dropDuplicates("region","store","id")
Df.createOrReplaceTempView(Df)
spark.sql("select * from Df").count() is constant
whenever i run
But if i put a where clause inside with Year or Month am getting multiple counts.
Eg:
spark.sql("select * from Df where Year =2018").count()
This statement is giving multiple values on each execution.
Intermediate output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1 4 2018 0
Sgs 21 1425 47.783 6.7282 2 5 2019 1
Efg 26 1277 48.8293 8.2727 3 7 2019 2
Output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1277 4 2018 0
Sgs 21 1425 47.783 6.7282 1425 5 2019 1
Efg 26 1277 48.8293 8.2727 1277 7 2019 2
So here newid gets the value of objecrnr,
When newid is comming same then i need to assign the latest objectnr to newid, considering the year and month
The line
Df.dropDuplicates("region","store","id")
creates a new Dataframe and it is not modifying the existing one. Dataframes are immutable.
To solve your issue you need to save the output of the dropDuplicates statement into a new Dataframe as shown below:
val Df2 = Df.dropDuplicates("region","store","id")
Df2.createOrReplaceTempView(Df2)
spark.sql("select * from Df2").count()
In addition you may get different counts when applying the filter Year=2018 because the Year column ist not part of the three columns you used to drop the duplicates. Apparently you have date in your Dataframe that share the same values in the three column but differ in the Year. Dropping duplicates is not a deterministic process ass it depends on the ordering of your data which vary in every run on your code.
I have a table containing courses run by teachers, I want to grab the number of taught days and split these by years and teachers' status.
The table contains the following fields:
id teacher_id course_name course_date course_duration teacher_status
--------------------------------------------------------------------------
1 Teacher_01 Course_AA 2012-02-01 2 volunteer
2 Teacher_02 Course_BB 2012-02-01 7 employee
3 Teacher_03 Course_BB 2013-02-01 7 contractor
4 Teacher_01 Course_AA 2014-02-01 2 paid volunteer
5 Teacher_04 Course_AA 2014-06-01 2 paid volunteer
Teachers may run a course under various statuses: volunteer, paid volunteer, contractor, employee, etc. The status of a given teacher can change through time. The duration of a course is expressed in days.
I can already gather the sum of taught days by teachers, split by status. This is done by
SELECT
teacher_status,
sum(course_duration) AS "Taught days"
FROM
my_table
GROUP BY
teacher_status
;
But data is not normalized and different families of statuses have been mixed. So I want to gather the same info (number of taught days) split:
by 3 statuses: volunteer, paid volunteer, all other statuses,
and by years.
What is expected is:
Year Teacher_status Taught_days
---------------------------------------
2012 volunteer 2
2012 employee 7
2013 contractor 7
2014 paid volunteer 4
I've tried various combinations of aggregate functions, GROUP BY / HAVING / ROLLUP statements but without success. How should I achieve this?
You'll want to select a complex expression and then GROUP BY that, not just by a raw column value. You could either repeat the expression or, in Postgres, also refer to the column alias:
SELECT
EXTRACT(year FROM course_date) as year,
(CASE teacher_status
WHEN 'volunteer' THEN 'volunteer'
WHEN 'paid volunteer' THEN 'paid'
ELSE 'other'
END) AS status,
SUM(course_duration) AS "Taught days"
FROM
my_table
GROUP BY
year,
status;
To get your example result, I have this query
SELECT extract (year from course_date),
teacher_status,
sum(course_duration) AS "Taught days"
FROM
my_table
GROUP BY
extract (year from course_date),
teacher_status;
Issue:
Need to show RUNNING DISTINCT users per 3-month interval^^. (See goal table as reference). However, “COUNTD” does not help even after table calculation or “WINDOW_COUNT” or “WINDOW_SUM” function.
^^RUNNING DISTINCT user means DISTINCT users in a period of time (Jan - Mar, Feb – Apr, etc.). The COUNTD option only COUNT DISTINCT users in a window. This process should go over 3-month window to find the DISTINCT users.
Original Table
Date Username
1/1/2016 A
1/1/2016 B
1/2/2016 C
2/1/2016 A
2/1/2016 B
2/2/2016 B
3/1/2016 B
3/1/2016 C
3/2/2016 D
4/1/2016 A
4/1/2016 C
4/2/2016 D
4/3/2016 F
5/1/2016 D
5/2/2016 F
6/1/2016 D
6/2/2016 F
6/3/2016 G
6/4/2016 H
Goal Table
Tried Methods:
Step-by-step:
Tried to distribute the problem into steps, but due to columnar nature of tableau, I cannot successfully run COUNT or SUM (any aggregate command) on the LAST STEP of the solution.
STEP 0 Raw Data
This tables show the structure Data, as it is in the original table.
STEP 1 COUNT usernames by MONTH
The table show the count of users by month. You will notice because user B had 2 entries he is counted twice. In the next step we use DISTINCT COUNT to fix this issue.
STEP 2 DISTINCT COUNT by MONTH
Now we can see who all were present in a month, next step would be to see running DISTINCT COUNT by MONTH for 3 months
STEP 3 RUNNING DISTINCT COUNT for 3 months
Now we can see the SUM of DISTINCT COUNT of usernames for running 3 months. If you turn the MONTH INTERVAL to 1 from 3, you can see STEP 2 table.
LAST STEP Issue Step
GOAL: Need the GRAND TOTAL to be the SUM of MONTH column.
Request:
I want to calculate the SUM of '1' by MONTH. However, I am using WINDOW function and aggregating the data that gave me an Error.
WHAT I NEED
Jan Feb March April May Jun
3 3 4 5 5 6
WHAT I GOT
Jan Feb March April May Jun
1 1 1 1 1 1
My Output after tried methods: Attached twbx file. DISTINCT_count_running_v1
HELP taken:
https://community.tableau.com/thread/119179 ; Tried this method but stuck at last step
https://community.tableau.com/thread/122852 ; Used some parts of this solution
The way I approached the problem was identifying the minimum login date for each user and then using that date to count the distinct number of users. For example, I have data in this format. I created a calculated field called Min User Login Date as { FIXED [User]:MIN([Date])} and then did a CNTD(USER) on Min User Login Date to get the unique user count by date. If you want running total, then you can do quick table calculation on Running Total on CNTD(USER) field.
You need to put Month(date) and count(username) in the columns then you will get result what you expect.
See screen below
In SQL Task Editor I have the following Query
DELETE FROM
[TICKETS_DATA]
where BILLING_TICKETS_DATA_Time_ID <
(SELECT TIME_ID
FROM [TIME]
WHERE (TIME_Year = ?) AND (TIME_Month = ?)) - 1
I have TIME_ID with relevant Month and Year present in the row.
I have 2 Variables present as Time_Month (int32) and Time_Year (int32) for eg 08 and 2012 respectively.
I want to pick up the Current Time_ID and pass the above query in SQL Task Editor.
Currently in the Table I was storing 1 month of data and now want to store 3 months data.
Kindly, assist me in Parameter mapping and how to parse the variable in the SQL Command query.
As long as the Time_id in the table is a numeric value that is incremented by one for each record, and there is as stated one record per year/month combo and the numbers increase sequentially, by one each time, in date order (i.e. 2000 01 has time_id 1 and 2000 02 has time_id 2 and 2001 1 has time_id 13), then you can just change the -1 to -3 to delete records from your table that are older than three months. Bear in mind that since this was probably run last month, you will have two months in the table on the first run after this change and it will delete 0 records in this first run. Next run, you will have 3 months and it will delete 0 records again. On the third run (assuming it is only run once a month) you will have three months of data and it will delete records from 4 months prior to that date.