Can MongoDB's aggregate pipeline or MapReduce functionality query for many documents, each with timestamps, and group them on timestamps that are within a set range of each other, essentially "clumping" them logically?
For example, if I have 5 documents with the following timestamps:
2017-10-15 11:25:00
2017-10-15 11:28:00
2017-10-15 14:59:00
2017-10-15 15:01:00
2017-10-15 15:06:00
2017-10-15 15:13:00
And my goal is to group them into the following documents:
2017-10-15 11:28:00 (2 events)
2017-10-15 15:06:00 (3 events)
2017-10-15 15:13:00 (1 event)
What I'm logically doing is saying "Group all documents timestamped within 5 minutes of each other".
It's important to note that I'm not just saying "group documents by every five minutes", because if that were the case (assuming the five minute groups started on the hour) documents 3 and 4 would probably not be in the same group.
The other tricky thing is handling documents 3 and 4 in relation to 5. There is more than 5 minutes between documents 3 and 5, but since there is less than 5 minutes between 3 and 4, and between 4 and 5, they'd all get grouped.
(I think it would maybe be OK to get this working without that caveat, where the group had 5 as separate from 3 and 4, because the group starts at 3 and goes for a set "5 minutes", excluding 5. But it would be awesome if the group could be extended like that.)
Related
I'm very new to pyspark. Here is the little situation, I created a dataframe for each file (9 in total, each file represent the counts for each month), then I need to union them all in one big df. The thing is I need it to come out like this with each month is its own separate column.
name_id | 2020_01 | 2020_02 | 2020_03
1 23 43534 3455
2 12 34534 34534
3 2352 32525 23
However, with my current code, it put all months under the same column. I've been searching on the internet for a long time, but could not find anything to solve this (maybe I need groupby, but not sure how to do that). Below is my code. Thanks!
df1=spark.read.format("parquet").load("dbfs:")
df2=spark.read.format("parquet").load("dbfs:")
df3=spark.read.format("parquet").load("dbfs:")
df4=spark.read.format("parquet").load("dbfs:")
df5=spark.read.format("parquet").load("dbfs:")
df6=spark.read.format("parquet").load("dbfs:")
df7=spark.read.format("parquet").load("dbfs:")
df8=spark.read.format("parquet").load("dbfs:")
df9=spark.read.format("parquet").load("dbfs:")
#union all 9 files
union_all=df1.unionAll(df2).unionAll(df3).unionAll(df4).unionAll(df5).unionAll(df6).unionAll(df7).unionAll(df8).unionAll(df9)
Here is the current output
name_id | count | date
1 23 2020_01
2 12 2020_01
1 43534 2020_02
2 34534 2020_02
In MongoDB I have documents like below (I cross the names of calls for confidentiality):
Now I need to build a query to return results grouped by the name of the call and for each type of call I need to get the number of calls by month, day and hour. Also, in this query I need to indicate a range between two dates (including time).
In SQL server this is done using window functions (partitions) in combination with aggregations but how can I do the same in Mongo?
I am using MongoDB compass as mongo client.
I need to obtain something as below:
call name month day hour #ByMonth #ByDay #ByHour
GetEmployee January 1 14 10 6 1
GetEnployee January 1 18 10 6 5
GetEmployee January 3 12 10 4 4
GetEmployee March 5 20 8 8 8
GetEmployee April 12 17 45 35 35
GetEmployee April 20 10 45 10 10
For example, for GetEmployee call the distribution is as below:
10 calls done in January
8 calls done in March
45 calls done in April
For the January, the 10 calls are being distributed as below:
6 calls done on 1st January(these 6 calls are distributed as follows: 1 call at 14h and 5 calls at 18h)
4 calls done on 3rd January(these 4 calls are all done at 12h)
and so on for the rest of months.
For example, in SQL Server, if I have below table:
processName initDateTime
processA 2020-06-15 13:31:15.330
processB 2020-06-20 10:00:30.000
processA 2020-06-20 13:31:15.330
...
and so on
The SQL query is:
select
processName,
month(initDateTime),
day(initDateTime),
datepart(hour, initDateTime),
sum(count(*)) over(partition by processName, year(initDateTime), month(initDateTime)) byMonth,
sum(count(*)) over(partition by processName, year(initDateTime), month(initDateTime), day(initDateTime)) byDay,
count(*) byHour
from mytable
group by
processName,
year(initDateTime),
month(initDateTime),
day(initDateTime),
datepart(hour, initDateTime)
So How to do the same in Mongo? above processName and initDateTime fields would be "call" and "created" attributes respectively in mongodb.
I am using ksql stream and calculating events coming every 5 minutes. Here is my query -
select count(*), created_on_date from TABLE_NAME window tumbling (size 5 minutes) group by created_on_date;
Providing results -
2 | 2018-11-13 09:54:50
3 | 2018-11-13 09:54:49
3 | 2018-11-13 09:54:52
3 | 2018-11-13 09:54:51
3 | 2018-11-13 09:54:50
query without window tumbling -
select count(*), created_on_date from OP_UPDATE_ONLY group by created_on_date;
Result -
1 | 2018-11-13 09:55:08
2 | 2018-11-13 09:55:09
1 | 2018-11-13 09:55:10
3 | 2018-11-13 09:55:09
4 | 2018-11-13 09:55:12
Both queries returning same results, so how does window tumbling make difference?
The tumbling window is a rolling aggregation and counts the number of events based on a key within a given window of time. The window of time is based on the timestamp of your stream, inherited from your Kafka message by default but overrideable by WITH (TIMESTAMP='my_column'). So you could pass created_on_date as the timestamp column and then aggregate by the values there.
The second one is over the entire stream of messages. Since you happen to have a timestamp in your message itself, grouping by that gives the illusion of a time-based aggregation. However, if you wanted to find out how many events, for example, within an hour - this would be no use (you can only do a count at the grain of created_on_date).
So the first example, with a window, is usually the correct way to do it because you usually want to answer a business question about an aggregation within a given time period, not over the course of an arbitrary stream of data.
Issue:
Need to show RUNNING DISTINCT users per 3-month interval^^. (See goal table as reference). However, “COUNTD” does not help even after table calculation or “WINDOW_COUNT” or “WINDOW_SUM” function.
^^RUNNING DISTINCT user means DISTINCT users in a period of time (Jan - Mar, Feb – Apr, etc.). The COUNTD option only COUNT DISTINCT users in a window. This process should go over 3-month window to find the DISTINCT users.
Original Table
Date Username
1/1/2016 A
1/1/2016 B
1/2/2016 C
2/1/2016 A
2/1/2016 B
2/2/2016 B
3/1/2016 B
3/1/2016 C
3/2/2016 D
4/1/2016 A
4/1/2016 C
4/2/2016 D
4/3/2016 F
5/1/2016 D
5/2/2016 F
6/1/2016 D
6/2/2016 F
6/3/2016 G
6/4/2016 H
Goal Table
Tried Methods:
Step-by-step:
Tried to distribute the problem into steps, but due to columnar nature of tableau, I cannot successfully run COUNT or SUM (any aggregate command) on the LAST STEP of the solution.
STEP 0 Raw Data
This tables show the structure Data, as it is in the original table.
STEP 1 COUNT usernames by MONTH
The table show the count of users by month. You will notice because user B had 2 entries he is counted twice. In the next step we use DISTINCT COUNT to fix this issue.
STEP 2 DISTINCT COUNT by MONTH
Now we can see who all were present in a month, next step would be to see running DISTINCT COUNT by MONTH for 3 months
STEP 3 RUNNING DISTINCT COUNT for 3 months
Now we can see the SUM of DISTINCT COUNT of usernames for running 3 months. If you turn the MONTH INTERVAL to 1 from 3, you can see STEP 2 table.
LAST STEP Issue Step
GOAL: Need the GRAND TOTAL to be the SUM of MONTH column.
Request:
I want to calculate the SUM of '1' by MONTH. However, I am using WINDOW function and aggregating the data that gave me an Error.
WHAT I NEED
Jan Feb March April May Jun
3 3 4 5 5 6
WHAT I GOT
Jan Feb March April May Jun
1 1 1 1 1 1
My Output after tried methods: Attached twbx file. DISTINCT_count_running_v1
HELP taken:
https://community.tableau.com/thread/119179 ; Tried this method but stuck at last step
https://community.tableau.com/thread/122852 ; Used some parts of this solution
The way I approached the problem was identifying the minimum login date for each user and then using that date to count the distinct number of users. For example, I have data in this format. I created a calculated field called Min User Login Date as { FIXED [User]:MIN([Date])} and then did a CNTD(USER) on Min User Login Date to get the unique user count by date. If you want running total, then you can do quick table calculation on Running Total on CNTD(USER) field.
You need to put Month(date) and count(username) in the columns then you will get result what you expect.
See screen below
Postgresql 8.4.
I'm new to this concept so if people could teach me I'd appreciate it.
For Obamacare, anyone that works 30 hours per week or more must be offered the same healthcare as is offered to any other worker. We can't afford that so we have to limit work hours for temp and part-timers. This is affecting the whole country.
I need to calculate the hours worked (doesn't matter if overtime,
regular time, double time, etc) between two dates, say Jan 1, 2014,
and Nov 1, 2014 (Saturday) for each custom week (which beings on Sunday), not the week as defined by Postgresql (which begins on Monday).
Each of my custom work weeks begins on Sunday and ends on Saturday.
I don't know if I have to include weeks where
they did not work at all in the average, but let's assume I do. Zero hours that week would draw down the average.
Table name is 'employeetime', date field is 'employeetime.stopdate', hours worked per day is in the field 'employeetime.hours', employeeid field is 'employeetime.empid'.
I'd prefer to do this in one query per employee and I will execute the query once per employee as I loop through employees. If not I'm open to suggestions. But I'd like to understand the SQL presented in the answer.
Currently EXTRACT(week from '2014-01-01') calculates the start of the week as a Monday, so that doesn't work for me. Link here.
How would I do that without doing, say a separate query for each week, per person? We have 200 people to process.
Thank you.
I have set up a table to match your format:
select * from employeetime order by date;
id date hours
1 2014-11-06 10
1 2014-11-07 3
1 2014-11-08 5
1 2014-11-09 3
1 2014-11-10 5
You can get the week starting on Sunday by shifting. Note, here the 9th is a Sunday, so that is where we want the boundary.
select *, extract(week from date + '1 day'::interval) as week
from employeetime
order by week;
id date hours week
1 2014-11-07 3 45
1 2014-11-06 10 45
1 2014-11-08 5 45
1 2014-11-09 3 46
1 2014-11-10 5 46
And now the week shifts on Sunday rather than Monday. From here, the query to get hours by week/employee would be simple:
select id, sum(hours) as hours, extract(week from date + '1 day'::interval) as week
from employeetime
group by id, week
order by id, week;
id hours week
1 18 45
1 8 46