Grouping the intake and identify number of students who did not enroll for other classes among the student in the intake - group-by

intake class student_id
Sep 2022 - Eng English 100
Sep 2022 - Eng English 101
Nov 2022 - Sc Science 100
Jan 2023 - Bio Biology 101
Nov 2022 - Sc Science 102
Sep 2022 - Eng English 102
Jan 2023 - Bio Biology 102
Jan 2023 - Bio Biology 103
Jan 2023 - Bio Biology 105
Feb 2023 - Eng English 104
Feb 2023 - Eng English 103
Hello everyone,
I have a table as shown above. Each row in the table represent the student who is going to attend the classes. For example by looking at the Sep 2022 English class, I know that students with ID 100,101,102 are going to attend the class, and student 100,102 are going to attend Nov 2022 Science class, etc...
What I want to do is to transform the table into another format where it tells how many students did not attend or are not going to attend other classes among the students that are attending the class right now. The table below is the expected output:
I will show how to get the value in the table that are shown in the screenshot:
For example
When student 100,101,102 are attending the Sep 2022 English class, among three of them:
None of them did not attend or not going to attend English class (as they are
attending the English class right now);
One of them did not attend or not going to attend science class (student
101) since only student 100,102 are in the list of science class;
One of them did not attend or not going to attend biology class
(student 100) since only student 101,102 are in the list
to attend biology class and student 100 is not in the list.
Hence, for Sep 2022 - Eng intake:
no_english = 0
no_science = 1
no_biology = 1
Giving another example
When student 101,102,103,105 are attending the Jan 2023 Biology class, among 4 of them:
One of them did not attend or not going to attend English class (student 105) since student 101,102 attended Sep 2022 English class and student 103 going to attend Feb 2023 English class;
three of them did not attend or not going to attend science class (student
101,103,105) since only student 102 are in the list of science class;
None of them did not attend or not going to attend biology class since all of them are attending Biology class right now.
Hence, for Jan 2023 - Bio intake:
no_english = 1
no_science = 3
no_biology = 0
I have been struggled to transform the data into the desired format like what I show in the screenshot. In fact, I'm not sure whether it is possible to do it or not using powerquery or DAX. Any help or advise will be greatly appreciated. Let me know if my question is not clear.

Add 3 measures to a table as follows:
no_science =
VAR ids = VALUES('Table'[student_id])
VAR ids_sci = CALCULATETABLE(VALUES( 'Table'[student_id]), REMOVEFILTERS('Table'), 'Table'[class] = "Science")
RETURN COUNTX( EXCEPT(ids, ids_sci), 'Table'[student_id])+0
no_english =
VAR ids = VALUES('Table'[student_id])
VAR ids_eng = CALCULATETABLE(VALUES( 'Table'[student_id]), REMOVEFILTERS('Table'), 'Table'[class] = "English")
RETURN COUNTX( EXCEPT(ids, ids_eng), 'Table'[student_id])+0
no_biology =
VAR ids = VALUES('Table'[student_id])
VAR ids_bio = CALCULATETABLE(VALUES( 'Table'[student_id]), REMOVEFILTERS('Table'), 'Table'[class] = "Biology")
RETURN COUNTX( EXCEPT(ids, ids_bio), 'Table'[student_id])+0

For fun, an M version
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source, {"intake", "student_id"}, {{"data", each _, type table }}),
AllCombos =Table.ExpandListColumn( Table.AddColumn(#"Grouped Rows", "class", each List.Distinct(Source[class])), "class"),
T1 = Table.ExpandListColumn(Table.AddColumn(Table.FromList(List.Distinct(Source[class]), null,{"class"} ),"student_id", each List.Distinct(Source[student_id])), "student_id"),
#"Merged Queries0" = Table.NestedJoin(T1, {"class", "student_id"}, Source, {"class", "student_id"}, "Table1", JoinKind.LeftOuter),
StudentNo = Table.AddColumn(#"Merged Queries0", "No", each if Table.RowCount([Table1])=0 then 1 else 0),
#"Merged Queries" = Table.NestedJoin(AllCombos, {"student_id", "class"}, StudentNo, {"student_id", "class"}, "Table2", JoinKind.LeftOuter),
#"Expanded Table2" = Table.ExpandTableColumn(#"Merged Queries", "Table2", {"No"}, {"No"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Table2",{"student_id", "data"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[class]), "class", "No", List.Sum)
in #"Pivoted Column"

Related

Apache Superset Pivot Table Grouping Problem

I faced with a grouping problem in Chart >>> Pivot table of Apache Superset (version 2023.01.1). I created pivot table chart with all criterias as below Column A: Product list, Column B: Jan, Column C: Feb till Dec.
So in Pivot Table Superset like: Time = Months, Columns = date , Rows = Product list and
Metric=
CASE
WHEN Product_List = 'Product_A' AND to_char(date,'Mon' ) = 'Jan'
THEN (SUM(sales)/(10000*0.01)
WHEN Product_List = 'Product_B' AND to_char(date,'Mon' ) = 'Feb'
THEN (SUM(sales)/(20000*0.01)
END
So the goal is to find the rate of products for each months based on different target sales. For Example
Sales for Jan = 100 units
Target sales for Jan = 1000 units
Rate = 100/1000*0.01= %10 for Jan
Sales for Feb = 150 units
Target sales for Feb = 2000 units
Rate = 150/2000*0.01= %Feb for Feb
Pivot Table:
Product Name
Jan
Product A
%20
Product B
%25
Note: The SQL code above is working in SQL LAB in Apache Superset. However When I apply it to chart>pivot table>metrics. it gives me error as 'must appear in the GROUP BY clause or be used in an aggregate function'

daily refresh of MongoDB Collection with Insert and update

I have a MongoDB collection where Data need to refreshed for certain fields every night. The Target collection has 3 extra custom fields which are used by end user input for respective documents.
So when daily refresh happens overnight, the data source can send new documents or updated data of existing documents. The documents can be upto 10,000.
I am using Pymongo and MongoDB to achieve this. My problem, how to identify the which record need to be updated and which record needs to be inserted with those 3 extra custom fields without impacting end user data.
For Example:
Data Source:
Manufacture Name Model Year Units
BMW 5Series 2019 10
BMW 5Series 2020 5
AUDI A4 2020 20
AUDI A7 2019 3
TOYOTA COROLLA 2020 5
TOYOTA CAMRY 2020 6
HONDA ACCORD 2020 10
HONDA PILOT 2019 15
HONDA CRV 2019 20
Once Loaded, the App table has 1 custom columns (Location) for user input
Manufacture Name Model Year Location Units
BMW 5Series 2019 London 10
BMW 5Series 2020 New York 5
AUDI A4 2020 Melbourne 20
AUDI A7 2019 London 3
TOYOTA COROLLA 2020 New York 5
TOYOTA CAMRY 2020 London 6
HONDA ACCORD 2020 Sydney 10
HONDA PILOT 2019 Tokyo 15
HONDA CRV 2019
On second day, we get new data as below
Manufacture Name Model Year Units
BMW 5Series 2019 10
BMW 5Series 2020 **35**
**BMW 7Series 2020 12**
AUDI A4 2020 20
AUDI A7 2019 3
**AUDI A6 2019 1**
TOYOTA COROLLA 2020 5
TOYOTA CAMRY 2020 6
HONDA ACCORD 2020 10
HONDA PILOT 2019 15
*HONDA CRV 2019 20* *-- deleted -- in second refresh*
The data can be 10,000 records. How to achieve this with Pymongo or MongodB? I wrote the code in PyMongo until the retrieve the source data and store the cursor in dictionary. Not sure how to proceed after this using MongoDB Upsert or bulk write and preserve/update Location column data for existing records and assign NULL values for new records.
Thanks
finally this is achieved as below:
import pymongo
from pymongo import UpdateOne
# Define MongoDB connection
client = pymongo.MongoClient(server, username='User', password='password', authSource='DBName', authMechanism='SCRAM-SHA-256')
#Source table
collection2 = db['cars2']
count123 = collection2.count_documents({})
#print("New Cars2 Data - Count before Insert:",count123)
source_cursor = collection2.find()
print("Cars2 - Count in Cursor:",source_cursor.count())
#Target Table
collection = db['cars']
tgt_count = collection.count_documents({})
print("Cars Collection - Count before Insert:",tgt_count)
sourcedata = []
#since this is a MongoDB Cursor object, we need push to List using list()
sourcedata = list(source_cursor)
source_cursor.close()
# ADD new columns to the Data before inserting in MongoDB
for item in sourcedata:
item['Location'] = None
item['last_refresh'] = datetime.now()
#sourcedata = source_cursor
ops = []
if tgt_count == 0:
print("Loading for First time:")
for rec in sourcedata:
#Load Data for first time with new fields
ops.append( UpdateOne({'name': rec['name'],'model':rec['model']}, {'$set': {'name':rec['name'],'model':rec['model'],'year':rec['year'],'units':rec['units'], 'Location': rec['Location'],'last_refresh':rec['last_refresh']}}, upsert=True))
#print(ops)
result = collection.bulk_write(ops)
print("Inserted Count:", result.inserted_count)
print("Matched Count:", result.matched_count)
print("Modified Count:", result.modified_count)
print("Upserted Count:", result.upserted_count)
elif tgt_count > 0:
print("Updating the Load:")
for rec in sourcedata:
#No Location field is included to avoid replacing the values of this field by NULL
ops.append( UpdateOne({'name': rec['name'],'model':rec['model']}, {'$set': {'name':rec['name'],'model':rec['model'],'year':rec['year'],'units':rec['units'], 'last_refresh':rec['last_refresh']}}, upsert=True))
result = collection.bulk_write(ops)
print("Inserted Count:", result.inserted_count)
print("Matched Count:", result.matched_count)
print("Modified Count:", result.modified_count)
print("Upserted Count:", result.upserted_count)
#because I didn’t include “Location” field above, the new UPSERT records DO NOT have “Location” field anymore. So I have to update the collection one more time to include “Location” field
nullfld_result = collection.update_many({'Location':None},{ '$set':{'Location':None}})
count2 = collection.count_documents({})
print("Count after Insert:",count2)

MongoDB query dates get no results

Maintained a database called "human" with more than 300,000 results of news, but when I use the date query to get the results when the date is after 2018-01-04 it always shows 0 results.
> db.human.count({"crawled_time":{"$gte":new Date("2018-01-04")}})
0
> db.human.count({"crawled_time":{"$gte":new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{"$gte":new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{$gte:new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{$gte:new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{$gte:new Date("2018-01-04T00:00:00Z")}})
0
> db.human.count({"version_created":{"$gte":ISODate("2018-01-04T00:00:00.0000Z")}})
0
A sample of the database file json looks like this:
{"_id":"21adb21dc225406182f031c8e67699cc","_class":"com.pats.reuters.pojo.NewsData","alt_id":"nWGB30349","audiences":["NP:MNI"],"body":"ISSUER: City of Spencer, IA\nAMOUNT: $1,500,000\nDESCRIPTION: General Obligation Corporate Purpose Bonds, Series 2018\n------------------------------------------------------------------------\nSELLING: Feb 5 TIME: 11:00 AM., EST\nFINANCIAL ADVISOR: PFM Fin Advisors\n------------------------------------------------------------------------\n ","first_created":"2018-01-30T06:12:05.000Z","headline":"SEALED BIDS: City of Spencer, IA, $1.5M Ult G.O. On Feb 5","instances_of":[],"language":"en","message_type":2,"mime_type":"text/plain","provider":"NS:RTRS","pub_status":"stat:usable","subjects":["A:R","E:T","E:5I","E:S","A:95","A:85","M:1QD","N2:MUNI","N2:PS1","N2:SL1","N2:CM1","N2:IA1","N2:GOS"],"takeSequence":1,"urgency":3,"version_created":"2018-01-30T06:12:05.000Z","source_id":"WGB30349__180130279kIQIcAh81BiGVmb/Js54Wg3naQC6GXEu9+H","crawled_time":"2018-01-30 14:12:05"}
{"_id":"8ba08c4af9464c6b23cc508645d5bf03","_class":"com.pats.reuters.pojo.NewsData","alt_id":"nWGB3034a","audiences":["NP:MNI"],"body":"ISSUER: City of Long Branch, NJ\nAMOUNT: $31,629,415\nDESCRIPTION: Bond Anticipation Notes, Consisting of $22,629,415 Bond Anticipation Notes, Series 2018A and\n------------------------------------------------------------------------\nSELLING: Feb 1 TIME: 11:30 AM., EST\nFINANCIAL ADVISOR: N.A.\n------------------------------------------------------------------------\n ","first_created":"2018-01-30T06:12:06.000Z","headline":"SEALED BIDS: City of Long Branch, NJ, $31.629M Ult G.O. On Feb 1","instances_of":[],"language":"en","message_type":2,"mime_type":"text/plain","provider":"NS:RTRS","pub_status":"stat:usable","subjects":["G:6J","A:R","E:T","E:5I","E:S","A:9M","E:U","M:1QD","N2:US","N2:MUNI","N2:PS1","N2:SL1","N2:CM1","N2:NJ1","N2:NT1"],"takeSequence":1,"urgency":3,"version_created":"2018-01-30T06:12:06.000Z","source_id":"WGB3034a__1801302ksv4Iy0zSP5cscas0FlZgu1TpQ4Zh25VKCtSt","crawled_time":"2018-01-30 14:12:06"}
{"_id":"537f70076ef056c9a43d30c89500353a","_class":"com.pats.reuters.pojo.NewsData","alt_id":"nWGB3034b","audiences":["NP:MNI"],"body":"ISSUER: Independent School District No. 76 of Canadian County (Calumet), OK\nAMOUNT: $1,630,000\nDESCRIPTION: Combined Purpose Bonds of 2018\n------------------------------------------------------------------------\nSELLING: Feb 12 TIME: 05:00 PM., EST\nFINANCIAL ADVISOR: Stephen H. McDonald\n------------------------------------------------------------------------\n ","first_created":"2018-01-30T06:12:07.000Z","headline":"SEALED BIDS: Independent School District No. 76 of Canadian County (Calumet), OK, $1.63M Ult G.O. On Feb 12","instances_of":[],"language":"en","message_type":2,"mime_type":"text/plain","provider":"NS:RTRS","pub_status":"stat:usable","subjects":["A:R","E:T","E:5I","E:S","A:9R","M:1QD","N2:MUNI","N2:PS1","N2:SL1","N2:CM1","N2:OK1"],"takeSequence":1,"urgency":3,"version_created":"2018-01-30T06:12:07.000Z","source_id":"WGB3034b__1801302ev7DqID2Wr/BAJHrC/plpNKBQhrfuHBnlSldz","crawled_time":"2018-01-30 14:12:07"}
Your field value is not a date object. It is a string. So you can't use new Date("2018-01-04"). Use the below query.
db.human.count({"crawled_time":{"$gte":"2018-01-04"}})
db.human.count({"version_created":{"$gte":"2018-01-04"}})

Date range queries mongodb

I've two collections one is random and other one is 'msg'
In message I've a document like
{ "message": "sssss", postedAt: Fri Jul 17 2015 09:03:43 GMT+0530 (IST) }
For random collection, there is a script which generates random number every minute
like
{ "randomStr": "sss", postedAt: Fri Jul 17 2015 09:03:43 GMT+0530 (IST) }
{ "randomStr": "xxx", postedAt: Fri Jul 17 2015 09:04:43 GMT+0530 (IST) }
{ "randomStr": "yyy", postedAt: Fri Jul 17 2015 09:05:43 GMT+0530 (IST) }
Notice the change in timings, for every mintute there is a new record.
Now, my issue is
when I query for message collection, I'll get one record.
Lets's say this
{ "message": "sssss", postedAt: Fri Jul 17 2015 09:03:13 GMT+0530 (IST) }
now I want to get the record from random collection which posts at exact minute
this message is posted at 09:03, I want to get the record from random collection which postedat exactly same time 09:03.
How to do that? Any help appreciated. Thx.
Note: I'm doing this in meteor
Edit
Added image for first comment
So the point here is 'actually use a range' and also be aware of the Date value you get in return. So as a basic example in principle.
The time right now as I execute this is:
var date = new Date()
ISODate("2015-07-17T04:02:04.471Z")
So if you presume then that your actual timestamp in the document is "not exactly to the minute" (like above) nor is it likely the "random record" is so then the first thing to do is "round" it to the minute:
date = new Date(date.valueOf() - date.valueOf() % ( 1000 * 60 ))
ISODate("2015-07-17T04:02:00Z")
And of course the "end date" is just one minute after that:
var endDate = new Date(date.valueOf() + ( 1000 * 60 ))
ISODate("2015-07-17T04:03:00Z")
Then when you query "rqndom" you just get the "range", with $gte and $lt operators:
db.random.find({ "postedAt": { "$gte": date, "$lt": endDate } })
Which just retrieves your single "write once a minute" item from "random" at any possible value within that minute.
So basically:
Round your input retrieved date to the minute
Search on the "range" betweeen that value and the next minute

MongoDB row index limit query like SQL

I would like to retrieve a list that contains an specified record under some conditions and only retrieve a number of records before and a number of records after that record. Are there any solutions?
For example, I have a MongoDB schema { id, date, section }
Data set:
100, 26 Aug 2014 11:00, A
110, 26 Aug 2014 11:01, A
140, 26 Aug 2014 12:00, A
141, 27 Aug 2014 12:00, B
200, 30 Aug 2014 11:00, A
210, 01 Sep 2014 11:01, B
290, 02 Sep 2014 12:00, A
300, 26 Sep 2014 12:00, A
301, 27 Oct 2014 12:00, B
302, 30 Oct 2014 11:23, A
410, 01 Oct 2014 15:01, B
590, 02 Oct 2014 12:00, A
600, 26 Nov 2014 00:00, A
I would like to get a list, which contains an unique id = 300 and 3 records before and 3 records after that record with id = 300, sorted by date under section A.
The output:
140, 26 Aug 2014 12:00, A
200, 30 Aug 2014 11:00, A
290, 02 Sep 2014 12:00, A
300, 26 Sep 2014 12:00, A <-- middle
302, 30 Oct 2014 11:23, A
590, 02 Oct 2014 12:00, A
600, 26 Nov 2014 00:00, A
I have a stupid approach:
get the date (let say it's 26 Sept 2014 12:00) of the specified id = 300 with section A
set a query to find records that the date is greater than and equal to 26 Sept 2014 12:00 ordered by date, limited by 3 records.
set a query to find records that the date is less than 26 Sept 2014 12:00 ordered by date, limited by 3 records.
combine two lists.
Is there any better approaches to just retrieve this kind of list in a query or in better performance? Thank you.
Let's your schema name be USER. You could use below mongo db query to fetch the result
$sort function :
1 represents ascending
-1 represents descending
Refer documentation : $Sort Documentation
Query will be :
db.user.aggregate({$match : { "id" : 300}},{$sort : {"date": 1 }},{$skip : 0},{$limit : 10});
$skip value will be your limit value after first query
db.user.aggregate({$match : { "id" : 300}},{$sort : {"date": 1 }},{$skip : 10},{$limit : 10});
Refer Documentation : Aggregation Documentation
$skip and $limit in aggregation framework
Here is a good example of using skip and limit which should help you achieve the SELECT TOP X or LIMIT X
Note: I'm assuming you want to use the aggregate framework based on the tagging of this question.
I believe this should do it
x = 300;
db.user.aggregate({$match : { "id" : {$lte: x+10,$gte: x-10 }},{$sort : {"date": 1 }}});