daily refresh of MongoDB Collection with Insert and update - mongodb

I have a MongoDB collection where Data need to refreshed for certain fields every night. The Target collection has 3 extra custom fields which are used by end user input for respective documents.
So when daily refresh happens overnight, the data source can send new documents or updated data of existing documents. The documents can be upto 10,000.
I am using Pymongo and MongoDB to achieve this. My problem, how to identify the which record need to be updated and which record needs to be inserted with those 3 extra custom fields without impacting end user data.
For Example:
Data Source:
Manufacture Name Model Year Units
BMW 5Series 2019 10
BMW 5Series 2020 5
AUDI A4 2020 20
AUDI A7 2019 3
TOYOTA COROLLA 2020 5
TOYOTA CAMRY 2020 6
HONDA ACCORD 2020 10
HONDA PILOT 2019 15
HONDA CRV 2019 20
Once Loaded, the App table has 1 custom columns (Location) for user input
Manufacture Name Model Year Location Units
BMW 5Series 2019 London 10
BMW 5Series 2020 New York 5
AUDI A4 2020 Melbourne 20
AUDI A7 2019 London 3
TOYOTA COROLLA 2020 New York 5
TOYOTA CAMRY 2020 London 6
HONDA ACCORD 2020 Sydney 10
HONDA PILOT 2019 Tokyo 15
HONDA CRV 2019
On second day, we get new data as below
Manufacture Name Model Year Units
BMW 5Series 2019 10
BMW 5Series 2020 **35**
**BMW 7Series 2020 12**
AUDI A4 2020 20
AUDI A7 2019 3
**AUDI A6 2019 1**
TOYOTA COROLLA 2020 5
TOYOTA CAMRY 2020 6
HONDA ACCORD 2020 10
HONDA PILOT 2019 15
*HONDA CRV 2019 20* *-- deleted -- in second refresh*
The data can be 10,000 records. How to achieve this with Pymongo or MongodB? I wrote the code in PyMongo until the retrieve the source data and store the cursor in dictionary. Not sure how to proceed after this using MongoDB Upsert or bulk write and preserve/update Location column data for existing records and assign NULL values for new records.
Thanks

finally this is achieved as below:
import pymongo
from pymongo import UpdateOne
# Define MongoDB connection
client = pymongo.MongoClient(server, username='User', password='password', authSource='DBName', authMechanism='SCRAM-SHA-256')
#Source table
collection2 = db['cars2']
count123 = collection2.count_documents({})
#print("New Cars2 Data - Count before Insert:",count123)
source_cursor = collection2.find()
print("Cars2 - Count in Cursor:",source_cursor.count())
#Target Table
collection = db['cars']
tgt_count = collection.count_documents({})
print("Cars Collection - Count before Insert:",tgt_count)
sourcedata = []
#since this is a MongoDB Cursor object, we need push to List using list()
sourcedata = list(source_cursor)
source_cursor.close()
# ADD new columns to the Data before inserting in MongoDB
for item in sourcedata:
item['Location'] = None
item['last_refresh'] = datetime.now()
#sourcedata = source_cursor
ops = []
if tgt_count == 0:
print("Loading for First time:")
for rec in sourcedata:
#Load Data for first time with new fields
ops.append( UpdateOne({'name': rec['name'],'model':rec['model']}, {'$set': {'name':rec['name'],'model':rec['model'],'year':rec['year'],'units':rec['units'], 'Location': rec['Location'],'last_refresh':rec['last_refresh']}}, upsert=True))
#print(ops)
result = collection.bulk_write(ops)
print("Inserted Count:", result.inserted_count)
print("Matched Count:", result.matched_count)
print("Modified Count:", result.modified_count)
print("Upserted Count:", result.upserted_count)
elif tgt_count > 0:
print("Updating the Load:")
for rec in sourcedata:
#No Location field is included to avoid replacing the values of this field by NULL
ops.append( UpdateOne({'name': rec['name'],'model':rec['model']}, {'$set': {'name':rec['name'],'model':rec['model'],'year':rec['year'],'units':rec['units'], 'last_refresh':rec['last_refresh']}}, upsert=True))
result = collection.bulk_write(ops)
print("Inserted Count:", result.inserted_count)
print("Matched Count:", result.matched_count)
print("Modified Count:", result.modified_count)
print("Upserted Count:", result.upserted_count)
#because I didn’t include “Location” field above, the new UPSERT records DO NOT have “Location” field anymore. So I have to update the collection one more time to include “Location” field
nullfld_result = collection.update_many({'Location':None},{ '$set':{'Location':None}})
count2 = collection.count_documents({})
print("Count after Insert:",count2)

Related

Apache Superset Pivot Table Grouping Problem

I faced with a grouping problem in Chart >>> Pivot table of Apache Superset (version 2023.01.1). I created pivot table chart with all criterias as below Column A: Product list, Column B: Jan, Column C: Feb till Dec.
So in Pivot Table Superset like: Time = Months, Columns = date , Rows = Product list and
Metric=
CASE
WHEN Product_List = 'Product_A' AND to_char(date,'Mon' ) = 'Jan'
THEN (SUM(sales)/(10000*0.01)
WHEN Product_List = 'Product_B' AND to_char(date,'Mon' ) = 'Feb'
THEN (SUM(sales)/(20000*0.01)
END
So the goal is to find the rate of products for each months based on different target sales. For Example
Sales for Jan = 100 units
Target sales for Jan = 1000 units
Rate = 100/1000*0.01= %10 for Jan
Sales for Feb = 150 units
Target sales for Feb = 2000 units
Rate = 150/2000*0.01= %Feb for Feb
Pivot Table:
Product Name
Jan
Product A
%20
Product B
%25
Note: The SQL code above is working in SQL LAB in Apache Superset. However When I apply it to chart>pivot table>metrics. it gives me error as 'must appear in the GROUP BY clause or be used in an aggregate function'

Grouping the intake and identify number of students who did not enroll for other classes among the student in the intake

intake class student_id
Sep 2022 - Eng English 100
Sep 2022 - Eng English 101
Nov 2022 - Sc Science 100
Jan 2023 - Bio Biology 101
Nov 2022 - Sc Science 102
Sep 2022 - Eng English 102
Jan 2023 - Bio Biology 102
Jan 2023 - Bio Biology 103
Jan 2023 - Bio Biology 105
Feb 2023 - Eng English 104
Feb 2023 - Eng English 103
Hello everyone,
I have a table as shown above. Each row in the table represent the student who is going to attend the classes. For example by looking at the Sep 2022 English class, I know that students with ID 100,101,102 are going to attend the class, and student 100,102 are going to attend Nov 2022 Science class, etc...
What I want to do is to transform the table into another format where it tells how many students did not attend or are not going to attend other classes among the students that are attending the class right now. The table below is the expected output:
I will show how to get the value in the table that are shown in the screenshot:
For example
When student 100,101,102 are attending the Sep 2022 English class, among three of them:
None of them did not attend or not going to attend English class (as they are
attending the English class right now);
One of them did not attend or not going to attend science class (student
101) since only student 100,102 are in the list of science class;
One of them did not attend or not going to attend biology class
(student 100) since only student 101,102 are in the list
to attend biology class and student 100 is not in the list.
Hence, for Sep 2022 - Eng intake:
no_english = 0
no_science = 1
no_biology = 1
Giving another example
When student 101,102,103,105 are attending the Jan 2023 Biology class, among 4 of them:
One of them did not attend or not going to attend English class (student 105) since student 101,102 attended Sep 2022 English class and student 103 going to attend Feb 2023 English class;
three of them did not attend or not going to attend science class (student
101,103,105) since only student 102 are in the list of science class;
None of them did not attend or not going to attend biology class since all of them are attending Biology class right now.
Hence, for Jan 2023 - Bio intake:
no_english = 1
no_science = 3
no_biology = 0
I have been struggled to transform the data into the desired format like what I show in the screenshot. In fact, I'm not sure whether it is possible to do it or not using powerquery or DAX. Any help or advise will be greatly appreciated. Let me know if my question is not clear.
Add 3 measures to a table as follows:
no_science =
VAR ids = VALUES('Table'[student_id])
VAR ids_sci = CALCULATETABLE(VALUES( 'Table'[student_id]), REMOVEFILTERS('Table'), 'Table'[class] = "Science")
RETURN COUNTX( EXCEPT(ids, ids_sci), 'Table'[student_id])+0
no_english =
VAR ids = VALUES('Table'[student_id])
VAR ids_eng = CALCULATETABLE(VALUES( 'Table'[student_id]), REMOVEFILTERS('Table'), 'Table'[class] = "English")
RETURN COUNTX( EXCEPT(ids, ids_eng), 'Table'[student_id])+0
no_biology =
VAR ids = VALUES('Table'[student_id])
VAR ids_bio = CALCULATETABLE(VALUES( 'Table'[student_id]), REMOVEFILTERS('Table'), 'Table'[class] = "Biology")
RETURN COUNTX( EXCEPT(ids, ids_bio), 'Table'[student_id])+0
For fun, an M version
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source, {"intake", "student_id"}, {{"data", each _, type table }}),
AllCombos =Table.ExpandListColumn( Table.AddColumn(#"Grouped Rows", "class", each List.Distinct(Source[class])), "class"),
T1 = Table.ExpandListColumn(Table.AddColumn(Table.FromList(List.Distinct(Source[class]), null,{"class"} ),"student_id", each List.Distinct(Source[student_id])), "student_id"),
#"Merged Queries0" = Table.NestedJoin(T1, {"class", "student_id"}, Source, {"class", "student_id"}, "Table1", JoinKind.LeftOuter),
StudentNo = Table.AddColumn(#"Merged Queries0", "No", each if Table.RowCount([Table1])=0 then 1 else 0),
#"Merged Queries" = Table.NestedJoin(AllCombos, {"student_id", "class"}, StudentNo, {"student_id", "class"}, "Table2", JoinKind.LeftOuter),
#"Expanded Table2" = Table.ExpandTableColumn(#"Merged Queries", "Table2", {"No"}, {"No"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Table2",{"student_id", "data"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[class]), "class", "No", List.Sum)
in #"Pivoted Column"

How to create calculated field based on another time series

I have a set of prices as data source for given timeseries and I would like to create a calculated field by combining two prices for each date: i.e.,
Price A *5 - Price B.
Data source:
Date Product Price
01.01.2018 A 10
01.01.2018 B 15
02.01.2018 A 20
02.01.2018 B 30
03.01.2018 A 10
03.01.2018 B 30
I don't know how to write the formula correctly for the Calculated field.
What I expect is to build the following table:
Date A B Combined Price (A *5 - B)
01.01.2018 10 15 35
02.01.2018 20 30 70
03.01.2018 10 30 20
Thank you
Answer from Mohfooj can be found in Tableau forum here: https://community.tableau.com/message/900181#900181

MongoDB query dates get no results

Maintained a database called "human" with more than 300,000 results of news, but when I use the date query to get the results when the date is after 2018-01-04 it always shows 0 results.
> db.human.count({"crawled_time":{"$gte":new Date("2018-01-04")}})
0
> db.human.count({"crawled_time":{"$gte":new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{"$gte":new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{$gte:new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{$gte:new Date("2018-01-04T00:00:00.000Z")}})
0
> db.human.count({"version_created":{$gte:new Date("2018-01-04T00:00:00Z")}})
0
> db.human.count({"version_created":{"$gte":ISODate("2018-01-04T00:00:00.0000Z")}})
0
A sample of the database file json looks like this:
{"_id":"21adb21dc225406182f031c8e67699cc","_class":"com.pats.reuters.pojo.NewsData","alt_id":"nWGB30349","audiences":["NP:MNI"],"body":"ISSUER: City of Spencer, IA\nAMOUNT: $1,500,000\nDESCRIPTION: General Obligation Corporate Purpose Bonds, Series 2018\n------------------------------------------------------------------------\nSELLING: Feb 5 TIME: 11:00 AM., EST\nFINANCIAL ADVISOR: PFM Fin Advisors\n------------------------------------------------------------------------\n ","first_created":"2018-01-30T06:12:05.000Z","headline":"SEALED BIDS: City of Spencer, IA, $1.5M Ult G.O. On Feb 5","instances_of":[],"language":"en","message_type":2,"mime_type":"text/plain","provider":"NS:RTRS","pub_status":"stat:usable","subjects":["A:R","E:T","E:5I","E:S","A:95","A:85","M:1QD","N2:MUNI","N2:PS1","N2:SL1","N2:CM1","N2:IA1","N2:GOS"],"takeSequence":1,"urgency":3,"version_created":"2018-01-30T06:12:05.000Z","source_id":"WGB30349__180130279kIQIcAh81BiGVmb/Js54Wg3naQC6GXEu9+H","crawled_time":"2018-01-30 14:12:05"}
{"_id":"8ba08c4af9464c6b23cc508645d5bf03","_class":"com.pats.reuters.pojo.NewsData","alt_id":"nWGB3034a","audiences":["NP:MNI"],"body":"ISSUER: City of Long Branch, NJ\nAMOUNT: $31,629,415\nDESCRIPTION: Bond Anticipation Notes, Consisting of $22,629,415 Bond Anticipation Notes, Series 2018A and\n------------------------------------------------------------------------\nSELLING: Feb 1 TIME: 11:30 AM., EST\nFINANCIAL ADVISOR: N.A.\n------------------------------------------------------------------------\n ","first_created":"2018-01-30T06:12:06.000Z","headline":"SEALED BIDS: City of Long Branch, NJ, $31.629M Ult G.O. On Feb 1","instances_of":[],"language":"en","message_type":2,"mime_type":"text/plain","provider":"NS:RTRS","pub_status":"stat:usable","subjects":["G:6J","A:R","E:T","E:5I","E:S","A:9M","E:U","M:1QD","N2:US","N2:MUNI","N2:PS1","N2:SL1","N2:CM1","N2:NJ1","N2:NT1"],"takeSequence":1,"urgency":3,"version_created":"2018-01-30T06:12:06.000Z","source_id":"WGB3034a__1801302ksv4Iy0zSP5cscas0FlZgu1TpQ4Zh25VKCtSt","crawled_time":"2018-01-30 14:12:06"}
{"_id":"537f70076ef056c9a43d30c89500353a","_class":"com.pats.reuters.pojo.NewsData","alt_id":"nWGB3034b","audiences":["NP:MNI"],"body":"ISSUER: Independent School District No. 76 of Canadian County (Calumet), OK\nAMOUNT: $1,630,000\nDESCRIPTION: Combined Purpose Bonds of 2018\n------------------------------------------------------------------------\nSELLING: Feb 12 TIME: 05:00 PM., EST\nFINANCIAL ADVISOR: Stephen H. McDonald\n------------------------------------------------------------------------\n ","first_created":"2018-01-30T06:12:07.000Z","headline":"SEALED BIDS: Independent School District No. 76 of Canadian County (Calumet), OK, $1.63M Ult G.O. On Feb 12","instances_of":[],"language":"en","message_type":2,"mime_type":"text/plain","provider":"NS:RTRS","pub_status":"stat:usable","subjects":["A:R","E:T","E:5I","E:S","A:9R","M:1QD","N2:MUNI","N2:PS1","N2:SL1","N2:CM1","N2:OK1"],"takeSequence":1,"urgency":3,"version_created":"2018-01-30T06:12:07.000Z","source_id":"WGB3034b__1801302ev7DqID2Wr/BAJHrC/plpNKBQhrfuHBnlSldz","crawled_time":"2018-01-30 14:12:07"}
Your field value is not a date object. It is a string. So you can't use new Date("2018-01-04"). Use the below query.
db.human.count({"crawled_time":{"$gte":"2018-01-04"}})
db.human.count({"version_created":{"$gte":"2018-01-04"}})

spark-dataframe/mongo - append data

I need to append data to mongodb using spark-dataframe. For example, let's say there are 100k stocks in a portfolio:
Stock A
Jan 2018
Profit: $30k
Stock B
Jan 2018
Profile: -$10k
MongoDB:
_id: ObjectId('XXX1')
stock: Stock A
monthlyProfit: Array
0: Object
Month: Jan 2018
Profit: 30k
_id: ObjectId('XXX2')
stock: Stock B
monthlyProfit: Array
0: Object
Month: Jan 2018
Profit: -10k
If I were to append February profit, how do I add an element to an existing array and push it to mongodb without having a performance issue given same updates need to happen to all 100k documents in a collection?