pymongo - Upsert collection to add object with nested objects and nested data - mongodb

I have spent weeks researching, finally asking the question....
Using python/pymongo I am stuck at the simple step of just adding a nested object with nested data "to an existing object or nested object"...so I have not even made it to the upsert yet. I "have" code that confirms if the collection "serial_number+ exists, if not then create it -- however I am stuck at adding nested objects to an existing or newly created collection. Looking for an example to build from, not for this forum to solve the full solution :-)
Attached is a screenshot of the desired layout. Use case, several cars are available to be rented
MongoDB Database with a Collection that is a car's serial number and all activities related to that car
a. Need to capture details of each rental activity - nested data under a each serial number.
b. Need to capture maintenance details for each car - nested data under each serial number.
When either "a" or "b" occur, need to check that the record is not already there, if not then add a new nested object that contains nested data.
I have tried for loops, numerous postings referring to collections and sub-collections -- all seem to require highly detailed information within code to parse the dictionary (which comes from semi-structed sources, not consistent) and add to a collection....but cannot find a starting point to add to an existing object or nested object within an existing collection?
#Assume mongodb is structured like this:
#rent_activity(DB)>vehicle(document)>serial_number(object)>car_details(object under serial_number)
#rent_activity(DB)>vehicle(document)>serial_number(object)>rental_activity(object under serial_number)
#rent_activity(DB)>vehicle(document)>serial_number(object)>maintenance(object under serial_number)
#The example below is attempting to insert new objects with nested arrays to the car_details object
import os,sys
import pymongo
import csv
import boto3
import time
import psycopg2
import pandas as pd
import pandas.io.sql as sqlio
from datetime import datetime
from pymongo import errors
my_connection = psycopg2.connect(dbname='xxx', host='xxx', port=xxx, user='xxx', password='xxx')
client = pymongo.MongoClient("xxxxxxx")
mydb = client["rent_activity"] #This is the mongodb database
collection = mydb["vehicle"]
car_details_sql = "SELECT serial_number, mileage, fuel_level other from car_details"
data = sqlio.read_sql_query(car_details_sql, my_connection) #Create dataframe from query results
try:
#Goal is to add each line from the dataframe result returned from the SQL queary, as a record under collection "vehicle", object "serial_number", nested object "car_details"
for item in data.to_dict(orient="records"): #For each row, transform to dict
mycarkey = str(item.get("purchase_date")) + "_" + str(item.get("serial_number")) #Create unique string
car_number = "car_" + mycarkey
finaldata = {}#Create empty
finaldata[mycarkey] = item #Add outer unique key to existing dict for this set of data
finaldata.update({"_id": car_number}) #Define custom object _id for this set of data
collection.insert_one(finaldata)
#This inserts to the collection "vehicle"
#What is needed is to nest this to:
# rent_activity(DB)>vehicle(document)>serial_number(object)>car_details(object under serial_number)
#NOTE: If I simply add another outer key "serial_number" I receive error key already exists
#..so the issue is how to add an object with nested values beneath an existing object??
except Exception as e:
raise e

Related

MongoDB - Best way to bulk insert or update using MongoEngine

I need to write CSV data into MongoDB (version 4.4.4). Right now I'm using MongoEngine as a data layer for my application.
Each CSV has at least 4 million records and 8 columns.
What is the fastest way to bulk insert (if the data doesn't exist yet) or update (if the data is already in the collection)?
Right now I'm doing the following:
for inf in list:
daily_report = DailyReport.objects.get(company_code=code, date=date)
if daily_report is not None:
inf.id = daily_report.id
inf.save()
The list is a list of DailyReports built from the CSV data.
The _id is auto-generated. However, for business purposes the primary keys are the variables company_code (StringField) and date (DateTimeField).
The DailyReport class has an unique compound index made of the following fields: company_code and date.
The previous code traverse through the list and for each DailyReport it looks for an existing DailyReport in the database with the same company_code and date. If so, the id from the DailyReport in the database is assigned to the DailyReport built from the CSV data. With the id assigned to the object, the object is saved using the Object.save() method from MongoEngine.
The insert operation is being done at each object and is incredibly slow.
Any ideas on how to make this process it faster?

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.

How to SET jsonb_column = json_build_array( string_column ) in Sequelize UPDATE?

I'm converting a one-to-one relationship into a one-to-many relationship. The old relationship was just a foreign key on the parent record. The new relationship will be an array of foreign keys on the parent record.
(Using Postgres dialect, BTW.)
First I'll add a new JSONB column, which will hold an array of UUIDs.
Then I'll run a query to update all existing rows such that the value from the old column is now stored in the new column (as the first element in an array).
Finally, I'll remove the old column.
I'm looking for help with step 2: writing the update statement that will update all rows, setting the value of the new column based on the value of the old column. Basically, I'm trying to figure out how to express this SQL query using Sequelize:
UPDATE "myTable"
SET "newColumn" = json_build_array("oldColumn")
-- ^^ this really works, btw
Where:
newColumn is type JSONB, and should hold an array (of UUIDs)
oldColumn is type UUID
names are double-quoted because they're mixed case in the DB (shrug)
Expressed using Sequelize sugar, that might be something like:
const { models } = require('../sequelize')
await models.MyModel.update({ newColumn: [ 'oldColumn' ] })
...except that would result in saving an array that contains the string "oldColumn" rather than an array whose first element is the value in that row's oldColumn column.
My experience, and the Sequelize documentation, is focused on working with individual rows via the standard instance methods. I could do that here, but it'd be a lot better to have the database engine do the work internally instead of forcing it to transfer every row to Node and then back again.
Looking for whatever is the most Sequelize-idiomatic way of doing this, if there is one.
Any help is appreciated.

retrieve the fields we are interested in and write to a result file the fields + count result using scala spark

I need to import a CSV file that contains several fields, I must later loop on some fields that interest us to recover the data contained in it.
In the file there is a field named query that contains SQL queries that must be executed and store in another CSV file that will contain the fields to retrieve as well as the results of each query.
Below is my code so far:
// step1:read the file
val table_requete = spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", ";").load("/user/swychowski/ClientAnlytics_Controle/00_Params/filtre.csv")
req.registerTempTable("req")
// step2:read the file
However, I dont know how to loop and store on another file at the same time.

How to assign foreign keys in Access within imported table from Excel

I will use Access database instead of Excel. But I need to import data from one huge Excel sheet into several pre-prepared normalized tables in Access. In the core Access table I have mainly the foreign keys from other tables (of course some other fields are texts or dates).
How should I perform the import in the easiest way? I cannot perform import directly, because there is NOT, for example, "United States" string in the Access field 'Country'; there must be foreign key no. 84 from the table tblCountries. I think about DLOOKUP function in the Excel and replace strings for FK... Do you know any more simple method?
Thank you, Martin
You don’t mention how you will get the Excel data into several Access tables, so I will assume you will import the entire Excel file into ONE large table then break out the data from there. I assume the imported data may NOT match with existing Access keys (i.e. misspellings, new values, etc.) so you will need to locate those so you can make corrections. This will involve creating a number of ‘unmatched queries’ then a number of ‘Update queries’, finally you can use Append queries to pull data from your import table into the final resting place. Using your example, you have imported ‘Country = United States’, but you need to relate that value to key “84”?
Let’s set some examples:
Assume you imported your Excel data into one large Access table. Also assume your import has three fields you need to get keys for.
You already have several control tables in Access similar to the following:
a. tblRegion: contains RegionCode, RegionName (i.e. 1=Pacific, 2=North America, 3=Asia, …)
b. tblCountry: contains CountryCode, Country, Region (i.e. 84 | United States | 2
c. tblProductType: contains ProdCode, ProductType (i.e. VEH | vehicles; ELE | electrical; etc.)
d. Assume your imported data has fields
Here are the steps I would take:
If your Excel file does not already have columns to hold the key values (i.e. 84), add them before the import. Or after the import, modify the table to add the columns.
Create ‘Unmatched query’ for each key field you need to relate. (Use ‘Query Wizard’, ‘Find Unmatched Query Wizard’. This will show you all imported data that does not have a match in your key table and you will need to correct those valuse. i.e.:
SELECT tblFromExcel.Country, tblFromExcel.Region, tblFromExcel.ProductType, tblFromExcel.SomeData
FROM tblFromExcel LEFT JOIN tblCountry ON tblFromExcel.[Country] = tblCountry.[CountryName]
WHERE (((tblCountry.CountryName) Is Null));
Update the FK with matching values:
UPDATE tblCountry
INNER JOIN tblFromExcel ON tblCountry.CountryName = tblFromExcel.Country
SET tblFromExcel.CountryFK = [CountryNbr];
Repeat the above Unmatched / Matched for all other key fields.