Spark2 mongodb connector polymorphic schema - mongodb

I have collection col that contains
{
'_id': ObjectId(...)
'type': "a"
'f1': data1
}
on same collection i have
{
'_id': ObjectId(...)
'f2': 222.234
'type': "b"
}
Spark MongoDB connector Is not working fine. It's reorder the data in wrong fields
for example:
{
'_id': ObjectId(...)
'type': "a"
'f1': data1
}
{
'_id': ObjectId(...)
'f1': data2
'type': "a"
}
Rdd will be:
------------------------
| id | f1 | type |
------------------------
| .... | a | data1 |
| .... | data2 | a |
------------------------
Is there any suggestions working with polymorphic schema

Is there any suggestions working with polymorphic schema
(Opinion alert) The best suggestion is not to have one in the first place. It is impossible to maintain in the long term, extremely error prone and requires complex compensation on the client side.
What to do if you have one:
You can try using Aggregation Framework with $project to sanitize data before it is fetched to Spark. See Aggregation section of the docs for example.
Don't try to couple it with structured format. Use RDDs, fetch data as plain Python dict and deal with the problem manually.

Related

Pymongo Aggregate with Group and Push

I have a collection in the form as below:
UserName | Device | Device Rights
I have multiple documents for each user with different devices.
I am trying to aggregate and store the data grouped by user in the collection. To get output as below:
UserName | Device | Device Rights| Device 2 | Device 2 Rights...
Using the below code in Pymongo:
cur = data1_dbh.access_data
cur.aggregate=([
{'$group': {
'_id': "$username",
'accessdetails':{'$push': {'device':"$device_id", 'accesstype':"$access_type"}},
}}
])
x=cur.aggregate
for i in x:
print(i)
But I get a single line of code as output. What am i missing..
{'$group': {'_id': '$username', 'accessdetails': {'$push': {'device': '$device_id', 'accesstype': '$access_type'}}}}

Debezium + Schema Registry Avro Schema: why do I have the "before" and "after" fields, and how do I use that with HudiDeltaStreamer?

I have a table in PostgreSQL with the following schema:
Table "public.kc_ds"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-----------------------+-----------+----------+-----------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('kc_ds_id_seq'::regclass) | plain | |
num | integer | | not null | | plain | |
text | character varying(50) | | not null | | extended | |
Indexes:
"kc_ds_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
When I run a Debezium source connector for this table that uses io.confluent.connect.avro.AvroConverter and Schema Registry, it creates a Schema Registry schema that looks this (some fields are omitted here):
"fields":[
{
"name":"before",
"type":[
"null",
{
"type":"record",
"name":"Value",
"fields":[
{
"name":"id",
"type":"int"
},
{
"name":"num",
"type":"int"
},
{
"name":"text",
"type":"string"
}
],
"connect.name":"xxx.public.kc_ds.Value"
}
],
"default":null
},
{
"name":"after",
"type":[
"null",
"Value"
],
"default":null
},
]
The messages in my Kafka topic that are produced by Debezium look like this (some fields are omitted):
{
"before": null,
"after": {
"xxx.public.kc_ds.Value": {
"id": 2,
"num": 2,
"text": "text version 1"
}
}
When I INSERT or UPDATE, "before" is always null, and "after" contains my data; when I DELETE, the inverse holds true: "after" is null and "before" contains the data (although all fields are set to default values).
Question #1: Why does Kafka Connect create a schema with "before" and "after" fields? Why do those fields behave in such a weird way?
Question #2: Is there a built-in way to make Kafka Connect send flat messages to my topics while still using Schema Registry? Please note that the Flatten transform is not what I need: if enabled, I will still have the "before" and "after" fields.
Question #3 (not actually hoping for anything, but maybe someone knows): The necessity to flatten my messages comes from the fact that I need to read the data from my topics using HudiDeltaStreamer, and it seems like this tool expects flat input data. The "before" and "after" fields end up being separate object-like columns in the resulting .parquet files. Does anyone have any idea how HudiDeltaStreamer is supposed to integrate with messages produced by Kafka Connect?

Write update mongo query with multiple where clause

New to mongo and I'm struggling to write update mongo query with multiple where clause
Lets say I have a employee list
name |emp_id | old phone number | new phone number
Steve |123 | 123-456-7896 | 801-123-4567
John |456 | 123-654-9878 | 702-123-4567
Steve |789 | 789-123-7890 | 504-123-4567
I would like to write a mongo query essentially saying
Update to new phone number where both name and emp_id matches. The document has about 200 entries.
Since you're having multiple entires to be updated, you can have something like this :
When you say you've a list - Considering its in a file or somewhere to be processed through code then try to read data into an array :
let dataArray = [{name: Steve, emp_id: 123, new_phone_number :801-123-4567},{name: John, emp_id: 456, new_phone_number :702-123-4567},....]
db.getCollection('YourCollection').bulkWrite(
dataArray.map((each) =>
({
updateOne: {
filter: { 'name': each.name, 'emp_id': each.emp_id },
update: { $set: { phone_number: each.new_phone_number } }
}
})
)
)
So here we're considering phone_number as your field to be updated, We didn't consider maintaining two fields old_phone_number & new_phone_number in the documents.
Or in case if you wanted to have two phones numbers in the document then:
db.getCollection('YourCollection').bulkWrite(
dataArray.map((each) =>
({
updateOne: {
filter: { 'name': each.name, 'emp_id': each.emp_id },
update: { $set: { new_phone_number: each.new_phone_number } }
}
})
)
)
As the above - we're injecting a new field new_phone_number to all matched documents, final outcome will have old phone number as a field phone_number and also new phone number in a field new_phone_number .
Also as you haven't mentioned anything particular, if you really need to update each record individually making multiple DB calls, then you can normally use .update() or .updateOne() or .updateMany() .findAndModify() depends on your requirement/interest.
findAndModify , update , updateOne , updateMany
By Default, MongoDB $update method updates single record. But if you want to update multiple documents matching criteria, then you can use Multi parameter (set this to true).
db.people.update(
{ name: "Steve", emp_id:123 }, //this is first parameter of update method. Define your query here
{
new_phone_number: 111-111-111 //This is second parameter. Update operation here
},
{ multi: true } // This is third one. Options here
)
Also look at $updateOne and $updateMany

How to apply a user defined function (UDF) to a document field in mongoDb?

Lets say you have the following schema
employee( firstname, lastname, dept)
Say we want the lastname to be uppercase for all the rows in the table (the uppercase function might be a built-in function in the DBMS, but lets assume it's a UDF)
In SQL we would have done the following
Select firstname, toUpper(lastname) as lastname, dept
From employees
What is the equivalent in mangoDB ? Knowing that we have a document employees which contains the following JSONs:
...
{ "firstname": "jhon", "lastname": "doe", "dept": "marketing"}
...
{ "firstname": "jean", "lastname": "dupont", "dept": "sales"}
...
The goal is to directly obtain the changed data using mongoDB query API, without writing extra JS code to get the job done.
I Know how to save/import a UDF in MongoDB. What I don't know is how to apply it to a result:
db.employees.find().apply_my_udf(....)
Use .aggregate() and $toUpper in a $project:
db.employees.aggregate([
{ "$project": {
"firstname": 1,
"lastname": { "$toUpper": "$firstname" },
"dept": 1,
}}
])
The aggregation framework is basically where you go for "tranformations" on data of any kind. The .find() method is really just for "plain selection" of documents and supports simple "inclusion" or "exclusion" of properties only.
If you have some familiarity with SQL, then SQL to Aggregation Mapping Chart in the core documentation is a good place to start with understanding how to apply common phrases.
Of course as an "alternative" approach you can also apply the "transform" to the data "after" the results are fetched from the server. All drivers provide some form of cursor.map() for this very purpose.
As a JavaScript shell example:
db.employees.find().map( d => Object.assign(d, { lastname: d.lastname.toUpperCase() }) )
Which essentially does the same thing, but the transformation is done as each document is retrieved from the server as opposed to transforming on the server itself.

JSON import nested object to relational tables with Talend

I am trying to import data from MongoDB to a relational DB (SQL Server).
I don't have access to the MongoDB components so I am querying my collection with the mongo java driver, in a tJava component.
I get a:
List< DBObject >
which I send to a tExtractJSONFields
An object of my collection looks like this:
[
{
"_id":{
"$oid":"1564t8re13e4ter86"
},
"object":{
"shop":"shop1",
"domain":"Divers",
"sell":[
{
"location":{
"zipCode":"58000",
"city":"NEVERS"
},
"properties":{
"description":"ddddd!!!!",
"id":"f1re67897116fre87"
},
"employee":[
{
"name":"employee1",
"id":"245975"
},
{
"name":"employee2",
"id":"458624"
}
],
"customer":{
"name":"Customer1",
"custid":"test_réf"
}
}
]
}
}
]
For a sell, I can have several employees. I have an array of employee and I want to store the affected employee in another table. So I would have 2 tables:
Sell
oid | shop | domain | zipCode | ...
1564t8re13e4ter86 | shop1 | Divers | 58000 | ...
Affected employee
employee_id | employee_name | oid
245975 | employee1 | 1564t8re13e4ter86
458624 | employee2 | 1564t8re13e4ter86
So I want to loop on the employee array, with a Jsonpath query:
"$[*].object.sell[0].employee"
The problem is that doing like this, I can't have the object_id. It seems that I can't get an attribute on a parent node if I define my Jsonpath query like this.
I also saw that I can do like in the following link:
http://techpoet.blogspot.ro/2014/06/dealing-with-nested-documents-in.html?utm_content=buffer02d59&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
But I don't understand when does he get the object_id at the lower levels.
How can I do?
my tests with JSONPath failed as well but I think this must be a bug in the component because when I query:
$..['$oid'] I'm getting back -> [].
This seems to be the case whenever you try to get a node that's on higher levels than the loop query.