JSON import nested object to relational tables with Talend - mongodb

I am trying to import data from MongoDB to a relational DB (SQL Server).
I don't have access to the MongoDB components so I am querying my collection with the mongo java driver, in a tJava component.
I get a:
List< DBObject >
which I send to a tExtractJSONFields
An object of my collection looks like this:
[
{
"_id":{
"$oid":"1564t8re13e4ter86"
},
"object":{
"shop":"shop1",
"domain":"Divers",
"sell":[
{
"location":{
"zipCode":"58000",
"city":"NEVERS"
},
"properties":{
"description":"ddddd!!!!",
"id":"f1re67897116fre87"
},
"employee":[
{
"name":"employee1",
"id":"245975"
},
{
"name":"employee2",
"id":"458624"
}
],
"customer":{
"name":"Customer1",
"custid":"test_réf"
}
}
]
}
}
]
For a sell, I can have several employees. I have an array of employee and I want to store the affected employee in another table. So I would have 2 tables:
Sell
oid | shop | domain | zipCode | ...
1564t8re13e4ter86 | shop1 | Divers | 58000 | ...
Affected employee
employee_id | employee_name | oid
245975 | employee1 | 1564t8re13e4ter86
458624 | employee2 | 1564t8re13e4ter86
So I want to loop on the employee array, with a Jsonpath query:
"$[*].object.sell[0].employee"
The problem is that doing like this, I can't have the object_id. It seems that I can't get an attribute on a parent node if I define my Jsonpath query like this.
I also saw that I can do like in the following link:
http://techpoet.blogspot.ro/2014/06/dealing-with-nested-documents-in.html?utm_content=buffer02d59&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
But I don't understand when does he get the object_id at the lower levels.
How can I do?

my tests with JSONPath failed as well but I think this must be a bug in the component because when I query:
$..['$oid'] I'm getting back -> [].
This seems to be the case whenever you try to get a node that's on higher levels than the loop query.

Related

how to find elements in an array exported from firestore in BigQuery

i need to get count of item has been sold so iam exporting my firestore data to big query
i have collection in firestore and i have array of object
i need to export forestore data to big query
line_items
[
{
price:10,
qyt:2,
name:car
},
{
price:4,
qyt:1,
name:pen
},
]
i have created schema.json
and added the type as array and exported the collection data
{
"name": "line_items",
"type": "array"
}
now iam trying to write query in cloud.google to get this result in
name |qty
car | 2
pen | 1
my issue im not able to convert my array of object to columns so i can get the counts of item qty has been sold
this is how the data looks in big query it is not showing as array of object
You have to UNNEST an array before getting elements
SELECT name, qty FROM your_dataset.your_table, UNNEST(line_items)

How to get all field names and data types from a MongoDB instance

I have a mongo db instance with a bunch of collections. I need to create an Excel where one column is the path to the field, and the second one is the type this keys has.
For instance, if this is my item:
{
_id: ObjectId("697a6s98689asdfd89s"),
name: "matias",
status: {
enabled: true,
role: "developer"
}
}
Then I want to get this :
Field
Type
_id
ObjectID
name
string
status
Object
status.enabled
boolean
status.role
string
Obviously this can be done through code, but is there any way to do this using the mongo shell to perform a query ? Or maybe get the json out from the shell and use jq/bash to perform the printing of the table?
Note: This is adapted from another answer, but it does not work, but gets there really close.
jq -r '. as $root |
path(..) | . as $path |
$root | getpath($path) as $value |
select($value | scalars) |
([$path[] | #json] | join(".")) + " = " + (($value|type) | (#json|type))
' < item.json
Well, I ended up this tool: https://github.com/variety/variety
Easy to download and use and does exactly what I needed.

Debezium + Schema Registry Avro Schema: why do I have the "before" and "after" fields, and how do I use that with HudiDeltaStreamer?

I have a table in PostgreSQL with the following schema:
Table "public.kc_ds"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-----------------------+-----------+----------+-----------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('kc_ds_id_seq'::regclass) | plain | |
num | integer | | not null | | plain | |
text | character varying(50) | | not null | | extended | |
Indexes:
"kc_ds_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
When I run a Debezium source connector for this table that uses io.confluent.connect.avro.AvroConverter and Schema Registry, it creates a Schema Registry schema that looks this (some fields are omitted here):
"fields":[
{
"name":"before",
"type":[
"null",
{
"type":"record",
"name":"Value",
"fields":[
{
"name":"id",
"type":"int"
},
{
"name":"num",
"type":"int"
},
{
"name":"text",
"type":"string"
}
],
"connect.name":"xxx.public.kc_ds.Value"
}
],
"default":null
},
{
"name":"after",
"type":[
"null",
"Value"
],
"default":null
},
]
The messages in my Kafka topic that are produced by Debezium look like this (some fields are omitted):
{
"before": null,
"after": {
"xxx.public.kc_ds.Value": {
"id": 2,
"num": 2,
"text": "text version 1"
}
}
When I INSERT or UPDATE, "before" is always null, and "after" contains my data; when I DELETE, the inverse holds true: "after" is null and "before" contains the data (although all fields are set to default values).
Question #1: Why does Kafka Connect create a schema with "before" and "after" fields? Why do those fields behave in such a weird way?
Question #2: Is there a built-in way to make Kafka Connect send flat messages to my topics while still using Schema Registry? Please note that the Flatten transform is not what I need: if enabled, I will still have the "before" and "after" fields.
Question #3 (not actually hoping for anything, but maybe someone knows): The necessity to flatten my messages comes from the fact that I need to read the data from my topics using HudiDeltaStreamer, and it seems like this tool expects flat input data. The "before" and "after" fields end up being separate object-like columns in the resulting .parquet files. Does anyone have any idea how HudiDeltaStreamer is supposed to integrate with messages produced by Kafka Connect?

Write update mongo query with multiple where clause

New to mongo and I'm struggling to write update mongo query with multiple where clause
Lets say I have a employee list
name |emp_id | old phone number | new phone number
Steve |123 | 123-456-7896 | 801-123-4567
John |456 | 123-654-9878 | 702-123-4567
Steve |789 | 789-123-7890 | 504-123-4567
I would like to write a mongo query essentially saying
Update to new phone number where both name and emp_id matches. The document has about 200 entries.
Since you're having multiple entires to be updated, you can have something like this :
When you say you've a list - Considering its in a file or somewhere to be processed through code then try to read data into an array :
let dataArray = [{name: Steve, emp_id: 123, new_phone_number :801-123-4567},{name: John, emp_id: 456, new_phone_number :702-123-4567},....]
db.getCollection('YourCollection').bulkWrite(
dataArray.map((each) =>
({
updateOne: {
filter: { 'name': each.name, 'emp_id': each.emp_id },
update: { $set: { phone_number: each.new_phone_number } }
}
})
)
)
So here we're considering phone_number as your field to be updated, We didn't consider maintaining two fields old_phone_number & new_phone_number in the documents.
Or in case if you wanted to have two phones numbers in the document then:
db.getCollection('YourCollection').bulkWrite(
dataArray.map((each) =>
({
updateOne: {
filter: { 'name': each.name, 'emp_id': each.emp_id },
update: { $set: { new_phone_number: each.new_phone_number } }
}
})
)
)
As the above - we're injecting a new field new_phone_number to all matched documents, final outcome will have old phone number as a field phone_number and also new phone number in a field new_phone_number .
Also as you haven't mentioned anything particular, if you really need to update each record individually making multiple DB calls, then you can normally use .update() or .updateOne() or .updateMany() .findAndModify() depends on your requirement/interest.
findAndModify , update , updateOne , updateMany
By Default, MongoDB $update method updates single record. But if you want to update multiple documents matching criteria, then you can use Multi parameter (set this to true).
db.people.update(
{ name: "Steve", emp_id:123 }, //this is first parameter of update method. Define your query here
{
new_phone_number: 111-111-111 //This is second parameter. Update operation here
},
{ multi: true } // This is third one. Options here
)
Also look at $updateOne and $updateMany

Spark2 mongodb connector polymorphic schema

I have collection col that contains
{
'_id': ObjectId(...)
'type': "a"
'f1': data1
}
on same collection i have
{
'_id': ObjectId(...)
'f2': 222.234
'type': "b"
}
Spark MongoDB connector Is not working fine. It's reorder the data in wrong fields
for example:
{
'_id': ObjectId(...)
'type': "a"
'f1': data1
}
{
'_id': ObjectId(...)
'f1': data2
'type': "a"
}
Rdd will be:
------------------------
| id | f1 | type |
------------------------
| .... | a | data1 |
| .... | data2 | a |
------------------------
Is there any suggestions working with polymorphic schema
Is there any suggestions working with polymorphic schema
(Opinion alert) The best suggestion is not to have one in the first place. It is impossible to maintain in the long term, extremely error prone and requires complex compensation on the client side.
What to do if you have one:
You can try using Aggregation Framework with $project to sanitize data before it is fetched to Spark. See Aggregation section of the docs for example.
Don't try to couple it with structured format. Use RDDs, fetch data as plain Python dict and deal with the problem manually.