How to query nested fields in MongoDB using Presto - mongodb

I'm setting up a Presto cluster which I'd like to use to query a MongoDB instance. Data in my Mongo instance has the following structure:
{
_id: <value>
somefield: <value>
otherfield: <value>
nesting_1: {
nested_field_1_1: <value>
nested_field_1_2: <value>
...
}
nesting_2: {
nesting_2_1: {
nested_field_2_1_1: <value>
nested_field_2_1_2: <value>
...
}
nesting_2_2: {
nested_field_2_2_1: <value>
nested_field_2_2_2: <value>
...
}
}
}
Just by plugging it, Presto correctly identifies and creates columns for the values in the top level (e.g. somefield, otherfield) and in the first nesting level -- that is, it creates a column for nesting_1, and its content is a row(nested_field_1_1 <type>, nested_field_1_2 <type>, ...), and I can query table.nesting1.nested_field_1_1.
However, fields with an extra nesting layer (e.g. nesting_2 and everything within it) are missing from the table schema. Presto's documentation for the MongoDB connector does mention that:
At startup, this connector tries guessing fields’ types, but it might not be correct for your collection. In that case, you need to modify it manually. CREATE TABLE and CREATE TABLE AS SELECT will create an entry for you.
While that seems to explain my use case, it's not very clear on how to "modify it manually" -- a CREATE TABLE statement doesn't seem appropriate, as the table is already there. The documentation also has a section on how to declare fields and their types, but it's also not very clear on how to deal with multiple nesting levels.
My question is: how do I setup Presto's MongoDB connector so that I can query fields in the third nesting layer?
Answers can assume that:
all nested fields' names are known;
there are only 3 layers;
there is no need to preserve the layered table layout (i.e. I don't mind if my resulting Presto table has all nested fields as unique columns like somefield, rather than one field with rows like nesting_1 in the above example);
extra points if the solution doesn't require me to explicitly declare the names and types of all columns in the third layer, as I have over 1500 of them -- but this is not a hard requirement.

On mongodb.properties, the property mongodb.schema-collection can be used to describe the schema of your MongoDB collections. As described in the documentation, this property is optional and the default is _schema.
it's not very clear on how to "modify it manually" -- a CREATE TABLE statement doesn't seem appropriate, as the table is already there.
It is supposed to be created and populated automatically but what I've noticed is that it is populated until some queries are executed, and it only generates the schema for the collections that are queried.
However, there is a open bug, some fields/columns are not automatically picked up.
Also, once an entry for a collection is created/populated it won't be updated automatically, any update needs to be done manually (if the collection start to have new fields they won't be detected automatically).
To manually update the schema, the field column is just another entry in the fields array, as mentioned in the doc, it has three parts :
name Name of the column in the Presto table, it needs to match with the name of the collection field.
type Presto type of the column. Here are the available types, the ROW type can be used for nested properties.
hidden Hides the column from DESCRIBE <table name> and SELECT *. Defaults to false.
My question is: how do I setup Presto's MongoDB connector so that I can query fields in the third nesting layer?
The schema definition for a MongoDB collection like the one you posted will be containing something like:
...
"fields": [
{
"name": "_id",
"type": "ObjectId",
"hidden": true
},
{
"name": "somefield",
"type": "varchar",
"hidden": false
},
{
"name": "otherfield",
"type": "varchar",
"hidden": false
},
{
"name": "nesting_1",
"type": "row(nested_field_1_1 varchar, nested_field_1_2 bigint)",
"hidden": false
},
{
"name": "nesting_2",
"type": "row(nesting_2_1 row(nested_field_2_1_1 varchar, nested_field_2_1_2 varchar),nesting_2_2 row(nested_field_2_2_1 varchar, nested_field_2_2_2 varchar))",
"hidden": false
}
]
...
It can be queried using . over the columns, like:
SELECT nesting_2.nesting_2_1.nested_field_2_1_1 FROM table;

If the mongo collection being queried does not have a fixed schema, indicated in the _schema collection, Presto is not able to infer the document structure.
If you prefer,the option is to explicitly declare the schema in the connector configuration, using field mongodb.schema-collection, as described in the documentation. You can set it to a different mongo collection which stores the same values, and create this collection directly.
Nested fields can be declared using the ROW data type, which is also described in the docs and behaves like what would be a struct or dictionary in other programming languages.

You can create a collection in mongodb, for example "presto_schema" in your database and insert sample schema like this
db.presto_schema.insertOne({
"table" : "your_collection",
"fields" : [
{
"name" : "_id",
"type" : "ObjectId",
"hidden" : true
},
{
"name" : "last_name",
"type" : "varchar",
"hidden" : false
},
{
"name" : "id",
"type" : "varchar",
"hidden" : false
}
]})
In your presto mongodb.properties, add the property like this:
mongodb.schema-collection=presto_schema
From now, presto will use "presto_schema" instead of your default "_schema" to query.

Related

Is there a way to define single fields that are never indexed in firestore in all collections

I understand that index has a cost in firestore. Most of the time we simply store objects without really caring about index and even if we don’t want most of the fields to be indexed.
If I understand correctly, any field at any level are indexed. I.e. for the following document in pseudo json
{
"root_field1": "abc" (indexed)
"root_field2": "def" (indexed)
"root_field3": {
"Sub_field1: "ghi" (indexed)
"sub_field2: "jkl" (indexed)
"sub_field3: {
"Inner_field1: "mno" (indexed)
"Inner_field2: "pqr" (indexed)
}
}
Let’s assume that I have the following record
{
"name": "abc"
"birthdate": "2000-01-01"
"gender": "m"
}
Let’s assume that I just want the field "name" to be indexed. One solution (A), without having to specify every field is to define it this way (i.e. move the root fields to a sub level unindexed), and exclude unindexed from being indexed
{
"name": "abc"
"unindexed" {
"birthdate": "2000-01-01"
"gender": "m"
}
Ideally I would like to just specify a prefix such as _ to prevent each field to be indexed but there is no global solution for that.
{
"name": "abc"
"_birthdate": "2000-01-01"
"_gender": "m"
}
Is my solution (A) correct and is there a more elegant generic solution?
Thanks!
Accordinig to the documentation
https://cloud.google.com/firestore/docs/query-data/indexing
Add a single-field index exemption
Single-field index exemptions allow you to override automatic index
settings for specific fields in a collection. You can add a
single-field exemptions from the console:
Go to the Single Field Indexes section.
Click Add Exemption.
Enter a Collection ID and Field path.
Select new indexing settings for this field. Enable or disable
automatically updated ascending, descending, and array-contains
single-field indexes for this field.
Click Save Exemption.

index Elasticsearch document with existing "id" field

I have documents that I want to index into Elasticsearch with an existing unique "id" field.
I get an array of documents from a REST api endpoint ( eg.: http://some.url/api/products) in no particular order and if a document with the _id already exists in Elasticsearch it should update and reindex the document.
I want to create a new document if no document with the _id in Elasticsearch exists and then update a document, if it matches with an existing document in Elasticsearch.
This could be done with:
PUT products/product/un1qu3-1d-b718-105973677e95
{
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
The basic idea is to use the provided "id" field to create or update a document. Extraction of _id from document fields seems deprecated (link). But the indexing/ reindexing of documents with the "id" field can be done manually very easy with the kibana dev tools, with postman or a cURL request.
I want to achieve this (re-)indexing of documents that I receive over this api endpoint programmatically.
Is it possible to achieve this with logstash or a simple cronjob? Does Elasticsearch provide any functionality for this? Or do I need to write some custom backend to achieve this?
I thought of either:
1) index the document into Elasticsearch with the "id" field of my document or
2) find an Elasticsearch query that first searches for the document with the specific "id" field and then updates the document.
I was unable to find a solution for either way and have no clue how a good approach would look like.
Can anyone point me into the right direction on how to achieve this, suggest a better approach or provide a solution?
Any help much appreciated!
Update
I solved the problem with the help of the accepted answer. I used Logstash, the Http_poller input plugin, this article: https://www.elastic.co/blog/new-way-to-ingest-part-1 and this elastic.co question: https://discuss.elastic.co/t/upsert-with-logstash/59116
My output of logstash looks like this at the moment:
output {
elasticsearch {
index => "products"
document_type => "product"
pipeline => "rename_id"
document_id => "%{id}"
doc_as_upsert => true
action => "update"
}
Update 2
just for the sake of completeness I added the "rename_id" pipeline
{
"rename_id": {
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
}
It works this way!
Thanks alot!
Peter,
If I understand correctly, you want to ingest your documents into elastic search and will have some updates in future for these documents ?
If that's the case,
- Use your documents primary key as id for elastic documents.
- You can ingest entire document with updated values, elastic will replace the previous document with new one. given the primary key is same. Old document with same id will be deleted.
We use this approach for our search data.
you can use ingest pipelines to extract the id from the body and the _create endpoint to only create a document if it does not exist. Minor note: If you could specify the id on the client side indexing would be faster, as adding a pipeline adds a certain overhead.
PUT _ingest/pipeline/my_pipeline
{
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
GET twitter/tweet/123
# this call will fail
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
You can use script to UPSERT (update or insert) your document
PUT /products/product/un1qu3-1d-b718-105973677e95/_update
{
"script": {
"inline": "ctx._source.state = \"packaged\"",
"lang": "painless"
},
"upsert": {
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
}
Above query find the document with _id = "un1qu3-1d-b718-105973677e95"
if it is able to find any document then it will update state to "packaged" otherwise create a new document with field "id" and "state" (you can insert as many fields as you want).

How do I manage a sublist in Mongodb?

I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.

Is it possible to make a "not modify " constrain on MongoDb subdocuments at creation?

I'd like to make a specific subdocument value from a MondoDb document fixed, so it can not be possible to modify it at a next update, or any other MongoDb operations that can modify documents.
For example, if a document like the one bellow is inserted, I will like that "eyesColor" value can not be changed.
{
"id" : "someId",
"name": "Jane",
"eyesColor" : "blue"
}
A possible update can be:
{
"id" : "someId",
"name": "Amy",
"eyesColor" : "green"
}
And the result I need after this update is :
{
"id" : "someId",
"name": "Amy",
"eyesColor" : "blue"
}
I'd like to do this because the possibility of using $set and $unset operators is not present in the project I'm creating. A read on the existing document before the update, in order to get the value of the subdocument ("eyesColor") will decrease the performance of the application I work on.
Actually the constrain I need is similar to the fixed size on collections (capped collections). The difference is that it is on a subdocument instead of collection and on the value contained in the subdocument instead of the size.
Is there any solution to this type of constrain?
There are no constraints in MongoDB (only exception: unique indexes). There is no way to make fields "read-only" on the database-layer.
When you want to use upsert's (db.collection.update with upsert: true) which add certain fields on inserting new documents but don't affect these fields on updates of existing documents, you can place these fields behind the $setOnInsert-operator.

How do I rename a nested key in mongodb

I want rename to rename my dict key in mongodb.
normally it works like that db.update({'_id':id},{$rename:{'oldfieldname':newfieldname}})
My document structure looks like that
{
'data':'.....',
'field':{'1':{'data':....},'2':{'data'...}},
'more_data':'....',
}
if i want to set
a new field in field 1 i do db.update({'_id':id},{$set:{'field.0.1.name':'peter'}})
for field two it is 'field'.1.2.name'
i thought with the rename it should be similar but it isn't ... (like $rename:{'field'.0.1': 2}
Here's a flexible method for renaming keys in a database
Given a document structure like this...
{
"_id": ObjectId("4ee5e9079b14f74ef14ddd2f"),
"code": "130.4",
"description": "4'' Socket Plug",
"technicalData": {
"Drawing No": "50",
"length": "200mm",
"diameter: "20mm"
},
}
I want to loop through all documents and rename technicalData["Drawing No"] to technicalData["Drawing Number"]
Run the following javascript in the execute panel in (the excellent) RockMongo
function remap(x){
dNo = x.technicalData["Drawing No"];
db.products.update({"_id":x._id}, {
$set: {"technicalData.Drawing Number" : dNo},
$unset: {"technicalData.Drawing No":1}
});
}
db.products.find({"technicalData.Drawing No":{$ne:null}}).forEach(remap);
The code will also run in a mongo shell
Your question is unclear but it seems you'd like to rename a field name within an array.
The short answer is you can't. As stated in the docs, $rename doesn't expand arrays to find a matching name. It only works on top level fields.
What you can do to simulate rename is by copying the field and its data to the new name, and then deleting the original field. You might also need a way to account for potentially concurrent writes if you have a lot of writes to that object/field.