Database with ability to search JSON object - mongodb

Please help me find a database solution.
Here is the main requirement: One of the columns (the data column) holds a JSON data object which has to be easily searched.
Database row fields:
source (string)
action (string)
description (string)
timestamp (epoch timestamp)
data (can be anything but most likely JSON object or a string)
The nature/schema of this JSON object is unknown (e.g. it can be 1 level deep or 3). The important thing is that tree of the object can be searched.
One possible object:
{
"things": [
{
"ip": "123.13.3.3.",
"sn": "ADF"
},
{
"ip": "123.13.3.3.",
"sn": "ABC"
}
]
}
Another possible object:
{
"ip": "123.13.3.3.",
"sn": "ABC"
}
Example: I want to return all the rows where the data column has a JSON object that has the key/value pair of "SN": "ABC" somewhere in its tree.
Additional Requirements:
Low overhead and/or simple to set up
This database will not be getting hundreds/thousands/millions of hits per minute (now or ever)
Python and/or PHP hooks

Related

Check If the Array contains value in Azure Data Factory

I need to process files in the container using Azure Datafactory and keep a track of processed files in the next execution.
so I am keeping a table in DB which stores the processed file information,
In ADF I am getting the FileNames of the processed files and I want to check whether the current file has been processed or not.
I am Using Lookup activity: Get All Files Processed
to get the processed files from DB by using below query:
select FileName from meta.Processed_Files;
Then I am traversing over the directory, and getting File Details for current File in the directory by using Get Metadata Activity: "Get Detail of Current File in Iteration"
and in the If Condition activity, I am using following Expression:
#not(contains(activity('Get All Files Processed').output.value,activity('Get Detail of current file in iteration').output.itemName))
This is always returning True even if the file has been processed
How do we compare the FileName from the returned value
Output of activity('Get All Files Processed').output.value
{
"count": 37,
"value": [
{
"FileName": "20210804074153AlteryxRunStats.xlsx"
},
{
"FileName": "20210805074129AlteryxRunStats.xlsx"
},
{
"FileName": "20210806074152AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074143AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074316AlteryxRunStats.xlsx"
},
{
"FileName": "20210810074135AlteryxRunStats.xlsx"
},
{
"FileName": "20210811074306AlteryxRunStats.xlsx"
},
Output of activity('Get Detail of current file in iteration').output.itemName
"20210804074153AlteryxRunStats.xlsx"
I often pass this type of thing off to SQL in Azure Data Factory (ADF) too, especially if I've got one in the architecture. However bearing in mind that any hand-offs in ADF take time, it is possible to check if an item exists in an array using contains, eg a set of files returned from a Lookup.
Background
Ordinary arrays normally look like this: [1,2,3] or ["a","b","c"], but if you think about values that get returned in ADF, eg from Lookups, they they look more like this:
{
"count": 3,
"value": [
{
"Filename": "file1.txt"
},
{
"Filename": "file2.txt"
},
{
"Filename": "file3.txt"
}
],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (North Europe)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
...
So what you've got is a complex piece of JSON representing an object (the return value of the Lookup activity plus some additional useful info about the execution), and the array we are interested in is within the value object. However it has additional curly brackets, ie it is itself an object.
Solution
So the thing to do is to pass to contains something that will look like your object which has the single attribute Filename. Use concat to create the string and json to make it authentic:
#contains(activity('Lookup').output.value, json(concat('{"Filename":"',pipeline().parameters.pFileToCheck,'"}')))
Here I'm using a parameter which holds the filename to check but this could also be a variable or output from another Lookup activity.
Sample output from Lookup:
The Set Variable expression using contains:
The result assigned to a variable of boolean type:
I tried something like this.
from SQL table, brought all the processed files as comma-separated values using select STRING_AGG(processedfile, ',') as files in lookup activity
Assign the comma separated value to an array variable (test) using split function
#split(activity('Lookup1').output.value[0]['files'],',')
meta data activity to get current files in directory
filter activity to filter the files in current directory against the processed files
items:
#activity('Get Metadata1').output.childitems
condition:
#not(contains(variables('test'),item().name))

Query nested array performance, Mongo vs ElasticSearch

I have a music app that has a job to find music recommendations based on a tag id.
There are two entities involved:
Song - a song record contains its name and a list of music tag ids (genres) this song belongs to
MusicTag - the music tag itself, includes id, name etc.
Data is currently stored in MongoDB.
The Songs collections in mongo have millions of songs, and each song has an average of 7 tag ids.
The MusicTags has about 30K records.
The Songs collection looks like that:
[
{
name: "Metallica - one",
tags: [
"6018703624d8a5e8efa1b76e", // Rock
"601861cc8cef62ba86765017", // Heavy metal
"5fda07ac8db0615c1c503a46" // Hard Rock
]
},
{
name: "Metallica - unforgiven",
tags: [
"6018703624d8a5e8efa1b76e", // Rock
"5fda07ac8db0615c1c503a46", // Metal
]
},
{
name: "Lady Gaga - Bad Romance",
tags: [
"5fc7b9f95e38e17282896b64", // Pop
"5fc729be5e38e17282844eff", // Dance
]
}
]
Given the tag "6018703624d8a5e8efa1b76e" (Rock), I want to query the Songs collection and find all songs that have Rock tag in their tags array.
In Mongo this is the query i'm doing:
db.songs.find({ tags: { $in: [ObjectId("6018703624d8a5e8efa1b76e")] }});
The performance of it is very bad (between 10 to 40 seconds and getting worst as long as the collection grows), I tried to index Mongo in various ways (the table contains more data that involve in the search, such as score and duration, but it's not relevant for now) but my queries are still take too long, I can't explain it (and I read a lot of official and unofficial stuff) but I have a feeling that holding the data in this nested form makes the index worthless and somehow still make a full scan on the table each time - but I can't prove it (the Mongo "explain" not really explained me something :) )
I'm thinking of using ElasticSearch for it, sync all songs data, and query it instead of the Mongo that will stay as the data SSOT and other lightweight ops.
But then the question remains open and I want to make sure: is in Elastic I can hold the data in that form (nested array inside song) or I need to represent it differently (e.g. flat it so every record will be song_tag index etc?
Thanks.
Elasticsearch doesn't offer a dedicated array type so what you'd typically do is define the mapping based on the type of the individual array items -- in your case a keyword:
PUT songs
{
"mappings": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
Then you'd index the docs:
POST songs/_doc
{
"name": "Metallica - one",
"tags": [
"6018703624d8a5e8efa1b76e",
"601861cc8cef62ba86765017",
"5fda07ac8db0615c1c503a46"
]
}
and query the tags:
POST songs/_search
{
"query": {
"bool": {
"must": [
{ ... other queries },
{
"terms": {
"tags": [
"6018703624d8a5e8efa1b76e" // one or more
]
}
}
]
}
}
}
The tags are unique keywords but are not human-readable so you'd need to keep the map of them vs. the actual genres somewhere. Since the genres are probably set once and rarely, if ever, updated, you could use nested fields too. But your tags would then become an array of key-value pairs:
POST songs/_doc
{
"name": "Metallica - one",
"tags": [
{
"tag": "6018703624d8a5e8efa1b76e",
"genre": "Rock"
}
...
]
}
The mapping would be slightly different and so would be the queries but now you wouldn't need the translation map, plus you could query or aggregate by human-readable values -- tags.genre.

How to push a JSON object to a nested array in a JSONB column

I need to somehow push a JSON object to a nested array of potentionally existing JSON objects - see "pages" in the below JSON snippet.
{
"session_id": "someuuid",
"visitor_ui": 1,
"pages": [
{
"datetime": "2016-08-13T19:45:40.259Z",
"duration,": 0,
"device_id": 1,
"url": {
"path": "/"
}
},
{
"datetime": "2016-08-14T19:45:40.259Z",
"duration,": 0,
"device_id": 1,
"url": {
"path": "/test"
}
},
// how can i push a new value (page) here??
]
"visit_page_count": 2
}
I'm aware of the jsonb_set(target jsonb, path text[], new_value jsonb[, create_missing boolean]) (although still finding it a bit hard to comprehend) but I guess using that, would require that I first SELECT the whole JSONB column, in order to find out how many elements inside "pages" already exists and what index to push it to using jsonb_set, right? I'm hoping theres a way in Postgres 9.5 / 9.6 to achieve the equivalent of what we know in programming languages eg. pages.push({"key": "val"}).
What would be the best and easiest way to do this with Postgresql 9.5 or 9.6?
The trick to jsonb_set() is that it modifies part of a jsonb object, but it returns the entire object. So you pass it the current value of the column and the path you want to modify ("pages" here, as a string array), then you take the existing array (my_column->'pages') and append || the new object to it. All other parts of the jsonb object remain as they were. You are effectively assigning a completely new object to the column but that is irrelevant because an UPDATE writes a new row to the physical table anyway.
UPDATE my_table
SET my_column = jsonb_set(my_column, '{pages}', my_column->'pages' || new_json, true);
The optional create_missing parameter set to true here adds the "pages" object if it does not already exist.

MongoDB - Document with different type of value

I'm very new to MongoDB, i tell you sorry for this question but i have a problem to understand how to create a document that can contain a value with different "type:
My document can contain data like this:
// Example ONE
{
"customer" : "aCustomer",
"type": "TYPE_ONE",
"value": "Value here"
}
// Example TWO
{
"customer": "aCustomer",
"type": "TYPE_TWO",
"value": {
"parameter1": "value for parameter one",
"parameter2": "value for parameter two"
}
}
// Example THREE
{
"customer": "aCustomer",
"type": "TYPE_THREE",
"value": {
"anotherParameter": "another value",
{
"someParameter": "value for some parameter",
...
}
}
}
Customer field will be even present, the type can be different (TYPE_ONE, TYPE_TWO and so on), based on the TYPE the value can be a string, an object, an array etc.
Looking this example, i should create three kind of collections (one for type) or the same collection (for example, a collection named "measurements") can contain differend kind of value on the field "value" ?
Trying some insert in my DB instance i dont get any error (i'm able to insert object, string and array on property value), but i would like to know if is the correct way...
I come from RDBMS, i'm a bit confused right now.. thanks a lot for your support.
You can find the answer here https://docs.mongodb.com/drivers/use-cases/product-catalog
MongoDB's dynamic schema means that each need not conform to the same schema.

How do I manage a sublist in Mongodb?

I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.