Given the below json object
{
"player": {
"francesco totti": {
"position": "forward"
},
"andrea pirlo": {
"position": "midfielder"
}
}
}
I would like to import the above file into Redshift as the below rows
name, position
"franceso totti", "forward"
"andrea pirlo", "midfielder"
The thing is the 'player' object has a dynamic number of objects each hour(cadence of when I import into Redshift). For example, the next hour run may look like the following.
{
"player": {
"fabio cannavaro": {
"position": "defender"
}
}
}
Is it possible to use a JSONPaths file to import this file every hour or does it require preprocessing?
You can reuse the jsonpath file as much as you like. You will just need to rerun the COPY statement but remember this will add rows to the table - not replace them. If you are replacing then you will want to clear the table out first (delete, drop/recreate, truncate - each with its own performance and limitations).
Now your json format isn't going to work for Redshift AFAIK. You have the player name as the field identifier and want to set this as the value of a column. You will want something like this (sorry these aren't tested):
{
"player": {
"name": "francesco totti",
"position": "forward"
}
},
{
"player": {
"name": "andrea pirlo",
"position": "midfielder"
}
}
And a jsonpath like this:
{
"jsonpaths": [
"$.player.name",
"$.player.position"
]
}
Related
I have a collection. The document structure is,
{
model: {
name: 'string name'
}
}
I have enabled atlas search, Also created a search index for model.name field. Search works fine, But the only issue is couldn't get results for very minimal query letters.
Example:
I have a document,
{
model: {
name: "space1duplicate"
}
}
If I query space, I couldn't get the result.
{
index: 'search_index',
compound: {
must: [
{
text: {
query: 'space',
path: 'model.name'
}
}
]
}
}
But If I query space1duplica, It returns the result.
During indexing, full text search engine tokenizes the input by splitting up text into searchable chunks. Check out the relevant section in the documentation.
By default Atlas Search does not split words by digits, but if you need that, try to define a custom analyzer with the regex tokenizer and use it for your field:
{
"mappings": {
"dynamic": false,
"fields": {
"name": [
{
"analyzer": "digitSplitter",
"type": "string"
}
]
}
},
"analyzers": [
{
"charFilters": [],
"name": "digitSplitter",
"tokenFilters": [],
"tokenizer": {
"pattern": "[0-9]+",
"type": "regexSplit"
}
}
]
}
Also note that you can use multiple analyzers for string fields, if needed.
Atlas search uses Lucene to do the job. Documentation on mongodb site is mostly focused on mongo specific syntax to pass the query to Lucene and might be a bit confusing if you are not familiar with its query language.
First of all, there are number of tokenizers and analizers available, each serve specific purpose. You really need include index definition when you ask quetions about atlas search.
Default tokeniser uses word separators to build the index, then removes endings to store stems, again depending on language, English by default.
So in order to find "space1duplicate" by beginning of the word you can use "autocomplete" analizer with nGram tokens. The index should be created as following:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"tokenization": "nGram",
"type": "autocomplete"
}
}
},
"storedSource": {
"include": [
"name"
]
}
}
Once it's indexed (you may need to wait a bit you you have larger dataset), you can find the document with following search:
{
index: 'search_index',
compound: {
must: [
{
autocomplete: {
query: 'spa',
path: 'name'
}
}
]
}
}
I have a collection in MongoDB containing search history of a user where each document is stored like:
"_id": "user1"
searchHistory: {
"product1": [
{
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
},
{
"timestamp": 1623481234,
"query": {
"query": "lindor",
"qty": 4
}
},
],
"product2": [
{
"timestamp": 1623473622,
"query": {
"query": "table",
"qty": 1
}
},
{
"timestamp": 1623438232,
"query": {
"query": "ike",
"qty": 1
}
},
]
}
Here _id of document acts like a foreign key to the user document in another collection.
I have backend running on nodejs and this function is used to store a new search history in the record.
exports.updateUserSearchCount = function (userId, productId, searchDetails) {
let addToSetData = {}
let key = `searchHistory.${productId}`
addToSetData[key] = { "timestamp": new Date().getTime(), "query": searchDetails }
return client.db("mydb").collection("userSearchHistory").updateOne({ "_id": userId }, { "$addToSet": addToSetData }, { upsert: true }, async (err, res) => {
})
}
Now, I want to get search history of a user based on query only using the db.find().
I want something like this:
db.find({"_id": "user1", "searchHistory.somewildcard.query": "some query"})
I need a wildcard which will replace ".somewildcard." to search in all products searched.
I saw a suggestion that we should store document like:
"_id": "user1"
searchHistory: [
{
"key": "product1",
"value": [
{
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
}
]
}
]
However if I store document like this, then adding search history to existing document becomes a tideous and confusing task.
What should I do?
It's always a bad idea to save values are keys, for this exact reason you're facing. It heavily limits querying that field, obviously the trade off is that it makes updates much easier.
I personally recommend you do not save these searches in nested form at all, this will cause you scaling issues quite quickly, assuming these fields are indexed you will start seeing performance issues when the arrays get's too large ( few hundred searches ).
So my personal recommendation is for you to save it in a new collection like so:
{
"user_id": "1",
"key": "product1",
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
}
Now querying a specific user or a specific product or even a query substring is all very easily supported by creating some basic indexes. an "update" in this case would just be to insert a new document which is also much faster.
If you still prefer to keep the nested structure, then I recommend you do switch to the recommended structure you posted, as you mentioned updates will become slightly more tedious, but you can still do it quite easily using arrayFilters for updating a specific element or just using $push for adding a new search
I have another problem to solve here. Thinking in arrays sometimes could be very challenging. Here is what I am lined up with. This is what my data looks like: -
{
"_id": { "Firm": "ABC", "year": 2014 },
"Headings": [
{
"costHead": "MNF",
"amount": 500000
},
{
"costHead": "SLS",
"amount": 25000
},
{
"costHead": "OVRHD",
"amount": 100
}
]
}
{
"_id": { "Firm": "CDF", "year": 2015 },
"Headings": [
{
"costHead": "MNF",
"amount": 15000
},
{
"costHead": "SLS",
"amount": 100500
},
{
"costHead": "MNTNC",
"amount": 7500
}
]
}
As you can see, I have a list that has a whole bunch of sub-documents.
Here is what I want to do .. I need to add more elements to this "Headings" list which should be : -
{
"costHead": "FxdCost",
"amount": "$Headings.amount (for costhead MFC) + $Headings.amount (for costhead OVRHD)"
}
I am unsure as to how to produce the above. Here are some challenges: -
I can addToSet the new subdocument I wish to add but the problem is addToSet can only be used in the group stage - which would be expensive (unless of course there is no other way).
Even if I use addToSet, I always have to use the $ operator to refer to elements that I read from my JSON file. Now the element I am trying to add here (costHead: FxdCost) is not present in my JSON file and hence I cannot use the $ operator.
Does anyone have any advice on how to go about this. This is after all basic ETL.
I am looking for a value in a Mongo table where its parent key might not have a descriptive or known name. Here is an example of what one of our documents looks like.
{
"assetsId": {
"0": "546cf2f8585ffa451bb68369"
},
"slotTypes": {
"0": { "usage": "json" },
"1": { "usage": "image" }
}
}
I am looking to see if this contains "usage": "json" in slotTypes, but I can't guarantee that the parent key for this usage will be "0".
I tried using the following query without any luck:
db.documents.find(
{
slotTypes:
{
$elemMatch:
{
"usage": "json"
}
}
}
)
Sorry in advance if this is a really basic question, but I'm not used to working in a nosql database.
I'm not sure you're going to be able to solve elegantly this with your current schema; slotTypes should be an array of sub-documents, which would allow your $elemMatch query to work. Right now, it's an object with numeric-ish keys.
That is, your document schema should be something like:
{
"assetsId": {
"0": "546cf2f8585ffa451bb68369"
},
"slotTypes": [
{ "usage": "json" },
{ "usage": "image" }
]
}
If changing the data layout isn't an option, then you're going to need to basically scan through every document to find matches with $where. This is slow, unindexable, and awkward.
db.objects.find({$where: function() {
for(var key in this.slotTypes) {
if (this.slotTypes[key].usage == "json") return true;
}
return false;
}})
You should read the documentation on $where to make sure you understand the caveats of it, and for the love of all that is holy, sanitize your inputs to the function; this is live code that is executing in the context of your database.
I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result.
I using following settings and mappings:
Index settings:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Mappings:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Insert records:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "men's shaver" }
{ "index": { "_id": 2 }}
{ "name": "women's shaver" }
Query:
1. To search by exact phrase match --> "men's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "men's"
}
}
}
Above query returns "men's shaver" in the return result.
2. To search by Partial word match --> "en's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "en's"
}
}
}
Above query DOES NOT return anything.
I have also tried following query
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "%en's%"
}
}
}
}
Still not getting anything.
I figured it is because of "edge_ngram" type filter on Index which is not able to find "partial word/sbustring match".
I tried "n-gram" type filter as well but it is slowing down the search alot.
Please suggest me how to achieve both excact phrase match and partial phrase match using same index setting.
To search for partial field matches and exact matches, it will work better if you define the fields as "not analyzed" or as keywords (rather than text), then use a wildcard query.
See also this.
To use a wildcard query, append * on both ends of the string you are searching for:
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "*en's*"
}
}
}
}
To use with case insensitivity, use a custom analyzer with a lowercase filter and keyword tokenizer.
Custom Analyzer:
"custom_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
Make the search string lowercase
If you get search string as AsD: change it to *asd*
The answer given by #BlackPOP will work, but it uses the wildcard approach, which is not preferred as it has a performance issue and if abused can create a huge domino effect (performance issue) in the Elastic cluster.
I have written a detailed blog on partial search/autocomplete covering the latest options available in Elasticsearch as of today (Dec 2020) with performance in mind. For more trade-off information please refer to this answer.
IMHO a better approach will be to use the customized n-gram tokenizer according to use-case, which will have already tokens needed for search term so it will be faster, although it will have a bigger index size, but you size is not that costly and speed will be better with more control on how exactly you want substring search to work.
Also size can be controlled if you are conservative in defining the min and max gram in tokenizer setting.
By searching with any string or substring Use:
query: {
or: [{
match_phrase_prefix: {
name: str
}
}, {
match_phrase_prefix: {
surname: str
}
}]
}
Happy coding with Elastic Search....