PostgreSQL jsonb indexing for optimized search performance - postgresql

I'm using PostgreSQL 10.1 jsonb data type and designing a JSON document of the following structure:
{
"guid": "9c36adc1-7fb5-4d5b-83b4-90356a46061a",
"name": "Angela Barton",
"is_active": true,
"company": "Magnafone",
"address": "178 Howard Place, Gulf, Washington, 702",
"registered": "2009-11-07T08:53:22 +08:00",
"latitude": 19.793713,
"longitude": 86.513373,
"timestamp": 2001-09-28 01:00:00,
"tags": [
"enim",
"aliquip",
"qui"
]
}
I need to retrieve JSON documents by searching based on tags and sorted by timestamp.
I have read these documenations and it says jsonb_path_ops offers better performance:
https://www.postgresql.org/docs/current/static/datatype-json.html#JSON-INDEXING
https://www.postgresql.org/docs/current/static/gin-builtin-opclasses.html
To index tags, 1. gave an example:
CREATE INDEX idxgintags ON api USING GIN ((jdoc -> 'tags'));
However, the example uses jsonb_ops but I think it is better to use jsonb_path_ops.
Based on the needs for my case, which is to be able to search based on tags and sorted by timestamp, what is the best way - in terms of optimized search performance - to create the indexes? I'd appreciate if an expert could give me the SQL to create the index and some examples to query for data.
Thanks!

http://dbfiddle.uk/?rdbms=postgres_10&fiddle=cbb2336edf796f9b9458be64c4654394
as you can see effective jsonb_ops on small part of jsonb is close to jsonb_path_ops. Close sizes, close timing, only different operators supported. It would start differ very much if you
CREATE INDEX idxgintags ON api USING GIN (jdoc);

Related

For an array of objects, should I be creating a multikey index?

For a document that resembles the following,
{
"translations": [
{
"source": "hello",
"lang": "en",
"target": "some target"
},
{
"source": "hey",
"lang": "en",
"target": "target string"
}
]
}
should I create a multikey index or a compound index? What I want is when the query happens for this collection on source or lang or target , it must return the results quickly.
it must return the results quickly.
It depends on multiple factors. One is amount of data. Another is resources you have such as ram, shards, nodes.
As you need to query more than field at a time from the nested documents, you can go for compound index. But I suggest you to try it out the below things
Multi key index - examine your use cases - confirm that mongo uses index intersection by explaining the query
Compound key index - ensure that compound key index is most of the time used for ur use cases
It would be quick in both the cases. You need to consider writes as well. Each write result in index update.
Any answer you will get, will not be accurate because you need to provide a lot more information about your use case. such as:
How many documents you have?
How many array elements, in average, would be in each document?
Is your data static, read-only. or there are updates and deletes?
What are the most frequest queries you expect on the collection?
Note that your indexe/s on "source" and/or "target" must use the same "collation".
Queries that ensure selectivity: While "source" and "target" have high carinality, but "lang", in comparison, would naturally have a lower cardinality (fewer unique values). Test how your queries will benifit from indexing "lang" standlone vs compound with source or target.
Make sure the size of your indexs "db.collection.totalIndexSize()" fit entirely in the RAM in order to avoid disk reads.
If you have little information about the application, you can evelaute (explain, indexstat) the performance with or without (use hint if needed, to force use of a partiular index) various combinations of sigle-key or compound indexes.

MongoDB inserting duplicate document within upsert

My current situation is that I have several Writer objects which dump data into MongoDB. No unique indexes so duplicates are allowed and are a possibility, but shouldn't be.
I was checking existing data within the DB and found several documents in which the fields that should be used to match in the upsert phase are duplicated and contain different counters.
{"date": "today", "k1": "sample", "count": 5}
{"date": "today", "k1": "sample", "count": 2}
That is a very very simple example of what my current situation is. The count field should be 7 and there shouldn't be two separate documents with the same keys I use to perform the upsert, but this is something that barely happens and isn't much of the data... Just wondering what could be causing this?
Is there any situation where this can happen? A R/W lock?
for such things as counters, I would recommend using $inc operator https://docs.mongodb.com/manual/reference/operator/update/inc/

Why use Postgres JSON column type?

The JSON column type accepts non valid JSON
eg
[1,2,3] can be inserted without the closing {}
Is there any difference between JSON and string?
While [1,2,3] is valid JSON, as zerkms has stated in the comments, to answer the primary question: Is there any difference between JSON and string?
The answer is yes. A whole new set of query operations, functions, etc. apply to json or jsonb columns that do not apply to text (or related types) columns.
For example, while with text columns you would need to use regular expressions and related string functions to parse the string (or a custom function), with json or jsonb, there exists a separate set of query operators that works within the structured nature of JSON.
From the Postgres doc, given the following JSON:
{
"guid": "9c36adc1-7fb5-4d5b-83b4-90356a46061a",
"name": "Angela Barton",
"is_active": true,
"company": "Magnafone",
"address": "178 Howard Place, Gulf, Washington, 702",
"registered": "2009-11-07T08:53:22 +08:00",
"latitude": 19.793713,
"longitude": 86.513373,
"tags": [
"enim",
"aliquip",
"qui"
]
}
The doc then says:
We store these documents in a table named api, in a jsonb column named
jdoc. If a GIN index is created on this column, queries like the
following can make use of the index:
-- Find documents in which the key "company" has value "Magnafone"
SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc #> '{"company": "Magnafone"}';
This allows you to query the jsonb (or json) fields very differently than if it were simply a text or related field.
Here is some Postgres doc that provides some of those query operators and functions.
Basically, if you have JSON data that you want to treat as JSON data, then a column is best specified as json or jsonb (which one you choose depends on whether you want to store it as plain text or binary, respectively).
The above data can be stored in text, but the JSON data types have the advantage you can apply JSON rules in those columns. There are several functions which are JSON specified which cannot be used for text fields.
Refer to this link to understand about the json functions/procedures

Best way to represent multilingual database on mongodb

I have a MySQL database to support a multilingual website where the data is represented as the following:
table1
id
is_active
created
table1_lang
table1_id
name
surname
address
What's the best way to achieve the same on mongo database?
You can either design a schema where you can reference or embed documents. Let's look at the first option of embedded documents. With you above application, you might store the information in a document as follows:
// db.table1 schema
{
"_id": 3, // table1_id
"is_active": true,
"created": ISODate("2015-04-07T16:00:30.798Z"),
"lang": [
{
"name": "foo",
"surname": "bar",
"address": "xxx"
},
{
"name": "abc",
"surname": "def",
"address": "xyz"
}
]
}
In the example schema above, you would have essentially embedded the table1_lang information within the main table1document. This design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. If your application frequently accesses table1 information along with the table1_lang data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a document which has a lang key "name" with value "foo", this can be done with one single (atomic) operation:
db.table.remove({"lang.name": "foo"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents where you follow a normalized schema. For example:
// db.table1 schema
{
"_id": 3
"is_active": true
"created": ISODate("2015-04-07T16:00:30.798Z")
}
// db.table1_lang schema
/*
1
*/
{
"_id": 1,
"table1_id": 3,
"name": "foo",
"surname": "bar",
"address": "xxx"
}
/*
2
*/
{
"_id": 2,
"table1_id": 3,
"name": "abc",
"surname": "def",
"address": "xyz"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child table1_lang documents for the main parent entity table1 with id 3 will be straightforward, simply create a query against the collection table1_lang:
db.table1_lang.find({"table1_id": 3});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of table_lang documents per give table entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland

DynamoDB Model/Keys Advice

I was hoping someone could help me understand how to best design my table(s) for DynamoDb. I'm building an application which is used to track the visits a certain user makes to another user's profile.
Currently I have a MongoDB where one entry contains the following fields:
userId
visitedProfileId
date
status
isMobile
How would this translate to DynamoDB in a way it would not be too slow? I would need to do search queries to select all items that have a certain userId, taking the status and isMobile in affect. What would me keys be? Can I use limit functionality to only request the latest x entries (sorted on date?).
I really like the way DynamoDB can be used but it really seems kind of complicated to make the click between a regular NoSQL database and a key-value nosql database.
There are a couple of ways you could do this - and it probably depends on any other querying you may want to do on this table.
Make your HashKey of the table the userId, and then the RangeKey can be <status>:<isMobile>:<date> (eg active:true:2013-03-25T04:05:06.789Z). Then you can query using BEGINS_WITH in the RangeKeyCondition (and ScanIndexForward set to false to return in ascending order).
So let's say you wanted to find the 20 most recent rows of user ID 1234abcd that have a status of active and an isMobile of true (I'm guessing that's what you mean by "taking [them] into affect"), then your query would look like:
{
"TableName": "Users",
"Limit": 20,
"HashKeyValue": { "S": "1234abcd" },
"RangeKeyCondition": {
"ComparisonOperator": "BEGINS_WITH"
"AttributeValueList": [{ "S": "active:true:" }],
},
"ScanIndexForward": false
}
Another way would be to make the HashKey <userId>:<status>:<isMobile>, and the RangeKey would just be the date. You wouldn't need a RangeKeyCondition in this case (and in the example, the HashKeyValue would be { "S": "1234abcd:active:true" }).