Is it possible to create a custom index on Postgres catalog table pg_largeobject? - postgresql

I am aware of the fact that large objects are stored in a separate table called pg_largeobject, which stores b-tree indexed rows, and the user table merely stores the Oid of the object stored in pg_largeobject.
Now, creating an index on the column which stores just the Oid(s) is kind of absurd. So, can we create custom indexes on the pg_largeobject table for better performance of data retrieval and stuff?

No, you cannot do that, because pg_largeobject is a system catalog. It also wouldn't do you much good, since the objects are stored in chunks there.
If you want to index a large object, you are doing something wrong. The large object would be too large for fitting into an index entry anyway, and who wants to search like WHERE blob = '...'?
I suspect that you have some information stored inside the large object that you would like to index, like the (benighted) idea of keeping your state in a JSON, storing that as large object and then index one of its attributes.
It would be better to store such attributes that you want to search for outside the large object as regular table columns, then the problem would go away.
That said, in PostgreSQL you can define indexes on expressions, so if you use bytea rather than a large object (which is preferable for smaller binary data anyway), you can define an index on an expression that extracts the desired attribute from the binary data. You cannot do that with a large object, because the functions to access large objects are not IMMUTABLE, as the contents of a large object can change, while the oid stays the same.

Related

Export index definitions in underlying collection when dumping views?

I have a database for development from which I only need to dump a subset of the fields, but all documents. So I created a view on the collection I need and monogodumped the view. Unfortunately, the underlying collection had indexes defined which were not rebuilt when I mongorestored the collections from the dump, because the index definitions were not dumped along with the data, apparently because they are defined for the collection, not for the view.
Is there a way to have the index definitions of the underlying collection dumped along with the data from the view?
Of course I can manually tell MongoDB to rebuild the indexes on the restored target collections, but that seems error-prone.
The fact that some indexes are on fields that are not part of the view may be a problem or even a blocker.
I believe the direct answer to your question is: No, mongodump will not pull index definitions from the source collection(s) associated with the view. Some degree of manual intervention or a change of approach is going to be needed here.
The specific approach you take depends on your specific constraints and goals. A few general things come to mind for consideration:
If the data isn't actually moving clusters, then perhaps $merge by itself would be sufficient in moving the subset of fields to a different collection. The rest of this answer assumes that this is not the case and that you do intend to actually move the data to a different cluster.
$merge may still be of interest even if you are moving the data since you could use that on the source cluster (combined with a script to copy indexes) and then run mongodump on that new collection instead. It's an extra data copy, but allows a script to programmatically recreate the indexes directly which should help prevent human error.
If you did continue with the current approach mentioned in the question, you could use a similar script to grab the index definitions (and then have them recreated).
Another thing you could do is run a second mongodump against the source collection with a --query that didn't match any documents (eg { _id: 'missing' }). The outcome would be a dump that doesn't contain any data, only index definitions. Those index definitions are just JSON text, so you could update the namespace and then combine it with the data dumped from the view to be restored together.
The specifics of the script to copy indexes mentioned in a couple of the alternatives depend a little bit on the specifics. But it would basically leverage the db.collection.getIndexes() helper to gather a list of existing indexes and then iterate over them to generate the appropriate command(s) to create the new ones.
I also want to address these statements:
The fact that some indexes are on fields that are not part of the view may be a problem or even a blocker.
it might be a problem that some index definitions are for fields that are not included in the view.
From MongoDB's perspective, there is no issue with creating indexes on fields that do not exist. Since it has a flexible schema, new fields could be added at any point. The fact that indexes aren't dumped for views is really more related to the fact that the views are not materialized. Now if some of those indexes are not appropriate for the transformed data (which doesn't have all of the fields from the original data), then of course you should consider dropping (or not creating) those indexes.

Can I store data that won't affect query performance in MongoDB?

We have an application which requires saving of data that should be in documents, for querying and sorting purposes. The data should be schema less, as some of the fields would be known only via usage. For this, MongoDB is a great solution and it works great for us.
Part of the data in each document, is for displaying purposes. Meaning the data can be objects (let's say json) that the client side uses in order to plot diagrams.
I tried to save this data using gridfs, but the use cases makes it not responsive enough. Also, the documents won't exceed the 16 MB limits even with the diagram data inside them. And in fact, while trying to save this data directly within the documents, we got better results.
This data is used only for client side responses, meaning we should never query it. So my question is, can I insert this data to MongoDB, and set it as a 'not for query' data? Meaning, can I insert this data without affecting Mongo's performance? The data is strict and once a document is inserted, there might be only updating of existing fields, not adding new ones.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for?
Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
As at MongoDB 3.4, read and write operations are atomic on the level of a single document from the storage/memory point of view. If the MongoDB server needs to fetch a document from memory or disk (even when projecting a subset of fields to return) the full document generally has to be loaded into memory on a mongod. The only exception is if you can take advantage of covered queries where all of the fields returned are also included in the index used.
This data is used only for client side responses, meaning we should never query it.
Data fields which aren't queried directly do not need to be in any indexes. However, there is currently no concept like "not for query" fields in MongoDB. You can query or project any field (with or without an index).
Meaning, can I insert this data without affecting Mongo's performance?
Data with very different access or growth patterns (such as your infrequently requested client data) is a recommended candidate for storing separately from a parent document with frequently accessed data. This will improve the efficiency of memory usage for mongod by avoiding unnecessary retrieval of data when working with documents in the parent collection.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for? Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
You should use a type that is most appropriate for the data that you are storing. Storing text data as binary will not gain you any obvious efficiencies in server storage. However, storing a complex object as a single value (for example, a JSON document serialized as a string) could save some serialization overhead if that object will only be interpreted via your client-side code. Binary data stored in MongoDB will be an opaque blob as far as indexing or querying, which sounds fine for your purposes.

Can Lucene store more than 100Gb original's documents in index?

I'm writing application what will be manipulate with more than 100Gb text documents. The size of each document is 2Kb-100Kb.
At first I supposed to use DBMS such as MySQL or Firebird to store raw documents with storing index in lucene's index. This approach have some disadvantages. For example, database transactions know nothing about lucene index and vice versa. So I need to synchronize them.
Then I supposed what Lucene can store entire documents in index. So I need regulary create index's backups. But it so easy: I can copy entire catalog with index. I use some kind of No SQL storage (i.e. Lucene). And I may don't use DBMS.
What is the best practice: to store original documents in index or not? I'm really don't want use DBMS to such purpose. Is it possible?
You would not want to store the raw document in a Lucene index, especially the size that you are talking about. I have done this a couple ways, but both ONLY store the indexed fields in the Lucene index and you have an ID/pointer to the raw document. I have dealt with indexes well over 100 million records and they work fine on a single server.
The reason this is important is that the build time of the index and manageability of the index dramatically drops if you don't need to store an additional 100 gig of data.
Basically, you need to index all the fields you need for searching/satisfying search queries. If a user clicks on the item in a grid, I assume you want to show the raw text (the UI pattern is that most of the time you will access a lot of the Lucene fields, but RARELY need to pull down the full binary text file).
The raw access I have used in conjunction with Lucene is:
SQL Server FILESTREAM, which is optimized for large binary file storage. It is really fast too. Not sure if MySQL has this (never worked with it)
Azure Table Storage, which is a key-value NoSQL cloud database. That was used to store the binary blobs.
It really doesn't matter what the persisted storage is, as long as it is optimized for larger binary files that can be accessed/streamed fast based off of a key. You could use an in-memory cache like Redis too as long as Lucene has the ID pointer to access the binary text file.

Index BTree Storage

How is a collection B-Tree Index saved?
Is it like each index bucket saved within the data portion of a record?
Does this mean that for every collection within a database, there are a dedicated number of extents that cover an specific index for an specific collection of an specific database?
Every b-tree bucket is allocated as needed and thus has it's own location within the data file. They are not specifically stored near the data it's referring to (nor is there any reason to).
B-tree is basically an concept or a set of algorithms, not a complete file storage specification. Everything you ask about is up to the implementor.

Non Relational Database , Key Value or flat table

My application needs configurable columns , and titles of these columns get configured in the begining, If relation database I would have created generic columns in table like CodeA, CodeB etc for this need because it helps queering on these columns (Code A = 11 ) it also helps in displaying the values (if that columns stores code and value) but now I am using Non Relational database Datastore (and I am new to it), should I follow the same old approach or I should use collection (Key Value pair) type of structure .
There will be lot of filters on these columns. Please suggest
What you've just described is one of the classic scenarios for a Key-Value database. The limitation here is that you will not have many of the set-based tools you're used to.
Most of the K-V databases are really good at loading one "record" or small set thereof. However, they don't tend to be any good at loading anything that may require a join. Given that you're using AppEngine, you probably appreciate this limitation. But it's worth stating.
As an important note, not all K-V database will allow you to "select by any column". Many K-V stores actually only allow for selection by a primary key. If you take a look at MongoDB, you'll find that you can query any column which sounds like a necessary feature.
I would suggest using key/value pairs where keys will act as your column names and value will be their data.