data storage and retrieval for search

data storage and retrieval for search - postgresql

We have a searchable list. Initially the list displayed will be empty. As soon as the user types in 3 characters, it will start hitting the database and fetching data which matches the 3 characters typed in. The list contains names of organizations. The catch is - the data may have "Private" but the user may try Pvt or something like that and end up thinking the data is missing. Which is the best approach to get around this problem? One approach we are considering is to have all possible combinations of the values a user may try in another column, do the look up against that column, and display the column which has the 'clean' data. Any other approaches? Performance is not a big concern because at most we will have 5000 records in the master table. The master data will be queries by the user once or twice for registration.

Related

How postgres access tables on disc

I am having a postgres table with 2 columns username and items they have bought. On an average we could expect about 100 items for each user. While thinking about this, I am wondering the possible issues I could face if I plan to have a unique table corresponding to each user rather than having a single table storing data of all the items bought by all the users.
One thing I realised was, since postgres stores data in chunks of 8kB I'll be consuming lot of disc space than required (as I will certainly not have 8kB of data for any specific user). But, will this help me to reduce time to fetch the items corresponding to a given user as I could directly fetch all the rows from the table of that specific user compared to searching for the rows corresponding to a given user (if I plan to save all the data in a single table) and then return the items?

What kind of key should be used to group multiple rows within the same database table?

Use case
I need to store texts assigned to some entity. It's important to note that I always only care about the most current texts that have been assigned to that entity. In case new texts are inserted, older ones might even be deleted. And that "might" is the problem, because I can't rely that really only the most current texts are available.
The only thing I'm unsure about how to design is the case that some INSERT can provide either 1 or N texts for some entity. In the latter case, I need to know which N texts belong to the most current INSERT done for one and the same entity. Additionally, inserting N instead of 1 text will be pretty rare.
I know that things could be implemented using two different tables: One calculating some main-ID and the other mapping individual texts with their own IDs to that main-ID. Because multiple texts should happen rarely and a one table design already provides columns which could easily be reused for grouping multiple texts together, I prefer using one only.
Additionally, I thought of which concept would make a good grouping key in general as well. I somewhat doubt that others really always implement the two table-approach only and therefore created this question to get a better understanding. Of course I simply might be wrong and everybody avoids such "hacks" at all costs. :-)
Possible keys
Transaction-local timestamp
Postgres supports the concept of a transaction-local timestamp using current_timestamp. I need to have one of those to store when the texts have been stored anyway, so they might be used for grouping as well?
While there's in theory the probability of collisions, timestamps have a resolution of 1 microsecond, which is in practice enough for my needs. Texts are uploaded by human users and it is very unlikely that multiple humans upload texts for the same entity at the same time at all.
That timestamp won't be used as a primary key of course, only to group multiple texts if necessary.
Transaction-ID
Postgres supports txid_current to get the ID of the current transaction, which should be ever increasing over the lifetime of the current installation. The good thing is that this value is always available and the app doesn't need to do anything on it's own. But things can easily break in case of restores, can't they? Can TXIDs e.g. occur again with the restored cluster?
People knowing things better than me write the following:
Do not use the transaction ID for anything at the application level. It is an internal system level field. Whatever you are trying to do, it's likely that the transaction ID is not the right way to do it.
https://stackoverflow.com/a/32644144/2055163
You shouldn't be using the transaction ID for anything except an identifier for a transaction. You can't even assume that a lower transaction ID is an older transaction, due to transaction ID wrap-around.
https://stackoverflow.com/a/20602796/2055163
Isn't my grouping a valid use case for wanting to know the ID of the current transaction?
Custom sequence
Grouping simply needs a unique key per transaction, which can be achieved using a custom sequence for that purpose only. I don't see any downsides, its values consume less storage than e.g. UUIDs and can easily be queried.
Reusing first unique ID
The table to store the texts contains a serial-column, so each inserted text gets an individual ID already. Therefore, the ID of the first inserted text could simply always be additionally reused as the group-key for all later added texts.
In case of only inserting one text, one should easily be able to use currval and doesn't even need to explicitly query the ID of the inserted row. In case of multiple texts this doesn't work anymore, though, because currval would provide updated IDs instead of the first one per transaction only. So some special handling would be necessary.
APP-generated random UUID
Each request to store multiple texts could simply generate some UUID and group by that. The mainly used database Postgres even supports a corresponding data type.
I mainly see too downsides with this: It feels really hacky already and consumes more space than necessary. OTOH, compared to the texts to store, the latter might simply be negligible.

Is it good practice to have 2 or more tables with the same columns?

I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?

Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.

In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.

Storing user-defined segments JSONB vs separate table?

I want to store user-defined segments. A segment will consist of several different rules. I was thinking I would either create a separate separate table of "Rules" with three columns: attribute name, operator, and value. For example, if a Segment is users in the united states the rule would be "country = US" in their respective columns. A segment can have many rules.
The other option is to store these as JSONB via Postgres in a "Rules" column in the Segment table. I'd follow a similar pattern to the above with an array of rules or something. What are the pros and cons of each method?
Maybe neither one of these is the right approach.

The choice is basically about the way you wish to read the data.
You are better off with JSON if:
you are not going to filter (with a WHERE clause) through the Rules
you do not need to get statistics (i.e. GROUP BY)
you will not imply any constraints on attributes/operators/values
you simply select the values (SELECT ..., Rules)
If you meet these requirements you can store data as JSON, thus eliminating JOINs and subselects, eliminating the overhead of primary key and indexes on Rules, etc.
But if you don't meet these you should store the data in a common relational design - your approach 1.

I would go with the first approach of storing the pieces of data individually in a relational database. It sounds like your data (segments->rules) will always contain the same structure (which is fairly simple), so there isn't a pressing reason to store the data as JSON.
As a side note, I think you will need another column in the "Rules" table, serving as a foreign key to the "Segments" table.
Pros to approach 1:
Data is easy to search and select. Your SQL statements can directly access specific information about the rules (the specific operators, name, value, etc) without having to parse the JSON object for the desired rule.
The above will result in reduced processing time
Only need to parse the JSON once (before the insert)
Cons to approach 1:
Requires parsing of JSON before the insert
Requires multiple inserts per segment
Regarding your last sentence, it is hard to prescribe a database design without knowing more about your intended functionality. For example, if the attribute names have meaning beyond a single segment, you would want to store the attribute names separately and reference them in the Rules table.

UITableView alphabetical index with large JSON data

I have a table that loads in data from a webservice that returns a bunch of JSON data. I load more data when the user scrolls down as the DB I am querying holds quite a bit of data.
The question I have is - will it be feasible to implement the right side alphabetical listing on such a table and how could this be done? It is definitely possible if I load in ALL the data and then sort them locally, populate the index and cache the data for every other time. But what if this is going to be 10K rows of data or more. Maybe load this data on application first launch is one option.
So in terms of performance and usability, does anyone have any recommendations of what is possible to do?

I don't think that you should download all data to make those indexes, it would decrease refreshing time and might cause memory problems.
But if you think that indexes could make a good difference than you can add some features to your server API. I would add either a different API call like get_indexes. Or even I would add POST parameter get_indexes which adds an array of indexes to any call which has this parameter set.
And you should be ready to handle cases when user taps on indexes without any downloaded data or when user just stresses out your app making fast index scrolling up and down.

First see how big the data download is. If the server can gzip the data, it may be surprisingly small - JSON zips very well because of the duplicated keys.
If it's too big, I would recommend modifying the server if possible to let you specify a starting letter. That way, if the user hits the "W" in the index you should be able to request all items that begin with "W".
It would also be helpful to get a total record count from the server so you can know how many rows are in the table ahead of time. I would also return a "loading..." string for each unknown row until the actual data comes down.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse