DynamoDB GSI vs LSI for Numerical fields - nosql

I am going through a DynamoDb udemy course and came across the following example
for leaderboard design,
Now, to easily fetch the top scores (or the top 10 in leaderboard),
the course goes on to do the following
GSI
PK
SK
player
top_score
150
player#2
top_score
200
player#1
top_score
250
player#3
Now, this works quite well for the instructor in the course,
however, if player#2 now scores 50 more, then the primary key(PK + SK)
wont be unique (top_score + 200) will point to two records.
I then think, that numerical fields will never make sense in a
GSI, unless they are IDs.
Even if we make the score field unique by appending player_id along
with score, that will need the score field to be string, rather than
numerical, and then lexicographical order will be followed and not the
numerical ordering.
From AWS Dynamo docs regarding GSI vs LSI in general :
In general, you should use global secondary indexes rather than local secondary indexes. The exception is when you need strong consistency in your query results, which a local secondary index can provide but a global secondary index cannot (global secondary index queries only support eventual consistency).
I think this is a clear case of using LSI (even
though the problem I just described in not related to consistency),
use LSI on the total_score attribute.
Am I right, or is there a way to actually use GSI for this ?
Edit:
The only thing I can think of is to pad sufficient 0s in scores,
and append player_id at the end
Like this,
PK
SK
top_score
000150#2
top_score
000200#1
top_score
000250#3

Related

How can I create an arbitrary ranking of records in Postgres?

The Problem
I'm looking to create a user defined ranking of records in Postgres.
That is, the order in which the records are ranked is not defined by some underlying score but rather via the choices of a collection of users.
These choices are subject to frequent changes and the ranking will be constantly changing with both new records being added and existing records being moved to new positions.
For the sake of argument, assume that these operations occur with high frequency.
Furthermore, we need to be able to determine when given an arbitrary subset of all records, how they should be ordered according to the ranking.
A Naive Solution
A very naive solution would be to track the ranking as an integer directly on the model and 'push' all the higher ranked records up by one when inserting a new record. This is obviously not ideal as we would need to modify potentially the entire table at once.
A Better Solution
Maintain a 'score' on each record in the interval [0, 1]. This can be indexed using a BTREE and used to rank the records. The first two records would have the scores 0 and 1. When inserting a new record some intermediate value would be chosen (e.g. 0.5) and the record inserted. This choice could be optimised in order to minimise the number of digits in the score.
A Question
The above seems like a complex solution to a common problem. Furthermore, the problem is actually being solved by the underlying BTREE index with the score something of a hack to create the index.
Is there a neater way to solve the problem?

How to implement a high performing non incremental ID in postgresql? [duplicate]

I would like to replace some of the sequences I use for id's in my postgresql db with my own custom made id generator. The generator would produce a random number with a checkdigit at the end. So this:
SELECT nextval('customers')
would be replaced by something like this:
SELECT get_new_rand_id('customer')
The function would then return a numerical value such as: [1-9][0-9]{9} where the last digit is a checksum.
The concerns I have is:
How do I make the thing atomic
How do I avoid returning the same id twice (this would be caught by trying to insert it into a column with unique constraint but then its to late to I think)
Is this a good idea at all?
Note1: I do not want to use uuid since it is to be communicated with customers and 10 digits is far simpler to communicate than the 36 character uuid.
Note2: The function would rarely be called with SELECT get_new_rand_id() but would be assigned as default value on the id-column instead of nextval().
EDIT: Ok, good discussusion below! Here are some explanation for why:
So why would I over-comlicate things this way? The purpouse is to hide the primary key from the customers.
I give each new customer a unique
customerId (generated serial number in
the db). Since I communicate that
number with the customer it is a
fairly simple task for my competitors
to monitor my business (there are
other numbers such as invoice nr and
order nr that have the same
properties). It is this monitoring I
would like to make a little bit
harder (note: not impossible but
harder).
Why the check digit?
Before there was any talk of hiding the serial nr I added a checkdigit to ordernr since there were klumbsy fingers at some points in the production, and my thought was that this would be a good practice to keep in the future.
After reading the discussion I can certainly see that my approach is not the best way to solve my problem, but I have no other good idea of how to solve it, so please help me out here.
Should I add an extra column where I put the id I expose to the customer and keep the serial as primary key?
How can I generate the id to expose in a sane and efficient way?
Is the checkdigit necessary?
For generating unique and random-looking identifiers from a serial, using ciphers might be a good idea. Since their output is bijective (there is a one-to-one mapping between input and output values) -- you will not have any collisions, unlike hashes. Which means your identifiers don't have to be as long as hashes.
Most cryptographic ciphers work on 64-bit or larger blocks, but the PostgreSQL wiki has an example PL/pgSQL procedure for a "non-cryptographic" cipher function that works on (32-bit) int type. Disclaimer: I have not tried using this function myself.
To use it for your primary keys, run the CREATE FUNCTION call from the wiki page, and then on your empty tables do:
ALTER TABLE foo ALTER COLUMN foo_id SET DEFAULT pseudo_encrypt(nextval('foo_foo_id_seq')::int);
And voila!
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> select * from foo;
foo_id
------------
1241588087
1500453386
1755259484
(4 rows)
I added my comment to your question and then realized that I should have explained myself better... My apologies.
You could have a second key - not the primary key - that is visible to the user. That key could use the primary as the seed for the hash function you describe and be the one that you use to do lookups. That key would be generated by a trigger after insert (which is much simpler than trying to ensure atomicity of the operation) and
That is the key that you share with your clients, never the PK. I know there is debate (albeit, I can't understand why) if PKs are to be invisible to the user applications or not. The modern database design practices, and my personal experience, all seem to suggest that PKs should NOT be visible to users. They tend to attach meaning to them and, over time, that is a very bad thing - regardless if they have a check digit in the key or not.
Your joins will still be done using the PK. This other generated key is just supposed to be used for client lookups. They are the face, the PK is the guts.
Hope that helps.
Edit: FWIW, there is little to be said about "right" or "wrong" in database design. Sometimes it boils down to a choice. I think the choice you face will be better served by leaving the PK alone and creating a secondary key - just that.
I think you are way over-complicating this. Why not let the database do what it does best and let it take care of atomicity and ensuring that the same id is not used twice? Why not use a postgresql SERIAL type and get an autogenerated surrogate primary key, just like an integer IDENTITY column in SQL Server or DB2? Use that on the column instead. Plus it will be faster than your user-defined function.
I concur regarding hiding this surrogate primary key and using an exposed secondary key (with a unique constraint on it) to lookup clients in your interface.
Are you using a sequence because you need a unique identifier across several tables? This is usually an indication that you need to rethink your table design, and those several tables should perhaps be combined into one, with an autogenerated surrogate primary key.
Also see here
How you generate the random and unique ids is a useful question - but you seem to be making a counter productive assumption about when to generate them!
My point is that you do not need to generate these id's at the time of creating your rows, because they are essentially independent of the data being inserted.
What I do is pre-generate random id's for future use, that way I can take my own sweet time and absolutely guarantee they are unique, and there's no processing to be done at the time of the insert.
For example I have an orders table with order_id in it. This id is generated on the fly when the user enters the order, incrementally 1,2,3 etc forever. The user does not need to see this internal id.
Then I have another table - random_ids with (order_id, random_id). I have a routine that runs every night which pre-loads this table with enough rows to more than cover the orders that might be inserted in the next 24 hours. (If I ever get 10000 orders in one day I'll have a problem - but that would be a good problem to have!)
This approach guarantees uniqueness and takes any processing load away from the insert transaction and into the batch routine, where it does not affect the user.
Your best bet would probably be some form of hash function, and then a checksum added to the end.
If you're not using this too often (you do not have a new customer every second, do you?) then it is feasible to just get a random number and then try to insert the record. Just be prepared to retry inserting with another number when it fails with unique constraint violation.
I'd use numbers 1000000 to 999999 (900000 possible numbers of the same length) and check digit using UPC or ISBN 10 algorithm. 2 check digits would be better though as they'll eliminate 99% of human errors instead of 9%.

Sphinx centralize multiple tables into a single index

I do have multiple tables (MySQL) and I want to have a single index for them.
Each table has the primary key of int autoincrement type.
The structure of collected data is the same for each table (so no conflict), but as the IDs collide so it seems that I have to query each index separately (unless you can give me a hint of how to avoid ID collision)
Question is: If I query each index separately does it means that the weight of returned results are comparable between indexes?
unless you can give me a hint of how to avoid ID collision
See for example
http://sphinxsearch.com/forum/view.html?id=13078
You can just arrange for the ids to be offset differently. The 'sphinx document id' doesnt have to match the real primary key, but having a simple mapping makes the application simpler.
You have a choice between one-index, one-source (using a single sql query to union all the tables together. one-index, many-source. (a source per table, all making one index) or many-indexes (one index per table, each with own source). Which ever way will give the same query results.
If I query each index separately does it means that the weight of returned results are comparable between indexes?
Pretty much. The difference should be negiblibe that doesnt matter whic way round you do it.

What is the best order for the fields of a composite key?

Say I have a table with the following fields:
LeagueID
MatchID
SomeData
A League will host many Matches. Usually each League will use its own local database, so the LeagueID field will be the same in all records for this local database. Once a year the League uploads its data to the national authority and then the LeagueID will be necessary to dsicriminate the Matches that have the same MatchID.
What is the best way to implement the composite primary key (using the EF Fluent API)?
Entity<Match>.HasKey(match=>new {match.LeagueID,match.MatchID})
OR
Entity<Match>.HasKey(match=>new {match.MatchID,match.LeagueID})
To the human eye the order League - Match is logical, as it will hold the Matches of a particular League together. But I understood that when composing a composite key it is important for performance reasons to use the most discriminating field first.
I think you can have your cake and eat it too.
The Database
When implementing a key in the database generally a narrower key with more selective field(s) will yield better performance. This holds true for single & composite keys. I said generally, because a more selective index that doesn't really match your query pattern can be pretty useless. For example, in your composite key, if MatchID is first (the more selective), but you query more frequently by LeagueID (less selective), the selectivity will work against you.
The real issue, i think, is not is index A or B more selective, but "do you have appropriate indexes for the ways you query?" (and enforce data integrity, but that's a different discussion). So you need to figure out how you query this table. If you query by :
LeagueID most of the time -- index LeagueID, MatchID
MatchID most of the time -- index MatchID, LeagueID
composite LeagueID & MatchID the majority of the time -- index
MatchID, LeagueID
a mixed bag -- you may want two indexes one for each order, but you'll have to figure out if the extra overhead of maintaining two indexes is worth the hit on insert/update/delete.
EF & The Query
For the most part, the order of the columns in your query (or the way you build a match in EF) won't make a difference. Meaning where a=#a and b=#b will yeild the same query plan & performance as where b=#b and a=#a.
Assuming you're using SQL Server, the order you write a where clause matters very little. The books on line explain the issue succinctly, stating:
The order of evaluation of logical operators can vary depending on choices made by the query optimizer. ).
You can choose the order of the fields in the database by using the HasColumnOrder

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.