Structuring a database - postgresql

In my database I need the following relations:
Tournament
Tournament Participant (Tpart)
relates a User to a tournament
Round
This is a single match
Hole
relates a Tpart to a round
also holds a score
This is my current Persistent Entities:
Tournament
name Text
urlName Text
location GolfCourseId
startDate Day
endDate Day Maybe
UniqueTName name
UniqueTUrlName urlName
Tpart
tournament TournamentId
userId UserId
deriving Show
deriving Eq
Round
tourn TournamentId
name Text
UniqueRound tourn name
deriving Show
Hole
round RoundId
part TpartId
score Int
deriving Show
I don't know if this is the best structure given the kind of queries I need to do.
I need to
Get the total score for a round for each Tpart
This would be done by summing up the score of all Holes related to a specific round and Tpart
Part | round 1 | round 2 | ...
p1 | 56 | 54
p2 | 60 | 57
Get all the holes and tparts that relate to a round
Part | hole 1 | hole 2| ...
p1 | 3 | 5
p2 | 5 | 6
To get the data on the first table it would require summing all the hole scores for each user. Is this an efficient method? Or would it be better to have another entity RndScore, like this:
RndScore
rnd RoundId
tpart TpartId
score Int
This entity could be updated every time a hole entity is updated. Either of those solutions seem rather robust though.

My advice is: You should always start with a clean, normalized logical relational database design without storing redundant data, and trust that the DBMS will derive your data (i.e., answer your queries) well enough. That is what a DBMS is there for. The next step should be to optimize your physical database design, e.g., choose your indexes, your table storage parameters, etc. Depending on your database, your can even materialize your views, so that their results are stored physically etc. Actually adding derived values in your logical database design (such as your your RndScore relation) should be the last resort, as you will have to ensure their consistency manually.
In general, you should avoid pre-mature optimizations: Ensure that you actually need to optimize your database layout (e.g., by measuring runtimes, checking query execution plans, making estimations about the number of queries you will have to answer, etc.)

Related

Filter postgresql query if multiple rows exist

right now I have the following table:
students | classes |
-------------------------------------
Ally | Math |
Ally | English |
Ally | Science |
Kim | Math |
Kim | English |
I am currently building an advanced search feature where you can search by class and return students who have those classes. I would like to build a query that will return student's that have Math and English and Science in the classes column, so in the case above it would only return the rows that have Ally in them, since she meets the three classes criteria.
If anyone has any advice I would greatly appriciate it, thank you.
I've renamed your tables and such slightly, but partly cause I'm lazy. Here's what I came up with:
select student from studentclasses where
class in ('Math', 'English', 'Science')
group by student
having count(*) = 3;
See the db-fiddle
The idea is to grab all the student-class rows that match what your search is (basically an OR) and group it by the student so that we can limit by the having clause. We could use >= here, but if count for a particular student gets more than 3, we screwed up the IN :) If there are fewer than 3, then we're missing one class, so not all classes were found for that student.
The only caveats are:
I'm assuming you're using a student ID rather than just first name, and that the first name bit is just to make it easier for us to read, otherwise duplicates will abound.
There are no duplicates of a given class for a particular student. That is, if Kim is in Science twice, then that comes up with 3. In that case, you'll need to use a DISTINCT in there somewhere.

Practical storage of many (in single-row) boolean values

I wish to have stored many (N ~ about 150) boolean values of web app "environment" variables.
What is the proper way to get them stored?
creating N columns and one (1) row of data,
creating two (2) or three (3) columns (id smallserial, name varchar(255), value boolean) with N rows of data,
by using jsonb data type,
by using area data type,
by using bit string bit varying(n),
by another way (please advise)
Note: name may be too long.
Tia!
Could you perhaps use a bit string? https://www.postgresql.org/docs/7.3/static/datatype-bit.html. (Set the nth bit to 1 when the nth attribute would have been "true")
Depends how you wants to access them in normal usage.
Do you need to access one of this value at time, in this case JSONB is a really good way, really easy and quick to find a record, or do you need to get all of them in one call, in this case Bit String Types are the best, but you need to be really careful around, order and transcription for writing and reading..
Any of the options will do, depending on your circumstances. There is little need to optimise storage if you have only 150 values. Unless, of course there can be a very large number of these sets of 150 values or you are working in a very restricted environment like an embedded system (in which case a full-blown database client is probably not what you're looking for).
There is no definite answer here, but I will give you a few guidelines to consider. As from experience:
You don't want to have an anonymous string of values that is interpreted in code. When you change anything later on, your 1101011 or 0x12f08a will be rendered an fascinatingly enigmatic problem.
When the number of your fields starts to grow, you will regret if they are all stored in a single cell on a single row, because you will either be developing some obscure SQL or transforming a larger-than-needed dataset from the server.
When you feel that boolean values are really not enough, you start to wonder if there is a possibility to store something else too.
Settings and environmental properties are seldom subject to processor or data intensive processing, so follow the easiest path.
As my recommendation based on the given information and some educated guessing, you'll probably want to store your information in a table like
string key | integer set_idx | string value
---------------------------------------------------------
use.the.force | 1899 | 1
home.directory | 1899 | /home/dvader
use.the.force | 1900 | 0
home.directory | 1900 | /home/yoda
Converting a 1 to boolean true is cheap, and if you have only one set of values, you can ignore the set index.

store list in key value database

I search for best way to store lists associated with key in key value database (like berkleydb or leveldb)
For example:
I have users and orders from user to user
I want to store list of orders ids for each user to fast access with range selects (for pagination)
How to store this structure?
I don't want to store it in serializable format for each user:
user_1_orders = serialize(1,2,3..)
user_2_orders = serialize(1,2,3..)
beacuse list can be long
I think about separate db file for each user with store orders ids as keys in it, but this does not solve range selects problem.. What if I want to get user ids with range [5000:5050]?
I know about redis, but interest in key value implementation like berkleydb or leveldb.
Let start with a single list. You can work with a single hashmap:
store in row 0 the count of user's order
for each new order store a new row with the count incremented
So yoru hashmap looks like the following:
key | value
-------------
0 | 5
1 | tomato
2 | celery
3 | apple
4 | pie
5 | meat
Steady increment of the key makes sure that every key is unique. Given the fact that the db is key ordered and that the pack function translates integers into a set of byte arrays that are correctly ordered you can fetch slices of the list. To fetch orders between 5000 and 5050 you can use bsddb Cursor.set_range or leveldb's createReadStream (js api)
Now let's expand to multiple user orders. If you can open several hashmap you can use the above using several hashmap. Maybe you will hit some system issues (max nb of open fds or max num of files per directory). So you can use a single and share the same hashmap for several users.
What I explain in the following works for both leveldb and bsddb given the fact that you pack keys correctly using the lexicographic order (byteorder). So I will assume that you have a pack function. In bsddb you have to build a pack function yourself. Have a look at wiredtiger.packing or bytekey for inspiration.
The principle is to namespace the keys using the user's id. It's also called key composition.
Say you database looks like the following:
key | value
-------------------
1 | 0 | 2 <--- count column for user 1
1 | 1 | tomato
1 | 2 | orange
... ...
32 | 0 | 1 <--- count column for user 32
32 | 1 | banna
... | ...
You create this database with the following (pseudo) code:
db.put(pack(1, make_uid(1)), 'tomato')
db.put(pack(1, make_uid(1)), 'orange')
...
db.put(pack(32, make_uid(32)), 'bannana')
make_uid implementation looks like this:
def make_uid(user_uid):
# retrieve the current count
counter_key = pack(user_uid, 0)
value = db.get(counter_key)
value += 1 # increment
# save new count
db.put(counter_key, value)
return value
Then you have to do the correct range lookup, it's similar to the single composite-key. Using bsddb api cursor.set_range(key) we retrieve all items
between 5000 and 5050 for user 42:
def user_orders_slice(user_id, start, end):
key, value = cursor.set_range(pack(user_id, start))
while True:
user_id, order_id = unpack(key)
if order_id > end:
break
else:
# the value is probably packed somehow...
yield value
key, value = cursor.next()
Not error checks are done. Among other things slicing user_orders_slice(42, 5000, 5050) is not guaranteed to tore 51 items if you delete items from the list. A correct way to query say 50 items, is to implement a user_orders_query(user_id, start, limit)`.
I hope you get the idea.
You can use Redis to store list in zset(sorted set), like this:
// this line is called whenever a user place an order
$redis->zadd($user_1_orders, time(), $order_id);
// list orders of the user
$redis->zrange($user_1_orders, 0, -1);
Redis is fast enough. But one thing you should know about Redis is that it stores all data in memory, so if the data eventually exceed the physical memory, you have to shard the data by your own.
Also you can use SSDB(https://github.com/ideawu/ssdb), which is a wrapper of leveldb, has similar APIs to Redis, but stores most data in disk, memory is only used for caching. That means SSDB's capacity is 100 times of Redis' - up to TBs.
One way you could model this in a key-value store which supports scans , like leveldb, would be to add the order id to the key for each user. So the new keys would be userId_orderId for each order. Now to get orders for a particular user, you can do a simple prefix scan - scan(userId*). Now this makes the userId range query slow, in that case you can maintain another table just for userIds or use another key convention : Id_userId for getting userIds between [5000-5050]
Recently I have seen hyperdex adding data types support on top of leveldb : ex: http://hyperdex.org/doc/04.datatypes/#lists , so you could give that a try too.
In BerkeleyDB you can store multiple values per key, either in sorted or unsorted order. This would be the most natural solution. LevelDB has no such feature. You should look into LMDB(http://symas.com/mdb/) though, it also supports sorted multi-value keys, and is smaller, faster, and more reliable than either of the others.

Do I have to create a surrogate key if I want to save space?

Let's say I have a very large table with owners of cars like so:
OWNERSHIP
owner | car
---------------
steven | audi
bernahrd | vw
dieter | vw
eike | vw
robert | audi
... one hundred million rows ...
If I refactor it to this:
OWNERSHIP
owner | car <-foreign key TYPE.car_type
---------------
steven | audi
bernahrd | vw
dieter | vw
eike | vw
robert | audi
...
TYPE
car_type |
---------------
audi
vw
Do I win anything spacewise or speedwise or do I need to create an INTEGER surrogate key on car_type for that?
The integer is going to take up 4 bytes, which is one more byte than "vw" will. As it happens, PostgreSQL enums take up 4 bytes too, so you won't gain anything storage-wise by switching to this representation (except for the difficulties it imposes on changing the enum itself). Querying will be as fast either way, because with a table that size you're going to be consulting the index anyway. Database performance, especially when tables get large, is essentially a matter of I/O, not CPU performance. I'm not convinced that an index on integers is going to be smaller or faster than an index on short strings, especially when you have a huge number of rows referencing a very small set of possible values. It's certainly not going to be the bottleneck in your applications.
Even if we assume that you were able to recover 4 bytes by using an artificial key, how much storage are you going to save? 4 bytes times 100 million rows would be about 400 MB ideally. Are you so pressed for storage that you need to eek out a small amount like that, on your honkin' database server? And this is assuming you refactor it into its own table and use a proper foreign key.
The right way to answer this, of course, is not to argue from first principles at all. Take your 100 million row table and work it both ways. Then examine the size yourself, like so:
SELECT pg_size_pretty(pg_total_relation_size('ownership')));
SELECT pg_size_pretty(pg_total_relation_size('ownership2')));
Do your test queries, with EXPLAIN ANALYZE like so:
EXPLAIN ANALYZE SELECT * FROM ownership WHERE car = 'audi';
EXPLAIN ANALYZE SELECT * FROM ownership2 WHERE car_id = 1;
Pay more attention to the actual time taken than the cost, but do look at the cost. Do this on the same database server as your production, if possible; if not, a similar machine with the same PostgreSQL configuration. Then you'll have hard numbers to tell you what you're paying for and what you're getting. My suspicion is that you'll find the space usage to be slightly worse with the artificial key and the performance to be equivalent.
If that's what you find, do the relational thing and use the natural key, and stop worrying so much about optimizing the physical storage. Space is the cheapest commodity you have.
Using two tables and string foreign key would of course use more space than using one. How much more depends on how many types of cars you have.
You should use integer car_id:
Using integer keys would save space if significant percentage of car names would repeat.
More so if you'd need to index car column, as integer index is much smaller than string index.
Also comparing integers is faster than comparing strings, so searching by car should also be faster.
Smaller table means that bigger part if it would fit in cache, so accessing it should also be faster.

database design decision (NoSQL) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I'm working on an application that has the following use case:
Users upload csv files, which need to be persisted across application restarts
The data in the csv files need to be queried/sorted etc
Users specify the query-able columns in a csv file at the time of uploading the file
The currently proposed solution is:
For small files (much more common), transform the data into xml and store it either as a LOB or in the file system. For querying, slurp the whole data into memory and use something like XQuery
For larger files, create dynamic tables in the database (MySQL), with indexes on the query-able columns
Although we have prototyped this solution and it works reasonably well, it's keeping us from supporting more complex file formats such as XML and JSON. There are also a few more niggling issues with the solution that I won't go into.
Considering the schemaless nature of NoSQL databases, I though they might be used to solve this problem. I have no practical experience with NoSQL though. My questions are:
Is NoSQL well suited for this use case?
If so, which NoSQL database?
How would we store csv files in the DB (collection of key-value pairs where the column headers make up the keys and the data fields from each row make up the values?)
How would we store XML/JSON files with possibly deeply hierarchical structures?
How about querying/indexing and other performance considerations? How does that compare to something like MySQL?
Appreciate the responses and thanks in advance!
example csv file:
employee_id,name,address
1234,XXXX,abcabc
001001,YYY,xyzxyz
...
DDL statement:
CREATE TABLE `employees`(
`id` INT(6) NOT NULL AUTO_INCREMENT,
`employee_id` VARCHAR(12) NOT NULL,
`name` VARCHAR(255),
`address` TEXT,
PRIMARY KEY (`id`),
UNIQUE INDEX `EMPLOYEE_ID` (`employee_id`)
);
for each row in csv file
INSERT INTO `employees`
(`employee_id`,
`name`,
`address`)
VALUES (...);
Not really a full answer, but I think I can help on some points.
For number 2, I can at least give this link that helps sorting out NoSQL implementations.
For number 3, using a SQL database (but should fit as well for a NoSQL system), I would represent each column and each row as individual tables, and add a third table with foreign keys to columns and rows, and with the value of the cell. You get a big table with easy filtering.
For number 4, you need to "represent hierarchical data in a table"
The common approach to this would be to have a table with attributes, and a foreign key to the same table, pointing to the parent, like this for example :
+----+------------+------------+--------+
| id | attribute1 | attribute2 | parent |
+----+------------+------------+--------+
| 0 | potato | berliner | NULL |
| 1 | hello | jack | 0 |
| 2 | hello | frank | 0 |
| 3 | die | please | 1 |
| 4 | no | thanks | 1 |
| 5 | okay | man | 4 |
| 6 | no | ideas | 2 |
| 7 | last | one | 2 |
+----+------------+------------+--------+
Now the problem is that, if you want to get, say, all the child elements from element 1, you'll have to query every item individually to obtain its childs. Some other operations are hard, because they need to get a path to the object, traversing many other objects and making extra data queries.
One common workaround to this, and the one I use and prefer, is called modified pre-order tree traversal.
Using this technique, we need an extra layer between the data storage and the application, to fill some extra columns at each structure-altering modification. We will assign to each object three properties : left, right and depth.
The left and right properties will be filled counting each object from the top, traversing all the tree leaves recursively.
This is a vague approximation of the traversal algorithm for left and right (the part with depth can be easily gussed, this is just some lines to add) :
Set the tree root (or the first tree root if there are many) left
attribute to 1
Go to its first (or next) child. Set its left attribute to
the last number plus one (here, 2)
Does is it have any child ? If yes, go back to number 2. If no, set its right to the last number plus one.
Go to next child, and do the same as in 2
If no more child, go to next child of parent and do the same as in 2
Here is a picture explaining the result we get :
(source: narod.ru)
Now it is really easier to find all descendants of an object, or all of its ancestors. This can be done with only a single query, using left and right.
What is important when using this is having a good implementation of the layer between the data and the application, handling the left, right and depth attribute. These fields have to be ajusted when :
An object is deleted
An object is added
The parent field of an object is modified
This can be done with a parallel process, using locks. It can also be implemented directly between the data and the application.
See these links for more information about trees :
Managing hierarchies in SQL: MPTT/nested sets vs adjacency lists vs storing paths
MPTT With Django lib
http://www.sitepoint.com/hierarchical-data-database-2/
I personally had great results with django-nonrel and django-mptt the few times I did NoSQL.