I have a database using PostgreSQL, which holds data on students, applications and job offers.
Is there some kind of constraint that will mean a student can only accept one job offer. So by selecting 'yes' on 'job accepted' attribute, they can no longer do this for any other jobs they may receive?
It is not exactly a "constraint". It is just a column. In the Student table have a column called AcceptedJobOffer. That solves the direct problem. In addition, you want the following:
AcceptedJobOfferId int references JobOffers(JobOfferid)
And, then create a unique index on Applications for StudentId, JobOfferId and include:
foreign key (StudentId, AcceptedJobOfferId) references Applications(StudentId, JobOfferId)
This ensures that the job offer is a valid job and that it references an application (assuming that an application is a requirement -- 100% of the time -- for acceptance).
I imagine you've some kind of job applications table, which has a field called is_accepted in it or something to that order. You can add an exclude constraint on it. Example here.
An alternative is to add an accepted_job_id column (ideally a foreign key) to the students table, as already suggested by Gordon.
Side note: If this is going to be dealing with real data, rather than theoretical data in a database course, you probably do not want to enforce the constraint at all. Sometimes, people want or need multiple jobs, so limiting the system in such a way that they cannot apply to more than one job introduces an artificial limitation which may come back and bite you down the road.
Related
Is there a way to have my items automatically ordered by creation order in DynamoDB?
I've tried using an ISO timestamp in my sort-keys but I soon noticed that items created in the same second have no guaranteed order.
Another major issue is that my sort-key is composite, for example :
Sort_Key : someRandomUuidHere|Created:someTimeStampHere
I need to generate UUIds to try to guarantee it is unique, but adding the timestamp at the end of the uuid doesn't seem to order it by the timestamp, but by the uuid instead. And if I add the timestamp to the start I can't use things like begins_with so it breaks my access patterns
The only way I could think of was maintaining a "last-key" object and always ask it the last item index before, but that would require an extra request and some ACID logic.
Or maybe just order it on the client side, there's always that
Sort keys are sorted left to right...so yeah, the UUID would be the only component that affects the order.
Why not place the timestamp first?
Alternatively, consider just Time-Based UUID. Dynamo doesn't offer one natively, alothough some NoSQL DB's, such as Cassandra, do.
Here's an article that discusses creating one with .NET
Creating a Time UUID (GUID) in .NET
Which includes a link the following source
https://github.com/fluentcassandra/fluentcassandra/blob/master/src/GuidGenerator.cs
Some of the Users in my database will also be Practitioners.
This could be represented by either:
an is_practitioner flag in the User table
a separate Practitioner table with a user_id column
It isn't clear to me which approach is better.
Advantages of flag:
fewer tables
only one id per user (hence no possibility of confusion, and also no confusion in which id to use in other tables)
flexibility (I don't have to decide whether fields are Practitioner-only or not)
possible speed advantage for finding User-level information for a practitioner (e.g. e-mail address)
Advantages of new table:
no nulls in the User table
clearer as to what information pertains to practitioners only
speed advantage for finding practitioners
In my case specifically, at the moment, practitioner-related information is generally one-to-many (such as the locations they can work in, or the shifts they can work, etc). I would not be at all surprised if it turns I need to store simple attributes for practitioners (i.e., one-to-one).
Questions
Are there any other considerations?
Is either approach superior?
You might want to consider the fact that, someone who is a practitioner today, is something else tomorrow. (And, by that I don't mean, not being a practitioner). Say, a consultant, an author or whatever are the variants in your subject domain, and you might want to keep track of his latest status in the Users table. So it might make sense to have a ProfType field, (Type of Professional practice) or equivalent. This way, you have all the advantages of having a flag, you could keep it as a string field and leave it as a blank string, or fill it with other Prof.Type codes as your requirements grow.
You mention, having a new table, has the advantage for finding practitioners. No, you are better off with a WHERE clause on the users table for that.
Your last paragraph(one-to-many), however, may tilt the whole choice in favour of a separate table. You might also want to consider, likely number of records, likely growth, criticality of complicated queries etc.
I tried to draw two scenarios, with some notes inside the image. It's really only a draft just to help you to "see" the various entities. May be you already done something like it: in this case do not consider my answer please. As Whirl stated in his last paragraph, you should consider other things too.
Personally I would go for a separate table - as long as you can already identify some extra data that make sense only for a Practitioner (e.g.: full professional title, University, Hospital or any other Entity the Practitioner is associated with).
So in case in the future you discover more data that make sense only for the Practitioner and/or identify another distinct "subtype" of User (e.g. Intern) you can just add fields to the Practitioner subtable, or a new Table for the Intern.
It might be advantageous to use a User Type field as suggested by #Whirl Mind above.
I think that this is just one example of having to identify different type of Objects in your DB, and for that I refer to one of my previous answers here: Designing SQL database to represent OO class hierarchy
I'm building a DynamoDB table that holds notification messages. Messages are directed from a given user (from_user) to another user (to_user). They're quite simple:
{ "to_user": "e17818ae-104e-11e3-a1d7-080027880ca6", "from_user": "e204ea36-104e-11e3-9b0b-080027880ca6", "notification_id": "e232f73c-104e-11e3-9b30-080027880ca6", "message": "Bob recommended a good read.", "type": "recommended", "isbn": "1844134016" }
These are the Hash/Range keys defined on the table:
HashKey: to_user, RangeKey: notification_id
Case 1: Users regularly phone home to ask for any available notifications.
With these keys, it's easy to fetch the notifications awaiting a given user:
notifications.query(to_user="e17818ae-104e-11e3-a1d7-080027880ca6")
Case 2: Once a user has seen a message, they will explicitly acknowledge it and it will be deleted. This is similarly simple to accomplish with the given Hash/Range keys:
notifications.delete(to_user="e17818ae-104e-11e3-a1d7-080027880ca6", notification_id="e232f73c-104e-11e3-9b30-080027880ca6")
Case 3: It may sometimes be necessary to delete items in this table identified by keys other than the to_user and notification_id. For example, user Bob decides to un-recommnend a book and we would like to pull notifications with from_user=Bob, action=recommended and isbn=isbnval.
I know this can't be done with the Hash/Range keys I've chosen. Local secondary indexes also seem unhelpful here since I don't want to work within the table's chosen HashKey.
So am I stuck doing a full Scan? I can imagine creating a second table to map from_user+action+isbn back to items in the original table but that means I have to manage that additional complexity... and it seems like this hand-rolled index could get out of sync easily.
Any insights would be appreciated. I'm new to DynamoDB and trying to understand how typical data models map to it. Thanks.
Your analysis is correct. For case 3 and this schema, you must do a table scan .
There are a number of options which you've identified, but all of them will add a layer of complexity to your application.
Use a second table as you state. You are effectively creating your own global index and must manage that complexity yourself. This grows in complexity as you require more indices.
Perform a full table scan. Look at DynamoDB's scan segmenting for a method of distributing the scan across multiple worker nodes. Depending on your latency requirements(is it ok if the recommendations don't go away until the next scan?) you may be able to combine this and other future background tasks into a constant background process. This is also simpler than 1.
Both of these seem to be fairly common models.
If I were using an RDBMS (e.g. SQL Server) to store event sourcing data, what might the schema look like?
I've seen a few variations talked about in an abstract sense, but nothing concrete.
For example, say one has a "Product" entity, and changes to that product could come in the form of: Price, Cost and Description. I'm confused about whether I'd:
Have a "ProductEvent" table, that has all the fields for a product, where each change means a new record in that table, plus "who, what, where, why, when and how" (WWWWWH) as appropriate. When cost, price or description are changed, a whole new row as added to represent the Product.
Store product Cost, Price and Description in separate tables joined to the Product table with a foreign key relationship. When changes to those properties occur, write new rows with WWWWWH as appropriate.
Store WWWWWH, plus a serialised object representing the event, in a "ProductEvent" table, meaning the event itself must be loaded, de-serialised and re-played in my application code in order to re-build the application state for a given Product.
Particularly I worry about option 2 above. Taken to the extreme, the product table would be almost one-table-per-property, where to load the Application State for a given product would require loading all events for that product from each product event table. This table-explosion smells wrong to me.
I'm sure "it depends", and while there's no single "correct answer", I'm trying to get a feel for what is acceptable, and what is totally not acceptable. I'm also aware that NoSQL can help here, where events could be stored against an aggregate root, meaning only a single request to the database to get the events to rebuild the object from, but we're not using a NoSQL db at the moment so I'm feeling around for alternatives.
The event store should not need to know about the specific fields or properties of events. Otherwise every modification of your model would result in having to migrate your database (just as in good old-fashioned state-based persistence). Therefore I wouldn't recommend option 1 and 2 at all.
Below is the schema as used in Ncqrs. As you can see, the table "Events" stores the related data as a CLOB (i.e. JSON or XML). This corresponds to your option 3 (Only that there is no "ProductEvents" table because you only need one generic "Events" table. In Ncqrs the mapping to your Aggregate Roots happens through the "EventSources" table, where each EventSource corresponds to an actual Aggregate Root.)
Table Events:
Id [uniqueidentifier] NOT NULL,
TimeStamp [datetime] NOT NULL,
Name [varchar](max) NOT NULL,
Version [varchar](max) NOT NULL,
EventSourceId [uniqueidentifier] NOT NULL,
Sequence [bigint],
Data [nvarchar](max) NOT NULL
Table EventSources:
Id [uniqueidentifier] NOT NULL,
Type [nvarchar](255) NOT NULL,
Version [int] NOT NULL
The SQL persistence mechanism of Jonathan Oliver's Event Store implementation consists basically of one table called "Commits" with a BLOB field "Payload". This is pretty much the same as in Ncqrs, only that it serializes the event's properties in binary format (which, for instance, adds encryption support).
Greg Young recommends a similar approach, as extensively documented on Greg's website.
The schema of his prototypical "Events" table reads:
Table Events
AggregateId [Guid],
Data [Blob],
SequenceNumber [Long],
Version [Int]
The GitHub project CQRS.NET has a few concrete examples of how you could do EventStores in a few different technologies. At time of writing there is an implementation in SQL using Linq2SQL and a SQL schema to go with it, there's one for MongoDB, one for DocumentDB (CosmosDB if you're in Azure) and one using EventStore (as mentioned above). There's more in Azure like Table Storage and Blob storage which is very similar to flat file storage.
I guess the main point here is that they all conform to the same principal/contract. They all store information in a single place/container/table, they use metadata to identify one event from another and 'just' store the whole event as it was - in some cases serialised, in supporting technologies, as it was. So depending on if you pick a document database, relational database or even flat file, there's several different ways to all reach the same intent of an event store (it's useful if you change you mind at any point and find you need to migrate or support more than one storage technology).
As a developer on the project I can share some insights on some of the choices we made.
Firstly we found (even with unique UUIDs/GUIDs instead of integers) for many reasons sequential IDs occur for strategic reasons, thus just having an ID wasn't unique enough for a key, so we merged our main ID key column with the data/object type to create what should be a truly (in the sense of your application) unique key. I know some people say you don't need to store it, but that will depend on if you are greenfield or having to co-exist with existing systems.
We stuck with a single container/table/collection for maintainability reasons, but we did play around with a separate table per entity/object. We found in practise that meant either the application needed "CREATE" permissions (which generally speaking is not a good idea... generally, there's always exceptions/exclusions) or each time a new entity/object came into existence or was deployed, new storage containers/tables/collections needed to be made. We found this was painfully slow for local development and problematic for production deployments. You may not, but that was our real-world experience.
Another things to remember is that asking action X to happen may result in many different events occurring, thus knowing all the events generated by a command/event/what ever is useful. They may also be across different object types e.g. pushing "buy" in a shopping cart may trigger account and warehousing events to fire. A consuming application may want to know all of this, so we added a CorrelationId. This meant a consumer could ask for all events raised as a result of their request. You'll see that in the schema.
Specifically with SQL, we found that performance really became a bottleneck if indexes and partitions weren't adequately used. Remember events will needs to be streamed in reverse order if you are using snapshots. We tried a few different indexes and found that in practise, some additional indexes were needed for debugging in-production real-world applications. Again you'll see that in the schema.
Other in-production metadata was useful during production based investigations, timestamps gave us insight into the order in which events were persisted vs raised. That gave us some assistance on a particularly heavily event driven system that raised vast quantities of events, giving us information about the performance of things like networks and the systems distribution across the network.
Well you might wanna give a look at Datomic.
Datomic is a database of flexible, time-based facts, supporting queries and joins, with elastic scalability, and ACID transactions.
I wrote a detailed answer here
You can watch a talk from Stuart Halloway explaining the design of Datomic here
Since Datomic stores facts in time, you can use it for event sourcing use cases, and so much more.
I think solution (1 & 2) can become a problem very quickly as your domain model evolves. New fields are created, some change meaning, and some can become no longer used. Eventually your table will have dozens of nullable fields, and loading the events will be mess.
Also, remember that the event store should be used only for writes, you only query it to load the events, not the properties of the aggregate. They are separate things (that is the essence of CQRS).
Solution 3 what people usually do, there are many ways to acomplish that.
As example, EventFlow CQRS when used with SQL Server creates a table with this schema:
CREATE TABLE [dbo].[EventFlow](
[GlobalSequenceNumber] [bigint] IDENTITY(1,1) NOT NULL,
[BatchId] [uniqueidentifier] NOT NULL,
[AggregateId] [nvarchar](255) NOT NULL,
[AggregateName] [nvarchar](255) NOT NULL,
[Data] [nvarchar](max) NOT NULL,
[Metadata] [nvarchar](max) NOT NULL,
[AggregateSequenceNumber] [int] NOT NULL,
CONSTRAINT [PK_EventFlow] PRIMARY KEY CLUSTERED
(
[GlobalSequenceNumber] ASC
)
where:
GlobalSequenceNumber: Simple global identification, may be used for ordering or identifying the missing events when you create your projection (readmodel).
BatchId: An identification of the group of events that where inserted atomically (TBH, have no idea why this would be usefull)
AggregateId: Identification of the aggregate
Data: Serialized event
Metadata: Other usefull information from event (e.g. event type used for deserialize, timestamp, originator id from command, etc.)
AggregateSequenceNumber: Sequence number within the same aggregate (this is usefull if you cannot have writes happening out of order, so you use this field to for optimistic concurrency)
However, if you are creating from scratch I would recomend following the YAGNI principle, and creating with the minimal required fields for your use case.
Possible hint is design followed by "Slowly Changing Dimension" (type=2) should help you to cover:
order of events occurring (via surrogate key)
durability of each state (valid from - valid to)
Left fold function should be also okay to implement, but you need to think of future query complexity.
I reckon this would be a late answer but I would like to point out that using RDBMS as event sourcing storage is totally possible if your throughput requirement is not high. I would just show you examples of an event-sourcing ledger I build to illustrate.
https://github.com/andrewkkchan/client-ledger-service
The above is an event sourcing ledger web service.
https://github.com/andrewkkchan/client-ledger-core-db
And the above I use RDBMS to compute states so you can enjoy all the advantages coming with a RDBMS like transaction support.
https://github.com/andrewkkchan/client-ledger-core-memory
And I have another consumer to be processing in memory to handle bursts.
One would argue the actual event store above still lives in Kafka-- as RDBMS is slow for inserting especially when the inserting is always appending.
I hope the code help give you an illustration apart from the very good theoretical answers already provided for this question.
Good day!
I am a newbie on creating database... I need to create a db for my recruitment web application.
My database schema is as follows:
NOTE: I included the applicant_id on other tables... e.g. exam, interview, exam type.
Am i violating any normalization rule? If i do, what do you recommend to improve my design? Thank you
Overall looks good. A few minor points to consider:
Interviewer is also a person. You will need to use program logic to prevent different / misspellings.
The longest real life email address I've seen was 62 characters.
In exam you use the reserved word date for a column name
(subjective) I would rename applicant_date to applied_at
I don't see a postal / zip code for the applicant
All result columns are VARCHAR(4). If they use the same values, can they be normalized?
Birthdate is better to store then age. You don't want to schedule someone for an interview on their birthdate (or if you're cruel by nature, you do want that :) ). Age can be derived from it and will also be correct at all times.
EDIT:
Given that result is PASS or FAIL, simply declare the field a boolean and name it 'passed'. A lot faster.
One area where I could see a potential problem is the Interviewer being integrated in interview. Also I would like to point at the source channel in applicant, which could potentially get blobbed (depending of what you're going to store in there).
You don't seem to be violating any normalization rules upon first glance. It's not clear from your schema design, however, that the applicant_id is a referencing the applicant table. Make sure you declare it as a foreign key that references the applicant table when actually implementing the scehma.
Not to make any assumptions on your data, but can the result of a screening be stored in 4 characters?
Age and gender are generally illegal questions to ask in interviews so you may not want to record such things. You might want a separate interviewer table. You also might want a separate table that stores qualifications so you can search for people you have interviewed with C# knowledge when the next opening comes up. I'd probably do something like a Qualifications table that is the lookup for quals you want to add to the applicant qualfications table. Then you'd need the qualification id, applicantId, years, skill level in the Applicant Qualification table.
I notice results is a varchar 4 field, I assume you are planning to put Pass/fail in it. I would consider having a numeric score as well. The guy who got 80% of the questions right passed but the guy who got 100% of them right might be the better candidate. In fact for interviews I might have interview questions and results tables. Then you can record the score and any comments about each question which can help later in evaluation of a lot of candidates. We did this manually in paper spreadsheets once when we were interviewing several hundred people (we had over a hundred openings at the time and this was way before personal computers) and found it most helpful to be able to compare answers to questions. It's hard to remember 200 people you interviewed and who said what. It might help later when you have a new opening to find the people who were strong onthe questions most pertinent to the new job who might not have been given a job at the time of the interview(5 excellent candidates, 1 job for instance).
I might also consider a field to mark if the candidate is unaccepatble for ever hiring for some reason. Such as he committed a felony or he lied on the resume and you caught him or he was just totally clueless in the interview. This can make it easy to prevent this person from being considered repeatedly.
I think that your DB structure has a lot of limitation for future usage. For example you can even have a description of the exam because this stable store the score and exam date. It may by that this kind of information are already stored in another system and you have to design only the result container. But even then the exam, screen and interview are just a form of test, that why the information about should be stored in one table and distinguished by some type id. If you decide to this approach you have to create another table to store the information about result
So the definition of that should look more like this:
TEST
TEST_ID
TEST_TYPE_ID ref TEST_TYPE - Table that define the test type
TEST_REQUIRED_SCORE - The value of the score that need to be reach to pass the exam.
... - Many others properties of TEST like duration, expire date, active inactive etc.
APPLICANT_RESULTS
APPLICANT_ID ref APPLICANT
TEST_ID = ref TEST
TESTS_DATE - The day of exam
TEST_START - The time when the test has started
TEST_FINISH - The time when the test has ended
APPLICANT_RESULT - The applicant result of taken test.
This kind of structure is more flexible and give the easy way to specify the requirements between the test in table like this
TEST_REQUIREMENTS - Table that specify the test hierarchy and limitation
TEST_ID ref TEST
REQUIRED_TEST ref TEST
ORDER - the order of exams
Another scenario is that in the future your employer will want to switch to an e-exam system. In that case only think what you will need are:
Create table that will store the question definition (one question can be used in exam, screen or interview)
Crate table that will store the question answers.
Create table that will store the information about the test question.
Create table for storing the answer for each question given from applicant.
A trigger that will update the over all score of test.