Cassandra Schema Design - nosql

I'm continuing exploring Cassandra and I would like to create Student <=> Course relation which is similar to Many-to-Many on RDBMS.
In term of Queries I will use the following query;
Retrieve all courses in which student enrolled.
Retrieve all students enrolled in specific course.
Let's say that I create to Column Families. one for Course and another for Student.
CREATE COLUMN FAMILY student with comparator = UTF8Type AND key_validation_class=UTF8Type and column_metadata=[
{column_name:firstname,validation_class:UTF8Type}
{column_name:lastname,validation_class:UTF8Type}
{column_name:gender,validation_class:UTF8Type}];
CREATE COLUMN FAMILY course with comparator = UTF8Type AND key_validation_class=UTF8Type and column_metadata=[
{column_name:name,validation_class:UTF8Type}
{column_name:description,validation_class:UTF8Type}
{column_name:lecturer,validation_class:UTF8Type}
{column_name:assistant,validation_class:UTF8Type}];
Now how should I move on?
Should I create third Column Family with courseID:studentId CompisiteKey? if yes, Can I use Hector to query by only one (left or right) Composite key component?
Please help.
Update:
Following the suggestion I created the following Schema:
For Student:
CREATE COLUMN FAMILY student with comparator = UTF8Type and key_validation_class=UTF8Type and default_validation_class=UTF8Type;
and then we will add some data:
set student['student.1']['firstName']='Danny'
set student['student.1']['lastName']='Lesnik'
set student['student.1']['course.1']=''
set student['student.1']['course.2']='';
Create column Family for Course:
CREATE COLUMN FAMILY course with comparator = UTF8Type and key_validation_class=UTF8Type and default_validation_class=UTF8Type;
add some data:
set course['course.1']['name'] ='History'
set course['course.1']['description'] ='History Course'
set course['course.1']['name'] ='Algebra'
set course['course.1']['description'] ='Algebra Course'
and Finally Student In Course:
CREATE COLUMN FAMILY StudentInCourse with comparator = UTF8Type and key_validation_class=UTF8Type and default_validation_class=UTF8Type;
add data:
set StudentInCourse['studentIncourse.1']['student.1'] ='';
set StudentInCourse['studentIncourse.2']['student.1'] ='';

I defined a data model below but it is easier to decribe the object model first and then dive into the row model, so from PlayOrm's perspective you would have
public class Student {
#NoSqlId
private String id;
private String firstName;
private String lastName;
#ManyToMany
private List<Course> courses = new ArrayList(); //constructing avoids nullpointers
}
public class Course {
#NoSqlId
private String id;
private String name;
private String description
#ManyToOne
private Lecturer lecturer;
#ManyToMany
private CursorToMany students = new CursorToManyImpl();
}
I could have used List in course but I was concerned I may get OutOfMemory if too many students take a course over years and years and years. NOW, let's jump to what PlayOrm does and you can do something similar if you like
A single student row would look like so
rowKey(the id in above entity) = firstName='dean',
lastName='hiller' courses.rowkey56=null, courses.78=null, courses.98=null, courses.101=null
This is the wide row where we have many columns with the name 'fieldname' and 'rowkey to actual course'
The Course row is a bit more interesting....because the user thinks loading al the Students for a single course could cause out of memory, he uses a cursor which only loads 500 at a time as you loop over it.
There are two rows backing the Course in this case that PlayOrm will have. Sooo, let's take our user row above and he was in course rowkey56 so let's describe that course
rowkey56 = name='coursename', description='somedesc', lecturer='rowkey89ToLecturer'
Then, there is another row in the some index table for the students(it is a very wide row so supports up to millions of students)
indexrowForrowkey56InCourse = student34.56, student39.56, student.23.56....
into the millions of students
If you want a course to have more than millions of students though, then you need to think about partitioning whether you use playOrm or not. PlayOrm does partitioning for you if you need though.
NOTE: If you don't know hibernate or JPA, when you load the above Student, it loads a proxy list so if you start looping over the courses, it then goes back to the noSQL store and loads the Courses so you don't have to ;).
In the case of Course, it loads a proxy Lecturer that is not filled in until you access a property field like lecturer.getName(). If you call lecturer.getId(), it doesn't need to load the lecturer since it already has that from the Course row.
EDIT(more detail): PlayOrm has 3 index tables Decimal(stores double, float, etc and BigDecimal), Integer(long, short, etc and BigInteger and boolean), and String index tables. When you use CursorToMany, it uses one of those tables depending on the FK type of key. It also uses those tables for it's Scalable-SQL language. The reason it uses a separate row on CursorToMany is just so clients don't get OutOfMemory on reading a row in as the toMany could have one million FK's in it in some cases. CursorToMany then reads in batches from that index row.
later,
Dean

Related

Join database query Vs Handeling Joins in API

I am developing an API where I am confused as to what is the efficient way to handle join query.
I want to join 2 tables data and return the response. Either I can query the database with join query and fetch the result and then return the response OR I can fire two separate queries and then I would handle the join in the API on the fly and return the response. Which is the efficient and correct way ?
Databases are pretty much faster than querying and joining as class instances. Always do joins in the database and map them from the code. Also look for any lazy loading if possible. Cause in a situation like below:
#Entity
#Table(name = "USER")
public class UserLazy implements Serializable {
#Id
#GeneratedValue
#Column(name = "USER_ID")
private Long userId;
#OneToMany(fetch = FetchType.LAZY, mappedBy = "user")
private Set<OrderDetail> orderDetail = new HashSet();
// standard setters and getters
// also override equals and hashcode
}
you might not want order details when you want the initial results.
Usually it's more efficient to do the join in the database, but there are some corner cases, mostly due to the fact that application CPU time is cheaper than database CPU time. Here are a few examples that come to mind, with a query like "table A join table B":
B is a small table that rarely changes.
In this case it can be profitable to cache the contents of this table in the application, and not query it at all.
Rows in A are quite large, and many rows of B are selected for each row of A.
This will cause useless network traffic and load as rows from A are duplicated many times in each result row.
Rows in B are quite large, and there are few distinct b_id's in A
Same as above, except this time the same few rows from B are duplicated in the result set.
In the previous two examples, it could be useful to perform the query on table A, then gather a set of unique b_id's from the result, and SELECT FROM b WHERE b_id IN (list).
Data structure and ORMs
If each table contains a different object type, and they have a "belongs to" relationship (like category and product) and you use an ORM which will instantiate objects for each selected row, then perhaps you only want one instance of each category, and not one per selected product. In this case, you could select the products, gather a list of unique category_ids, and select the categories from there. The ORM may even do that for you behind the scene.
Complicated aggregates
Sometimes, you want some stuff, and some aggregates of other stuff related to the first stuff, but it just won't fit in a neat GROUP BY, or you may need several ones.
So basically, usually the join works better in the database, so that should be the default. If you do it in the application, then you should know why you're doing it, and decide it's a good reason. If it is, then fine. I gave a few reasons, based on performance, data model, and SQL constraints, these are only examples of course.

JPA how ensure uniqueness over 2 fields, string and boolean

I want to create an entity containing 2 fields that need to be unique in together. One of the fields is a Boolean:
#Entity
public class SoldToCountry {
private String countryId;
private Boolean isInt;
}
For a given String there should never exist more than 2 entries one with isInt:true and the other isInt:false.
I read the doc about #Id but it seems that Boolean is not supported. For me it would also be ok to have a unique constraint spanned over both fields and using a generated id.
What is the best way to get this constraint via JPA?
If your table has really two fields only, and you want they are unique, then they should be the composite PK of the table. Take a look at How to create and handle composite primary key in JPA
If, instead, you have another PK, consider Sebastian's comment.

Database first - EF created objects ignoring the name of foreign keys

I have created a mssql database and then used these instructions to create the datamodel. I have noticed something weird and it is also something which doesn't allow me to work with the model without confusion.
One of my objects is Book. Book contains an id, name, isbn, .. and also references to a Person table, like the author, the editor and the publisher.
the created model contains
string name;
int id;
string isbn;
int authorid;
int editorid;
int publisherid;
person person1;
person person2;
person person3;
Is there a way to change the names of the persons at creation? Because otherwise it is hard to figure out which person is who and I will have to change a lot manually.

How to optionally persist secondary table in Eclipselink

I am working with Eclipselink and having issue with using secondary table.
I have two tables as below.
Student with columns student_id(Primary Key), student_name etc.
Registration with columns student_id(FK relationship with Student table), course_name (with not null constraint) etc.
The requirement is student may or may not have registration. If student has registration, the data should be persisted to Registration table as well. Otherwise only Student table should be persisted.
My code snippet is as below.
Student.java
------------
#Entity
#Table(name = "STUDENT")
#SecondaryTable(name = "REGISTRATION")
#Id
#Column(name = "STUDENT_ID")
private long studentId;
#Basic(optional=true)
#Column(name = "COURSE_NAME", table = "REGISTRATION")
private String courseName;
I tried the following scenarios.
1. Student with registration - Working fine. Data is added to both Student and Registration tables
2. Student without registration - Getting error such as 'COURSE_NAME' cannot be null.
Is there a way to prevent persisting into secondary table?
Any help is much appreciated.
Thanks!!!
As #Eelke states, the best solution is to define two classes and a OneToOne relationship.
Potentially you could also use inheritance, having a Student and a RegisteredStudent that adds the additional table. But the relationship is a much better design.
It‘s possible using a DescriptorEventListener. The aboutToInsert and aboutToUpdate callbacks have access to the DatabaseCalls and may even remove the statements hitting the secondary table.
Register the DescriptorEventListener with the ClassDescriptor of the entity. For registration use a DescriptorCustomizer specified in a Customizer annotation at the entity.
However, you will not succeed fetching the entities back again later on. EclipseLink uses inner joins when selecting from the secondary table, so that the row of the primary table will be gone in the results.

JPA 2.0 retrieve entity by business key

I know there have been a number of similar posts about this, but I couldn't find a clear answer to my problem.
To make it as simple as possible, say I have such an entity:
#Entity
public class Person implements Serializable {
#Id
private Long id; // PK
private String name; // business key
/* getters and setters */
/*
override equals() and hashCode()
to use the **name** field
*/
}
So, id is the PK and name is the business key.
Say that I get a list of names, with possible duplicates, which I want to store.
If I simply create one object per name, and let JPA make it persistent, my final table will contain duplicate names - Not acceptable.
My question is what you think is the best approach, considering the alternatives I describe here below and (especially welcome) your own.
Possible solution 1: check the entity manager
Before creating a new person object, check if one with the same person name is already managed.
Problem: The entity manager can only be queried by PK. IS there any workaround Idon't know about?
Possible solution 2: find objects by query
Query query = em.createQuery("SELECT p FROM Person p WHERE p.name = ...");
List<Person> list = query.getResultList();
Questions: Should the objects requested be already loaded in the em, will this still fetch from database? If so, I suppose it would still be not very efficient if done very frequently, due to parsing the query?
Possible solution 3: keep a separate dictionary
This is possible because equals() and hashCode() are overridden to use the field name.
Map<String,Person> personDict = new HashMap<String,Person>();
for(String n : incomingNames) {
Person p = personDict.get(n);
if (p == null) {
p = new Person();
p.setName(n);
em.persist(p);
personDict.put(n,p);
}
// do something with it
}
Problem 1: Wasting memory for large collections, as this is essentially what the entity manager does (not quite though!)
Problem 2: Suppose that I have a more complex schema, and that after the initial writing my application gets closed, started again, and needs to re-load the database. If all tables are loaded explicitly into the em, then I can easily re-populate the dictionaries (one per entity), but if I use lazy fetch and/or cascade read, then it's not so easy.
I started recently with JPA (I use EclipseLink), so perhaps I am missing something fundamental here, because this issue seems to boil down to a very common usage pattern.
Please enlighten me!
The best solution which I can think of is pretty simple, use a Unique Constraint
#Entity
#UniqueConstraint(columnNames="name")
public class Person implements Serializable {
#Id
private Long id; // PK
private String name; // business key
}
The only way to ensure that the field can be used (correctly) as a key is to create a unique constraint on it. You can do this using #UniqueConstraint(columnNames="name") or using #Column(unique = true).
Upon trying to insert a duplicate key the EntityManager (actually, the DB) will throw an exception. This scenario is also true for a manually set primary key.
The only way to prevent the exception is to do a select on the key and check if it exists.