Scaling problem with CassandraDB structure

Scaling problem with CassandraDB structure - email

I'm trying to create a database based mail server. For this i choose to use CassandraDB. The main problem is: The more mail i have in my table, the longer the table answer (which is normal but scale to much). Currently i got like only 20000 mails and Cassandra send me a timeout (Set to 5secs by default apparently). The objective is to make every user find mail in my table containing more than 500k mails with the possibilities to filter their mails.
Here is my table structure :
CREATE TABLE mail__mail (
accountid uuid,
date timestamp,
id uuid,
attachment set<uuid>,
categories set<uuid>,
content text,
dateadded timestamp,
folderid uuid,
hash text,
isconfidential boolean,
isdeleted boolean,
isimportant boolean,
isseen boolean,
mailcc text,
mailfrom text,
mailid text,
mailto text,
size bigint,
subject text,
PRIMARY KEY (accountid, date, id)
) WITH CLUSTERING ORDER BY (date DESC,id ASC);
CREATE CUSTOM INDEX mailFromIndex ON mail__mail (mailfrom) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS','analyzed': 'true', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
CREATE CUSTOM INDEX subjectIndex ON mail__mail (subject) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS','analyzed': 'true', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
I'm pretty sure that my structure is bad due to my poor skills in CassandraDB.
Here are the operations i want to acheives with this table:
UPDATE : isImportant, isConfidential, isDeleted, isSeen, mailId, folderId, categories
DELETE : by id, by folderId, by accountId
SELECT : by id, by folderId, by accountId
And i would like to select with this filter :
ORDER BY : date, size, mailFrom (ASC and DESC)
CONTAINS : categories (I can assign some categories to my mail, and i want to filter all email in one or more categories)
LIKE '%search%' : mailFrom, subject to filter the mail containing my search
Equals : isConfidential, isImportant, isDeleted, isSeen, to get all confidential, important, delete or seen mail.
My table works with few line in it (it works with 7k emails in 1000ms approximately) but i think it can be faster with the good structure and the good query (without the ALLOW FILTERING).
Moreover, i apparently can't use CONTAINS and LIKE '%text%' in the same query, it gave me a 1300 error code. So i did this step in python but in my opinion it's a performance disaster, it will be great if i can do everything with cassandra.
To query my CassandraDB i use Python3.5 Cassandra Driver but i don't think this information is relevant.
Tell me if you need more informations,
thanks in advance !
EDIT : As a solution i follow what you guys told me, i deploy a new server with Elassandra (ElasticSeach + Cassandra). I will try to give you the result i get as soon as possible.

I agree with #Lohfink suggestion about different view on modeling C* database, starting from the queries itself. But according to your requirements, the C* may be not be the perfect fit. You can redesign the schema in the following way:
no queries with ALLOW FILTERING, as it does table scan.
the same mail__mail schema, but with id and date merged to timeuuid to simplify things.
instead of creating a ton of secondary indices (due to problems with high-cardinality data) and materialized views (because of data copying), you can use either external ElasticSearch or using it as a plugin for C* to perform the actual searches.

I agree with #shutty, you can use cassandra for your datastore along with ES for searching.

Related

Non-Relational Database Design For Big Data Warehouse

Suppose I need to design a table for Spotify where I need to quickly retrieve what items (song or album) a user already purchased so it can play for the user. The scenario is straightforward: when users click to buy the song, the database needs to quickly update a particular song being purchased to the user account.
Since it really requires near real-time response and the table could be increased exponentially, on the other hand, the access format is quite simple and fix, a non-relational database is designed for this use case. That's why I am thinking about using HBase, Cassandra, or MongoDB.
I would like to use UserId as the primary key for this Purchase Table, will Wide Column Stores like (HBase or Cassandra) or Document databases like MongoDB work better for this scenario?
The input is just a user_id and the database table response with all available purchased items. What is the best database table design strategy?
{user_id:int
{purchased_item: item1
item2
item3
}
}
The second table will be used for searching for specific artists, albums, genres, and songs that are available for purchase.
Appreciate if you can share any examples of best practice from the real-world application. Or any good article/document/blogs I can read.

If you are considering near real-time I would definitely consider using Cassandra especially for history detailed storage!
What I would do using Cassandra is the folowing:
CREATE TABLE purchases( user_id uuid, purchase_id uuid, item_id uuid, item_details text, item_name text, time_of purchase timestamp, PRIMARY KEY((user_id), purchase_id, item_id));
This will let you cluster the data in several ways first using the user_id then using the purchase_id to keep all items recorded per purchase!
By having the the Primary key formed of Partition key the user_id the clustering key the purchase_id and item_id we are able to group the items in the purchase_id and then in the user_id.
https://cassandra.apache.org/doc/latest/data_modeling/intro.html
https://docs.datastax.com/en/landing_page/doc/landing_page/current.html

Is it relevant to put "version" on a separate sql server table?

I have a table with several fields, this table almost never change but for one field, "version" which change very often.
Would it be relevant to put that single field into a separate table in order to reduce how often locks are put on the main table?
For instance I have a table tType and a table tEntry.
Whenever I add/deleted/update any row of tEntry, I need to update the "version" field of tType. There might be thousand of rows inside tEntry for a single tType referenced row. Meaning the version number could change very often, though any other data of tType (such as name, id, etc.) doesn't change.

Your Referral to tType and tEntry sounds like you are implementing a key-value store in a rdbms. There are several discussions you can google about this topic. In the www there seems to be consesus, that cons overweight pros on that. An option would be to look at key value stores, no sql, multi column DBs, etc (wikipedia)...
The next "anti-pattern" I recognized is that you try to mix transactional data with 'master data' in the table tType. Try to avoid this, even if your selects get more uncomfortable and need to be tuned better. Keep off the version info from the tType, if this changes extremely often. Look here to get the concept: MySQL JOIN the most recent row only?

Good way to assign a post to a location with PostGIS

I am setting up a location aware application, as mentioned here. I have since learned a lot more about GIS apps, and have decided to change a few things about the setup I had originally proposed -- I'm now going to use a postgresql database using the postgis extension to allow for geometric fields, and use TIGER/Line data to fill it. The TIGER/Line data seems to offer different data sets in different resolutions (layers) -- there is data for states, counties, zips, blocks, etc. I need a way to associate a post to an address using the finest grain resolution possible.
For instance, if possible, I would like to associate a post with a particular street (finest resolution). If not a street, then a particular zip code (less specific). If not a zip code, then a particular county (less specific), and so on. Sidenote: I want to eventually show these all on a map.
This is what I propose:
Locations
id -- int
street_name -- varchar -- NULL
postal_code_id -- int -- NULL
county_id -- int -- NULL
state_id -- int
Postal Codes
id -- int
code -- varchar
geom -- geometry
Counties
id -- int
name -- varchar
geom -- geometry
The states table is similar, and so on...
As you can see, the locations table would decide the level of specificity by whatever fields are set. The postal codes, counties, and states table are not tied together by foreign key (too complex to determine a proper hierarchy that is valid everywhere), however, I believe that there is a way to determine their relationship using the geometry field (e.g., query what state a certain zip code is contained in or what zip codes belong to a certain state).
I think this is a good setup because if the database grows (lets say I decide to include data for districts or blocks in the database) then I can add another table for that data and then add another foreign key to the locations table (eg, block_id).
Does anybody know of a better way to do this?

Is it possible that a street belongs to two different counties? or two postal codes?, In my country this is possible, specially in cities. If this is possible then your schema won't work.
Despite of what I said before, I would add the geometry of the streets(open street map) without linking it to a postal code or county or even the state, and then with a simple query that intersects the geometry of the streets with the other tables you could get that information, and fill another table that has that relationships.

What are the proper use-cases for the PostgreSQL Array Datatype?

It seems to me that the functionality of the PostgreSQL array datatype overlaps a lot with the standard one-to-many and many-to-many relationships.
For example, a table called users could have an array field called "favorite_colors", or there could be a separate table called "favorite_colors" and a join table between "users" and "favorite_colors".
In what cases is the array datatype OK to use instead of a full-blown join?

An array should not be used similar to a relation. It should rather contain indexed values that relate to one row very tightly. For example if you had a table with the results of a football match, than you would not need to do
id team1 team2 goals1 goals2
but would do
id team[2] goals[2]
Because in this example, most would also consider normalizing this into two tables would be silly.
So all in all I would use it in cases where you are not interested in making relations and where you else would add fields like field1 field2 field3.

One incredibly handy use case is tagging:
CREATE TABLE posts (
title TEXT,
tags TEXT[]
);
-- Select all posts with tag 'kitty'
SELECT * FROM posts WHERE tags #> '{kitty}';

I totally agree with #marc. Arrays are to be used when you are absolutely sure you don't need to create any relationship between the items in the array with any other table. It should be used for a tightly coupled one to many relationship.
A typical example is creating a multichoice questions system. Since other questions don't need to be aware of the options of a question, the options can be stored in an array.
e.g
CREATE TABLE Question (
id integer PRIMARY KEY,
question TEXT,
options VARCHAR(255)[],
answer VARCHAR(255)
)
This is much better than creating a question_options table and getting the options with a join.

The Postgresql documentation gives good examples:
CREATE TABLE sal_emp (
name text,
pay_by_quarter integer[],
schedule text[][]
);
The above command will create a table named sal_emp with a column of
type text (name), a one-dimensional array of type integer
(pay_by_quarter), which represents the employee's salary by quarter,
and a two-dimensional array of text (schedule), which represents the
employee's weekly schedule.
Or, if you prefer:
CREATE TABLE tictactoe (
squares integer[3][3] );

If I want to store some similar type of set of data, and those data don't have any other attribute.
I prefer to use arrays.
One example is :
Storing contact numbers for a user
So, when we want to store contact number, usually main one and a alternate one, in such case
I prefer to use array.
CREATE TABLE students (
name text,
contacts varchar ARRAY -- or varchar[]
);
But if these data have additional attributes, say storing cards.
A card can have expiry date and other details.
Also, storing tags as an array a bad idea. A tag can be associated to multiple posts.
Don't use arrays in such cases.

Ways to implement data versioning in PostreSQL

Can you share your thoughts how would you implement data versioning in PostgreSQL. (I've asked similar question regarding Cassandra and MongoDB. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in a simple address book. Address book records are stored in one table without relations for simplicity. I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object table to store history of records with a copy of schema of addressbook table and add timestamp and foreign key to address book table.
Create a kind of schema less table to store changes to address book records. Such table would consist of: AddressBookId, TimeStamp, FieldName, Value. This way I would store only changes to the records and I wouldn't have to keep history table and address book table in sync.
Create a table to store seralized (JSON) address book records or changes to address book records. Such table would looks as follows: AddressBookId, TimeStamp, Object (varchar).
Again this is schema less so I wouldn't have to keep the history table with address book table in sync.
(This is modelled after Simple Document Versioning with CouchDB)

I do something like your second approach: have the table with the actual working set and a history with changes (timestamp, record_id, property_id, property_value). This includes the creation of records. A third table describes the properties (id, property_name, property_type), which helps in data conversion higher up in the application. So you can also track very easily changes of single properties.
Instead of a timestamp you could also have an int-like, wich you increment for every change per record_id, so you have an actual version.

You could have start_date and end_date.
When end_date is NULL, it`s the actual record.

I'm versioning glossary data, and my approach was pretty successful for my needs. Basically, for records you need versioning, you divide the fieldset into persistent fields and version-dependent fields, thus creating two tables. Some of the first set should also be the unique key for the first table.
Address
id [pk]
fullname [uk]
birthday [uk]
Version
id [pk]
address_id [uk]
timestamp [uk]
address
In this fashion, you get an address subjects determined by fullname and birthday (should not change by versioning) and a versioned records containing addresses. address_id should be related to Address:id through foreign key. With each entry in Version table you'll get new version for subject Address:id=address_id with a specific timestamp, in which way you can have a history reference.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse