Is Azure table or NoSQL in general not so good when updating data - nosql

I have only looked in Azure table, but it may well apply for other NoSQL databases as well.
If I have an entity consisting of these following properties
First name - Last name - Hometown - Country
In Azure table there is no concept of relations therefore if I have thousands of data, and I want to change all entities that has 'Canada' in it, to some other country. Then in this scenario there is a possibility it has to go through thousands of data to find entities with 'Canada' and change it to something else.
I wonder, is the benefit of NoSQL only if you have data that is static and not changed after you have written it? Or could this problem be solved for NoSQLs?

In the case of NoSQL data stores the advantages are different from SQL databases. Things like scalability or availability can be better in a NoSQL database like Azure table, but there are tradeoffs. For example you are generally unable to efficiently query any part of a record, only the keys.
In designing your schema for Azure Table you have to consider the use cases of your data layer and let that dictate the schema. In this example, if I thought I would have to update all records in a given country, I would make that part of the partition or row key. That way your query to get all data in a given country is fast and can be updated quickly.

Related

Best approach to implement inheritance in a data warehouse based on a postgres database

I am developing a multi-step data pipeline that should optimize the following process:
1) Extract data from a NoSQL database (MongoDB).
2) Transform and load the data into a relational (PostgreSQL) database.
3) Build a data warehouse using the Postgres database
I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. Below is the dimensional model for the relational database:
There are 2 main tables, Occurrence and Canonical, from which inherit a set of others (drawn in red and blue, respectively). Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode, that have an extra many-to-many relationship with another table.
I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table. Doing this would imply adding extra attributes and a lot of null values. My new dimensional model would look like the following:
Question 1: Do you think this is the best approach to address this problem? If not, what would be?
Question 2: Any software recommendations for on-premise data warehouses? (on-premise is a must since it contains sensitive data)
Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing.
This would suggest your second table design. NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that.
As described here there are three options to implement inheritance in a relational database.
IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table.
The reason is not only the performance gain by saving the joins. In data warehouse often the historical view of the data is important. Think, how would you model a change in a subtype in some entity?
An important thing is to define a discriminator column which uniquely defines the source entity.

Asp Net Boilerplate - Setup Schema-Per-Tenant Multitenancy (EntityFrameworkCore & PostgreSQL)

We are looking into using Asp Net Boilerplate. Looks very promising. We love the framework, but we would like to be able to use a per-schema Multitenancy configuration. Instead of sharing the data in the same db & tables, each tenant would "have" a schema, in which the whole database structure would be replicated.
One of our data tables will be quite big (sometimes +1 million entries / tenant), and we were advised that for performance reasons, it's better to keep the number of entries as low as possible. Also, this particular table will be queried & inserted a lot. It would be unrealistic that this table would hold data for 40+ tenants. For that reason, and others, we would prefer to have a distinct schema per tenant.
Our DB is a single PostgreSQL server (might scale up to more in the future). We use EntityFramework & Npgsql. We already noticed that it is possible to set up a different ConnectionString for specific tenants that would have bigger data requirements.
http://www.summa.com/blog/2013/09/17/approaches-to-multi-tenancy See separate schema per tenant
Any idea on how to acheive a schema-per-tenant multitenancy? There's a lot of moving parts in this, I'm not sure where to start.

Implementing multi tenant data structure using multiple schemas or by customerId table column

I am developing multi tenant store web application (software as a service) which will be used by many customers. I would like to use just one database. I would appreciate suggestions/feedback on how to go about this in the database:
Separate schemas for each customer. Whenever new customer signs up, I create separate schema.
Single schema with all the customers. And creating a CUSTOMER table with customerId that is referenced in all other tables (eg. orders, payments, etc). Whenever new customer signs up, I create an entry in CUSTOMER table.
Incase if you want to know what technologies are being used:
Postgres, Spring Boot MVC, REST, Maven, JPA.
Thanks.
There are major tradeoffs here. With customer id's your foreign keys become more complex (the customer id should probably be a part of every foreign key) and that means additional indexes. It also means you have to have some means of enforcing this restriction. The big issue is that bugs in your application can quite easily disclose material from other customers.
With multiple schemas you have an issue that you have many more tables and this can cause performance problems for pg_dump in particular. However with appropriate search paths it is a bit harder to compromise other clients' data. However this is harder to use with a connection pool.
In general I think the schema approach is better because you can always scale out by partitioning by customer set, and the better security is important. However it means you must have a good understanding of search_path and set it to a sensible value on every database transaction.

Transforming relational data bases to graph databases

As part of my final thesis, I must transform a relational database in a graph-oriented database, specifically a PostgreSQL database into a Neo4j embedded database. Now, the way is the problem. In Rik Van Bruggen's book: Learning Neo4j, he mentions a data import process using ETL activities with Trascend and MuleSoft tools, but in their official sites, there's no documentation about how to do it, neither help documentation nor examples. Apart from these tools, what other ways can I use to transform this information without using my own code?
Some modeling advice:
A well normalized relational model, which was not yet denormalized for performance reasons can be translated into the equivalent graph model.
Graph model shapes are mostly driven by use-cases, so there will be opportunity for optimization and model evolution afterwards.
A good, normalized Entity-Relationship diagram often already represents a decent graph model.
So if you still have the orignal ER diagram available, try to use it as a guide.
Here are some tips that help you with the transformation:
Each entity table is represented by a label on nodes
Each row in a table is a node
Columns on those tables become node properties.
Remove technical primary keys, keep business primary keys
Add unique constraints for business primary keys, add indexes for frequent lookup attributes
Replace foreign keys with relationships to the other table, remove them afterwards
Remove data with default values, no need to store those
Data in tables that is denormalized and duplicated might have to be pulled out into separate nodes to get a cleaner model.
Indexed column names, might indicate an array property (like email1, email2, email3)
JOIN tables are transformed into relationships, columns on those tables become relationship properties
It is important to have an understanding of the graph model before you start to import data, then it just becomes the task of hydrating that model.
LOAD CSV might be your best option, but of course it means outputting a CSV first. Here are some great resources:
http://neo4j.com/docs/stable/query-load-csv.html
http://watch.neo4j.org/video/112447027
http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
http://jexp.de/blog/2014/10/load-cvs-with-success/
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
I've also written a ruby gem which lets you write a little ruby code to import data from various sources. It's called neo4apis. You can look at the neo4apis-twitter gem to get an idea for how it works:
https://github.com/neo4jrb/neo4apis-twitter/
https://github.com/neo4jrb/neo4apis-twitter/blob/master/lib/neo4apis/twitter.rb
I've actually been wanting to implement a neo4apis-activerecord to make it easy to import from SQL with ActiveRecord
You can not directly export data from relational and import to neo4j.
Because these are two different database structures.
Relational Database -
A relational database is a set of tables containing data fitted into predefined categories. Each table (which is sometimes called a relation) contains one or more data categories in columns. Each row contains a unique instance of data for the categories defined by the columns.
Graph-oriented database -
A graph database is essentially a collection of nodes and edges. Each node represents an entity (such as a person or business) and each edge represents a connection or relationship between two nodes.
Sollution To your Problem-
First, you need to design Neo4j Data structure. e.g What will be the nodes you required, what will be the relationships between the nodes.
After that you create Script in your application language to fetch data from relational database and insert it into neo4j.
Load CSA is a option to Import/Export (backup) functionality with graph database. you can not directly Export/Import data from Relational DB to Graph DB

Amazon DynamoDB Item Size?

I'm just exploring the whole NoSQL concept. I've been playing around with Amazons DynamoDB and I really like the concept. However that said I am not quite sure how the data should be separated. By this I mean should I create a new Table for related data features like you would in a relational database or do I use a single table to store all the applications data?
As an example, in a relational DB I might have a table called users and a table called users_details. I would then for example, create a 1:1 relationship between the two tables. With the NoSQL concept I could theoretically create two tables as well but it strikes me as more efficient to have all the data in a single table.
If that is the case then when do you stop? Is the idea to store all the application data for a given user in a single table?
First ask yourself: why did I separate the users from the user details in RDBMS in the first place.
On a more general note, when talking about NoSQL you really shouldn't look at relationships between tables. You shouldn't think about "joining" information from different tables but rather prepare your data in a way that can be retrieved optimally.
In the user/user_details scenario, put all information in a users table, and query what you want by specifying the attributes to get.