How to design the schema when the embedded documents are too big - mongodb

Given the data structure as follows, as you can see each record inside one file has the same values for ATT1 and ATT2.
// Store in fileD001.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D001 | 10102011 | x13 | x14 ... | x1200
D001 | 10102011 | x23 | x24 ... | x2200
...
D001 | 10102011 | xN3 | xN4 ... | xN200
// Store in fileD002.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D002 | 10112011 | x13 | x14 ... | x1200
D002 | 10112011 | x23 | x24 ... | x2200
...
D002 | 10112011 | xN3 | xN4 ... | xN200
// Store in fileD003.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D003 | 10132011 | x13 | x14 ... | x1200
D003 | 10132011 | x23 | x24 ... | x2200
...
D003 | 10132011 | xN3 | xN4 ... | xN200
Method One: Assume I use the following structure to store the data.
doc = { “ATT1" : "D001",
"ATT2" : "10102011",
"ATT3" : "x13",
"ATT4" : "x14",
...
"ATT200" : "x1200"
}
Here is the problem, the data contains too much duplicated information and waste the space of DB. However, the benefit is that each record has its own _id.
Method One: Assume I use the following structure to store the data.
doc = { “ATT1" : "D001",
"ATT2" : "10102011",
"sub_doc" : { "ATT3" : "x13",
"ATT4" : "x14",
...
"ATT200" : "x1200"
}
}
Here is the problem, the data size N, which is around 1~5000, is too much and cannot be handled by MongoDB in one insertion operation. Of course, we can use $push update modifier to gradually append the data. However, each record has no _id any more in this way.
I don't mean each record has to have its own ID. I am just looking for a better design solution for the task like this.
Thank you

Option 1 is decent since it gives you the easiest data to work with. Maybe worry less about the space since it is cheap?
Option 2 is good to conserve space, though watch out that your document does not get too large. Maximum document size may limit you. Also, if you shard in the future this could limit you.
Option 3 is being a little relational about it. Have two collections. The first one is just a lookup for ATT1 and ATT2 pairs. The other collection is a reference to the other and the final atts.
parent = { att1: "val1", att2: "val2"}
child = {parent: parent.id, att3: "val3"...}

Related

Check if a set of a field values is mapped against single value of another field in dataframe

Consider the below dataframe with store and books available:
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S2 | B11 | 11$ |
| S1 | B15 | 29$ | <<
| S2 | B10 | 25$ |
| S2 | B16 | 30$ |
| S1 | B09 | 21$ | <
| S3 | B15 | 22$ |
+-----------+------+-------+
Suppose we need to find the stores which have two books namely, B11 and B15. Here, the answer is S1 as it stores both books.
One way of doing it is to find intersection of the stores having book B11 with the stores having book B15 using below command:
val df_select = df.filter($"book" === "B11").select("storename")
.join(df.filter($"book" === "B15").select("storename"), Seq("storename"), "inner")
which contains the name of stores having both.
But instead I want a table
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S1 | B15 | 29$ | <<
| S1 | B09 | 21$ | <
+-----------+------+-------+
which contains all records related to that fulfilling store. Note that B09 is not left out. (Use case : the user can explore some other books as well in the same store)
We can do this by doing another intersection of above result with original dataframe:
df_select.join(df, Seq("storename"), "inner")
But, I see scalability and readability issue with step 1 as I have to keep on joining one dataframe to another if the number of books are more than 2. Lots of pain to do and that's error-prone too. Is there a more elegant way to do the same? Something like:
val storewise = Window.partitionBy("storename")
df.filter($"book".contains{"B11", "B15"}.over(storewise))
Found a simple solution using array_except function.
Add required set-of-field-values as an array in a new column, req_books
Add a column, all_books, storing all the books stored in a store using Window.
Using above two columns find if the store misses any required book, and filter them out if it misses anything.
Drop the excess columns created.
Code:
val df1 = df.withColumn("req_books", array(lit("B11"), lit("B15")))
.withColumn("all_books", collect_set('book).over(Window.partitionBy('storename)))
df1.withColumn("missing_books", array_except('req_books, 'all_books))
.filter(size('missing_books) === 0)
.drop('missing_book).drop('all_books).drop('req_books).show
Using Window Functions to create array of all values and check if it contains all the necessary values.
val bookList = List("B11", "B15") //list of books to search
def arrayContainsMultiple(bookList: Seq[String]) = udf((allBooks: WrappedArray[String]) => allBooks.intersect(bookList).sorted.equals(bookList.sorted))
val filteredDF = input
.withColumn("allBooks", collect_set($"books").over(Window.partitionBy($"storename")))
.filter(arrayContainsMultiple(bookList)($"allBooks"))
.drop($"allBooks")

CTE RECURSIVE optimization, how to?

I need to optimize the performance of a commom WITH RECURSIVE query... We can limit the depth of the tree and decompose in many updates, and can also change representation (use array)... I try some options but perhaps there are a "classic optimization solution" that I'm not realizing.
All details
There are a t_up table, to be updated, with a composit primary key (pk1,pk2), one attribute attr and an array of references to primary keys... And a unnested representation t_scan, with the references; like this:
pk1 | pk2 | attr | ref_pk1 | ref_pk2
n | 123 | 1 | |
n | 456 | 2 | |
r | 123 | 1 | w | 123
w | 123 | 5 | n | 456
r | 456 | 2 | n | 123
r | 123 | 1 | n | 111
n | 111 | 4 | |
... | ...| ... | ... | ...
There are no loops.
UPDATE t_up SET x = pairs
FROM (
WITH RECURSIVE tree as (
SELECT pk1, pk2, attr, ref_pk1, ref_pk2,
array[array[0,0]]::bigint[] as all_refs
FROM t_scan
UNION ALL
SELECT c.pk1, c.pk2, c.attr, c.ref_pk1, c.ref_pk2
,p.all_refs || array[c.attr,c.pk2]
FROM t_scan c JOIN tree p
ON c.ref_pk1=p.pk1 AND c.ref_pk2=p.pk2 AND c.pk2!=p.pk2
AND array_length(p.all_refs,1)<5 -- 5 or 6 avoiding endless loops
)
SELECT pk1, pk2, array_agg_cat(all_refs) as pairs
FROM (
SELECT distinct pk1, pk2, all_refs
FROM tree
WHERE array_length(all_refs,1)>1 -- ignores initial array[0,0].
) t
GROUP BY 1,2
ORDER BY 1,2
) rec
WHERE rec.pk1=t_up.pk1 AND rec.pk2=t_up.pk2
;
To test:
CREATE TABLE t_scan(
pk1 char,pk2 bigint, attr bigint,
ref_pk1 char, ref_pk2 bigint
);
INSERT INTO t_scan VALUES
('n',123, 1 ,NULL,NULL),
('n',456, 2 ,NULL,NULL),
('r',123, 1 ,'w' ,123),
('w',123, 5 ,'n' ,456),
('r',456, 2 ,'n' ,123),
('r',123, 1 ,'n' ,111),
('n',111, 4 ,NULL,NULL);
Running only rec you will obtain:
pk1 | pk2 | pairs
-----+-----+-----------------
r | 123 | {{0,0},{1,123}}
r | 456 | {{0,0},{2,456}}
w | 123 | {{0,0},{5,123}}
But, unfortunately, to appreciate the "Big Data performance problem", you need to see it in a real database... I am preparing a public Github that run with OpenStreetMap Big Data.

Drools: create a conditional rule to match an input as list for each condition with permutations and combinations

In Drools how do I create a conditional rule to match if
1) input is a list.
2) each condition column will has its own list
3) Condition should match in permutations and combinations of all condition lists
If my decision table is in below format
------------------------------------------------
COND. | CONDITION | CONDITION| ACTION
------------------------------------------------
Store | ProjectCode | Country | ArticleNumber
------------------------------------------------
10 | 1001 | USA | AD112
20 | 1002 | UK | AD113
30 | 1003 | USA | AD114
40 | 1004 | SWE | AD112
50 | 1005 | GER | AD114
I will have conditions in list form like below
ArticleRule{
List<String> stores = Arrays.asList("10","30","40","50");
List<String> projectCodes = Arrays.asList("1001","1002","1004","1005");
List<String> countries = Arrays.asList("USA","GER","UK");
}
My result by creating a permutation and combination of all list would be.
Output : (AD112,AD114)
In my real use case each list might have 1000 values in it.
And my decision table can have a million records.
How can I achieve using drools.
You should have each row as a fact Article with fields store, projectCode, country, articleNumber. Your rule would be
rule select
when
$article: Article(
store in ("10","30","40","50"),
projectCode in ("1001","1002","1004","1005"),
country in ("USA","GER","UK") )
then
System.out.println( $article.getArticleNumber );
end

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

Optimizing MongoDB indexing multiple field with multiple query

I am new to database indexing. My application has the following "find" and "update" queries, searched by single and multiple fields
reference | timestamp | phone | username | key | Address
update x | | | | |
findOne | x | x | | |
find/limit:16 | x | x | x | |
find/limit:11 | x | | | x | x
find/limit:1/sort:-1 | x | x | | x | x
find | x | | | |
1)update({"reference":"f0d3dba-278de4a-79a6cb-1284a5a85cde"}, ……….
2)findOne({"timestamp":"1466595571", "phone":"9112345678900"})
3)find({"timestamp":"1466595571", "phone":"9112345678900", "username":"a0001a"}).limit(16)
4)find({"timestamp":"1466595571", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).limit(11)
5)find({"timestamp":"1466595571", "phone":"9112345678900", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).sort({"_id":-1}).limit(1)
6)find({"timestamp":"1466595571"})
I am creating index
db.coll.createIndex( { "reference": 1 } ) //for 1st, 6th query
db.coll.createIndex( { "timestamp": 1, "phone": 1, "username": 1 } ) //for 2nd, 3rd query
db.coll.createIndex( { "timestamp": 1, "key": 1, "address": 1, phone: 1 } ) //for 4th, 5th query
Is this the correct way?
Please help me
Thank you
I think what you have done looks fine. One way to check if your query is using an index, which index is being used, and whether the index is effective is to use the explain() function alongside your find().
For example:
db.coll.find({"timestamp":"1466595571"}).explain()
will return a json document which details what index (if any) was used. In addition to this you can specify that the explain return "executionStats"
eg.
db.coll.find({"timestamp":"1466595571"}).explain("executionStats")
This will tell you how many index keys were examined to find the result set as well as the execution time and other useful metrics.