How to combine two separate aggregations together in same result in mongodb with join - mongodb

I have two different aggregate collections, in the first collection I simply retrieve data from mongo collection, and in the second collection I retrieve date from 3 collection using joins (lookup, match) and groups.
Now I want to apply join on first and second collection using MongoDB
eg.
select a1.name , a2.count
from (select * from a where xyz group by name) as a1
LEFT OUTER JOIN
(select * from b(this contains multiple table joins)) as a2
on t1.id=t2.id

Welcome to Stack Overflow, Pravin.
As MongoDB documentation states, it is most useful in cases where your data model is denormalized (i.e. related data is stored within a single document).
So your first solution is to design your schema in a way that would
help you avoid making complex join queries.
You also have an option to make two separate queries to the database
and use Manual references, like foreign keys in RDBMS.
You also have an option to use so-called DBrefs
Depending on your particular task one of these options may be applicable

Related

What is the proper way to translate a complicated SQL query with custom columns and rdbms-specific functions to Doctrine

I have been a Propel user for years and only recently started switching to Doctrine. It's still quite new to me and sometimes Propel habits kick in and make it hard form me to "think in Doctrine". Below is a specific case. You don't have to know Propel to answer my question - I also present my case in raw SQL.
Simplified structure of the tables that my query refers is like this:
Application table has FK to Admin which has FK to User (fos_user in the DB)
ApplicationUser table has FK to Application
My query gets all Application records with custom columns containing additional info retrieved from related User records (through Admin) and some COUNTs of related ApplicationUser objects, one of which is additionally filtered (adminname, usercount, usercountperiod columns added to the query).
I have a Propel query like this:
ApplicationQuery::create()
->leftJoinApplicationUser()
->useAdminQuery()
->leftJoinUser()
->endUse()
->withColumn('fos_user.username', 'adminname')
->withColumn('COUNT(application_user.id)', 'usercount')
->withColumn('COUNT(application_user.id) FILTER '
. '(WHERE score > 0 AND '
. ' application_user.created_at >= to_timestamp('.strtotime($users_scored['begin']).') and '
. ' application_user.created_at < to_timestamp('.strtotime($users_scored['end']).') )', 'usercountperiod')
->groupById()
->groupBy('User.Id')
->orderById('DESC')
->paginate( ....
This is how it translates to SQL (PostgreSQL):
SELECT application.id, application.name, ...,
fos_user.username AS "adminname",
COUNT(socialscore_application_user.id) AS "usercount",
COUNT(application_user.id) FILTER (
WHERE score > 0 AND
application_user.created_at >= to_timestamp(1491004800) and
application_user.created_at < to_timestamp(1498780800) ) AS "usercountperiod"
FROM application
LEFT JOIN application_user ON (application.id=application_user.application_id)
LEFT JOIN admin ON (application.admin_id=admin.id)
LEFT JOIN fos_user ON (admin.id=fos_user.id)
GROUP BY application.id,fos_user.id
ORDER BY application.id DESC
LIMIT 15
As you can see it's quite complex (in terms of translating it to Doctrine ORM, when you're a Doctrine newbie like me :) ). It uses specific features of PostgreSQL:
being able to include only Primary Key in GROUP BY statement, while other columns from the same table can be used in SELECT without aggregating function or inclusion in GROUP BY (because they are "dependent" on the PK);
FILTER which allows you to further filter records that are fed into aggregate functions
It also uses some joins and adds custom columns (adminname, usercount, usercountperiod) which I can access in my resulting Propel Model objects (with functions like $result->getAdminname().
My question is: what is the "Doctrine way" to achieve as similar thing as possible as simply as possible (use some PostgreSQL-specific or any RDBMS-specific features, add some custom columns which will be accessible through ORM objects and so on)?
Thank you for help.

Laravel 4.2 order by another collections field or result of a function

I have a mongo database and I'm trying to write an Eloquent code to change some fields before using them in WHERE or ORDER BY clauses. something like this SQL query:
Select ag.*, ht.*
from agency as ag inner join hotel as ht on ag.hotel_id = ht.id
Where ht.title = 'OrangeHotel'
-- or --
Select ag.*, ht.*
from agency as ag inner join hotel as ht on ag.hotel_id = ht.id
Order by ht.title
sometimes there is no other table and I just need to use calculated field in Where or Order By clause:
Select *
from agency
Where func(agency_admin) = 'testAdmin'
Select *
from agency
Order by func(agency_admin)
where func() is my custom function.
any suggestion?
and I have read Laravel 4/5, order by a foreign column for half of my problem, but I don't know how can I use it.
For the first query: mongodb only support "join" partially with the aggregation pipeline, which limits your aggregation in one collection. For "join"s between different collections/tables, just select from collections one by one, first the one containing the "where" field, then the one who should "join" with the former, and so on.
The second question just puzzled me for some minutes until I see this question and realized it's the same as your first question: sort the collection containing your sort field and retrive some data, then go to another.
For the 3rd question, this question should serve you well.

Where clause versus join clause in Spark SQL

I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?
It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.

Orientdb query and scheme patterns to speed up the reading phase

I've some performance issue on a quite big data store.
For optimizing the insert phase, we created a document store and not a graph, infact the edge creation performance was too slow.
Essentially now we have a class A (with about 30M documents) with a link (say field fieldL) to a class B (about 500 documents).
The query structure is like:
select from A where field1='field1value' and field2='field2value' and field3>0 ... and fieldL in (select from B where ...)
The first issue i've found is this:
I've created n indexes on the n properties engaged in the where condition, but the explain command showed me orient uses only one... https://github.com/orientechnologies/orientdb/issues/3626
So I've created a composite index and if I perform a query involving only the index, say
select from A where field1='field1value' and field2='field2value' and field3>0
the result is really fast
The issue is about the second part of the query, involving the fieldL and the links.
I've tried with the [#rid,...] syntax but it seems not perform well.
I've also tried to change the schema using a different approach: class B with multiple links to class A, using a different query pattern (say the field containing the links fieldL1):
select * from (select expand(fieldL1) from B where ...) where field1='field1value' and field2='field2value' and field3>0
In this case the subquery executes a sort of partition of the data, but unfortunatelly we lose the indexes on the result set, so we have really slow performances on the second where clause (field1='field1value' and field2='field2value' and field3>0).
My question is: Does it exist a better query pattern to execute these kind of query faster?
Thank you very much.
By the way during the performance tuning it seems really awkward to perform a count of the documents involved in a query. (https://github.com/orientechnologies/orientdb/issues/3462)
If you use the following query
select * from (select expand(fieldL1) from B where ...) where field1='field1value' and field2='field2value' and field3>0
it doesn't use the index because seems that there are problems when using the subqueries and the indexes
For more information, you can look at this link
https://groups.google.com/forum/#!topic/orient-database/7jWEGpkIzXQ

what's the utility of array type?

I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)