mongdb supports join queries? - mongodb

Can mongodb support the following SQL queries? I tried mongodb's $lookup, but $loopup only supports queries like a=b:
select c.cidr,sum(c.bps),sum(c.pps) from (select a.ip,a.pps,a.bps,b.cidr from iptables a join cachetables b on a.start>=b.start and a.end <=b.end)c group by c.cidr;
Here's my test data.
mysql> select * from iptables;
+-----------+-------+------+------+------+
| ip | start | end | pps | bps |
+-----------+-------+------+------+------+
| 168.1.1.1 | 1 | 2 | 1 | 1 |
| 168.1.1.2 | 3 | 4 | 2 | 2 |
| 168.1.1.6 | 5 | 6 | 6 | 6 |
| 168.2.2.1 | 101 | 102 | 6 | 6 |
| 168.2.2.2 | 103 | 104 | 6 | 6 |
| 168.2.2.2 | 103 | 104 | 6 | 6 |
+-----------+-------+------+------+------+
6 rows in set (0.00 sec)
mysql> select * from cachetables;
+--------------+-------+------+
| cidr | start | end |
+--------------+-------+------+
| 168.1.1.0/24 | 1 | 100 |
| 168.2.2.0/24 | 101 | 200 |
+--------------+-------+------+
2 rows in set (0.00 sec)

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
According to the official documentation, flexible joins are supported from version 3.6, but my version is 3.2, so I'll have to find another workaround.

Related

Facet a Mutli-value(MVA) type field in sphinx

I have executed below query in sphinx,
select MVA_FIELD from mySphinxIndex facet MVA_FIELD order by count(*) desc;
What I got is like,
+----------------------------+----------+
| MVA_FIELD | count(*) |
+----------------------------+----------+
| | 664 |
| 0 | 536 |
| 13 | 439 |
| 4,13 | 8 |
| 19,13 | 8 |
| 18,13,20 | 8 |
| 8,17,18 | 8 |
| 8,18,13 | 8 |
| 8,15,18 | 8 |
| 8,13,20 | 7 |
| 17,13 | 7 |
| 18,19,20 | 7 |
| 8,17 | 7 |
| 13,17,19 | 7 |
| 11,6 | 7 |
| 6,11,13 | 7 |
| 15,18 | 7 |
| 11,13,20 | 7 |
| 11,13,17 | 7 |
| 6,18,19 | 6 |
| 7,20 | 6 |
| 8,11,13 | 6 |
| 13,17,20 | 6 |
I want to get the count of each ids in MVA_FIELD. For example, I just want the count of 0, 4, 13,... each id separately. How to achieve this ?
Honestly dont how how to do it with FACET suger, but with a normal GROUP BY query, would just use the GROUPBY() function when grouping by a MVA attribute
SELECT GROUPBY() AS value,COUNT(*) FROM mySphinxIndex GROUP BY MVA_FIELD ORDER BY COUNT(*) DESC;
From the docs
A special GROUPBY() function is also supported. It returns the GROUP BY key. That is particularly useful when grouping by an MVA value, in order to pick the specific value that was used to create the current group.

faceted search for field with multiple value

I have a table where field with multiple value by comma:
+------+---------------+
| id | education_ids |
+------+---------------+
| 3 | 7,5 |
| 4 | 7,3 |
| 5 | 1,5 |
| 8 | 3 |
| 9 | 5,7 |
| 11 | 9 |
...
+------+---------------+
when I trying use faceted search:
select id,education_ids from jobResume facet education_ids;
I'm getting this response:
+---------------+----------+
| education_ids | count(*) |
+---------------+----------+
| 7,5 | 3558 |
| 7,3 | 3655 |
| 1,5 | 3686 |
| 3 | 31909 |
| 5,7 | 3490 |
| 9 | 31743 |
| 9,6 | 3535 |
| 8,2 | 3547 |
| 6,2,7 | 291 |
| 7,8,1 | 291 |
| 1,2 | 3637 |
| 7 | 31986 |
| 5,9,7 | 408 |
| 1,1,5 | 365 |
| 5 | 31768 |
| 3,8,3,7 | 32 |
| 3,7,6 | 431 |
| 2 | 31617 |
| 5,5 | 3614 |
| 9,9,2,2 | 6 |
+---------------+----------+
but that's not what I wanted to see. I would like to where each value had its own count, for example like here:
+---------------+----------+
| education_ids | count(*) |
+---------------+----------+
| 10 | 961 |
| 11 | 1653 |
| 12 | 1998 |
| 13 | 2090 |
| 14 | 1058 |
| 15 | 347 |
...
+---------------+----------+
can I get such a result with sphinx?
Make sure you use an MVA, not a string attribute:
index rt
{
type = rt
rt_field = f
rt_attr_multi = education_ids
path = rt
}
snikolaev#dev:$ mysql -P9306 -h0
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 3.2.2 62ea5ff#191220 release
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> insert into rt(education_ids) values((7,5)), ((7,3)), ((7,1)), ((5,1)), ((5,3));
Query OK, 5 rows affected (0.00 sec)
mysql> select * from rt facet education_ids;
+---------------------+---------------+
| id | education_ids |
+---------------------+---------------+
| 2810610458032078849 | 5,7 |
| 2810610458032078850 | 3,7 |
| 2810610458032078851 | 1,7 |
| 2810610458032078852 | 1,5 |
| 2810610458032078853 | 3,5 |
+---------------------+---------------+
5 rows in set (0.00 sec)
+---------------+----------+
| education_ids | count(*) |
+---------------+----------+
| 7 | 3 |
| 5 | 3 |
| 3 | 2 |
| 1 | 2 |
+---------------+----------+
4 rows in set (0.00 sec)
BTW here's an interactive course about faceting in Sphinx / Manticore in case you want to learn more about that - https://play.manticoresearch.com/faceting/

Redshift Distribution By Child Columns

My Situation
I have some tables in my redshift cluster that all break down into either an order_id, shipment_id, or shipment_item_id depending on how granular the table is. order_id is a 1 to many relationship on shipment_id and shipment_id is a 1 to many on shipemnt_item_id.
My Question
I distribute on order_id, so all shipment_id and shipment_item_id records should be on the same nodes across the tables since they are grouped by order_id. My question is, when I have to join on shipment_id or shipment_item_id then will redshift know that the records are on the same nodes, or will it still broadcast the tables since they aren't joined on order_id?
Example Tables
unified_order shipment_details
+----------+-------------+------------------+ +-------------+-----------+--------------+
| order_id | shipment_id | shipment_item_id | | shipment_id | ship_day | ship_details |
+----------+-------------+------------------+ +-------------+-----------+--------------+
| 1 | 1 | 1 | | 1 | 1/1/2017 | stuff |
| 1 | 1 | 2 | | 2 | 5/1/2017 | other stuff |
| 1 | 1 | 3 | | 3 | 6/14/2017 | more stuff |
| 1 | 2 | 4 | | 4 | 5/13/2017 | less stuff |
| 1 | 2 | 5 | | 5 | 6/19/2017 | that stuff |
| 1 | 3 | 6 | | 6 | 7/31/2017 | what stuff |
| 2 | 4 | 7 | | 7 | 2/5/2017 | things |
| 2 | 4 | 8 | +-------------+-----------+--------------+
| 3 | 5 | 9 |
| 3 | 5 | 10 |
| 4 | 6 | 11 |
| 5 | 7 | 12 |
| 5 | 7 | 13 |
+----------+-------------+------------------+
Distribution
distribution_by_node
+------+----------+-------------+------------------+
| node | order_id | shipment_id | shipment_item_id |
+------+----------+-------------+------------------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 1 | 1 | 3 |
| 1 | 1 | 2 | 4 |
| 1 | 1 | 2 | 5 |
| 1 | 1 | 3 | 6 |
| 1 | 5 | 7 | 12 |
| 1 | 5 | 7 | 13 |
| 2 | 2 | 4 | 7 |
| 2 | 2 | 4 | 8 |
| 3 | 3 | 5 | 9 |
| 3 | 3 | 5 | 10 |
| 4 | 4 | 6 | 11 |
+------+----------+-------------+------------------+
The Amazon Redshift documentation does not go into detail how information is shared between nodes, but it is doubtful that it "broadcasts the tables".
Rather, information is probably sent between nodes based on need -- only the relevant columns would be shared, and possibly only sub-ranges of the data.
Rather than worrying too much about the internal implementation, you should test various DISTKEY and SORTKEY strategies against real queries to determine performance.
Follow the recommendations from Choose the Best Distribution Style to minimize the amount of data that needs to be sent between nodes and consult Amazon Redshift Best Practices for Designing Queries to improve queries.
You can EXPLAIN your query to see how data will be distributed (or not) during the execution. In this doc you'll see how to read the query plan:
Evaluating the Query Plan

Are pg_stat_database and pg_stat_activity really listing the same stuff aka how do I get a list of all backends

In this answer to the question Right query to get the current number of connections in a PostgreSQL DB the poster implies that
SELECT sum(numbackends) FROM pg_stat_database;
and
SELECT count(*) FROM pg_stat_activity;
give the same results.
However, if I do this on my db the first one says 119 and the second one 30.
This is the difference as shown by summing numbackends and counting:
+------+-------------+-------+
| | numbackends | count |
+------+-------------+-------+
| db1 | 1 | 1 |
| db2 | 1 | 1 |
| db3 | 1 | 1 |
| db4 | 1 | 1 |
| db5 | 2 | 2 |
| db6 | 2 | 2 |
| db7 | 12 | 3 | <--
| db8 | 4 | 4 |
| db9 | 5 | 5 |
| db10 | 78 | 35 | <--
+------+-------------+-------+
Why does this difference exist?
How can I list each of the 119-30=89 backends not shown in pg_stat_activity?

Comparing Subqueries

I have two subqueries. Here is the output of subquery A....
id | date_lat_lng | stat_total | rnum
-------+--------------------+------------+------
16820 | 2016_10_05_10_3802 | 9 | 2
15701 | 2016_10_05_10_3802 | 9 | 3
16821 | 2016_10_05_11_3802 | 16 | 2
17861 | 2016_10_05_11_3802 | 16 | 3
16840 | 2016_10_05_12_3683 | 42 | 2
17831 | 2016_10_05_12_3767 | 0 | 2
17862 | 2016_10_05_12_3802 | 11 | 2
17888 | 2016_10_05_13_3683 | 35 | 2
17833 | 2016_10_05_13_3767 | 24 | 2
16823 | 2016_10_05_13_3802 | 24 | 2
and subquery B, in which date_lat_lng and stat_total has commonality with subquery A, but id does not.
id | date_lat_lng | stat_total | rnum
-------+--------------------+------------+------
17860 | 2016_10_05_10_3802 | 9 | 1
15702 | 2016_10_05_11_3802 | 16 | 1
17887 | 2016_10_05_12_3683 | 42 | 1
15630 | 2016_10_05_12_3767 | 20 | 1
16822 | 2016_10_05_12_3802 | 20 | 1
16841 | 2016_10_05_13_3683 | 35 | 1
15632 | 2016_10_05_13_3767 | 23 | 1
17863 | 2016_10_05_13_3802 | 3 | 1
16842 | 2016_10_05_14_3683 | 32 | 1
15633 | 2016_10_05_14_3767 | 12 | 1
Both subquery A and B pull data from the same table. I want to delete the rows in that table that share the same ID as subquery A but only where date_lat_lng and stat_total have a shared match in subquery B.
Effectively I need:
DELETE FROM table WHERE
id IN
(SELECT id FROM (subqueryA) WHERE
subqueryA.date_lat_lng=subqueryB.date_lat_lng
AND subqueryA.stat_total=subqueryB.stat_total)
Except I'm not sure where to place subquery B, or if I need an entirely different structure.
Something like this,
DELETE FROM table WHERE
id IN (
SELECT DISTINCT id
FROM subqueryA
JOIN subqueryB
USING (id,date_lat_lng,stat_total)
)