I have a database with hundreds of thousands of rows with this schema:
+----+----------+---------+
| id | duration | type |
+----+----------+---------+
| 1 | 41 | cycling |
+----+----------+---------+
| 2 | 15 | walking |
+----+----------+---------+
| 3 | 6 | walking |
+----+----------+---------+
| 4 | 26 | running |
+----+----------+---------+
| 5 | 30 | cycling |
+----+----------+---------+
| 6 | 13 | running |
+----+----------+---------+
| 7 | 10 | running |
+----+----------+---------+
I was previously using a MongoDB aggregation to do this and get a distribution of activities by type and total count:
{
$bucket: {
groupBy: '$duration',
boundaries: [0, 16, 31, 61, 91, 121],
default: 121,
output: {
total: { $sum: 1 },
walking: {
$sum: { $cond: [{ $eq: ['$type', 'walking'] }, 1, 0] },
},
running: {
$sum: { $cond: [{ $eq: ['$type', 'running'] }, 1, 0] },
},
cycling: {
$sum: { $cond: [{ $eq: ['$type', 'cycling'] }, 1, 0] },
},
},
},
}
I have just transitioned to using Postgres and can't figure out how to do the conditional sums there. What would the query be to get a result table like this?
+---------------+---------+---------+---------+-------+
| duration_band | walking | running | cycling | total |
+---------------+---------+---------+---------+-------+
| 0-15 | 41 | 21 | 12 | 74 |
+---------------+---------+---------+---------+-------+
| 15-30 | 15 | 1 | 44 | 60 |
+---------------+---------+---------+---------+-------+
| 30-60 | 6 | 56 | 7 | 69 |
+---------------+---------+---------+---------+-------+
| 60-90 | 26 | 89 | 32 | 150 |
+---------------+---------+---------+---------+-------+
| 90-120 | 30 | 0 | 6 | 36 |
+---------------+---------+---------+---------+-------+
| 120+ | 13 | 90 | 0 | 103 |
+---------------+---------+---------+---------+-------+
| Total | 131 | 257 | 101 | 492 |
+---------------+---------+---------+---------+-------+
SQL is very good at retrieving and making calculations on data, and delivering it so getting the values you want is an easy task. It is not so good at formatting results, that why that task is typically left to the presentation layer. That said, however, does not mean it cannot be done - it can and in a single query. The difficulty is the pivot process - transforming rows into columns. But first some setup. You should put the duration data on a its own table (if not already). With the addition of a identifier which then allows multiple criteria sets (more on that later). I will proceed that way.
create table bands( name text, period int4range, title text );
insert into bands(name, period, title)
values ('Standard', '[ 0, 15)'::int4range , '0 - 15')
, ('Standard', '[ 15, 30)'::int4range , '15 - 30')
, ('Standard', '[ 30, 60)'::int4range , '30 - 60')
, ('Standard', '[ 60, 90)'::int4range , '60 - 00')
, ('Standard', '[ 90,120)'::int4range , '90 - 120')
, ('Standard', '[120,)'::int4range , '120+');
This sets up the your current criteria. The name column is the prior mentioned identifier where the title column becomes the duration band on the output. The interesting column is the period; defined as an integer range. In this case a [closed,open) range that includes the 1st number but not the 2nd - yea the brackets have meaning. That definition becomes the heart of resulting query. The query builds as follows:
Retrieve the desired interval set ( [0-5) ... ) set and append to it
a "totals" entry.
Define the list of activities (cycling, ...).
Combine these sets to create a list of interval set with each
activity.
The above gives the activity intervals which becomes the matrix generated when pivoted.
Combine the "test" table values into the above list calculating the
total time for each activity within each interval. This is the work
horse of the query. It does ALL of the calculations.
The above now contains intervals plus total activity for each cell in the matrix. However it still exists in row orientation.
With the results calculated pivot them from row orientation to
column orientation.
Finally compress the pivoted results into a single row for each interval and set the final interval ordering.
And the result is:
with buckets ( period , title, ord) as
( select period , title, row_number() over (order by lower(b.period)) ord ---- 1
from bands b
where name = 'Standard'
union all
select '[0,)','Total',count(*) + 1
from bands b
where name = 'Standard'
)
, activities (activity) as ( values ('running'),('walking'),('cycling'), ('Total')) ---- 2
, activity_buckets (period, title, ord, activity) as
(select * from buckets cross join activities) ---- 3
select s2.title "Duration Band" ---- 6
, max(cycling) "Cycling"
, max(running) "Running"
, max(walking) "Walking"
, max(Total) "Total "
from ( select s1.title, s1.ord
, case when s1.activity = 'cycling' then duration else null end cycling ---- 5
, case when s1.activity = 'running' then duration else null end running
, case when s1.activity = 'walking' then duration else null end walking
, case when s1.activity = 'Total' then duration else null end total
from ( select ab.ord, ab.title, ab.activity
, sum(coalesce(t.duration,0)) duration ---- 4
from activity_buckets ab
left join test t
on ( (t.type = ab.activity or ab.activity = 'Total')
and t.duration <# ab.period --** determines which time interval(s) the value belongs
)
group by ab.ord, ab.title, ab.activity
) s1
) s2
group by s2.ord,s2.title
order by s2.ord;
See demo. It contains each of the major steps along the way. Additionally it shows how creating a table for the intervals can be put to use. Since I dislike long queries I generally hide them behind a SQL function and then just use the function. Demo also contains this.
I have mongodb collection called tasks. Here T1.1 is sub task of T1 and T1.1.1 is sub task of T1.1 and so on.. Subtask levels can grow more. I am using mongodb version 4.0. Below is the collection data
---------------------------------------
task | parent_task_id | progress(%)
---------------------------------------
T1 | null | 20
---------------------------------------
T2 | null | 30
---------------------------------------
T1.1 | T1 | 10
---------------------------------------
T1.2 | T1 | 10
---------------------------------------
T1.1.1 | T1.1 | 10
---------------------------------------
T1.1.2 | T1.1 | 10
---------------------------------------
T1.1.1.1 | T1.1.1 | 10
---------------------------------------
How do I calculate average progress of task T1 including all subtasks(T1.1,T1.2,T1.1.1,T1.1.2,T1.1.1.1) using mongodb aggregations?
Thanks in advance.
db.collection.aggregate([
{
$match: {
"task": {//Find all subtasks using pattern
$regex: "^T1.*"
}
}
},
{
$group: {
"_id": null,
"avg": {//Avg of the matched tasks' progress
"$avg": "$progress"
}
}
}
])
Sample playground
I am new to database indexing. My application has the following "find" and "update" queries, searched by single and multiple fields
reference | timestamp | phone | username | key | Address
update x | | | | |
findOne | x | x | | |
find/limit:16 | x | x | x | |
find/limit:11 | x | | | x | x
find/limit:1/sort:-1 | x | x | | x | x
find | x | | | |
1)update({"reference":"f0d3dba-278de4a-79a6cb-1284a5a85cde"}, ……….
2)findOne({"timestamp":"1466595571", "phone":"9112345678900"})
3)find({"timestamp":"1466595571", "phone":"9112345678900", "username":"a0001a"}).limit(16)
4)find({"timestamp":"1466595571", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).limit(11)
5)find({"timestamp":"1466595571", "phone":"9112345678900", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).sort({"_id":-1}).limit(1)
6)find({"timestamp":"1466595571"})
I am creating index
db.coll.createIndex( { "reference": 1 } ) //for 1st, 6th query
db.coll.createIndex( { "timestamp": 1, "phone": 1, "username": 1 } ) //for 2nd, 3rd query
db.coll.createIndex( { "timestamp": 1, "key": 1, "address": 1, phone: 1 } ) //for 4th, 5th query
Is this the correct way?
Please help me
Thank you
I think what you have done looks fine. One way to check if your query is using an index, which index is being used, and whether the index is effective is to use the explain() function alongside your find().
For example:
db.coll.find({"timestamp":"1466595571"}).explain()
will return a json document which details what index (if any) was used. In addition to this you can specify that the explain return "executionStats"
eg.
db.coll.find({"timestamp":"1466595571"}).explain("executionStats")
This will tell you how many index keys were examined to find the result set as well as the execution time and other useful metrics.
I currently get problem to dynamic linq expression below
My Models
public class Orders
{
public int OrderId ;
public ICollection<OrderStatuses> Statuses;
}
public class Statuses
{
public int StatusId;
public int OrderId;
public string Note;
public DateTime Created;
}
My Sample data :
Orders
| ID | Name |
----------------------
| 1 | Order 01 |
| 2 | Order 02 |
| 3 | Order 03 |
Statuses
|ID | OrderId | Note | Created |
---------------------------------------
| 1 | 1 | Ordered | 2016-03-01|
| 2 | 1 | Pending | 2016-04-02|
| 3 | 1 | Completed | 2016-05-19|
| 4 | 1 | Ordered | 2015-05-19|
| 5 | 2 | Ordered | 2016-05-20|
| 6 | 2 | Completed | 2016-05-19|
| 7 | 3 | Completed | 2016-05-19|
I'd like to get number of orders which have note value equal to 'Ordered' and max created time.
Below is sample number of orders that I expect from query
| Name | Note | Last Created|
-------------------------------------|
| Order 01 | Ordered | 2016-03-01 |
| Order 02 | Ordered | 2016-05-20 |
Here my idea but it's seem to wrong way
var outer = PredicateBuilder.True<Order>();
var orders = _entities.Orders
.GroupBy(x => x.OrderId)
.Select(x => new { x.Key, Created = x.Max(g => g.Created) })
.ToArray();
var predicateStatuses = PredicateBuilder.False<Status>();
foreach (var item in orders)
{
predicateStatuses = predicateStatuses.Or(x => x.OrderId == item.Key && x.Created == item.Created);
}
var predicateOrders = PredicateBuilder.False<JobOrder>();
predicateOrders = predicateOrders.Or(predicateStatuses); (I don't how to passed expression which different object type (Order and Status) here or I have to write an extension method or something
outer = outer.And(predicateOrders);
Please suggest me how to solve this dynamic linq expression in this case.
Thanks in advance.
There's nothing dynamic about your query, at least, it doesn't need to be. You can express it as a regular query.
var query =
from o in db.Orders
join s in db.Statuses on o.Id equals s.OrderId
where s.Note == "Ordered"
orderby s.Created descending
group new { o.Name, s.Note, LastCreated = s.Created } by o.Id into g
select g.First();
p.s., your models doesn't seem to match the data at all so I'm ignoring that. Adjust as necessary.
Thanks so much for #Jeff Mercado answer. Finally, I customized your answer to solve my problem below:
var predicateStatuses = PredicateBuilder.False<Order>();
predicateStatuses = predicateStatuses.Or(p => (
from j in db.Statuses
where j.OrderId == p.ID
group j by j.OrderId into g
select g.OrderByDescending(t=>t.Created)
.FirstOrDefault()
).FirstOrDefault().Note == 'Ordered'
);
Given the data structure as follows, as you can see each record inside one file has the same values for ATT1 and ATT2.
// Store in fileD001.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D001 | 10102011 | x13 | x14 ... | x1200
D001 | 10102011 | x23 | x24 ... | x2200
...
D001 | 10102011 | xN3 | xN4 ... | xN200
// Store in fileD002.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D002 | 10112011 | x13 | x14 ... | x1200
D002 | 10112011 | x23 | x24 ... | x2200
...
D002 | 10112011 | xN3 | xN4 ... | xN200
// Store in fileD003.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D003 | 10132011 | x13 | x14 ... | x1200
D003 | 10132011 | x23 | x24 ... | x2200
...
D003 | 10132011 | xN3 | xN4 ... | xN200
Method One: Assume I use the following structure to store the data.
doc = { “ATT1" : "D001",
"ATT2" : "10102011",
"ATT3" : "x13",
"ATT4" : "x14",
...
"ATT200" : "x1200"
}
Here is the problem, the data contains too much duplicated information and waste the space of DB. However, the benefit is that each record has its own _id.
Method One: Assume I use the following structure to store the data.
doc = { “ATT1" : "D001",
"ATT2" : "10102011",
"sub_doc" : { "ATT3" : "x13",
"ATT4" : "x14",
...
"ATT200" : "x1200"
}
}
Here is the problem, the data size N, which is around 1~5000, is too much and cannot be handled by MongoDB in one insertion operation. Of course, we can use $push update modifier to gradually append the data. However, each record has no _id any more in this way.
I don't mean each record has to have its own ID. I am just looking for a better design solution for the task like this.
Thank you
Option 1 is decent since it gives you the easiest data to work with. Maybe worry less about the space since it is cheap?
Option 2 is good to conserve space, though watch out that your document does not get too large. Maximum document size may limit you. Also, if you shard in the future this could limit you.
Option 3 is being a little relational about it. Have two collections. The first one is just a lookup for ATT1 and ATT2 pairs. The other collection is a reference to the other and the final atts.
parent = { att1: "val1", att2: "val2"}
child = {parent: parent.id, att3: "val3"...}