I'm finding an approach to resolve cross-venue vistor report for my client, he wants an HTTP API that return the total unique count of his customer who has visited more than one shop in day range (that API must return in 1-2 seconds).
The raw data sample (...millions records in reality):
--------------------------
DAY | CUSTOMER | VENUE
--------------------------
1 | cust_1 | A
2 | cust_2 | A
3 | cust_1 | B
3 | cust_2 | A
4 | cust_1 | C
5 | cust_3 | C
6 | cust_3 | A
Now, I want to calculate the cross-visitor report. IMO the steps would be as following:
Step 1: aggregate raw data from day 1 to 6
--------------------------
CUSTOMER | VENUE VISIT
--------------------------
cus_1 | [A, B, C]
cus_2 | [A]
cus_3 | [A, C]
Step 2: produce the final result
Total unique cross-customer: 2 (cus_1 and cus_3)
I've tried somes solutions:
I firstly used MongoDB to store data, then using Flask to write an API that uses MongoDB's utilities: aggregation, addToSet, group, count... But the API's response time is unacceptable.
Then, I switched to ElasticSearch with hope on its Aggregation command sets, but they do not support pipeline group command on the output result from the first "terms" aggregation.
After that, I read about Redis Sets, Sorted Sets,... But they couldn't help.
Could you please show me a clue to solve my problem.
Thank in advanced!
You can easily do this with Elasticsearch by leveraging one date_histogram aggregation to bucket by day, two terms aggregations (first bucket by customer and then by venue) and then only select the customers which visited more than one venue any given day using the bucket_selector pipeline aggregation. It looks like this:
POST /sales/_search
{
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"customers": {
"terms": {
"field": "customer.keyword"
},
"aggs": {
"venues": {
"terms": {
"field": "venue.keyword"
}
},
"cross_selector": {
"bucket_selector": {
"buckets_path": {
"venues_count": "venues._bucket_count"
},
"script": {
"source": "params.venues_count > 1"
}
}
}
}
}
}
}
}
}
In the result set, you'll get customers 1 and 3 as expected.
UPDATE:
Another approach involves using a scripted_metric aggregation in order to implement the logic yourself. It's a bit more complicated and might not perform well depending on the number of documents and hardware you have, but the following algorithm would yield the response 2 exactly as you expect:
POST sales/_search
{
"size":0,
"aggs": {
"unique": {
"scripted_metric": {
"init_script": "params._agg.visits = new HashMap()",
"map_script": "def cust = doc['customer.keyword'].value; def venue = doc['venue.keyword'].value; def venues = params._agg.visits.get(cust); if (venues == null) { venues = new HashSet(); } venues.add(venue); params._agg.visits.put(cust, venues)",
"combine_script": "def merged = new HashMap(); for (v in params._agg.visits.entrySet()) { def cust = merged.get(v.key); if (cust == null) { merged.put(v.key, v.value) } else { cust.addAll(v.value); } } return merged",
"reduce_script": "def merged = new HashMap(); for (agg in params._aggs) { for (v in agg.entrySet()) {def cust = merged.get(v.key); if (cust == null) {merged.put(v.key, v.value)} else {cust.addAll(v.value); }}} def unique = 0; for (m in merged.entrySet()) { if (m.value.size() > 1) unique++;} return unique"
}
}
}
}
Response:
{
"took": 1413,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique": {
"value": 2
}
}
}
Related
I'm currently pulling every embedded object, that matches the given filters, from my mongodb collection. I then sort these and then .slice them to get the 10 objects the user actually needs depending on which page they are on, it looks like this.
const data = await EachWay.aggregate([
{
$unwind: "$data"
},
{
$match: filters
},
{
$project: {
"_id": 0,
"bookmaker": 1,
"sport": 1,
"data": 1
}
}
])
const start_index = (current_page - 1) * posts_per_page // included in results
const end_index = current_page * posts_per_page // won't be included in results
const shortened = data.sort((a, b) => {
if(Number(a.data.rating) < Number(b.data.rating)){
return 1
} else if(Number(a.data.rating) > Number(b.data.rating)){
return -1
} else {
return 0
}
}).slice(start_index, end_index).map(item => {
let result = item.data
result.bookmaker = item.bookmaker
result.sport = item.sport
return result
})
If the user is on page 3, this will get the 21st to the 30th post.
I feel like this is inefficient, as I'm having to pull every matching post from the database first, in some cases this can be 100,000 posts and then I only end up sending the client 10 of these. Is there a way to sort the embedded objects and pull the 21st-30th items from the database before returning the items?
I am new to n1ql. I want to search all records in a bucket with "ABC" and replace it with "DEF". Can you please help me in creating this query and index?
Sample records
{
"userTypeNm": "pro",
"userStateArray": [
{
"bindCd": "1591779772457",
"name": "########",
"state": "**ABC**",
"ts": "1591779772457"
}
],
"vts": "1591779772457",
"ets": "1591779772457",
"daoObj": {
"authDaObj": {
"data": "eyJ0cmFuc2FjdGlvbklkIjoiVVNMT0dPTi0xN2U3YWQ5ZC0wN",
"id": "829892839892"
}
}
}
CREATE INDEX ix1 ON default
(DISTINCT ARRAY v.state FOR v IN userStateArray END) WHERE userTypeNm = "pro";
UPDATE default AS d SET usa.state = "DEF" FOR usa IN d.userStateArray WHEN usa.state = "ABC" END
WHERE ANY v IN d.userStateArray SATISFIES v.state = "ABC" END;
https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/update.html
I would like to display 2 time series of data with columns in the same "Rally.ui.chart.Chart". The config below for "Rally.data.lookback.calculator.TimeSeriesCalculator" stacks the columns on the same X column. Is there an easy way to group the data to be shown side-by-side instead for the same date (like the "accepted" and "time remaining" in the iteration burn-down chart) ?
Perhaps something like this?
getMetrics: function () {
return [
{
"field": "TaskRemainingTotal",
"as": "Hours Remaining",
"f": "sum",
"display": "column"
},
{
"field": "PlanEstimate",
"as": "Story Points Accepted",
"f": "filteredSum",
"filterField": "ScheduleState",
"filterValues": ["Accepted", "Verified"],
"display": "column",
"group": "1" //????? is there a specifier to separate this data?
},
];
},
Here is the code for the calculator used for the burn chart:
https://github.com/RallyApps/app-catalog/blob/master/src/apps/charts/burndown/BurnDownCalculator.js
Writing calculators for generating charts from lookback api is probably the most difficult thing to do in the app platform, so kudos for tackling it!
I'm not an expert either, but hopefully that above code is enough to point you in the right direction. Please post back if you either solve it or run into a new issue.
I was able to get it to work by adding the following to the chartConfig:
plotOptions: {
column: {
stacking: null
}
}
I've found some more on this subject that I believe may be helpful:
The "stack" member of the series config in highcharts allows a series to be stacked by name. We can create a much more flexible system that allow us to specify how to stack the data by using this and overriding some methods in the Rally.data.lookback.calculator.TimeSeriesCalculator to allow the series data to be modified.
prepareChartData returns series data, so we can override the output of that to add series data:
prepareChartData: function(store) {
var snapshots = [];
store.each(function(record) {
snapshots.push(record.raw);
});
var a = this.runCalculation(snapshots);
for (var k in a.series) {
if (a.series[k].name.startsWith("Story")) a.series[k].stack = "Story";
}
return a;
}
we can override the _buildSeriesConfig function to push any properties listed in the seriesConfig array in the metric config to the series config. This allows us to specify the series formatting in a nicer way and also gives us much more power in modifying other attributes of the chart config :
_buildSeriesConfig: function(calculatorConfig) {
var aggregationConfig = [],
metrics = calculatorConfig.metrics,
derivedFieldsAfterSummary = calculatorConfig.deriveFieldsAfterSummary;
for (var i = 0, ilength = metrics.length; i < ilength; i += 1) {
var metric = metrics[i];
var seriesConfig = {
name: metric.as || metric.field,
type: metric.display,
dashStyle: metric.dashStyle || "Solid"
};
for (var k in metric.seriesConfig) {
seriesConfig[k] = metric.seriesConfig[k];
}
aggregationConfig.push(seriesConfig);
}
for (var j = 0, jlength = derivedFieldsAfterSummary.length; j < jlength; j += 1) {
var derivedField = derivedFieldsAfterSummary[j];
var seriesConfig = {
name: derivedField.as,
type: derivedField.display,
dashStyle: derivedField.dashStyle || "Solid"
};
for (var k in derivedField.seriesConfig) {
seriesConfig[k] = derivedField.seriesConfig[k];
}
aggregationConfig.push(seriesConfig);
}
return aggregationConfig;
},
this method allows us to supply a seriesConfig property in getMetrics like so:
getMetrics: function() {
return [{
"field": "TaskRemainingTotal", // the field in the data to operate on
"as": "Hours Remaining", // the label to appear on the chart
"f": "sum", // summing function to use.
"display": "column", // how to display the point on the chart.
seriesConfig: {
"stack": "Hours",
"color": "#005eb8"
}
}, {
"field": "PlanEstimate", // the field in the data to operate on
"as": "Story Points Accepted", // the label to appear on the chart
"f": "filteredSum",
"filterField": "ScheduleState", // Only use points in seduled sate accepted or Verified
"filterValues": ["Accepted", "Verified"],
"display": "column",
seriesConfig: {
"stack": "Points",
"color": "#8dc63f"
}
}, {
"field": "PlanEstimate", // the field in the data to operate on
"as": "Story Points Remaining", // the label to appear on the chart
"f": "filteredSum",
"filterField": "ScheduleState", // Only use points in seduled sate accepted or Verified
"filterValues": ["Idea", "Defined", "In Progress", "Completed"],
"display": "column",
seriesConfig: {
"stack": "Points",
"color": "#c0c0c0"
}
},
];
},
With option #2, we can control and add any series config data in the same context that we are configuring the metrics, without worrying about configuration orders. Option #2 is a little dangerous though as the underscore implies that the method is private and therefore has no contractual guarantee to remain compatible in future revisions. (Maybe the rally guys will see this and extend the functionality for us)
How can I manipulate one of the values in normalizePayload (I need it anyway to convert result to session)
I need start_time and end_time to be multipled with 1000 in order to get smoothly into an attr('date')
"result": [
{
"end_time": 1412687629.42063,
"start_time": 1412687629.26851,
},
{
"end_time": 1412688377.15329,
"start_time": 1412688377.11507,
},
...
The current code I have is:
App.SessionSerializer = DS.ActiveModelSerializer.extend({
normalizePayload: function(payload) {
return {
sessions: payload.result
};
}
});
Sorry I gave you bad info in the answer in your other question. The normalizePayload method should manipulate the payload hash directly:
App.SessionSerializer = DS.ActiveModelSerializer.extend({
normalizePayload: function(payload) {
payload.sessions= payload.result;
delete payload.result;
delete payload.metadata;
return payload;
}
});
How to write below query using MongoDB-Csharp driver
SELECT SubSet.*
FROM ( SELECT T.ProductName ,
T.Price ,
ROW_NUMBER() OVER ( PARTITION BY T.ProductName ORDER BY T.ProductName ) AS ProductRepeat
FROM myTable T
) SubSet
WHERE SubSet.ProductRepeat = 1
What I am trying to achieve is
Collection
ProductName|Price|SKU
Cap|10|AB123
Bag|5|ED567
Cap|20|CD345
Cap|5|EC123
Expected results is
ProductName|Price|SKU
Cap|10|AB123
Bag|5|ED567
Here is the one attempt (please don't go with the object and fields)
public List<ProductOL> Search(ProductOL obj, bool topOneOnly)
{
List<ProdutOL> products = new List<ProductOL>();
var database = MyMongoClient.Instance.OpenToRead(dbName: ConfigurationManager.AppSettings["MongoDBDefaultDB"]);
var collection = database.GetCollection<RawBsonDocument>("Products");
List<IMongoQuery> build = new List<IMongoQuery>();
if (!string.IsNullOrEmpty(obj.ProductName))
{
var ProductNameQuery = Query.Matches("ProductName", new BsonRegularExpression(obj.ProductName, "i"));
build.Add(ProductNameQuery);
}
if (!string.IsNullOrEmpty(obj.BrandName))
{
var brandNameQuery = Query.Matches("BrandName", new BsonRegularExpression(obj.BrandName, "i"));
build.Add(brandNameQuery);
}
var fullQuery = Query.And(build.ToArray());
products = collection.FindAs<ProductOL>(fullQuery).SetSortOrder(SortBy.Ascending("ProductName")).ToList();
if (topOneOnly)
{
var tmpProducts = new List<ProductOL>();
foreach (var item in products)
{
if (tmpProducts.Any(x => x.ProductName== item.ProductName)) { }
else
tmpProducts.Add(item);
}
products = tmpProducts;
}
return products;
}
my mongo query works and gives me the right results. But that is not effeciant when I am dealing with huge data, so I was wondering if mongodb has any concepts like SQL Server for Row_Number() and Partitioning
If your query returns the expected results but isn't efficient, you should look into index usage with explain(). Given your query generation code includes conditional clauses, it seems likely you will need multiple indexes to efficiently cover common variations.
I'm not sure how the C# code you've provided relates to the original SQL query, as they seem to be entirely different. I'm also not clear how grouping is expected to help your query performance, aside from limiting the results returned.
Equivalent of the SQL query
There is no direct equivalent of ROW_NUMBER() .. PARTITION BY grouping in MongoDB, but you should be able to work out the desired result using either the Aggregation Framework (fastest) or Map/Reduce (slower but more functionality). The MongoDB manual includes an Aggregation Commands Comparison as well as usage examples.
As an exercise in translation, I'll focus on your SQL query which is pulling out the first product match by ProductName:
SELECT SubSet.*
FROM ( SELECT T.ProductName ,
T.Price ,
ROW_NUMBER() OVER ( PARTITION BY T.ProductName ORDER BY T.ProductName ) AS ProductRepeat
FROM myTable T
) SubSet
WHERE SubSet.ProductRepeat = 1
Setting up the test data you provided:
db.myTable.insert([
{ ProductName: 'Cap', Price: 10, SKU: 'AB123' },
{ ProductName: 'Bag', Price: 5, SKU: 'ED567' },
{ ProductName: 'Cap', Price: 20, SKU: 'CD345' },
{ ProductName: 'Cap', Price: 5, SKU: 'EC123' },
])
Here's an aggregation query in the mongo shell which will find the first match per group (ordered by ProductName). It should be straightforward to translate that aggregation query to the C# driver using the MongoCollection.Aggregate() method.
I've included comments with the rough equivalent SQL fragment in your original query.
db.myTable.aggregate(
// Apply a sort order so the $first product is somewhat predictable
// ( "ORDER BY T.ProductName")
{ $sort: {
ProductName: 1
// Should really have additional sort by Price or SKU (otherwise order may change)
}},
// Group by Product Name
// (" PARTITION BY T.ProductName")
{ $group: {
_id: "$ProductName",
// Find first matching product details per group (can use $$CURRENT in MongoDB 2.6 or list specific fields)
// "SELECT SubSet.* ... WHERE SubSet.ProductRepeat = 1"
Price: { $first: "$Price" },
SKU: { $first: "$SKU" },
}},
// Rename _id to match expected results
{ $project: {
_id: 0,
ProductName: "$_id",
Price: 1,
SKU: 1,
}}
)
Results given the test data appear to be what you were looking for:
{ "Price" : 10, "SKU" : "AB123", "ProductName" : "Cap" }
{ "Price" : 5, "SKU" : "ED567", "ProductName" : "Bag" }
Notes:
This aggregation query uses the $first operator, so if you want to find the second or third product per grouping you'd need a different approach (eg. $group and then take the subset of results needed in your application code)
If you want predictable results for finding the first item in a $group there should be more specific sort criteria than ProductName (for example, sorting by ProductName & Price or ProductName & SKU). Otherwise the order of results may change in future as documents are added or updated.
Thanks to #Stennie with the help of his answer I could come up with C# aggregation code
var match = new BsonDocument
{
{
"$match",
new BsonDocument{
{"ProductName", new BsonRegularExpression("cap", "i")}
}
}
};
var group = new BsonDocument
{
{"$group",
new BsonDocument
{
{"_id", "$ProductName"},
{"SKU", new BsonDocument{
{
"$first", "$SKU"
}}
}
}}
};
var project = new BsonDocument{
{
"$project",
new BsonDocument
{
{"_id", 0 },
{"ProductName","$_id" },
{"SKU", 1}
}}};
var sort = new BsonDocument{
{
"$sort",
new BsonDocument
{
{
"ProductName",1 }
}
}};
var pipeline = new[] { match, group, project, sort };
var aggResult = collection.Aggregate(pipeline);
var products= aggResult.ResultDocuments.Select(BsonSerializer.Deserialize<ProductOL>).ToList();
Using AggregateArgs
AggregateArgs args = new AggregateArgs();
List<BsonDocument> piple = new List<BsonDocument>();
piple.Add(match);
piple.Add(group);
piple.Add(project);
piple.Add(sort);
args.Pipeline = piple;
// var pipeline = new[] { match, group, project, sort };
var aggResult = collection.Aggregate(args);
products = aggResult.Select(BsonSerializer.Deserialize<ProductOL>).ToList();