Pig load array to mongo - mongodb

I would like to transform my output array:
I have the following code:
x = LOAD '$INPU'
USING PigStorage('\\u001')
AS (
product_id:chararray,
size:chararray
);
grouped = GROUP x BY (product_id);
sizes = FOREACH grouped {
sizes = DISTINCT $1.size;
GENERATE
$0 AS product_id,
sizes AS sizes;
}
output = foreach sizes generate
product_id as id,
sizes as sizes;
STORE output
INTO '$output'
USING com.mongodb.hadoop.pig.MongoInsertStorage('id');
this result the following:
"product_id" :"123",
"sizes": [
{
"size": "X"
},
{
"size": "M"
},
{
"size": "L"
}
]
It is possible to change the output to the following? :
product_id: "123",
sizes": ["X", "M", "L"]
i have tried flatten and BagToTuple but could not find a solution
thanks in advance

You've probably already seen it, but this page on mongodb.hadoop explains in great detail, and with examples, how to use MongoInsertStorage (and also MongoUpdateStorage).
I have to admit, I didn't see an option there to do what you'd like; indeed, in their example they get a similar result to yours.
However, on thing that might work is to use MongoUpdateStorage to do upserts. I'm not sure it will work, but if you use a general query with no parameters on a new or empty collection, it can do the job. If you look at the bottom part of the link I provided, they explain how to get output that looks like this
{ "_id" : ObjectId("..."), "gender":"male", "age" : 19, "cars" : ["a", "b", "c"], "first" : "Daniel", "last" : "Alabi" }
instead of this
{ "_id" : ObjectId("..."), "gender":"male", "age" : 19, "cars" : [{"car": "a"}, {"car":"b"}, {"car":"c"}], "first" : "Daniel", "last" : "Alabi" }
(I'm referring to the change in the cars field).
One last question - in your example, you change the name of product_id to id in your last foreach ... but in the output you showed, it still has the name product_id. Could it be you've been sending the wrong relation to MongoInsertStorage?
And, finally, another option is to save your collection as BSON and then use mongorestore on it - this option is also explained on that page.

Related

How to find nodes with an object that contains a string value

I'm struggling to create a find query that finds nodes that contain "Item1".
{
"_id" : ObjectId("589274f49bd4d562f0a15e07"),
"Value" : [["Item1", {
"Name" : "John",
"Age" : 45
}], ["Item2", {
"Address" : "123 Main St.",
"City" : "Hometown",
"State" : "ZZ"
}]]
}
In this example, "Item1" is not a key/value pair, but rather just a string that is part of an array that is part of a larger array. This is a legacy format so I can't adjust it unfortunately.
I've tried something like: { Value: {$elmemmatch:{$elemmatch:{"Item1"}}}, but that is not returning any matches. Similarly, $regex is not working since it only seems to match on string objects (and the overall object is not a string, but a string in an array in an array).
It seems like you should use the $in or $eq operator to match value.
So try this:
db.collection.find({'Value':{$elemMatch:{$elemMatch:{$in:['Item1']}}}})
Or run this to get the specific Item
db.collection.find({},{'Value':{$elemMatch:{$elemMatch:{$in:['Item1']}}}})
Hope this helps.
var data = {
"_id":"ObjectId('589274f49bd4d562f0a15e07')",
"Value":[
[
"Item1",
{
"Name":"John",
"Age":45
}
],
[
"Item2",
{
"Address":"123 Main St.",
"City":"Hometown",
"State":"ZZ"
}
]
]
}
data.Value[0][0] // 'Item1'
Copy and paste on repl it works.
There was an error on structure ofr your data

Using $last on Mongo Aggregation Pipeline

I searched for similar questions but couldn't find any. Feel free to point me in their direction.
Say I have this data:
{ "_id" : ObjectId("5694c9eed4c65e923780f28e"), "name" : "foo1", "attr" : "foo" }
{ "_id" : ObjectId("5694ca3ad4c65e923780f290"), "name" : "foo2", "attr" : "foo" }
{ "_id" : ObjectId("5694ca47d4c65e923780f294"), "name" : "bar1", "attr" : "bar" }
{ "_id" : ObjectId("5694ca53d4c65e923780f296"), "name" : "bar2", "attr" : "bar" }
If I want to get the latest record for each attribute group, I can do this:
> db.content.aggregate({$group: {_id: '$attr', name: {$last: '$name'}}})
{ "_id" : "bar", "name" : "bar2" }
{ "_id" : "foo", "name" : "foo2" }
I would like to have my data grouped by attr and then sorted by _id so that only the latest record remains in each group, and that's how I can achieve this. BUT I need a way to avoid naming all the fields that I want in the result (in this example "name") because in my real use case they are not known ahead.
So, is there a way to achieve this, but without having to explicitly name each field using $last and just taking all fields instead? Of course, I would sort my data prior to grouping and I just need to somehow tell Mongo "take all values from the latest one".
See some possible options here:
Do multiple find().sort() queries for each of the attr values you
want to search.
Grab the original _id of the $last doc, then do a findOne() for each of those values (this is the more extensible option).
Use the $$ROOT system variable as shown here.
This wouldn't be the quickest operation, but I assume you're using this more for analytics, not in response to a user behavior.
Edited to add slouc's example posted in comments:
db.content.aggregate({$group: {_id: '$attr', lastItem: { $last: "$$ROOT" }}}).

Comparing documents between two MongoDB collections

I have two existing collections and need to populate a third collection based on the comparison between the two existing.
The two collections that need to be compared have the following schema:
// Settings collection:
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Low":116,
"High":124,
"ImportLogId":1
}
// Data collection
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Date":"11/6/2013 12AM",
"Value":128,
"ImportLogId": 1
}
I am new to MongoDB and NoSQL in general so I am having a tough time grasping how to do this. The SQL would look something like this:
SELECT s.Identifier, r.ReadValue, r.U, r.C, r.Date
FROM Settings s
JOIN Reads r
ON s.Identifier = r.Identifier
AND s.C = r.C
AND s.U = r.U
WHERE (r.Value <= s.Low OR r.Value >= s.High)
In this case using the sample data, I would want to return a record because the value from the Data collection is greater than the high value from the setting collection. Is this possible using Mongo queries or map reduce, or is this bad collection structure (i.e. maybe all of this should be in one collection)?
A few more additional notes:
The Settings collection should really only have 1 record per "Identifier". The Data collection will have many records per "Identifier". This process could potentially be scanning hundreds of thousands of documents at one time, so resource consideration is somewhat important
There is no good way of performing operation like this using MongoDB. If you want BAD way you can use code like this:
db.settings.find().forEach(
function(doc) {
data = db.data.find({
Identifier: doc.Idendtifier,
C: doc.C,
U: doc.U,
$or: [{Value: {$lte: doc.Low}}, {Value: {$gte: doc.High}}]
}).toArray();
// Do what you need
}
)
but don't expect it will perform even remotely as good as any decent RDBMS.
You could rebuild your schema and embed documents from data collection like this:
{
"_id" : ObjectId("527a7f4b07c17a1f8ad009d2"),
"Identifier" : "ABC123",
"C" : "1",
"U" : "V",
"Low" : 116,
"High" : 124,
"ImportLogId" : 1,
"Data" : [
{
"Date" : ISODate("2013-11-06T00:00:00Z"),
"Value" : 128
},
{
"Date" : ISODate("2013-10-09T00:00:00Z"),
"Value" : 99
}
]
}
It may work if number of embedded document is low but to be honest working with arrays of documents is far from being pleasant experience. Not even mention that you can easily hit document size limit with growing size of the Data array.
If this kind of operations is typical for your application I would consider using different solution. As much as I like MongoDB it works well only with certain type of data and access patterns.
Without the concept of JOIN, you must change your approach and denormalize.
In your case, looks like you're doing a data log validation. My advice is looping settings collection and with each of them use the findAndModify operator in order to set a validation flag on data collection records who matches; after that, you could just use the find operator on the data collection, filtering by the new flag.
Starting Mongo 4.4, we can achieve this type of "join" with the new $unionWith aggregation stage coupled with a classic $group stage:
// > db.settings.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Low" : 116 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Low" : 416 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Low" : 142 }
// > db.data.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 14 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Value" : 43 }
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 45 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Value" : 8 }
db.data.aggregate([
{ $unionWith: "settings" },
{ $group: {
_id: { Identifier: "$Identifier", C: "$C", U: "$U" },
Values: { $push: "$Value" },
Low: { $mergeObjects: { v: "$Low" } }
}},
{ $match: { "Low.v": { $lt: 150 } } },
{ $out: "result-collection" }
])
// > db.result-collection.find()
// { _id: { Identifier: "ABC123", C: "1", U: "V" }, Values: [14, 45], Low: { v: 116 } }
// { _id: { Identifier: "GHI789", C: "1", U: "W" }, Values: [43], Low: { v: 142 } }
This:
Starts with a union of both collections into the pipeline via the new $unionWith stage.
Continues with a $group stage that:
Groups records based on Identifier, C and U
Accumulates Values into an array
Accumulates Lows via a $mergeObjects operation in order to get a value of Low that isn't null. Using a $first wouldn't work since this could potentially take null first (for elements from the data collection). Whereas $mergeObjects discards null values when merging an object containing a non-null value.
Then discards joined records whose Low value is bigger than let's say 150.
And finally output resulting records to a third collection via an $out stage.
A feature we've developed called Data Compare & Sync might be able to help here.
It lets you compare two MongoDB collections and see the differences (e.g. spot the same, missing, or different fields).
You can then export these comparison results to a CSV file, and use that to create your new, third collection.
Disclosure: We are the creators of the MongoDB GUI, Studio 3T.

Mongo Array in Array Query

Given the below example record, how can I find all users that belong to at least one group from an arbitrary set of groups to query against? For example, find all users that belong to any one of the following groups - 1, 10, 43. I'm looking for a generalized solution. I know I can build out an or query but is there a more efficient way to handle this?
> db.users.findOne()
{
"_id" : ObjectId("508f477aca442be537000000"),
"name" : "Some Name",
"email" : "some#email.com",
"groups" : [
1,5,10
]
}
{ groups: {$in: [1, 10, 43]} }

Suitability of MongoDB for hierarchial type queries

I have a particular data manipulation requirement that I have worked out how to do in SQL Server and PostgreSQL. However, I'm not too happy with the speed, so I am investigating MongoDB.
The best way to describe the query is as follows. Picture the hierarchical data of the USA: Country, State, County, City. Let's say a particular vendor can service the whole of California. Another can perhaps service only Los Angeles. There are potentially hundreds of thousands of vendors and they all can service from some point(s) in this hierarchy down. I am not confusing this with Geo - I am using this to illustrate the need.
Using recursive queries, it is quite simple to get a list of all vendors who could service a particular user. If he were in say Pasadena, Los Angeles, California, we would walk up the hierarchy to get the applicable IDs, then query back down to find the vendors.
I know this can be optimized. Again, this is just a simple query example.
I know MongoDB is a document store. That suits other needs I have very well. The question is how well suited is it to the query type I describe? (I know it doesn't have joins - those are simulated).
I get that this is a "how long is a piece of string" question. I just want to know if anyone has any experience with MongoDB doing this sort of thing. It could take me quite some time to go from 0 to tested, and I'm looking to save time if MongoDB is not suited to this.
EXAMPLE
A local movie store "A" can supply Blu-Rays in Springfield. A chain store "B" with state-wide distribution can supply Blu-Rays to all of IL. And a download-on-demand store "C" can supply to all of the US.
If we wanted to get all applicable movie suppliers for Springfield, IL, the answer would be [A, B, C].
In other words, there are numerous vendors attached at differing levels on the hierarchy.
I realize this question was asked nearly a year ago, but since then MongoDB has an officially supported solution for this problem, and I just used their solution. Refer to their documentation here: https://docs.mongodb.com/manual/tutorial/model-tree-structures-with-materialized-paths/
The concept relating closest to your question is named "partial path."
While it may feel a bit heavy to embed ancestor data; this approach is the most suitable way to solve your problem in MongoDB. The only pitfall to this, that I've experienced so far, is that if you're storing all of this in a single document you can hit the, as of this time, 16MB document size limit when working with enough data (although, I can only see this happening if you're using this structure to track user referrals [which could reach millions] rather than US cities [which is upwards of 26,000 according to the latest US Census]).
References:
http://www.mongodb.org/display/DOCS/Schema+Design
http://www.census.gov/geo/www/gazetteer/places2k.html
Modifications:
Replaced link: http://www.mongodb.org/display/DOCS/Trees+in+MongoDB
Note that this question was also asked on the google group. See http://groups.google.com/group/mongodb-user/browse_thread/thread/5cd5edd549813148 for that disucssion.
One option is to use an array key. You can store the hierarchy as an
array of values (for example ['US','CA','Los Angeles']). Then you can
query against records based on individual elements in that array key
For example:
First, store some documents with the array value representing the
hierarchy
> db.hierarchical.save({ location: ['US','CA','LA'], name: 'foo'} )
> db.hierarchical.save({ location: ['US','CA','SF'], name: 'bar'} )
> db.hierarchical.save({ location: ['US','MA','BOS'], name: 'baz'} )
Make sure we have an index on the location field so we can perform
fast queries against its values
> db.hierarchical.ensureIndex({'location':1})
Find all records in California
> db.hierarchical.find({location: 'CA'})
{ "_id" : ObjectId("4d9f69cbf88aea89d1492c55"), "location" : [ "US", "CA", "LA" ], "name" : "foo" }
{ "_id" : ObjectId("4d9f69dcf88aea89d1492c56"), "location" : [ "US", "CA", "SF" ], "name" : "bar" }
Find all records in Massachusetts
> db.hierarchical.find({location: 'MA'})
{ "_id" : ObjectId("4d9f6a21f88aea89d1492c5a"), "location" : [ "US", "MA", "BOS" ], "name" : "baz" }
Find all records in the US
> db.hierarchical.find({location: 'US'})
{ "_id" : ObjectId("4d9f69cbf88aea89d1492c55"), "location" : [ "US", "CA", "LA" ], "name" : "foo" }
{ "_id" : ObjectId("4d9f69dcf88aea89d1492c56"), "location" : [ "US", "CA", "SF" ], "name" : "bar" }
{ "_id" : ObjectId("4d9f6a21f88aea89d1492c5a"), "location" : [ "US", "MA", "BOS" ], "name" : "baz" }
Note that in this model, your values in the array would need to be
unique. So for example, if you had 'springfield' in different states,
then you would need to do some extra work to differentiate.
> db.hierarchical.save({location:['US','MA','Springfield'], name: 'one' })
> db.hierarchical.save({location:['US','IL','Springfield'], name: 'two' })
> db.hierarchical.find({location: 'Springfield'})
{ "_id" : ObjectId("4d9f6b7cf88aea89d1492c5b"), "location" : [ "US", "MA", "Springfield"], "name" : "one" }
{ "_id" : ObjectId("4d9f6b86f88aea89d1492c5c"), "location" : [ "US", "IL", "Springfield"], "name" : "two" }
You can overcome this by using the $all operator and specifying more
levels of the hierarchy. For example:
> db.hierarchical.find({location: { $all : ['US','MA','Springfield']} })
{ "_id" : ObjectId("4d9f6b7cf88aea89d1492c5b"), "location" : [ "US", "MA", "Springfield"], "name" : "one" }
> db.hierarchical.find({location: { $all : ['US','IL','Springfield']} })
{ "_id" : ObjectId("4d9f6b86f88aea89d1492c5c"), "location" : [ "US", "IL", "Springfield"], "name" : "two" }