Creating subtables in MongoDB using Kettle - mongodb

I have two PostgreSQL tables with the following data:
houses:
-# select * from houses;
id | address
----+----------------
1 | 123 Main Ave.
2 | 456 Elm St.
3 | 789 County Rd.
(3 rows)
and people:
-# select * from people;
id | name | house_id
----+-------+----------
1 | Fred | 1
2 | Jane | 1
3 | Bob | 1
4 | Mary | 2
5 | John | 2
6 | Susan | 2
7 | Bill | 3
8 | Nancy | 3
9 | Adam | 3
(9 rows)
In Spoon I have two table inputs the first named House Input with the SQL:
SELECT
id
, address
FROM houses
ORDER BY id;
The second table input is named People Input with the SQL:
SELECT
"name"
, house_id
FROM people
ORDER BY house_id;
I have both table input's going into a Merge Join that uses House Input as the first step with a key of id and People Input as the second step with a key of house_id.
I then have this going into a MongoDb Output with the database demo, collection houses, and Mongo document fields address and name. (As I am expecting MongoDB to assign the _id).
When I run the transformation and type db.houses.find(); from a Mongo shell, I get:
{ "_id" : ObjectId("52083706b251cc4be9813153"), "address" : "123 Main Ave.", "name" : "Fred" }
{ "_id" : ObjectId("52083706b251cc4be9813154"), "address" : "123 Main Ave.", "name" : "Jane" }
{ "_id" : ObjectId("52083706b251cc4be9813155"), "address" : "123 Main Ave.", "name" : "Bob" }
{ "_id" : ObjectId("52083706b251cc4be9813156"), "address" : "456 Elm St.", "name" : "Mary" }
{ "_id" : ObjectId("52083706b251cc4be9813157"), "address" : "456 Elm St.", "name" : "John" }
{ "_id" : ObjectId("52083706b251cc4be9813158"), "address" : "456 Elm St.", "name" : "Susan" }
{ "_id" : ObjectId("52083706b251cc4be9813159"), "address" : "789 County Rd.", "name" : "Bill" }
{ "_id" : ObjectId("52083706b251cc4be981315a"), "address" : "789 County Rd.", "name" : "Nancy" }
{ "_id" : ObjectId("52083706b251cc4be981315b"), "address" : "789 County Rd.", "name" : "Adam" }
What I want to get is something like:
{ "_id" : ObjectId("52083706b251cc4be9813153"), "address" : "123 Main Ave.", "people" : [
{ "_id" : ObjectId("52083706b251cc4be9813154"), "name" : "Fred"} ,
{ "_id" : ObjectId("52083706b251cc4be9813155"), "name" : "Jane" } ,
{ "_id" : ObjectId("52083706b251cc4be9813155"), "name" : "Bob" }
]
},
{ "_id" : ObjectId("52083706b251cc4be9813156"), "address" : "345 Elm St.", "people" : [
{ "_id" : ObjectId("52083706b251cc4be9813157"), "name" : "Mary"} ,
{ "_id" : ObjectId("52083706b251cc4be9813158"), "name" : "John" } ,
{ "_id" : ObjectId("52083706b251cc4be9813159"), "name" : "Susan" }
]
},
{ "_id" : ObjectId("52083706b251cc4be981315a"), "address" : "789 County Rd.", "people" : [
{ "_id" : ObjectId("52083706b251cc4be981315b"), "name" : "Mary"} ,
{ "_id" : ObjectId("52083706b251cc4be981315c"), "name" : "John" } ,
{ "_id" : ObjectId("52083706b251cc4be981315d"), "name" : "Susan" }
]
}
}
I know why I am getting what I am getting, but can't seem to find anything online or in the examples to get me where I want to be.
I was hoping someone could nudge me in the right direction, point to an example that is closer to what I am trying to accomplish, or tell me that this is out of scope for what Kettle is supposed to do (Hopefully not the latter).

Turns out creating subtables is all in the MongoDB Output step.
First make sure that you have the Upsert and Modifier update checked on the Configure connection tab.
Then on the Mongo Documents field tab enter the following (The first line is column names):
Name | Mongo document Path | Use field name | Match field for upsert | Modifier operation | Modifier policy
--------+---------------------+----------------+------------------------|--------------------+----------------
address | | Y | N | N/A | Insert
address | | Y | Y | N/A | Insert
name | people[0] | Y | N | $set | Insert
name | people[1] | Y | N | $push | Update
Now when I run db.houses.find(); I get:
{ "_id" : ObjectId("520ccb8978d96b204daa029d"), "address" : "123 Main Ave.", "people" : [ { "name" : "Fred" }, { "name" : "Jane" }, { "name" : "Bob" } ] }
{ "_id" : ObjectId("520ccb8978d96b204daa029e"), "address" : "456 Elm St.", "people" : [ { "name" : "Mary" }, { "name" : "John" }, { "name" : "Susan" } ] }
{ "_id" : ObjectId("520ccb8a78d96b204daa029f"), "address" : "789 County Rd.", "people" : [ { "name" : "Bill" }, { "name" : "Nancy" }, { "name" : "Adam" } ] }
Two things I would like to note:
This assumes that my address are unique and that my name's are unique within a house. If this is not the case I would need to make my id's from my OLTP tables to id (not _id) fields in MongoDB and Match for field upsert on my house id.
As #G Gordon Worley III pointed out above, if these two tables are in the same database, I could do the join in the Table Output step, and this would be a two step transformation (and faster).

Related

Scala Elasticsearch query with multiple parameters

I need to delete certain entries from an Elasticsearch table. I cannot find any hints in the documentation. I'm also an Elasticsearch noob. The to be deleted rows will be identified by its type and an owner_id. Is it possible to call deleteByQuery with multiple parameters? Or any alternatives to reach the same?
I'm using this library: https://github.com/sksamuel/elastic4s
How the table looks like:
| id | type | owner_id | cost |
|------------------------------|
| 1 | house | 1 | 10 |
| 2 | hut | 1 | 3 |
| 3 | house | 2 | 16 |
| 4 | house | 1 | 11 |
In the code it looks like this currently:
deleteByQuery(someIndex, matchQuery("type", "house"))
and I would need something like this:
deleteByQuery(someIndex, matchQuery("type", "house"), matchQuery("owner_id", 1))
But this won't work since deleteByQuery only accepts a single Query.
In this example it should delete the entries with id 1 and 4.
Explaining it in JSON and rest API format, to make it more clear.
Index Sample documents
put myindex/_doc/1
{
"type" : "house",
"owner_id" :1
}
put myindex/_doc/2
{
"type" : "hut",
"owner_id" :1
}
put myindex/_doc/3
{
"type" : "house",
"owner_id" :2
}
put myindex/_doc/4
{
"type" : "house",
"owner_id" :1
}
Search using the boolean query
GET myindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"type": "house"
}
}
],
"filter": [
{
"term": {
"owner_id": 1
}
}
]
}
}
}
And query result
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.35667494,
"_source" : {
"type" : "house",
"owner_id" : 1
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.35667494,
"_source" : {
"type" : "house",
"owner_id" : 1
}
}
]

mongodb count value of one field based on other

I have a collection named genre_collection of following structure :
user | genres
----------------
1 | comedy
1 | action
1 | thriller
1 | comedy
1 | action
2 | war
2 | adventure
2 | war
2 | thriller
I'm trying to find the count for each genre for each user i.e. my ideal final result would be something like this :
1 | comedy |2
1 | action |1
1 | thriller |1
2 | war |2
2 | adventure |1
2 | thriller |1
Any helps would be really useful.
you can do this with aggregation using $group
try this :
db.genre_collection.aggregate([
{
$group:{
_id:{
genre:"$genres",
user:"$user"
},
count:{
$sum:1
}
}
}
])
output:
{ "_id" : { "genre" : "adventure", "user" : 2 }, "count" : 1 }
{ "_id" : { "genre" : "action", "user" : 1 }, "count" : 2 }
{ "_id" : { "genre" : "thriller", "user" : 2 }, "count" : 1 }
{ "_id" : { "genre" : "war", "user" : 2 }, "count" : 2 }
{ "_id" : { "genre" : "comedy", "user" : 1 }, "count" : 2 }
{ "_id" : { "genre" : "thriller", "user" : 1 }, "count" : 1 }
Try this :
db.genre_collection.aggregate([
{"$group" : {_id:{genres:"$genres"}, count:{$sum:1}}} ])
])
Hope it helps !!!

Group across two columns in Mongo

Say in mongo I have a collection that looks like this:
+----+-----+-----+----------+
| id | x | y | quantity |
+----+-----+-----+----------+
| 1 | abc | jkl | 5 |
+----+-----+-----+----------+
| 2 | jkl | xyz | 10 |
+----+-----+-----+----------+
| 3 | xyz | abc | 20 |
+----+-----+-----+----------+
I want to do a $group where x equals y and sum up the quantity. So the output would look like:
+-----+-------+
| x | total |
+-----+-------+
| abc | 25 |
+-----+-------+
| jkl | 15 |
+-----+-------+
| xyz | 30 |
+-----+-------+
Is this even possible to do in mongo?
You won't be performing a $group to retrieve the results. You're performing a $lookup. This feature is new in MongoDB 3.2.
Using the sample data you provided, the aggregation would be the following:
db.join.aggregate( [
{
"$lookup" : {
"from" : "join",
"localField" : "x",
"foreignField" : "y",
"as" : "matching_field"
}
},
{
"$unwind" : "$matching_field"
},
{
"$project" : {
"_id" : 0,
"x" : 1,
"total" : { "$sum" : [ "$quantity", "$matching_field.quantity"]}
}
}
])
The sample data set is pretty simple, so you'll need to test behavior when there are more than a simple result returned for a value, etc.
Edit:
It gets more complicated if there can be more than a single match between X and Y.
// Add document to return more than a single match for abc
db.join.insert( { "x" : "123", "y" : "abc", "quantity" : 100 })
// Had to add $group stage to consolidate matched results
db.join.aggregate( [
{
"$lookup" : {
"from" : "join",
"localField" : "x",
"foreignField" : "y",
"as" : "matching_field"
}
},
{
"$unwind" : "$matching_field"
},
{ "$group" : {
"_id" : { "x" : "$x", "quantity" : "$quantity" },
"matched_quantities" : { "$sum" : "$matching_field.quantity" }
}},
{
"$project" : {
"x" : "$_id.x",
"total" : { "$sum" : [ "$_id.quantity", "$matched_quantities" ]}
}
}
])

mongolDB aggregation grouping similar documents next to each other

I have a collection in mongoDB that everyday a document with sampling data is added to it. I want to observe fields changes.
I want to use mongoDB aggregation to group similar items next to each other to the first:
+--+-------------------------+
|id|field | date |
+--+-------------------------+
| 1|hello | date1|
+--+-------------------------+
| 2|foobar | date2| \_ Condense these into one row with date2
+--+-------------------------+ /
| 3|foobar | date3|
+--+-------------------------+
| 4|hello | date4|
+--+-------------------------+
| 5|world | date5| \__ Condense these into a row with date5
+--+-------------------------+ /
| 6|world | date6|
+--+-------------------------+
| 7|puppies | date7|
+--+-------------------------+
| 8|kittens | date8| \__ Condense these into a row with date8
+--+-------------------------+ /
| 9|kittens | date9|
+--+-------------------------+
Is it possible to create a mongoDB aggregation for this problem?
Here is answer to similar problem in MySQL:
Grouping similar rows next to each other in MySQL
Sample Data
Data are already sorted by date.
These documents:
{ "_id" : "566ee064d56d02e854df756e", "date" : "2015-12-14T15:29:40.432Z", "score" : 59 },
{ "_id" : "566a8c70520d55771f2e9871", "date" : "2015-12-11T08:42:23.880Z", "score" : 60 },
{ "_id" : "566932f5572bd1720db7a4ef", "date" : "2015-12-10T08:08:21.514Z", "score" : 60 },
{ "_id" : "5667e652c021206f34e2c9e4", "date" : "2015-12-09T08:29:06.696Z", "score" : 60 },
{ "_id" : "5666a468cc45e9d9a82b81c9", "date" : "2015-12-08T09:35:35.837Z", "score" : 61 },
{ "_id" : "56653fe099799049b66dab97", "date" : "2015-12-07T08:14:24.494Z", "score" : 60 },
{ "_id" : "5663f6b3b7d0b00b74d9fdf9", "date" : "2015-12-06T08:49:55.299Z", "score" : 60 },
{ "_id" : "56629fb56099dfe31b0c72be", "date" : "2015-12-05T08:26:29.510Z", "score" : 60 }
should group to:
{ "_id" : "566ee064d56d02e854df756e", "date" : "2015-12-14T15:29:40.432Z", "score" : 59 }
{ "_id" : "566a8c70520d55771f2e9871", "date" : "2015-12-11T08:42:23.880Z", "score" : 60 }
{ "_id" : "5666a468cc45e9d9a82b81c9", "date" : "2015-12-08T09:35:35.837Z", "score" : 61 }
{ "_id" : "56653fe099799049b66dab97", "date" : "2015-12-07T08:14:24.494Z", "score" : 60 }
If you don't insist on using the aggregation framework, this could be done by iterating over the cursor and comparing each document to the previous one:
var cursor = db.test.find().sort({date:-1}).toArray();
var result = [];
result.push(cursor[0]); //first document must be saved
for(var i = 1; i < cursor.length; i++) {
if (cursor[i].score != cursor[i-1].score) {
result.push(cursor[i]);
}
}
result:
[
{
"_id" : "566ee064d56d02e854df756e",
"date" : "2015-12-14T15:29:40.432Z",
"score" : 59
},
{
"_id" : "566a8c70520d55771f2e9871",
"date" : "2015-12-11T08:42:23.880Z",
"score" : 60
},
{
"_id" : "5666a468cc45e9d9a82b81c9",
"date" : "2015-12-08T09:35:35.837Z",
"score" : 61
},
{
"_id" : "56653fe099799049b66dab97",
"date" : "2015-12-07T08:14:24.494Z",
"score" : 60
}
]

mongodb get distinct records

I am using mongoDB in which I have collection of following format.
{"id" : 1 , name : x ttm : 23 , val : 5 }
{"id" : 1 , name : x ttm : 34 , val : 1 }
{"id" : 1 , name : x ttm : 24 , val : 2 }
{"id" : 2 , name : x ttm : 56 , val : 3 }
{"id" : 2 , name : x ttm : 76 , val : 3 }
{"id" : 3 , name : x ttm : 54 , val : 7 }
On that collection I have queried to get records in descending order like this:
db.foo.find({"id" : {"$in" : [1,2,3]}}).sort(ttm : -1).limit(3)
But it gives two records of same id = 1 and I want records such that it gives 1 record per id.
Is it possible in mongodb?
There is a distinct command in mongodb, that can be used in conjunction with a query. However, I believe this just returns a distinct list of values for a specific key you name (i.e. in your case, you'd only get the id values returned) so I'm not sure this will give you exactly what you want if you need the whole documents - you may require MapReduce instead.
Documentation on distinct:
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Distinct
You want to use aggregation. You could do that like this:
db.test.aggregate([
// each Object is an aggregation.
{
$group: {
originalId: {$first: '$_id'}, // Hold onto original ID.
_id: '$id', // Set the unique identifier
val: {$first: '$val'},
name: {$first: '$name'},
ttm: {$first: '$ttm'}
}
}, {
// this receives the output from the first aggregation.
// So the (originally) non-unique 'id' field is now
// present as the _id field. We want to rename it.
$project:{
_id : '$originalId', // Restore original ID.
id : '$_id', //
val : '$val',
name: '$name',
ttm : '$ttm'
}
}
])
This will be very fast... ~90ms for my test DB of 100,000 documents.
Example:
db.test.find()
// { "_id" : ObjectId("55fb595b241fee91ac4cd881"), "id" : 1, "name" : "x", "ttm" : 23, "val" : 5 }
// { "_id" : ObjectId("55fb596d241fee91ac4cd882"), "id" : 1, "name" : "x", "ttm" : 34, "val" : 1 }
// { "_id" : ObjectId("55fb59c8241fee91ac4cd883"), "id" : 1, "name" : "x", "ttm" : 24, "val" : 2 }
// { "_id" : ObjectId("55fb59d9241fee91ac4cd884"), "id" : 2, "name" : "x", "ttm" : 56, "val" : 3 }
// { "_id" : ObjectId("55fb59e7241fee91ac4cd885"), "id" : 2, "name" : "x", "ttm" : 76, "val" : 3 }
// { "_id" : ObjectId("55fb59f9241fee91ac4cd886"), "id" : 3, "name" : "x", "ttm" : 54, "val" : 7 }
db.test.aggregate(/* from first code snippet */)
// output
{
"result" : [
{
"_id" : ObjectId("55fb59f9241fee91ac4cd886"),
"val" : 7,
"name" : "x",
"ttm" : 54,
"id" : 3
},
{
"_id" : ObjectId("55fb59d9241fee91ac4cd884"),
"val" : 3,
"name" : "x",
"ttm" : 56,
"id" : 2
},
{
"_id" : ObjectId("55fb595b241fee91ac4cd881"),
"val" : 5,
"name" : "x",
"ttm" : 23,
"id" : 1
}
],
"ok" : 1
}
PROS: Almost certainly the fastest method.
CONS: Involves use of the complicated Aggregation API. Also, it is tightly coupled to the original schema of the document. Though, it may be possible to generalize this.
I believe you can use aggregate like this
collection.aggregate({
$group : {
"_id" : "$id",
"docs" : {
$first : {
"name" : "$name",
"ttm" : "$ttm",
"val" : "$val",
}
}
}
});
The issue is that you want to distill 3 matching records down to one without providing any logic in the query for how to choose between the matching results.
Your options are basically to specify aggregation logic of some kind (select the max or min value for each column, for example), or to run a select distinct query and only select the fields that you wish to be distinct.
querymongo.com does a good job of translating these distinct queries for you (from SQL to MongoDB).
For example, this SQL:
SELECT DISTINCT columnA FROM collection WHERE columnA > 5
Is returned as this MongoDB:
db.runCommand({
"distinct": "collection",
"query": {
"columnA": {
"$gt": 5
}
},
"key": "columnA"
});
If you want to write the distinct result in a file using javascript...this is how you do
cursor = db.myColl.find({'fieldName':'fieldValue'})
var Arr = new Array();
var count = 0;
cursor.forEach(
function(x) {
var temp = x.id;
var index = Arr.indexOf(temp);
if(index==-1)
{
printjson(x.id);
Arr[count] = temp;
count++;
}
})
Specify Query with distinct.
The following example returns the distinct values for the field sku, embedded in the item field, from the documents whose dept is equal to "A":
db.inventory.distinct( "item.sku", { dept: "A" } )
Reference: https://docs.mongodb.com/manual/reference/method/db.collection.distinct/