how to transform a nested mongodb table into spark dataframe - mongodb

i have a nested mongodb talbe and its document structure like this:
{
"_id" : "35228334dbd1090f6117c5a0011b56b0",
"brasidas" : [
{
"key" : "buy",
"value" : 859193
}
],
"crawl_time" : NumberLong(1526296211997),
"date" : "2018-05-11",
"id" : "44874f4c8c677087bcd5f829b2843e66",
"initNumber" : 0,
"repurchase" : 0,
"source_url" : "http://query.sse.com.cn/commonQuery.do?jsonCallBack=jQuery11120015170331124618408_1526262411932&isPagination=true&sqlId=COMMON_SSE_SCSJ_CJGK_ZQZYSHG_JYSLMX_L&beginDate&endDate&securityCode&pageHelp.pageNo=1&pageHelp.beginPage=1&pageHelp.cacheSize=1&pageHelp.endPage=1&pageHelp.pageSize=25",
"stockCode" : "600020",
"stockName" : "ZYGS",
"type" : "SSE"
}
i want to transform it into spark dataframe,and extract the title "key"and "value " of "brasidas" as single column respectively.just like follows:
initNumber repurchase key value stockName type date
50000 50000 buy 286698 shgf SSE 2015/3/30
but there is a problem with the form of title "brasidas",it have three forms:
[{ "key" : "buy", "value" : 286698 }]
[{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]
[{ "key" : ""buy_free" " }, { "key" : "buy_limited" }]
so when i use scala to define a StructType, it's not suitable for every document,i can only take "brasidas" as a single column and failed to divide it by the "key" .this is what i get:
initNumber repurchase brasidas stockName type date
50000 50000 [{ "key" : "buy", "value" : 286698 }] shgf SSE 2015/3/30
This is the code for getting mongodb document:
val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
.selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")

If you try to df.printSchema() you'll probably be able to observe that brasidas got ArrayType. Most likely (array of map).
So, I'd suggest to implement some sort of UDF function that get Array as parameter and transform it in a way you need.
def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???

Related

How can I use mongodb projections with Go and mgo?

I am currently trying to extract a single object within a document array inside of mongodb.
This is a sample dataset:
"_id" : ObjectId("564aae61e0c4e5dddb07343b"),
"name" : "The Races",
"description" : "Horse races",
"capacity" : 0,
"open" : true,
"type" : 0,
"races" : [
{
"_id" : ObjectId("564ab9097628ba2c6ec54423"),
"race" : {
"distance" : 3000,
"user" : {
"_id" : ObjectId("5648bdbe7628ba189e011b18"),
"status" : 1,
"lastName" : "Miranda",
"firstName" : "Aramys"
}
}
},
{
"_id" : ObjectId("564ab9847628ba2c81f2f34a"),
"bet" : {
"distance" : 3000,
"user" : {
"_id" : ObjectId("5648bdbe7628ba189e011b18"),
"status" : 1,
"lastName" : "Miranda",
"firstName" : "Aramys"
}
}
},{...}
]
I can successfully query using the following in mongo:
db.tracks.find({"_id": ObjectId("564aae61e0c4e5dddb07343b")}, {"races": { $elemMatch: {"_id": ObjectId("564ab9847628ba2c81f2f34a")}}}).pretty()
I am unable to do the same using mgo and have tried the following:
Using nesting (Throws: missing type in composite literal, missing key in map literal)
// Using nesting (Throws: missing type in composite literal, missing key in map literal)
c.Find(bson.M{{"_id": bson.ObjectIdHex(p.ByName("id"))}, bson.M{"races": bson.M{"$elemMatch": bson.M{"_id": bson.ObjectIdHex(p.ByName("raceId"))}}}}).One(&result)
// Using select (Returns empty)
c.Find(bson.M{"_id": bson.ObjectIdHex(p.ByName("id"))}).Select(bson.M{"races._id": bson.ObjectIdHex(p.ByName("raceId"))}).One(&result)
//As an array (Returns empty)
c.Find([]bson.M{{"_id": bson.ObjectIdHex(p.ByName("id"))}, bson.M{"races": bson.M{"$elemMatch": bson.M{"_id": bson.ObjectIdHex(p.ByName("raceId"))}}}}).One(&result)
I am using httprouter and p.ByName("...") invocations are parameters passed to the handler.
Thanks in advance.
Would go with the Select method as the doc states that this enables selecting which fields should be retrieved for the results found, thus the projection using $elemMatch operator can be applied here in conjuction with Select, with your final query looking something like:
c.Find(bson.M{
"_id": bson.ObjectIdHex(p.ByName("id"))
}).Select(bson.M{
"races": bson.M{
"$elemMatch": bson.M{
"_id": bson.ObjectIdHex(p.ByName("raceId"))
}
}
}).One(&result)

MongoDB Query double nested documents in array

I was wondering how I can query from an embedded document inside an array. I have following structure:
{ "targetId" : 2, "metaData" : [ { "key" : "id", "value" : 1 }, { "key" : "name", "value" : "Parisa" }, { "key" : "img", "value" : { "imgid" : 1, "imgName" : "img1" } } ]
I could search simple key-values like key = id and value =1, but I could not search based on the values with embedded document e.g. key="img"
I tried following query but it does not work:
db.test.find({"metaData":{$elemMatch:{"key":"img", "value":{"imgid":1}}}})
Could you please help me!
I think the "value" part of your query is a little off. You need to put the document element in the criteria:
b.test.find({"metaData":{$elemMatch:{"key":"img", "value.imgid":1}}})

MongoDB - How can I use MapReduce to merge a value from one collection into another collection on multiple keys of a second collection?

I have two MongoDB collections: The first is a collection that includes frequency information for different IDs and is shown (truncated form) below:
[
{
"_id" : "A1",
"value" : 19
},
{
"_id" : "A2",
"value" : 6
},
{
"_id" : "A3",
"value" : 12
},
{
"_id" : "A4",
"value" : 8
},
{
"_id" : "A5",
"value" : 4
},
...
]
The second collection is more complex and contains information for each _id listed in the first collection (it's called frequency_collection_id in the second collection), but frequency_collection_id may be inside two lists (info.details_one, and info.details_two) for each record:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
}
}
...
]
What I'm looking to do, is merge the frequency information (from the first collection) into the second collection, in effect creating a collection that looks like:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
}
}
...
]
I know that this should be possible with MongoDB's MapReduce functions, but all the examples I've seen are either too minimal for my collection structure, or are answering different questions than I'm looking for.
Does anyone have any pointers? How can I merge my frequency information (from my first collection) into the records (inside my two lists in each record of the second collection)?
I know this is more or less a JOIN, which MongoDB does not support, but from my reading, it looks like this is a prime example of MapReduce.
I'm learning Mongo as best I can, so please forgive me if my question is too naive.
Just like all MongoDB operations, a MapReduce always operates only on a single collection and can not obtain info from another one. So you first step needs to be to dump both collections into one. Your documents have different _id's, so it should not be a problem for them to coexist in the same collection.
Then you do a MapReduce where the map function emits both kinds of documents for their common key, which is their frequency ID.
Your reduce function will then receive an array of two documents for each key: the two documents you have received. You then just have to merge these two documents into one. Keep in mind that the reduce-function can receive these two documents in any order. It can also happen that it gets called for a partial result (only one of the two documents) or for an already completed result. You need to handle these cases gracefully! A good implementation could be to create a new object and then iterate the input-documents copying all existing relevant fields with their values to the new object, so the resulting object is an amalgamation of the input documents.

How to update particular array element in MongoDB

I am newbie in MongoDB. I have stored data inside mongoDB in below format
"_id" : ObjectId("51d5725c7be2c20819ac8a22"),
"chrom" : "chr22",
"pos" : 17060409,
"information" : [
{
"name" : "Category",
"value" : "3"
},
{
"name" : "INDEL",
"value" : "INDEL"
},
{
"name" : "DP",
"value" : "31"
},
{
"name" : "FORMAT",
"value" : "GT:PL:GQ"
},
{
"name" : "PV4",
"value" : "1,0.21,0.00096,1"
}
],
"sampleID" : "Job1373964150558382243283"
I want to update the value to 11 which has the name as Category.
I have tried below query:
db.VariantEntries.update({$and:[ { "pos" : 117199533} , { "sampleID" : "Job1373964150558382243283"},{"information.name":"Category"}]},{$set:{'information.value':'11'}})
but Mongo replies
can't append to array using string field name [value]
How one can form a query which will update the particular value?
You can use the $ positional operator to identify the first array element to match the query in the update like this:
db.VariantEntries.update({
"pos": 17060409,
"sampleID": "Job1373964150558382243283",
"information.name":"Category"
},{
$set:{'information.$.value':'11'}
})
In MongoDB you can't adress array values this way. So you should change your schema design to:
"information" : {
'category' : 3,
'INDEL' : INDEL
...
}
Then you can adress the single fields in your query:
db.VariantEntries.update(
{
{"pos" : 117199533} ,
{"sampleID" : "Job1373964150558382243283"},
{"information.category":3}
},
{
$set:{'information.category':'11'}
}
)

How can I select a number of records per a specific field using mongodb?

I have a collection of documents in mongodb, each of which have a "group" field that refers to a group that owns the document. The documents look like this:
{
group: <objectID>
name: <string>
contents: <string>
date: <Date>
}
I'd like to construct a query which returns the most recent N documents for each group. For example, suppose there are 5 groups, each of which have 20 documents. I want to write a query which will return the top 3 for each group, which would return 15 documents, 3 from each group. Each group gets 3, even if another group has a 4th that's more recent.
In the SQL world, I believe this type of query is done with "partition by" and a counter. Is there such a thing in mongodb, short of doing N+1 separate queries for N groups?
You cannot do this using the aggregation framework yet - you can get the $max or top date value for each group but aggregation framework does not yet have a way to accumulate top N plus there is no way to push the entire document into the result set (only individual fields).
So you have to fall back on MapReduce. Here is something that would work, but I'm sure there are many variants (all require somehow sorting an array of objects based on a specific attribute, I borrowed my solution from one of the answers in this question.
Map function - outputs group name as a key and the entire rest of the document as the value - but it outputs it as a document containing an array because we will try to accumulate an array of results per group:
map = function () {
emit(this.name, {a:[this]});
}
The reduce function will accumulate all the documents belonging to the same group into one array (via concat). Note that if you optimize reduce to keep only the top five array elements by checking date then you won't need the finalize function, and you will use less memory during running mapreduce (it will also be faster).
reduce = function (key, values) {
result={a:[]};
values.forEach( function(v) {
result.a = v.a.concat(result.a);
} );
return result;
}
Since I'm keeping all values for each key, I need a finalize function to pull out only latest five elements per key.
final = function (key, value) {
Array.prototype.sortByProp = function(p){
return this.sort(function(a,b){
return (a[p] < b[p]) ? 1 : (a[p] > b[p]) ? -1 : 0;
});
}
value.a.sortByProp('date');
return value.a.slice(0,5);
}
Using a template document similar to one you provided, you run this by calling mapReduce command:
> db.top5.mapReduce(map, reduce, {finalize:final, out:{inline:1}})
{
"results" : [
{
"_id" : "group1",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe13"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.498Z"),
"contents" : 0.23778377776034176
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0e"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.467Z"),
"contents" : 0.4434165076818317
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe09"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.436Z"),
"contents" : 0.5935856597498059
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe04"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.405Z"),
"contents" : 0.3912118375301361
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfdff"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.372Z"),
"contents" : 0.221651989268139
}
]
},
{
"_id" : "group2",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe14"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.504Z"),
"contents" : 0.019611883210018277
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0f"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.473Z"),
"contents" : 0.5670706110540777
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0a"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.442Z"),
"contents" : 0.893193120136857
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe05"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.411Z"),
"contents" : 0.9496864483226091
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe00"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.378Z"),
"contents" : 0.013748752186074853
}
]
},
{
"_id" : "group3",
...
}
]
}
],
"timeMillis" : 15,
"counts" : {
"input" : 80,
"emit" : 80,
"reduce" : 5,
"output" : 5
},
"ok" : 1,
}
Each result has _id as group name and values as array of most recent five documents from the collection for that group name.
you need aggregation framework $group stage piped in a $limit stage...
you want also to $sort the records in some ways or else the limit will have undefined behaviour, the returned documents will be pseudo-random (the order used internally by mongo)
something like that:
db.collection.aggregate([{$group:...},{$sort:...},{$limit:...}])
here there is the documentation if you want to know more