Migrating from MongoDB to HBase - mongodb

Hi I am very new to HBase database. I downloaded some twitter data and stored into MongoDB. Now I need to transform that data into HBase to speed-up Hadoop processing. But I am not able to create it's scheme. Here I have twitter data into JSON format-
{
"_id" : ObjectId("512b71e6e4b02a4322d1c0b0"),
"id" : NumberLong("306044618179506176"),
"source" : "Facebook",
"user" : {
"name" : "Dada Bhagwan",
"location" : "India",
"url" : "http://www.dadabhagwan.org",
"id" : 191724440,
"protected" : false,
"timeZone" : null,
"description" : "Founder of Akram Vignan - Practical Spiritual Science of Self Realization",
"screenName" : "dadabhagwan",
"geoEnabled" : false,
"profileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg",
"biggerProfileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_bigger.jpg",
"profileImageUrlHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg",
"profileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg",
"biggerProfileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_bigger.jpg",
"miniProfileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_mini.jpg",
"originalProfileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034.jpg",
"followersCount" : 499,
"profileBackgroundColor" : "EEE4C1",
"profileTextColor" : "333333",
"profileLinkColor" : "990000",
"lang" : "en",
"profileSidebarFillColor" : "FCF9EC",
"profileSidebarBorderColor" : "CBC09A",
"profileUseBackgroundImage" : true,
"showAllInlineMedia" : false,
"friendsCount" : 1,
"favouritesCount" : 0,
"profileBackgroundImageUrl" : "http://a0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg",
"profileBackgroundImageURL" : "http://a0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg",
"profileBackgroundImageUrlHttps" : "https://si0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg",
"profileBannerURL" : null,
"profileBannerRetinaURL" : null,
"profileBannerIPadURL" : null,
"profileBannerIPadRetinaURL" : null,
"miniProfileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_mini.jpg",
"originalProfileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034.jpg",
"utcOffset" : -1,
"contributorsEnabled" : false,
"status" : null,
"createdAt" : NumberLong("1284700143000"),
"profileBannerMobileURL" : null,
"profileBannerMobileRetinaURL" : null,
"profileBackgroundTiled" : false,
"statusesCount" : 1713,
"verified" : false,
"translator" : false,
"listedCount" : 6,
"followRequestSent" : false,
"descriptionURLEntities" : [ ],
"urlentity" : {
"url" : "http://www.dadabhagwan.org",
"start" : 0,
"end" : 26,
"expandedURL" : "http://www.dadabhagwan.org",
"displayURL" : "http://www.dadabhagwan.org"
},
"rateLimitStatus" : null,
"accessLevel" : 0
},
"contributors" : [ ],
"geoLocation" : null,
"place" : null,
"favorited" : false,
"retweet" : false,
"retweetedStatus" : null,
"retweetCount" : 0,
"userMentionEntities" : [ ],
"retweetedByMe" : false,
"currentUserRetweetId" : -1,
"possiblySensitive" : false,
"urlentities" : [
{
"url" : "http://t.co/gR1GohGjaj",
"start" : 113,
"end" : 135,
"expandedURL" : "http://fb.me/2j2HKHJrM",
"displayURL" : "fb.me/2j2HKHJrM"
}
],
"hashtagEntities" : [ ],
"mediaEntities" : [ ],
"truncated" : false,
"inReplyToStatusId" : -1,
"text" : "Spiritual Quote of the Day :\n\n‘I am Chandubhai’ is an illusion itself and from that are \nkarmas charged. When... http://t.co/gR1GohGjaj",
"inReplyToUserId" : -1,
"inReplyToScreenName" : null,
"createdAt" : NumberLong("1361801697000"),
"rateLimitStatus" : null,
"accessLevel" : 0
}
Here how to divide data into columns and column-family? I thought to make one "twitter" column-family that contain source, getlocation, place, retweet etc... and another "user" column-family and that contain name, location etc... (user's data). i.e new column family for each inner level sub-document.
Is this approach is correct? Now How I will differentiate urlentity for "user" column-family and "twitter" column-family?
And how to handle those keys that contain list of sub-documents (for e.g. urlentity)

There are many ways to model this in HBase ranging from storing everything in a single column to having a different table for each sub entity with several other tables for "indexing".
Generally speaking you model the data in hbase based on you read and write access patterns. fo r example column family are stored in different files on disk. A reason to divide data into two column families is if there are a lot of cases where you need data from one and not the other. etc.
There's a good presentation about HBAse schema design by Ian Varley from HBaseCon 2012 you can find the slides here and the video here

Related

MongoDB not using my index(arrays, boolean)

I have collection full of the following documents, for example:
{
"_id" : ObjectId("2819738917238dgd21873"),
"mailIntid" : 10000000,
"mailCreated" : "2019-02-08",
"mailLastModified" : null,
"mailReceived" : "2019-02-08",
"mailSend" : "2019-02-08",
"hadAttachment" : false,
"subject" : "nieuwe vacature ",
"bodyPreview" : "Test Body",
"importance" : "normal",
"isDeliveryRequested" : null,
"isReadReceiptRequested" : false,
"isRead" : false,
"isDraft" : false,
"inferenceClassification" : "focused",
"bodyContentType" : "html",
"senderName" : "Jobs",
"senderEmail" : "noreply#test.nl",
"fromName" : "Jobs",
"fromEmail" : "noreply#mailing.test.nl",
"flagStatus" : "notFlagged",
"urls" : "https://test.nl/request-details?id=1337",
"insertDate" : "2019-02-09",
"modifiedDate" : "2019-05-05",
"parseComplete" : true,
"rawMailUrl" : [
{
"url" : "https://test.nl/request-details?id=1337",
"parsed" : false
}
]
}
I created a index with the following code:
db.testAB.createIndex(
{"rawMailUrl.parsed": 1},
{partialFilterExpression: { "rawMailUrl.parsed": false}}
)
But whenever I use the following query, it's not using the index from above.
db.testAB.find({"rawMailUrl.parsed": false})
Any ideas? Am I doing something wrong with creating a index with arrays and a true or false expression?
You defined an index for partialFilterExpression: { "rawMailUrl.parsed": false}, i.e. all records in your index have the same value false.
An index with low number of distinct values is a bad index, it does not improve the access path. Thus the index is not used, it is just a waste of disc space.

How TO Filter by _id in mongodb using pig

I have a mongo documents like this:
db.activity_days.findOne()
{
"_id" : ObjectId("54b4ee617acf9ce0440a3185"),
"aca" : 0,
"ca" : 0,
"cbdw" : true,
"day" : ISODate("2014-12-10T00:00:00Z"),
"dm" : 0,
"fbc" : 0,
"go" : 2500,
"gs" : [ ],
"its" : [
{
"_id" : ObjectId("551ac8d44f9f322e2b055d3a"),
"at" : 2000,
"atn" : "Running",
"cas" : 386.514909469507,
"dis" : 2.788989730832084,
"du" : 1472,
"ibr" : false,
"ide" : false,
"lcs" : false,
"pt" : 0,
"rpt" : 0,
"src" : 1001,
"stp" : 0,
"tcs" : [ ],
"ts" : 1418257729,
"u_at" : ISODate("2015-01-13T00:32:10.954Z")
}
],
"po" : 0,
"se" : 0,
"st" : 0,
"tap3c" : [ ],
"tzo" : -21600,
"u_at" : ISODate("2015-01-13T00:32:10.952Z"),
"uid" : ObjectId("545eb753ae9237b1df115649")
}
I want to use pig to filter special _id range,I can write mongo query like this:
db.activity_day.find(_id:{$gt:ObjectId("54a48e000000000000000000"),$lt:ObjectId("54cd6c800000000000000000")})
But I don't know how to write in pig, anyone knows?
You could try using mongo-hadoop connector for Pig, see mongo-hadoop: Usage with Pig.
Once you REGISTER the JARs (core, pig, and the Java driver), e.g., REGISTER /path-to/mongo-hadoop-pig-<version>.jar; via grunt you could run:
SET mongo.input.query '{"_id":{"\$gt":{"\$oid":"54a48e000000000000000000},"\$lt":{"\$oid":"54cd6c800000000000000000}}}'
rangeActivityDay = LOAD 'mongodb://localhost:27017/database.collection' USING com.mongodb.hadoop.pig.MongoLoader()
DUMP rangeActivityDay
You may want to use LIMIT before dumping the data as well.
The above was tested using: mongo-java-driver-3.0.0-rc1.jar, mongo-hadoop-pig-1.4.0.jar, mongo-hadoop-core-1.4.0.jar and MongoDB v3.0.9

Mongoid query embedded document and return parent

I have this document, each is a tool:
{
"_id" : ObjectId("54da43aea96ddcc40915a457"),
"checked_in" : false,
"barcode" : "PXJ-234234",
"calibrations" : [
{
"_id" : ObjectId("54da46ec546173129d810100"),
"cal_date" : null,
"cal_date_due" : ISODate("2014-08-06T00:00:00.000+0000"),
"time_in" : ISODate("2015-02-10T17:46:20.250+0000"),
"time_out" : ISODate("2015-02-10T17:46:20.250+0000"),
"updated_at" : ISODate("2015-02-10T17:59:08.796+0000"),
"created_at" : ISODate("2015-02-10T17:59:08.796+0000")
},
{
"_id" : ObjectId("5509e815686d610b70010000"),
"cal_date_due" : ISODate("2015-03-18T21:03:17.959+0000"),
"time_in" : ISODate("2015-03-18T21:03:17.959+0000"),
"time_out" : ISODate("2015-03-18T21:03:17.959+0000"),
"cal_date" : ISODate("2015-03-18T21:03:17.959+0000"),
"updated_at" : ISODate("2015-03-18T21:03:17.961+0000"),
"created_at" : ISODate("2015-03-18T21:03:17.961+0000")
},
{
"_id" : ObjectId("5509e837686d610b70020000"),
"cal_date_due" : ISODate("2015-03-18T21:03:51.189+0000"),
"time_in" : ISODate("2015-03-18T21:03:51.189+0000"),
"time_out" : ISODate("2015-03-18T21:03:51.189+0000"),
"cal_date" : ISODate("2015-03-18T21:03:51.189+0000"),
"updated_at" : ISODate("2015-03-18T21:03:51.191+0000"),
"created_at" : ISODate("2015-03-18T21:03:51.191+0000")
}
],
"group" : "Engine",
"location" : "Here or there",
"model" : "ZX101C",
"serial" : NumberInt(15449),
"tool" : "octane analyzer",
"updated_at" : ISODate("2015-09-30T20:43:55.652+0000"),
"description" : "Description...",
}
Tools are calibrated periodically. What I want to do is grab tools that are due this month.
Currently, my query is this:
scope :upcoming, -> { where(:at_ats => false).where('calibrations.0.cal_date_due' => {'$gte' => Time.now-1.day, '$lte' => Time.now+30.days}).order_by(:'calibrations.cal_date_due'.asc) }
However, this query gets the tool by the first calibration object and it needs to be the last. I've tried a myriad of things, but I'm stuck here.
How can I make sure I'm querying the most recent calibration document, not the first (which would be the oldest and therefore not relevant)?
You should look into aggregation framework and $unwind operator.
This link may be of help.
This link may be helpful. It contains an example of use of 'aggregation framework' for get the last element of the array, that is, the most recent in your case.

How to group data already group in MongoDB

I have a MongoDB query based in Objects as "Ticket" structure, each ticket usually contains Tasks Object and each of them have an Owner, if the Ticket is Open, the associate Owner is “OpnPrps.CurAgtNme”, by other hand if the ticket is Closed, the associate Owner is “Nms.CloAgt”.
This is how my database (JSON) look like:
{
"result" : [
{
"_id" : NumberLong(3131032306336),
"TicId" : 1147552,
"OrgId" : 729,
"Sts" : "Closed",
"CrtDat" : ISODate("2015-04-23T18:50:46.000Z"),
"CloDat" : ISODate("2015-04-23T19:46:26.000Z"),
"ShtDes" : "Copy of Employment agreement",
"Des" : "EE wants a copy of employment agreement. address was verified.",
"DesSum" : "EE wants a copy of employment agreement. address was verified.",
"Sol" : "ISIGHT TICKET NUMBER:<br><h1>US-04-15-01109</h1>",
"CrtAgtId" : 20444,
"CloAgtId" : 20149,
"CrtApp" : null,
"IsInt" : false,
"HasPrtTsk" : null,
"PexBodId" : "",
"PayGrp" : "",
"RvwDat" : null,
"RclDat" : null,
"IsDynDueDatDef" : false,
"OlaDat" : ISODate("2015-04-25T00:50:50.000Z"),
"SlaDat" : ISODate("2015-04-25T00:50:50.000Z"),
"DynDueDatElaDat" : null,
"ReaTim" : "00:00:00",
"LstUpd" : ISODate("2015-04-23T19:46:26.000Z"),
"VrsOnl" : 2,
"VrsArc" : 1,
"OpnPrps" : null,
"Nms" : {
"Cmp" : "Organization1 US",
"Org" : "Organization1",
"Srv" : "Policies",
"SrvGrp" : "17. Workforce Administration (NGR)",
"Wkg" : "WORKGROUP1",
"Pri" : "",
"Tir" : "T1",
"Src" : "Call",
"CrtAgt" : "Arun ",
"CloAgt" : "Felicia"
},
"Olas" : {
"_id" : 2,
"EntNam" : "ENTITY",
"DueDat" : ISODate("2015-04-25T00:50:50.000Z"),
"AmbDat" : ISODate("2015-04-24T20:50:49.000Z"),
"DueDatElaDat" : null,
"AmbDatElaDat" : null,
"SlaDuration" : 18,
"TotTim" : NumberLong(0),
"TotTimPndEnt" : NumberLong(0),
"TotTimPndEmp" : NumberLong(0),
"RclInSla" : false
},
"Tsks" : {
"_id" : 1,
"Typ" : "Planned",
"CrtDat" : ISODate("2015-04-23T18:50:46.000Z"),
"CloDat" : null,
"LstUpd" : ISODate("2015-04-23T18:50:46.000Z"),
"DueDat" : ISODate("2015-04-25T00:50:50.000Z"),
"TimCplTsk" : 1080,
"DueDatEla" : false,
"AgtOwnId" : null,
"WkgSklId" : 45387,
"EntId" : 2,
"Sts" : "Open",
"PrdTskId" : 201,
"Ttl" : "Provide Navigational Assistance",
"Des" : "Provide Navigational Assistance",
"SrvSklId" : 45792,
"PriSklId" : null,
"TotTim" : 0,
"PtnId" : null,
"PtnTic" : null,
"DepTskId" : null,
"IsAct" : true,
"IsMndOnCloTic" : false,
"Nms" : {
"AgtOwn" : null,
"Wkg" : "CM_T1",
"Ent" : "ENTITY",
"SrvSkl" : "Policies",
"PriSkl" : null,
"Ptn" : null,
"Frm" : ""
},
"AscTicItm" : null,
"FrmId" : null,
"Flds" : []
},
And the query I'm using to group the data looks like:
db.tickets.aggregate([
{$match:{
'Nms.Org': 'Organization1',
'Nms.Cmp':'Company',
'Nms.Wkg':'Workgroup’
}},
{$project:{
_id:0,
'Tsks._id':1,
'Tsks.Sts':1,
'Tsks.DueDat':1,
'Sts':1,
'Nms.Org':1,
'Nms.Cmp':1,
'Nms.Wkg':1,
'Nms.CloAgt':1,
'OpnPrps.CurAgtNme':1,
'OpnPrps.CurEntNme':1,
'Olas.EntNam':1
}},
{$unwind : "$Olas" },
{$unwind : "$Tsks" },
{$match:{$and:[{
'Tsks.DueDat': {$ne: null},
'Olas.EntNam': 'Entity',
'Tsks._id':{$gt:0}
}]}},
{$group:
{_id:{
Org:'$Nms.Org',
Cmp:'$Nms.Cmp',
Wkg:'$Nms.Wkg',
CurEntNme:'$Olas.EntNam',
CurTskId: '$Tsks._id',
DueDate:'$Tsks.DueDat',
Owner1:'$OpnPrps.CurAgtNme',
Owner2:'$Nms.CloAgt'
},
All: {$sum: { $cond:
[
{ $eq: [ '$Tsks.DueDat' , null ] } ,0,1
]}}
},
}
])
With this query I’m able to get every ticket for every owner (even if they’re open or closed) and their associate tasks, my problem is that I want to group all the tickets without taking in consideration the status, something like a second group after the results I have now, see below:
The table I get with Jasper Studio with this query looks like:
Owner Inventory
--------- --------------
Noemi 1 Owner1:Noemi | Owner2:null
Carl 2 Owner1:null | Owner2:Carl
Darla 2 Owner1:Darla| Owner2:null
Carl 1 Owner1:Carl| Owner2:null
Paola 2 Owner1:null| Owner2:Paola
Noemi 2 Owner1:null | Owner2:Noemi
As you can see, I’m getting repeated values due to the different field from each ticket. The table I'm looking for should be like this:
Owner Inventory
--------- --------------
Noemi 3 Owner1:Noemi | Owner2:Noemi
Carl 3 Owner1:Carl | Owner2:Carl
Darla 2 Owner1:Darla| Owner2:null
Paola 2 Owner1:null| Owner2:Paola
So, my problem is that I can’t find the way to group these results again to obtain the second table.

using 2 different result sets in mongodb

I'm using groovy with mongodb. I have a result set but need a value from a different grouping of documents. How do I pull that value into the result set I need?
MAIN:Network data
"resource_metadata" : {
"name" : "tapd2e75adf-71",
"parameters" : { },
"fref" : null,
"instance_id" : "9f170531-79d0-48ee-b0f7-9bd2788b1cc5"}
I need the display_name for the network data result set which is contained in the compute data.
CPU data
"resource_id" : "9f170531-79d0-48ee-b0f7-9bd2788b1cc5",
"resource_metadata" : {
"ramdisk_id" : "",
"display_name" : "testinstance0001"}
You can see the resource_id and the Instance_id are the same values. I know there is no relationship I can do but trying to reach to see if anyone has come across this. I'm using the table model to retrieve data for reporting. Hashtable has been suggested to me but I'm not seeing that working. Somehow in the hasNext I need to include the display_name value. in the networking data so GUID number doesn't only valid name shows from compute data.
def docs = meter.find(query).sort(sort).limit(50)\
while (docs.hasNext()) { def doc = docs.next()\
model.addRow([ doc.get("counter_name"),doc.get("counter_volume"),doc.get("timestamp"),\
doc.get("resource_metadata").getString("mac"),\
doc.get("resource_metadata").getString("instance_id"),\
doc.get("counter_unit")]
as Object[]);}
Full document:
1st set where I need the network data measure with no name only id {resource_metadata.instance_id}
{
"_id" : ObjectId("528812f8be09a32281e137d0"),
"counter_name" : "network.outgoing.packets",
"user_id" : "4d4e43ec79c5497491b23b13644c2a3b",
"timestamp" : ISODate("2013-11-17T00:51:00Z"),
"resource_metadata" : {
"name" : "tap6baab24e-8f",
"parameters" : { },
"fref" : null,
"instance_id" : "a8727a1d-4661-4565-9c0a-511279024a97",
"instance_type" : "50",
"mac" : "fa:16:3e:a3:bf:fc"
},
"source" : "openstack",
"counter_unit" : "packet",
"counter_volume" : 4611911,
"project_id" : "97dc4ca962b040608e7e707dd03f2574",
"message_id" : "54039238-4f22-11e3-8e68-e4115b99a59d",
"counter_type" : "cumulative"
}
2nd set where I want to grab the name as I get the values {resource_id}:
"_id" : ObjectId("5287bc3ebe09a32281dd2594"),
"counter_name" : "cpu",
"user_id" : "4d4e43ec79c5497491b23b13644c2a3b",
"message_signature" :
"timestamp" : ISODate("2013-11-16T18:40:58Z"),
"resource_id" : "a8727a1d-4661-4565-9c0a-511279024a97",
"resource_metadata" : {
"ramdisk_id" : "",
"display_name" : "vmsapng01",
"name" : "instance-000014d4",
"disk_gb" : "",
"availability_zone" : "",
"kernel_id" : "",
"ephemeral_gb" : "",
"host" : "3746d148a76f4e1a8203d7e2378ef48ccad8a714a47e7481ab37bcb6",
"memory_mb" : "",
"instance_type" : "50",
"vcpus" : "",
"root_gb" : "",
"image_ref" : "869be2c0-9480-4239-97ad-df383c6d09bf",
"architecture" : "",
"os_type" : "",
"reservation_id" : ""
},
"source" : "openstack",
"counter_unit" : "ns",
"counter_volume" : NumberLong("724574640000000"),
"project_id" : "97dc4ca962b040608e7e707dd03f2574",
"message_id" : "a240fa5a-4eee-11e3-8e68-e4115b99a59d",
"counter_type" : "cumulative"
}
This is another collection that contains the same value but just thought it would be easier to grab from same collection:
"_id" : "a8727a1d-4661-4565-9c0a-511279024a97",
"metadata" : {
"ramdisk_id" : "",
"display_name" : "vmsapng01",
"name" : "instance-000014d4",
"disk_gb" : "",
"availability_zone" : "",
"kernel_id" : "",
"ephemeral_gb" : "",
"host" : "3746d148a76f4e1a8203d7e2378ef48ccad8a714a47e7481ab37bcb6",
"memory_mb" : "",
"instance_type" : "50",
"vcpus" : "",
"root_gb" : "",
"image_ref" : "869be2c0-9480-4239-97ad-df383c6d09bf",
"architecture" : "",
"os_type" : "",
"reservation_id" : "",
}
Mike
It looks like these data are in 2 different collections, is this correct?
Would you be able to query CPU data for each "instance_id" ("resource_id")?
Or if this would cause too many queries to the database (looks like you limit to 50...) you could use $in with the list of all "Instance_id"s
http://docs.mongodb.org/manual/reference/operator/query/in/
Either way, you will need to query each collection separately.