apache drill select single column crashes - mongodb

I'm currently exploring apache drill, running on a cluster mode. my datasoure is mongodb.My datasource table contains 5 million documents. I can't execute a simple query
select body from mongo.twitter.tweets limit 10;
Throwing exception
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 264 (expected: range(0, 256)) Fragment 1:2 [Error Id: 8903127a-e9e9-407e-8afc-2092b4c03cf0 on test01.css.org:31010] (java.lang.IndexOutOfBoundsException) index: 0, length: 264 (expected: range(0, 256)) io.netty.buffer.AbstractByteBuf.checkIndex():1134 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes():272 io.netty.buffer.WrappedByteBuf.setBytes():390 io.netty.buffer.UnsafeDirectLittleEndian.setBytes():30 io.netty.buffer.DrillBuf.setBytes():753 io.netty.buffer.AbstractByteBuf.setBytes():510 org.apache.drill.exec.store.bson.BsonRecordReader.writeString():265 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():178 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745
Working query which is fetching results:
select body from mongo.twitter.tweets where tweet_id = 'tag:search.twitter.com,2005:xxxxxxxxxx';
Sample document in source
{
"_id" : ObjectId("58402ad5757d7fede822e641"),
"rule_list" : [
"x",
"(contains:x (contains:y OR contains:y1)) OR (contains:v contains:b) OR (contains:v (contains:r OR contains:t))"
],
"actor_friends_count" : 79,
"klout_score" : 19,
"actor_favorites_count" : 0,
"actor_preferred_username" : "xxxxxxx",
"sentiment" : "neg",
"tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxx",
"object_actor_followers_count" : 1286,
"actor_posted_time" : "2016-07-16T14:08:25.000Z",
"actor_id" : "id:twitter.com:xxxxxxxx",
"actor_display_name" : "xxxxx",
"retweet_count" : 6,
"hashtag_list" : [
"myhashtag"
],
"body" : "my tweet body",
"actor_followers_count" : 25,
"actor_status_count" : 243,
"verb" : "share",
"posted_time" : "2016-08-01T07:49:00.000Z",
"object_actor_status_count" : 206,
"lang" : "ar",
"object_actor_preferred_username" : "xxxxxx",
"original_tweet_id" : "tag:search.twitter.com,2005:xxxxxx",
"gender" : "male",
"object_actor_id" : "id:twitter.com:xxxxxxx",
"favorites_count" : 0,
"object_posted_time" : "2016-06-20T04:12:02.000Z",
"object_actor_friends_count" : 2516,
"generator_display_name" : "Twitter for iPhone",
"object_actor_display_name" : "sdfsf",
"actor_listed_count" : 0
}
Any help is appreciated!

set store.mongo.bson.record.reader = false;

Related

Groovy: Retrieve a value from a JSON based on an object

So I have a JSON that looks like this
{
"_embedded" : {
"userTaskDtoList" : [
{
"userTaskId" : 8,
"userTaskDefinitionId" : "JJG",
"userRoleId" : 8,
"workflowId" : 9,
"criticality" : "MEDIUM",
**"dueDate"** : "2021-09-29T09:04:37Z",
"dueDateFormatted" : "Tomorrow 09:04",
"acknowledge" : false,
"key" : 8,
},
{
"userTaskId" : 10,
"userTaskDefinitionId" : "JJP",
"userRoleId" : 8,
"workflowId" : 11,
"criticality" : "MEDIUM",
**"dueDate"** : "2021-09-29T09:06:44Z",
"dueDateFormatted" : "Tomorrow 09:06",
"acknowledge" : false,
"key" : 10,
},
{
"userTaskId" : 12,
"userTaskDefinitionId" : "JJD",
"userRoleId" : 8,
"workflowId" : 13,
"criticality" : "MEDIUM",
**"dueDate"** : "2021-09-29T09:59:07Z",
"dueDateFormatted" : "Tomorrow 09:59",
"acknowledge" : false,
"key" : 12,
}
]
}
}
It's a response from a REST request. What I need is to extract the data of key "dueDate" ONLY from a specific object and make some validations with it. I'm trying to use Groovy to resolve this.
The only thing I've managed to do is this:
import groovy.json.*
def response = context.expand( '${user tasks#Response}' )
def data = new JsonSlurper().parseText(response)
idValue = data._embedded.userTaskDtoList.dueDate
Which returns all 3 of the values from the "dueDate" key in the response.
I was thinking that maybe I can interact with a certain object based on another key, for instance let's say I retrieve only the value from the "dueDate" key, that is part of the object with "userTaskId" : 12.
How could I do this?
Any help would be greatly appreciated.
You can find the record of interest, then just grab the dueDate from that
data._embedded.userTaskDtoList.find { it.userTaskId == 12 }.dueDate

How TO Filter by _id in mongodb using pig

I have a mongo documents like this:
db.activity_days.findOne()
{
"_id" : ObjectId("54b4ee617acf9ce0440a3185"),
"aca" : 0,
"ca" : 0,
"cbdw" : true,
"day" : ISODate("2014-12-10T00:00:00Z"),
"dm" : 0,
"fbc" : 0,
"go" : 2500,
"gs" : [ ],
"its" : [
{
"_id" : ObjectId("551ac8d44f9f322e2b055d3a"),
"at" : 2000,
"atn" : "Running",
"cas" : 386.514909469507,
"dis" : 2.788989730832084,
"du" : 1472,
"ibr" : false,
"ide" : false,
"lcs" : false,
"pt" : 0,
"rpt" : 0,
"src" : 1001,
"stp" : 0,
"tcs" : [ ],
"ts" : 1418257729,
"u_at" : ISODate("2015-01-13T00:32:10.954Z")
}
],
"po" : 0,
"se" : 0,
"st" : 0,
"tap3c" : [ ],
"tzo" : -21600,
"u_at" : ISODate("2015-01-13T00:32:10.952Z"),
"uid" : ObjectId("545eb753ae9237b1df115649")
}
I want to use pig to filter special _id range,I can write mongo query like this:
db.activity_day.find(_id:{$gt:ObjectId("54a48e000000000000000000"),$lt:ObjectId("54cd6c800000000000000000")})
But I don't know how to write in pig, anyone knows?
You could try using mongo-hadoop connector for Pig, see mongo-hadoop: Usage with Pig.
Once you REGISTER the JARs (core, pig, and the Java driver), e.g., REGISTER /path-to/mongo-hadoop-pig-<version>.jar; via grunt you could run:
SET mongo.input.query '{"_id":{"\$gt":{"\$oid":"54a48e000000000000000000},"\$lt":{"\$oid":"54cd6c800000000000000000}}}'
rangeActivityDay = LOAD 'mongodb://localhost:27017/database.collection' USING com.mongodb.hadoop.pig.MongoLoader()
DUMP rangeActivityDay
You may want to use LIMIT before dumping the data as well.
The above was tested using: mongo-java-driver-3.0.0-rc1.jar, mongo-hadoop-pig-1.4.0.jar, mongo-hadoop-core-1.4.0.jar and MongoDB v3.0.9

Apache Spark:How to get parquet output file size and records

I am newbie in apache spark and i want to get parquet output file size.
My Scenario is
Read the file from csv and save as text file
myRDD.saveAsTextFile("person.txt")
after saved the file spark UI(localhost:4040) showing me inputBytes 15607801 and outputBytes 13551724
but when i save as parquet file
myDF.saveAsParquetFile("person.perquet")
UI(localhost:4040) on stage tab, only show me inputBytes 15607801 and there is nothing in outputBytes.
Can anybody help me. Thanks in advance
Edit
When I call REST API it giving me following response.
[ {
"status" : "COMPLETE",
"stageId" : 4,
"attemptId" : 0,
"numActiveTasks" : 0,
"numCompleteTasks" : 1,
"numFailedTasks" : 0,
"executorRunTime" : 10955,
"inputBytes" : 15607801,
"inputRecords" : 1440721,
**"outputBytes" : 0,**
**"outputRecords" : 0,**
"shuffleReadBytes" : 0,
"shuffleReadRecords" : 0,
"shuffleWriteBytes" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"name" : "saveAsParquetFile at ParquetExample.scala:82",
"details" : "org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:1494)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:82)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
"schedulingPool" : "default",
"accumulatorUpdates" : [ ]
}, {
"status" : "COMPLETE",
"stageId" : 3,
"attemptId" : 0,
"numActiveTasks" : 0,
"numCompleteTasks" : 1,
"numFailedTasks" : 0,
"executorRunTime" : 2091,
"inputBytes" : 15607801,
"inputRecords" : 1440721,
**"outputBytes" : 13551724,**
**"outputRecords" : 1200540,**
"shuffleReadBytes" : 0,
"shuffleReadRecords" : 0,
"shuffleWriteBytes" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"name" : "saveAsTextFile at ParquetExample.scala:77",
"details" : "org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:77)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
"schedulingPool" : "default",
"accumulatorUpdates" : [ ]
} ]

exception: BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB)

I'm trying to use the full search http://docs.mongodb.org/manual/tutorial/search-for-text/
db ['Item']. runCommand ('text', {search: 'deep voice', language: 'english'})
it works well
but when I add conditions
db['Item'].runCommand( 'text', { search: 'deep voice' , language: 'english' , filter: {"and":[{"_extendedBy":{"in":["Voiceover"]}},{"and":[{"or":[{"removed":null},{"removed":{"\(exists":false}}]},{"category":ObjectId("51bc464ab012269e23278d55")},{"active":true},{"visible":true}]}]} } )
I receive an error
{
"queryDebugString" : "deep|voic||||||",
"language" : "english",
"errmsg" : "exception: BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _extendedBy: \"Voiceover\"",
"code" : 10334,
"ok" : 0
}
delete the word "voice"
db['Item'].runCommand( 'text', { search: 'deep' , language: 'english' , filter: {"\)and":[{"_extendedBy":{"in":["Voiceover"]}},{"and":[{"or":[{"removed":null},{"removed":{"exists":false}}]},{"category":ObjectId("51bc464ab012269e23278d55")},{"active":true},{"visible":true}]}]} } );
receive
response to a request ...... ......
],
"stats" : {
"nscanned" : 87,
"nscannedObjects" : 87,
"n" : 18,
"nfound" : 18,
"timeMicros" : 1013
},
"ok" : 1
}
Couldn’t understand why the error occurs?
database is not large "storageSize" : 2793472,
db.Item.stats()
{
"ns" : "internetjock.Item",
"count" : 616,
"size" : 2035840,
"avgObjSize" : 3304.935064935065,
"storageSize" : 2793472,
"numExtents" : 5,
"nindexes" : 12,
"lastExtentSize" : 2097152,
"paddingFactor" : 1.0000000000001221,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 7440160,
"indexSizes" : {
"_id_" : 24528,
"modlrHff22a60ae822e1e68ba919bbedcb8957d5c5d10f" : 40880,
"modlrH6f786b134a46c37db715aa2c831cfbe1fadb9d1d" : 40880,
"modlrI467f6180af484be29ee9258920fc4837992c825e" : 24528,
"modlrI5cb302f507b9d0409921ac0c51f7d9fc4fd5d2ee" : 40880,
"modlrI6393f31b5b6b4b2cd9517391dabf5db6d6dd3c28" : 8176,
"modlrI1c5cbf0ce48258a5a39c1ac54a1c1a038ebe1027" : 32704,
"modlrH6e623929cc3867746630bae4572b9dbe5bd3b9f7" : 40880,
"modlrH72ea9b8456321008fd832ef9459d868800ce87cb" : 40880,
"modlrU821e16c04f9069f8d0b705d78d8f666a007c274d" : 24528,
"modlrT88fc09e54b17679b0028556344b50c9fe169bdb5" : 7080416,
"modlrIefa804b72cc346d66957110e286839a3f42793ef" : 40880
},
"ok" : 1
}
I had same problem with mongo 3.0.0 and 3.1.9 with relatively small database (12GB).
After wasting roughly 4 hours of time on this I found workaround using hidden parameter
mongorestore --batchSize=10
where number varies depending on nature of your data. Start with 1000.
The result document returned by the first query is apparently greater than 16MB. MongoDB has a max document size of 16MB. The second query is returning a document that's lesser than 16MB and hence no errors.
There's no way around this. Here's the link to documentation:
http://docs.mongodb.org/manual/reference/limits/
Recreate the Text Index and everything works :-)
db.Item.dropIndex('modlrT88fc09e54b17679b0028556344b50c9fe169bdb5');
db.Item.ensureIndex({'keywords':'text'},{'name':'modlrT88fc09e54b17679b0028556344b50c9fe169bdb5'})
db.Item.stats()
...
"modlrT88fc09e54b17679b0028556344b50c9fe169bdb5" : 7080416, //before
...
"modlrT88fc09e54b17679b0028556344b50c9fe169bdb5" : 2518208 //after Recreated the Text Index

Errors while creating a collection in MongoDB

I am new to MongoDB. I am not able to create a collection. It gives a sentence in the mongo shell - Display all 169 possibilities? (y or n). The code is -
db.Lead.insert(
{ LeadID: 1,
MasterAccountID: 100,
LeadName: 'Sarah',
LeadEmailID : 'sarah#hmail.com',
LeadPhoneNumber : '2132155445',
Details : [{ StateID: 1,
TaskID : 1,
Assigned By : 1001,
TimeStamp : '10:00:00',
StatusID : 1 }
]
}
)
Not sure what the issue is. Please help me out with the same.
Regards.
Apart from the fact there is a space in Assigned By everything looks good.
I am able to insert it properly.
> db.Lead.find().pretty()
{
"_id" : ObjectId("517ebe75278e0557fd167eb7"),
"LeadID" : 1,
"MasterAccountID" : 100,
"LeadName" : "Sarah",
"LeadEmailID" : "sarah#hmail.com",
"LeadPhoneNumber" : "2132155445",
"Details" : [
{
"StateID" : 1,
"TaskID" : 1,
"AssignedBy" : 1001,
"TimeStamp" : "10:00:00",
"StatusID" : 1
}
]
}