I am using Hadoop to apply map reduce in my MongoDB database.
I can able to execute the sample in this link.
Right now I can able to get only key, value pair in output collection after map reduce job was executed. I wonder if it is possible to save multiple columns in a map reduce output collection?
or embedded document in value column?
thanks.
Yes - use BSONWritable as your reducer output class, and create a BSONWritable object with as many columns as you need.
See example here:
https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldReducer.java
Related
I want to use MongoDB to BigQuery dataflow template and I have 2 questions:
Can I somehow configure partitioning for the destination table? For e.g. if I want to dump my database every day?
Can I map nested fields in MongoDB to records in BQ instead of columns with string values?
I see User option with values FLATTEN and NONE, but FLATTEN will flatten documents for 1 level only.
May any of these two approaches help?:
Create a destination table with structure definition before running dataflow
Using UDF
I tried to use MongoDB to BigQuery dataflow with User option set to FLATTEN.
I am not able to get multiple metrics using agg as below.
table.select("date_time")\
.withColumn("date",to_timestamp("date_time"))\
.agg({'date_time':'max', 'date_time':'min'}).show()
I see that second aggregation overwriting first aggregation,
can someone help me to get multiple aggregations on same column?
I can't replicate and make sure that it works but I would suggest instead of using a dict for your aggregations try it like this:
table.select("date_time")\
.withColumn("date",to_timestamp("date_time"))\
.agg(min('date_time'), max('date_time')).show()
I am a newbie to mongo, I have a collection in my mongodb, To test a feature in my project I need to update database with some random data.I need a script to do that. by identifying the datatype of the field script should fill up the data automatically.
suppose I have the fields in the collection:
id, name, first_name, last_name, current_date, user_income etc.
Since the my questions are as follows:
1. Can we get all field names of a collection with their data types?
2. Can we generate a random value of that data type in mongo shell?
3. how to set the values dynamically to store random data.
I am frequently putting manually to do this.
1. Can we get all field names of a collection with their data types?
mongodb collections are schema-less, which means each document (row in relation database) can have different fields. When you find a document from a collection, you could get its fields names and data types.
2. Can we generate a random value of that data type in mongo shell?
3. how to set the values dynamically to store random data.
mongo shell use JavaScript, you may write a js script and run it with mongo the_js_file.js. So you could generate a random value in the js script.
It's useful to have a look at the mongo JavaScript API documentation and the mongo shell JavaScript Method Reference.
Other script language such as Python can also do that. mongodb has their APIs too.
I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.
I need to perform some aggregation on one existing table and then use aggregated table to perform the map reduce.
The aggregation table is sort of a temporary used so that it can be used in map reduce. Record set in temporary table reaches around 8M.
What can be the way to avoid the temporary table?
One way could be to write find() query inside map() function and emit the aggregated result(initially being stored on aggregation table).
However, I am not able to implement this.
Is there a way! Please help.
You can use the "query" parameter on MongoDB MapReduce. With this parameter the data sent to map function is filtered before processing.
More info on MapReduce documentation