MongoDB with Hive DW - mongodb

I am planning to build a DataWarehouse in MongoDB for the first time. It has been suggested to me that I should use Hadoop for map-reduce in case I need some more complex analyses of the datasets.
Having discovered Hive, I liked the idea of doing mapreduces through a language similar with SQL. But my doubt is, can I make HiveQL queries directly into mongodb without needing to build an Hive DW on top of Hadoop? Because in all use cases I found it seems to only work in the data found in the Hadoop HDFS.

You could use MongoDB Connector for Hadoop:
http://docs.mongodb.org/ecosystem/tools/hadoop/
Mongo DB on its own has Map-Reduce paradigm too:
http://docs.mongodb.org/manual/core/map-reduce/

Related

Druid SQL vs Druid Native Query

I have some question about Druid Query.
According to official document, there are two query language which are Druid query and Native query.
In my use, I feel more comfortable to use Druid Sql because I don't know about Native Query well and have a simple code.
But, I don't know the difference in performance between two query languages.
Is there a large difference in performance between them?
I saw a druid forum writing in 2019.07. In that document, Recommand using Druid SQL in Druid 0.15.0 or after. (for now, latest Druid version is 0.23.0)
Which are better to use Druid Query or Native Query??
But, I don't know the difference in performance between two query
languages.
Is there a large difference in performance between them?
Every Druid SQL query gets translated into a native query. According to the docs, there's a slight overhead in translating the query from SQL to native, but that's
the only minor performance penalty to using Druid SQL compared to
native queries.
To specifically answer your question of
Which are better to use Druid Query or Native Query??
continue using what you are comfortable with.
If you'd like to learn more about best practices and native queries, the linked doc goes into quite a bit of detail.

How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?

I am trying to load huge amount of data from mongodb. Data size is in millions. So, it makes sense to pull this data using appropriate indexes and also query mongo in parallel. Thats why to do batch reads, I am using mongo spark.
How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?
Also, I was exploring "com.mongodb.reactivestreams.client.MongoCollection". If possible, can someone throw some light on this?

What is the work logic of the Mongo-Spark Connector?

I've been trying to understand how the mongo-spark connector works under the hood, but I'm still not getting the whole work logic behind it.
Details:
I'm trying to use Mongo-Spark to run a spark job to which perform mainly text search against a MongoDB collection.
Spark and MongoDB run on two different clusters
So I created the following Spark-mongo Data frame:
entity_df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("spark.mongodb.input.database", "WikiData") \
.option("spark.mongodb.input.collection", "entities_slim") \
.load()
entity_df.cache()
df = entity_df.filter(entity_df['id'] == "Q2834810").show()
Does the first instruction mean that the entities_slim collection is being copied from the MongoDB cluster to the spark cluster and represented as a Spark DataFrame?
If yes, does this mean that the connector is just a tool that only read/write data between MongoDB and Spark?
If yes, Is there a way to create spark jobs that run MongoDB quires by the MongoDB engine? something like:
import pymongo
from pyspark import SparkContext
spark_rdd.map(lamda x: entities.find_one( {'id': best} ))
Note that executing statement entity_df.filter(entity_df['id'] == "Q2834810").show() runs much slower than directly querying MongoDB using pymongo
If yes, does this mean that the connector is just a tool that only read/write data between MongoDB and Spark?
To some extent, but it doesn't mean that
the entities_slim collection is being copied from the MongoDB cluster.
Selections (filters) are converted to aggregation pipelines:
When using filters with DataFrames or Spark SQL, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark
This type of behavior is in general common for data source API - where projections and selections are pushed down to the source if possible.
So going back to your concern:
Note that executing statement entity_df.filter(entity_df['id'] == "Q2834810").show() runs much slower than directly querying MongoDB using pymongo
This is to be expected. Neither Apache Spark nor MongoDB aggregations pipelines are designed for low latency, single item queries. Both are intended for large scale, batch processing. If you need fast single item access don't use Apache Spark in the first place - this is what you have databases for.
Finally if run
job to which perform mainly text search against a MongoDB collection.
built-in MongoDB text search capabilities (as imperfect as they are) might be a better choice.
Using mongodb and spark connector you can load MongoDB data into spark to leverage sparks range of API's(Spark SQL, Spark streaming, machine learning, and graph APIs) to perform rich aggregations to your mongodb data
This enables you to leverage the spark's capabilities to analyze your data offloading to Spark
This is two-way connector, you can load mongodb data into spark and spark RDD's back to MongoDB
Does the first instruction mean that the entities_slim collection is
being copied from the MongoDB cluster to the spark cluster and
represented as a Spark DataFrame?
Yes
If yes, does this mean that the connector is just a tool that only
read/write data between MongoDB and Spark?
Yes
If yes, Is there a way to create spark jobs that run MongoDB quires by
the MongoDB engine?
You may need to query data from mongodb in mongodb itself? You can process your data into spark and store it back to mongodb

Mongodb aggregation framework alternative for Cassandra

Because of external reasons we thinking to switch from MongoDB to Cassandra. Cassandra is scale good, write fast, read good. But there we really stuck is queries features. We using MongoDB queries features activelly and we also use mongo's aggregation features very activelly. So could you please point me to alternatives technology, which could compensate monodb rich queries and aggragation framework? Could it be Hadoop or Spark?
Apache Spark is most powerful cassandra complement. With Spark you can group, join, sort, filter, and whatever you imagine. There are some projects that built an abstraction layer in Spark over Cassandra and let you apply this operations.
Two commonly projects are:
Stratio Deep
Datastax Connector

HiveQL in MongoDB

I have been studying NoSQL and Hadoop for Data Warehousing however I never worked with this technologies before and I would like to inquire if this following is possible to check if I got my understanding of this technologies right.
If I have my data stored in MongoDB, can I use Hadoop with Hive to make Hiveql queries directly to MongoDB and store the output of those queries as views back in MongoDB again, instead of the HDFS?
Also If I understand correctly most of the NoSQL databases don't support joins and aggregates, but it's possible to make them through map-reduce. If HiveQL queries are map-reduce jobs when I do a join in HiveQL would it already be automatically "joining" the MongoDB data in map-reduce for me, with no need to be worried about the lack of support for joins and aggregates in MongoDB?
MongoDB does have very good support for Aggregation kind of functions. There are no joins of-course. The way MongoDB Schema is usually designed is such that you would typically not need a join.
HiveQL operates on 'Tables' in HDFS. That's the default behavior.
But you have a MongoDB-Hadoop Connector: http://docs.mongodb.org/ecosystem/tools/hadoop/
which will let you query MongoDB data from within Hadoop.
To use Map Reduce you can do that with MongoDB itself (without Hadoop).
See this: http://docs.mongodb.org/manual/core/map-reduce/