I've been trying to understand how the mongo-spark connector works under the hood, but I'm still not getting the whole work logic behind it.
Details:
I'm trying to use Mongo-Spark to run a spark job to which perform mainly text search against a MongoDB collection.
Spark and MongoDB run on two different clusters
So I created the following Spark-mongo Data frame:
entity_df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("spark.mongodb.input.database", "WikiData") \
.option("spark.mongodb.input.collection", "entities_slim") \
.load()
entity_df.cache()
df = entity_df.filter(entity_df['id'] == "Q2834810").show()
Does the first instruction mean that the entities_slim collection is being copied from the MongoDB cluster to the spark cluster and represented as a Spark DataFrame?
If yes, does this mean that the connector is just a tool that only read/write data between MongoDB and Spark?
If yes, Is there a way to create spark jobs that run MongoDB quires by the MongoDB engine? something like:
import pymongo
from pyspark import SparkContext
spark_rdd.map(lamda x: entities.find_one( {'id': best} ))
Note that executing statement entity_df.filter(entity_df['id'] == "Q2834810").show() runs much slower than directly querying MongoDB using pymongo
If yes, does this mean that the connector is just a tool that only read/write data between MongoDB and Spark?
To some extent, but it doesn't mean that
the entities_slim collection is being copied from the MongoDB cluster.
Selections (filters) are converted to aggregation pipelines:
When using filters with DataFrames or Spark SQL, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark
This type of behavior is in general common for data source API - where projections and selections are pushed down to the source if possible.
So going back to your concern:
Note that executing statement entity_df.filter(entity_df['id'] == "Q2834810").show() runs much slower than directly querying MongoDB using pymongo
This is to be expected. Neither Apache Spark nor MongoDB aggregations pipelines are designed for low latency, single item queries. Both are intended for large scale, batch processing. If you need fast single item access don't use Apache Spark in the first place - this is what you have databases for.
Finally if run
job to which perform mainly text search against a MongoDB collection.
built-in MongoDB text search capabilities (as imperfect as they are) might be a better choice.
Using mongodb and spark connector you can load MongoDB data into spark to leverage sparks range of API's(Spark SQL, Spark streaming, machine learning, and graph APIs) to perform rich aggregations to your mongodb data
This enables you to leverage the spark's capabilities to analyze your data offloading to Spark
This is two-way connector, you can load mongodb data into spark and spark RDD's back to MongoDB
Does the first instruction mean that the entities_slim collection is
being copied from the MongoDB cluster to the spark cluster and
represented as a Spark DataFrame?
Yes
If yes, does this mean that the connector is just a tool that only
read/write data between MongoDB and Spark?
Yes
If yes, Is there a way to create spark jobs that run MongoDB quires by
the MongoDB engine?
You may need to query data from mongodb in mongodb itself? You can process your data into spark and store it back to mongodb
Related
This is a general question but I am hoping someone can answer it. I am comparing query execution times between MongoDB and Spark SQL. Specifically I have created a MongoDB collection of 1 million entries from a .csv file and ran a few queries on it using the mongosh in Compass. Then using Spark Shell and the Spark - MongoDB connector I inserted this database from MongoDB into Spark as an RDD. After that I converted the RDD into a Dataframe and started running Spark SQL queries on it. I ran the same queries on Spark SQL as in MongoDB while calculating the query execution times in both instances. The result was that in fairly simple queries like
SELECT ... FROM ... WHERE ... ORDER BY ...
MongoDB was significantly faster than Spark SQL. In one of those examples the respective execution times were around 800ms for MongoDB while for Spark SQL it was around 1800ms
From my understanding a Spark dataframe automatically makes the code distribute and run in parallel so the Spark SQL query should be faster than the MongoDB query. Can anyone explain?
I was right, Spark SQL should be faster on the Dataframe compared to the MongoDB queries. It appears that using the connector to import the data is what causes the execution time problem. I tried inserting the data from a .csv file straight into a Dataframe and the query time was much faster than both the query on imported data(from MongoDB Connector) and MongoDB.
I am trying to load huge amount of data from mongodb. Data size is in millions. So, it makes sense to pull this data using appropriate indexes and also query mongo in parallel. Thats why to do batch reads, I am using mongo spark.
How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?
Also, I was exploring "com.mongodb.reactivestreams.client.MongoCollection". If possible, can someone throw some light on this?
Because of external reasons we thinking to switch from MongoDB to Cassandra. Cassandra is scale good, write fast, read good. But there we really stuck is queries features. We using MongoDB queries features activelly and we also use mongo's aggregation features very activelly. So could you please point me to alternatives technology, which could compensate monodb rich queries and aggragation framework? Could it be Hadoop or Spark?
Apache Spark is most powerful cassandra complement. With Spark you can group, join, sort, filter, and whatever you imagine. There are some projects that built an abstraction layer in Spark over Cassandra and let you apply this operations.
Two commonly projects are:
Stratio Deep
Datastax Connector
I have been studying NoSQL and Hadoop for Data Warehousing however I never worked with this technologies before and I would like to inquire if this following is possible to check if I got my understanding of this technologies right.
If I have my data stored in MongoDB, can I use Hadoop with Hive to make Hiveql queries directly to MongoDB and store the output of those queries as views back in MongoDB again, instead of the HDFS?
Also If I understand correctly most of the NoSQL databases don't support joins and aggregates, but it's possible to make them through map-reduce. If HiveQL queries are map-reduce jobs when I do a join in HiveQL would it already be automatically "joining" the MongoDB data in map-reduce for me, with no need to be worried about the lack of support for joins and aggregates in MongoDB?
MongoDB does have very good support for Aggregation kind of functions. There are no joins of-course. The way MongoDB Schema is usually designed is such that you would typically not need a join.
HiveQL operates on 'Tables' in HDFS. That's the default behavior.
But you have a MongoDB-Hadoop Connector: http://docs.mongodb.org/ecosystem/tools/hadoop/
which will let you query MongoDB data from within Hadoop.
To use Map Reduce you can do that with MongoDB itself (without Hadoop).
See this: http://docs.mongodb.org/manual/core/map-reduce/
I am planning to build a DataWarehouse in MongoDB for the first time. It has been suggested to me that I should use Hadoop for map-reduce in case I need some more complex analyses of the datasets.
Having discovered Hive, I liked the idea of doing mapreduces through a language similar with SQL. But my doubt is, can I make HiveQL queries directly into mongodb without needing to build an Hive DW on top of Hadoop? Because in all use cases I found it seems to only work in the data found in the Hadoop HDFS.
You could use MongoDB Connector for Hadoop:
http://docs.mongodb.org/ecosystem/tools/hadoop/
Mongo DB on its own has Map-Reduce paradigm too:
http://docs.mongodb.org/manual/core/map-reduce/