Handling Huge transfers of data to spark cluster from mongoDB - mongodb

Problem statement:
To transfer data from mongoDB to spark optimally with minimal latency
Problem Description:
I have my data stored in mongoDB and want to process the data (of the order ~100-500GB) using apache spark.
I used the mongoDB-spark connector and was able to read/write data from/to mongoDB (https://docs.mongodb.com/spark-connector/master/python-api/)
The problem was to create spark dataframe each time on the fly.
Is there a solution to handling such huge data transfers?
I looked into :
spark streaming API
Apache Kafka
Amazon S3 and EMR
But couldn't make a decision as to whether it was the optimal way to do it.
What strategy would you reckon to handle transferring such data?
Would having the data on the spark cluster and syncing just the deltas (changes in database) to the local file would be the way to go or just reading from mongoDB each time is the only way (or the optimal way) to go about it?
EDIT 1:
The following suggests to read data of mongoDB (due to secondary indexes, data retrieval is faster): https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb
EDIT 2:
The advantages of using parquet format : What are the pros and cons of parquet format compared to other formats?

Related

What is Data Locality in Apache Spark?

I am newbie to apache-spark
I have a hard time understanding Data Locality in Apache Spark. I have tried to read this article https://data-flair.training/blogs/apache-spark-performance-tuning/
which says "PROCESS_LOCAL" and "NODE_LOCAL". Are there settings I need to configure?
Can someone take an example and explain it to me?
Thanks,
Padd
Data locality in simple terms means doing computation on the node where data resides.
Just to elaborate:
Spark is cluster computing system. It is not a storage system like HDFS or NOSQL. Spark is used to process the data stored in such distributed system.
Typical installation is like spark is installed on same nodes as that of HDFS/NOSQL.
In case there is a spark application which is processing data stored HDFS. Spark tries to place computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.
Note : Spark’s compute nodes / workers should be running on storage nodes.
This is how data locality achieved in Spark.
The advantage is performance gain as less data is transferred over the network.
Try the below articles for reading - There are n number of documents availavle in google to search for
http://www.russellspitzer.com/2017/09/01/Spark-Locality/
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html

Performance Issue with writing data to snowflake using spark df

I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.

How to store old streaming data in Databricks Spark?

I am new to Spark Streaming and Azure Databricks. I read many articles on how spark works and process data etc. But what about old data? If spark works on interactive data then my 2 weeks older or 2 months older data can Spark hold? or suppose I have to move data after transformation where should I move and clear the spark memory? will it store in SSD only?
Azure Databricks supports several data stores (as a source and as a target for data at rest). Good practice for Big Data is mounting an Azure Data Lake Store. If you have a streaming data source (like Kafka or EventHubs) you can use it as a sink and of cause reuse it for further analytics.
See https://docs.azuredatabricks.net/spark/latest/data-sources/index.html for supported data sources.

Streaming data store in hive using spark

I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.
my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using
results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")
I have not created the pipeline. Is it fine or I have to modified the architecture?
Thanks
I would give it a try!
BUT kafka->spark->hive is not the optimal pipline for your usecase.
hive is normally based on hdfs which is not designed for small number of inserts/updates/selects.
So your plan can end up in the following problems:
many small files which ends in bad performance
your window gets to small because it takes to long
Suggestion:
option 1:
- use kafka just as buffer queue and design your pipeline like
- kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table
Option 2:
kafka->flume/spark to hbase/kudu->batch spark to hive/impala
option 1 has no "realtime" analysis option. It depends on how often you run the batch spark
option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis.
Kudu makes the architecture even easier.
Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.
But basicly it would work like the following:
xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")
BR

What is an efficient way to send data from MongoDB to Hadoop?

I was discussing with a coworker about the usage of the MongoDB connector for Hadoop and he explained that it was very inefficient. He stated that the MongoDB connectors utilizes its own map reduce, and then uses the Hadoop map reduce, which internally slows down the entire system.
If that is the case, what is the most efficient way to transport my data to the Hadoop cluster? What purpose does the MongoDB connector serve if it is more inefficient? In my scenario, I want to get the daily inserted data from MongoDB (roughly around 10MB) and put that all into Hadoop. I should also add that each MongoDB node and Hadoop node all share the same server.
The MongoDB Connector for Hadoop reads data directly from MongoDB. You can configure multiple input splits to read data from the same collection in parallel. The Mapper and Reducer jobs are run by Hadoop's Map/Reduce engine, not MongoDB's Map/Reduce.
If your data estimate is correct (only 10MB per day?) that is a small amount to ingest and the job may be faster if you don't have any input splits calculated.
You should be wary of Hadoop and MongoDB competing for resources on the same server, as contention for memory or disk can affect the efficiency of your data transfer.
To transfer your data from Mongodb to Hadoop you can use some ETL tools like Talend or Pentaho , it's much more easy and practical ! Good luck !