What is Data Locality in Apache Spark? - pyspark

I am newbie to apache-spark
I have a hard time understanding Data Locality in Apache Spark. I have tried to read this article https://data-flair.training/blogs/apache-spark-performance-tuning/
which says "PROCESS_LOCAL" and "NODE_LOCAL". Are there settings I need to configure?
Can someone take an example and explain it to me?
Thanks,
Padd

Data locality in simple terms means doing computation on the node where data resides.
Just to elaborate:
Spark is cluster computing system. It is not a storage system like HDFS or NOSQL. Spark is used to process the data stored in such distributed system.
Typical installation is like spark is installed on same nodes as that of HDFS/NOSQL.
In case there is a spark application which is processing data stored HDFS. Spark tries to place computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers.
Note : Spark’s compute nodes / workers should be running on storage nodes.
This is how data locality achieved in Spark.
The advantage is performance gain as less data is transferred over the network.
Try the below articles for reading - There are n number of documents availavle in google to search for
http://www.russellspitzer.com/2017/09/01/Spark-Locality/
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html

Related

Combine DataFrames in Pyspark

I have a vendor giving me multiple zipped data file on an S3 bucket which I need to read all together for analysis using Pyspark. How do I modify the sc.textFile() command?
Also, if I am loading 10 files, how do I reference them? Or are they all going into a single RDD?
On a broader level, how would I tweak the partitions, memory on an AMAZON EMR cluster? Each zipped file is 3MB in size or 1.3GB unzipped.
Thanks
You can have a script which will move all the unzip files into a directory and then as part of yur spark code you can refer to that directory
rdd = sc.textFile(("s3://path/to/data/")
As you mentioed it's 1.3 GB data which is not huge for spark to process, you can leave to spark to have required partitions, however you can define them while creating rdd.
For Amazon EMR, you can spin smaller nodes based on the type of reuirement
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html
Based on kind of processing(memory intensive/ compute intensive), choose machine type.
HTH

Can I use memoization to cache data in the hbase read and write from spark job?

In a Scala project which has a Spark job, I used Spark-Hbase connector (SHC) to connect read from Hbase data.
The number of requests are very large and I am trying to use cached data for a certain amount of time. I am wondering if I can do that. Maybe Memoization can help?!
HBase itself provides two different kinds of cache.
A way to cache data into Spark is using Pair RDDs.
You can also use Broadcast variables
About memoization, remember that it is local to the single node. So you can have data memoized on a node and have cache miss on all other nodes.

Handling Huge transfers of data to spark cluster from mongoDB

Problem statement:
To transfer data from mongoDB to spark optimally with minimal latency
Problem Description:
I have my data stored in mongoDB and want to process the data (of the order ~100-500GB) using apache spark.
I used the mongoDB-spark connector and was able to read/write data from/to mongoDB (https://docs.mongodb.com/spark-connector/master/python-api/)
The problem was to create spark dataframe each time on the fly.
Is there a solution to handling such huge data transfers?
I looked into :
spark streaming API
Apache Kafka
Amazon S3 and EMR
But couldn't make a decision as to whether it was the optimal way to do it.
What strategy would you reckon to handle transferring such data?
Would having the data on the spark cluster and syncing just the deltas (changes in database) to the local file would be the way to go or just reading from mongoDB each time is the only way (or the optimal way) to go about it?
EDIT 1:
The following suggests to read data of mongoDB (due to secondary indexes, data retrieval is faster): https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb
EDIT 2:
The advantages of using parquet format : What are the pros and cons of parquet format compared to other formats?

Apache Ignite CEP implementation with large datasets

We need a CEP engine which can run over large datasets so I had a look over alternatives like FLink, Ignite etc.
When I was on Ignite, I saw that Ignite's querying api is not eligible enough to run over large data. The reason is: that much data can not be stored into cache(insufficient memory size : 2 TB is needed). I have looked at write-through and read-through but the data payload(not key) is not queryable with Predicates(for ex SQLPredicate).
My question is: Am I missing something or is it really like that?
Thx
Ignite is an in-memory system by design. Cache store (read-through/write-through) allows storing data on disk, but queries only work over in-memory data.
that much data can not be stored into cache(insufficient memory size : 2 TB is needed)
Why not? Ignite is a distributed system, it is possible to build a cluster with more than 2TB of combined RAM.

What is an efficient way to send data from MongoDB to Hadoop?

I was discussing with a coworker about the usage of the MongoDB connector for Hadoop and he explained that it was very inefficient. He stated that the MongoDB connectors utilizes its own map reduce, and then uses the Hadoop map reduce, which internally slows down the entire system.
If that is the case, what is the most efficient way to transport my data to the Hadoop cluster? What purpose does the MongoDB connector serve if it is more inefficient? In my scenario, I want to get the daily inserted data from MongoDB (roughly around 10MB) and put that all into Hadoop. I should also add that each MongoDB node and Hadoop node all share the same server.
The MongoDB Connector for Hadoop reads data directly from MongoDB. You can configure multiple input splits to read data from the same collection in parallel. The Mapper and Reducer jobs are run by Hadoop's Map/Reduce engine, not MongoDB's Map/Reduce.
If your data estimate is correct (only 10MB per day?) that is a small amount to ingest and the job may be faster if you don't have any input splits calculated.
You should be wary of Hadoop and MongoDB competing for resources on the same server, as contention for memory or disk can affect the efficiency of your data transfer.
To transfer your data from Mongodb to Hadoop you can use some ETL tools like Talend or Pentaho , it's much more easy and practical ! Good luck !