I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .
When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.
scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")
Is this the intended behaviour or should all blocks be distributed across the cluster?
Thanks
Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy).
So yes, this is the intended behaivour.
Just as #nik says, I do my work with multi cients and it done for me:
This is the python snippet:
columns = xfact.columns
test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns)
test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')
Related
I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.
I just wanted to know the impact of storing data of apache Cassandra to any other distributed file system.
For example- let's say i am having Hadoop cluster of 5 node and replication factor of 3.
Similarly for cassandra i am having 5 node of cluster with replication factor of 3 for all keyspaces. all data will be stored at hdfs location with same Mount path.
For example- node-0 Cassandra data directory -"/data/user/cassandra-0/"
And Cassandra logs directory -
"/data/user/cassandra-0/logs/
With such kind of Architecture i need comments on following points-
As suggested in datastax documentation casaandra data and commitlog directory should be different, which is not possible in this case. With default configuration cassandra commitlog size is 8192MB. So as per my understanding if i am having a disk of 1TB and if disk got full or any disk level error will stop entire cassandra clusters??
Second question is related to underlying storage mechanism. Going with two level of data distribution by specifying replication factor 3 for hdfs and 3 for cassandra, then is it same data (sstables) will be stored at 9 location? Significant memory loss please suggest on this??
Cassandra doesn't support out of the box storage of data on the non-local file systems, like, HDFS, etc. You can theoretically hack source code to support this, but it makes no sense - Cassandra handles replication itself, and doesn't need to have additional file system layer.
I need to change the HDFS replication factor from 3 to 1 for my Spark program. While searching, I came up with the "spark.hadoop.dfs.replication" property, but by looking at https://spark.apache.org/docs/latest/configuration.html, it doesn't seem to exist anymore. So, how can I change the hdfs replication factor from my Spark program or using spark-submit?
You should use spark.hadoop.dfs.replication to set the replication factor in HDFS in your spark application. But why you cannot find it in the https://spark.apache.org/docs/latest/configuration.html? That's because that link ONLY contains spark specific configuration. As a matter of fact, any property you set started with spark.hadoop.* will be automatically translated to a Hadoop property, stripping the beginning "spark.haddoop.". You can find how it is implemented at https://github.com/apache/spark/blob/d7b1fcf8f0a267322af0592b2cb31f1c8970fb16/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
The method you should look for is appendSparkHadoopConfigs
HDFDS configuration is not specific in any way to Spark. You should be able to modify it, using standard Hadoop configuration files. In particular hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<property>
It is also possible to access Hadoop configuration using SparkContext instance:
val hconf: org.apache.hadoop.conf.Configuration = spark.sparkContext.hadoopConfiguration
hconf.setInt("dfs.replication", 3)
I've a two-node spark standalone cluster and I'm trying to read some parquet files that I just saved but am getting files not found exception.
Checking the location, it looks like all the parquet files got created on one of the nodes in my standalone cluster.
The problem now, reading the parquet files back, it says cannot find xasdad.part file.
The only way I manage to load it is to scale down the standalone spark cluster to one node.
My question is how can I load my parquet files while running more than one node in my standalone cluster ?
You have to put your files on a shard directory which is accessible to all spark nodes with the same path. Otherwise, use spark with Hadoop HDFS : a distributed file system.
I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.
Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.