Read 1 GB file from ADLS in Databricks - pyspark

I want to load a 1 GB csv file present in ADLS Gen2 from Databricks.
My cluster configuration is
Databricks Runtime:9.1 LTS(Spark 3.1.2, Scala 2.12)
Worker type : Standard 56 GB Memory 8 Core Min Worker 2 Max Worker 8
Driver node same as worker.
While loading the file to dataframe, I am getting error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will automatically re atttached.
Earlier I was getting Java Heap Space issue
thats why I increased the Size of cluster to 56 GB but now also its not working.
Is there any other method to load the data to dataframe or some configuration change that will load the data

Are you loading the CSV using spark.read ?
Try always using spark in databricks and try to apply repartition when loading this data. Spark will not load it until you apply some action over this DF. so try to apply this before any other action. This may help spark to run operations in parallel:
df = df.repartition(64)

Related

Spark data frame is not utilizing the workers

I have a spark cluster with 3 worker nodes, when i try to load the csv file from hdfs it only utilizes the resources(cpu & memory) on the system where i load the csv via spark-shell (used master node)
Load dataframe
val df = spark.read.format("csv")
.option("header","true")
.load("hdfs://ipaddr:9000/user/smb_ram/2016_HDD.csv")
Do some operation on the dataframe
df.agg(sum("failure")).show
When i load csv system memory increases by 1.3 GB which is the hdfs file size & 100 % CPU usage. The workers were idling CPU near 0 % and no memory usage changes. Ideally i would expect all the heavy lifting to be done by worker which is not happening.
Set spark mode to cluster that should solve your problem. Looks like your job is running in Client mode.

Memory Allocation In Spark-Scala Application:

I am executing spark-Scala job using Spark submit command. I have written my code in spark sql where i am joining 2 tables and loading data again in 3rd hive.
code is working fine,But sometimes i am getting some issue like OutofmemoryIssue: Java heap size issue,Timeout error.
So i want to control my job manually by passing number of executors, cores and memory.When i used 16 executor,1 core and 20 GB executor memory my spark application is getting stuck.
can someone please suggest me how should i control manually my spark application by providing correct parameter.and is there any other hive or spark specific parameter are there which i can use for fast execution.
below is configuration of my cluster.
Number of Nodes: 5
Number of Cores per Node: 6
RAM per Node: 125 gb
Spark Submit Command.
spark-submit --class org.apache.spark.examples.sparksc \
--master yarn-client \
--num-executors 16 \
--executor-memory 20g \
--executor-cores 1 \
examples/jars/spark-examples.jar
It depends on volume of your data. you can make dynamic parameters. This link has very nice explanation
How to tune spark executor number, cores and executor memory?
you can enable spark.shuffle.service.enabled, use spark.sql.shuffle.partitions=400, hive.exec.compress.intermediate=true, hive.exec.reducers.bytes.per.reducer=536870912, hive.exec.compress.output=true, hive.output.codec=snappy, mapred.output.compression.type=BLOCK
if your data >700MB you can enable spark.speculation properties

Spark Structured Streaming unable to write parquet data to HDFS

I'm trying to write data to HDFS from a spark structured streaming code in scala.
But I'm unable to do so due to an error that I failed to understand
On my use case, I'm reading data from a Kafka topic which I want to write on HDFS in parquet format. Everything in my script work well no bug so far.
For doing that I'm using a developement hadoop cluster with 1 namenode and 3 datanodes.
Whatever hadoop configuration I tried I have the same error (2 datanodes, a single node setup and so on ...)
here is the error :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test/metadata could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
here is the code I'm using to write data :
val query = lbcAutomationCastDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("lbcautomation")
.partitionBy("date_year", "date_month", "date_day")
.option("checkpointLocation", "hdfs://NAMENODE_IP:8020/test/")
.option("path", "hdfs://NAMENODE_I:8020/test/")
.start()
.awaitTermination()
The spark scala code work correctly because I can write the data to the server local disk without any error.
I already tried to format the hadoop cluster, it does not change anything
Have you ever deal with this case ?
UPDATES :
Manually push file to HDFS on the cluster work without issues

Spark: master local[*] is a lot slower than master local

I have an EC2 set up with r3.8xlarge (32 cores, 244G RAM).
In my Spark application, I am reading two csv files from S3 using Spark-CSV from DataBrick, each csv has about 5 millions rows. I am unionAll the two DataFrames and running a dropDuplicates on the combined DataFrame.
But when I have,
val conf = new SparkConf()
.setMaster("local[32]")
.setAppName("Raw Ingestion On Apache Spark")
.set("spark.sql.shuffle.partitions", "32")
Spark is slower than .setMaster("local")
Wouldn't it be faster with 32 cores?
Well spark is not a Windows operating system, that it would work at maximum possible capacity from the start, you need to tune it for your usage.
Right now you just bluntly said to spark start and process my stuff on one node with 32 cores. That is not what Spark is good for. It is a distributed system suppose to be run on multi-node cluster, that is where it works best.
Reason is simple, even if you are using 32 core, what about IO issue?
Because now you are using let's if it has run 30 executors, than that is 32 process reading from same disk.
You specified 32 core, what about executor memory?
Did both machine had same ram, where you were testing.
You have specified now specifically that you want 32 partitions, if data is very small that is alot of overhead. Ideally you shouldn't specify partition until you know specifically what you are doing, or you are doing repetitive task, and you know data is going to be exactly similar all time.
If you tune it correctly spark with 32 core will indeed work faster than "local" which is basically running on one core.

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .
When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.
scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")
Is this the intended behaviour or should all blocks be distributed across the cluster?
Thanks
Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy).
So yes, this is the intended behaivour.
Just as #nik says, I do my work with multi cients and it done for me:
This is the python snippet:
columns = xfact.columns
test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns)
test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')