Spark Structured Streaming unable to write parquet data to HDFS - scala

I'm trying to write data to HDFS from a spark structured streaming code in scala.
But I'm unable to do so due to an error that I failed to understand
On my use case, I'm reading data from a Kafka topic which I want to write on HDFS in parquet format. Everything in my script work well no bug so far.
For doing that I'm using a developement hadoop cluster with 1 namenode and 3 datanodes.
Whatever hadoop configuration I tried I have the same error (2 datanodes, a single node setup and so on ...)
here is the error :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test/metadata could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
here is the code I'm using to write data :
val query = lbcAutomationCastDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("lbcautomation")
.partitionBy("date_year", "date_month", "date_day")
.option("checkpointLocation", "hdfs://NAMENODE_IP:8020/test/")
.option("path", "hdfs://NAMENODE_I:8020/test/")
.start()
.awaitTermination()
The spark scala code work correctly because I can write the data to the server local disk without any error.
I already tried to format the hadoop cluster, it does not change anything
Have you ever deal with this case ?
UPDATES :
Manually push file to HDFS on the cluster work without issues

Related

Read 1 GB file from ADLS in Databricks

I want to load a 1 GB csv file present in ADLS Gen2 from Databricks.
My cluster configuration is
Databricks Runtime:9.1 LTS(Spark 3.1.2, Scala 2.12)
Worker type : Standard 56 GB Memory 8 Core Min Worker 2 Max Worker 8
Driver node same as worker.
While loading the file to dataframe, I am getting error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will automatically re atttached.
Earlier I was getting Java Heap Space issue
thats why I increased the Size of cluster to 56 GB but now also its not working.
Is there any other method to load the data to dataframe or some configuration change that will load the data
Are you loading the CSV using spark.read ?
Try always using spark in databricks and try to apply repartition when loading this data. Spark will not load it until you apply some action over this DF. so try to apply this before any other action. This may help spark to run operations in parallel:
df = df.repartition(64)

Spark - write to parquet never finish

I have some strange issue with my spark on emr.
when I run spark job and save dataframe to CSV the job finish successfully , but when I try to save to parquet, the Spark application never finish , but i see that all internal tasks are finished.
also I can see that all parquet files created in relevant partitions
I run on EMR : emr-5.13.0
and spark 2.3.0
scala 2.11
the write to parquet is :
newDf.coalesce(partitions)
.write.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("key1", "key2")
.mode(SaveMode.Append)
.parquet(destination)

End of file exception while reading a file from remote hdfs cluster using spark

I am new to working with HDFS. I am trying to read a csv file which is stored in a hadoop cluster using spark. Every time i try to access it i get the following error:
End of File Exception between local host
I have not setup hadoop locally since i already had access to hadoop cluster.
I may be missing some configurations but i dont know which ones. Would appreciate the help.
I tried to debug it using this :
link
Did not work for me.
This is the code using spark.
val conf= new SparkConf().setAppName("Read").setMaster("local").set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
val sc=new SparkContext(conf)
val data=sc.textfile("hdfs://<some-ip>/abc.csv)
I expect it to read the csv and convert it into RDD.
Getting this error:
Exception in thread "main" java.io.EOFException: End of File Exception between local host is:
Run you spark jobs on hadoop cluster. Use below code:
val spark = SparkSession.builder().master("local[1]").appName("Read").getOrCreate()
val data = spark.sparkContext.textFile("<filePath>")
or you can use spark-shell as well.
If you want to access hdfs from your local, follow this:link

Structured Streaming - Could not use FileContext API for managing metadata log files on AWS S3

I have a StreamingQuery in Spark(v2.2.0), i.e.,
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val query = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("parquet")
.option("checkpointLocation", "s3n://bucket/checkpoint/test")
.option("path", "s3n://bucket/test")
.start()
When I am running query then data does get save on AWS S3 and checkpoints are created at s3n://bucket/checkpoint/test. But, I am also receiving following WARNING in the logs:
WARN [o.a.s.s.e.streaming.OffsetSeqLog] Could not use FileContext API for managing metadata log files at path s3n://bucket/checpoint/test/offsets. Using FileSystem API instead for managing log files. The log may be inconsistent under failures.
I am not able to understand as to why this WARNING is coming. Also, will my checkpoints be inconsistent in case of any failure?
Can anyone help me in resolving it?
Looking at the source code, this error comes from the HDFSMetadataLog class. A comment in the code states that:
Note: [[HDFSMetadataLog]] doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
So the problem is due to using AWS S3 and it will force you to use the FileSystemManager API. Checking the comment for that class, we see that,
Implementation of FileManager using older FileSystem API. Note that this implementation cannot provide atomic renaming of paths, hence can lead to consistency issues. This should be used only as a backup option, when FileContextManager cannot be used.
Hence, some issues can come up when multiple writers want to concurrently do rename operations. There is a related ticket here, however, it has been closed since the issue can't be fixed in Spark.
Some things to consider if you need to checkpoint on S3:
To aviod the warning and potential trouble, checkpoint to HDFS and then copy over the results
Checkpoint to S3, but have a long gap between checkpoints.
Nobody should be using S3n as the connector. It is obsolete and removed from Hadoop 3. If you have the Hadoop 2.7.x JARs on the classpath, use s3a
The issue with rename() is not just the consistency, but the bigger the file, the longer it takes.
Really checkpointing to object stores needs to be done differently. If you look closely, there is no rename(), yet so much existing code expects it to be an O(1) atomic operation.

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .
When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.
scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")
Is this the intended behaviour or should all blocks be distributed across the cluster?
Thanks
Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy).
So yes, this is the intended behaivour.
Just as #nik says, I do my work with multi cients and it done for me:
This is the python snippet:
columns = xfact.columns
test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns)
test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')