In Spark, it is possible to set some hadoop configuration settings like, e.g.
System.setProperty("spark.hadoop.dfs.replication", "1")
This works, the replication factor is set to 1.
Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the
textinputformat.record.delimiter:
System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")
However, it seems that spark just ignores this setting.
Do I set the textinputformat.record.delimiter in the correct way?
Is there a simpler way of setting the textinputformat.record.delimiter. I would like to avoid writing my own InputFormat, since I really only need to obtain records delimited by two newlines.
I got this working with plain uncompressed files with the below function.
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
def nlFile(path: String) = {
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n")
sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
.map(_._2.toString)
}
Related
When I try to write my final DF with append or overwrite mode, sometimes, I get the following error:
Caused by: java.io.FileNotFoundException: File file:/C:/Users/xxx/ScalaSparkProjects/Date=2019-11-02/part-xxxx2x.28232x.213.c000.snappy.parquet does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
And I can't understand why. This is how I am writting the DF as a parquet file:
df.write.mode("append")
.partitionBy("Date")
.format("parquet")
.save(/data/testing/files)
Why could be happening this?
Based on your information consider this scenario:
Source DataFrame example under the path /tmp/sourceDF
target path to save under /tmp/destPath
val sourceDF = spark.read.parquet("/tmp/source")
At this point spark reads the header of the parquet in this folder to infere the schema. The schema I used is for simplicity reasons num: Integer
Now what you probably think is that all data is loaded at this point, but spark works lazy until an action occurs (Actions: df.show(), df.take(1), df.count())
so this code would result in error.
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
val sourceDF = spark.read.parquet("/tmp/sourceDF")
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
sourceDF.write.parquet("/tmp/destDF")
the result will be:
java.io.FileNotFoundException: File file:/tmp/source/part-00000-1915503b-4beb-4e14-87ef-ca8b99fc4b11-c000.snappy.parquet does not exist
In order to fix this you you have two options I can think of.
Change the order:
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
val sourceDF = spark.read.parquet("/tmp/sourceDF")
sourceDF.write.mode("append").parquet("/tmp/destDF")
// Deletion happens now after writing
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
Or you can use a checkpoint which loads the df at some point and caches it:
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
// set checkpoint directory
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
// cache df
val sourceDF = spark.read.parquet("/tmp/sourceDF").checkpoint()
// Now you can delete before writing it out
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
sourceDF.write.mode("append").parquet("/tmp/destDF")
I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
Thanks #Dror and #Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
The answer posted by #rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Looks like you should change except IOError: to except AnalysisException:.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception -
Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
You can validate existence of a file as seen here:
import os
if os.path.isfile('/path/file.csv'):
print("File Exists")
my_df = spark.read.load("/path/file.csv")
...
else:
print("File doesn't exists")
dbutils.fs.ls(file_location)
Do not import dbutils. It's already there when you start your cluster.
I'm using Spark Structured Streaming on a classic use case : I want to read form a kafka topic and write the stream into HDFS in parquet format.
Here is my code :
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{ArrayType, DataTypes, StructType}
object TestKafkaReader extends App{
val spark = SparkSession
.builder
.appName("Spark-Kafka-Integration")
.master("local")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val kafkaDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","KAFKA_BROKER_IP:PORT")
//.option("subscribe", "test")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.load()
val moviesJsonDf = kafkaDf.selectExpr("CAST(value AS STRING)")
// movie struct
val struct = new StructType()
.add("title", DataTypes.StringType)
.add("year", DataTypes.IntegerType)
.add("cast", ArrayType(DataTypes.StringType))
.add("genres", ArrayType(DataTypes.StringType))
val moviesNestedDf = moviesJsonDf.select(from_json($"value", struct).as("movie"))
// json flatten
val movieFlattenedDf = moviesNestedDf.selectExpr("movie.title", "movie.year", "movie.cast","movie.genres")
// convert to parquet and save to hdfs
val query = movieFlattenedDf
.writeStream
.outputMode("append")
.format("parquet")
.queryName("movies")
.option("checkpointLocation", "src/main/resources/chkpoint_dir")
.start("src/main/resources/output")
.awaitTermination()
}
Context :
I'm running this directly from intellij (with a local spark
installed)
I manage to read from kafka without problem and write in
console (using console mode)
For the moment I want to write the file
on local machine (but I did try on HDFS cluster, the problem is the
same)
My problem :
During the job, it doesn't write anything in the folder, I have to manualy stop the job to finally see the files.
I figured that there is maybe something to do with .awaitTermination()
For information, I tried to delete this option but without that I get an error and the job simply doesn't run.
Maybe I didn't set the right options but after reading many time the doc and searching on Google I didn't find anything.
Can you please help me on that ?
Thank you
EDIT :
I'm using spark 2.4.0
I tried the 64/128mb format => nothing change no file until I stop the job
Yes problem solve
My problem was that, I had too few data and spark was waiting for more data to write the parquet file.
To make this work I use the comment from #AlexandrosBiratsis
(change the block size)
Once again all credit to #AlexandrosBiratsis
thank you very much
I have a Parquet file which I would like to read into my Scala program without using Spark or other Big Data Technologies.
I found the projects
https://github.com/apache/parquet-mr
https://github.com/51zero/eel-sdk
but not detailed enough examples to get them to work.
Parquet-MR
https://stackoverflow.com/a/35594368/4533188 mentions this, but the examples given are not complete. For example it is not clear what path is supposed to be. It is supposed to implement InputFile, how is this supposed to be done? Also, from the post it seems to me that Parquet-MR does not directly truns the parquet data as standard Scala classes.
Eel
Here I tried
import io.eels.component.parquet.ParquetSource
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val parquetFilePath = new Path("file://home/raeg/Datatroniq/Projekte/14. Witzenmann/Teilprojekt Strom und Spannung/python_witzenmann/src/data/1.parquet")
implicit val hadoopConfiguration = new Configuration()
implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required
ParquetSource(parquetFilePath)
.toDataStream()
.collect
.foreach(row => println(row))
but I get the error
java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(ParquetReaderTesting.sc:2582)
at org.apache.hadoop.fs.FileSystem.createFileSystem(ParquetReaderTesting.sc:2589)
at org.apache.hadoop.fs.FileSystem.access$200(ParquetReaderTesting.sc:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(ParquetReaderTesting.sc:2628)
at org.apache.hadoop.fs.FileSystem$Cache.get(ParquetReaderTesting.sc:2610)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:366)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:165)
at dataReading.A$A6$A$A6.hadoopFileSystem$lzycompute(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.hadoopFileSystem(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.get$$instance$$hadoopFileSystem(ParquetReaderTesting.sc:7)
at #worksheet#.#worksheet#(ParquetReaderTesting.sc:30)
in my worksheet.
I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.
I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.
SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.
How do I do this?
Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.
// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration);
// Output file can be created from file system.
val output = fs.create(new Path(filename));
// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)
os.write("Hello World".getBytes("UTF-8"))
os.close()
Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.
Here's what worked best for me (using Spark 2.0):
val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:
URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));
This code will work with local files as well (change hdfs:// to file://).
One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.
Here is a simple snippet (in Scala):
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val conf = new Configuration() // Hadoop configuration
val sfwriter = SequenceFile.createWriter(conf,
SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...
In case you don't have a key you can use NullWritable.class in its place:
SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));