Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD? - scala

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.
I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.
SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.
How do I do this?

Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.
// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration);
// Output file can be created from file system.
val output = fs.create(new Path(filename));
// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)
os.write("Hello World".getBytes("UTF-8"))
os.close()
Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.

Here's what worked best for me (using Spark 2.0):
val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()

Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:
URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));
This code will work with local files as well (change hdfs:// to file://).

One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.
Here is a simple snippet (in Scala):
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val conf = new Configuration() // Hadoop configuration
val sfwriter = SequenceFile.createWriter(conf,
SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...
In case you don't have a key you can use NullWritable.class in its place:
SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));

Related

How to convert csv file in S3 bucket to RDD

I'm pretty new with this topic so any help will be much appreciated.
I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally.
So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content.
val bucketName = "bucket-name"
val credentials = new BasicAWSCredentials(
"acessKey",
"secretKey"
);
val s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....
I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it.
val myData = Source.fromInputStream(inputStream)
....
You can do it with S3A file system provided in Hadoop-AWS module:
Add this dependency https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
Either define <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property> in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName) to SparkSession builder
Access S3 using spark.read.csv("s3://bucket/key"). If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd
At the end I was able to get the results I was searching for taking a look at https://gist.github.com/snowindy/d438cb5256f9331f5eec

How to check a file/folder is present using pyspark without getting exception

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
Thanks #Dror and #Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
The answer posted by #rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Looks like you should change except IOError: to except AnalysisException:.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception -
Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
You can validate existence of a file as seen here:
import os
if os.path.isfile('/path/file.csv'):
print("File Exists")
my_df = spark.read.load("/path/file.csv")
...
else:
print("File doesn't exists")
dbutils.fs.ls(file_location)
Do not import dbutils. It's already there when you start your cluster.

Content not saved in file while spark job running on cluster (yarn)

I am facing a weird issue and I've stuck with that for a while.
Basically in my spark application I have a method, where I create a file in MapR fs and save some content in this file.
The method is like below:
def collector(value: String): Unit = {
val conf = new Configuration()
val fs= FileSystem.get(conf)
val path = getConfig("path")
Try {fs.create(new Path(path + s"${scala.util.Random.alphanumeric take 10 mkString}" +
Calendar.getInstance.getTimeInMillis))
.write(value.getBytes);
fs.close()}.getOrElse(logger.warn("File not saved"))
}
This method is called from different object. When I run it with --master local[*], then the file is created in specific location in the FS with string that is passed in value. However, when I run it on cluster (--master yarn), then only empty file is saved. When I print value, its printing the string. But for some reason not saving it in the file.
I wonder if anybody has any idea why?
Thanks

Spark Scala file stream

I am new to Spark and Scala. I want to keep read files from folder and persist file content in Cassandra. I have written simple Scala program using file streaming to read the file content. it is not reading files from the specified folder.
Can anybody correct my below sample code ?
i am using Windows 7
Code:
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val lines = ssc.textFileStream("file:///C:/input/")
lines.foreachRDD(file=> {
file.foreach(fc=> {
println(fc)
})
})
ssc.start()
ssc.awaitTermination()
}
I think a normal spark job is needed for the scenario rather than spark streaming.Spark streaming is used in cases where your source is something like kafka or a normal port where there is continuous inflow of data.

How to save RDD data into json files, not folders

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)
For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);