Adding to an existing zip file in scala - scala

I have a zip file in hdfs and i need to add a file to the zip and save in the same HDFS location. Any examples would be appreciated.
I have the following code.
val filePattern = s"${hdfsFolderPath}/${filePath}.txt"
val zipFilePath = hdfsWrapper.getFileNameFromPattern(s"${targetFilePath}/*.zip")
if (hdfsWrapper.filter(filePattern).size() > 0)
{
Try
{
val zipEntry = new ZipEntry(filePattern)
val zos: ZipOutputStream = new ZipOutputStream(new FileOutputStream(zipFilePath))
zos.putNextEntry(zipEntry)
zos.closeEntry()
zos.close()
}
}
Would like to know if above code is right?

I believe that your code would result in the zip file being replaced with a new one containing just the new file. See Appending files to a zip file with Java for an example of adding a file to a zip archive.
I'm not extremely familiar with HDFS, but I suspect that you can't write to that directly either, you probably have to create the new zip file and then replace it in HDFS

Related

How to convert csv file in S3 bucket to RDD

I'm pretty new with this topic so any help will be much appreciated.
I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally.
So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content.
val bucketName = "bucket-name"
val credentials = new BasicAWSCredentials(
"acessKey",
"secretKey"
);
val s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....
I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it.
val myData = Source.fromInputStream(inputStream)
....
You can do it with S3A file system provided in Hadoop-AWS module:
Add this dependency https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
Either define <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property> in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName) to SparkSession builder
Access S3 using spark.read.csv("s3://bucket/key"). If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd
At the end I was able to get the results I was searching for taking a look at https://gist.github.com/snowindy/d438cb5256f9331f5eec

How to read text files from sub directory in scala

i have sub directory under resources (Memory)
src
test
resources
Memory
where i put all my text files, i know how to read file from resources directly with:
val sdf = "file1.txt"
val file:File = new File("file1.txt")
....
but using sub directory i dont know how. i am trying to do something like
val path = scala.reflect.io.Path("src/test/resources/Memory/file1.txt")
but i get java.lang.NullPointerException: null

How to save RDD data into json files, not folders

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)
For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);

Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.
I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.
SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.
How do I do this?
Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.
// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration);
// Output file can be created from file system.
val output = fs.create(new Path(filename));
// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)
os.write("Hello World".getBytes("UTF-8"))
os.close()
Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.
Here's what worked best for me (using Spark 2.0):
val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:
URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));
This code will work with local files as well (change hdfs:// to file://).
One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.
Here is a simple snippet (in Scala):
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val conf = new Configuration() // Hadoop configuration
val sfwriter = SequenceFile.createWriter(conf,
SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...
In case you don't have a key you can use NullWritable.class in its place:
SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));

Can't write into a .txt file in netbeans

I have a file called SAVE.txt. It is in the same package as the class k. The problem is I can't write anything in the .txt file using the following code inside k:
File saveButton = new File ("SAVE.txt");
BufferedWriter output = new BufferedWriter (new FileWriter (saveButton));
output.write("something");
output.close();
Can anyone help me with this?
bw = new BufferedWriter(new FileWriter("filepath",true));
bw.write("Hello World!");
bw.write("\n");
bw.write("Hello World 2 !\n");
bw.write("Hello World 3 !" + "\n");
bw.close();
Try this?
Did you try something easy like this:
FileWriter f = new FileWriter("test.txt");
f.write("hello");
f.close();
When you write new File ("SAVE.txt"), since you specified a relative path, it refers to a file SAVE.txt in the current working directory. The current directory is in general completely separate from the directory corresponding to your Java package.
When you run code in Netbeans, it should be possible to specify the working directory (look in the project settings). Set it to some well-defined location, like the root of your project. Now specify the path relative to that working directory. For example, you could use new File ("out/SAVE.txt").