Prepend header record (or a string / a file) to large file in Scala / Java - scala

What is the most efficient (or recommended) way to prepend a string or a file to another large file in Scala, preferably without using external libraries? The large file can be binary.
E.g.
if prepend string is:
header_information|123.45|xyz\n
and large file is:
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
...
I would expect to get:
header_information|123.45|xyz
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
...

I come up with the following solution:
Turn prepend string/file into InputStream
Turn large file into InputStream
"Combine" InputStreams together using java.io.SequenceInputStream
Use java.nio.file.Files.copy to write to target file
object FileAppender {
def main(args: Array[String]): Unit = {
val stringToPrepend = new ByteArrayInputStream("header_information|123.45|xyz\n".getBytes)
val largeFile = new FileInputStream("big_file.dat")
Files.copy(
new SequenceInputStream(stringToPrepend, largeFile),
Paths.get("output_file.dat"),
StandardCopyOption.REPLACE_EXISTING
)
}
}
Tested on ~30GB file, took ~40 seconds on MacBookPro (3.3GHz/16GB).
This approach can be used (if necessary) to combine multiple partitioned files created by e.g. Spark engine.

Related

Remove all files with a given extension using scala spark

I have some csv.crc files generated when I try to write a dataframe into a csv file using spark. Therefore I want to delete all files with .csv.crc extension
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
val srcPath=new Path("./src/main/resources/myDirectory/*.csv.crc")
println(fs.exists(srcPath))
println(fs.isFile(srcPath))
if(fs.exists(srcPath) && fs.isFile(srcPath)) {
fs.delete(srcPath,true)
}
both prinln lines give false as the value. therefor its not even going into the if condition. How can I delete all.csv.crc files using scala and spark
You can use below option to avoid crc file while writing.(Note :you're eliminating checksum).
fs.setVerifyChecksum(false).
else you can avoid crc files while reading using below,
config.("dfs.client.read.shortcircuit.skip.checksum", "true").

Creating temporary resource test files in Scala

I am currently writing tests for a function that takes file paths and loads a dataset from them. I am not able to change the function. To test it currently I am creating files for each run of the test function. I am worried that simply making files and then deleting them is a bad practice. Is there a better way to create temporary test files in Scala?
import java.io.{File, PrintWriter}
val testFile = new File("src/main/resources/temp.txt" )
val pw = new PrintWriter(testFile)
val testLines = List("this is a text line", "this is the next text line")
testLines.foreach(pw.write)
pw.close
// test logic here
testFile.delete()
I would generally prefer java.nio over java.io. You can create a temporary file like so:
import java.nio.Files
Files.createTempFile()
You can delete it using Files.delete. To ensure that the file is deleted even in the case of an error, you should put the delete call into a finally block.

Read large file into StringBuilder

I've got text files sitting in HDFS, ranging in size from around 300-800 MB each. They are almost valid json files. I am trying to make them valid json files so I can save them as ORC files.
I am attempting to create a StringBuilder with the needed opening characters, then read the file in line by line stripping off the newlines, append each line the string builder, and then add the needed closing character.
import org.apache.hadoop.fs.{FileSystem,Path, PathFilter, RemoteIterator}
import scala.collection.mutable.StringBuilder
//create stringbuilder
var sb = new scala.collection.mutable.StringBuilder("{\"data\" : ")
//read in the file
val path = new Path("/path/to/crappy/file.json")
val stream = fs.open(path)
//read the file line by line. This will strip off the newlines so we can append it to the string builder
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => sb.append(line)
That works. But as soon as I try to append the closing }:
sb.append("}")
It crashes with out of memory:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala
...
I've tried setting the initial size of the stringbuilder to be larger than the file I'm currently testing with, but that didn't help. I've also tried giving the driver more memory (spark-shell --driver-memory 3g), didn't help either.
Is there a better way to do this?
If that's all you need, you can just do it without Scala via hdfs command-line:
hadoop fs -cat /hdfs/path/prefix /hdfs/path/badjson /hdfs/path/suffix | hadoop fs -put - /hdfs/path/properjson
where file prefix just contains {"data" :, and suffix - a single }.
1) Don't use scala's Stream. It is just a broken abstraction. It's extremely difficult to use infinite/huge stream without blowing-up the heap. Stick either with a plain old Iterator or use more principled approaches from fs2 / zio.
In your case readLines object accumulates all entries even though it expects to hold only one at a time.
2) sb object leaks as well. It accumulates entire file content in memory.
Consider writing the corrected content directly into some OutputStreamWriter.

How to save RDD data into json files, not folders

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)
For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);

Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.
I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.
SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.
How do I do this?
Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.
// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration);
// Output file can be created from file system.
val output = fs.create(new Path(filename));
// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)
os.write("Hello World".getBytes("UTF-8"))
os.close()
Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.
Here's what worked best for me (using Spark 2.0):
val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:
URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));
This code will work with local files as well (change hdfs:// to file://).
One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.
Here is a simple snippet (in Scala):
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val conf = new Configuration() // Hadoop configuration
val sfwriter = SequenceFile.createWriter(conf,
SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...
In case you don't have a key you can use NullWritable.class in its place:
SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));