Akka Streams: keep the delimiter on framing stage - scala

I want to split a byte sequence by each line and max line size
val f = Source(List(ByteString("a\n")))
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 256))
.runFold(ByteString())(_ ++ _)
Await.result(f, 3.seconds) should be(ByteString("a\n"))
The delimiter \n will be missing. I want to keep it. Is there a way to do this?
P.S. The issue is described here: https://github.com/akka/akka/issues/19664
P.P.S. Just adding map section to the flow and farther concatenation of each bytestring with a delimiter is not an option since data is split not just by the delimiter, but also by the chunkSize property since it actually comes from the file: val source = FileIO.fromPath(path, chunkSize = MAX_BYTES)

Related

Spark (Scala) modify the contents of a Dataset Column

I would like to have a Dataset, where the first column contains single words and the second column contains the filenames of the files where these words appear.
My current code looks something like this:
val path = "path/to/folder/with/files"
val tokens = spark.read.textFile(path).
.flatMap(line => line.split(" "))
.withColumn("filename", input_file_name)
tokens.show()
However this returns something like
|word1 |whole/path/to/some/file |
|word2 |whole/path/to/some/file |
|word1 |whole/path/to/some/otherfile|
(I don't need the whole path, just the last bit). My idea to fix this, was to use the map function
val tokensNoPath = tokens.
map(r => (r(0), r(1).asInstanceOf[String].split("/").lastOption))
So basically, just going to every tow, grabbing the second entry and deleting everything before the last slash.
However since I'm very new to Spark and Scala I can't figure out how to get the syntax for this right
Docs:
substring_index "substring_index(str, delim, count) Returns the substring from str before count occurrences of the delimiter delim... If count is negative, everything to the right of the final delimiter (counting from the right) is returned."
.withColumn("filename", substring_index(input_file_name, "/", -1))
You can split by slash and get the last element:
val tokens2 = tokens.withColumn("filename", element_at(split(col("filename"), "/"), -1))

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

Decoded Snappy compressed byte arrays have trailing zeros

I am trying to write and read Snappy compressed byte array created from a protobuf from a Hadoop Sequence File.
The array read back from hadoop has trailing zeros. If a byte array is a small and simple removing trailing zeros is enough to parse the protobuf back, however for more complex objects and big sequence files parsing fails.
Byte array example:
val data = Array(1,2,6,4,2,1).map(_.toByte)
val distData = sparkContext.parallelize(Array.fill(5)(data))
.map(j => (NullWritable.get(), new BytesWritable(j)))
distData
.saveAsSequenceFile(file, Some(classOf[SnappyCodec]))
val original = distData.map(kv=> kv._2.getBytes).collect()
val decoded = sparkContext
.sequenceFile[NullWritable, BytesWritable](file)
.map( kv => kv._2.getBytes.mkString).collect().foreach(println(_))
Output:
original := 126421
decoded := 126421000
This problem stems from BytesWritable.getBytes, which returns a backing array that may be longer than your data. Instead, call copyBytes (as in Write and read raw byte arrays in Spark - using Sequence File SequenceFile).
See HADOOP-6298: BytesWritable#getBytes is a bad name that leads to programming mistakes for more details.

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect

Apache Spark: Building file/string in reduce function

I am working with a set of quite big txt files, a couple of 100MB each. What I want to do is to copy those, whereat I map every line with a function MapFunc. See below my first try which is terribly slow. I am pretty sure the problem is with the reduce function, which concatenates this huge string.
The order how the lines are written to outputFile is not important, but they mustn't overlap. I already took a look at Spark's saveAsTextFile but as far as I understand I can't specify the filename, only the directory, which is not useful for my use case. Also, what about adding a header and footer and the comma between the elements of the RDD? I would be grateful for any advice how to tune this application to maximum performance.
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val input = sc.textFile(file)
val lines = input.filter(s => filterFunc(s)).map(s => MapFunc(s))
val output = lines.reduce((a, b) => a + ',' + b)
val outputFile = new File(outFile)
val writer = new BufferedWriter(new FileWriter(outputFile))
val buf = new StringBuilder
buf ++= "header"
buf ++= output
buf ++= "footer"
writer.append(buf)
writer.flush()
writer.close()
Edit: My files are simple csv files. They can have comments (#). Also, I need to make sure that only files with 3 columns are processed, because the user is allowed to submit his own files for processing. This is done by FilterFunc which, to be honest, does not exclude whole files but only lines that do not match the criteria. A simple example would look like:
# File A
# generated mm/dd/yyyy
field11,field12,field13
field21,field22,field23
field31,field32,field33
And the output fill look like this:
$header
map(line1),
map(line2),
map(line3)
$footer
saveAsTextFile is really close to what I am looking for. But as already said it is important to me that I can control the filename and location of the output file.
Instead of using a temporary buffer buf, you should consider writing directly to the file.
val writer = new PrintWriter(new File(outFile))
writer.print("header")
writer.print(output)
writer.print("footer")
writer.flush()
writer.close()
You can avoid concatenation, as well as consuming memory for buf.