I have a directory with many *.zip files and each zipfile can have many *.xml files inside it.
Test1.Zip [2018-09-07 02-57-43Z-OJPRxx.xml , 2018-09-07 03-57-43Z-OJPRxx.xml ]
Test2.Zip [2018-09-17 02-57-43Z-OJPRYY.xml , 2018-09-17 03-57-43Z-OJPRYY.xml ]
Using the scala code I am able to print the name of the files inside the zip file , but cant print the name of the zip file itself - while using sc.binaryfiles and zipinputstreams.
I have altered the code that I researched and it appends the filename to each line of the xmlfile using the "getName" method. But I cant find any function in scala to print the parent zip filename after its builds the stream and starts to read zip files one by one.
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.map { x ⇒
val filename1 = x.getName
scala.io.Source.fromInputStream(zis, "UTF-8").getLines.mkString(s"~${filename1}\n") + s"~${filename1}"
}
}
}
}
val df = sc.readFile("/landing/data/hs/tsnt/froff/bdu/acbdustore/INPUT/zipfiles", 1)
df.saveAsTextFile("/landing/data/hs/tsnt/froff/bdu/acbdustore/INPUT/outputfile")
example output:
<?xml version="1.0" encoding="utf-8"?>~2018-09-06 01-57-43Z-OJPRQL.xml
<pnr xmlns="http://gdsx.com/PnrDataPush.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">~2018-09-06 01-57-43Z-
OJPRQL.xml
<PNRid>999999999999</PNRid>~2018-09-06 01-57-43Z-OJPRQL.xml
<recordLocator>OJPRQL</recordLocator>~2018-09-06 01-57-43Z-OJPRQL.xml
<GDS>6</GDS>~2018-09-06 01-57-43Z-OJPRQL.xml
<platformID>NA</platformID>~2018-09-06 01-57-43Z-OJPRQL.xml
I expect my parent zip file name be captured and printed or written to an output file, but cant get a function/method to do the same. Would someone please help me out here. Thanks!
It should be in 'name' or
content.getPath
Related
I am trying to download pdf file from S3 using the akka-stream-alpakka connector. I have the s3 path and try to download the pdf using a wrapper method over the alpakka s3Client.
def getSource(s3Path: String): Source[ByteString, NotUsed] = {
val (source, _) = s3Client.download(s3Bucket, s3Path)
source
}
From my main code, I call the above method and try to convert it to a file
val file = new File("certificate.pdf")
val res: Future[IOResult] = getSource(data.s3PdfPath)
.runWith(FileIO.toFile(file))
However, instead of it getting converted to a file, I am stuck with a type of IOResult. Can someone please guide as to where I am going wrong regarding this ?
def download(bucket: String, bucketKey: String, filePath: String) = {
val (s3Source: Source[ByteString, _], _) = s3Client.download(bucket, bucketKey)
val result = s3Source.toMat(FileIO.toPath(Paths.get(filePath)))(Keep.right)
.run()
result
}
download(s3Bucket, key, newSigFilepath).onComplete {
}
Inspect the IOResult, and if successful you can use your file:
res.foreach {
case IOResult(bytes, Success(_)) =>
println(s"$bytes bytes written to $file")
... // do whatever you want with your file
case _ =>
println("some error occurred.")
}
I am extracting zip files in memory suing scala as shown below:
val rdd = sc.binaryFiles("/path")
.flatMap {
case (name: String, content: PortableDataStream) => {
val zis = new ZipInputStream(content.open())
Stream.continually(zis.getNextEntry()).takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
val root = scala.xml.XML.load(br);
val namespace = root.head.namespace
//other stuff
}
}
}
Problem with this code is, it reads only 1st XML file inside the zip and then closes the ZipInputStream automatically and I get the error as:
java.io.IOException: Stream closed
at java.util.zip.ZipInputStream.ensureOpen(Unknown Source)
at java.util.zip.ZipInputStream.getNextEntry(Unknown Source)
I have few zip files and each zip contains XML files, my goal is to parse each of the XML file. As I am newbie to Scala I am not sure whether I am doing anything wrong or scala.xml.XML.load itself is closing the stream.
I'm trying to list files within a directory that match a regular expression, e.g. ".csv$" this is very similar to Scala & DataBricks: Getting a list of Files
I've been running in circles for hours trying to figure out how Scala can list a directory of files and filter by regex.
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val name : String = ".csv"
val files = getListOfFiles("/home/con/Scripts").map(_.path).filter(_.matches(name))
println(files)
gives the error
/home/con/Scripts/scala/find_files.scala:13: error: value path is not a member of java.io.File
val files = getListOfFiles("/home/con/Scripts").map(_.path).filter(_.matches(name))
I'm trying to figure out the regular Scala equivalent of dbutils.fs.ls which eludes me.
How can list files in a regular directory in Scala?
The error is reporting that path is not defined in java.io.File which it isn't.
If you want to match by name, why don't you get file names? Also, your regex is a bit off if you want to match based on file extension.
Fixing these two problems:
val name : String = ".+\\.csv"
val files = getListOfFiles("/path/to/files/location")
.map(f => f.getName)
.filter(_.matches(name))
will output .csv files in the /path/to/files/location folder.
I want to read files from given directory then read contents from file and create Map of filename as key and its contetns as value.
I did not got any success but I have tried like this,
def getFileLists(): List[File] = {
val directory = "./input"
// print(new File(directory).listFiles().toList)
return new File(directory).listFiles().toList
}
val contents = getFileLists().map(file => Source.fromFile(file).getLines())
print(contents)
you can change your following line
val contents = getFileLists().map(file => Source.fromFile(file).getLines())
to
val contents = getFileLists().map(file => (file.getName, Source.fromFile(file).getLines()))
which would give you
contents: List[(String, Iterator[String])]
Furthermore you can add .toMap method call as
val contents = getFileLists().map(file => (file.getName, Source.fromFile(file).getLines())).toMap
which would give you
contents: scala.collection.immutable.Map[String,Iterator[String]]
What you are doing is transforming the list of file-names to a list of their contents. You want a Map[File, List[String]] instead. To do that the easiest ist to map to tuples of files and content, and then call toMap on the result:
getFileLists().map(file => file -> Source.fromFile(file).getLines().toList).toMap
toMap works when the input sequence has Tuple2 as element type. file -> contents is such a tuple (File, List[String]).
Or in two steps:
val xs: Seq[(File, List[String])] = getFileLists().map(file =>
file -> Source.fromFile(file).getLines().toList)
val m: Map[File, List[String]] = xs.toMap
You can try this:
getFileLists().map(file => (file.getName, Source.fromFile(file).getLines().toList)).toMap
i am trying to fetch and read text files in a zip file uploaded on aws s3 bucket
code i tried
var ZipFileList = spark.sparkContext.binaryFiles(/path/);
var unit = ZipFileList.flatMap {
case (zipFilePath, zipContent) =>
{
val zipInputStream = new ZipInputStream(zipContent.open())
val zipEntry = zipInputStream.getNextEntry()
println(zipEntry.getName)
}
}
but it gives an error unit required traversableOnce
val files = spark.sparkContext.wholeTextFiles(/path/))
files.flatMap({case (name, content) =>
unzip(content) //gives error "type mismatch; found : Unit required: scala.collection.GenTraversableOnce[?]"
})
is there any other way to read file contents inside a zip file ...
zip file contains .json files and i want to achieve is to read and parse all those files
you aren't actually returning the data in the unzip() command, are you? I think that's part of the problem