Assume the following structure in your recources folder:
resources
├─spec_A
| ├─AA
| | ├─file-aev
| | ├─file-oxa
| | ├─…
| | └─file-stl
| ├─BB
| | ├─file-hio
| | ├─file-nht
| | ├─…
| | └─file-22an
| └─…
├─spec_B
| ├─AA
| | ├─file-aev
| | ├─file-oxa
| | ├─…
| | └─file-stl
| ├─BB
| | ├─file-hio
| | ├─file-nht
| | ├─…
| | └─file-22an
| └─…
└─…
The task is to read all files for a given specification spec_X one subfolder by one. For obvious reasons we do not want to have the exact names as string literals to open with Source.fromResource("spec_A/AA/…") for hundreds of files in the code.
Additionally, this solution should of course run inside the development environment, i.e. without being packaged into a jar.
The only option to list files inside a resource folder I found is with nio’s Filesystem concept, as this can load a jar-file as a file system. But this comes with two major downsides:
java.nio uses java Stream API, which I cannot collect from inside scala code: Collectors.toList() cannot be made to compile as it cannot determine the right type.
The filesystem needs different base paths for OS-filesystems and jar-file-based filesystems. So I need to manually differentiate between the two situations testing and jar-based running.
First lazy load the jar-filesystem if needed
private static FileSystem jarFileSystem;
static synchronized private FileSystem getJarFileAsFilesystem(String drg_file_root) throws URISyntaxException, IOException {
if (jarFileSystem == null) {
jarFileSystem = FileSystems.newFileSystem(ConfigFiles.class.getResource(drg_file_root).toURI(), Collections.emptyMap());
}
return jarFileSystem;
}
next do the limbo to figure out whether we are inside the jar or not by checking the protocol of the URL and return a Path. (Protocol inside the jar file will be jar:
static Path getPathForResource(String resourceFolder, String filename) throws IOException, URISyntaxException {
URL url = ConfigFiles.class.getResource(resourceFolder + "/" + filename);
return "file".equals(url.getProtocol())
? Paths.get(url.toURI())
: getJarFileAsFilesystem(resourceFolder).getPath(resourceFolder, filename);
}
And finally list and collect into a java list
static List<Path> listPathsFromResource(String resourceFolder, String subFolder) throws IOException, URISyntaxException {
return Files.list(getPathForResource(resourceFolder, subFolder))
.filter(Files::isRegularFile)
.sorted()
.collect(toList());
}
Only then we can go back do Scala and fetch is
class SpecReader {
def readSpecMessage(spec: String): String = {
List("CN", "DO", "KF")
.flatMap(ConfigFiles.listPathsFromResource(s"/spec_$spec", _).asScala.toSeq)
.flatMap(path ⇒ Source.fromInputStream(Files.newInputStream(path), "UTF-8").getLines())
.reduce(_ + " " + _)
}
}
object Main {
def main(args: Array[String]): Unit = {
System.out.println(new SpecReader().readSpecMessage(args.head))
}
}
I put a running mini project to proof it here: https://github.com/kurellajunior/list-files-from-resource-directory
But of course this is far from optimal. I wanto to elmiminate the two downsides mentioned above so, that
scala files only
no extra testing code in my production library
Here's a function for reading all files from a resource folder. My use case is with small files. Inspired by Jan's answers, but without needing a user-defined collector or messing with Java.
// Helper for reading an individual file.
def readFile(path: Path): String =
Source.fromInputStream(Files.newInputStream(path), "UTF-8").getLines.mkString("\n")
private var jarFS: FileSystem = null; // Static variable for storing a FileSystem. Will be loaded on the first call to getPath.
/**
* Gets a Path object corresponding to an URL.
* #param url The URL could follow the `file:` (usually used in dev) or `jar:` (usually used in prod) rotocols.
* #return A Path object.
*/
def getPath(url: URL): Path = {
if (url.getProtocol == "file")
Paths.get(url.toURI)
else {
// This hacky branch is to handle reading resource files from a jar (where url is jar:...).
val strings = url.toString.split("!")
if (jarFS == null) {
jarFS = FileSystems.newFileSystem(URI.create(strings(0)), Map[String, String]().asJava)
}
jarFS.getPath(strings(1))
}
}
/**
* Given a folder (e.g. "A"), reads all files under the resource folder (e.g. "src/main/resources/A/**") as a Seq[String]. */
* #param folder Relative path to a resource folder under src/main/resources.
* #return A sequence of strings. Each element corresponds to the contents of a single file.
*/
def readFilesFromResource(folder: String): Seq[String] = {
val url = Main.getClass.getResource("/" + folder)
val path = getPath(url)
val ls = Files.list(path)
ls.collect(Collectors.toList()).asScala.map(readFile) // Magic!
}
(not catered to example in question)
Relevant imports:
import java.nio.file._
import scala.collection.JavaConverters._ // Needed for .asScala
import java.net.{URI, URL}
import java.util.stream._
import scala.io.Source
Thanks to #TrebledJ ’s answer, this could be minimized to the following:
class ConfigFiles (val basePath String) {
lazy val jarFileSystem: FileSystem = FileSystems.newFileSystem(getClass.getResource(basePath).toURI, Map[String, String]().asJava);
def listPathsFromResource(folder: String): List[Path] = {
Files.list(getPathForResource(folder))
.filter(p ⇒ Files.isRegularFile(p, Array[LinkOption](): _*))
.sorted.toList.asScala.toList // from Stream to java List to Scala Buffer to scala List
}
private def getPathForResource(filename: String) = {
val url = classOf[ConfigFiles].getResource(basePath + "/" + filename)
if ("file" == url.getProtocol) Paths.get(url.toURI)
else jarFileSystem.getPath(basePath, filename)
}
}
special attention was necessary for the empty setting maps.
checking for the URL protocol seems inevitable. Git updated, PUll requests welcome: https://github.com/kurellajunior/list-files-from-resource-directory
Related
I want to move all files under a directory in my s3 bucket to another directory within the same bucket, using scala.
Here is what I have:
def copyFromInputFilesToArchive(spark: SparkSession) : Unit = {
val sourcePath = new Path("s3a://path-to-source-directory/")
val destPath = new Path("s3a:/path-to-destination-directory/")
val fs = sourcePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
fs.moveFromLocalFile(sourcePath,destPath)
}
I get this error:
fs.copyFromLocalFile returns Wrong FS: s3a:// expected file:///
Error explained
The error you are seeing is because the copyFromLocalFile method is really for moving files from a local filesystem to S3. You are trying to "move" files that are already both in S3.
It is important to note that directories don't really exist in Amazon S3 buckets - The folder/file hierarchy you see is really just key-value metadata attached to the file. All file objects are really sitting in the same big, single level container and that filename key is there to give the illusion of files/folders.
To "move" files in a bucket, what you really need to do is update the filename key with the new path which is really just editing object metadata.
How to do a "move" within a bucket with Scala
To accomplish this, you'd need to copy the original object, assign the new metadata to the copy, and then write it back to S3. In practice, you can copy it and save it to the same object which will overwrite the old version, which acts a lot like an update.
Try something like this (from datahackr):
/**
* Copy object from a key to another in multiparts
*
* #param sourceS3Path S3 object key
* #param targetS3Path S3 object key
* #param fromBucketName bucket name
* #param toBucketName bucket name
*/
#throws(classOf[Exception])
#throws(classOf[AmazonServiceException])
def copyMultipart(sourceS3Path: String, targetS3Path: String, fromS3BucketName: String, toS3BucketName: String) {
// Create a list of ETag objects. You retrieve ETags for each object part uploaded,
// then, after each individual part has been uploaded, pass the list of ETags to
// the request to complete the upload.
var partETags = new ArrayList[PartETag]();
// Initiate the multipart upload.
val initRequest = new InitiateMultipartUploadRequest(toS3BucketName, targetS3Path);
val initResponse = s3client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromS3BucketName);
var objectSize = metadataResult.getContentLength();
// Copy the object using 50 MB parts.
val partSize = (50 * 1024 * 1024) * 1L;
var bytePosition = 0L;
var partNum = 1;
var copyResponses = new ArrayList[CopyPartResult]();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
val lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
val copyRequest = new CopyPartRequest()
.withSourceBucketName(fromS3BucketName)
.withSourceKey(sourceS3Path)
.withDestinationBucketName(toS3BucketName)
.withDestinationKey(targetS3Path)
.withUploadId(initResponse.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum + 1);
partNum += 1;
copyResponses.add(s3client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
val completeRequest = new CompleteMultipartUploadRequest(
toS3BucketName,
targetS3Path,
initResponse.getUploadId(),
getETags(copyResponses));
s3client.completeMultipartUpload(completeRequest);
logger.info("Multipart upload complete.");
}
// This is a helper function to construct a list of ETags.
def getETags(responses: java.util.List[CopyPartResult]): ArrayList[PartETag] = {
var etags = new ArrayList[PartETag]();
val it = responses.iterator();
while (it.hasNext()) {
val response = it.next();
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
def moveObject(sourceS3Path: String, targetS3Path: String, fromBucketName: String, toBucketName: String) {
logger.info(s"Moving S3 frile from $sourceS3Path ==> $targetS3Path")
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromBucketName);
var objectSize = metadataResult.getContentLength();
if (objectSize > ALLOWED_OBJECT_SIZE) {
logger.info("Object size is greater than 1GB. Initiating multipart upload.");
copyMultipart(sourceS3Path, targetS3Path, fromBucketName, toBucketName);
} else {
s3client.copyObject(fromBucketName, sourceS3Path, toBucketName, targetS3Path);
}
// Delete source object after successful copy
s3client.deleteObject(fromS3BucketName, sourceS3Path);
}
You will need the AWS Sdk for this.
If you are using AWS Sdk Version 1,
projectDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.248"
)
import com.amazonaws.services.s3.transfer.{ Copy, TransferManager, TransferManagerBuilder }
val transferManager: TransferManager =
TransferManagerBuilder.standard().build()
def copyFile(): Unit = {
val copy: Copy =
transferManager.copy(
"source-bucket-name", "source-file-key",
"destination-bucket-name", "destination-file-key"
)
copy.waitForCompletion()
}
If you are using AWS Sdk Version 2
projectDependencies ++= Seq(
"software.amazon.awssdk" % "s3" % "2.17.219",
"software.amazon.awssdk" % "s3-transfer-manager" % "2.17.219-PREVIEW"
)
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.model.CopyObjectRequest
import software.amazon.awssdk.transfer.s3.{Copy, CopyRequest, S3ClientConfiguration, S3TransferManager}
// change Region.US_WEST_2 to your required region
// or it might even work without the whole `.region(Region.US_WEST_2)` thing
val s3ClientConfig: S3ClientConfiguration =
S3ClientConfiguration
.builder()
.region(Region.US_WEST_2)
.build()
val s3TransferManager: S3TransferManager =
S3TransferManager.builder().s3ClientConfiguration(s3ClientConfig).build()
def copyFile(): Unit = {
val copyObjectRequest: CopyObjectRequest =
CopyObjectRequest
.builder()
.sourceBucket("source-bucket-name")
.sourceKey("source-file-key")
.destinationBucket("destination-bucket-name")
.destinationKey("destination-file-key")
.build()
val copyRequest: CopyRequest =
CopyRequest
.builder()
.copyObjectRequest(copyObjectRequest)
.build()
val copy: Copy =
s3TransferManager.copy(copyRequest)
copy.completionFuture().get()
}
Keep in mind that you will need the AWS credentials with appropriate permissions for both the source and destination object. For this, You just need to get the credentials and make them available as following environment variables.
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_SESSION_TOKEN=your_session_token
Also, "source-file-key" and "destination-file-key" should be the full path of the file in the bucket.
I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}
Is it possible to decompress a single file from a zip folder and return the decompressed file without storing the data on the server?
I have a zip file with an unknown structure and I would like to develop a service that will serve the content of a given file on demand without decompressing the whole zip and also without writing on disk.
So, if I have a zip file like this
zip_folder.zip
| folder1
| file1.txt
| file2.png
| folder 2
| file3.jpg
| file4.pdf
| ...
So, I would like my service to receive the name and path of the file so I could send the file.
For example, fileName could be folder1/file1.txt
def getFileContent(fileName: String): IBinaryContent = {
val content: IBinaryContent = getBinaryContent(...)
val zipInputStream: ZipInputStream = new ZipInputStream(content.getInputStream)
val outputStream: FileOutputStream = new FileOutputStream(fileName)
var zipEntry: ZipEntry = null
var founded: Boolean = false
while ({
zipEntry = zipInputStream.getNextEntry
Option(zipEntry).isDefined && !founded
}) {
if (zipEntry.getName.equals(fileName)) {
val buffer: Array[Byte] = Array.ofDim(9000) // FIXME how to get the dimension of the array
var length = 0
while ({
length = zipInputStream.read(buffer)
length != -1
}) {
outputStream.write(buffer, 0, length)
}
outputStream.close()
founded = true
}
}
zipInputStream.close()
outputStream /* how can I return the value? */
}
How can I do it without writing the content in the disk?
You can use a ByteArrayOutputStream instead of the FileOutputStream to uncompress the zip entry into memory. Then call toByteArray() on it.
Also note that, technically, you would not even need to decompress the zip part if you can transmit it over a protocol (think: HTTP(S)) which supports the deflate encoding for its transport (which is usually the standard compression used in Zip files).
So, basically I did the same thing that #cbley recommended. I returned an array of bytes and defined the content-type so that the browser can do the magic!
def getFileContent(fileName: String): IBinaryContent = {
val content: IBinaryContent = getBinaryContent(...)
val zipInputStream: ZipInputStream = new ZipInputStream(content.getInputStream)
val outputStream: ByteArrayOutputStream = new ByteArrayOutputStream()
var zipEntry: ZipEntry = null
var founded: Boolean = false
while ({
zipEntry = zipInputStream.getNextEntry
Option(zipEntry).isDefined && !founded
}) {
if (zipEntry.getName.equals(fileName)) {
val buffer: Array[Byte] = Array.ofDim(zipEntry.getSize)
var length = 0
while ({
length = zipInputStream.read(buffer)
length != -1
}) {
outputStream.write(buffer, 0, length)
}
outputStream.close()
founded = true
}
}
zipInputStream.close()
outputStream.toByteArray
}
// in my rest service
#GET
#Path("/content/${fileName}")
def content(#PathVariable fileName): Response = {
val content = getFileContent(fileName)
Response.ok(content)
.header("Content-type", new Tika().detect(fileName)) // I'm using Tika but it's possible to use other libraries
.build()
}
I want to read a specific file from S3 bucket. In my S3 bucket I will be having so many objects(directories and Sub directories). I want traverse through all the objects and have to read only that file.
I am trying below code:
val s3Client: AmazonS3Client = getS3Client()
try {
log.info("Listing objects from S3")
var counter = 0
val listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName)
.withMaxKeys(2)
.withPrefix("Test/"+"Client_cd" + "/"+"DM1"+"/")
.withMarker("Test/"+"Client_cd" + "/"+"DM1"+"/")
var objectListing: ObjectListing = null
do {
objectListing = s3Client.listObjects(listObjectsRequest)
import scala.collection.JavaConversions._
for (objectSummary <- objectListing.getObjectSummaries) {
println( objectSummary.getKey + "\t" + StringUtils.fromDate(objectSummary.getLastModified))
}
listObjectsRequest.setMarker(objectListing.getNextMarker())
}
while (objectListing.isTruncated())
}
catch {
case e: Exception => {
log.error("Failed listing files. ", e)
throw e
}
}
In this path I have to read only .gz files from latest month folders. File Path:
"Mybucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz"
Here, I have to pass Client_cd as parameter for particular client.
How to filter the objects and to get particular files?
If you are using EMR or your S3 configs are setup correctly, you can also use the sc.textFile("s3://bucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz")
I tried the below code to download one file successfully but unable to download all the list of files
client.getObject(
new GetObjectRequest(bucketName, "TestFolder/TestSubfolder/Psalm/P.txt"),
new File("test.txt"))
Thanks in advance
Update
I tried the below code but getting list of directories ,I want list of files rather
val listObjectsRequest = new ListObjectsRequest().
withBucketName("tivo-hadoop-dev").
withPrefix("prefix").
withDelimiter("/")
client.listObjects(listObjectsRequest).getCommonPrefixes
It's a simple thing but I struggled like any thing before concluding below mentioned answer.
I found a java code and changed to scala accordingly and it worked
val client = new AmazonS3Client(credentials)
val listObjectsRequest = new ListObjectsRequest().
withBucketName("bucket-name").
withPrefix("path/of/dir").
withDelimiter("/")
var objects = client.listObjects(listObjectsRequest);
do {
for (objectSummary <- objects.getObjectSummaries()) {
var key = objectSummary.getKey()
println(key)
var arr=key.split("/")
var file_name = arr(arr.length-1)
client.getObject(
new GetObjectRequest("bucket" , key),
new File("some/path/"+file_name))
}
objects = client.listNextBatchOfObjects(objects);
} while (objects.isTruncated())
Below code is fast and useful especially when you want to download all objects at a specific local directory. It maintains the files under the exact same s3 prefix hierarchy
val xferMgrForAws:TransferManager = TransferManagerBuilder.standard().withS3Client(awsS3Client).build();
var objectListing:ObjectListing = null;
objectListing = awsS3Client.listObjects(awsBucketName, prefix);
val summaries:java.util.List[S3ObjectSummary] = objectListing.getObjectSummaries();
if(summaries.size() > 0) {
val xfer:MultipleFileDownload = xferMgrForAws.downloadDirectory(awsBucketName, prefix, new File(localDirPath));
xfer.waitForCompletion();
println("All files downloaded successfully!")
} else {
println("No object present in the bucket !");
}