Download all the files from a s3 bucket using scala - scala

I tried the below code to download one file successfully but unable to download all the list of files
client.getObject(
new GetObjectRequest(bucketName, "TestFolder/TestSubfolder/Psalm/P.txt"),
new File("test.txt"))
Thanks in advance
Update
I tried the below code but getting list of directories ,I want list of files rather
val listObjectsRequest = new ListObjectsRequest().
withBucketName("tivo-hadoop-dev").
withPrefix("prefix").
withDelimiter("/")
client.listObjects(listObjectsRequest).getCommonPrefixes

It's a simple thing but I struggled like any thing before concluding below mentioned answer.
I found a java code and changed to scala accordingly and it worked
val client = new AmazonS3Client(credentials)
val listObjectsRequest = new ListObjectsRequest().
withBucketName("bucket-name").
withPrefix("path/of/dir").
withDelimiter("/")
var objects = client.listObjects(listObjectsRequest);
do {
for (objectSummary <- objects.getObjectSummaries()) {
var key = objectSummary.getKey()
println(key)
var arr=key.split("/")
var file_name = arr(arr.length-1)
client.getObject(
new GetObjectRequest("bucket" , key),
new File("some/path/"+file_name))
}
objects = client.listNextBatchOfObjects(objects);
} while (objects.isTruncated())

Below code is fast and useful especially when you want to download all objects at a specific local directory. It maintains the files under the exact same s3 prefix hierarchy
val xferMgrForAws:TransferManager = TransferManagerBuilder.standard().withS3Client(awsS3Client).build();
var objectListing:ObjectListing = null;
objectListing = awsS3Client.listObjects(awsBucketName, prefix);
val summaries:java.util.List[S3ObjectSummary] = objectListing.getObjectSummaries();
if(summaries.size() > 0) {
val xfer:MultipleFileDownload = xferMgrForAws.downloadDirectory(awsBucketName, prefix, new File(localDirPath));
xfer.waitForCompletion();
println("All files downloaded successfully!")
} else {
println("No object present in the bucket !");
}

Related

How to move files from one S3 bucket directory to another directory in same bucket? Scala/Java

I want to move all files under a directory in my s3 bucket to another directory within the same bucket, using scala.
Here is what I have:
def copyFromInputFilesToArchive(spark: SparkSession) : Unit = {
val sourcePath = new Path("s3a://path-to-source-directory/")
val destPath = new Path("s3a:/path-to-destination-directory/")
val fs = sourcePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
fs.moveFromLocalFile(sourcePath,destPath)
}
I get this error:
fs.copyFromLocalFile returns Wrong FS: s3a:// expected file:///
Error explained
The error you are seeing is because the copyFromLocalFile method is really for moving files from a local filesystem to S3. You are trying to "move" files that are already both in S3.
It is important to note that directories don't really exist in Amazon S3 buckets - The folder/file hierarchy you see is really just key-value metadata attached to the file. All file objects are really sitting in the same big, single level container and that filename key is there to give the illusion of files/folders.
To "move" files in a bucket, what you really need to do is update the filename key with the new path which is really just editing object metadata.
How to do a "move" within a bucket with Scala
To accomplish this, you'd need to copy the original object, assign the new metadata to the copy, and then write it back to S3. In practice, you can copy it and save it to the same object which will overwrite the old version, which acts a lot like an update.
Try something like this (from datahackr):
/**
* Copy object from a key to another in multiparts
*
* #param sourceS3Path S3 object key
* #param targetS3Path S3 object key
* #param fromBucketName bucket name
* #param toBucketName bucket name
*/
#throws(classOf[Exception])
#throws(classOf[AmazonServiceException])
def copyMultipart(sourceS3Path: String, targetS3Path: String, fromS3BucketName: String, toS3BucketName: String) {
// Create a list of ETag objects. You retrieve ETags for each object part uploaded,
// then, after each individual part has been uploaded, pass the list of ETags to
// the request to complete the upload.
var partETags = new ArrayList[PartETag]();
// Initiate the multipart upload.
val initRequest = new InitiateMultipartUploadRequest(toS3BucketName, targetS3Path);
val initResponse = s3client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromS3BucketName);
var objectSize = metadataResult.getContentLength();
// Copy the object using 50 MB parts.
val partSize = (50 * 1024 * 1024) * 1L;
var bytePosition = 0L;
var partNum = 1;
var copyResponses = new ArrayList[CopyPartResult]();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
val lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
val copyRequest = new CopyPartRequest()
.withSourceBucketName(fromS3BucketName)
.withSourceKey(sourceS3Path)
.withDestinationBucketName(toS3BucketName)
.withDestinationKey(targetS3Path)
.withUploadId(initResponse.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum + 1);
partNum += 1;
copyResponses.add(s3client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
val completeRequest = new CompleteMultipartUploadRequest(
toS3BucketName,
targetS3Path,
initResponse.getUploadId(),
getETags(copyResponses));
s3client.completeMultipartUpload(completeRequest);
logger.info("Multipart upload complete.");
}
// This is a helper function to construct a list of ETags.
def getETags(responses: java.util.List[CopyPartResult]): ArrayList[PartETag] = {
var etags = new ArrayList[PartETag]();
val it = responses.iterator();
while (it.hasNext()) {
val response = it.next();
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
def moveObject(sourceS3Path: String, targetS3Path: String, fromBucketName: String, toBucketName: String) {
logger.info(s"Moving S3 frile from $sourceS3Path ==> $targetS3Path")
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromBucketName);
var objectSize = metadataResult.getContentLength();
if (objectSize > ALLOWED_OBJECT_SIZE) {
logger.info("Object size is greater than 1GB. Initiating multipart upload.");
copyMultipart(sourceS3Path, targetS3Path, fromBucketName, toBucketName);
} else {
s3client.copyObject(fromBucketName, sourceS3Path, toBucketName, targetS3Path);
}
// Delete source object after successful copy
s3client.deleteObject(fromS3BucketName, sourceS3Path);
}
You will need the AWS Sdk for this.
If you are using AWS Sdk Version 1,
projectDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.248"
)
import com.amazonaws.services.s3.transfer.{ Copy, TransferManager, TransferManagerBuilder }
val transferManager: TransferManager =
TransferManagerBuilder.standard().build()
def copyFile(): Unit = {
val copy: Copy =
transferManager.copy(
"source-bucket-name", "source-file-key",
"destination-bucket-name", "destination-file-key"
)
copy.waitForCompletion()
}
If you are using AWS Sdk Version 2
projectDependencies ++= Seq(
"software.amazon.awssdk" % "s3" % "2.17.219",
"software.amazon.awssdk" % "s3-transfer-manager" % "2.17.219-PREVIEW"
)
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.model.CopyObjectRequest
import software.amazon.awssdk.transfer.s3.{Copy, CopyRequest, S3ClientConfiguration, S3TransferManager}
// change Region.US_WEST_2 to your required region
// or it might even work without the whole `.region(Region.US_WEST_2)` thing
val s3ClientConfig: S3ClientConfiguration =
S3ClientConfiguration
.builder()
.region(Region.US_WEST_2)
.build()
val s3TransferManager: S3TransferManager =
S3TransferManager.builder().s3ClientConfiguration(s3ClientConfig).build()
def copyFile(): Unit = {
val copyObjectRequest: CopyObjectRequest =
CopyObjectRequest
.builder()
.sourceBucket("source-bucket-name")
.sourceKey("source-file-key")
.destinationBucket("destination-bucket-name")
.destinationKey("destination-file-key")
.build()
val copyRequest: CopyRequest =
CopyRequest
.builder()
.copyObjectRequest(copyObjectRequest)
.build()
val copy: Copy =
s3TransferManager.copy(copyRequest)
copy.completionFuture().get()
}
Keep in mind that you will need the AWS credentials with appropriate permissions for both the source and destination object. For this, You just need to get the credentials and make them available as following environment variables.
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_SESSION_TOKEN=your_session_token
Also, "source-file-key" and "destination-file-key" should be the full path of the file in the bucket.

decompress (unzip/extract) util using spark scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

get list of sub-directories name present under directory

i am very much new to scala and trying to fetch the sub-directories name ,present at particular path.
Directory path = "/src/test/output/"
Sub-directories present under Directory path are :20180101,20190302,19990409,20110402
I just need the sub-directories names as List in scala .
I have tried this
val result = new JFile(path).listFiles.map(_.getName).toList
But this is not working ,can anyone please help me ??
You can check the examples from here.
def getListOfFiles(dir: String):List[String] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.getName).toList
} else {
List[String]()
}
}

scala- Read file from S3 bucket

I want to read a specific file from S3 bucket. In my S3 bucket I will be having so many objects(directories and Sub directories). I want traverse through all the objects and have to read only that file.
I am trying below code:
val s3Client: AmazonS3Client = getS3Client()
try {
log.info("Listing objects from S3")
var counter = 0
val listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName)
.withMaxKeys(2)
.withPrefix("Test/"+"Client_cd" + "/"+"DM1"+"/")
.withMarker("Test/"+"Client_cd" + "/"+"DM1"+"/")
var objectListing: ObjectListing = null
do {
objectListing = s3Client.listObjects(listObjectsRequest)
import scala.collection.JavaConversions._
for (objectSummary <- objectListing.getObjectSummaries) {
println( objectSummary.getKey + "\t" + StringUtils.fromDate(objectSummary.getLastModified))
}
listObjectsRequest.setMarker(objectListing.getNextMarker())
}
while (objectListing.isTruncated())
}
catch {
case e: Exception => {
log.error("Failed listing files. ", e)
throw e
}
}
In this path I have to read only .gz files from latest month folders. File Path:
"Mybucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz"
Here, I have to pass Client_cd as parameter for particular client.
How to filter the objects and to get particular files?
If you are using EMR or your S3 configs are setup correctly, you can also use the sc.textFile("s3://bucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz")

A better way of representing File Attachment into a list(c#3.0)

I have written
List<Attachment> lstAttachment = new List<Attachment>();
//Check if any error file is present in which case it needs to be send
if (new FileInfo(Path.Combine(errorFolder, errorFileName)).Exists)
{
Attachment unprocessedFile = new Attachment(Path.Combine(errorFolder, errorFileName));
lstAttachment.Add(unprocessedFile);
}
//Check if any processed file is present in which case it needs to be send
if (new FileInfo(Path.Combine(outputFolder, outputFileName)).Exists)
{
Attachment processedFile = new Attachment(Path.Combine(outputFolder, outputFileName));
lstAttachment.Add(processedFile);
}
Working fine and is giving the expected output.
Basically I am attaching the file to the list based on whether the file is present or not.
I am looking for any other elegant solution than the one I have written.
Reason: Want to learn differnt ways of representing the same program.
I am using C#3.0
Thanks.
Is it looks better?
...
var lstAttachment = new List<Attachment>();
string errorPath = Path.Combine(errorFolder, errorFileName);
string outputPath = Path.Combine(outputFolder, outputFileName);
AddAttachmentToCollection(lstAttachment, errorPath);
AddAttachmentToCollection(lstAttachment, outputPath);
...
public static void AddAttachmentToCollection(ICollection<Attachment> collection, string filePath)
{
if (File.Exists(filePath))
{
var attachment = new Attachment(filePath);
collection.Add(attachment);
}
}
How about a little LINQ?
var filenames = new List<string>()
{
Path.Combine(errorFolder, errorFilename),
Path.Combine(outputFolder, outputFilename)
};
var attachments = filenames.Where(f => File.Exists(f))
.Select(f => new Attachment(f));