How to move files from one S3 bucket directory to another directory in same bucket? Scala/Java - scala

I want to move all files under a directory in my s3 bucket to another directory within the same bucket, using scala.
Here is what I have:
def copyFromInputFilesToArchive(spark: SparkSession) : Unit = {
val sourcePath = new Path("s3a://path-to-source-directory/")
val destPath = new Path("s3a:/path-to-destination-directory/")
val fs = sourcePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
fs.moveFromLocalFile(sourcePath,destPath)
}
I get this error:
fs.copyFromLocalFile returns Wrong FS: s3a:// expected file:///

Error explained
The error you are seeing is because the copyFromLocalFile method is really for moving files from a local filesystem to S3. You are trying to "move" files that are already both in S3.
It is important to note that directories don't really exist in Amazon S3 buckets - The folder/file hierarchy you see is really just key-value metadata attached to the file. All file objects are really sitting in the same big, single level container and that filename key is there to give the illusion of files/folders.
To "move" files in a bucket, what you really need to do is update the filename key with the new path which is really just editing object metadata.
How to do a "move" within a bucket with Scala
To accomplish this, you'd need to copy the original object, assign the new metadata to the copy, and then write it back to S3. In practice, you can copy it and save it to the same object which will overwrite the old version, which acts a lot like an update.
Try something like this (from datahackr):
/**
* Copy object from a key to another in multiparts
*
* #param sourceS3Path S3 object key
* #param targetS3Path S3 object key
* #param fromBucketName bucket name
* #param toBucketName bucket name
*/
#throws(classOf[Exception])
#throws(classOf[AmazonServiceException])
def copyMultipart(sourceS3Path: String, targetS3Path: String, fromS3BucketName: String, toS3BucketName: String) {
// Create a list of ETag objects. You retrieve ETags for each object part uploaded,
// then, after each individual part has been uploaded, pass the list of ETags to
// the request to complete the upload.
var partETags = new ArrayList[PartETag]();
// Initiate the multipart upload.
val initRequest = new InitiateMultipartUploadRequest(toS3BucketName, targetS3Path);
val initResponse = s3client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromS3BucketName);
var objectSize = metadataResult.getContentLength();
// Copy the object using 50 MB parts.
val partSize = (50 * 1024 * 1024) * 1L;
var bytePosition = 0L;
var partNum = 1;
var copyResponses = new ArrayList[CopyPartResult]();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
val lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
val copyRequest = new CopyPartRequest()
.withSourceBucketName(fromS3BucketName)
.withSourceKey(sourceS3Path)
.withDestinationBucketName(toS3BucketName)
.withDestinationKey(targetS3Path)
.withUploadId(initResponse.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum + 1);
partNum += 1;
copyResponses.add(s3client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
val completeRequest = new CompleteMultipartUploadRequest(
toS3BucketName,
targetS3Path,
initResponse.getUploadId(),
getETags(copyResponses));
s3client.completeMultipartUpload(completeRequest);
logger.info("Multipart upload complete.");
}
// This is a helper function to construct a list of ETags.
def getETags(responses: java.util.List[CopyPartResult]): ArrayList[PartETag] = {
var etags = new ArrayList[PartETag]();
val it = responses.iterator();
while (it.hasNext()) {
val response = it.next();
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
def moveObject(sourceS3Path: String, targetS3Path: String, fromBucketName: String, toBucketName: String) {
logger.info(s"Moving S3 frile from $sourceS3Path ==> $targetS3Path")
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromBucketName);
var objectSize = metadataResult.getContentLength();
if (objectSize > ALLOWED_OBJECT_SIZE) {
logger.info("Object size is greater than 1GB. Initiating multipart upload.");
copyMultipart(sourceS3Path, targetS3Path, fromBucketName, toBucketName);
} else {
s3client.copyObject(fromBucketName, sourceS3Path, toBucketName, targetS3Path);
}
// Delete source object after successful copy
s3client.deleteObject(fromS3BucketName, sourceS3Path);
}

You will need the AWS Sdk for this.
If you are using AWS Sdk Version 1,
projectDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.248"
)
import com.amazonaws.services.s3.transfer.{ Copy, TransferManager, TransferManagerBuilder }
val transferManager: TransferManager =
TransferManagerBuilder.standard().build()
def copyFile(): Unit = {
val copy: Copy =
transferManager.copy(
"source-bucket-name", "source-file-key",
"destination-bucket-name", "destination-file-key"
)
copy.waitForCompletion()
}
If you are using AWS Sdk Version 2
projectDependencies ++= Seq(
"software.amazon.awssdk" % "s3" % "2.17.219",
"software.amazon.awssdk" % "s3-transfer-manager" % "2.17.219-PREVIEW"
)
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.model.CopyObjectRequest
import software.amazon.awssdk.transfer.s3.{Copy, CopyRequest, S3ClientConfiguration, S3TransferManager}
// change Region.US_WEST_2 to your required region
// or it might even work without the whole `.region(Region.US_WEST_2)` thing
val s3ClientConfig: S3ClientConfiguration =
S3ClientConfiguration
.builder()
.region(Region.US_WEST_2)
.build()
val s3TransferManager: S3TransferManager =
S3TransferManager.builder().s3ClientConfiguration(s3ClientConfig).build()
def copyFile(): Unit = {
val copyObjectRequest: CopyObjectRequest =
CopyObjectRequest
.builder()
.sourceBucket("source-bucket-name")
.sourceKey("source-file-key")
.destinationBucket("destination-bucket-name")
.destinationKey("destination-file-key")
.build()
val copyRequest: CopyRequest =
CopyRequest
.builder()
.copyObjectRequest(copyObjectRequest)
.build()
val copy: Copy =
s3TransferManager.copy(copyRequest)
copy.completionFuture().get()
}
Keep in mind that you will need the AWS credentials with appropriate permissions for both the source and destination object. For this, You just need to get the credentials and make them available as following environment variables.
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_SESSION_TOKEN=your_session_token
Also, "source-file-key" and "destination-file-key" should be the full path of the file in the bucket.

Related

How to resolve Wrong FS: hdfs:/..., expected: file:/// to get the list of subdirectory with a Path (String)?

I'm trying to get the latest subdirectory name created in my directory "DWP".
I managed to execute this code with local path but running it in my hdfs cluster I have the error "Wrong FS: hdfs:/..., expected: file:///"
def lastDirectoryHour(): String = {
val env = System.getenv("IP_HDFS")
val currentDate = DateTimeFormatter.ofPattern("YYYY-MM-dd").format(java.time.LocalDate.now)
val readingPath = "hdfs://".concat(env).concat(":9000/user/bronze/json/DWP/").concat(currentDate).concat("/")
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(readingPath))
var listDir = ListBuffer[Long]()
var DirName: String = ""
for(value<-status)
{
listDir = listDir += value.getModificationTime
}
for(value<-status)
{
if(value.getModificationTime == listDir.max) {
DirName = value.getPath.getName
}
}
readingPath.concat(DirName)
}
When I add "addRessource" as some answers say, I'm unable to use "listStatus" which return the name.
Do you know how can I change my code in order to keep it returning me the latest subdirectory name ?
Thank you very much in advance.
You need to set the value of fs.defaultFS in your configuration object as the address of the namenode. Most probably till 9000 in your reading path.
Second option could be.
Get the FileSystem from the path,
Filesystem fs = new Path(readingPath).getFileSystem(new Configuration())

Apache Spark Data Generator Function on Databricks Not working

I am trying to execute the Data Generator function provided my Microsoft to test streaming data to Event Hubs.
Unfortunately, I keep on getting the error
Processing failure: No such file or directory
When I try and execute the function:
%scala
DummyDataGenerator.start(15)
Can someone take a look at the code and help decipher why I'm getting the error:
class DummyDataGenerator:
streamDirectory = "/FileStore/tables/flight"
None # suppress output
I'm not sure how the above cell gets called into the function DummyDataGenerator
%scala
import scala.util.Random
import java.io._
import java.time._
// Notebook #2 has to set this to 8, we are setting
// it to 200 to "restore" the default behavior.
spark.conf.set("spark.sql.shuffle.partitions", 200)
// Make the username available to all other languages.
// "WARNING: use of the "current" username is unpredictable
// when multiple users are collaborating and should be replaced
// with the notebook ID instead.
val username = com.databricks.logging.AttributionContext.current.tags(com.databricks.logging.BaseTagDefinitions.TAG_USER);
spark.conf.set("com.databricks.training.username", username)
object DummyDataGenerator extends Runnable {
var runner : Thread = null;
val className = getClass().getName()
val streamDirectory = s"dbfs:/tmp/$username/new-flights"
val airlines = Array( ("American", 0.17), ("Delta", 0.12), ("Frontier", 0.14), ("Hawaiian", 0.13), ("JetBlue", 0.15), ("United", 0.11), ("Southwest", 0.18) )
val reasons = Array("Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft")
val rand = new Random(System.currentTimeMillis())
var maxDuration = 3 * 60 * 1000 // default to three minutes
def clean() {
System.out.println("Removing old files for dummy data generator.")
dbutils.fs.rm(streamDirectory, true)
if (dbutils.fs.mkdirs(streamDirectory) == false) {
throw new RuntimeException("Unable to create temp directory.")
}
}
def run() {
val date = LocalDate.now()
val start = System.currentTimeMillis()
while (System.currentTimeMillis() - start < maxDuration) {
try {
val dir = s"/dbfs/tmp/$username/new-flights"
val tempFile = File.createTempFile("flights-", "", new File(dir)).getAbsolutePath()+".csv"
val writer = new PrintWriter(tempFile)
for (airline <- airlines) {
val flightNumber = rand.nextInt(1000)+1000
val deptTime = rand.nextInt(10)+10
val departureTime = LocalDateTime.now().plusHours(-deptTime)
val (name, odds) = airline
val reason = Random.shuffle(reasons.toList).head
val test = rand.nextDouble()
val delay = if (test < odds)
rand.nextInt(60)+(30*odds)
else rand.nextInt(10)-5
println(s"- Flight #$flightNumber by $name at $departureTime delayed $delay minutes due to $reason")
writer.println(s""" "$flightNumber","$departureTime","$delay","$reason","$name" """.trim)
}
writer.close()
// wait a couple of seconds
//Thread.sleep(rand.nextInt(5000))
} catch {
case e: Exception => {
printf("* Processing failure: %s%n", e.getMessage())
return;
}
}
}
println("No more flights!")
}
def start(minutes:Int = 5) {
maxDuration = minutes * 60 * 1000
if (runner != null) {
println("Stopping dummy data generator.")
runner.interrupt();
runner.join();
}
println(s"Running dummy data generator for $minutes minutes.")
runner = new Thread(this);
runner.run();
}
def stop() {
start(0)
}
}
DummyDataGenerator.clean()
displayHTML("Imported streaming logic...") // suppress output
you should be able to use the Databricks Labs Data Generator on the Databricks community edition. I'm providing the instructions below:
Running Databricks Labs Data Generator on the community edition
The Databricks Labs Data Generator is a Pyspark library so the code to generate the data needs to be Python. But you should be able to create a view on the generated data and consume it from Scala if that's your preferred language.
You can install the framework on the Databricks community edition by creating a notebook with the cell
%pip install git+https://github.com/databrickslabs/dbldatagen
Once it's installed you can then use the library to define a data generation spec and by using build, generate a Spark dataframe on it.
The following example shows generation of batch data similar to the data set you are trying to generate. This should be placed in a separate notebook cell
Note - here we generate 10 million records to illustrate ability to create larger data sets. It can be used to generate datasets much larger than that
%python
num_rows = 10 * 1000000 # number of rows to generate
num_partitions = 8 # number of Spark dataframe partitions
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build()
display(df_flight_data)
You can find information on how to generate streaming data in the online documentation at https://databrickslabs.github.io/dbldatagen/public_docs/using_streaming_data.html
You can create a named temporary view over the data so that you can access it from SQL or Scala using one of two methods:
1: use createOrReplaceTempView
df_flight_data.createOrReplaceTempView("delays")
2: use options for build. In this case the name passed to the Data Instance initializer will be the name of the view
i.e
df_flight_data = flightdata_defn.build(withTempView=True)
This code will not work on the community edition because of this line:
val dir = s"/dbfs/tmp/$username/new-flights"
as there is no DBFS fuse on Databricks community edition (it's supported only on full Databricks). It's potentially possible to make it working by:
Changing that directory to local directory, like, /tmp or something like
adding a code (after writer.close()) to list flights-* files in that local directory, and using dbutils.fs.mv to move them into streamDirectory

Using Spark Scala in EMR to get S3 Object size (folder, files)

I am trying to get the folder size for some S3 folders with scala from my command line EMR.
I have JSON data stored as GZ files in S3. I find I can count the number of JSON records within my files:
spark.read.json("s3://mybucket/subfolder/subsubfolder/").count
But now I need to know how much GB that data accounts for.
I am finding options to get the size for distinct files, but not for a whole folder all up.
I am finding options to get the size for distinct files, but not for a
whole folder all up.
Solution :
Option1:
Get the s3 access by FileSystem
val fs = FileSystem.get(new URI(ipPath), spark.sparkContext.hadoopConfiguration)
Note :
1) new URI is important other wise it will connect to
hadoop file system path instread of s3 file system(object store :-)) path . using new URI you are giving scheme s3://
here.
2) org.apache.commons.io.FileUtils.byteCountToDisplaySize will
give display sizes of file system in GB MB etc...
/**
* recursively print file sizes
*
* #param filePath
* #param fs
* #return
*/
#throws[FileNotFoundException]
#throws[IOException]
def getDisplaysizesOfS3Files(filePath: org.apache.hadoop.fs.Path, fs: org.apache.hadoop.fs.FileSystem): scala.collection.mutable.ListBuffer[String] = {
val fileList = new scala.collection.mutable.ListBuffer[String]
val fileStatus = fs.listStatus(filePath)
for (fileStat <- fileStatus) {
println(s"file path Name : ${fileStat.getPath.toString} length is ${fileStat.getLen}")
if (fileStat.isDirectory) fileList ++= (getDisplaysizesOfS3Files(fileStat.getPath, fs))
else if (fileStat.getLen > 0 && !fileStat.getPath.toString.isEmpty) {
println("fileStat.getPath.toString" + fileStat.getPath.toString)
fileList += fileStat.getPath.toString
val size = fileStat.getLen
val display = org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
println(" length zero files \n " + fileStat)
println("Name = " + fileStat.getPath().getName());
println("Size = " + size);
println("Display = " + display);
} else if (fileStat.getLen == 0) {
println(" length zero files \n " + fileStat)
}
}
fileList
}
based on your requirement, you can modify the code... you can sum up all the distinct files.
Option 2 : Simple and crispy using getContentSummary
implicit val spark = SparkSession.builder().appName("ObjectSummary").getOrCreate()
/**
* getDisplaysizesOfS3Files
* #param path
* #param spark [[org.apache.spark.sql.SparkSession]]
*/
def getDisplaysizesOfS3Files(path: String)( implicit spark: org.apache.spark.sql.SparkSession): Unit = {
val filePath = new org.apache.hadoop.fs.Path(path)
val fileSystem = filePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
val size = fileSystem.getContentSummary(filePath).getLength
val display = org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
println("path = " + path);
println("Size = " + size);
println("Display = " + display);
}
Note : Any option showed above will work for
local or
hdfs or
s3
as well

scala- Read file from S3 bucket

I want to read a specific file from S3 bucket. In my S3 bucket I will be having so many objects(directories and Sub directories). I want traverse through all the objects and have to read only that file.
I am trying below code:
val s3Client: AmazonS3Client = getS3Client()
try {
log.info("Listing objects from S3")
var counter = 0
val listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName)
.withMaxKeys(2)
.withPrefix("Test/"+"Client_cd" + "/"+"DM1"+"/")
.withMarker("Test/"+"Client_cd" + "/"+"DM1"+"/")
var objectListing: ObjectListing = null
do {
objectListing = s3Client.listObjects(listObjectsRequest)
import scala.collection.JavaConversions._
for (objectSummary <- objectListing.getObjectSummaries) {
println( objectSummary.getKey + "\t" + StringUtils.fromDate(objectSummary.getLastModified))
}
listObjectsRequest.setMarker(objectListing.getNextMarker())
}
while (objectListing.isTruncated())
}
catch {
case e: Exception => {
log.error("Failed listing files. ", e)
throw e
}
}
In this path I have to read only .gz files from latest month folders. File Path:
"Mybucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz"
Here, I have to pass Client_cd as parameter for particular client.
How to filter the objects and to get particular files?
If you are using EMR or your S3 configs are setup correctly, you can also use the sc.textFile("s3://bucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz")

Download all the files from a s3 bucket using scala

I tried the below code to download one file successfully but unable to download all the list of files
client.getObject(
new GetObjectRequest(bucketName, "TestFolder/TestSubfolder/Psalm/P.txt"),
new File("test.txt"))
Thanks in advance
Update
I tried the below code but getting list of directories ,I want list of files rather
val listObjectsRequest = new ListObjectsRequest().
withBucketName("tivo-hadoop-dev").
withPrefix("prefix").
withDelimiter("/")
client.listObjects(listObjectsRequest).getCommonPrefixes
It's a simple thing but I struggled like any thing before concluding below mentioned answer.
I found a java code and changed to scala accordingly and it worked
val client = new AmazonS3Client(credentials)
val listObjectsRequest = new ListObjectsRequest().
withBucketName("bucket-name").
withPrefix("path/of/dir").
withDelimiter("/")
var objects = client.listObjects(listObjectsRequest);
do {
for (objectSummary <- objects.getObjectSummaries()) {
var key = objectSummary.getKey()
println(key)
var arr=key.split("/")
var file_name = arr(arr.length-1)
client.getObject(
new GetObjectRequest("bucket" , key),
new File("some/path/"+file_name))
}
objects = client.listNextBatchOfObjects(objects);
} while (objects.isTruncated())
Below code is fast and useful especially when you want to download all objects at a specific local directory. It maintains the files under the exact same s3 prefix hierarchy
val xferMgrForAws:TransferManager = TransferManagerBuilder.standard().withS3Client(awsS3Client).build();
var objectListing:ObjectListing = null;
objectListing = awsS3Client.listObjects(awsBucketName, prefix);
val summaries:java.util.List[S3ObjectSummary] = objectListing.getObjectSummaries();
if(summaries.size() > 0) {
val xfer:MultipleFileDownload = xferMgrForAws.downloadDirectory(awsBucketName, prefix, new File(localDirPath));
xfer.waitForCompletion();
println("All files downloaded successfully!")
} else {
println("No object present in the bucket !");
}