Using Spark Scala in EMR to get S3 Object size (folder, files) - scala

I am trying to get the folder size for some S3 folders with scala from my command line EMR.
I have JSON data stored as GZ files in S3. I find I can count the number of JSON records within my files:
spark.read.json("s3://mybucket/subfolder/subsubfolder/").count
But now I need to know how much GB that data accounts for.
I am finding options to get the size for distinct files, but not for a whole folder all up.

I am finding options to get the size for distinct files, but not for a
whole folder all up.
Solution :
Option1:
Get the s3 access by FileSystem
val fs = FileSystem.get(new URI(ipPath), spark.sparkContext.hadoopConfiguration)
Note :
1) new URI is important other wise it will connect to
hadoop file system path instread of s3 file system(object store :-)) path . using new URI you are giving scheme s3://
here.
2) org.apache.commons.io.FileUtils.byteCountToDisplaySize will
give display sizes of file system in GB MB etc...
/**
* recursively print file sizes
*
* #param filePath
* #param fs
* #return
*/
#throws[FileNotFoundException]
#throws[IOException]
def getDisplaysizesOfS3Files(filePath: org.apache.hadoop.fs.Path, fs: org.apache.hadoop.fs.FileSystem): scala.collection.mutable.ListBuffer[String] = {
val fileList = new scala.collection.mutable.ListBuffer[String]
val fileStatus = fs.listStatus(filePath)
for (fileStat <- fileStatus) {
println(s"file path Name : ${fileStat.getPath.toString} length is ${fileStat.getLen}")
if (fileStat.isDirectory) fileList ++= (getDisplaysizesOfS3Files(fileStat.getPath, fs))
else if (fileStat.getLen > 0 && !fileStat.getPath.toString.isEmpty) {
println("fileStat.getPath.toString" + fileStat.getPath.toString)
fileList += fileStat.getPath.toString
val size = fileStat.getLen
val display = org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
println(" length zero files \n " + fileStat)
println("Name = " + fileStat.getPath().getName());
println("Size = " + size);
println("Display = " + display);
} else if (fileStat.getLen == 0) {
println(" length zero files \n " + fileStat)
}
}
fileList
}
based on your requirement, you can modify the code... you can sum up all the distinct files.
Option 2 : Simple and crispy using getContentSummary
implicit val spark = SparkSession.builder().appName("ObjectSummary").getOrCreate()
/**
* getDisplaysizesOfS3Files
* #param path
* #param spark [[org.apache.spark.sql.SparkSession]]
*/
def getDisplaysizesOfS3Files(path: String)( implicit spark: org.apache.spark.sql.SparkSession): Unit = {
val filePath = new org.apache.hadoop.fs.Path(path)
val fileSystem = filePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
val size = fileSystem.getContentSummary(filePath).getLength
val display = org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
println("path = " + path);
println("Size = " + size);
println("Display = " + display);
}
Note : Any option showed above will work for
local or
hdfs or
s3
as well

Related

How to resolve Wrong FS: hdfs:/..., expected: file:/// to get the list of subdirectory with a Path (String)?

I'm trying to get the latest subdirectory name created in my directory "DWP".
I managed to execute this code with local path but running it in my hdfs cluster I have the error "Wrong FS: hdfs:/..., expected: file:///"
def lastDirectoryHour(): String = {
val env = System.getenv("IP_HDFS")
val currentDate = DateTimeFormatter.ofPattern("YYYY-MM-dd").format(java.time.LocalDate.now)
val readingPath = "hdfs://".concat(env).concat(":9000/user/bronze/json/DWP/").concat(currentDate).concat("/")
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(readingPath))
var listDir = ListBuffer[Long]()
var DirName: String = ""
for(value<-status)
{
listDir = listDir += value.getModificationTime
}
for(value<-status)
{
if(value.getModificationTime == listDir.max) {
DirName = value.getPath.getName
}
}
readingPath.concat(DirName)
}
When I add "addRessource" as some answers say, I'm unable to use "listStatus" which return the name.
Do you know how can I change my code in order to keep it returning me the latest subdirectory name ?
Thank you very much in advance.
You need to set the value of fs.defaultFS in your configuration object as the address of the namenode. Most probably till 9000 in your reading path.
Second option could be.
Get the FileSystem from the path,
Filesystem fs = new Path(readingPath).getFileSystem(new Configuration())

How to move files from one S3 bucket directory to another directory in same bucket? Scala/Java

I want to move all files under a directory in my s3 bucket to another directory within the same bucket, using scala.
Here is what I have:
def copyFromInputFilesToArchive(spark: SparkSession) : Unit = {
val sourcePath = new Path("s3a://path-to-source-directory/")
val destPath = new Path("s3a:/path-to-destination-directory/")
val fs = sourcePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
fs.moveFromLocalFile(sourcePath,destPath)
}
I get this error:
fs.copyFromLocalFile returns Wrong FS: s3a:// expected file:///
Error explained
The error you are seeing is because the copyFromLocalFile method is really for moving files from a local filesystem to S3. You are trying to "move" files that are already both in S3.
It is important to note that directories don't really exist in Amazon S3 buckets - The folder/file hierarchy you see is really just key-value metadata attached to the file. All file objects are really sitting in the same big, single level container and that filename key is there to give the illusion of files/folders.
To "move" files in a bucket, what you really need to do is update the filename key with the new path which is really just editing object metadata.
How to do a "move" within a bucket with Scala
To accomplish this, you'd need to copy the original object, assign the new metadata to the copy, and then write it back to S3. In practice, you can copy it and save it to the same object which will overwrite the old version, which acts a lot like an update.
Try something like this (from datahackr):
/**
* Copy object from a key to another in multiparts
*
* #param sourceS3Path S3 object key
* #param targetS3Path S3 object key
* #param fromBucketName bucket name
* #param toBucketName bucket name
*/
#throws(classOf[Exception])
#throws(classOf[AmazonServiceException])
def copyMultipart(sourceS3Path: String, targetS3Path: String, fromS3BucketName: String, toS3BucketName: String) {
// Create a list of ETag objects. You retrieve ETags for each object part uploaded,
// then, after each individual part has been uploaded, pass the list of ETags to
// the request to complete the upload.
var partETags = new ArrayList[PartETag]();
// Initiate the multipart upload.
val initRequest = new InitiateMultipartUploadRequest(toS3BucketName, targetS3Path);
val initResponse = s3client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromS3BucketName);
var objectSize = metadataResult.getContentLength();
// Copy the object using 50 MB parts.
val partSize = (50 * 1024 * 1024) * 1L;
var bytePosition = 0L;
var partNum = 1;
var copyResponses = new ArrayList[CopyPartResult]();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
val lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
val copyRequest = new CopyPartRequest()
.withSourceBucketName(fromS3BucketName)
.withSourceKey(sourceS3Path)
.withDestinationBucketName(toS3BucketName)
.withDestinationKey(targetS3Path)
.withUploadId(initResponse.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum + 1);
partNum += 1;
copyResponses.add(s3client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
val completeRequest = new CompleteMultipartUploadRequest(
toS3BucketName,
targetS3Path,
initResponse.getUploadId(),
getETags(copyResponses));
s3client.completeMultipartUpload(completeRequest);
logger.info("Multipart upload complete.");
}
// This is a helper function to construct a list of ETags.
def getETags(responses: java.util.List[CopyPartResult]): ArrayList[PartETag] = {
var etags = new ArrayList[PartETag]();
val it = responses.iterator();
while (it.hasNext()) {
val response = it.next();
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
def moveObject(sourceS3Path: String, targetS3Path: String, fromBucketName: String, toBucketName: String) {
logger.info(s"Moving S3 frile from $sourceS3Path ==> $targetS3Path")
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromBucketName);
var objectSize = metadataResult.getContentLength();
if (objectSize > ALLOWED_OBJECT_SIZE) {
logger.info("Object size is greater than 1GB. Initiating multipart upload.");
copyMultipart(sourceS3Path, targetS3Path, fromBucketName, toBucketName);
} else {
s3client.copyObject(fromBucketName, sourceS3Path, toBucketName, targetS3Path);
}
// Delete source object after successful copy
s3client.deleteObject(fromS3BucketName, sourceS3Path);
}
You will need the AWS Sdk for this.
If you are using AWS Sdk Version 1,
projectDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.248"
)
import com.amazonaws.services.s3.transfer.{ Copy, TransferManager, TransferManagerBuilder }
val transferManager: TransferManager =
TransferManagerBuilder.standard().build()
def copyFile(): Unit = {
val copy: Copy =
transferManager.copy(
"source-bucket-name", "source-file-key",
"destination-bucket-name", "destination-file-key"
)
copy.waitForCompletion()
}
If you are using AWS Sdk Version 2
projectDependencies ++= Seq(
"software.amazon.awssdk" % "s3" % "2.17.219",
"software.amazon.awssdk" % "s3-transfer-manager" % "2.17.219-PREVIEW"
)
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.model.CopyObjectRequest
import software.amazon.awssdk.transfer.s3.{Copy, CopyRequest, S3ClientConfiguration, S3TransferManager}
// change Region.US_WEST_2 to your required region
// or it might even work without the whole `.region(Region.US_WEST_2)` thing
val s3ClientConfig: S3ClientConfiguration =
S3ClientConfiguration
.builder()
.region(Region.US_WEST_2)
.build()
val s3TransferManager: S3TransferManager =
S3TransferManager.builder().s3ClientConfiguration(s3ClientConfig).build()
def copyFile(): Unit = {
val copyObjectRequest: CopyObjectRequest =
CopyObjectRequest
.builder()
.sourceBucket("source-bucket-name")
.sourceKey("source-file-key")
.destinationBucket("destination-bucket-name")
.destinationKey("destination-file-key")
.build()
val copyRequest: CopyRequest =
CopyRequest
.builder()
.copyObjectRequest(copyObjectRequest)
.build()
val copy: Copy =
s3TransferManager.copy(copyRequest)
copy.completionFuture().get()
}
Keep in mind that you will need the AWS credentials with appropriate permissions for both the source and destination object. For this, You just need to get the credentials and make them available as following environment variables.
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_SESSION_TOKEN=your_session_token
Also, "source-file-key" and "destination-file-key" should be the full path of the file in the bucket.

decompress (unzip/extract) util using spark scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

scala- Read file from S3 bucket

I want to read a specific file from S3 bucket. In my S3 bucket I will be having so many objects(directories and Sub directories). I want traverse through all the objects and have to read only that file.
I am trying below code:
val s3Client: AmazonS3Client = getS3Client()
try {
log.info("Listing objects from S3")
var counter = 0
val listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName)
.withMaxKeys(2)
.withPrefix("Test/"+"Client_cd" + "/"+"DM1"+"/")
.withMarker("Test/"+"Client_cd" + "/"+"DM1"+"/")
var objectListing: ObjectListing = null
do {
objectListing = s3Client.listObjects(listObjectsRequest)
import scala.collection.JavaConversions._
for (objectSummary <- objectListing.getObjectSummaries) {
println( objectSummary.getKey + "\t" + StringUtils.fromDate(objectSummary.getLastModified))
}
listObjectsRequest.setMarker(objectListing.getNextMarker())
}
while (objectListing.isTruncated())
}
catch {
case e: Exception => {
log.error("Failed listing files. ", e)
throw e
}
}
In this path I have to read only .gz files from latest month folders. File Path:
"Mybucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz"
Here, I have to pass Client_cd as parameter for particular client.
How to filter the objects and to get particular files?
If you are using EMR or your S3 configs are setup correctly, you can also use the sc.textFile("s3://bucket/Test/Client_cd/Dm1/20181010_xxxxx/*.gz")

How to map one column with other columns in an avro file?

I'm using Spark 2.1.1 and Scala 2.11.8
This question is an extension of one my earlier questions:
How to identify null fields in a csv file?
The change is that rather than reading the data from a CSV file, I'm now reading the data from an avro file. This is the format of the avro file I'm reading the data from :
var ttime: Long = 0;
var eTime: Long = 0;
var tids: String = "";
var tlevel: Integer = 0;
var tboot: Long = 0;
var rNo: Integer = 0;
var varType: String = "";
var uids: List[TRUEntry] = Nil;
I'm parsing the avro file in a separate class.
I have to map the tids column with every single one of the uids in the same way as mentioned in the accepted answer of the link posted above, except this time from an avro file rather than a well formatted csv file. How can I do this?
This is the code I'm trying to do it with :
val avroRow = spark.read.avro(inputString).rdd
val avroParsed = avroRow
.map(x => new TRParser(x))
.map((obj: TRParser) => ((obj.tids, obj.uId ),1))
.reduceByKey(_+_)
.saveAsTextFile(outputString)
After obj.tids, all the uids columns have to be mapped individually to give a final output same as mentioned in the accepted answer of the above link.
This is how I'm parsing all the uids in the avro file parsing class:
this.uids = Nil
row.getAs[Seq[Row]]("uids")
.foreach((objRow: Row) =>
this.uids ::= (new TRUEntry(objRow))
)
this.uids
.foreach((obj:TRUEntry) => {
uInfo += obj.uId + " , " + obj.initM.toString() + " , "
})
P.S : I apologise if the question seems dumb but this is my first encounter with avro file
It can be done by passing the same for loop processing
this.uids
in the main code as :
val avroParsed = avroRow
.map(x => new TRParser(x))
.map((obj: TRParser) => {
val tId = obj.source.trim
var retVal: String = ""
obj.uids
.foreach((obj: TRUEntry) => {
retVal += tId + "," + obj.uId.trim + ":"
})
retVal.dropRight(1)
})
val flattened = avroParsed
.flatMap(x => x.split(":"))
.map(y => ((y),1))