Could not figure out a way to list all files in directory and subdirectories.
Here is the code that I'm using which lists files in a specific directory but files if there is a subdorectory inside:
val conf = new Configuration()
val fs = FileSystem.get(new java.net.URI("hdfs://servername/"), conf)
val status = fs.listStatus(new Path("path/to/folder/"))
status.foreach { x => println(x.getPath.toString()) }
the above code lists all the files inside the directory but I need it to be recursive.
You could go for a recursion whenever you discover a new folder:
val hdfs = FileSystem.get(new Configuration())
def listFileNames(hdfsPath: String): List[String] = {
hdfs
.listStatus(new Path(hdfsPath))
.flatMap { status =>
// If it's a file:
if (status.isFile)
List(hdfsPath + "/" + status.getPath.getName)
// If it's a dir and we're in a recursive option:
else
listFileNames(hdfsPath + "/" + status.getPath.getName)
}
.toList
.sorted
}
I am facing below issues in hadoop Distcp any suggestion or help is highly appreciated.
I am trying to copy data from Google Cloud platform to Amazon S3
1) When we have multiple files to copy from source to destination (This work fine)
val sourcefile : String = "gs://XXXX_-abc_account2621/abc_account2621_click_20170616*.csv.gz [Multiple files to copy (we have * in the file name)]
Output: S3://S3bucketname/xxx/xxxx/clientid=account2621/date=2017-08-18/
Files in above path
abc_account2621_click_2017061612_20170617_005852_572560033.csv.gz
abc_account2621_click_2017061616_20170617_045654_572608350.csv.gz
abc_account2621_click_2017061622_20170617_103107_572684922.csv.gz
abc_account2621_click_2017061623_20170617_120235_572705834.csv.gz
2) When we have only one file to copy from source to destination (Issue)
val sourcefile : String = "gs://XXXX_-abc_account2621/abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz
Output:S3://S3bucketname/xxx/xxxx/clientid=account2621/
Files in above path
date=2017-08-18 (Directory replace with file content and it doesn't have file type)
Code:
def main(args: Array[String]): Unit = {
val Array(environment,customer, typesoftables, clientid, filedate) = args.take(5)
val S3Path: String = customer + "/" + typesoftables + "/" + "clientid=" + clientid + "/" + "date=" + filedate + "/"
val sourcefile : String = "gs://XXXX_-abc_account2621//abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz"
val destination: String = "s3n://S3bucketname/" + S3Path
println(sourcefile)
println(destination)
val filepaths: Array[String] = Array(sourcefile, destination)
executeDistCp(filepaths)
}
def executeDistCp(filepaths : Array[String]) {
val conf: Configuration = new Configuration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("fs.gs.project.id", "XXXX-XXXX")
conf.set("google.cloud.auth.service.account.json.keyfile","/tmp/XXXXX.json")
conf.set("fs.s3n.awsAccessKeyId", "XXXXXXXXXXXX")
conf.set("fs.s3n.awsSecretAccessKey","XXXXXXXXXXXXXX")
conf.set("mapreduce.application.classpath","$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
,/usr/lib/hadoop-lzo/lib/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/emr/lib/*,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,/usr/share/aws/emr/cloudwatch-sink/lib/*,/usr/share/aws/aws-java-sdk/*,/tmp/gcs-connector-latest-hadoop2.jar")
conf.set("HADOOP_CLASSPATH","$HADOOP_CLASSPATH:/tmp/gcs-connector-latest-hadoop2.jar")
val outputDir: Path = new Path(filepaths(1))
outputDir.getFileSystem(conf).delete(outputDir, true)
val distCp: DistCp = new DistCp(conf,null)
ToolRunner.run(distCp, filepaths)
}
}
By adding below code the above issue is fixed
Code
val makeDir: Path = new Path(filepaths(1))
makeDir.getFileSystem(conf).mkdirs(makeDir)
I have a requirement where I have to pull parquet file from S3 process it and convert into another object format and store it in S3 in json and Parquet format.
I have done the below changes for this problem statement, but the Spark job is taking too much time when collect statement is called Please Let me know how this can be optimized, Below is the complete Code which reads Parquet file from S3 and process it and store it to S3. I am very new to Spark and BigData technology
package com.expedia.www.lambda
import java.io._
import com.amazonaws.ClientConfiguration
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing}
import com.expedia.hendrix.lambda.HotelInfosite
import com.expedia.www.hendrix.signals.definition.local.HotelInfoSignal
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import com.expedia.www.user.interaction.v1.UserInteraction
import com.expedia.www.util._
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.lang.exception.ExceptionUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import scala.collection.JavaConverters._
import scala.io.Source
import scala.util.Random
object GenericLambdaMapper{
private def currentTimeMillis: Long = System.currentTimeMillis
/** The below Generic mapper object is built for creating json similar to the Signal pushed by hendrix */
def populateSignalRecord( genericRecord: GenericRecord, uisMessage: UserInteraction, signalType: String): HotelInfoSignal ={
val objectMapper:ObjectMapper = new ObjectMapper
objectMapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
objectMapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
val hotelInfoObject = objectMapper.readValue( genericRecord.toString, classOf[com.expedia.www.hendrix.signals.definition.local.HotelInfosite])
val userKey = UserKeyUtil.createUserKey(uisMessage)
val hotelInfoSignal:HotelInfoSignal = new HotelInfoSignal
hotelInfoSignal.setSignalType(signalType)
hotelInfoSignal.setData(hotelInfoObject)
hotelInfoSignal.setUserKey(userKey)
hotelInfoSignal.setGeneratedAtTimestamp(currentTimeMillis)
return hotelInfoSignal
}
}
class GenericLambdaMapper extends Serializable{
var LOGGER:Logger = LoggerFactory.getLogger("GenericLambdaMapper")
var bw : BufferedWriter = null
var fw :FileWriter = null
val random: Random = new Random
var counter: Int = 0
var fileName: String= null
val s3Util = new S3Util
/** Object Mapper function for serializing and deserializing objects**/
def objectMapper : ObjectMapper= {
val mapper = new ObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
}
def process(sparkContext: SparkContext, options: HendrixHistoricalOfflineProcessorOptions ): Unit = { //ObjectListing
try {
LOGGER.info("Start Date : "+options.startDate)
LOGGER.info("END Date : "+options.endDate)
val listOfFilePath: List[String] = DateTimeUtil.getDateRangeStrFromInput(options.startDate, options.endDate)
/**Looping through each folder based on start and end date **/
listOfFilePath.map(
path => applyLambdaForGivenPathAndPushToS3Signal( sparkContext, path, options )
)
}catch {
case ex: Exception => {
LOGGER.error( "Exception in downloading data :" + options.rawBucketName + options.rawS3UploadRootFolder + options.startDate)
LOGGER.error("Stack Trace :"+ExceptionUtils.getFullStackTrace(ex))
}
}
}
// TODO: Currently the Lambda is hardcoded only to HotelInfoSite to be made generic
def prepareUisObjectAndApplyLambda(uisMessage: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): List[GenericRecord] = {
try {
val schemaDefinition = Source.fromInputStream(getClass.getResourceAsStream("/"+options.avroSchemaName)).getLines.mkString("\n")
val schemaHotelInfo = new Schema.Parser().parse(schemaDefinition)
HotelInfosite.apply(uisMessage, schemaHotelInfo).toList
}catch {
case ex: Exception => LOGGER.error("Exception while preparing UIS Object" + ex.toString)
List.empty
}
}
/** Below method is used to extract userInteraction Data from Raw file **/
private def constructUisObject(uisMessageRaw: String): UserInteraction = objectMapper.readValue( uisMessageRaw, classOf[UserInteraction])
/** Below function contains logic to apply the lambda for the given range of dates and push to signals folder in S3 **/
def applyLambdaForGivenPathAndPushToS3Signal(sparkContext: SparkContext, dateFolderPath: String, options: HendrixHistoricalOfflineProcessorOptions ): Unit ={
var awsS3Client: AmazonS3Client = null;
try {
if ("sandbox".equals(options.environment)) {
val clientConfiguration = new ClientConfiguration()
.withConnectionTimeout(options.awsConnectionTimeout)
.withSocketTimeout(options.awsSocketTimeout)
.withTcpKeepAlive(true)
awsS3Client = S3Client.getAWSConnection(options.awsS3AccessKey, options.awsS3SecretKey, clientConfiguration)
} else {
awsS3Client = S3Client.getAWSConnection
}
/** Validate if destination path has any gzip file if so then just skip that date and process next record **/
LOGGER.info("Validating if the destination folder path is empty: " + dateFolderPath)
var objectListing: ObjectListing = null
var listObjectsRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(options.destinationBucketName).withPrefix(options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
if (objectListing.getObjectSummaries.size > 0) {
LOGGER.warn("Record already present at the below location, so skipping the processing of record for the folder path :" + dateFolderPath.toString)
LOGGER.warn("s3n://" + options.destinationBucketName + "/" + options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
return
}
LOGGER.info("Validated the destination folder path :" + dateFolderPath + " and found it to be empty ")
/** End of validation **/
/*Selecting all the files under the source path and iterating*/
counter = 0
listObjectsRequest = new ListObjectsRequest().withBucketName(options.rawBucketName).withPrefix(options.rawS3UploadRootFolder + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
val rddListOfParquetFileNames = objectListing.getObjectSummaries.asScala.map(_.getKey).toList
rddListOfParquetFileNames.flatMap{key => { processIndividualParquetFileAndUploadToS3(sparkContext, awsS3Client, options, key, dateFolderPath)
"COMPLETED Processing=>"+key;
}}
}catch{
case ex: Exception =>
LOGGER.error("Exception occured while processing records for the path " + dateFolderPath)
LOGGER.error("Exception in Apply Lambda method Message :" + ex.getMessage + "\n Stack Trace :" + ex.getStackTrace)
}finally {
awsS3Client.shutdown
LOGGER.info("JOB Complete ")
}
}
def processIndividualParquetFileAndUploadToS3(sparkContext: SparkContext, awsS3Client: AmazonS3Client, options: HendrixHistoricalOfflineProcessorOptions, parquetFilePath:String, dateFolderPath:String ):Unit ={
try{
LOGGER.info("Currently Processing the Parquet file: "+parquetFilePath)
LOGGER.info("Starting to reading Parquet File Start Time: "+System.currentTimeMillis)
val dataSetString: RDD[String] = ParquetHelper.readParquetData(sparkContext, options, parquetFilePath)
LOGGER.info("Data Set returned from Parquet file Successful Time: "+System.currentTimeMillis)
val lambdaSignalRecords: Array[HotelInfoSignal] = dataSetString.map(x => constructUisObject(x))
.filter(_ != null)
.map(userInteraction => processIndividualRecords(userInteraction, options))
.filter(_ != null)
.collect
LOGGER.info("Successfully Generated "+lambdaSignalRecords.length+" Signal Records")
if(lambdaSignalRecords.length > 0) {
//Write to Paraquet File :Start
val parquetFileName: String = getFileNameForParquet(dateFolderPath, counter)
val parquetWriter = ParquetHelper.newParquetWriter(HotelInfoSignal.getClassSchema, dateFolderPath, parquetFileName, options)
LOGGER.info("Initialized Parquet Writer")
lambdaSignalRecords.map(signalRecord => parquetWriter.write(signalRecord))
LOGGER.info("Completed writing the data in Parquet format")
parquetWriter.close
//Parquet Write Complete
/*val avroSignalString = lambdaSignalRecords.mkString("\n")
val sparkSession = SparkSession.builder.getOrCreate
uploadProceessedDataToS3(sparkSession, awsS3Client, dateFolderPath, avroSignalString, options)
*/ }
}catch {case ex:Exception =>
LOGGER.error("Skipping processing of record :"+parquetFilePath+" because of Exception: "+ExceptionUtils.getFullStackTrace(ex))
}
LOGGER.info("Completed data processing for file :" + options.rawBucketName + options.rawS3UploadRootFolder + parquetFilePath)
}
def uploadProceessedDataToS3(sparkSession:SparkSession, awsS3Client: AmazonS3Client, filePath: String, genericSignalRecords: String, options: HendrixHistoricalOfflineProcessorOptions):Unit ={
var jsonFile: File = null
var gzFile: File = null
try {
//Building the file name based on the folder accessed
fileName = getFileName (filePath, counter)
jsonFile = IOUtil.createS3JsonFile (genericSignalRecords, fileName)
gzFile = IOUtil.gzipIt (jsonFile)
s3Util.uploadToS3(awsS3Client, options.destinationBucketName, options.s3SignalRootFolder + options.signalType + "/" + filePath, gzFile)
counter += 1 //Incement counter
} catch {
case ex: RuntimeException => LOGGER.error ("Exception while uploading file to path :" + options.s3SignalRootFolder + options.signalType + "/" + filePath + "/" + fileName)
LOGGER.error ("Stack Trace for S3 Upload :" + ExceptionUtils.getFullStackTrace(ex))
} finally {
//Cleaning the temp file created after upload to s3, we can create a temp dir if required.
jsonFile.delete
gzFile.delete
}
}
def processIndividualRecords(userInteraction: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): HotelInfoSignal ={
try {
//Applying lambda for the indivisual UserInteraction
val list: List[GenericRecord] = prepareUisObjectAndApplyLambda (userInteraction, options)
if (list.nonEmpty) return GenericLambdaMapper.populateSignalRecord (list.head, userInteraction, options.signalType)
} catch { case ex: Exception => LOGGER.error ("Error while creating signal record from UserInteraction for Singal Type :"+ options.signalType +" For Interaction "+userInteraction.toString)
LOGGER.error ("Stack Trace while processIndividualRecords :" + ExceptionUtils.getFullStackTrace(ex))}
null
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileName(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".json"
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileNameForParquet(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".parquet"
}
}
package com.expedia.www.util
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.{AvroParquetWriter, AvroSchemaConverter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.schema.MessageType
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.slf4j.{Logger, LoggerFactory}
/**
* Created by prasubra on 2/17/17.
*/
object ParquetHelper {
val LOGGER:Logger = LoggerFactory.getLogger("ParquetHelper")
def newParquetWriter(signalSchema: Schema, folderPath:String, fileName:String, options:HendrixHistoricalOfflineProcessorOptions): ParquetWriter[GenericRecord] = {
val blockSize: Int = 256 * 1024 * 1024
val pageSize: Int = 64 * 1024
val compressionCodec = if (options.parquetCompressionToGzip) CompressionCodecName.GZIP else CompressionCodecName.UNCOMPRESSED
val path: Path = new Path("s3n://" + options.destinationBucketName + "/" + options.parquetSignalFolderName + options.signalType + "/" + folderPath + "/" + fileName);
val parquetSchema: MessageType = new AvroSchemaConverter().convert(signalSchema);
// var writeSupport:WriteSupport = new AvroWriteSupport(parquetSchema, signalSchema);
//(path, writeSupport, compressionCodec, blockSize, pageSize)
//var parquetWriter:ParquetWriter[GenericRecord] = new ParquetWriter(path, writeSupport, compressionCodec, blockSize, pageSize);
if ("sandbox".equals(options.environment)) {
val hadoopConf = new Configuration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", options.awsS3AccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey", options.awsS3SecretKey)
hadoopConf.set("fs.s3n.maxRetries", options.awsFileReaderRetry)
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withConf(hadoopConf)
.build()
} else {
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withPageSize(pageSize)
.build()
}
}
def readParquetData(sc: SparkContext, options: HendrixHistoricalOfflineProcessorOptions, filePath: String): RDD[String] = {
val filePathOfParquet = "s3n://"+options.rawBucketName+"/"+ filePath
LOGGER.info("Reading Parquet file from path :"+filePathOfParquet)
val sparkSession = SparkSession.builder.getOrCreate
val dataFrame = sparkSession.sqlContext.read.parquet(filePathOfParquet)
//dataFrame.printSchema()
dataFrame.toJSON.rdd
}
}
First, you really should improve your questions, with a minimal code example. It's really hard to see whats going on in your code...
Collect retrieves all elements of your RDD into a single RDD on the driver. If your RDD is large, then this will of course take a lot of time (and maybe cause an OutOfMemeoryError if the content does not fit into the driver's main memory).
You can directly write the content of a Dataframe/Dataset using parquet. This will surely be much faster and more scalable.
Use s3a:// URLs . S3n// has a bug which really kills ORC/Parquet performance, and has been superceded by s3a now
I am trying to parse JSON log file with spark and append to external hive table.
def main(args: Array[String]) {
val local: String = args(0)
val sparkConf: SparkConf = new SparkConf().setAppName("Proxy")
val ctx = new SparkContext(sparkConf)
val sqlContext = new HiveContext(ctx)
import sqlContext.implicits._
val df = sqlContext.read.schema(schema).json("group-8_instance-48_2016-10- 19-16-54.log")
df.registerTempTable("proxy_par_tmp")
val results_transaction =sqlContext.sql("SELECT type,time,path,protocol,protocolSrc, duration, status, serviceContexts,customMsgAtts,correlationId, legs FROM proxy_par_tmp where type='transaction'")
val two = saveFile(results_transaction).toDF()
two.write.mode(SaveMode.Append).saveAsTable("lzwpp_ushare.proxy_par")
ctx.stop
}
def saveFile(file:DataFrame): RDD[String]={
val five =file.map { t =>( t(0).toString + "~" +t(1).toString + "~" +t(2).toString + "~" +t(3).toString + "~" +t(4).toString
+ "~" +t(5).toString + "~" +t(6).toString + "~" +t(7).toString + "~" +t(8).toString + "~" +t(9).toString
+ "~"+ t(10).toString) }
five
}
Also tried with
two.saveAsTable("lzwpp_ushare.proxy_par",SaveMode.Append)
The problem is when I try the same thing with Hive managed table it works fine but not for External table.
Spark-shell as well as spark-submit throws the same error.
It writes to HDFS
Thanks.
I want to save a KMeans Model on Hdfs. To do this I use the method save and create the otuput directory during runtime(See code). I get the Error Exception metadata already exists. How can I solve this problem?
val lastUrbanKMeansModel = KMeansModel.load(spark, defaultPath + "UrbanRoad/201692918")
val newUrbanKMeansObject = new KMeans()
.setK(7)
.setMaxIterations(20)
.setInitialModel(lastUrbanKMeansModel)
val vectorUrbanRoad = typeStreet.filter(k => k._2 == 1).map(_._1)
if (!vectorUrbanRoad.isEmpty()) {
val newUrbanModel = newUrbanKMeansObject.run(vectorUrbanRoad)
newUrbanModel.save(spark, defaultPath + "UrbanRoad/" +
Calendar.getInstance().get(Calendar.YEAR).toString
+ (Calendar.getInstance().get(Calendar.MONTH) + 1).toString +
Calendar.getInstance().get(Calendar.DAY_OF_MONTH).toString +
Calendar.getInstance().get(Calendar.HOUR_OF_DAY).toString)
}