hadoop distcp doesnot create folder when we pass single file - scala

I am facing below issues in hadoop Distcp any suggestion or help is highly appreciated.
I am trying to copy data from Google Cloud platform to Amazon S3
1) When we have multiple files to copy from source to destination (This work fine)
val sourcefile : String = "gs://XXXX_-abc_account2621/abc_account2621_click_20170616*.csv.gz [Multiple files to copy (we have * in the file name)]
Output: S3://S3bucketname/xxx/xxxx/clientid=account2621/date=2017-08-18/
Files in above path
abc_account2621_click_2017061612_20170617_005852_572560033.csv.gz
abc_account2621_click_2017061616_20170617_045654_572608350.csv.gz
abc_account2621_click_2017061622_20170617_103107_572684922.csv.gz
abc_account2621_click_2017061623_20170617_120235_572705834.csv.gz
2) When we have only one file to copy from source to destination (Issue)
val sourcefile : String = "gs://XXXX_-abc_account2621/abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz
Output:S3://S3bucketname/xxx/xxxx/clientid=account2621/
Files in above path
date=2017-08-18 (Directory replace with file content and it doesn't have file type)
Code:
def main(args: Array[String]): Unit = {
val Array(environment,customer, typesoftables, clientid, filedate) = args.take(5)
val S3Path: String = customer + "/" + typesoftables + "/" + "clientid=" + clientid + "/" + "date=" + filedate + "/"
val sourcefile : String = "gs://XXXX_-abc_account2621//abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz"
val destination: String = "s3n://S3bucketname/" + S3Path
println(sourcefile)
println(destination)
val filepaths: Array[String] = Array(sourcefile, destination)
executeDistCp(filepaths)
}
def executeDistCp(filepaths : Array[String]) {
val conf: Configuration = new Configuration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("fs.gs.project.id", "XXXX-XXXX")
conf.set("google.cloud.auth.service.account.json.keyfile","/tmp/XXXXX.json")
conf.set("fs.s3n.awsAccessKeyId", "XXXXXXXXXXXX")
conf.set("fs.s3n.awsSecretAccessKey","XXXXXXXXXXXXXX")
conf.set("mapreduce.application.classpath","$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
,/usr/lib/hadoop-lzo/lib/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/emr/lib/*,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,/usr/share/aws/emr/cloudwatch-sink/lib/*,/usr/share/aws/aws-java-sdk/*,/tmp/gcs-connector-latest-hadoop2.jar")
conf.set("HADOOP_CLASSPATH","$HADOOP_CLASSPATH:/tmp/gcs-connector-latest-hadoop2.jar")
val outputDir: Path = new Path(filepaths(1))
outputDir.getFileSystem(conf).delete(outputDir, true)
val distCp: DistCp = new DistCp(conf,null)
ToolRunner.run(distCp, filepaths)
}
}

By adding below code the above issue is fixed
Code
val makeDir: Path = new Path(filepaths(1))
makeDir.getFileSystem(conf).mkdirs(makeDir)

Related

show the file timestamp and select latest file from the directory using Scala in azure databricks

I want to select the latest file from the directory and show the timestamp of all the files using scala code in azure databricks .
Can you please help me on this .
I have tried this using below code , which is working fine .
`var basePath = "<Full_Path>"
var files = Array[String]()
var maxTS: Long = 0
var TimeFile = collection.mutable.Map[Long, String]()
val conf = new Configuration()
val hdfs = FileSystem.get( conf )
val f = new Path( basePath )
val messageFile = hdfs.listFiles( f, true )
while (messageFile.hasNext()) {
val message = messageFile.next()
if (message.getPath.toString().endsWith( "tsv" )) {
files = files :+ message.getPath.toString()
TimeFile += ( message.getModificationTime -> message.getPath.toString())
}
}`

Fail to write to S3

I am trying to write a file to Amazon S3.
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val filePath = "/service2/2019/06/30/21"
val fileContent = "{\"key\":\"value\"}"
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, filePath, new ByteArrayInputStream(fileContent.getBytes), meta)
The program is finished with no error, but no file is written into the bucket.
The key argument seems to have a typo. Try without the initial forward slash
val filePath = "service2/2019/06/30/21"
instead of
val filePath = "/service2/2019/06/30/21"

How to list the path of files inside Hdfs directory and subdirectory?

Could not figure out a way to list all files in directory and subdirectories.
Here is the code that I'm using which lists files in a specific directory but files if there is a subdorectory inside:
val conf = new Configuration()
val fs = FileSystem.get(new java.net.URI("hdfs://servername/"), conf)
val status = fs.listStatus(new Path("path/to/folder/"))
status.foreach { x => println(x.getPath.toString()) }
the above code lists all the files inside the directory but I need it to be recursive.
You could go for a recursion whenever you discover a new folder:
val hdfs = FileSystem.get(new Configuration())
def listFileNames(hdfsPath: String): List[String] = {
hdfs
.listStatus(new Path(hdfsPath))
.flatMap { status =>
// If it's a file:
if (status.isFile)
List(hdfsPath + "/" + status.getPath.getName)
// If it's a dir and we're in a recursive option:
else
listFileNames(hdfsPath + "/" + status.getPath.getName)
}
.toList
.sorted
}

how get element from List based on its name?

I have a path that contains subpath ,each subpath contains files
path="/data"
I implemented tow function to get csv files from each sub path
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
def getListOfFiles(dir: String, extensions: List[String]): List[File] = {
val d = new File(dir)
d.listFiles.filter(_.isFile).toList.filter { file =>
extensions.exists(file.getName.endsWith(_))
}
}
each sub path contain 5 csv files : contextfile.csv,datafile.csv,datesfiles.csv,errors.csv,testfiles so my problem that i'll work with each file in a separate dataframe how I can get name of file for the right dataframe for example I want to get the name of files that concern context (i.e contextfile.csv). I worked like this but for each iteration the logic and the ranking in th List change
val dir=getListOfSubDirectories(path)
for (sup_path <- dir)
{ val Files = getListOfFiles(path + "//" + sup_path, List(".csv"))
val filename_context = Files(1).toString
val filename_datavalue = Files(0).toString
val filename_error = Files(3).toString
val filename_testresult = Files(4).toString
}
any help and thanks
I solve it just a simple filter
val filename_context = Files.filter(f =>f.getName.contains("context")).last.toString
val filename_datavalue = Files.filter(f =>f.getName.contains("data")).last.toString
val filename_error = Files.filter(f =>f.getName.contains("error")).last.toString
val filename_testresult = Files.filter(f =>f.getName.contains("test")).last.toString

java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-d48e6576/scratch_hive_2016-12-28_10-47-18_519_1, expected: hdfs://nameservice1

I am trying to parse JSON log file with spark and append to external hive table.
def main(args: Array[String]) {
val local: String = args(0)
val sparkConf: SparkConf = new SparkConf().setAppName("Proxy")
val ctx = new SparkContext(sparkConf)
val sqlContext = new HiveContext(ctx)
import sqlContext.implicits._
val df = sqlContext.read.schema(schema).json("group-8_instance-48_2016-10- 19-16-54.log")
df.registerTempTable("proxy_par_tmp")
val results_transaction =sqlContext.sql("SELECT type,time,path,protocol,protocolSrc, duration, status, serviceContexts,customMsgAtts,correlationId, legs FROM proxy_par_tmp where type='transaction'")
val two = saveFile(results_transaction).toDF()
two.write.mode(SaveMode.Append).saveAsTable("lzwpp_ushare.proxy_par")
ctx.stop
}
def saveFile(file:DataFrame): RDD[String]={
val five =file.map { t =>( t(0).toString + "~" +t(1).toString + "~" +t(2).toString + "~" +t(3).toString + "~" +t(4).toString
+ "~" +t(5).toString + "~" +t(6).toString + "~" +t(7).toString + "~" +t(8).toString + "~" +t(9).toString
+ "~"+ t(10).toString) }
five
}
Also tried with
two.saveAsTable("lzwpp_ushare.proxy_par",SaveMode.Append)
The problem is when I try the same thing with Hive managed table it works fine but not for External table.
Spark-shell as well as spark-submit throws the same error.
It writes to HDFS
Thanks.