hadoop distcp doesnot create folder when we pass single file

hadoop distcp doesnot create folder when we pass single file - scala

I am facing below issues in hadoop Distcp any suggestion or help is highly appreciated.
I am trying to copy data from Google Cloud platform to Amazon S3
1) When we have multiple files to copy from source to destination (This work fine)
val sourcefile : String = "gs://XXXX_-abc_account2621/abc_account2621_click_20170616*.csv.gz [Multiple files to copy (we have * in the file name)]
Output: S3://S3bucketname/xxx/xxxx/clientid=account2621/date=2017-08-18/
Files in above path
abc_account2621_click_2017061612_20170617_005852_572560033.csv.gz
abc_account2621_click_2017061616_20170617_045654_572608350.csv.gz
abc_account2621_click_2017061622_20170617_103107_572684922.csv.gz
abc_account2621_click_2017061623_20170617_120235_572705834.csv.gz
2) When we have only one file to copy from source to destination (Issue)
val sourcefile : String = "gs://XXXX_-abc_account2621/abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz
Output:S3://S3bucketname/xxx/xxxx/clientid=account2621/
Files in above path
date=2017-08-18 (Directory replace with file content and it doesn't have file type)
Code:
def main(args: Array[String]): Unit = {
val Array(environment,customer, typesoftables, clientid, filedate) = args.take(5)
val S3Path: String = customer + "/" + typesoftables + "/" + "clientid=" + clientid + "/" + "date=" + filedate + "/"
val sourcefile : String = "gs://XXXX_-abc_account2621//abc_account2621_activity_20170618_20170619_034412_573362513.csv.gz"
val destination: String = "s3n://S3bucketname/" + S3Path
println(sourcefile)
println(destination)
val filepaths: Array[String] = Array(sourcefile, destination)
executeDistCp(filepaths)
}
def executeDistCp(filepaths : Array[String]) {
val conf: Configuration = new Configuration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("fs.gs.project.id", "XXXX-XXXX")
conf.set("google.cloud.auth.service.account.json.keyfile","/tmp/XXXXX.json")
conf.set("fs.s3n.awsAccessKeyId", "XXXXXXXXXXXX")
conf.set("fs.s3n.awsSecretAccessKey","XXXXXXXXXXXXXX")
conf.set("mapreduce.application.classpath","$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
,/usr/lib/hadoop-lzo/lib/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/emr/lib/*,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,/usr/share/aws/emr/cloudwatch-sink/lib/*,/usr/share/aws/aws-java-sdk/*,/tmp/gcs-connector-latest-hadoop2.jar")
conf.set("HADOOP_CLASSPATH","$HADOOP_CLASSPATH:/tmp/gcs-connector-latest-hadoop2.jar")
val outputDir: Path = new Path(filepaths(1))
outputDir.getFileSystem(conf).delete(outputDir, true)
val distCp: DistCp = new DistCp(conf,null)
ToolRunner.run(distCp, filepaths)
}
}

By adding below code the above issue is fixed
Code
val makeDir: Path = new Path(filepaths(1))
makeDir.getFileSystem(conf).mkdirs(makeDir)

Related

show the file timestamp and select latest file from the directory using Scala in azure databricks

I want to select the latest file from the directory and show the timestamp of all the files using scala code in azure databricks .
Can you please help me on this .

I have tried this using below code , which is working fine .
`var basePath = "<Full_Path>"
var files = Array[String]()
var maxTS: Long = 0
var TimeFile = collection.mutable.Map[Long, String]()
val conf = new Configuration()
val hdfs = FileSystem.get( conf )
val f = new Path( basePath )
val messageFile = hdfs.listFiles( f, true )
while (messageFile.hasNext()) {
val message = messageFile.next()
if (message.getPath.toString().endsWith( "tsv" )) {
files = files :+ message.getPath.toString()
TimeFile += ( message.getModificationTime -> message.getPath.toString())
}
}`

Fail to write to S3

I am trying to write a file to Amazon S3.
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val filePath = "/service2/2019/06/30/21"
val fileContent = "{\"key\":\"value\"}"
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, filePath, new ByteArrayInputStream(fileContent.getBytes), meta)
The program is finished with no error, but no file is written into the bucket.

The key argument seems to have a typo. Try without the initial forward slash
val filePath = "service2/2019/06/30/21"
instead of
val filePath = "/service2/2019/06/30/21"

How to list the path of files inside Hdfs directory and subdirectory?

Could not figure out a way to list all files in directory and subdirectories.
Here is the code that I'm using which lists files in a specific directory but files if there is a subdorectory inside:
val conf = new Configuration()
val fs = FileSystem.get(new java.net.URI("hdfs://servername/"), conf)
val status = fs.listStatus(new Path("path/to/folder/"))
status.foreach { x => println(x.getPath.toString()) }
the above code lists all the files inside the directory but I need it to be recursive.

You could go for a recursion whenever you discover a new folder:
val hdfs = FileSystem.get(new Configuration())
def listFileNames(hdfsPath: String): List[String] = {
hdfs
.listStatus(new Path(hdfsPath))
.flatMap { status =>
// If it's a file:
if (status.isFile)
List(hdfsPath + "/" + status.getPath.getName)
// If it's a dir and we're in a recursive option:
else
listFileNames(hdfsPath + "/" + status.getPath.getName)
}
.toList
.sorted
}

how get element from List based on its name?

I have a path that contains subpath ,each subpath contains files
path="/data"
I implemented tow function to get csv files from each sub path
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
def getListOfFiles(dir: String, extensions: List[String]): List[File] = {
val d = new File(dir)
d.listFiles.filter(_.isFile).toList.filter { file =>
extensions.exists(file.getName.endsWith(_))
}
}
each sub path contain 5 csv files : contextfile.csv,datafile.csv,datesfiles.csv,errors.csv,testfiles so my problem that i'll work with each file in a separate dataframe how I can get name of file for the right dataframe for example I want to get the name of files that concern context (i.e contextfile.csv). I worked like this but for each iteration the logic and the ranking in th List change
val dir=getListOfSubDirectories(path)
for (sup_path <- dir)
{ val Files = getListOfFiles(path + "//" + sup_path, List(".csv"))
val filename_context = Files(1).toString
val filename_datavalue = Files(0).toString
val filename_error = Files(3).toString
val filename_testresult = Files(4).toString
}
any help and thanks

I solve it just a simple filter
val filename_context = Files.filter(f =>f.getName.contains("context")).last.toString
val filename_datavalue = Files.filter(f =>f.getName.contains("data")).last.toString
val filename_error = Files.filter(f =>f.getName.contains("error")).last.toString
val filename_testresult = Files.filter(f =>f.getName.contains("test")).last.toString

java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-d48e6576/scratch_hive_2016-12-28_10-47-18_519_1, expected: hdfs://nameservice1

I am trying to parse JSON log file with spark and append to external hive table.
def main(args: Array[String]) {
val local: String = args(0)
val sparkConf: SparkConf = new SparkConf().setAppName("Proxy")
val ctx = new SparkContext(sparkConf)
val sqlContext = new HiveContext(ctx)
import sqlContext.implicits._
val df = sqlContext.read.schema(schema).json("group-8_instance-48_2016-10- 19-16-54.log")
df.registerTempTable("proxy_par_tmp")
val results_transaction =sqlContext.sql("SELECT type,time,path,protocol,protocolSrc, duration, status, serviceContexts,customMsgAtts,correlationId, legs FROM proxy_par_tmp where type='transaction'")
val two = saveFile(results_transaction).toDF()
two.write.mode(SaveMode.Append).saveAsTable("lzwpp_ushare.proxy_par")
ctx.stop
}
def saveFile(file:DataFrame): RDD[String]={
val five =file.map { t =>( t(0).toString + "~" +t(1).toString + "~" +t(2).toString + "~" +t(3).toString + "~" +t(4).toString
+ "~" +t(5).toString + "~" +t(6).toString + "~" +t(7).toString + "~" +t(8).toString + "~" +t(9).toString
+ "~"+ t(10).toString) }
five
}
Also tried with
two.saveAsTable("lzwpp_ushare.proxy_par",SaveMode.Append)
The problem is when I try the same thing with Hive managed table it works fine but not for External table.
Spark-shell as well as spark-submit throws the same error.
It writes to HDFS
Thanks.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

hadoop distcp doesnot create folder when we pass single file - scala

By adding below code the above issue is fixed Code val makeDir: Path = new Path(filepaths(1)) makeDir.getFileSystem(conf).mkdirs(makeDir)

Related

show the file timestamp and select latest file from the directory using Scala in azure databricks

Fail to write to S3

How to list the path of files inside Hdfs directory and subdirectory?

how get element from List based on its name?

java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-d48e6576/scratch_hive_2016-12-28_10-47-18_519_1, expected: hdfs://nameservice1

Categories

Resources