I have Loaded a DataFrame into HDFS as text format using below code. finalDataFrame is the DataFrame
finalDataFrame.repartition(1).rdd.saveAsTextFile(targetFile)
After executing the above code I found that a directory created with the file name I provided and under the directory a file created but not in text format. The file name is like part-00000.
I have resolved this in HDFS using below code.
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
Now I can get the text file in the mentioned path with given file name.
But when I am trying to do the same operation in S3 it is showing some exception
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
java.lang.IllegalArgumentException: Wrong FS:
s3a://globalhadoop/data, expected:
hdfs://*********.aws.*****.com:8050
It seems that S3 path is not supporting over here. Can anyone please assist how to do this.
I have solved the problem using below code.
def createOutputTextFile(srcPath: String, dstPath: String, s3BucketPath: String): Unit = {
var fileSystem: FileSystem = null
var conf: Configuration = null
if (srcPath.toLowerCase().contains("s3a") || srcPath.toLowerCase().contains("s3n")) {
conf = sc.hadoopConfiguration
fileSystem = FileSystem.get(new URI(s3BucketPath), conf)
} else {
conf = new Configuration()
fileSystem = FileSystem.get(conf)
}
FileUtil.copyMerge(fileSystem, new Path(srcPath), fileSystem, new Path(dstPath), true, conf, null)
}
I have written the code for filesystem of S3 and HDFS and both are working fine.
You are passing in the hdfs filesystem as the destination FS in FileUtil.copyMerge. You need to get the real FS of the destination, which you can do by calling Path.getFileSystem(Configuration) on the destination path you have created.
Related
I am trying to push records to AWS S3 with Aplakka S3 library. The issue is due to security issues, I have to "assume role", and my IAM user doesn't have access to PUT. with aws cli, I could have successfully pushed to s3 with --profile parameter in aws s3 --profile <profile>. I want to know HOW TO ASSUME ROLE IN ALPAKKA S3 LIBRARY?
my application.conf file has the credentials as in https://github.com/akka/alpakka/blob/v3.0.4/s3/src/main/resources/reference.conf
:
alpakka.s3 {
# default values for AWS configuration
aws {
# to use the same configuration as if credentials.provider = default
credentials {
# static credentials
provider = default //static
access-key-id = <> //valid access key exists in original code
secret-access-key = <>
}
and my PUTing code is:
object Experiments extends App{
//AWS S3 configs
val s3BucketName = config.getString("alpakka.s3.bucket_name")
val s3BucketRegion = config.getString("alpakka.s3.bucket_region")
val bucket_path = config.getString("alpakka.s3.path_inside_bucket")
Source.single(record)
.map { Record => println("Record value: " + Record)
val K_record = //some parsing for the record into JSON
//push to S3
val bucketKey = s"$bucket_path/${K_record.session_id}.json"
try{
Source.single(ByteString(K_record.featureSet))
.runWith(S3.multipartUpload(bucket = s3BucketName, key = bucketKey))
}
catch{
case e2:Exception => println(s"S3 pushing error: ${e2.getMessage}")
}
}
.runWith(Sink.ignore)
I'm trying to setup an Alpakka S3 for files upload purpose. Here is my configs:
alpakka s3 dependency:
...
"com.lightbend.akka" %% "akka-stream-alpakka-s3" % "0.20"
...
Here is application.conf:
akka.stream.alpakka.s3 {
buffer = "memory"
proxy {
host = ""
port = 8000
secure = true
}
aws {
credentials {
provider = default
}
}
path-style-access = false
list-bucket-api-version = 2
}
File upload code example:
private val awsCredentials = new BasicAWSCredentials("my_key", "my_secret_key")
private val awsCredentialsProvider = new AWSStaticCredentialsProvider(awsCredentials)
private val regionProvider = new AwsRegionProvider { def getRegion: String = "us-east-1" }
private val settings = new S3Settings(MemoryBufferType, None, awsCredentialsProvider, regionProvider, false, None, ListBucketVersion2)
private val s3Client = new S3Client(settings)(system, materializer)
val fileSource = Source.fromFuture(ByteString("ololo blabla bla"))
val fileName = UUID.randomUUID().toString
val s3Sink: Sink[ByteString, Future[MultipartUploadResult]] = s3Client.multipartUpload("my_basket", fileName)
fileSource.runWith(s3Sink)
.map {
result => println(s"${result.location}")
} recover {
case ex: Exception => println(s"$ex")
}
When I run this code I get:
javax.net.ssl.SSLHandshakeException: General SSLEngine problem
What can be a reason?
The certificate problem arises for bucket names containing dots.
You may switch to
akka.stream.alpakka.s3.path-style-access = true to get rid of this.
We're considering making it the default: https://github.com/akka/alpakka/issues/1152
I am trying to get the directory name from hdfs location using spark. I am getting the whole path to the directory instead of just the directory name.
val fs = FileSystem.get(sc.hadoopConfiguration)
val ls = fs.listStatus(new Path("/user/rev/raw_data"))
ls.foreach(x => println(x.getPath))
This gives me
hdfs://localhost/user/rev/raw_data/191622-140
hdfs://localhost/user/rev/raw_data/201025-001
hdfs://localhost/user/rev/raw_data/201025-002
hdfs://localhost/user/rev/raw_data/2065-5
hdfs://localhost/user/rev/raw_data/223575-002
How can I just get the output as below (i.e. just the directory name)
191622-140
201025-001
201025-002
2065-5
223575-002
As you work with Path objects when using status.getPath, you can simply use the getName function on Path objects:
FileSystem
.get(sc.hadoopConfiguration)
.listStatus(new Path("/user/rev/raw_data"))
.filterNot(_.isFile)
.foreach(status => println(status.getPath.getName))
which would print:
191622-140
201025-001
201025-002
2065-5
223575-002
I am trying to read the following config file using typesafe config
common = {
jdbcDriver = "com.mysql.jdbc.Driver"
slickDriver = "slick.driver.MySQLDriver"
port = 3306
db = "foo"
user = "bar"
password = "baz"
}
source = ${common} {server = "remoteserver"}
target = ${common} {server = "localserver"}
When I try to read my config using this code
val conf = ConfigFactory.parseFile(new File("src/main/resources/application.conf"))
val username = conf.getString("source.user")
I get an error
com.typesafe.config.ConfigException$NotResolved: source.user has not been resolved, you need to call Config#resolve(), see API docs for Config#resolve()
I don't get any error if I put everything inside "source" or "target" tags. I get errors only when I try to use "common"
I solved it myself.
ConfigFactory.parseFile(new File("src/main/resources/application.conf")).resolve()
I solved it.
Config confSwitchEnv = ConfigFactory.load("env.conf");
the env.conf is in the resources dir.
reference: https://nicedoc.io/lightbend/config
i would like merging from local file from /opt/one.txt with file at my hdfs hdfs://localhost:54310/dummy/two.txt.
contains at one.txt : f,g,h
contains at two.txt : 2424244r
my code :
val cfg = new Configuration()
cfg.addResource(new Path("/usr/local/hadoop/etc/hadoop/core-site.xml"))
cfg.addResource(new Path("/usr/local/hadoop/etc/hadoop/hdfs-site.xml"))
cfg.addResource(new Path("/usr/local/hadoop/etc/hadoop/mapred-site.xml"))
try
{
val srcPath = "/opt/one.txt"
val dstPath = "/dumCBF/two.txt"
val srcFS = FileSystem.get(URI.create(srcPath), cfg)
val dstFS = FileSystem.get(URI.create(dstPath), cfg)
FileUtil.copyMerge(srcFS,
new Path(srcPath),
dstFS,
new Path(dstPath),
true,
cfg,
null)
println("end proses")
}
catch
{
case m:Exception => m.printStackTrace()
case k:Throwable => k.printStackTrace()
}
i was following tutorial from : http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
and it's not working at all, error would below this :
java.io.FileNotFoundException: File does not exist: /opt/one.txt
i dont why, sounds of error like that? BTW the file one.txt is exist
and then, im add some code to check exist file :
if(new File(srcPath).exists()) println("file is exist")
any idea or references? thanks!
EDIT 1,2 : typo extensions