How to get just directory name from HDFS - scala

I am trying to get the directory name from hdfs location using spark. I am getting the whole path to the directory instead of just the directory name.
val fs = FileSystem.get(sc.hadoopConfiguration)
val ls = fs.listStatus(new Path("/user/rev/raw_data"))
ls.foreach(x => println(x.getPath))
This gives me
hdfs://localhost/user/rev/raw_data/191622-140
hdfs://localhost/user/rev/raw_data/201025-001
hdfs://localhost/user/rev/raw_data/201025-002
hdfs://localhost/user/rev/raw_data/2065-5
hdfs://localhost/user/rev/raw_data/223575-002
How can I just get the output as below (i.e. just the directory name)
191622-140
201025-001
201025-002
2065-5
223575-002

As you work with Path objects when using status.getPath, you can simply use the getName function on Path objects:
FileSystem
.get(sc.hadoopConfiguration)
.listStatus(new Path("/user/rev/raw_data"))
.filterNot(_.isFile)
.foreach(status => println(status.getPath.getName))
which would print:
191622-140
201025-001
201025-002
2065-5
223575-002

Related

Output file not created when running a R command in a Nextflow file?

I am trying to run a nextflow pipeline but the output file is not created.
The main.nf file looks like this:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process my_script {
"""
Rscript script.R
"""
}
workflow {
my_script
}
In my nextflow.config I have:
process {
executor = 'k8s'
container = 'rocker/r-ver:4.1.3'
}
The script.R looks like this:
FUN <- readRDS("function.rds");
input = readRDS("input.rds");
output = FUN(
singleCell_data_input = input[[1]], savePath = input[[2]], tmpDirGC = input[[3]]
);
saveRDS(output, "output.rds")
After running nextflow run main.nf the output.rds is not created
Nextflow processes are run independently and isolated from each other from inside the working directory. For your script to be able to find the required input files, these must be localized inside the process working directory. This should be done by defining an input block and declaring the files using the path qualifier, for example:
params.function_rds = './function.rds'
params.input_rds = './input.rds'
process my_script {
input:
path my_function_rds
path my_input_rds
output:
path "output.rds"
"""
#!/usr/bin/env Rscript
FUN <- readRDS("${my_function_rds}");
input = readRDS("${my_input_rds}");
output = FUN(
singleCell_data_input=input[[1]], savePath=input[[2]], tmpDirGC=input[[3]]
);
saveRDS(output, "output.rds")
"""
}
workflow {
function_rds = file( params.function_rds )
input_rds = file( params.input_rds )
my_script( function_rds, input_rds )
my_script.out.view()
}
In the same way, the script itself would need to be localized inside the process working directory. To avoid specifying an absolute path to your R script (which would not make your workflow portable at all), it's possible to simply embed your code, making sure to specify the Rscript shebang. This works because process scripts are not limited to Bash1.
Another way, would be to make your Rscript executable and move it into a directory called bin in the the root directory of your project repository (i.e. the same directory as your 'main.nf' Nextflow script). Nextflow automatically adds this folder to the $PATH environment variable and your script would become automatically accessible to each of your pipeline processes. For this to work, you'd need some way to pass in the input files as command line arguments. For example:
params.function_rds = './function.rds'
params.input_rds = './input.rds'
process my_script {
input:
path my_function_rds
path my_input_rds
output:
path "output.rds"
"""
script.R "${my_function_rds}" "${my_input_rds}" output.rds
"""
}
workflow {
function_rds = file( params.function_rds )
input_rds = file( params.input_rds )
my_script( function_rds, input_rds )
my_script.out.view()
}
And your R script might look like:
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
FUN <- readRDS(args[1]);
input = readRDS(args[2]);
output = FUN(
singleCell_data_input=input[[1]], savePath=input[[2]], tmpDirGC=input[[3]]
);
saveRDS(output, args[3])

How to get the configuration file path within JAR file for KafkaProducer SSL setup?

I have a JAR file with below structure:
example.jar
|
+-org
| +-springframework
| +-boot
| +-loader.jar
+-BOOT-INF
+-classes
| +- kafka
| truststore.jks ==> I want to get the path here
+-lib
+-dependency1.jar
How can I get the configuration file path, only path (string) of 'kafka/truststore.jks' file ?
Because I am applying the SSL for KafkaProducer, I using below code and it work fine on local:
#Value("classpath:kafka/truststore.jks")
private org.springframework.core.io.Resource sslTruststoreResource;
...
String sslTruststoreLocation = sslTruststoreResource.getFile().getAbsolutePath(); // ==\> ***it throw FileNotFoundException here on deployed Server, local env run fine !***
Map\<String, Object\> config = Maps.newHashMap();
config.put("ssl.truststore.location", sslTruststoreLocation);
but when I deploy on Server, it throw FileNotFoundException :(
After many days to research, I found that the sslTruststoreResource.getFile() will be fail for JAR file case as mentioned at here
The sslTruststoreResource.getInputStream() or sslTruststoreResource.getFilename() are ok for JAR file but they are not path I need for kafka configuration.
In my project, the 'truststore.jks' file is located as below:
src
-- java
-- resources
. -- kafka
-- truststore.jks
So, is there any solution for my issue ? Thank you.
I tried to use ClassPathResource, ResourcePatternResolver but they not working
After many ways I still could not get the path from JKS file then I copy it to another path out of jar file where my code can refer to its path
final String FILE_NAME = env.getProperty("kafka.metadata.ssl.truststore.location");
String sslTruststoreLocation = "*-imf-kafka-client.truststore.jks";
try {
InputStream is = getClass().getClassLoader().getResourceAsStream(FILE_NAME);
// Get the destination path where contains JKS file
final String HOME_DIR = System.getProperty("user.home");
final Path destPath = Paths.get(HOME_DIR, "tmp");
if (!Files.isDirectory(destPath)) {
Files.createDirectories(destPath);
}
// Copy JKS file to destination path
sslTruststoreLocation = destPath.toFile().getAbsolutePath() + "/" + FILE_NAME;
File uploadedFile = new File(sslTruststoreLocation);
if(!uploadedFile.exists()) {
uploadedFile.getParentFile().mkdirs();
uploadedFile.createNewFile();
FileCopyUtils.copy(is, new FileOutputStream(sslTruststoreLocation));
log.debug("Copied {} file from resources dir to {} done !", FILE_NAME, sslTruststoreLocation);
}
config.put(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, sslTruststoreLocation);
} catch (IOException e) {
final String message = "The " + sslTruststoreLocation + " file not found to construct a KafkaProducer";
log.error(message);
}
Looks like this is a known issue in Kafka.
Spring Boot propose a workaround similar to yours.

How to execute spark scala script saved in a text file

I have written word count scala-script in a text file and saved it in home directory.
How can i call and execute the script file "wordcount.txt" ?
If I try the command: spark-submit wordcount.txt, it is not working.
Content of "wordcount.txt" file-
val text = sc.textFile("/data/mr/wordcount/big.txt");
val counts = text.flatMap(line => line.split(" ").map(word => (word.toLowerCase(),1)).reduceByKey(+).sortBy(_._2,false).saveAsTextFile(“count_output”);

FileUtil.copyMerge() in AWS S3

I have Loaded a DataFrame into HDFS as text format using below code. finalDataFrame is the DataFrame
finalDataFrame.repartition(1).rdd.saveAsTextFile(targetFile)
After executing the above code I found that a directory created with the file name I provided and under the directory a file created but not in text format. The file name is like part-00000.
I have resolved this in HDFS using below code.
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
Now I can get the text file in the mentioned path with given file name.
But when I am trying to do the same operation in S3 it is showing some exception
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
java.lang.IllegalArgumentException: Wrong FS:
s3a://globalhadoop/data, expected:
hdfs://*********.aws.*****.com:8050
It seems that S3 path is not supporting over here. Can anyone please assist how to do this.
I have solved the problem using below code.
def createOutputTextFile(srcPath: String, dstPath: String, s3BucketPath: String): Unit = {
var fileSystem: FileSystem = null
var conf: Configuration = null
if (srcPath.toLowerCase().contains("s3a") || srcPath.toLowerCase().contains("s3n")) {
conf = sc.hadoopConfiguration
fileSystem = FileSystem.get(new URI(s3BucketPath), conf)
} else {
conf = new Configuration()
fileSystem = FileSystem.get(conf)
}
FileUtil.copyMerge(fileSystem, new Path(srcPath), fileSystem, new Path(dstPath), true, conf, null)
}
I have written the code for filesystem of S3 and HDFS and both are working fine.
You are passing in the hdfs filesystem as the destination FS in FileUtil.copyMerge. You need to get the real FS of the destination, which you can do by calling Path.getFileSystem(Configuration) on the destination path you have created.

Exception while getting string from conf file in Play 2 with Scala

I am trying to get a file path from conf, this is my conf file:
uploadedFilePath.conf
file.path="public/img/"
This is how I am getting path from conf file in my code:
val conf = ConfigFactory.load()
var path : String = conf.getString("file.path")
I am getting exception on second line
09:58:11.527 108649 [application-akka.actor.default-dispatcher-10]
PlayDefaultUpstreamHandler ERROR - Cannot invoke the action
com.typesafe.config.ConfigException$WrongType: system properties: path
has type OBJECT rather than STRING
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:159)
~[com.typesafe.config-1.3.0.jar:na]
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:170)
~[com.typesafe.config-1.3.0.jar:na]
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:184)
~[com.typesafe.config-1.3.0.jar:na]
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189)
~[com.typesafe.config-1.3.0.jar:na]
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:246)
~[com.typesafe.config-1.3.0.jar:na]
I am do not know what I am doing wrong.
remove the quotes
file.path=public/img/