How to execute spark scala script saved in a text file - scala

I have written word count scala-script in a text file and saved it in home directory.
How can i call and execute the script file "wordcount.txt" ?
If I try the command: spark-submit wordcount.txt, it is not working.
Content of "wordcount.txt" file-
val text = sc.textFile("/data/mr/wordcount/big.txt");
val counts = text.flatMap(line => line.split(" ").map(word => (word.toLowerCase(),1)).reduceByKey(+).sortBy(_._2,false).saveAsTextFile(“count_output”);

Related

Output file not created when running a R command in a Nextflow file?

I am trying to run a nextflow pipeline but the output file is not created.
The main.nf file looks like this:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process my_script {
"""
Rscript script.R
"""
}
workflow {
my_script
}
In my nextflow.config I have:
process {
executor = 'k8s'
container = 'rocker/r-ver:4.1.3'
}
The script.R looks like this:
FUN <- readRDS("function.rds");
input = readRDS("input.rds");
output = FUN(
singleCell_data_input = input[[1]], savePath = input[[2]], tmpDirGC = input[[3]]
);
saveRDS(output, "output.rds")
After running nextflow run main.nf the output.rds is not created
Nextflow processes are run independently and isolated from each other from inside the working directory. For your script to be able to find the required input files, these must be localized inside the process working directory. This should be done by defining an input block and declaring the files using the path qualifier, for example:
params.function_rds = './function.rds'
params.input_rds = './input.rds'
process my_script {
input:
path my_function_rds
path my_input_rds
output:
path "output.rds"
"""
#!/usr/bin/env Rscript
FUN <- readRDS("${my_function_rds}");
input = readRDS("${my_input_rds}");
output = FUN(
singleCell_data_input=input[[1]], savePath=input[[2]], tmpDirGC=input[[3]]
);
saveRDS(output, "output.rds")
"""
}
workflow {
function_rds = file( params.function_rds )
input_rds = file( params.input_rds )
my_script( function_rds, input_rds )
my_script.out.view()
}
In the same way, the script itself would need to be localized inside the process working directory. To avoid specifying an absolute path to your R script (which would not make your workflow portable at all), it's possible to simply embed your code, making sure to specify the Rscript shebang. This works because process scripts are not limited to Bash1.
Another way, would be to make your Rscript executable and move it into a directory called bin in the the root directory of your project repository (i.e. the same directory as your 'main.nf' Nextflow script). Nextflow automatically adds this folder to the $PATH environment variable and your script would become automatically accessible to each of your pipeline processes. For this to work, you'd need some way to pass in the input files as command line arguments. For example:
params.function_rds = './function.rds'
params.input_rds = './input.rds'
process my_script {
input:
path my_function_rds
path my_input_rds
output:
path "output.rds"
"""
script.R "${my_function_rds}" "${my_input_rds}" output.rds
"""
}
workflow {
function_rds = file( params.function_rds )
input_rds = file( params.input_rds )
my_script( function_rds, input_rds )
my_script.out.view()
}
And your R script might look like:
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
FUN <- readRDS(args[1]);
input = readRDS(args[2]);
output = FUN(
singleCell_data_input=input[[1]], savePath=input[[2]], tmpDirGC=input[[3]]
);
saveRDS(output, args[3])

Using input function with remote files in snakemake

I want to use a function to read inputs file paths from a dataframe and send them to my snakemake rule. I also have a helper function to select the remote from which to pull the files.
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.SFTP import RemoteProvider as SFTPRemoteProvider
from os.path import join
import pandas as pd
configfile: "config.yaml"
units = pd.read_csv(config["units"]).set_index(["library", "unit"], drop=False)
TMP= join('data', 'tmp')
def access_remote(local_path):
""" Connnects to remote as defined in config file"""
provider = config['provider']
if provider == 'GS':
GS = GSRemoteProvider()
remote_path = GS.remote(join("gs://" + config['bucket'], local_path))
elif provider == 'SFTP':
SFTP = SFTPRemoteProvider(
username=config['user'],
private_key=config['ssh_key']
)
remote_path = SFTP.remote(
config['host'] + ":22" + join(base_path, local_path)
)
else:
remote_path = local_path
return remote_path
def get_fastqs(wc):
"""
Get fastq files (units) of a particular library - sample
combination from the unit sheet.
"""
fqs = units.loc[
(units.library == wc.library) &
(units.libtype == wc.libtype),
"fq1"
]
return {
"r1": list(map(access_remote, fqs.fq1.values)),
}
# Combine all fastq files from the same sample / library type combination
rule combine_units:
input: unpack(get_fastqs)
output:
r1 = join(TMP, "reads", "{library}_{libtype}.end1.fq.gz")
threads: 12
run:
shell("cat {i1} > {o1}".format(i1=input['r1'], o1=output['r1']))
My config file contains the bucket name and provider, which are passed to the function. This works as expected when running simply snakemake.
However, I would like to use the kubernetes integration, which requires passing the provider and bucket name in the command line. But when I run:
snakemake -n --kubernetes --default-remote-provider GS --default-remote-prefix bucket-name
I get this error:
ERROR :: MissingInputException in line 19 of Snakefile:
Missing input files for rule combine_units:
bucket-name/['bucket-name/lib1-unit1.end1.fastq.gz', 'bucket-name/lib1-unit2.end1.fastq.gz', 'bucket-name/lib1-unit3.end1.fastq.gz']
The bucket is applied twice (once mapped correctly to each element, and once before the whole list (which gets converted to a string). Did I miss something ? Is there a good way to work around this ?

Cake Task output log to file

I have a set of Tasks inside a build.cake file and I would like to capture the log output from the console into a log file. I know it's possible to use the OnError() function to output errors to file but I would like to output everything to a log file, not just errors.
Below is an example of the build.cake file.
#load "SomeTask.cake"
#load "SomeOtherTask.cake"
var target = Argument("target", "Default");
var someTask = Task("SomeTask")
.Does(() =>
{
SomeMethodInsideSomeTask();
});
var someOtherTask = Task("SomeOtherTask")
.Does(() =>
{
SomeOtherMethodInsideSomeOtherTask();
});
Task("Default")
.IsDependentOn(someTask)
.IsDependentOn(someOtherTask);
RunTarget(target);
N.B. The Tasks are not running any sort of MSBuild commands so it's not possible to use MSBuildFileLogger.
How about pipe the stdout to a file i.e.
./build.ps1 > log.txt
Have you heard about tee ?
It reads standard input and writes it to both standard output and one or more files

How to get just directory name from HDFS

I am trying to get the directory name from hdfs location using spark. I am getting the whole path to the directory instead of just the directory name.
val fs = FileSystem.get(sc.hadoopConfiguration)
val ls = fs.listStatus(new Path("/user/rev/raw_data"))
ls.foreach(x => println(x.getPath))
This gives me
hdfs://localhost/user/rev/raw_data/191622-140
hdfs://localhost/user/rev/raw_data/201025-001
hdfs://localhost/user/rev/raw_data/201025-002
hdfs://localhost/user/rev/raw_data/2065-5
hdfs://localhost/user/rev/raw_data/223575-002
How can I just get the output as below (i.e. just the directory name)
191622-140
201025-001
201025-002
2065-5
223575-002
As you work with Path objects when using status.getPath, you can simply use the getName function on Path objects:
FileSystem
.get(sc.hadoopConfiguration)
.listStatus(new Path("/user/rev/raw_data"))
.filterNot(_.isFile)
.foreach(status => println(status.getPath.getName))
which would print:
191622-140
201025-001
201025-002
2065-5
223575-002

how to read aparticular line from log file using logstash

I have to read 3 different lines from log files based on some text and then output the fields in a csv file.
sample log data:-
20110607 095826 [.] !! Begin test. Script filename/text.txt
20110607 095826 [.] Full path: filename/test/text.txt
20110607 095828 [.] FAILED: Test Failed()..
i have to read file name after !!Begin test. Script. this is my conf file
filter{
grok
{
match => {"message" => "%{BASE10NUM:Date}%{SPACE:pat}{BASE10NUM:Number}%
{SPACE:pat}[.]%{SPACE:pat}%{SPACE:pat}!! Begin test. Script%
{SPACE:pat}%{GREEDYDATA:file}"
}
overwrite => ["message"]
}
if "_grokparserfailure" in [tags]
{
drop{}
}
}
but its not giving me single record, its parsing full log file in json format no parsed field.