going to specific path in scala using scala.sys.process

going to specific path in scala using scala.sys.process - scala

I have to go to the path of an application to deploy it, I tried using scala.sys.process and did "cd /home/path/somepath" !
It is throwing an exception, Can anyone guide me how I can go to the directory, I cannot deploy it using absolute path because of the dependency the run file has.
Thanks in advance

Although this question is a couple of years old, it's a good question.
To use scala.sys.process to execute something from a specific working directory, pass the required directory as a parameter to ProcessBuilder, as in this working example:
import scala.sys.process._
val scriptPath = "/home/path/myShellScript.sh"
val command = Seq("/bin/bash","-c",scriptPath)
val proc = Process(command,new java.io.File("."))
var output = Vector.empty[String]
val exitValue = proc ! ProcessLogger (
(out) => if( out.trim.length > 0 )
output +:= out.trim,
(err) =>
System.err.printf("e:%s\n",err) // can be quite noisy!
)
printf("exit value: %d\n",exitValue)
printf("output[%s]\n",output.mkString("\n"))
If the goal instead is to insure that the environment of the caller defaults to a specific working directory, that can be accomplished by setting the required working directory before launching the jvm.

Related

How to run gsutil from Scala without the .cmd suffix?

I am trying to run gsutil in Scala, but it doesn't work unless I explicitly put .cmd in the code. I don't like this approach, since others I work with use Unix systems. How do I let Scala understand that gsutil == gsutil.cmd? I could just write a custom shell script and add that to path, but I'd like a solution that doesn't include scripting.
I have already tried with various environment variables (using IntelliJ, don't know if it's relevant). I have tried adding both /bin and /platform/gsutil to path, neither works (without .cmd at least). I have also tried giving full path to see if it made a difference, it didn't.
Here is the method that uses gsutil:
def readFilesInBucket(ss: SparkSession, bucket: String): DataFrame = {
import ss.implicits._
ss.sparkContext.parallelize((s"gsutil ls -l $bucket" !!).split("\n")
.map(r => r.trim.split(" ")).filter(r => r.length == 3)
.map(r => (r(0), r(1), r(2)))).toDF(Array("Size", "Date", "File"): _*)
}
This is my first ever question on SO, I apologize for any formattic errors there may be.
EDIT:
Found out, that even when I write a script like this:
exec gsutil.cmd "$#"
called just gsutil in the same folder, it spits out the same error message as before: java.io.IOException: Cannot run program "gsutil": CreateProcess error=2, The system cannot find the file specified.
It works if I write gsutil in git bash, which otherwise didn't work without the script.

Maybe just use a different version whether you're on Windows or *nix system?
Create some helper:
object SystemDetector {
lazy val isWindows = System.getProperty("os.name").startsWith("Windows")
}
And then just use it like:
def readFilesInBucket(ss: SparkSession, bucket: String): DataFrame = {
import ss.implicits._
val gsutil = if(SystemDetector.isWindows) "gsutil.cmd" else "gsutil"
ss.sparkContext.parallelize((s"$gsutil ls -l $bucket" !!).split("\n")
.map(r => r.trim.split(" ")).filter(r => r.length == 3)
.map(r => (r(0), r(1), r(2)))).toDF(Array("Size", "Date", "File"): _*)
}

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?

So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.

I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

pytest implementing a logfile per test method

I would like to create a separate log file for each test method. And i would like to do this in the conftest.py file and pass the logfile instance to the test method. This way, whenever i log something in a test method it would log to a separate log file and will be very easy to analyse.
I tried the following.
Inside conftest.py file i added this:
logs_dir = pkg_resources.resource_filename("test_results", "logs")
def pytest_runtest_setup(item):
test_method_name = item.name
testpath = item.parent.name.strip('.py')
path = '%s/%s' % (logs_dir, testpath)
if not os.path.exists(path):
os.makedirs(path)
log = logger.make_logger(test_method_name, path) # Make logger takes care of creating the logfile and returns the python logging object.
The problem here is that pytest_runtest_setup does not have the ability to return anything to the test method. Atleast, i am not aware of it.
So, i thought of creating a fixture method inside the conftest.py file with scope="function" and call this fixture from the test methods. But, the fixture method does not know about the the Pytest.Item object. In case of pytest_runtest_setup method, it receives the item parameter and using that we are able to find out the test method name and test method path.
Please help!

I found this solution by researching further upon webh's answer. I tried to use pytest-logger but their file structure is very rigid and it was not really useful for me. I found this code working without any plugin. It is based on set_log_path, which is an experimental feature.
Pytest 6.1.1 and Python 3.8.4
# conftest.py
# Required modules
import pytest
from pathlib import Path
# Configure logging
#pytest.hookimpl(hookwrapper=True,tryfirst=True)
def pytest_runtest_setup(item):
config=item.config
logging_plugin=config.pluginmanager.get_plugin("logging-plugin")
filename=Path('pytest-logs', item._request.node.name+".log")
logging_plugin.set_log_path(str(filename))
yield
Notice that the use of Path can be substituted by os.path.join. Moreover, different tests can be set up in different folders and keep a record of all tests done historically by using a timestamp on the filename. One could use the following filename for example:
# conftest.py
# Required modules
import pytest
import datetime
from pathlib import Path
# Configure logging
#pytest.hookimpl(hookwrapper=True,tryfirst=True)
def pytest_runtest_setup(item):
...
filename=Path(
'pytest-logs',
item._request.node.name,
f"{datetime.datetime.now().strftime('%Y%m%dT%H%M%S')}.log"
)
...
Additionally, if one would like to modify the log format, one can change it in pytest configuration file as described in the documentation.
# pytest.ini
[pytest]
log_file_level = INFO
log_file_format = %(name)s [%(levelname)s]: %(message)
My first stackoverflow answer!

I found the answer i was looking for.
I was able to achieve it using the function scoped fixture like this:
#pytest.fixture(scope="function")
def log(request):
test_path = request.node.parent.name.strip(".py")
test_name = request.node.name
node_id = request.node.nodeid
log_file_path = '%s/%s' % (logs_dir, test_path)
if not os.path.exists(log_file_path):
os.makedirs(log_file_path)
logger_obj = logger.make_logger(test_name, log_file_path, node_id)
yield logger_obj
handlers = logger_obj.handlers
for handler in handlers:
handler.close()
logger_obj.removeHandler(handler)

In newer pytest version this can be achieved with set_log_path.
#pytest.fixture
def manage_logs(request, autouse=True):
"""Set log file name same as test name"""
request.config.pluginmanager.get_plugin("logging-plugin")\
.set_log_path(os.path.join('log', request.node.name + '.log'))

Test Spark with Tachyon

I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser

I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.

How do you write a Scala script that will react to file changes

I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.

The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse