going to specific path in scala using scala.sys.process - scala

I have to go to the path of an application to deploy it, I tried using scala.sys.process and did "cd /home/path/somepath" !
It is throwing an exception, Can anyone guide me how I can go to the directory, I cannot deploy it using absolute path because of the dependency the run file has.
Thanks in advance

Although this question is a couple of years old, it's a good question.
To use scala.sys.process to execute something from a specific working directory, pass the required directory as a parameter to ProcessBuilder, as in this working example:
import scala.sys.process._
val scriptPath = "/home/path/myShellScript.sh"
val command = Seq("/bin/bash","-c",scriptPath)
val proc = Process(command,new java.io.File("."))
var output = Vector.empty[String]
val exitValue = proc ! ProcessLogger (
(out) => if( out.trim.length > 0 )
output +:= out.trim,
(err) =>
System.err.printf("e:%s\n",err) // can be quite noisy!
)
printf("exit value: %d\n",exitValue)
printf("output[%s]\n",output.mkString("\n"))
If the goal instead is to insure that the environment of the caller defaults to a specific working directory, that can be accomplished by setting the required working directory before launching the jvm.

Related

How to run gsutil from Scala without the .cmd suffix?

I am trying to run gsutil in Scala, but it doesn't work unless I explicitly put .cmd in the code. I don't like this approach, since others I work with use Unix systems. How do I let Scala understand that gsutil == gsutil.cmd? I could just write a custom shell script and add that to path, but I'd like a solution that doesn't include scripting.
I have already tried with various environment variables (using IntelliJ, don't know if it's relevant). I have tried adding both /bin and /platform/gsutil to path, neither works (without .cmd at least). I have also tried giving full path to see if it made a difference, it didn't.
Here is the method that uses gsutil:
def readFilesInBucket(ss: SparkSession, bucket: String): DataFrame = {
import ss.implicits._
ss.sparkContext.parallelize((s"gsutil ls -l $bucket" !!).split("\n")
.map(r => r.trim.split(" ")).filter(r => r.length == 3)
.map(r => (r(0), r(1), r(2)))).toDF(Array("Size", "Date", "File"): _*)
}
This is my first ever question on SO, I apologize for any formattic errors there may be.
EDIT:
Found out, that even when I write a script like this:
exec gsutil.cmd "$#"
called just gsutil in the same folder, it spits out the same error message as before: java.io.IOException: Cannot run program "gsutil": CreateProcess error=2, The system cannot find the file specified.
It works if I write gsutil in git bash, which otherwise didn't work without the script.
Maybe just use a different version whether you're on Windows or *nix system?
Create some helper:
object SystemDetector {
lazy val isWindows = System.getProperty("os.name").startsWith("Windows")
}
And then just use it like:
def readFilesInBucket(ss: SparkSession, bucket: String): DataFrame = {
import ss.implicits._
val gsutil = if(SystemDetector.isWindows) "gsutil.cmd" else "gsutil"
ss.sparkContext.parallelize((s"$gsutil ls -l $bucket" !!).split("\n")
.map(r => r.trim.split(" ")).filter(r => r.length == 3)
.map(r => (r(0), r(1), r(2)))).toDF(Array("Size", "Date", "File"): _*)
}

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

pytest implementing a logfile per test method

I would like to create a separate log file for each test method. And i would like to do this in the conftest.py file and pass the logfile instance to the test method. This way, whenever i log something in a test method it would log to a separate log file and will be very easy to analyse.
I tried the following.
Inside conftest.py file i added this:
logs_dir = pkg_resources.resource_filename("test_results", "logs")
def pytest_runtest_setup(item):
test_method_name = item.name
testpath = item.parent.name.strip('.py')
path = '%s/%s' % (logs_dir, testpath)
if not os.path.exists(path):
os.makedirs(path)
log = logger.make_logger(test_method_name, path) # Make logger takes care of creating the logfile and returns the python logging object.
The problem here is that pytest_runtest_setup does not have the ability to return anything to the test method. Atleast, i am not aware of it.
So, i thought of creating a fixture method inside the conftest.py file with scope="function" and call this fixture from the test methods. But, the fixture method does not know about the the Pytest.Item object. In case of pytest_runtest_setup method, it receives the item parameter and using that we are able to find out the test method name and test method path.
Please help!
I found this solution by researching further upon webh's answer. I tried to use pytest-logger but their file structure is very rigid and it was not really useful for me. I found this code working without any plugin. It is based on set_log_path, which is an experimental feature.
Pytest 6.1.1 and Python 3.8.4
# conftest.py
# Required modules
import pytest
from pathlib import Path
# Configure logging
#pytest.hookimpl(hookwrapper=True,tryfirst=True)
def pytest_runtest_setup(item):
config=item.config
logging_plugin=config.pluginmanager.get_plugin("logging-plugin")
filename=Path('pytest-logs', item._request.node.name+".log")
logging_plugin.set_log_path(str(filename))
yield
Notice that the use of Path can be substituted by os.path.join. Moreover, different tests can be set up in different folders and keep a record of all tests done historically by using a timestamp on the filename. One could use the following filename for example:
# conftest.py
# Required modules
import pytest
import datetime
from pathlib import Path
# Configure logging
#pytest.hookimpl(hookwrapper=True,tryfirst=True)
def pytest_runtest_setup(item):
...
filename=Path(
'pytest-logs',
item._request.node.name,
f"{datetime.datetime.now().strftime('%Y%m%dT%H%M%S')}.log"
)
...
Additionally, if one would like to modify the log format, one can change it in pytest configuration file as described in the documentation.
# pytest.ini
[pytest]
log_file_level = INFO
log_file_format = %(name)s [%(levelname)s]: %(message)
My first stackoverflow answer!
I found the answer i was looking for.
I was able to achieve it using the function scoped fixture like this:
#pytest.fixture(scope="function")
def log(request):
test_path = request.node.parent.name.strip(".py")
test_name = request.node.name
node_id = request.node.nodeid
log_file_path = '%s/%s' % (logs_dir, test_path)
if not os.path.exists(log_file_path):
os.makedirs(log_file_path)
logger_obj = logger.make_logger(test_name, log_file_path, node_id)
yield logger_obj
handlers = logger_obj.handlers
for handler in handlers:
handler.close()
logger_obj.removeHandler(handler)
In newer pytest version this can be achieved with set_log_path.
#pytest.fixture
def manage_logs(request, autouse=True):
"""Set log file name same as test name"""
request.config.pluginmanager.get_plugin("logging-plugin")\
.set_log_path(os.path.join('log', request.node.name + '.log'))

Test Spark with Tachyon

I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser
I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.

How do you write a Scala script that will react to file changes

I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.
The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.