I am trying to run gsutil in Scala, but it doesn't work unless I explicitly put .cmd in the code. I don't like this approach, since others I work with use Unix systems. How do I let Scala understand that gsutil == gsutil.cmd? I could just write a custom shell script and add that to path, but I'd like a solution that doesn't include scripting.
I have already tried with various environment variables (using IntelliJ, don't know if it's relevant). I have tried adding both /bin and /platform/gsutil to path, neither works (without .cmd at least). I have also tried giving full path to see if it made a difference, it didn't.
Here is the method that uses gsutil:
def readFilesInBucket(ss: SparkSession, bucket: String): DataFrame = {
import ss.implicits._
ss.sparkContext.parallelize((s"gsutil ls -l $bucket" !!).split("\n")
.map(r => r.trim.split(" ")).filter(r => r.length == 3)
.map(r => (r(0), r(1), r(2)))).toDF(Array("Size", "Date", "File"): _*)
}
This is my first ever question on SO, I apologize for any formattic errors there may be.
EDIT:
Found out, that even when I write a script like this:
exec gsutil.cmd "$#"
called just gsutil in the same folder, it spits out the same error message as before: java.io.IOException: Cannot run program "gsutil": CreateProcess error=2, The system cannot find the file specified.
It works if I write gsutil in git bash, which otherwise didn't work without the script.
Maybe just use a different version whether you're on Windows or *nix system?
Create some helper:
object SystemDetector {
lazy val isWindows = System.getProperty("os.name").startsWith("Windows")
}
And then just use it like:
def readFilesInBucket(ss: SparkSession, bucket: String): DataFrame = {
import ss.implicits._
val gsutil = if(SystemDetector.isWindows) "gsutil.cmd" else "gsutil"
ss.sparkContext.parallelize((s"$gsutil ls -l $bucket" !!).split("\n")
.map(r => r.trim.split(" ")).filter(r => r.length == 3)
.map(r => (r(0), r(1), r(2)))).toDF(Array("Size", "Date", "File"): _*)
}
Related
In the book, Programming in Scala 5th Edition, the author says the following for two classes:
Neither ChecksumAccumulator.scala nor Summer.scala are scripts, because they end in a definition. A script, by contrast, must end in a result expression.
The ChecksumAccumulator.scala is as follows:
import scala.collection.mutable
class CheckSumAccumulator:
private var sum = 0
def add(b: Byte): Unit = sum += b
def checksum(): Int = ~(sum & 0XFF) + 1
object CheckSumAccumulator:
private val cache = mutable.Map.empty[String, Int]
def calculate(s: String): Int =
if cache.contains(s) then
cache(s)
else
val acc = new CheckSumAccumulator
for c<-s do
acc.add((c >> 8).toByte)
acc.add(c.toByte)
val cs = acc.checksum()
cache += (s -> cs)
cs
whereas the Summer.scala is as follows:
import CheckSumAccumulator.calculate
object Summer:
def main(args: Array[String]): Unit =
for arg <- args do
println(arg + ": " + calculate(arg))
But when I run the Summer.scala file, I get a different error than what mentioned by the author:
➜ learning-scala git:(main) ./scala3-3.0.0-RC3/bin/scala Summer.scala
-- [E006] Not Found Error: /Users/avirals/dev/learning-scala/Summer.scala:1:7 --
1 |import CheckSumAccumulator.calculate
| ^^^^^^^^^^^^^^^^^^^
| Not found: CheckSumAccumulator
longer explanation available when compiling with `-explain`
1 error found
Error: Errors encountered during compilation
➜ learning-scala git:(main)
The author mentioned that the error would be around not having a result expression.
I also tried to compile CheckSumAccumulator only and then run Summer.scala as a script without compiling it:
➜ learning-scala git:(main) ./scala3-3.0.0-RC3/bin/scalac CheckSumAccumulator.scala
➜ learning-scala git:(main) ✗ ./scala3-3.0.0-RC3/bin/scala Summer.scala
<No output, given no input>
➜ learning-scala git:(main) ✗ ./scala3-3.0.0-RC3/bin/scala Summer.scala Summer of love
Summer: -121
of: -213
love: -182
It works.
Obviously, when I compile both, and then run Summer.scala, it works as expected. However, the differentiation of Summer.scala as a script vs normal file is unclear to me.
Let's start top-down...
The most regular way to compile Scala is to use a build tool like SBT/Maven/Mill/Gradle/etc. This build tool will help with a few things: downloading dependencies/libraries, downloading Scala compiler (optional), setting up CLASS_PATH and most importantly running scalac compiler and passing all flags to it. Additionally it can package compiled class files into JARs and other formats and do much more. Most relevant part is CP and compilation flags.
If you strip off the build tool you can compile your project by manually invoking scalac with all required arguments and making sure your working directory matches package structure, i.e. you are in the right directory. This can be tedious because you need to download all libraries manually and make sure they are on the class path.
So far build tool and manual compiler invocation are very similar to what you can also do in Java.
If you want to have an ah-hoc way of running some Scala code there are 2 options. scala let's you run scripts or REPL by simply compiling your uncompiled code before it executes it.
However, there are some caveats. Essentially REPL and shell scripts are the same - Scala wraps your code in some anonymous object and then runs it. This way you can write any expression without having to follow convention of using main function or App trait (which provides main). It will compile the script you are trying to run but will have no idea about imported classes. You can either compile them beforehand or make a large script that contains all code. Of course if it starts getting too large it's time to make a proper project.
So in a sense there is no such thing as script vs normal file because they both contain Scala code. The file you are running with scala is a script if it's an uncompiled code XXX.scala and "normal" compiled class XXX.class otherwise. If you ignore object wrapping I've mentioned above the rest is the same just different steps to compile and run them.
Here is the traditional 2.xxx scala runner code snippet with all possible options:
def runTarget(): Option[Throwable] = howToRun match {
case AsObject =>
ObjectRunner.runAndCatch(settings.classpathURLs, thingToRun, command.arguments)
case AsScript if isE =>
ScriptRunner(settings).runScriptText(combinedCode, thingToRun +: command.arguments)
case AsScript =>
ScriptRunner(settings).runScript(thingToRun, command.arguments)
case AsJar =>
JarRunner.runJar(settings, thingToRun, command.arguments)
case Error =>
None
case _ =>
// We start the repl when no arguments are given.
if (settings.Wconf.isDefault && settings.lint.isDefault) {
// If user is agnostic about -Wconf and -Xlint, enable -deprecation and -feature
settings.deprecation.value = true
settings.feature.value = true
}
val config = ShellConfig(settings)
new ILoop(config).run(settings)
None
}
This is what's getting invoked when you run scala.
In Dotty/Scala3 the idea is similar but split into multiple classes and classpath logic might be different: REPL, Script runner. Script runner invokes repl.
I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser
I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.
I have to go to the path of an application to deploy it, I tried using scala.sys.process and did "cd /home/path/somepath" !
It is throwing an exception, Can anyone guide me how I can go to the directory, I cannot deploy it using absolute path because of the dependency the run file has.
Thanks in advance
Although this question is a couple of years old, it's a good question.
To use scala.sys.process to execute something from a specific working directory, pass the required directory as a parameter to ProcessBuilder, as in this working example:
import scala.sys.process._
val scriptPath = "/home/path/myShellScript.sh"
val command = Seq("/bin/bash","-c",scriptPath)
val proc = Process(command,new java.io.File("."))
var output = Vector.empty[String]
val exitValue = proc ! ProcessLogger (
(out) => if( out.trim.length > 0 )
output +:= out.trim,
(err) =>
System.err.printf("e:%s\n",err) // can be quite noisy!
)
printf("exit value: %d\n",exitValue)
printf("output[%s]\n",output.mkString("\n"))
If the goal instead is to insure that the environment of the caller defaults to a specific working directory, that can be accomplished by setting the required working directory before launching the jvm.
I want to rename a file in the system by Scala code. The equivalent of what can be done by bash like,
mv old_file_name new_file_name
I am not asking about renaming a scala source code file, but a file residing in the system.
Consider
import java.io.File
import util.Try
def mv(oldName: String, newName: String) =
Try(new File(oldName).renameTo(new File(newName))).getOrElse(false)
and use it with
mv("oldname", "newname")
Note mv returns true on successful renaming, false otherwise. Note also that Try will catch possible IO exceptions.
See renameTo of java.io.File. In your case this would be
new File("old_file_name").renameTo(new File("new_file_name"))
Use Guava:
Files.move(new File("<path from>"), new File("<path to>"))
I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.
The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.