I have a need to pass a very big input file to Scastie. I mean how can Scastie which is online code editor read a file which is available at my local machine, for example
val lines = sc.textfile("....mdb/u.data")
Some asked this on the team's Gitter channel.
The Scastie team member first asked how big the file is, then recommended to put it in a Gist on Github and to use the raw url to read it in.
This works only for small files. The limits of files on Gist are explained in their Developer Guide.
If you need the full contents of the file, you can make a GET request to the URL specified by raw_url. Be aware that for files larger than ten megabytes, you'll need to clone the gist via the URL provided by git_pull_url.
So 10 MB is your limit. Also note that you can't use a SparkContext(denoted by sc in your question) without identifying the library to the online environment.
To do that, you'll have to add the SBT dependency.
Navigate to Build Settings on the left part of the interface.
Set the Scala Version to a version compatible with the Spark we'll choose, in our case 2.11.12.
Under Extra Sbt Configuration place the following dependencies:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.3",
"org.apache.spark" %% "spark-sql" % "2.4.3"
)
You won't be able to read url content directly using sc.textFile, that is only for reading local/HDFS text files. You'll have to get the content first, wrangle it into shape and get a DataFrame out of it.
The answer shown here describes how to access a web url using Source from the Scala Standard Library.
At the request of the OP, here's an implementation on scastie.
Related
What is the difference between the following libraries:
libraryDependencies += "com.typesafe.play" %% "play-ahc-ws-standalone" % "LATEST_VERSION"
and
libraryDependencies += "com.typesafe.play" %% "play-ahc-ws" % "LATEST_VERSION"
I am just trying to figure which is the correct one to use. What I did was to create a Play module in a separate library and I want to inject that into a Play application. But when I used the first library listed above, it only offers a StandaloneWSClient. When I injected that into a Play application it couldn't bind an implementation to it. But when I switched the second library, it offers a WSClient which the Play application could find an implementation to bind to as it already has one which you can specify in the build.sbt definition ie ws.
Within Play project you should use play-ahc-ws which is usually added like so
libraryDependencies += ws
The ws value comes from Play's sbt plugin
addSbtPlugin("com.typesafe.play" % "sbt-plugin" % "2.8.1")
On the other hand, play-ahc-ws-standalone is a HTTP client in its own right which can be used outside Play projects, just how one could use, for example, scalaj-http or requests-scala HTTP clients which are in no way aware of Play.
The difference is documented by Play 2.6 Migration Guide.
Spent a few hours trying to figure out how to do this. Over the course of it I have looked at a few seemingly promising questions but none of them seem to quite fit what I'm doing.
I've got three library jars, let's call them M, S, and H. Library M has things like:
case class MyModel(x: Int, s: String)
and then library S uses the play-json library, version 2.3.8, to provide implicit serializers for the classes defined by M
trait MyModelSerializer {
implicit val myModelFormt = Json.format[MyModel]
}
Which are then bundled up together into a convenience object for importing
package object implicits extends MyModelSerializer extends FooSerizlier // etc
That way, in Library H, when it performs HTTP calls to various services it just imports implicits from S and then I call Json.validate[MyModel] to get back the models I need from my web services. This is all well and dandy, but I'm working on an application that's running play 2.4 and when I included H into the project and tried to use it I ran up against:
java.lang.NoSuchMethodError: play.api.data.validation.ValidationError.<init>(Ljava/lang/String;Lscala/collection/Seq;)
Which I believe is being caused by play 2.4 using play-json version 2.4.6. Unfortunately, these are a minor version apart and this means that trying to just use the old library like:
// In build.sbt
"com.typesafe.play" %% "play-json" % "2.3.8" force()
Results in all the code in the app to fail to compile because I'm using things like JsError.toJson which weren't parts of play-json 2.3.8. I could change the 14 or so places trying to use that method, but given the exception before I have a feeling that even if I do that it's not going to help.
Around this point I remembered that back in my maven days I could shade dependencies during my build process. So I got to thinking that if I could shade the play-json 2.3.8 dependency in H that that would solve the problem. Since the problem seems to be that calling Json.* in H is using the Json object from play-json 2.4.6.
Unfortunately, the only thing I can find online that indicates the ability to shade is sbt-assembly. I found a great answer on how to do that for a fat jar. But I don't think I can use sbt-assembly because H isn't executable, it's just a library jar. I read through a question like my own but the answer refers to sbt-assembly so it doesn't help me.
Another question seems somewhat promising but I really can't follow how I would use it / where I would be placing the code itself. I also looked through the sbt manual, but nothing stuck out to me as being what I need.
I can't just change S to use play-json 2.4.6 because we're using H in a play 2.3 application as well. So it needs to be able to be used in both.
Right now the only thing I can really think to do if I can't get some kind of shading done is to make H not use S and to instead require some kind of serializer/deserializer implicitly and then wire in the appropriate json (dee)serializer. So here I am asking about how to properly shade with sbt with something that isn't an executable jar because I only want to do a re-write if I absolutely have to. If I missed something (like sbt-assembly being able to shade for non-executable jars as well), I'll take that as an answer if you can point me to the docs I must have missed.
As indicated by Yuval Itzchakov, sbt-assembly doesn't have to be building an executable jar and can shade library code as well. In addition, packing without transitive dependencies except the ones that need to be shaded can be done too and this will keep the packaged jar's size down and let the rest of the dependencies come through as usual.
Hunting down the transitive dependencies manually is what I ended up having to do, but if anyone has a way to do that automatically, that'd be a great addition to this answer. Anyway, this is what I needed to do to the H library's build file to get it properly shading the play-json library.
Figure out what the dependencies are using show compile:dependencyClasspath at the sbt console
Grab anything play related (since I'm only using play-json and no others I can assume play = needs shading)
Also shade the S models because they rely on play-json as well, so to avoid transitive dependencies bringing a non-shadded play 2.3.8 back in, I have to shade my serializers.
Add sbt-assembly to project and then update build.sbt file
build.sbt
//Shade rules for all things play:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("play.api.**" -> "shade.play.api.#1").inAll
)
//Grabbed from the "publishing" section of the sbt-assembly readme, excluding the "assembly" classifier
addArtifact(artifact in (Compile, assembly), assembly)
// Only the play stuff and the "S" serializers need to be shaded since they use/introduce play:
assemblyExcludedJars in assembly := {
val cp = (fullClasspath in assembly).value
val toIncludeInPackage = Seq(
"play-json_2.11-2.3.8.jar",
"play-datacommons_2.11-2.3.8.jar",
"play-iteratees_2.11-2.3.8.jar",
"play-functional_2.11-2.3.8.jar",
"S_2.11-0.0.0.jar"
)
cp filter {c => !toIncludeInPackage.contains(c.data.getName)}
}
And then I don't get any exceptions anymore from trying to run it. I hope this helps other people with similar issues, and if anyone has a way to automatically grab dependencies and filter by them I'll happily update the answer with it.
I am working in a legacy project where the guys before me use the specs2 from play inserted through
libraryDependencies += specs2 % Test
as well as the normal distribution
libraryDependencies ++= Seq("org.specs2" %% "specs2-core" % "3.8.5" % "test")
I am wandering which is better option, as I want to have just one distribution as don't see the need for both, what is the advantages of one over the other one, as well I want to have just one as the jars are conflicting for be debugged from IDE.
I'm not sure what version of Play! you are on, but looking at 2.5.8, specs2 from Play (play.sbt.PlayImport.specs2) refers to the play-specs2 library, which in turn depends on specs2#3.8.5, but also adds some Play! specific testing logic. Using play.sbt.PlayImport.specs2 as a dependency, will allow your module to use both specs2 and the play-specs2 library.
I'm new to scala and I'm writing a Spark application in Scala and I need to use the axpy function from org.apache.spark.mllib.linalg.BLAS. However it looks to be not accessible to users. Instead I try to import the com.github.fomil.netlib and directly access it. But I could either. I need to multiply to DenseVector.
Right now, the BLAS class within mllib is marked private[spark] in the spark source code. What this means is that it is not accessible external to spark itself as you seem to have figured out. In short, you can't use it in your code.
If you want to use netlib-java classes directly, you need to add the following dependency to your project
libraryDependencies += "com.github.fommil.netlib" % "all" % "1.1.2" pomOnly()
That should allow you to import the BLAS class. Note, I haven't really tried to use it, but I am able to execute BLAS.getInstance() without a problem. There might be some complexities in the installation on some Linux platforms as described here - https://github.com/fommil/netlib-java.
Add mllib to your project
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.3.0"
Set-up:
A project I am working on has a pub/sub server with an HTTP interface. Subscription works by accepting server-sent-events.
curl -X GET server:port/topics/news
which will be pushed whenever a message is published to the given topic URL
curl -X PUT server:port/topics/news -d "Politician Lies!"
Problem:
I have a scala project which needs to subscribe to this pub/sub server. The Play! framework is able to handle this by using PlayWS with Enumeratee + Iteratee. Unfortunately, the PlayWS library requires that a Play! Application is in scope and I am not using Play. Is there a library (with minimal dependancies) I can use which allows me to accept server-sent-events? I'll need at least one working example in order to get started.
I have a preference for scala libraries but I'm willing to accept a Java solution if I have to.
I've accepted Manuel Bernhardt's answer because it led me in the right direction but I feel a full example is important for anyone else with this issue.
I updated my build.sbt file to include PlayWS 2.3 and the Iteratees library.
libraryDependencies ++= Seq(
"com.typesafe.play" %% "play-ws" % "2.3.0",
"com.typesafe.play" %% "play-iteratees" % "2.3.0"
)
The WS singleton requires an implicit Play Application to be used (something I don't have or want) so instead I will need to create my own client
val builder = new (com.ning.http.client.AsyncHttpClientConfig.Builder)()
val client = new play.api.libs.ws.ning.NingWSClient(builder.build())
I then create my Iteratee so that I can handle my server-sent-events.
def print = Iteratee.foreach { chunk: Array[Byte] =>
println(new String(chunk))
}
and finally subscribe to the server
client.url("http://localhost:8080/topics/news").get(_ => print)
Now, when an event is sent
curl -X PUT server:port/topics/news -d "Politician Lies!"
My Scala application will print the received event
data: Politician Lies!
You have several possibilities:
In Play 2.3, the WS library is now a separate library, so that should help. RC2 is already available
Alternatively, you could depend on Play 2.x and use a StaticApplication like so:
val application = new StaticApplication(new java.io.File("."))
This will essentially bootstrap a Play application, and from there on you can use the WS library as usual
I'm not aware of other Scala libraries that implement a Server Sent Events client, but the Jersey project has a Java library for Server Sent Events clients (as well as servers). The API does not appear to be very verbose, and appears like it can be nicely wrapped in some code to fit in more idiomatically with Scala.