Reading file contents with casbah gridfs throws MalformedInputException - scala

Consider the following sample code: it writes a file to mongodb and then tries to reread it
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.gridfs.Imports._
object TestGridFS{
def main(args: Array[String]){
val mongoConn = MongoConnection()
val mongoDB = mongoConn("gridfs_test")
val gridfs = GridFS(mongoDB) // creates a GridFS handle on ``fs``
val xls = new java.io.FileInputStream("ok.xls")
val savedFile=gridfs.createFile(xls)
savedFile.filename="ok.xls"
savedFile.save
println("savedfile id: %s".format(savedFile._id.get))
val file=gridfs.findOne(savedFile._id.get)
val bytes=file.get.source.map(_.toByte).toArray
println(bytes)
}
}
this yields
gridfs $ sbt run
[info] Loading global plugins from /Users/jean/.sbt/plugins
[info] Set current project to gridfs-test (in build file:/Users/jean/dev/sdev/src/perso/gridfs/)
[info] Running TestGridFS
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
savedfile id: 504c8cce0364a7cd145d5dc1
[error] (run-main) java.nio.charset.MalformedInputException: Input length = 1
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:319)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.read(BufferedReader.java:157)
at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38)
at scala.io.Codec.wrap(Codec.scala:64)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:148)
at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:334)
at scala.io.Source.hasNext(Source.scala:238)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:334)
at scala.collection.Iterator$class.foreach(Iterator.scala:660)
at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:99)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:333)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:333)
at TestGridFS$.main(test.scala:15)
at TestGridFS.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[error] {file:/Users/jean/dev/sdev/src/perso/gridfs/}default-b6ab90/compile:run: Nonzero exit code: 1
[error] Total time: 1 s, completed 9 sept. 2012 14:34:22
I don't understand what the charset problem can be, I just wrote the file to the database. when querying the base I DO see the files and chunks in there, but can't seem to be able to read them.
I tried this with mongo 2.0 and 2.2, casbah 2.4 and 3.0.0-M2 to no avail, and don't see what I could do to get the bytes, on mac OSX mountain lion.
PS: To run the test, you can use the following build.sbt
name := "gridfs-test"
version := "1.0"
scalaVersion := "2.9.1"
libraryDependencies += "org.mongodb" %% "casbah" % "2.4.1"
libraryDependencies += "org.mongodb" %% "casbah-gridfs" % "2.4.1"
resolvers ++= Seq("Typesafe Releases" at "http://repo.typesafe.com/typesafe/releases/",
"sonatype release" at "https://oss.sonatype.org/content/repositories/releases",
"OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/")
Here is the stacktrace I get :

I found a way to read the file contents back from mongodb. The source method relies on underlying.inpustream which is defined in GridFSDBFile.
Every test I did which uses underlying.inpustream failed with the same error.
However, the API proposes another way to access the files : writeTo. writeTo does not use underlying.inpustream.
Here is the "fixed" code from the question :
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.gridfs.Imports._
object TestGridFS{
def main(args: Array[String]){
val mongoConn = MongoConnection()
val mongoDB = mongoConn("gridfs_test")
val gridfs = GridFS(mongoDB) // creates a GridFS handle on ``fs``
val xls = new java.io.File("ok.xls")
val savedFile=gridfs.createFile(xls)
savedFile.filename="ok.xls"
savedFile.save
println("savedfile id: %s".format(savedFile._id.get))
val file=gridfs.findOne(savedFile._id.get)
val byteArrayOutputStream = new java.io.ByteArrayOutputStream()
file.map(_.writeTo(byteArrayOutputStream))
byteArrayOutputStream.toByteArray
}
}
the last line, byteArrayOutputStream.toByteArray gives you an array of bytes which can then be used however you see fit.

Related

Hazelcast server with scala client issue

I am trying to setup hazelcast server and client on my local machine. I am also trying to connect to local Hazelcast server by scala-client.
For server I used below code,
import com.hazelcast.config._
import com.hazelcast.Scala._
object HazelcastServer {
def main(args: Array[String]): Unit = {
val conf = new Config
serialization.Defaults.register(conf.getSerializationConfig)
serialization.DynamicExecution.register(conf.getSerializationConfig)
val hz = conf.newInstance()
val cmap = hz.getMap[String, String]("test")
cmap.put("a","A")
cmap.put("b","B")
}
}
and hazelcast client as,
import com.hazelcast.Scala._
import client._
import com.hazelcast.client._
import com.hazelcast.config._
object Hazelcast_Client {
def main(args:Array[String]): Unit = {
val conf = new Config
serialization.Defaults.register(conf.getSerializationConfig)
serialization.DynamicExecution.register(conf.getSerializationConfig)
val hz = conf.newClient()
val cmap = hz.getMap("test")
println(cmap.size())
}
}
In my build.sbt,
libraryDependencies += "com.hazelcast" % "hazelcast" % "3.7.2"
libraryDependencies += "com.hazelcast" %% "hazelcast-scala" % "3.7.2"
I am getting below error and stuck in dependency issues.
Symbol 'type <none>.config.ClientConfig' is missing from the classpath.
[error] This symbol is required by 'value com.hazelcast.Scala.client.package.conf'.
[error] Make sure that type ClientConfig is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
[error] A full rebuild may help if 'package.class' was compiled against an incompatible version of <none>.config.
[error] val conf = new Config
I referred hazelcast documentation. I am not able to find any good hazelcast scala examples to understand the setup and to start playing with. If anybody can help in solving this issue, or share really good scala examples that would be helpful.
I've done a Scala+Akka Hazelcast before. My build.sbt included
libraryDependencies += "com.hazelcast" % "hazelcast-all" % "3.7.2"
I seem to remember that hazelcast-all was required.

Flink ES connection Not compiling as expected

My problem is somewhat as described here.
Part of Code (actually took from apache site) is as below
val httpHosts = new java.util.ArrayList[HttpHost]
httpHosts.add(new HttpHost("127.0.0.1", 9200, "http"))
httpHosts.add(new HttpHost("10.2.3.1", 9200, "http"))
val esSinkBuilder = new ElasticsearchSink.Builder[String](
httpHosts,
new ElasticsearchSinkFunction[String] {
def createIndexRequest(element: String): IndexRequest = {
val json = new java.util.HashMap[String, String]
json.put("data", element)
return Requests.indexRequest()
.index("my-index")
.`type`("my-type")
.source(json)
If I add these three statements, I am getting error as below
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
Error I am getting
object elasticsearch is not a member of package org.apache.flink.streaming.connectors
object elasticsearch6 is not a member of package org.apache.flink.streaming.connectors
If I do not add those import statements, I get error as below
Compiling 1 Scala source to E:\sar\scala\practice\readstbdata\target\scala-2.11\classes ...
[error] E:\sar\scala\practice\readstbdata\src\main\scala\example\readcsv.scala:35:25: not found: value ElasticsearchSink
[error] val esSinkBuilder = new ElasticsearchSink.Builder[String](
[error] ^
[error] E:\sar\scala\practice\readstbdata\src\main\scala\example\readcsv.scala:37:7: not found: type ElasticsearchSinkFunction
[error] new ElasticsearchSinkFunction[String] {
[error] ^
[error] two errors found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 1 s, completed 10 Feb, 2020 2:15:04 PM
Stackflow question I referred above, some function has been extended. My understanding is, flink.streaming.connectors.elasticsearch have to be extended into REST libraries. 1) Is my understanding correct 2) if Yes, can I have complete extensions 3)If my understanding is wrong, please give me a solution.
Note: I added the following statements in build.sbt
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.5.2" ,
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "7.5.2",
The streaming connectors are not part of the flink binary distribution. You have to package them with your application.
For elasticsearch6 you need to add flink-connector-elasticsearch6_2.11, which you can do as
libraryDependencies += "org.apache.flink" %% "flink-connector-elasticsearch6" % "1.6.0"
Once this jar is part of your build, then the compiler will find the missing components. However, I don't know if this ES6 client will work with version 7.5.2.
Flink Elasticsearch Connector 7
Please look at the working and detailed answer which I have provided here, Which is written in Scala.

value wholeTextFiles is not a member of org.apache.spark.SparkContext

I have a Scala code like below :-
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark._
object RecipeIO {
val sc = new SparkContext(new SparkConf().setAppName("Recipe_Extraction"))
def read(INPUT_PATH: String): org.apache.spark.rdd.RDD[(String)]= {
val data = sc.wholeTextFiles("INPUT_PATH")
val files = data.map { case (filename, content) => filename}
(files)
}
}
When I compile this code using sbt it gives me the error :
value wholeTextFiles is not a member of org.apache.spark.SparkContext.
I am importing all of which is required but it's still giving me this errror.
But when I compile this code by replacing wholeTextFiles with textFile, the code gets compiled.
What might be the problem here and how do I resolve that?
Thanks in advance!
Environment:
Scala compiler version 2.10.2
spark-1.2.0
Error:
[info] Set current project to RecipeIO (in build file:/home/akshat/RecipeIO/)
[info] Compiling 1 Scala source to /home/akshat/RecipeIO/target/scala-2.10.4/classes...
[error] /home/akshat/RecipeIO/src/main/scala/RecipeIO.scala:14: value wholeTexFiles is not a member of org.apache.spark.SparkContext
[error] val data = sc.wholeTexFiles(INPUT_PATH)
[error] ^
[error] one error found
[error] {file:/home/akshat/RecipeIO/}default-55aff3/compile:compile: Compilation failed
[error] Total time: 16 s, completed Jun 15, 2015 11:07:04 PM
My build.sbt file looks like this :
name := "RecipeIO"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "0.9.0-incubating"
libraryDependencies += "org.eclipse.jetty" % "jetty-server" % "8.1.2.v20120308"
ivyXML :=
<dependency org="org.eclipse.jetty.orbit" name="javax.servlet" rev="3.0.0.v201112011016">
<artifact name="javax.servlet" type="orbit" ext="jar"/>
</dependency>
You have a typo: it should be wholeTextFiles instead of wholeTexFiles.
As a side note, I think you want sc.wholeTextFiles(INPUT_PATH) and not sc.wholeTextFiles("INPUT_PATH") if you really want to use the INPUT_PATH variable.

No RowReaderFactory can be found for this type error when trying to map Cassandra row to case object using spark-cassandra-connector

I am trying to get a simple example working mapping rows from Cassandra to a scala case class using Apache Spark 1.1.1, Cassandra 2.0.11, & the spark-cassandra-connector (v1.1.0). I have reviewed the documentation at the spark-cassandra-connector github page, planetcassandra.org, datastax, and generally searched around; but have not found anyone else encountering this issue. So here goes...
Building a tiny spark application using sbt (0.13.5), scala 2.10.4, spark 1.1.1 against Cassandra 2.0.11. Modelling the example from the spark-cassandra-connector docs the following two lines present an error in my IDE and fail to compile.
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
The simple error presented by eclipse is:
No RowReaderFactory can be found for this type
The compile error is only slightly more verbose:
> compile
[info] Compiling 1 Scala source to /home/bkarels/dev/simple-case/target/scala-2.10/classes...
[error] /home/bkarels/dev/simple-case/src/main/scala/com/bradkarels/simple/SimpleApp.scala:82: No RowReaderFactory can be found for this type
[error] val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 1 s, completed Dec 10, 2014 9:01:30 AM
>
Scala source:
package com.bradkarels.simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
// Likely don't need this import - but throwing darts hits the bullseye once in a while...
import com.datastax.spark.connector.rdd.reader.RowReaderFactory
object CaseStudy {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "simple", conf)
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
}
}
With the bothersome lines removed, everything compiles fine, assembly works, and I can perform other Spark operations normally. For example, if I remove the problem lines and drop in:
val rdd:CassandraRDD[CassandraRow] = sc.cassandraTable("nicecase", "human")
I get back the RDD and work with it as expected. That said, I suspect that my sbt project, assembly plugin, etc. are not contributing to the issues. The working source (less the new attempt to map to a case class as the connector as intended) can be found on github here.
But, to be more thorough, my build.sbt:
name := "Simple Case"
version := "0.0.1"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.1",
"org.apache.spark" %% "spark-sql" % "1.1.1",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
)
So the question is what have I missed? Hoping this is something silly, but if you have encountered this and can help me get past this puzzling little issue I would very much appreciate it. Please let me know if there are any other details that would be helpful in troubleshooting.
Thank you.
This may be my newness with Scala in general, but I resolved this issue by moving the case class declaration out of the main method. So the simplified source now looks like this:
package com.bradkarels.simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
object CaseStudy {
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "simple", conf)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
}
}
The complete source (updated & fixed) can be found on github https://github.com/bradkarels/spark-cassandra-to-scala-case-class

SBT/Play2 multi-project setup does not include dependant projects in classpath in run/test

I have following SBT/Play2 multi-project setup:
import sbt._
import Keys._
import PlayProject._
object ApplicationBuild extends Build {
val appName = "traveltime-api"
val appVersion = "1.0"
val appDependencies = Seq(
// Google geocoding library
"com.google.code.geocoder-java" % "geocoder-java" % "0.9",
// Emailer
"org.apache.commons" % "commons-email" % "1.2",
// CSV generator
"net.sf.opencsv" % "opencsv" % "2.0",
"org.scalatest" %% "scalatest" % "1.7.2" % "test",
"org.scalacheck" %% "scalacheck" % "1.10.0" % "test",
"org.mockito" % "mockito-core" % "1.9.0" % "test"
)
val lib = RootProject(file("../lib"))
val chiShape = RootProject(file("../chi-shape"))
lazy val main = PlayProject(
appName, appVersion, appDependencies, mainLang = SCALA
).settings(
// Add your own project settings here
resolvers ++= Seq(
"Sonatype Snapshots" at
"http://oss.sonatype.org/content/repositories/snapshots",
"Sonatype Releases" at
"http://oss.sonatype.org/content/repositories/releases"
),
// Scalatest compatibility
testOptions in Test := Nil
).aggregate(lib, chiShape).dependsOn(lib, chiShape)
}
As you can see this project depends on two independant subprojects: lib and chiShape.
Now compile works fine - all sources are correctly compiled. However if I try run or test, neither task in runtime has classes from subprojects on classpath loaded and things go haywire with NoClassFound exceptions.
For example - my application has to load serialized data from file and it goes like this: test starts FakeApplication, it tries to load data and boom:
[info] CsvGeneratorsTest:
[info] #markerFilterCsv
[info] - should fail on bad json *** FAILED ***
[info] java.lang.ClassNotFoundException: com.library.Node
[info] at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
[info] at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
[info] at java.security.AccessController.doPrivileged(Native Method)
[info] at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
[info] at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
[info] at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
[info] at java.lang.Class.forName0(Native Method)
[info] at java.lang.Class.forName(Class.java:264)
[info] at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:622)
[info] at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1593)
[info] ...
Strangely enough stage creates a directory structure with chi-shapes_2.9.1-1.0.jar and lib_2.9.1-1.0.jar in staged/.
How can I get my runtime/test configurations get subprojects into classpath?
Update:
I've added following code to Global#onStart:
override def onStart(app: Application) {
println(app)
ClassLoader.getSystemClassLoader.asInstanceOf[URLClassLoader].getURLs.
foreach(println)
throw new RuntimeException("foo!")
}
When I launch tests, the classpath is very very ill populated, to say at least :)
FakeApplication(.,sbt.classpath.ClasspathUtilities$$anon$1#182253a,List(),List(),Map(application.load-data -> test, mailer.smtp.test-mode -> true))
file:/home/arturas/Software/sdks/play-2.0.3/framework/sbt/sbt-launch.jar
[info] CsvGeneratorsTest:
When launching staged app, there's a lot of stuff, how it's supposed to be :)
$ target/start
Play server process ID is 29045
play.api.Application#1c2862b
file:/home/arturas/work/traveltime-api/api/target/staged/jul-to-slf4j.jar
That's strange, because there should be at least testing jars in the classpath I suppose?
It seems I've solved it.
The culprit was that ObjectInputStream ignores thread local class loaders by default and only uses system class loader.
So I changed from:
def unserialize[T](file: File): T = {
val in = new ObjectInputStream(new FileInputStream(file))
try {
in.readObject().asInstanceOf[T]
}
finally {
in.close
}
}
To:
/**
* Object input stream which respects thread local class loader.
*
* TL class loader is used by SBT to avoid polluting system class loader when
* running different tasks.
*/
class TLObjectInputStream(in: InputStream) extends ObjectInputStream(in) {
override protected def resolveClass(desc: ObjectStreamClass): Class[_] = {
Option(Thread.currentThread().getContextClassLoader).map { cl =>
try { return cl.loadClass(desc.getName)}
catch { case (e: java.lang.ClassNotFoundException) => () }
}
super.resolveClass(desc)
}
}
def unserialize[T](file: File): T = {
val in = new TLObjectInputStream(new FileInputStream(file))
try {
in.readObject().asInstanceOf[T]
}
finally {
in.close
}
}
And my class not found problems went away!
Thanks to How to put custom ClassLoader to use? and http://tech-tauk.blogspot.com/2010/05/thread-context-classlaoder-in.html on useful insight about deserializing and thread local class loaders.
This sounds similar to this bug https://play.lighthouseapp.com/projects/82401/tickets/659-play-dist-broken-with-sub-projects, though that bug is about dist and not test. I think that the fix has not made it to the latest stable release, so try building Play from source (and don't forget to use aggregate and dependsOn as demonstrated in that link.
Alternatively, as a workaround, inside sbt, you can navigate to the sub-project with project lib and then type test. It's a bit manual, but you can script that if you'd like.