Filesystem provider disappearing in Spark?

Filesystem provider disappearing in Spark? - scala

Mysteriously, when I use Spark my custom filesystem provider vanishes.
The full source for my example is available on github so you can follow.
I'm using maven to depend on gcloud-java-nio, which provides a Java FileSystem for Google Cloud Storage, via "gs://" URLs. My Spark project uses maven-shade-plugin to create one big jar with all the source in it.
The big jar correctly includes a META-INF/services/java.nio.file.spi.FileSystemProvider file, containing the correct name for the class (com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider). I checked and that class is also correctly included in the jar file.
The program uses FileSystemProvider.installedProviders() to list the filesystem providers it finds. "gs" should be listed (and it is if I run the same function in a non-Spark context), but when running with Spark on Dataproc, that provider's gone.
I'd like to know: How can I use a custom filesystem in my Spark program?
edit: Dennis Huo helpfully contributed that he sees the same problem when running on a Spark cluster, so the problem isn't specific to Dataproc. In fact, it also occurs when just using Scala. Also there are workarounds for the example I'm showing here, but I'd still like to know how to use a custom filesystem with Spark.

This doesn't appear to be a Dataproc-specific issue, more of a Scala issue and Spark fundamentally depends on Scala; if I build your jarfile and then load it up with scala independently of Dataproc or Spark, I get:
scala -cp spark-repro-1.0-SNAPSHOT.jar
scala> import java.nio.file.spi.FileSystemProvider
import java.nio.file.spi.FileSystemProvider
scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._
scala> FileSystemProvider.installedProviders().toList.foreach(l => println(l.getScheme()))
file
jar
scala> SparkRepro.listFS(1)
res3: String = Worker 1 installed filesystem providers: file jar
So it seems any bundling that's being done isn't properly registering the FileSystem provider, at least for scala. I tested the theory using the ListFilesystems example code (just removed the package at the top for convenience) on both a Dataproc node as well as a manually created VM with scala and Java 7 installed independently of Dataproc just to double-check.
$ cat ListFilesystems.java
import java.io.IOException;
import java.net.URI;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.spi.FileSystemProvider;
/**
* ListFilesystems is a super-simple program that lists the available NIO filesystems.
*/
public class ListFilesystems {
/**
* See the class documentation.
*/
public static void main(String[] args) throws IOException {
listFilesystems();
}
private static void listFilesystems() {
System.out.println("Installed filesystem providers:");
for (FileSystemProvider p : FileSystemProvider.installedProviders()) {
System.out.println(" " + p.getScheme());
}
}
}
$ javac ListFilesystems.java
Running using java and then scala:
$ java -cp spark-repro-1.0-SNAPSHOT.jar:. ListFilesystems
Installed filesystem providers:
file
jar
gs
$ scala -cp spark-repro-1.0-SNAPSHOT.jar:. ListFilesystems
Installed filesystem providers:
file
jar
$
This was the same on both Dataproc and my non-Dataproc VM. It looks like there's still unresolved difficulties getting the FileSystemProviders to load properly in Scala, and there doesn't seem to be an easy way to dynamically register them system-wide at runtime either; the most I could find was this old thread that didn't seem to come to any useful conclusion.
Fortunately though, it looks like at least the CloudStorageFileSystemProvider has no problem making it onto the classpath, so you can at least fall back to explicitly creating an instance of the cloud storage provider to use:
new com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider()
.getFileSystem(new java.net.URI("gs://my-bucket"))
Alternatively, if you're using Spark anyways, you might want to consider just using the Hadoop FileSystem interfaces. It's very similar to Java NIO FileSystems (rather, predated the Java NIO FileSystem stuff), and it's more portable for now. You can easily do things like:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
...
Path foo = new Path("gs://my-bucket/my-data.txt");
InputStream is = foo.getFileSystem(new Configuration()).open(foo);
...
The benefit of working with the Hadoop FileSystem interfaces is that you're guaranteed the configuration/settings will be clean both in your driver program and in the distributed worker nodes. For example, sometimes you'll need to modify filesystem settings just for a single job running in a Dataproc cluster; then you can plumb through Hadoop properties which are properly scoped for a single job without interfering with other jobs running at the same time.

My comment on the linked ticket (https://github.com/scala/bug/issues/10247):
Using scala -toolcp path makes your app jar available to the system class loader.
Or, use the API where you can provide the class loader to find providers which are not "installed."
The scala runner script has a few interacting parts, like -Dscala.usejavacp and -nobootcp, which possibly behave differently on Windows. It's not always obvious which incantation to use.
The misunderstanding here is the assumption that java and scala do the same thing with respect to -cp.
This example shows loading a test provider from a build dir. The "default" provider comes first, hence the strange ordering.
$ skala -toolcp ~/bin
Welcome to Scala 2.12.2 (OpenJDK 64-Bit Server VM 1.8.0_112)
scala> import java.nio.file.spi.FileSystemProvider
import java.nio.file.spi.FileSystemProvider
scala> FileSystemProvider.installedProviders
res0: java.util.List[java.nio.file.spi.FileSystemProvider] = [sun.nio.fs.LinuxFileSystemProvider#12abdfb, com.acme.FlakeyFileSystemProvider#b0e5507, com.acme.FlakeyTPDFileSystemProvider#6bbe50c9, com.sun.nio.zipfs.ZipFileSystemProvider#3c46dcbe]
scala> :quit
Or specifying loader:
$ skala -cp ~/bin
Welcome to Scala 2.12.2 (OpenJDK 64-Bit Server VM 1.8.0_112)
scala> import java.net.URI
import java.net.URI
scala> val uri = URI.create("tpd:///?count=10000")
uri: java.net.URI = tpd:///?count=10000
scala> import collection.JavaConverters._
import collection.JavaConverters._
scala> val em = Map.empty[String, AnyRef].asJava
em: java.util.Map[String,AnyRef] = {}
scala> import java.nio.file.FileSystems
import java.nio.file.FileSystems
scala> FileSystems.
getDefault getFileSystem newFileSystem
scala> FileSystems.newFileSystem(uri, em, $intp.classLoader)
res1: java.nio.file.FileSystem = com.acme.FlakeyFileSystemProvider$FlakeyFileSystem#2553dcc0

Related

how to detect if your code is running under pyspark

For staging and production, my code will be running on PySpark. However, in my local development environment, I will not be running my code on PySpark.
This presents a problem from the standpoint of logging. Because one uses the Java library Log4J via Py4J when using PySpark, one will not be using Log4J for the local development.
Thankfully, the API for Log4J and the core Python logging module are the same: once you get a logger object, with either module you simply debug() or info() etc.
Thus, I wish to detect whether or not my code is being imported/run in PySpark or a non-PySpark environment: similar to:
class App:
def our_logger(self):
if self.running_under_spark():
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger
How might I implement running_under_spark()
Simply trying to import pyspark and seeing if it works is not a fail-proof way of doing this because I have pyspark in my dev environment to kill warnings about non-imported modules in the code from my IDE.

Maybe you can set some environment variable in your spark environment that you check for at runtime ( in $SPARK_HOME/conf/spark-env.sh):
export SPARKY=spark
Then you check if SPARKY exists to determine if you're in your spark environment.
from os import environ
class App:
def our_logger(self):
if environ.get('SPARKY') is not None:
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger

Scala Notebook Type Mismatch error when creating Event Hub for streaming tweets

I want to send messages from a Twitter application to an Azure event hub. However, I am getting an error that says:
notebook:20: error: type mismatch;
found : java.util.concurrent.ExecutorService
required: java.util.concurrent.ScheduledExecutorService
val eventHubClient = EventHubClient.create(connStr.toString(), pool)
I do not know how to create the EventHubClient.create now. Please help.
I am referring to code from the link
https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs.
Also, I have tried the solution from link:
Stream data into Azure Databricks using Event Hubs and it doesn't work for me.
The version of the cluster is 5.2 (includes Apache Spark 2.4.0, Scala 2.11) which should include the Java SE 8 libraries that have the new ScheduledExecutorService member. Also, the libraries attached are com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.9 and org.twitter4j:twitter4j-core:4.0.7, so again all the prerequisites are met.
The code is:
import java._
import java.util._
import scala.collection.JavaConverters._
import com.microsoft.azure.eventhubs._
import java.util.concurrent._
import java.util.concurrent.ExecutorService
import java.util.concurrent.ScheduledExecutorService
val pool = Executors.newFixedThreadPool(1)
val eventHubClient = EventHubClient.create(connStr.toString(), pool)

How to tell PySpark where the pymongo-spark package is located? [duplicate]

I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.

Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.

Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.

I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.

Good Luck！
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()

XmlRpc client using Scala

I need to consume an xmlrpc service from Scala, and so far it looks like my only option is the Apache XML-RPC library.
I added this dependency to my Build.scala:
"org.apache.xmlrpc" % "xmlrpc" % "3.1.3"
and sbt reported no problem in downloading the library. However, I don't know how to go about actually accessing the library.
val xml = org.apache.xmlrpc.XmlRpcClient("http://foo") wouldn't compile
and
import org.apache.xmlrpc._
reported that object xmlrpc was not a member of package org.apache.
What would be the correct package to import?
(Or, is there a better library for XmlRpc from Scala?)

Try
"org.apache.xmlrpc" % "xmlrpc-client" % "3.1.3"
and so :
class XmlRpc(val serverURL: String) {
import org.apache.xmlrpc.client.XmlRpcClient
import org.apache.xmlrpc.client.XmlRpcClientConfigImpl
import org.apache.xmlrpc.client.XmlRpcSunHttpTransportFactory
import java.net.URL
val config = new XmlRpcClientConfigImpl();
config.setServerURL(new URL(serverURL));
config.setEncoding("ISO-8859-1");
val client = new XmlRpcClient();
client.setTransportFactory(new XmlRpcSunHttpTransportFactory(client));
client.setConfig(config);
client.execute(...)
}

There is a good module for this kind of tasks:
https://github.com/jvican/xmlrpc

scala cloud foundry mongodb : access to mongodb denied

I have installed eclipse, the cloudfoundry plugin, the scala plugin,the vaadin plugin(for web developments) and the mongodb libraries.
I created a class like this :
import vaadin.scala.Application
import vaadin.scala.VerticalLayout
import com.mongodb.casbah.MongoConnection
import com.mongodb.casbah.commons.MongoDBObject
import vaadin.scala.Label
import vaadin.scala.Button
class Launcher extends Application {
val label=new Label
override def main = new VerticalLayout() {
val coll=MongoConnection()("mybd")("somecollection")
val builder=MongoDBObject.newBuilder
builder+="foo1" -> "bar"
var newobj=builder.result()
coll.save(newobj)
val mongoColl=MongoConnection()("mybd")("somecollection")
val withFoo=mongoColl.findOne()
label.value=withFoo
add(label)
//bouton pour faire joli
add(new Button{
caption_=("click me!")
})
}
}
the error (the access to the mongodb database is denied) comes from the parameters, which are the default ones.
do you know how to set up the good parameters in scala or in java?

Looks like you got some help on the vcap-dev mailing list
package com.example.vaadin_1
import vaadin.scala.Application
import org.cloudfoundry.runtime.env.CloudEnvironment
import org.cloudfoundry.runtime.env.MongoServiceInfo
import com.mongodb.casbah.MongoConnection
class Launcher extends Application {
val cloudEnvironment = new CloudEnvironment()
val mongoServices = cloudEnvironment.getServiceInfos(classOf[MongoServiceInfo])
val mongo = mongoServices.get(0)
val mongodb = MongoConnection(mongo.getHost(), mongo.getPort())("abc")
mongodb.authenticate(mongo.getUserName(),mongo.getPassword())
}

I would suggest to do it using Spring Data for MongoDB there is a sample application for Cloudfoundry in particular put together by the Spring guys. With a bit of xml configuration you have ready to inject the mongoTemplate similar to the familiar Spring xxxTemplate paradigm.

when deploying to CloudFoundry, the information relative to connecting to a service (i.e mongo in your case) is made available to the app through environment variable VCAP_SERVICES. It is a json document with one entry per service. You could of course parse it yourself, but you will find the class http://cf-runtime-api.cloudfoundry.com/org/cloudfoundry/runtime/env/CloudEnvironment.html useful. You will need to add the org.cloudfoundry/cloudfoundry-runtime/0.8.1 jar to your project. You can use it without Spring.
Have a look at http://docs.cloudfoundry.com/services.html for explanation of the VCAP_SERVICES underlying var