Spark Scala error while loading BytesWritable, invalid LOC header (bad signature) - scala

Using sbt package I have the following error
Spark Scala error while loading BytesWritable, invalid LOC header (bad signature)
My code is
....
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
......
object Test{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf) // the error is due by this
......
}
}

Pls re-load your JARs and / or library dependencies as they might be corrupted while building jar through sbt - could be issue with one of their update. Second alternative is that you have too many temp files open, check your 4040-9 ports on master if there are any jobs hanging and kill them if so, you can also check how increase open files you have on linux:/etc/security/limits.conf where hard nofile ***** and soft nofile ***** then reboot and ulimit -n ****

I was using spark-mllib_2.11 and it gave me the same error. I had to use version 2.10 of Spark MLIB to get rid of it.
Using Maven:
<artifactId>spark-mllib_2.10</artifactId>

Related

how to detect if your code is running under pyspark

For staging and production, my code will be running on PySpark. However, in my local development environment, I will not be running my code on PySpark.
This presents a problem from the standpoint of logging. Because one uses the Java library Log4J via Py4J when using PySpark, one will not be using Log4J for the local development.
Thankfully, the API for Log4J and the core Python logging module are the same: once you get a logger object, with either module you simply debug() or info() etc.
Thus, I wish to detect whether or not my code is being imported/run in PySpark or a non-PySpark environment: similar to:
class App:
def our_logger(self):
if self.running_under_spark():
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger
How might I implement running_under_spark()
Simply trying to import pyspark and seeing if it works is not a fail-proof way of doing this because I have pyspark in my dev environment to kill warnings about non-imported modules in the code from my IDE.
Maybe you can set some environment variable in your spark environment that you check for at runtime ( in $SPARK_HOME/conf/spark-env.sh):
export SPARKY=spark
Then you check if SPARKY exists to determine if you're in your spark environment.
from os import environ
class App:
def our_logger(self):
if environ.get('SPARKY') is not None:
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger

Spark ElasticSearch Configuration - Reading Elastic Search from Spark

I am trying to read data from ElasticSearch via Spark Scala. I see lot of post addressing this question, i have tried all the options they have mentioned in various posts but seems nothing is working for me
JAR Used - elasticsearch-hadoop-5.6.8.jar (Used elasticsearch-spark-5.6.8.jar too without any success)
Elastic Search Version - 5.6.8
Spark - 2.3.0
Scala - 2.11
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").getOrCreate()
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.index.auto.create", "true").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.port", "9200").option("es.nodes", "xxxxxxxxx").option("es.nodes.wan.only", "true").option("es.net.http.auth.user","xxxxxx").option("es.net.http.auth.pass", "xxxxxxxx")
val read = reader.load("index/type")
Error:
ERROR rest.NetworkClient: Node [xxxxxxxxx:9200] failed (The server xxxxxxxxxxxxx failed to respond); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:98)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at scala.Option.getOrElse(Option.scala:121)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:133)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[xxxxxxxxxxx:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:655)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:287)
... 65 more
Apart from this I have also tried below properties without any success:
option("es.net.ssl.cert.allow.self.signed", "true")
option("es.net.ssl.truststore.location", "<path for elasticsearch cert file>")
option("es.net.ssl.truststore.pass", "xxxxxx")
Please note elasticsearch node is within Unix edge node and is http://xxxxxx:9200 (mentioning it just in case if that makes any difference with the code)
What I am missing here? Any other properties? Please Help
Use below Jar which support spark 2+ version instead of Elastic-Hadoop or Elastic-Spark jar.
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20_2.11/5.6.8

Scala Notebook Type Mismatch error when creating Event Hub for streaming tweets

I want to send messages from a Twitter application to an Azure event hub. However, I am getting an error that says:
notebook:20: error: type mismatch;
found : java.util.concurrent.ExecutorService
required: java.util.concurrent.ScheduledExecutorService
val eventHubClient = EventHubClient.create(connStr.toString(), pool)
I do not know how to create the EventHubClient.create now. Please help.
I am referring to code from the link
https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs.
Also, I have tried the solution from link:
Stream data into Azure Databricks using Event Hubs and it doesn't work for me.
The version of the cluster is 5.2 (includes Apache Spark 2.4.0, Scala 2.11) which should include the Java SE 8 libraries that have the new ScheduledExecutorService member. Also, the libraries attached are com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.9 and org.twitter4j:twitter4j-core:4.0.7, so again all the prerequisites are met.
The code is:
import java._
import java.util._
import scala.collection.JavaConverters._
import com.microsoft.azure.eventhubs._
import java.util.concurrent._
import java.util.concurrent.ExecutorService
import java.util.concurrent.ScheduledExecutorService
val pool = Executors.newFixedThreadPool(1)
val eventHubClient = EventHubClient.create(connStr.toString(), pool)

ReactiveMongo 0.12 application.conf issue and logging issue

I've read up everything I could on SO and the ReactiveMongo community list and I am stumped. I am using ReactiveMongo version 0.12 and am just trying to test it out since I have some other problems.
The code in my scala worksheet is:
import reactivemongo.api.{DefaultDB, MongoConnection, MongoDriver}
import reactivemongo.bson.{
BSONDocumentWriter, BSONDocumentReader, Macros, document
}
import com.typesafe.config.{Config, ConfigFactory}
lazy val conf = ConfigFactory.load()
val driver1 = new reactivemongo.api.MongoDriver
val connection3 = driver1.connection(List("localhost"))
and the error I get is
[NGSession 3: 127.0.0.1: compile-server] INFO reactivemongo.api.MongoDriver - No mongo-async-driver configuration found
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka'
at com.typesafe.config.impl.SimpleConfig.findKey(testMongo.sc:120)
at com.typesafe.config.impl.SimpleConfig.find(testMongo.sc:143)
at com.typesafe.config.impl.SimpleConfig.find(testMongo.sc:155)
at com.typesafe.config.impl.SimpleConfig.find(testMongo.sc:160)
at com.typesafe.config.impl.SimpleConfig.getString(testMongo.sc:202)
at akka.actor.ActorSystem$Settings.<init>(testMongo.sc:165)
at akka.actor.ActorSystemImpl.<init>(testMongo.sc:501)
at akka.actor.ActorSystem$.apply(testMongo.sc:138)
at reactivemongo.api.MongoDriver.<init>(testMongo.sc:879)
at #worksheet#.driver1$lzycompute(testMongo.sc:9)
at #worksheet#.driver1(testMongo.sc:9)
at #worksheet#.get$$instance$$driver1(testMongo.sc:9)
at #worksheet#.#worksheet#(testMongo.sc:30)
My application.conf is in src/main/resources of the sub-project which this worksheet is found and contains this:
mongo-async-driver {
akka {
loglevel = WARNING
}
}
I added the ConfigFactory precisely because I got this error and thought it might help. I looked at the code and that's what ReactiveMongo is doing at this point so I thought perhaps a call here would force it to load at this point. I have moved the application.conf file into every conceivable place including a conf directory (thinking it might require play conventions) and the src/main/resources of the top level directory. Nothing works. So my first question is what am I doing wrong? Where should application.conf file go?
This info message causes my program to crash and driver doesn't get created so I can't move on from here.
Also, I added an akka key to reference.conf just in case - that didnt help either.

Filesystem provider disappearing in Spark?

Mysteriously, when I use Spark my custom filesystem provider vanishes.
The full source for my example is available on github so you can follow.
I'm using maven to depend on gcloud-java-nio, which provides a Java FileSystem for Google Cloud Storage, via "gs://" URLs. My Spark project uses maven-shade-plugin to create one big jar with all the source in it.
The big jar correctly includes a META-INF/services/java.nio.file.spi.FileSystemProvider file, containing the correct name for the class (com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider). I checked and that class is also correctly included in the jar file.
The program uses FileSystemProvider.installedProviders() to list the filesystem providers it finds. "gs" should be listed (and it is if I run the same function in a non-Spark context), but when running with Spark on Dataproc, that provider's gone.
I'd like to know: How can I use a custom filesystem in my Spark program?
edit: Dennis Huo helpfully contributed that he sees the same problem when running on a Spark cluster, so the problem isn't specific to Dataproc. In fact, it also occurs when just using Scala. Also there are workarounds for the example I'm showing here, but I'd still like to know how to use a custom filesystem with Spark.
This doesn't appear to be a Dataproc-specific issue, more of a Scala issue and Spark fundamentally depends on Scala; if I build your jarfile and then load it up with scala independently of Dataproc or Spark, I get:
scala -cp spark-repro-1.0-SNAPSHOT.jar
scala> import java.nio.file.spi.FileSystemProvider
import java.nio.file.spi.FileSystemProvider
scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._
scala> FileSystemProvider.installedProviders().toList.foreach(l => println(l.getScheme()))
file
jar
scala> SparkRepro.listFS(1)
res3: String = Worker 1 installed filesystem providers: file jar
So it seems any bundling that's being done isn't properly registering the FileSystem provider, at least for scala. I tested the theory using the ListFilesystems example code (just removed the package at the top for convenience) on both a Dataproc node as well as a manually created VM with scala and Java 7 installed independently of Dataproc just to double-check.
$ cat ListFilesystems.java
import java.io.IOException;
import java.net.URI;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.spi.FileSystemProvider;
/**
* ListFilesystems is a super-simple program that lists the available NIO filesystems.
*/
public class ListFilesystems {
/**
* See the class documentation.
*/
public static void main(String[] args) throws IOException {
listFilesystems();
}
private static void listFilesystems() {
System.out.println("Installed filesystem providers:");
for (FileSystemProvider p : FileSystemProvider.installedProviders()) {
System.out.println(" " + p.getScheme());
}
}
}
$ javac ListFilesystems.java
Running using java and then scala:
$ java -cp spark-repro-1.0-SNAPSHOT.jar:. ListFilesystems
Installed filesystem providers:
file
jar
gs
$ scala -cp spark-repro-1.0-SNAPSHOT.jar:. ListFilesystems
Installed filesystem providers:
file
jar
$
This was the same on both Dataproc and my non-Dataproc VM. It looks like there's still unresolved difficulties getting the FileSystemProviders to load properly in Scala, and there doesn't seem to be an easy way to dynamically register them system-wide at runtime either; the most I could find was this old thread that didn't seem to come to any useful conclusion.
Fortunately though, it looks like at least the CloudStorageFileSystemProvider has no problem making it onto the classpath, so you can at least fall back to explicitly creating an instance of the cloud storage provider to use:
new com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider()
.getFileSystem(new java.net.URI("gs://my-bucket"))
Alternatively, if you're using Spark anyways, you might want to consider just using the Hadoop FileSystem interfaces. It's very similar to Java NIO FileSystems (rather, predated the Java NIO FileSystem stuff), and it's more portable for now. You can easily do things like:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
...
Path foo = new Path("gs://my-bucket/my-data.txt");
InputStream is = foo.getFileSystem(new Configuration()).open(foo);
...
The benefit of working with the Hadoop FileSystem interfaces is that you're guaranteed the configuration/settings will be clean both in your driver program and in the distributed worker nodes. For example, sometimes you'll need to modify filesystem settings just for a single job running in a Dataproc cluster; then you can plumb through Hadoop properties which are properly scoped for a single job without interfering with other jobs running at the same time.
My comment on the linked ticket (https://github.com/scala/bug/issues/10247):
Using scala -toolcp path makes your app jar available to the system class loader.
Or, use the API where you can provide the class loader to find providers which are not "installed."
The scala runner script has a few interacting parts, like -Dscala.usejavacp and -nobootcp, which possibly behave differently on Windows. It's not always obvious which incantation to use.
The misunderstanding here is the assumption that java and scala do the same thing with respect to -cp.
This example shows loading a test provider from a build dir. The "default" provider comes first, hence the strange ordering.
$ skala -toolcp ~/bin
Welcome to Scala 2.12.2 (OpenJDK 64-Bit Server VM 1.8.0_112)
scala> import java.nio.file.spi.FileSystemProvider
import java.nio.file.spi.FileSystemProvider
scala> FileSystemProvider.installedProviders
res0: java.util.List[java.nio.file.spi.FileSystemProvider] = [sun.nio.fs.LinuxFileSystemProvider#12abdfb, com.acme.FlakeyFileSystemProvider#b0e5507, com.acme.FlakeyTPDFileSystemProvider#6bbe50c9, com.sun.nio.zipfs.ZipFileSystemProvider#3c46dcbe]
scala> :quit
Or specifying loader:
$ skala -cp ~/bin
Welcome to Scala 2.12.2 (OpenJDK 64-Bit Server VM 1.8.0_112)
scala> import java.net.URI
import java.net.URI
scala> val uri = URI.create("tpd:///?count=10000")
uri: java.net.URI = tpd:///?count=10000
scala> import collection.JavaConverters._
import collection.JavaConverters._
scala> val em = Map.empty[String, AnyRef].asJava
em: java.util.Map[String,AnyRef] = {}
scala> import java.nio.file.FileSystems
import java.nio.file.FileSystems
scala> FileSystems.
getDefault getFileSystem newFileSystem
scala> FileSystems.newFileSystem(uri, em, $intp.classLoader)
res1: java.nio.file.FileSystem = com.acme.FlakeyFileSystemProvider$FlakeyFileSystem#2553dcc0