Apache Zeppelin 0.6.1: Run Spark 2.0 Twitter Stream App - scala

I have a cluster with Spark 2.0 and Zeppelin 0.6.1 installed. Since the class TwitterUtils.scala is moved from Spark project to Apache Bahir, I can't use the TwitterUtils in my Zeppelin notebook anymore.
Here the snippets of my notebook:
Dependency loading:
%dep
z.reset
z.load("org.apache.bahir:spark-streaming-twitter_2.11:2.0.0")
DepInterpreter(%dep) deprecated. Remove dependencies and repositories through GUI interpreter menu instead.
DepInterpreter(%dep) deprecated. Load dependency through GUI interpreter menu instead.
res1: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency#4793109a
And the Spark part:
import org.apache.spark.streaming.twitter
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess
import org.apache.spark.SparkConf
// ********************************* Configures the Oauth Credentials for accessing Twitter ****************************
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {...}
// ***************************************** Configure Twitter credentials ********************************************
val apiKey = ...
val apiSecret = ...
val accessToken = ...
val accessTokenSecret = ...
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)
// ************************************************* The logic itself *************************************************
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))
When I try to run the Spark part of the notebook after importing the dependency, I get the following exception:
<console>:44: error: object twitter is not a member of package org.apache.spark.streaming
import org.apache.spark.streaming.twitter
What am I doing wrong here? Bahir documentation also uses the import org.apache.spark.streaming.twitter._ command, see http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/

Well, dep is not exactly stable and since it is deprecated anyway why not use supported methods? If you don't won't to modify neither Spark nor Zeppelin configuration files you can add dependencies to the interpreter configuration (I omitted properties for clarity):

Related

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name

44: error: value read is not a member of object org.apache.spark.sql.SQLContext

I am using Spark 1.6.1, and Scala 2.10.5. I am trying to read the csv file through com.databricks.
While launching the spark-shell, I use below lines as well
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 --driver-class-path path to/sqljdbc4.jar, and below is the whole code
import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = SQLContext.read().format("com.databricks.spark.csv").option("inferScheme","true").option("header","true").load("path_to/data.csv");
I am getting below error:-
error: value read is not a member of object org.apache.spark.sql.SQLContext,
and the "^" is pointing toward "SQLContext.read().format" in the error message.
I did try the suggestions available in stackoverflow, as well as other sites as well. but nothing seems to be working.
SQLContext means object access - static methods in class.
You should use sqlContext variable, as methods are not static, but are in class
So code should be:
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferScheme","true").option("header","true").load("path_to/data.csv");

Spark unable to find "spark-version-info.properties" when run from ammonite script

I have an ammonite script which creates a spark context:
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.0.1`
import org.apache.spark.{SparkConf, SparkContext}
#main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
When I run this script, it throws an error:
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: org.apache.spark.SparkException: Error while locating file spark-version-info.properties
...
Caused by: java.lang.NullPointerException
at java.util.Properties$LineReader.readLine(Properties.java:434)
at java.util.Properties.load0(Properties.java:353)
The script isn't being run from the spark installation directory and doesn't have any knowledge of it or the resources where this version information is packaged - it only knows about the ivy dependencies. So perhaps the issue is that this resource information isn't on the classpath in the ivy dependencies. I have seen other spark "standalone scripts" so I was hoping I could do the same here.
I poked around a bit to try and understand what was happening. I was hoping I could programmatically hack some build information into the system properties at runtime.
The source of the exception comes from package.scala in the spark library. The relevant bits of code are
val resourceStream = Thread.currentThread().getContextClassLoader.
getResourceAsStream("spark-version-info.properties")
try {
val unknownProp = "<unknown>"
val props = new Properties()
props.load(resourceStream) <--- causing a NPE?
(
props.getProperty("version", unknownProp),
// Load some other properties
)
} catch {
case npe: NullPointerException =>
throw new SparkException("Error while locating file spark-version-info.properties", npe)
It seems that the implicit assumption is that props.load will fail with a NPE if the version information can't be found in the resources. (That's not so clear to the reader!)
The NPE itself looks like it's coming from this code in java.util.Properties.java:
class LineReader {
public LineReader(InputStream inStream) {
this.inStream = inStream;
inByteBuf = new byte[8192];
}
...
InputStream inStream;
Reader reader;
int readLine() throws IOException {
...
inLimit = (inStream==null)?reader.read(inCharBuf)
:inStream.read(inByteBuf);
The LineReader is constructed with a null InputStream which the class internally interprets as meaning that the reader is non-null and should be used instead - but it's also null. (Is this kind of stuff really in the standard library? Seems very unsafe...)
From looking at the bin/spark-shell that comes with spark, it adds -Dscala.usejavacp=true when it launches spark-submit. Is this the right direction?
Thanks for your help!
Following seems to work on 2.11 with 1.0.1 version but not experimental.
Could be just better implemented on Spark 2.2
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.2.0`
import $ivy.`org.apache.spark:spark-sql_2.11:2.2.0`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
#main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
or more expanded answer:
#main
def main(): Unit = {
val spark = SparkSession.builder()
.appName("testings")
.master("local")
.config("configuration key", "configuration value")
.getOrCreate
val sqlContext = spark.sqlContext
val tdf2 = spark.read.option("delimiter", "|").option("header", true).csv("./tst.dat")
tdf2.show()
}

IllegalAccessException .. can not access a member of class with modifiers "protected"

this is my scala code . i am trying to ingest geotiff file into HDFS using the geotrellis library.
package RasterDataIngest.RasterDataIngestIntoHadoop
import geotrellis.spark._
import geotrellis.spark.ingest._
import geotrellis.spark.io.hadoop._
import geotrellis.spark.io.index._
import geotrellis.spark.tiling._
import geotrellis.spark.utils.SparkUtils
import geotrellis.vector._
import org.apache.hadoop.fs.Path
import org.apache.spark._
import com.quantifind.sumac.ArgMain
import com.quantifind.sumac.validation.Required
class HadoopIngestArgs extends IngestArgs {
#Required var catalog: String = _
def catalogPath = new Path(catalog)
}
object HadoopIngest extends ArgMain[HadoopIngestArgs] with Logging {
def main(args: HadoopIngestArgs): Unit = {
System.setProperty("com.sun.media.jai.disableMediaLib", "true")
implicit val sparkContext = SparkUtils.createSparkContext("Ingest")
val conf = sparkContext.hadoopConfiguration
conf.set("io.map.index.interval", "1")
val catalog = HadoopRasterCatalog(args.catalogPath)
val source = sparkContext.hadoopGeoTiffRDD(args.inPath)
val layoutScheme = ZoomedLayoutScheme()
Ingest[ProjectedExtent, SpatialKey](source, args.destCrs, layoutScheme, args.pyramid){ (rdd, level) =>
catalog
.writer[SpatialKey](RowMajorKeyIndexMethod, args.clobber)
.write(LayerId(args.layerName, level.zoom), rdd)
}
}
}
When i run this code , i get the following error.
Please help me to solve this error.
java.lang.IllegalAccessException: Class org.osgeo.proj4j.Registry can not access a member of class org.osgeo.proj4j.proj.Projection with modifiers "protected"
I believe the problem is related to a bad sbt cache or Java version mismatch. Try the latest stable GeoTrellis version: 0.10.3 (Scala 2.10/2.11, Java 8, Spark 1.6.x). If you plan to use GeoTrellis with Spark 2, take a look at the GeoTrellis snapshot (version 1.0.0 will support Spark 2+, Java 8, and Scala 2.11).

Class org.apache.spark.sql.types.SQLUserDefinedType not found - continuing with a stub

I have a basic spark mllib program as follows.
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
class Sample {
val conf = new SparkConf().setAppName("helloApp").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Export to PMML
println("PMML Model:\n" + clusters.toPMML)
}
I have manually added spark-core , spark-mllib and spark-sql to the project class path through intellij all having version 1.5.0.
I am getting the below error when I run the program? any idea what's wrong?
Error:scalac: error while loading Vector, Missing dependency 'bad
symbolic reference. A signature in Vector.class refers to term types
in package org.apache.spark.sql which is not available. It may be
completely missing from the current classpath, or the version on the
classpath might be incompatible with the version used when compiling
Vector.class.', required by
/home/fazlann/Downloads/spark-mllib_2.10-1.5.0.jar(org/apache/spark/mllib/linalg/Vector.class
DesirePRG. I have met the same problem as yours. The solution is to import some jar which assemble the spark and hadoop, such as spark-assembly-1.4.1-hadoop2.4.0.jar, then it could work properly.