JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS - scala

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?

Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

Related

How to Run Apache Tika on Apache Spark

I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in https://forums.databricks.com/questions/28378/trying-to-use-apache-tika-on-databricks.html and the job works correctly in local mode. However when running the job in clustered mode, the extracted text always comes out as an empty string. This problem is outlined in Tika's documentation (https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-NoContentExtracted), but I haven't been able to debug the issue. Since the code works for me in local mode it has to be something with the classpath or JARs, and I can't figure it out.
Here is sample Scala code for my Spark job:
/* TikaTest.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.tika.parser._
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata
import java.io.DataInputStream
// The first argument must be an S3 path to a directory with documents for text extraction.
// The second argument must be an S3 path to a directory where extracted text will be written.
object TikaTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Tika Test")
val sc = new SparkContext(conf)
val binRDD = sc.binaryFiles(args(0))
val textRDD = binRDD.map(file => {parseFile(file._2.open( ))})
textRDD.saveAsTextFile(args(1))
sc.stop()
}
def parseFile(stream: DataInputStream): String = {
val parser = new AutoDetectParser()
val handler = new BodyContentHandler()
val metadata = new Metadata()
val context = new ParseContext()
parser.parse(stream, handler, metadata, context)
return handler.toString()
}
}

Spark unable to find "spark-version-info.properties" when run from ammonite script

I have an ammonite script which creates a spark context:
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.0.1`
import org.apache.spark.{SparkConf, SparkContext}
#main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
When I run this script, it throws an error:
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: org.apache.spark.SparkException: Error while locating file spark-version-info.properties
...
Caused by: java.lang.NullPointerException
at java.util.Properties$LineReader.readLine(Properties.java:434)
at java.util.Properties.load0(Properties.java:353)
The script isn't being run from the spark installation directory and doesn't have any knowledge of it or the resources where this version information is packaged - it only knows about the ivy dependencies. So perhaps the issue is that this resource information isn't on the classpath in the ivy dependencies. I have seen other spark "standalone scripts" so I was hoping I could do the same here.
I poked around a bit to try and understand what was happening. I was hoping I could programmatically hack some build information into the system properties at runtime.
The source of the exception comes from package.scala in the spark library. The relevant bits of code are
val resourceStream = Thread.currentThread().getContextClassLoader.
getResourceAsStream("spark-version-info.properties")
try {
val unknownProp = "<unknown>"
val props = new Properties()
props.load(resourceStream) <--- causing a NPE?
(
props.getProperty("version", unknownProp),
// Load some other properties
)
} catch {
case npe: NullPointerException =>
throw new SparkException("Error while locating file spark-version-info.properties", npe)
It seems that the implicit assumption is that props.load will fail with a NPE if the version information can't be found in the resources. (That's not so clear to the reader!)
The NPE itself looks like it's coming from this code in java.util.Properties.java:
class LineReader {
public LineReader(InputStream inStream) {
this.inStream = inStream;
inByteBuf = new byte[8192];
}
...
InputStream inStream;
Reader reader;
int readLine() throws IOException {
...
inLimit = (inStream==null)?reader.read(inCharBuf)
:inStream.read(inByteBuf);
The LineReader is constructed with a null InputStream which the class internally interprets as meaning that the reader is non-null and should be used instead - but it's also null. (Is this kind of stuff really in the standard library? Seems very unsafe...)
From looking at the bin/spark-shell that comes with spark, it adds -Dscala.usejavacp=true when it launches spark-submit. Is this the right direction?
Thanks for your help!
Following seems to work on 2.11 with 1.0.1 version but not experimental.
Could be just better implemented on Spark 2.2
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.2.0`
import $ivy.`org.apache.spark:spark-sql_2.11:2.2.0`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
#main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
or more expanded answer:
#main
def main(): Unit = {
val spark = SparkSession.builder()
.appName("testings")
.master("local")
.config("configuration key", "configuration value")
.getOrCreate
val sqlContext = spark.sqlContext
val tdf2 = spark.read.option("delimiter", "|").option("header", true).csv("./tst.dat")
tdf2.show()
}

Spark and HBase Snapshots

Under the assumption that we could access data much faster if pulling directly from HDFS instead of using the HBase API, we're trying to build an RDD based on a Table Snapshot from HBase.
So, I have a snapshot called "dm_test_snap". I seem to be able to get most of the configuration stuff working, but my RDD is null (despite there being data in the Snapshot itself).
I'm having a hell of a time finding an example of anyone doing offline analysis of HBase snapshots with Spark, but I can't believe I'm alone in trying to get this working. Any help or suggestions are greatly appreciated.
Here is a snippet of my code:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
val scan = new Scan
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
}
Update to include the solution
The trick was, as #Holden mentioned below, the conf wasn't getting passed through. To remedy this, I was able to get it working by changing this the call to newAPIHadoopRDD to this:
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
There was a second issue that was also highlighted by #victor's answer, that I was not passing in a scan. To fix that, I added this line and method:
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
This also let me pull out this line from the conf.set commands:
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
*NOTE: This was for HBase version 0.96.1.1 on CDH5.0
Final full code for easy reference:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
val scan = new Scan
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
}
Looking at the Job information, its making a copy of the conf object you are supplying to it (The Job makes a copy of the Configuration so that any necessary internal modifications do not reflect on the incoming parameter.) so most likely the information that you need to set on the conf object isn't getting passed down to Spark. You could instead use TableSnapshotInputFormatImpl which has a similar method that works on conf objects. There might be additional things needed but at first pass through the problem this seems like the most likely cause.
As pointed out in the comments, another option is to use job.getConfiguration to get the updated config from the job object.
You have not configured your M/R job properly:
This is example in Java on how to configure M/R over snapshots:
Job job = new Job(conf);
Scan scan = new Scan();
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName,
scan, MyTableMapper.class, MyMapKeyOutput.class,
MyMapOutputValueWritable.class, job, true);
}
You, definitely, skipped Scan. I suggest you taking a look at TableMapReduceUtil's initTableSnapshotMapperJob implementation to get idea how to configure job in Spark/Scala.
Here is complete configuration in mapreduce Java
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Name of the snapshot
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);

twitterStream not found

I'm trying to compile my first scala program and I'm using twitterStream to get tweets, here is a snippet of my code:
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import TutorialHelper._
object Tutorial {
def main(args: Array[String]) {
// Location of the Spark directory
val sparkHome = "/home/shaza90/spark-1.1.0"
// URL of the Spark cluster
val sparkUrl = TutorialHelper.getSparkUrl()
// Location of the required JAR files
val jarFile = "target/scala-2.10/tutorial_2.10-0.1-SNAPSHOT.jar"
// HDFS directory for checkpointing
val checkpointDir = TutorialHelper.getHdfsUrl() + "/checkpoint/"
// Configure Twitter credentials using twitter.txt
TutorialHelper.configureTwitterCredentials()
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream()
val statuses = tweets.map(status => status.getText())
statuses.print()
ssc.checkpoint(checkpointDir)
ssc.start()
}
}
When compiling I'm getting this error message:
value twitterStream is not a member of org.apache.spark.streaming.StreamingContext
Do you know if I'm missing any library or dependency?
In this case you want a stream of tweets. We all know that Sparks provides Streams. Now, lets check if Spark itself provides something for interacting with twitter specifically.
Open Spark API-docs -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#package
Now search for twitter and bingo... there is something called TwitterUtils in package org.apache.spark.streaming. Now since it is called TwitterUtils and is in package org.apache.spark.streaming, I think it will provide helpers to create stream from twitter API's.
Now lets click on TwitterUtils and goto -> http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.dstream.ReceiverInputDStream
And yup... it has a method with following signature
def createStream(
ssc: StreamingContext,
twitterAuth: Option[Authorization],
filters: Seq[String] = Nil,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[Status]
It returns a ReceiverInputDStream[ Status ] where Status is twitter4j.Status.
Parameters are further explained
ssc
StreamingContext object
twitterAuth
Twitter4J authentication, or None to use Twitter4J's default OAuth authorization; this uses the system properties twitter4j.oauth.consumerKey, twitter4j.oauth.consumerSecret, twitter4j.oauth.accessToken and twitter4j.oauth.accessTokenSecret
filters
Set of filter strings to get only those tweets that match them
storageLevel
Storage level to use for storing the received objects
See... API docs are simple. I believe, now you should be a little more motivated to read API docs.
And... This means you need to look a little( at least getting started part ) into twitter4j documentation too.
NOTE :: This answer is specifically written to explain "Why not to shy
away from API docs ?". And was written after careful thoughts. So
please, do not edit unless your edit makes some significant
contribution.

Integrerate Spark Streaming and Kafka, but get bad symbolic reference error

I am trying to integrate Spark streaming and Kafka. I wrote my source code in intellij idea IDE; the complier compiled the code without any error, but when I try to build the jar file, an error message is generated that shows:
Error:scalac: bad symbolic reference. A signature in KafkaUtils.class
refers to term kafka in package which is not available. It may
be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when
compiling KafkaUtils.class.
I did research on google, many people say this is because of different versions between Scala version and spark_streaming_kafka jar file. But I have checked the version, they are the same.
Does someone know why this error happened?
Here are more details:
scala version : 2.10
spark streaming kafka jar version : spark_streaming_kafka_2.10-1.20.jar,spark_streaming_2.10-1.20.jar
My source code:
object Kafka {
def main(args: Array[String]) {
val master = "local[*]"
val zkQuorum = "localhost:2181"
val group = ""
val topics = "test"
val numThreads = 1
val conf = new SparkConf().setAppName("Kafka")
val ssc = new StreamingContext(conf, Seconds(2))
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.print()
ssc.start()
ssc.awaitTermination()
}
}