How to write and update by kudu API in Spark 2.1 - scala

I want to write and update by Kudu API.
This is the maven dependency:
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.1.0</version>
</dependency>
In the following code, I have no idea about KuduContext parameter.
My code in spark2-shell:
val kuduContext = new KuduContext("master:7051")
Also the same error in Spark 2.1 streaming:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
val sparkConf = new SparkConf().setAppName("DirectKafka").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val bb = spark.read.options(Map("kudu.master" -> "master:7051","kudu.table" -> "table")).kudu //good
val kuduContext = new KuduContext("master:7051") //error
})
Then the error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243). To ignore this error, set
spark.driver.allowMultipleContexts = true. The currently running
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)

Update your version of Kudu to the latest one (currently 1.5.0). The KuduContext takes the SparkContext as an input parameter in later versions and that should prevent this problem.
Also, do the initial Spark initialization outside of the foreachRDD. In the code you provided, move both the spark and kuduContext out of the foreach. Also, you do not need to create a separate sparkConf, you can use the newer SparkSession only.
val spark = SparkSession.builder.appName("DirectKafka").master("local[*]").getOrCreate()
import spark.implicits._
val kuduContext = new KuduContext("master:7051", spark.sparkContext)
val bb = spark.read.options(Map("kudu.master" -> "master:7051", "kudu.table" -> "table")).kudu
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
// do something with the bb table and messages
})

Related

Reading file from Azure Data Lake Storage V2 with Spark 2.4

I am trying to read a simple csv file Azure Data Lake Storage V2 with Spark 2.4 on my IntelliJ-IDE on mac
Code Below
package com.example
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object Test extends App {
val appName: String = "DataExtract"
val master: String = "local[*]"
val sparkConf: SparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.scheduler.mode", "FAIR")
.set("spark.sql.session.timeZone", "UTC")
.set("spark.sql.shuffle.partitions", "32")
.set("fs.defaultFS", "abfs://development#xyz.dfs.core.windows.net/")
.set("fs.azure.account.key.xyz.dfs.core.windows.net", "~~key~~")
val spark: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
spark.time(run(spark))
def run(spark: SparkSession): Unit = {
val df = spark.read.csv("abfs://development#xyz.dfs.core.windows.net/development/sales.csv")
df.show(10)
}
}
It's able to read, and throwing security exception
Exception in thread "main" java.lang.NullPointerException
at org.wildfly.openssl.CipherSuiteConverter.toJava(CipherSuiteConverter.java:284)
at org.wildfly.openssl.OpenSSLEngine.toJavaCipherSuite(OpenSSLEngine.java:1094)
at org.wildfly.openssl.OpenSSLEngine.getEnabledCipherSuites(OpenSSLEngine.java:729)
at org.wildfly.openssl.OpenSSLContextSPI.getCiphers(OpenSSLContextSPI.java:333)
at org.wildfly.openssl.OpenSSLContextSPI$1.getSupportedCipherSuites(OpenSSLContextSPI.java:365)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.<init>(SSLSocketFactoryEx.java:105)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.initializeDefaultFactory(SSLSocketFactoryEx.java:72)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.<init>(AbfsClient.java:79)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:817)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:149)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
Can anyone help me, what is the mistake?
As per my research, you will receive this error message when you have incompatible jar with the hadoop version.
I would request you to kindly go through the below issues:
http://mail-archives.apache.org/mod_mbox/spark-issues/201907.mbox/%3CJIRA.13243325.1562321895000.591499.1562323440292#Atlassian.JIRA%3E
https://issues.apache.org/jira/browse/HADOOP-16410
Had the same issue, resolved by adding wildfly version 1.0.7, as the docs shared by #cheekatlapradeep-msft mentioned
<dependency>
<groupId>org.wildfly.openssl</groupId>
<artifactId>wildfly-openssl</artifactId>
<version>1.0.7.Final</version>
</dependency>

How to submit Spark jobs to Apache Livy?

I am trying to understand how to submit Spark job to Apache Livy.
I added the following API to my POM.xml:
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-api</artifactId>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-scala-api_2.11</artifactId>
<version>0.3.0</version>
</dependency>
Then I have the following code in Spark that I want to submit to Livy on request.
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
import spark.sqlContext.implicits._
implicit val sparkContext = spark.sparkContext
// ...
}
}
To have the following code that creates a LivyClient instance and uploads the application code to the Spark context:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
However, the problem is that the code of Test is not adapted to be used with Apache Livy.
How can I adjust the code of Test object in order to be able to run client.submit(new Test())?
Your Test class needs to implement Livy's Job interface and you need to implement its call method in your Test class, from where you will get access to jobContext/SparkContext. You can then pass the instance of Test in the submit method
You don't have to create SparkSession by yourself, Livy will create it on the cluster and you can access that context in your call method.
You can find more detailed information on Livy's programmatic API here: https://livy.incubator.apache.org/docs/latest/programmatic-api.html
Here's a sample implementation of Test Class:
import com.cloudera.livy.{Job, JobContext}
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// Do anything with SparkSession
1 //Return value
}
}

Unable to create SparkContext object using Apache Spark 2.2 version

I use MS Windows 7.
Initially, I tried one program using scala in Spark 1.6 and it worked fine (where I am getting SparkContext object as sc automatically).
When I tried Spark 2.2, I am not getting sc automatically so I created one by doing the following steps:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkConf().setAppName("myname").setMaster("mast")
new SparkContext(sc)
Now when I am trying to execute below parallelize method it gives me one error:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Error:
Value parallelize is not a member of org.apache.spark.SparkConf
I followed these steps using official documentation only. So can anybody explain me where I went wrong? Thanks in advance. :)
If spark-shell doesn't show this line on start:
Spark context available as 'sc' (master = local[*], app id = local-XXX).
Run
val sc = SparkContext.getOrCreate()
The issue is that you created sc of type SparkConfig not SparkContext (both have the same initials).
For using parallelize method in Spark 2.0 version or any other version, sc should be SparkContext and not SparkConf. The correct code should be like this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sparkConf = new SparkConf().setAppName("myname").setMaster("mast")
val sc = new SparkContext(sparkConf)
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
This will give you the desired result.
You should prefer to use SparkSession as it is the the entry point for Spark from version 2. You could try something like :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
val sc = spark.sparkContext
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
There is some problem with 2.2.0 version of Apache Spark. I replaced it with 2.2.1 version which is the latest one and i am able to get sc and spark variables automatically when I start spark-shell via cmd in windows 7. I hope it will help someone.
I executed below code which creates rdd and it works perfectly. No need to import any packages.
val dataOne=sc.parallelize(1 to 10)
dataOne.collect(); //Will print 1 to 10 numbers in array
Your code shud like this
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("myname")
val sc = new SparkContext(conf)
NOTE: master url should be local[*]

Parsing json in spark

I was using json scala library to parse a json from a local drive in spark job :
val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
val currency=mainJson.get("currency").get.asInstanceOf[String]
But when i try to use the same parser by pointing to hdfs file location it doesnt work:
val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)
and gives me an error:
java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
... 128 elided
How can i use Json.parseFull library to get data from hdfs file location ?
Thanks
Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.
In Spark 2.0+ :
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.read.format("json").json("json/file/location/in/hdfs")
df.show()
with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson
will be computed in single machine only.
Maven dependencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Edit: (as per comment to read file from hdfs)
val hdfs = org.apache.hadoop.fs.FileSystem.get(
new java.net.URI("hdfs://ITS-Hadoop10:9000/"),
new org.apache.hadoop.conf.Configuration()
)
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()
code credits: from another SO
question
Maven dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>
It is much more easy in spark 2.0
val df = spark.read.json("json/file/location/in/hdfs")
df.show()
One can use following in Spark to read the file from HDFS:
val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

spark streaming Twitter- filter languaje

I know that this question has been treated in several posts, but I can't find out how to solve my problem. I am trying to filter twitts by languaje. I have read in this forum that I have to use twitter4j api. I have already added it in my dependencies:
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.3</version>
</dependency>
My code is:
import twitter4j.api
[....]
val sc = new SparkContext("local", "Simple", "$SPARK_HOME", List("target/streamingTwitter-1.0.jar"))
val ssc = new StreamingContext(sc, Seconds(10))
var filter = Array("filter")
val tweets = TwitterUtils.createStream(ssc, None, filter).filter(status => _.getLang == "es")
The error is:
cannot resolve symbol getlang
Why the compiler doesnt recognize getLang method? This method is supposed to be implemented in twitter4j api, right? Is not enough to import twitter4j and set dependencies in order to use its methods?