How to save a JSON file in ElasticSearch using Spark? - scala

I am trying to save a JSON file in ElasticSearch but its not working.
This is my code:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._
import org.apache.spark.SparkConf
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.registerTempTable("sensor_ptable")
sen_p.saveToEs("sensor/metrics")
}
}
I am getting also this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.elasticsearch.spark.sql.package$.sparkDataFrameFunctions(Lorg/apache/spark/sql/Dataset;)Lorg/elasticsearch/spark/sql/package$SparkDataFrameFunctions;
at learnscala.HelloEs$.main(HelloEs.scala:20)
at learnscala.HelloEs.main(HelloEs.scala)

There are multiple ways to save an RDD / Dataframe to Elastic Search.
Spark Dataframe can be written to Elastic Search using:
df.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
RDD can be written to ES using:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "<ES_RESOURCE_PATH>")
In your case, modify the code as below:
`
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
}
}
`

Related

HBASE SPARK Query with filter without load all the hbase

I have to query HBASE and then work with the data with spark and scala.
My problem is that with my solution, i take ALL the data of my HBASE table and then i filter, it's not an efficient way because it takes too much memory. So i would like to do the filter directly, how can i do that ?
def HbaseSparkQuery(table: String, gatewayINPUT: String, sparkContext: SparkContext): DataFrame = {
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val conf = HBaseConfiguration.create()
val tableName = table
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("rssi"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "RSSI")
.filter($"GatewayIMEA" === gatewayINPUT)
DATAFRAME
}
As you can see in my code, I do the filter after the creation of the dataframe, after the loading of Hbase data ..
Thank you in advance for your answers
Here is the solution I found
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.filter._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil
object HbaseConnector {
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "/usr/local/hadoop")
val sparkConf = new SparkConf().setAppName("CoverageAlgPipeline").setMaster("local[*]")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Coverage Algorithm")
.getOrCreate
val GatewayIMEA = "123"
val TABLE_NAME = "TABLE"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME)
val connection = ConnectionFactory.createConnection(conf)
val table = connection.getTable(TableName.valueOf(TABLE_NAME))
val scan = new Scan
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
scan.setFilter(GatewayIDFilter)
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan))
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("Measure"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "measure")
DATAFRAME.show()
}
}
What is done is to set your input table, set your filter, do the scan with the filter and get the scan to a RDD, and then transform the RDD to a dataframe (optional)
To do multiple filters :
val timestampFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("eventTime"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(String.valueOf(dateOfDayTimestamp)))
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
val filters = new FilterList(GatewayIDFilter, timestampFilter)
scan.setFilter(filters)
You can use a spark-hbase connector with predicate pushdown. e.g.https://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

How to pass HiveContext as an argument from one function to another function using spark scala

I have a sceneria where I need to pass HiveContext as an argument to another function. Below is my code and where I am stuck with issue:
Object Sample {
def main(args:Array[String]){
val fileName = "SampleFile.txt"
val conf = new SparkConf().setMaster("local").setAppName("LoadToHivePart")
conf.set("spark.ui.port","4041")
val sc=new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
hc.setConf("hive.metastore.uris","thrift://127.0.0.1:9083")
test(hc,fileName)
sc.stop()
}
def test(hc:String, fileName: String){
//code.....
}
}
As per above code I am unable to pass a HiveContext variable "hc" from main to another function. Also tried with:
def test(hc:HiveContext, fileName:String){}
but it is showing error for both.
def test(hc:HiveContext, fileName: String){
//code.....
}
Note: Hive Context available in org.apache.spark.sql.hive.HiveContext
so import it using import org.apache.spark.sql.hive.HiveContext

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.
You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF
For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}

More than one spark context error

I have this spark code below:
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Hbase {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark-Hbase").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
...
val ssc = new StreamingContext(sparkConf, Seconds(3))
val kafkaBrokers = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaBrokers, topics)
}
}
Now the error I am getting is:
Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.
Is there anything wrong with my code above? I do not see where I am creating the context again...
These are the two SparkContext you're creating. This is not allowed.
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sparkConf, Seconds(3))
You should create the streaming context from the original context.
val ssc = new StreamingContext(sc, Seconds(3))
you are initializing two spark context in the same JVM i.e. (sparkContext and streamingContext). That's why you are getting this exception. you can set spark.driver.allowMultipleContexts = true in config. Although, multiple Spark contexts is discouraged. You can get unexpected results.