How to save a JSON file in ElasticSearch using Spark?

How to save a JSON file in ElasticSearch using Spark? - scala

I am trying to save a JSON file in ElasticSearch but its not working.
This is my code:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._
import org.apache.spark.SparkConf
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.registerTempTable("sensor_ptable")
sen_p.saveToEs("sensor/metrics")
}
}
I am getting also this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.elasticsearch.spark.sql.package$.sparkDataFrameFunctions(Lorg/apache/spark/sql/Dataset;)Lorg/elasticsearch/spark/sql/package$SparkDataFrameFunctions;
at learnscala.HelloEs$.main(HelloEs.scala:20)
at learnscala.HelloEs.main(HelloEs.scala)

There are multiple ways to save an RDD / Dataframe to Elastic Search.
Spark Dataframe can be written to Elastic Search using:
df.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
RDD can be written to ES using:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "<ES_RESOURCE_PATH>")
In your case, modify the code as below:
`
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
}
}
`

Related

HBASE SPARK Query with filter without load all the hbase

I have to query HBASE and then work with the data with spark and scala.
My problem is that with my solution, i take ALL the data of my HBASE table and then i filter, it's not an efficient way because it takes too much memory. So i would like to do the filter directly, how can i do that ?
def HbaseSparkQuery(table: String, gatewayINPUT: String, sparkContext: SparkContext): DataFrame = {
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val conf = HBaseConfiguration.create()
val tableName = table
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("rssi"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "RSSI")
.filter($"GatewayIMEA" === gatewayINPUT)
DATAFRAME
}
As you can see in my code, I do the filter after the creation of the dataframe, after the loading of Hbase data ..
Thank you in advance for your answers

Here is the solution I found
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.filter._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil
object HbaseConnector {
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "/usr/local/hadoop")
val sparkConf = new SparkConf().setAppName("CoverageAlgPipeline").setMaster("local[*]")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Coverage Algorithm")
.getOrCreate
val GatewayIMEA = "123"
val TABLE_NAME = "TABLE"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME)
val connection = ConnectionFactory.createConnection(conf)
val table = connection.getTable(TableName.valueOf(TABLE_NAME))
val scan = new Scan
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
scan.setFilter(GatewayIDFilter)
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan))
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("Measure"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "measure")
DATAFRAME.show()
}
}
What is done is to set your input table, set your filter, do the scan with the filter and get the scan to a RDD, and then transform the RDD to a dataframe (optional)
To do multiple filters :
val timestampFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("eventTime"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(String.valueOf(dateOfDayTimestamp)))
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
val filters = new FilterList(GatewayIDFilter, timestampFilter)
scan.setFilter(filters)

You can use a spark-hbase connector with predicate pushdown. e.g.https://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

How to pass HiveContext as an argument from one function to another function using spark scala

I have a sceneria where I need to pass HiveContext as an argument to another function. Below is my code and where I am stuck with issue:
Object Sample {
def main(args:Array[String]){
val fileName = "SampleFile.txt"
val conf = new SparkConf().setMaster("local").setAppName("LoadToHivePart")
conf.set("spark.ui.port","4041")
val sc=new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
hc.setConf("hive.metastore.uris","thrift://127.0.0.1:9083")
test(hc,fileName)
sc.stop()
}
def test(hc:String, fileName: String){
//code.....
}
}
As per above code I am unable to pass a HiveContext variable "hc" from main to another function. Also tried with:
def test(hc:HiveContext, fileName:String){}
but it is showing error for both.

def test(hc:HiveContext, fileName: String){
//code.....
}
Note: Hive Context available in org.apache.spark.sql.hive.HiveContext
so import it using import org.apache.spark.sql.hive.HiveContext

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).

All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here

EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}

Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.

You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF

For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}

More than one spark context error

I have this spark code below:
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Hbase {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark-Hbase").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
...
val ssc = new StreamingContext(sparkConf, Seconds(3))
val kafkaBrokers = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaBrokers, topics)
}
}
Now the error I am getting is:
Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.
Is there anything wrong with my code above? I do not see where I am creating the context again...

These are the two SparkContext you're creating. This is not allowed.
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sparkConf, Seconds(3))
You should create the streaming context from the original context.
val ssc = new StreamingContext(sc, Seconds(3))

you are initializing two spark context in the same JVM i.e. (sparkContext and streamingContext). That's why you are getting this exception. you can set spark.driver.allowMultipleContexts = true in config. Although, multiple Spark contexts is discouraged. You can get unexpected results.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to save a JSON file in ElasticSearch using Spark? - scala

Related

HBASE SPARK Query with filter without load all the hbase

How to pass HiveContext as an argument from one function to another function using spark scala

How to declare an empty dataset in Spark?

Importing Spark libraries using Intellij IDEA

More than one spark context error

Categories

Resources