Below is the sample code snippet that is used for data fetch from HBase. This worked fine with Spark 3.1.2. However after upgrading to Spark 3.2.1, it is not working i.e. returned RDD doesn't contain any value. Also, it is not throwing any exception.
def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): RDD[(String)] = {{
val scan = new Scan
scan.addFamily("family")
scan.addColumn("family","time")
val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, cachingValue, sparkLoggerParams)
val output: RDD[(String)] = rdd.map { row =>
(Bytes.toString(row._2.getRow))
}
output
}
def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: String, tableName: String,
scan: Scan, cachingValue: Int, sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, Result] = {
scan.setCaching(cachingValue)
val scanString = Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
val hbaseConfig = hbaseContext.getConfiguration()
hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
hbaseConfig.set(TableInputFormat.SCAN, scanString)
sc.newAPIHadoopRDD(
hbaseConfig,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]
).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
}
Also, If we fetch using Scan directly without using NewAPIHadoopRDD, it works.
Software versions:
Spark: 3.2.1 prebuilt with user provided Apache Hadoop
Scala: 2.12.10
HBase: 2.4.9
Hadoop: 2.10.1
I found out the solution to this one.
See this upgrade guide from Spark 3.1.x to Spark 3.2.x:
https://spark.apache.org/docs/latest/core-migration-guide.html
Since Spark 3.2, spark.hadoopRDD.ignoreEmptySplits is set to true by default which means Spark will not create empty partitions for empty input splits. To restore the behavior before Spark 3.2, you can set spark.hadoopRDD.ignoreEmptySplits to false.
It can be set like this on spark-submit:
./spark-submit \
--class org.apache.hadoop.hbase.spark.example.hbasecontext.HBaseDistributedScanExample \
--master spark://localhost:7077 \
--conf "spark.hadoopRDD.ignoreEmptySplits=false" \
--jars ... \
/tmp/hbase-spark-1.0.1-SNAPSHOT.jar YourHBaseTable
Alternatively, you can also set these globally at $SPARK_HOME/conf/spark-defaults.conf to apply for every Spark application.
spark.hadoopRDD.ignoreEmptySplits false
Related
I am trying to load in a table from SalesForce using spark. I invoked this code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Sample {
def main(arg: Array[String]) {
val spark = SparkSession.builder().
appName("salesforce").
master("local[*]").
getOrCreate()
val tableName = "Opportunity"
val outputPath = "output/result" + tableName
val salesforceDf = spark.
read.
format("jdbc").
option("url", "jdbc:datadirect:sforce://login.salesforce.com;").
option("driver", "com.ddtek.jdbc.sforce.SForceDriver").
option("dbtable", tableName).
option("user", "").
option("password", "xxxxxxxxx").
option("securitytoken", "xxxxx")
.load()
salesforceDf.createOrReplaceTempView("Opportunity")
spark.sql("select * from Opportunity").collect.foreach(println)
//save the result
salesforceDf.write.save(outputPath)
}
}
And the docs I was referring to said to start a spark shell as:
spark-shell --jars /path_to_driver/sforce.jar
Which outputted a lot of lines in the terminal and this was the last line:
22/07/12 14:57:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a12b060d-5c82-4283-b2b9-53f9b3863b53
And then to submit spark
spark-submit --jars sforce.jar --class <Your class name> your jar file
However I am not sure where this jar file is and if that was substantiated? and how to submit that. Any help is appreciated, thank you.
I want to put the data that is in HDFS in a neo4j chart. Taking into account the suggestion given in this question, I used the neo4j-spark-connector, initializing them this way:
/usr/local/sparks/bin/spark-shell --conf spark.neo4j.url="bolt://192.xxx.xxx.xx:7687" --conf spark.neo4j.user="xxxx" --conf spark.neo4j.password="xxxx" --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M1, graphframes:graphframes:0.2.0-spark2.0-s2.11
I read the file that is in the hdfs and put it in neo4j from the function Neo4jDataFrame.mergeEdgeList.
import org.neo4j.spark.dataframe.Neo4jDataFrame
object Neo4jTeste {
def main {
val lines = sc.textFile("hdfs://.../testeNodes.csv")
val filteredlines = lines.map(_.split("-")).map{x => (x(0),x(1),x(2),x(3))}
val newNames = Seq("name","apelido","city","date")
val df = filteredlines.toDF(newNames: _*)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Name", Seq("name")),("HAPPENED_IN", Seq.empty), ("Age", Seq("age")))
} }
However, as my data is very large, neo4j is unable to represent all nodes. I think the problem is with the function.
I've tried it this way too:
import org.neo4j.spark._
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n:Person) RETURN id(n) as id ").loadRowRdd
However, this way I cannot read the HDFS file or divide it into columns
Can someone help me find another solution? With the Neo4jDataFrame.mergeEdgeList function, I only see 150 nodes instead of the 500 expected.
I'm trying to run the program specified in this IBM Developer code pattern. For now, I am only doing the local deployment https://github.com/IBM/kafka-streaming-click-analysis?cm_sp=Developer-_-determine-trending-topics-with-clickstream-analysis-_-Get-the-Code
Since it's a little old, my versions of Kafka and Scala aren't exactly what the code pattern calls for. The versions I am using are:
Scala: 2.4.6
Kafka 0.10.2.1
At the last step, I get the following error:
ERROR MicroBatchExecution: Query [id = f4dfe12f-1c99-427e-9f75-91a77f6e51a7,
runId = c9744709-2484-4ea1-9bab-28e7d0f6b511] terminated with error
org.apache.spark.sql.catalyst.errors.package$TreeNodeException
Along with the execution tree
The steps I am following are as follows:
1. Start Zookeeper
2. Start Kafka
3. cd kafka_2.10-0.10.2.1
4. tail -200 data/2017_01_en_clickstream.tsv | bin/kafka-console-producer.sh --broker-list ip:port --topic clicks --producer.config=config/producer.properties
I have downloaded the dataset and stored it in a directory called data inside of the kafka_2.10-0.10.2.1 directory
cd $SPARK_DIR
bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6
Since SPARK_DIR wasn't set during the Spark installation, I am navigating the the directory containing spark to run this command
scala> import scala.util.Try
scala> case class Click(prev: String, curr: String, link: String, n: Long)
scala> def parseVal(x: Array[Byte]): Option[Click] = {
val split: Array[String] = new Predef.String(x).split("\\t")
if (split.length == 4) {
Try(Click(split(0), split(1), split(2), split(3).toLong)).toOption
} else
None
}
scala> val records = spark.readStream.format("kafka")
.option("subscribe", "clicks")
.option("failOnDataLoss", "false")
.option("kafka.bootstrap.servers", "localhost:9092").load()
scala>
val messages = records.select("value").as[Array[Byte]]
.flatMap(x => parseVal(x))
.groupBy("curr")
.agg(Map("n" -> "sum"))
.sort($"sum(n)".desc)
val query = messages.writeStream
.outputMode("complete")
.option("truncate", "false")
.format("console")
.start()
The last statement, query=... is giving the error mentioned above. Any help would be greatly appreciated. Thanks in advance!
A required library or dependency for interacting with Apache Kafka is missing, so you may need to install the missing library or update to a compatible version
I am trying to convert a JSON String jsonStr into a Spark Dataframe in Scala. Using InteliJ for this purpose.
val spark = SparkSession.builder().appName("SparkExample").master("local[*]").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
var df = spark.read.json(Seq(jsonStr).toDS)
df.show()
Getting the following error while compilation/ building the project using Maven.
Error:(243, 29) overloaded method value json with alternatives:
(jsonRDD:
org.apache.spark.rdd.RDD[String])org.apache.spark.sql.DataFrame
(jsonRDD:
org.apache.spark.api.java.JavaRDD[String])org.apache.spark.sql.DataFrame
(paths: String*)org.apache.spark.sql.DataFrame (path:
String)org.apache.spark.sql.DataFrame cannot be applied to
(org.apache.spark.sql.Dataset[String])
var df = spark.read.json(Seq(jsonStr).toDS)
Note: I was not getting the error while building with SBT.
Below method was introduced in Spark 2.2.0
def json(jsonDataset: Dataset[String]): DataFrame =
Please correct your version of Spark in maven's pom.xml file
Change your code to
val rdd = sc.parallelize(Seq(jsonStr))
var json_df = spark.read.json(rdd)
var df = json_df.toDS
(or combine it in one variable, up to you)
.
You're trying to pass Dataset into spark.read.json function, hence the error.
I am following spark hbase connector basic example to read a HBase table in spark2 shell version 2.2.0. It looks like the code is working, but when I run df.show() command, I do not see any results and it seems to run forever.
import org.apache.spark.sql.{ DataFrame, Row, SQLContext }
import org.apache.spark.sql.execution.datasources.hbase._
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
def catalog = s"""{
|"table":{"namespace":"default", "name":"testmeta"},
|"rowkey":"vgil",
|"columns":{
|"id":{"cf":"rowkey", "col":"vgil", "type":"string"},
|"col1":{"cf":"pp", "col":"dtyp", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = { sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->cat)).format("org.apache.spark.sql.execution.datasources.hbase").load()}
val df = withCatalog(catalog)
df.show()
df.show() will neither give any output nor any error. It will keep on running forever.
Also, how can I run queryy for range of row keys.
Here is the scan of the HBase test table.
hbase(main):001:0> scan 'testmeta'
ROW COLUMN+CELL
fmix column=pp:dtyp, timestamp=1541714925380, value=ss1
fmix column=pp:lati, timestamp=1541714925371, value=41.50
fmix column=pp:long, timestamp=1541714925374, value=-81.61
fmix column=pp:modm, timestamp=1541714925377, value=ABC
vgil column=pp:dtyp, timestamp=1541714925405, value=ss2
vgil column=pp:lati, timestamp=1541714925397, value=41.50
I have followed some of solutions on the web, but unfortunately not able to get the data from HBase.
Thanks in advance for help!
Posting my answer after lots of trial, so I found that adding --conf option to start spark shell helped me connect to HBase.
spark2-shell --master yarn --deploy-mode client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11,it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --repositories http://repo.hortonworks.com/content/groups/public/ --conf spark.hbase.host=192.168.xxx.xxx --files /mnt/fs1/opt/cloudera/parcels/CDH-5.13.0-1.cdh5.13.0.p0.29/share/doc/hbase-solr-doc-1.5+cdh5.13.0+71/demo/hbase-site.xml
Then the following code snippet could fetch a value for one column qualifier.
val hBaseRDD_iacp = sc.hbaseTable[(String)]("testmeta").select("lati").inColumnFamily("pp").withStartRow("vg").withStopRow("vgz")
object myschema {
val column1 = StructField("column1", StringType)
val struct = StructType(Array(column1))
}
val rowRDD = hBaseRDD.map(x => Row(x))
val myDf = sqlContext.createDataFrame(rowRDD,myschema.struct)
myDf.show()