I am trying to read data from Hive and Writing to a custom object in Salesforce using a JDBC drive for Salesforce from Progress. Here is how I am trying to do this
spark-shell --jars /usr/hdp/current/spark-client/lib/sforce.jar
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val results = hc.sql("select rep_name FROM schema.rpt_view")
print(results.first())
import org.apache.spark.sql.SaveMode
val url="jdbc:datadirect:sforce://login.salesforce.com"
val prop = new java.util.Properties
prop.put("user","user1")
prop.put("password","passwd")
prop.put("driver","com.ddtek.jdbc.sforce.SForceDriver")
results.write.mode(SaveMode.Append).jdbc(url,"SFORCE.test_tab1",prop)`
I am getting the error
`java.sql.SQLSyntaxErrorException: [DataDirect][SForce JDBC Driver][SForce]column size is required in statement [CREATE TABLE SFORCE.test_tab1 (rep_name TEXT`
Can some help me here .. if table test_tab1 already exists , how do I configure the write and also if table doesn't exist in Salesforce, how do I add a column value
Related
In HDP 3.1.0, HWC hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar, I cannot append (or overwrite) to an existing table depending on the database.
I tested on one datase called DSN, it works and on another database called CLEAN_CRYPT it fails.
Both databases are crypted + kerberos
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession._
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.execute("show databases").show()
hive.setDatabase("clean_crypt")
val df=hive.execute("select * from test")
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table","test").mode("append").save
The error message is "table already exists". I tried overwrite mode without success.
If I drop the table, it passes !!!
Any idea ?
This is probably related to a HWC bug which is reported by multiple users here.
What I've found is that it only occurs if you try to use a partitionBy at writing, like:
df.write.partitionBy("part")
.mode(SaveMode.Overwrite)
.format(com.hortonworks.hwc.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
.option("table", "`default`.`testout`").save;
On an other note, if you remove the partitionBy piece, partitioning works as expected (as partition info is already stored in the Hive table), but if you use overwrite mode (and not, for example, append), HWC will drop and recreate your table and it won't reapply partitioning info.
If you want to use the Hortnoworks connector and append to a partitioned table, you should not use partitionBy as it does not seem to work properly with this connector. Instead, you could use the partition options and add Spark parameters for dynamic partitioning.
Example:
import org.apache.spark.SparkConf
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR
import org.apache.spark.sql.{SaveMode, SparkSession}
val sparkConf = new SparkConf()
.setMaster("yarn")
.setAppName("My application")
.set("hive.exec.dynamic.partition", "true")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val hive = HiveWarehouseBuilder.session(spark).build()
val hiveDatabase = "clean_crypt")
hive.setDatabase(hiveDatabase)
val df = hive.execute("select * from test")
df
.write
.format(HIVE_WAREHOUSE_CONNECTOR)
.mode(SaveMode.Append)
.option("partition", partitionColumn)
.option("table", table)
.save()
For the above, the hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar was used. If the table does not exist, the connector creates it and stores it (by default) in ORC format.
I have a stream of alerts coming from Kafka to Spark. These are alerts in JSON format from different IoT Sensors.
Kafka Streams:
{ "id":"2093021", alert:"Malfunction
detected","sensor_id":"14-23092-AS" }
{ "id":"2093021", alert:"Malfunction
detected","sensor_id":"14-23092-AS" , "alarm_code": "Severe" }
My code: spark-client.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SparkSession
from pyspark.sql.context import SQLContext
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://spark:1234#172.31.9.44/at_cloudcentral.spark_test").config("spark.mongodb.output.uri", "mongodb://spark:1234#172.31.9.44/at_cloudcentral.spark_test").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 10)
zkQuorum, topic = sys.argv[1:]
kafka_streams =KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-sql-mongodb-test-consumer", {topic: 1})
dstream = kafka_streams.map(lambda x: json.loads(x[1]))
dstream.pprint()
ssc.start()
ssc.awaitTermination()
When I run this
ubuntu#ip-172-31-89-176:~/spark-connectors$ spark-submit spark-client.py localhost:2181 DetectionEntry
I get this output
-------------------------------------------
Time: 2019-12-04 14:26:40
-------------------------------------------
{u'sensor_id': u'16-23092-AS', u'id': u'2093021', u'alert': u'Malfunction detected'}
I need to be able to save this alert to a remote MongoDB. I have two specific challenges:
How do I correctly parse the output so that I can create a dataframe that can be written to mongodb ? I have tried adding this to the end of code
d = [dstream]
df = spark.createDataFrame(d).collect()
and it gives me this error
dataType py4j.java_gateway.JavaMember object at 0x7f5912726750 should
be an instance of class 'pyspark.sql.types.DataType'
My alerts can have different json structure and I'll need to dump them into a mongodb collection. As such a fixed schema wont work for me. Most of the similar questions and code that I have referred to in stackoverflow are specific to fixed schema and I'm unable to figure out how to push this to mongodb in a way that each record in the mongodb collection will have its own schema(json structure). Any pointers in the right direction is requested.
We can parsing the Kafka JSON message easily through Pyspark based Structured Streaming API with invoking the simple UDF. You can check complete code in below stack overflow link for reference.
Pyspark Structured streaming processing
I am trying to setup a connection from Databricks to couchbase server 4.5 and then run a N1QL query.
The scala code below will return 1 record but fails when introducing the N1QL. Any help is appreciated.
import com.couchbase.client.java.CouchbaseCluster;
import scala.collection.JavaConversions._;
import com.couchbase.client.java.query.Select.select;
import com.couchbase.client.java.query.dsl.Expression;
import com.couchbase.client.java.query.Query
// Connect to a cluster on localhost
val cluster = CouchbaseCluster.create("http://**************")
// Open the default bucket
val bucket = cluster.openBucket("travel-sample", "password");
// Read it back out
//val streamsense = bucket.get("airline_1004546") - Works and returns one record
// Create a DataFrame with schema inference
val ev = sql.read.couchbase(schemaFilter = EqualTo("type", "airline"))
//Show the inferred schema
ev.printSchema()
//query using the data frame
ev
.select("id", "type")
.show(10)
//issue sql query for the same data (N1ql)
val query = "SELECT type, meta().id FROM `travel-sample` LIMIT 10"
sc
.couchbaseQuery(N1qlQuery.simple(query))
.collect()
.foreach(println)
In Databricks (and any interactive Spark cloud environment usually) you do not define the cluster nodes, buckets or sc variable, instead you need to set the configuration settings for Spark to use when setting up the Databricks cluster. Use the advanced settings option as shown below.
I've only used this approach with spark2.0 so your mileage may vary.
You can remove your cluster and bucket variable initialisation as well.
You have a syntax error in the N1QL query. You have:
val query = "SELECT type, id FROM `travel-sample` WHERE LIMIT 10"
You need to either remove the WHERE, or add a condition.
You also need to change id to META().id.
I want to try to load data into hive external table using spark.
please help me on this, how to load data into hive using scala code or java
Thanks in advance
Assuming that hive external table is already created using something like,
CREATE EXTERNAL TABLE external_parquet(c1 INT, c2 STRING, c3 TIMESTAMP)
STORED AS PARQUET LOCATION '/user/etl/destination'; -- location is some directory on HDFS
And you have an existing dataFrame / RDD in Spark, that you want to write.
import sqlContext.implicits._
val rdd = sc.parallelize(List((1, "a", new Date), (2, "b", new Date), (3, "c", new Date)))
val df = rdd.toDF("c1", "c2", "c3") //column names for your data frame
df.write.mode(SaveMode.Overwrite).parquet("/user/etl/destination") // If you want to overwrite existing dataset (full reimport from some source)
If you don't want to overwrite existing data from your dataset...
df.write.mode(SaveMode.Append).parquet("/user/etl/destination") // If you want to append to existing dataset (incremental imports)
**I have tried similar scenario and had satisfactory results.I have worked with avro data with schema in json.I streamed kafka topic with spark streaming and persisted the data in to hdfs which is the location of an external table.So every 2 seconds(the streaming duration the data will be stored in to hdfs in a seperate file and the hive external table will be appended as well).
Here is the simple code snippet
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val myEvent = dataframe.toDF()
import org.apache.spark.sql.SaveMode
myEvent.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs:///location/of/hive/external/table")
})
Don't forget to stop the 'SSC' at the end of the application.Doing it gracefully is more preferable.
P.S:
Note that while creating an external table make sure you are creating the table with schema identical to the dataframe schema. Because when getting converted in to a dataframe which is nothing but a table, the columns will be arranged in an alphabetic order.
This question already has an answer here:
Converting CSV to ORC with Spark
(1 answer)
Closed 6 years ago.
I am trying to save my RDD in orc format.
val data: RDD[MyObject] = createMyData()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
data.toDF.write.format("orc").save(outputPath)
It compiles fine but it doesn't work.
I get following exception:
ERROR ApplicationMaster: User class threw exception: java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext.
java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext.
I would like to avoid using hive to do this, because my data is in hdfs and it is not related to any hive table. Is there any workaround?
It works fine for Parquet format.
Thanks in advance.
Persisting ORC formats in persistent storage area (like HDFS) is only available with the HiveContext.
As an alternate (workaround) you can register it as temporary table. Something like this: -
DataFrame.write.mode("overwrite").orc("myDF.orc")
val orcDF = sqlCtx.read.orc("myDF.orc")
orcDF.registerTempTable("<Table Name>")
As for now, saving as orc can only be done with HiveContext.
so the approach will be like this :
import sqlContext.implicits._
val data: RDD[MyObject] = createMyData()
val sqlContext = new New Org.Apache.Spark.Sql.Hive.HiveContext(Sc)
data.toDF.write.format("orc").save(outputPath)