In HDP 3.1.0, HWC hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar, I cannot append (or overwrite) to an existing table depending on the database.
I tested on one datase called DSN, it works and on another database called CLEAN_CRYPT it fails.
Both databases are crypted + kerberos
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession._
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.execute("show databases").show()
hive.setDatabase("clean_crypt")
val df=hive.execute("select * from test")
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table","test").mode("append").save
The error message is "table already exists". I tried overwrite mode without success.
If I drop the table, it passes !!!
Any idea ?
This is probably related to a HWC bug which is reported by multiple users here.
What I've found is that it only occurs if you try to use a partitionBy at writing, like:
df.write.partitionBy("part")
.mode(SaveMode.Overwrite)
.format(com.hortonworks.hwc.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
.option("table", "`default`.`testout`").save;
On an other note, if you remove the partitionBy piece, partitioning works as expected (as partition info is already stored in the Hive table), but if you use overwrite mode (and not, for example, append), HWC will drop and recreate your table and it won't reapply partitioning info.
If you want to use the Hortnoworks connector and append to a partitioned table, you should not use partitionBy as it does not seem to work properly with this connector. Instead, you could use the partition options and add Spark parameters for dynamic partitioning.
Example:
import org.apache.spark.SparkConf
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR
import org.apache.spark.sql.{SaveMode, SparkSession}
val sparkConf = new SparkConf()
.setMaster("yarn")
.setAppName("My application")
.set("hive.exec.dynamic.partition", "true")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val hive = HiveWarehouseBuilder.session(spark).build()
val hiveDatabase = "clean_crypt")
hive.setDatabase(hiveDatabase)
val df = hive.execute("select * from test")
df
.write
.format(HIVE_WAREHOUSE_CONNECTOR)
.mode(SaveMode.Append)
.option("partition", partitionColumn)
.option("table", table)
.save()
For the above, the hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar was used. If the table does not exist, the connector creates it and stores it (by default) in ORC format.
Related
I have a use case in which I need to change the schema of JSON without interrupting the streaming job. I am using a conf file where I have all the required schema mentioned. I have already tried cache and broadcast variables by persisting and unpersisting with a separate streaming pipeline but still no luck. Thanks in advance for your help!
rather than reading the data set as json you can try reading it as text and then map it as per a schema that that coming externally from a config file in HDFS or a DB.
so instead of doing something like,
val df = spark.readStream.format("json").load(.. path ..)
do,
import sparkSession.implicits._
val df = spark.readStream
.format("text").load( .. path .. )
.select("value")
.as[String]
.mapPartitions(partStrings => {
val currentSchema = readSchemaFromFile(???)
partStrings.map(str => parseJSON(currentSchema, str))
})
mapPartitions prevents schema lookup on each record.
I need to write Cassandra Partitions as parquet file. Since I cannot share and use sparkSession in foreach function. Firstly, I call collect method to collect all data in driver program then I write parquet file to HDFS, as below.
Thanks to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
I am able to get my partitioned rows. I want to write partitioned rows into seperated parquet file, whenever a partition is read from cassandra table. I also tried sparkSQLContext that method writes task results as temporary. I think, after all the tasks are done. I will see parquet files.
Is there any convenient method for this?
val keyedTable : CassandraTableScanRDD[(Tuple2[Int, Date], MyCassandraTable)] = getTableAsKeyed()
keyedTable.groupByKey
.collect
.foreach(f => {
import sparkSession.implicits._
val items = f._2.toList
val key = f._1
val baseHDFS = "hdfs://mycluster/parquet_test/"
val ds = sparkSession.sqlContext.createDataset(items)
ds.write
.option("compression", "gzip")
.parquet(baseHDFS + key._1 + "/" + key._2)
})
Why not use Spark SQL everywhere & use built-in functionality of the Parquet to write data by partitions, instead of creating a directory hierarchy yourself?
Something like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load()
data.write
.option("compression", "gzip")
.partitionBy("col1", "col2")
.parquet(baseHDFS)
In this case, it will create a separate directory for every value of col & col2 as nested directories, with name like this: ${column}=${value}. Then when you read, you may force to read only specific value.
I am trying to read data from Hive and Writing to a custom object in Salesforce using a JDBC drive for Salesforce from Progress. Here is how I am trying to do this
spark-shell --jars /usr/hdp/current/spark-client/lib/sforce.jar
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val results = hc.sql("select rep_name FROM schema.rpt_view")
print(results.first())
import org.apache.spark.sql.SaveMode
val url="jdbc:datadirect:sforce://login.salesforce.com"
val prop = new java.util.Properties
prop.put("user","user1")
prop.put("password","passwd")
prop.put("driver","com.ddtek.jdbc.sforce.SForceDriver")
results.write.mode(SaveMode.Append).jdbc(url,"SFORCE.test_tab1",prop)`
I am getting the error
`java.sql.SQLSyntaxErrorException: [DataDirect][SForce JDBC Driver][SForce]column size is required in statement [CREATE TABLE SFORCE.test_tab1 (rep_name TEXT`
Can some help me here .. if table test_tab1 already exists , how do I configure the write and also if table doesn't exist in Salesforce, how do I add a column value
I have a following situation. I have large Cassandra table (with large number of columns) which i would like process with Spark. I want only selected columns to be loaded in to Spark ( Apply select and filtering on Cassandra server itself)
val eptable =
sc.cassandraTable("test","devices").select("device_ccompany","device_model","devi
ce_type")
Above statement gives a CassandraTableScanRDD but how do i convert this in to DataSet/DataFrame ?
Si there any other way i can do server side filtering of columns and get dataframes?
In DataStax Spark Cassandra Connector, you would read Cassandra data as a Dataset, and prune columns on the server-side as follows:
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "devices", "keyspace" -> "test" ))
.load()
val dfWithColumnPruned = df
.select("device_ccompany","device_model","device_type")
Note that the selection operation I do after reading is pushed to the server-side using Catalyst optimizations. Refer this document for further information.
I want to try to load data into hive external table using spark.
please help me on this, how to load data into hive using scala code or java
Thanks in advance
Assuming that hive external table is already created using something like,
CREATE EXTERNAL TABLE external_parquet(c1 INT, c2 STRING, c3 TIMESTAMP)
STORED AS PARQUET LOCATION '/user/etl/destination'; -- location is some directory on HDFS
And you have an existing dataFrame / RDD in Spark, that you want to write.
import sqlContext.implicits._
val rdd = sc.parallelize(List((1, "a", new Date), (2, "b", new Date), (3, "c", new Date)))
val df = rdd.toDF("c1", "c2", "c3") //column names for your data frame
df.write.mode(SaveMode.Overwrite).parquet("/user/etl/destination") // If you want to overwrite existing dataset (full reimport from some source)
If you don't want to overwrite existing data from your dataset...
df.write.mode(SaveMode.Append).parquet("/user/etl/destination") // If you want to append to existing dataset (incremental imports)
**I have tried similar scenario and had satisfactory results.I have worked with avro data with schema in json.I streamed kafka topic with spark streaming and persisted the data in to hdfs which is the location of an external table.So every 2 seconds(the streaming duration the data will be stored in to hdfs in a seperate file and the hive external table will be appended as well).
Here is the simple code snippet
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val myEvent = dataframe.toDF()
import org.apache.spark.sql.SaveMode
myEvent.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs:///location/of/hive/external/table")
})
Don't forget to stop the 'SSC' at the end of the application.Doing it gracefully is more preferable.
P.S:
Note that while creating an external table make sure you are creating the table with schema identical to the dataframe schema. Because when getting converted in to a dataframe which is nothing but a table, the columns will be arranged in an alphabetic order.