I want to try to load data into hive external table using spark.
please help me on this, how to load data into hive using scala code or java
Thanks in advance
Assuming that hive external table is already created using something like,
CREATE EXTERNAL TABLE external_parquet(c1 INT, c2 STRING, c3 TIMESTAMP)
STORED AS PARQUET LOCATION '/user/etl/destination'; -- location is some directory on HDFS
And you have an existing dataFrame / RDD in Spark, that you want to write.
import sqlContext.implicits._
val rdd = sc.parallelize(List((1, "a", new Date), (2, "b", new Date), (3, "c", new Date)))
val df = rdd.toDF("c1", "c2", "c3") //column names for your data frame
df.write.mode(SaveMode.Overwrite).parquet("/user/etl/destination") // If you want to overwrite existing dataset (full reimport from some source)
If you don't want to overwrite existing data from your dataset...
df.write.mode(SaveMode.Append).parquet("/user/etl/destination") // If you want to append to existing dataset (incremental imports)
**I have tried similar scenario and had satisfactory results.I have worked with avro data with schema in json.I streamed kafka topic with spark streaming and persisted the data in to hdfs which is the location of an external table.So every 2 seconds(the streaming duration the data will be stored in to hdfs in a seperate file and the hive external table will be appended as well).
Here is the simple code snippet
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val myEvent = dataframe.toDF()
import org.apache.spark.sql.SaveMode
myEvent.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs:///location/of/hive/external/table")
})
Don't forget to stop the 'SSC' at the end of the application.Doing it gracefully is more preferable.
P.S:
Note that while creating an external table make sure you are creating the table with schema identical to the dataframe schema. Because when getting converted in to a dataframe which is nothing but a table, the columns will be arranged in an alphabetic order.
Related
How to understand the hudi write operation with upsert but df savemode with append? Since this will upsert the records, why append instead of overwrite? What's the difference?
Like showed in the pic:
Example: Upsert a DataFrame, specifying the necessary field names for recordKey => _row_key, partitionPath => partition, and precombineKey => timestamp
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) //Where clientOpts is of type Map[String, String]. clientOpts can include any other options necessary.
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
When you use the overwrite mode, you tell spark to delete the table and recreate it (or just the partitions which exist in your new df if you use a dynamic partitionOverwriteMode).
But when we use append mode, spark will append the new data to existing old data on disk/cloud storage. With hudi we can provide additional operation to merge the two versions of data and update old records which have key present in new data, keep old records which have a key not present in new data and add new records having new keys. This is totally different from overwriting data.
In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")
I need to write Cassandra Partitions as parquet file. Since I cannot share and use sparkSession in foreach function. Firstly, I call collect method to collect all data in driver program then I write parquet file to HDFS, as below.
Thanks to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
I am able to get my partitioned rows. I want to write partitioned rows into seperated parquet file, whenever a partition is read from cassandra table. I also tried sparkSQLContext that method writes task results as temporary. I think, after all the tasks are done. I will see parquet files.
Is there any convenient method for this?
val keyedTable : CassandraTableScanRDD[(Tuple2[Int, Date], MyCassandraTable)] = getTableAsKeyed()
keyedTable.groupByKey
.collect
.foreach(f => {
import sparkSession.implicits._
val items = f._2.toList
val key = f._1
val baseHDFS = "hdfs://mycluster/parquet_test/"
val ds = sparkSession.sqlContext.createDataset(items)
ds.write
.option("compression", "gzip")
.parquet(baseHDFS + key._1 + "/" + key._2)
})
Why not use Spark SQL everywhere & use built-in functionality of the Parquet to write data by partitions, instead of creating a directory hierarchy yourself?
Something like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load()
data.write
.option("compression", "gzip")
.partitionBy("col1", "col2")
.parquet(baseHDFS)
In this case, it will create a separate directory for every value of col & col2 as nested directories, with name like this: ${column}=${value}. Then when you read, you may force to read only specific value.
I have a following situation. I have large Cassandra table (with large number of columns) which i would like process with Spark. I want only selected columns to be loaded in to Spark ( Apply select and filtering on Cassandra server itself)
val eptable =
sc.cassandraTable("test","devices").select("device_ccompany","device_model","devi
ce_type")
Above statement gives a CassandraTableScanRDD but how do i convert this in to DataSet/DataFrame ?
Si there any other way i can do server side filtering of columns and get dataframes?
In DataStax Spark Cassandra Connector, you would read Cassandra data as a Dataset, and prune columns on the server-side as follows:
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "devices", "keyspace" -> "test" ))
.load()
val dfWithColumnPruned = df
.select("device_ccompany","device_model","device_type")
Note that the selection operation I do after reading is pushed to the server-side using Catalyst optimizations. Refer this document for further information.
I have a cassandra table like below and want to get records from cassandra using some conditions and put it in the hive table.
Cassandra Table(Employee) Entry:
Id Name Amount Time
1 abc 1000 2017041801
2 def 1000 2017041802
3 ghi 1000 2017041803
4 jkl 1000 2017041804
5 mno 1000 2017041805
6 pqr 1000 2017041806
7 stu 1000 2017041807
Assume that this table columns are of the datatype string.
We have same schema in hive also.
Now i wanted to import cassandra record between 2017041801 to 2017041804 to hive or hdfs. In second run I will pull the incremental records based on the prev run.
I am able to load the cassandra data into RDD using below syntax.
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("mydb", "Employee")
Now my problem is how can i filter this records according to the between condition and persist the filtered records in hive or hive external table path.
Unfortunately my Time column is not clustering key in cassandra table. So I am not able to use .where() clause.
I am new to this scala and spark. So please kindly help out on this filter logic or any other better way of implementing this logic using dataframe, Please let me know.
Thanks in advance.
I recommend to use Connector Dataframe API for loading from C* https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md.
Use df.filter() call for predicates
saveAsTable() method to store data in hive.
Here is spark 2.0 example for your case
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "Employee", "keyspace" -> "mydb" ))
.load()
df.filter("time between 2017041801 and 2017041804")
.write.mode("overwrite").saveAsTable("hivedb.employee");