saving spark rdd in ORC format [duplicate] - scala

This question already has an answer here:
Converting CSV to ORC with Spark
(1 answer)
Closed 6 years ago.
I am trying to save my RDD in orc format.
val data: RDD[MyObject] = createMyData()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
data.toDF.write.format("orc").save(outputPath)
It compiles fine but it doesn't work.
I get following exception:
ERROR ApplicationMaster: User class threw exception: java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext.
java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext.
I would like to avoid using hive to do this, because my data is in hdfs and it is not related to any hive table. Is there any workaround?
It works fine for Parquet format.
Thanks in advance.

Persisting ORC formats in persistent storage area (like HDFS) is only available with the HiveContext.
As an alternate (workaround) you can register it as temporary table. Something like this: -
DataFrame.write.mode("overwrite").orc("myDF.orc")
val orcDF = sqlCtx.read.orc("myDF.orc")
orcDF.registerTempTable("<Table Name>")

As for now, saving as orc can only be done with HiveContext.
so the approach will be like this :
import sqlContext.implicits._
val data: RDD[MyObject] = createMyData()
val sqlContext = new New Org.Apache.Spark.Sql.Hive.HiveContext(Sc)
data.toDF.write.format("orc").save(outputPath)

Related

spark HWC cannot write to an existing table

In HDP 3.1.0, HWC hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar, I cannot append (or overwrite) to an existing table depending on the database.
I tested on one datase called DSN, it works and on another database called CLEAN_CRYPT it fails.
Both databases are crypted + kerberos
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession._
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.execute("show databases").show()
hive.setDatabase("clean_crypt")
val df=hive.execute("select * from test")
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table","test").mode("append").save
The error message is "table already exists". I tried overwrite mode without success.
If I drop the table, it passes !!!
Any idea ?
This is probably related to a HWC bug which is reported by multiple users here.
What I've found is that it only occurs if you try to use a partitionBy at writing, like:
df.write.partitionBy("part")
.mode(SaveMode.Overwrite)
.format(com.hortonworks.hwc.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
.option("table", "`default`.`testout`").save;
On an other note, if you remove the partitionBy piece, partitioning works as expected (as partition info is already stored in the Hive table), but if you use overwrite mode (and not, for example, append), HWC will drop and recreate your table and it won't reapply partitioning info.
If you want to use the Hortnoworks connector and append to a partitioned table, you should not use partitionBy as it does not seem to work properly with this connector. Instead, you could use the partition options and add Spark parameters for dynamic partitioning.
Example:
import org.apache.spark.SparkConf
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR
import org.apache.spark.sql.{SaveMode, SparkSession}
val sparkConf = new SparkConf()
.setMaster("yarn")
.setAppName("My application")
.set("hive.exec.dynamic.partition", "true")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val hive = HiveWarehouseBuilder.session(spark).build()
val hiveDatabase = "clean_crypt")
hive.setDatabase(hiveDatabase)
val df = hive.execute("select * from test")
df
.write
.format(HIVE_WAREHOUSE_CONNECTOR)
.mode(SaveMode.Append)
.option("partition", partitionColumn)
.option("table", table)
.save()
For the above, the hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar was used. If the table does not exist, the connector creates it and stores it (by default) in ORC format.

Difference between sparksession text and textfile methods? [duplicate]

This question already has an answer here:
Difference between sc.textFile and spark.read.text in Spark
(1 answer)
Closed 3 years ago.
I am working with Spark scala shell and trying to create dataframe and datasets from a text file.
For getting datasets from a text file, there are two options, text and textFile methods as follows:
scala> spark.read.
csv format jdbc json load option options orc parquet schema table text textFile
Here is how i am gettting datasets and dataframe from both these methods:
scala> val df = spark.read.text("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val df = spark.read.textFile("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.Dataset[String] = [value: string]
So my question is what is the difference between the two methods for text file?
When to use which methods?
As I've noticed that they are almost having the same functionality,
It just that spark.read.text transform data to Dataset which is a distributed collection of data, while spark.read.textFile transform data to Dataset[Type] which is consist of Dataset organized into named columns.
Hope it helps.

How to write to Kafka from Spark with a changed schema without getting exceptions?

I'm loading parquet files from Databricks to Spark:
val dataset = context.session.read().parquet(parquetPath)
Then I perform some transformations like this:
val df = dataset.withColumn(
columnName, concat_ws("",
col(data.columnName), lit(textToAppend)))
When I try to save it as JSON to Kafka (not back to parquet!):
df = df.select(
lit("databricks").alias("source"),
struct("*").alias("data"))
val server = "kafka.dev.server" // some url
df = dataset.selectExpr("to_json(struct(*)) AS value")
df.write()
.format("kafka")
.option("kafka.bootstrap.servers", server)
.option("topic", topic)
.save()
I get the following exception:
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file dbfs:/mnt/warehouse/part-00001-tid-4198727867000085490-1e0230e7-7ebc-4e79-9985-0a131bdabee2-4-c000.snappy.parquet. Column: [item_group_id], Expected: StringType, Found: INT32
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:310)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at com.databricks.sql.io.parquet.NativeColumnReader.readBatch(NativeColumnReader.java:448)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:330)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:167)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:299)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This only happens if I'm trying to read multiple partitions. For example in the /mnt/warehouse/ directory I have a lot of parquet files each representing data from a datestamp. If I read only one of them I don't get exceptions but if I read the whole directory this exception happens.
I get this when I do a transformation, like above where I change the data type of a column. How can I fix this? I'm not trying to write back to parquet but to transform all files from the same source schema to a new schema and write them to Kafka.
There seems to be an issue with the parquet files. The item_group_id column in the files are not all of the same data type, some files have the column stored as String and others as Integer. From the source code of the exception SchemaColumnConvertNotSupportedException we see the description:
Exception thrown when the parquet reader find column type mismatches.
A simple way to replicate the problem can be found among the tests for Spark on github:
Seq(("bcd", 2)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet(s"$path/parquet")
Seq((1, "abc")).toDF("a", "b").coalesce(1).write.mode("append").parquet(s"$path/parquet")
spark.read.parquet(s"$path/parquet").collect()
Of course, this will only happen when reading multiple files at once, or as in the test above where more data has been appended. If a single file is read then there will not be a mismatch issue between the datatypes of a column.
The easiest way to fix the problem would be to make sure that the column types of all files are correct while writing the files.
The alternative is to read all the parquet files separetly, change the schemas to match and then combine them with union. An easy way to do this is to adjust the schemas:
// Specify the files and read as separate dataframes
val files = Seq(...)
val dfs = files.map(file => spark.read.parquet(file))
// Specify the schema (here the schema of the first file is used)
val schema = dfs.head.schema
// Create new columns with the correct names and types
val newCols = schema.map(c => col(c.name).cast(c.dataType))
// Select the new columns and merge the dataframes
val df = dfs.map(_.select(newCols: _*)).reduce(_ union _)
You can find the instruction on this link
It present you the differents ways to write data to a kafka topic.

how to write a dataframe from scala to HDFS as csv

I am new to scala and HDFS. I need to dump my data into HDFS. The data is in the form of a spark dataframe but I want to write it as a CSV in HDFS.
Can someone please share the basic boilder plate code for starters.
Thanks
If your data is flat then the following would work.
val df: DataFrame = ???
val filePath: String = ???
df.map(_.mkString(",")).saveAsTextFile(filePath)
However the issue with the question is that we need to see what your data looks like. For example if its got nested Structs then saving as a CSV isn't clearly defined.

How to load data into hive external table using spark?

I want to try to load data into hive external table using spark.
please help me on this, how to load data into hive using scala code or java
Thanks in advance
Assuming that hive external table is already created using something like,
CREATE EXTERNAL TABLE external_parquet(c1 INT, c2 STRING, c3 TIMESTAMP)
STORED AS PARQUET LOCATION '/user/etl/destination'; -- location is some directory on HDFS
And you have an existing dataFrame / RDD in Spark, that you want to write.
import sqlContext.implicits._
val rdd = sc.parallelize(List((1, "a", new Date), (2, "b", new Date), (3, "c", new Date)))
val df = rdd.toDF("c1", "c2", "c3") //column names for your data frame
df.write.mode(SaveMode.Overwrite).parquet("/user/etl/destination") // If you want to overwrite existing dataset (full reimport from some source)
If you don't want to overwrite existing data from your dataset...
df.write.mode(SaveMode.Append).parquet("/user/etl/destination") // If you want to append to existing dataset (incremental imports)
**I have tried similar scenario and had satisfactory results.I have worked with avro data with schema in json.I streamed kafka topic with spark streaming and persisted the data in to hdfs which is the location of an external table.So every 2 seconds(the streaming duration the data will be stored in to hdfs in a seperate file and the hive external table will be appended as well).
Here is the simple code snippet
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val myEvent = dataframe.toDF()
import org.apache.spark.sql.SaveMode
myEvent.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs:///location/of/hive/external/table")
})
Don't forget to stop the 'SSC' at the end of the application.Doing it gracefully is more preferable.
P.S:
Note that while creating an external table make sure you are creating the table with schema identical to the dataframe schema. Because when getting converted in to a dataframe which is nothing but a table, the columns will be arranged in an alphabetic order.