Convert dataframe to hive table in spark scala - scala

I am trying to convert a dataframe to hive table in spark Scala. I have read in a dataframe from an XML file. It uses SQL context to do so. I want to convert save this dataframe as a hive table. I am getting this error:
"WARN HiveContext$$anon$1: Could not persist database_1.test_table in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format."
object spark_conversion {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
System.err.println("Usage: <input file> <output dir>")
System.exit(1)
}
val in_path = args(0)
val out_path_csv = args(1)
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("conversion")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val df = hiveContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "PolicyPeriod")
.option("attributePrefix", "attr_")
.load(in_path)
df.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save(out_path_csv)
df.saveAsTable("database_1.test_table")
df.printSchema()
df.show()

saveAsTable in spark is not compatible with hive. I am on CDH 5.5.2. Workaround from cloudera website:
df.registerTempTable(tempName)
hsc.sql(s"""
CREATE TABLE $tableName (
// field definitions )
STORED AS $format """)
hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html

Related

.csv not a SequenceFile error on Select Hive Query

I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

Unable to filter CSV columns stored in dataframe in spark 2.2.0

I am reading a CSV file from my local machine using spark and scala and storing into a dataframe (called df). I have to select only few selected columns with new aliasing names from the df and save to new dataframe newDf. I have tried to do the same but I am getting the error below.
main" org.apache.spark.sql.AnalysisException: cannot resolve '`history_temp.time`' given input columns: [history_temp.time, history_temp.poc]
Below is the code written to read the csv file from my local machine.
import org.apache.spark.sql.SparkSession
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
df.show(5) // Till this code is working fine
val newDf = df.select("history_temp.time","history_temp.poc")
Below are the code which I tried but not working.
// val newDf = df.select($"history_temp.time",$"history_temp.poc")
// val newDf = df.select("history_temp.time","history_temp.poc")
// val newDf = df.select( df("history_temp.time").as("TIME"))
// val newDf = df.select(df.col("history_temp.time"))
// df.select(df.col("*")) // This is working
newDf.show(10)
}
}
from the looks of it. your column name format is the issue here. i am guessing they are just regular stringType but when you have something like history_temp.time spark thinks it as an arrayed column. which is not the case. I would rename all of the columns and replace "." to "". then you can run the same select and it should work. you can use foldleft to rplace all "." with "" like below.
val replacedDF = df.columns.foldleft(df){ (newdf, colname)=>
newdf.withColumnRenamed (colname, colname.replace(".","_"))
}
With that done you can select from replacedDF with below
val newDf= replacedDf.select("history_temp_time","history_temp_poc")
Let me know how it works out for you.

Accessing Azure Data Lake Storage gen2 from Scala

I am able to connect to ADLS gen2 from a notebook running on Azure Databricks but am unable to connect from a job using a jar. I used the same settings as I did in the notebook, save for the use of dbutils.
I used the same setting for Spark conf from the notebook in the Scala code.
Notebook:
spark.conf.set(
"fs.azure.account.key.xxxx.dfs.core.windows.net",
dbutils.secrets.get(scope = "kv-secrets", key = "xxxxxx"))
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "true")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val rdd = sqlContext.read.format
("csv").option("header",
"true").load(
"abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
Scala code:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
sc.getConf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
sc.getConf.set("fs.azure.account.auth.type", "OAuth")
sc.getConf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc.getConf.set("fs.azure.account.oauth2.client.id", "<app id>")
sc.getConf.set("fs.azure.account.oauth2.client.secret", "<app password>")
sc.getConf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
sc.getConf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val sqlContext = spark.sqlContext
val rdd = sqlContext.read.format
("csv").option("header",
"true").load
("abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
println(df.count())
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
I expected the parquet file to get written. Instead I get the following error:
19/04/20 13:58:40 ERROR Uncaught throwable from user code: Configuration property xxxx.dfs.core.windows.net not found.
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:385)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:802)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:133)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
Never mind, silly mistake. It should be:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
spark.conf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<app id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<app password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")

special character in dataframe spark

I have a CSV file with the following content
id,pos_id,supplier_id
5127973,2000,"test
5704355,77,/10122
I wanted to load it into a dataframe and the data as it is , this dataframe will be loaded into postresql through JDBC
Here what I did:
val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder.config(conf = conf).appName("spark session example").getOrCreate()
val df= sparkSession.sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("escape", "\"").load("C:\\Users\\MHT\\Desktop\\data.csv")
df.show()
+-------+------+--------------------+
| id|pos_id| supplier_id|
+-------+------+--------------------+
|5127973| 2000|test
5704355,77,/...|
+-------+------+--------------------+
What should I do to get the same data in the dataframe and then the same data in postresql.
Write the csv data on to HDFS and using sqoop we can export the data to the destination database by providing required jdbc jars in the $SQOOP_HOME/lib directory.