I am reading CSV file with spark and storing the DF as table in HIVE by using sparl JDBC Hive.
val df = spark.read.format("csv").load("PATH_TO_CSV");
df.write
.format("jdbc")
.option("url", "jdbc:hive2://127.0.0.1:10000/default")
.option("dbtable", "student2")
.option("user", "hive")
.option("password", "hive")
.mode(SaveMode.Overwrite).save;
I am getting below exception in spark shell and not sure why i am geting this exception
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:25 cannot recognize input near '"_c0"' 'TEXT' ',' in column name or primary key or foreign key
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:279)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:265)
at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:303)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:244)
at infoworks.tools.hive.HiveUtils.execStmts(HiveUtils.java:506)
at infoworks.tools.hive.HiveUtils.createHiveTable(HiveUtils.java:740)
Can anyone help why i am getting the above exception?
Can you please also share the sample data that you are trying to insert into the HIVE table?
Related
In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?
I'm trying to write a dataframe to a bigquery table. I have set the sparkSession with the required parameters. However, at the moment of doing the write I get an error:
Py4JJavaError: An error occurred while calling o114.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
The code is the following one:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark2 = SparkSession.builder\
.config("spark.jars", "/Users/xyz/Downloads/gcs-connector-hadoop2-latest.jar") \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0")\
.config("google.cloud.auth.service.account.json.keyfile", "/Users/xyz/Downloads/MyProject-cd7627f8ef9b.json") \
.getOrCreate()
spark2.conf.set("parentProject", "xyz")
data=spark2.createDataFrame(
[
("AAA", 51),
("BBB", 23),
],
['codiPuntSuministre', 'valor']
)
spark2.conf.set("temporaryGcsBucket","bqconsumptions")
data.write.format('bigquery') \
.option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.option('table', 'consumptions.c1') \
.mode('append') \
.save()
df=spark2.read.format("bigquery").option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.load("consumptions.c1")
I don't get any error if taking out the write from the code, so the error comes when trying to write and may be with something related to the auxiliar bucket to operate with bigquery
the error here suggests that it is not able to recognize the filesystem , you can use the below link for adding the support for gs filesystem , it happens because when you write to bigquery the files are loaded to google cloud storage bucket temporarily and then it is loaded into the bigquery table .
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
I try this basic command to read a CSV in scala:
val df = spark.read
.option("header", "true")
.option("sep","|")
.option("inferSchema", "true")
.csv("path/to/_34File.csv")
And I get:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
What could be the solution?
The solution is to rename de file from "_34File.csv" to "34File.csv". It's a peculiar case and that worked for me.
I saw this example code to overwrite a partition through spark 2.3 really nicely
dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)
My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .
The compaction and overwriting of the partition if working but not the writing as parquet.
Fullc code here
val sparkSession = SparkSession.builder //.master("local[2]")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "snappy")
.enableHiveSupport() //can just comment out hive support
.getOrCreate
sparkSession.sparkContext.setLogLevel("ERROR")
println("Created hive Context")
val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
//to compact yesterdays partition
val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong
val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")
if (!dfPartition.take(1).isEmpty) {
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
sparkSession.sql(s"msck repair table $tblName")
Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
}
else {
"echo invalid partition"
}
here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.
What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases
Using scala 2.11 , cdh 5.12 , spark 2.3
Any suggestions
The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:
hadoop dfs -cat /tmp/filename.c000 | head
where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.
I have a spark sql code
object MyTest extends App {
val conf = new SparkConf().setAppName("GTPCP KPIs")
val sc = new SparkContext(conf)
val hContext = new org.apache.spark.sql.hive.HiveContext(sc)
val outputDF = hContext.sql("Select field1, field2 from prddb.cust_data")
println("records selected: " + outputDF.count() + "\n")
outputDF.write.mode("append").saveAsTable("devdb.vs_test")
//outputDF.show()
}
The problem is that if i run the query
Select field1, field2 from prddb.cust_data
in hive it gives me around 1.5 million records.
However, through spark sql I am not getting any output in devdb.vs_test table
The println statement prints 0.
I am using spark 1.5.0
Any help here will be appreciated!!
What it looks like your spark-session doesn't have hive connectivity to it.
You need to link hive-site.xml with spark conf or copy hive-site.xml into spark conf directory. Spark is not able to find your hive metastore (derby database which is by default), so for that we have to link hive-conf to spark conf direcrtory.
Finally, to connect Spark SQL to an existing Hive installation, you must copy your hive-site.xml file to Spark’s configuration directory ($SPARK_HOME/conf). If you don’t have an existing Hive installation, Spark SQL will still run.
Sudo to root user and then copy hive-site to spark conf directory.
sudo -u root
cp /etc/hive/conf/hive-site.xml /etc/spark/conf