How to write result of structured query with hive format? - scala

I'm trying to insert into a hive table data via DataStreamWriter class using hive format.
rdf.writeStream.format("hive").partitionBy("date")
.option("checkpointLocation", "/tmp/")
.option("db", "default")
.option("table", "Daily_summary_data")
.queryName("socket-hive-streaming")
.start()
It throws the following error:
org.apache.spark.sql.AnalysisException: Hive data source can only be
used with tables, you can not write files of Hive data source
directly.;
How to resolve this?

It's not possible to write the result of a structured query using hive data source (as per the exception that you can find here).
You can write using parquet format that may be an option.

Related

FileNotFoundException in Azure Synapse when trying to delete a Parquet file and replace it with a dataframe created from the original file

I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.
So I created the Parquet file from a dataframe and put it in the path:
full_file_path
I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:
df = spark.read.parquet(full_file_path)
I then create a new dataframe with the update:
df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")
and the df_variance dataframe is created.
I then delete the original file with:
mssparkutils.fs.rm(full_file_path, True)
and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:
df_variance.write.parquet(full_file_path)
and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.
Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call
df_variance.write
is a spark job run that reads from the source, performs your transformations, and writes to the destination.
A Spark dataframe is really just a query that you can compose with other expressions before finally running it.
You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake

DF insertInto is not persisting all columns for mixed structured data ( json, string)

DataFrame saveAsTable is persisting all the column values properly but insertInto function is not storing all the columns especially json data is truncated and sub-sequent column in not stored hive table.
Our Environment
Spark 2.2.0
EMR 5.10.0
Scala 2.11.8
The sample data is
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23
23:29:11.20","vs":[5.0,1.7999999523162842,1.5]}
DF SaveAsTable
val spark = SparkSession.builder().appName("Spark SQL Test").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
enableHiveSupport().getOrCreate()
val zoneStatus = spark.table("zone_status")
zoneStatus.select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).saveAsTable("dwh_zone_status")
Stored data properly in result table:
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated 0 {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]} 1520453589
DF insertInto
zoneStatus.
select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).insertInto("zone_status_insert")
But insertInto is not persisting all the contents. The json string is storing partially and sub-sequent column is not stored.
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated 0 {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13" NULL
We are using insertInto functions in our projects and recently encountered when parsing json data to pull other metrics. We noticed that the config content is not stored fully. Planning to change to saveAsTable but we can avoid the code change, if any workaround available to add in spark configuration.
You can use below alternative ways of inserting data into table.
val zoneStatusDF = zoneStatus.
select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts"))
zoneStatusDF.registerTempTable("zone_status_insert ")
Or
zoneStatus.sqlContext.sql("create table zone_status_insert as select * from zone_status")
The reason is that schema created with
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
After removing the ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' can able to save entire contents using insertInto.

spark Athena connector

I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception
"com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented"
Can you please let me know how can I connect Athena in spark
I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark.
There are two parts to Athena
Hive Metastore (now called the Glue Data Catalog) which contains mappings between database and table names and all underlying files
Presto query engine which translates your SQL into data operations against those files
When you start an EMR cluster (v5.8.0 and later) you can instruct it to connect to your Glue Data Catalog. This is a checkbox in the 'create cluster' dialog. When you check this option your Spark SqlContext will connect to the Glue Data Catalog, and you'll be able to see the tables in Athena.
You can then query these tables as normal.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html for more.
You can use this JDBC driver: SimbaAthenaJDBC
<dependency>
<groupId>com.syncron.amazonaws</groupId>
<artifactId>simba-athena-jdbc-driver</artifactId>
<version>2.0.2</version>
</dependency>
to use:
SparkSession spark = SparkSession
.builder()
.appName("My Spark Example")
.getOrCreate();
Class.forName("com.simba.athena.jdbc.Driver");
Properties connectionProperties = new Properties();
connectionProperties.put("User", "AWSAccessKey");
connectionProperties.put("Password", "AWSSecretAccessKey");
connectionProperties.put("S3OutputLocation", "s3://my-bucket/tmp/");
connectionProperties.put("AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider");
connectionProperties.put("AwsCredentialsProviderArguments", "/my-folder/.athenaCredentials");
connectionProperties.put("driver", "com.simba.athena.jdbc.Driver");
List<String> predicateList =
Stream
.of("id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.collect(Collectors.toList());
String[] predicates = new String[predicateList.size()];
predicates = predicateList.toArray(predicates);
Dataset<Row> data =
spark.read()
.jdbc("jdbc:awsathena://AwsRegion=us-east-1;",
"my_env.my_table", predicates, connectionProperties);
You can also use this driver in a Flink application:
TypeInformation[] fieldTypes = new TypeInformation[] {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.simba.athena.jdbc.Driver")
.setDBUrl("jdbc:awsathena://AwsRegion=us-east-1;UID=my_access_key;PWD=my_secret_key;S3OutputLocation=s3://my-bucket/tmp/;")
.setQuery("select id, val_col from my_env.my_table WHERE id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataSet<Row> dbData = env.createInput(jdbcInputFormat, rowTypeInfo);
You can't directly connect Spark to Athena. Athena is simply an implementation of Prestodb targeting s3. Unlike Presto, Athena cannot target data on HDFS.
However, if you want to use Spark to query data in s3, then you are in luck with HUE, which will let you query data in s3 from Spark on Elastic Map Reduce (EMR).
See Also:
Developer Guide for Hadoop User Experience (HUE) on EMR.
The response of #Kirk Broadhurst is correct if you want to use the data of Athena.
If you want to use the Athena engine, then, there is a lib on github that overcomes the preparedStatement problem.
Note that I didn't succeed to use the lib, due to my lack of experience with Maven etc
Actually you can use B2W's Spark Athena Driver.
https://github.com/B2W-BIT/athena-spark-driver

Why does my query fail with AnalysisException?

I am new to Spark streaming. I am trying structured Spark streaming with local csv files. I am getting the below exception while processing.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
This is my code.
val df = spark
.readStream
.format("csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", ":") // Specifying the delimiter of the input file
.schema(inputdata_schema) // Specifying the schema for the input file
.load("file:///home/Teju/Desktop/SparkInputFiles/*.csv")
val filterop = spark.sql("select tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID,first(rssi_weightage(RSSI)) as RSSI_Weight from my_table where RSSI > -127 group by tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID order by Timestamp ASC")
val outStream = filterop.writeStream.outputMode("complete").format("console").start()
I created cron job so every 5 mins I will get one input csv file. I am trying to parse through Spark streaming.
(This is not a solution but more a comment, but given its length it ended up here. I'm going to make it an answer eventually right after I've collected enough information for investigation).
My guess is that you're doing something incorrect on df that you have not included in your question.
Since the error message is about FileSource with the path as below and it is a streaming dataset that must be df that's in play.
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
Given the other lines I guess that you register the streaming dataset as a temporary table (i.e. my_table) that you then use in spark.sql to execute SQL and writeStream to the console.
df.createOrReplaceTempView("my_table")
If that's correct, the code you've included in the question is incomplete and does not show the reason for the error.
Add .writeStream.start to your df, as the Exception is telling you.
Read the docs for more detail.

How to save data in parquet format and append entries

I am trying to follow this example to save some data in parquet format and read it. If I use the write.parquet("filename"), then the iterating Spark job gives error that
"filename" already exists.
If I use SaveMode.Append option, then the Spark job gives the error
".spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables".
Please let me know the best way to ensure new data is just appended to the parquet file. Can I define primary keys on these parquet tables?
I am using Spark 1.6.2 on Hortonworks 2.5 system. Here is the code:
// Option 1: peopleDF.write.parquet("people.parquet")
//Option 2:
peopleDF.write.format("parquet").mode(SaveMode.Append).saveAsTable("people.parquet")
// Read in the parquet file created above
val parquetFile = spark.read.parquet("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT * FROM people.parquet")
I believe if you use .parquet("...."), you should use .mode('append'),
not SaveMode.Append:
df.write.mode('append').parquet("....")