How can I update or delete records of hive table from spark , with out loading entire table in to dataframe? - scala

I have a hive orc table with around 2 million records , currently to update or delete I am loading entire table in to a dataframe and then update and save as new dataframe and saving this by Overwrite mode(below is command),so to update single record do I need to load and process entire table data??
I'm unable to do objHiveContext.sql("update myTable set columnName='' ")
I'm using Spark 1.4.1, Hive 1.2.1
myData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("myTable") where myData is updated dataframe.
How can I get rid of loading entire 2-3 million records just to update a single record of hive table .

Related

External table not pulling new data Spark Databricks GCS

I have a stream source and using spark stream I am writing data as csv to GCS bucket.
I have created a table on top of that GCS location with hive metastore using databricks.
As soon as new files are written to GCS I refresh the table using refresh table table_name but the table data isn't refreshed
Before creating table there are 10 files on GCS with 150 records in total.
I create a table and the table now has 150 records.
New files are added to GCS path and now total file count is say 20 with 300 total records.
After refreshing the table I still see 150 records while I am expecting total records in table to be 300.
Workaround is to delete and recreate table. Is there any other way to refresh the table without dropping?

Performance Enhancement while inserting HBase table through hive external table

I have to insert a dataframe to an Hbase table on PySpark. I create a hive external table based on the HBase table. In the PySpark code, I create a temp view from the dataframe. After that, I select data from tempView and insert it into the hive external table, and data is inserted into the HBase table. Everything works fine, there is no problem in the process.
My problem is that the insert process is slower than we expected. I also tried the write method by defining the catalog but the performance was much worse.
It takes 27 minutes to insert 470 MB of data (2000 columns, 200 000 rows)
I am running .py file via spark-submit.
Do you have any recommendations to insert data faster?

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.