External table not pulling new data Spark Databricks GCS - pyspark

I have a stream source and using spark stream I am writing data as csv to GCS bucket.
I have created a table on top of that GCS location with hive metastore using databricks.
As soon as new files are written to GCS I refresh the table using refresh table table_name but the table data isn't refreshed
Before creating table there are 10 files on GCS with 150 records in total.
I create a table and the table now has 150 records.
New files are added to GCS path and now total file count is say 20 with 300 total records.
After refreshing the table I still see 150 records while I am expecting total records in table to be 300.
Workaround is to delete and recreate table. Is there any other way to refresh the table without dropping?

Related

Performance Enhancement while inserting HBase table through hive external table

I have to insert a dataframe to an Hbase table on PySpark. I create a hive external table based on the HBase table. In the PySpark code, I create a temp view from the dataframe. After that, I select data from tempView and insert it into the hive external table, and data is inserted into the HBase table. Everything works fine, there is no problem in the process.
My problem is that the insert process is slower than we expected. I also tried the write method by defining the catalog but the performance was much worse.
It takes 27 minutes to insert 470 MB of data (2000 columns, 200 000 rows)
I am running .py file via spark-submit.
Do you have any recommendations to insert data faster?

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.

Purging of transactional data in DB2

We have existing table of size more than 130 TB we have to delete records in DB2 . Using delete statement would will hang the system. So one way is we can partition the table month and year wise and then drop the partition one by one by using truncate or drop. Looking for a script which can create the partition and subsequently dropping.
You can't partition the data within an existing table. You would need to move the data to a new ranged partitioned table.
If using Db2 LUW, and depending on your specific requirments, consider using ADMIN_MOVE_TABLE to move your data to a new table while keeping your table "on-line"
ADMIN_MOVE_TABLE has the ability to add Range Partitioning and/or Multi-Dimentional Clustering on the new table during the move.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0055069.html
Still, a 130 TB table is very large, and you would be well advised to be carful in planning and testing such a movement.

Data replication from mysql to Redshift

I would like to load the data from mysql to redshift.
Here my data values can change at anytime. So I need to capture old records and new records as well into Redshift.
Here modified records need to be archive.Only new records reflect in Redshift.
For an example
MysqlTable :
ID NAME SAL
-- ---- -----
1 XYZ 10000
2 ABC 20000
For first load into Redshift(this should be same as Mysqltable)
ID NAME SAL
-- ---- ----
1 XYZ 10000
2 ABC 20000
for Second load(I changed salary of Employee 'XYZ' from 10000 to 30000 )
ID NAME SAL
-- ---- ----
1 XYZ 30000
2 ABC 20000
The above table should be reflect in Redshift and modified record (1 XYZ 10000) should be archive.
Is this possible?
How many rows are you expecting?
One approach would be to add a timestamp column which gets updated to current time whenever a record is modified.
Then with an external process doing a replication run, you could get the max timestamp from Redshift and select any records from MySQL that are greater than that timestamp and, if you use the COPY method to load into Redshift, dump them to S3.
To load new records and archive old you'll need to use a variation of a Redshift upsert pattern. This would involve loading to a temporary table, identifying records in the original table to be archived, moving those to another archive table or UNLOADing them to an S3 archive, and then ALTER APPEND the new records into the main table.
See this site https://flydata.com/blog/mysql-to-amazon-redshift-replication.
A better option is the Change Data Capture (CDC). CDC is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It’s similar to technique mention by systemjack, but in that it only imports changed data, not the entire database.
To use the CDC method with a MySQL database, you must utilize the Binary Change Log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in Redshift.

How can I update or delete records of hive table from spark , with out loading entire table in to dataframe?

I have a hive orc table with around 2 million records , currently to update or delete I am loading entire table in to a dataframe and then update and save as new dataframe and saving this by Overwrite mode(below is command),so to update single record do I need to load and process entire table data??
I'm unable to do objHiveContext.sql("update myTable set columnName='' ")
I'm using Spark 1.4.1, Hive 1.2.1
myData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("myTable") where myData is updated dataframe.
How can I get rid of loading entire 2-3 million records just to update a single record of hive table .