Create a table from pyspark code on top of parquet file - pyspark

I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. Now what I am trying to do is that from the same code I want to create a table on top of this parquet file which then I can later query from. How can I do that?

You can use the saveAsTable method :
peopleDF.write.saveAsTable('people_table')

you have to create external table in hive like this:
CREATE EXTERNAL TABLE my_table (
col1 INT,
col2 INT
) STORED AS PARQUET
LOCATION '/path/to/';
Where /path/to/ is absolute path to files in HDFS.
If you want to use partitioning you can add PARTITION BY (col3 INT). In that case to see the data you have to execute repair.

Related

Databricks - How to change a partition of an existing Delta table via table path?

I want to use this syntax that was posted in a similar thread to update the partitions of a delta table I have. The challenge is the delta table does not exist in databricks but we use databricks to update the delta able via Azure Data factory.
How can I adjust the below syntax to update the partitions and overwrite the table via the table path?
Python:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
SQL:
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
I tried to adjust the above code but could not adjust it to use the delta table path.
If you have path, then you need to use correct functions:
for Python, you need to use .load for reading, and .save for writing
if you're using SQL, then you specify table as following:
delta.`path`

Column mapping option argument is not supported for PARQUET based COPY

I have to insert parquet file data into redshift table. Number of columns in parquet might be less when compared to redshift table. I have used the below command.
COPY table_name
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
FORMAT AS PARQUET
But getting the below issue when I run the COPY command.
Column mapping option argument is not supported for PARQUET based COPY
I tried to use the column mapping like
COPY table_name(column1, column2..)
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
But am getting Delimiter not found issue. If I specify FORMAT AS PARQUET in the above COPY command (which has column list) then am getting Column mapping option argument is not supported for PARQUET based COPY.
Could you please let me know how to resolve this.
Thanks
The number of columns in the parquet file MUST match the table's columns reference . You can't use column mapping with parquet files.
What you can do: create a staging table and copy parquet file content to it. Then run a insert to your final table using insert into final_table (select col1, col2 from stg_table)

Load records to Existing table using SaveAsTable

As InsertInto statement taking more time to run compare to SaveAsTable, I want to use SaveAsTable with below use case.
I am using spark version 2.2 and I would like to create my table with below mentioned structure in the beginning of my spark code. At last, I have my DF(df_all_rec) ready which I want to write into test table using SaveasTable with Partition(my_part) and with Text format.
create table test(a varchar(50),
b varchar(50))
partitioned by (my_part int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
NULL DEFINED AS ''
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Please suggest how to achieve that
df_all_rec.write.mode("overwrite")
...?

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.