As InsertInto statement taking more time to run compare to SaveAsTable, I want to use SaveAsTable with below use case.
I am using spark version 2.2 and I would like to create my table with below mentioned structure in the beginning of my spark code. At last, I have my DF(df_all_rec) ready which I want to write into test table using SaveasTable with Partition(my_part) and with Text format.
create table test(a varchar(50),
b varchar(50))
partitioned by (my_part int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
NULL DEFINED AS ''
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Please suggest how to achieve that
df_all_rec.write.mode("overwrite")
...?
Related
Working in Python 3.7 and pyspark 3.2.0, we're writing out a PySpark dataframe to a Snowflake table using the following statement, where mode usually equals 'append'.
df.write \
.format('snowflake') \
.option('dbtable', table_name) \
.options(**sf_configs) \
.mode(mode)\
.save()
We've made the surprising discovery that this write can be insensitive to underscores in column names -- specifically, a dataframe with the column "RUN_ID" is successfully written out to a table with the column "RUNID" in Snowflake, with the column mapping accordingly. We're curious why this is so (I'm in particular wondering if the pathway runs through a LIKE statement somewhere, or if there's something interesting in the Snowflake table definition) and looking for documentation of this behavior (assuming that it's a feature, not a bug.)
According to the docs, the snowflake connector defaults to using column order instead of name, see parameter column_mapping.
The connector must map columns from the Spark data frame to the Snowflake table. This can be done based on column names (regardless of
order), or based on column order (i.e. the first column in the data
frame is mapped to the first column in the table, regardless of column
name).
By default, the mapping is done based on order. You can override that by setting this parameter to name, which tells the connector to
map columns based on column names. (The name mapping is
case-insensitive.)
I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.
I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.
I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.
I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. Now what I am trying to do is that from the same code I want to create a table on top of this parquet file which then I can later query from. How can I do that?
You can use the saveAsTable method :
peopleDF.write.saveAsTable('people_table')
you have to create external table in hive like this:
CREATE EXTERNAL TABLE my_table (
col1 INT,
col2 INT
) STORED AS PARQUET
LOCATION '/path/to/';
Where /path/to/ is absolute path to files in HDFS.
If you want to use partitioning you can add PARTITION BY (col3 INT). In that case to see the data you have to execute repair.