Write only few columns of pyspark dataframe to a snowflake table - pyspark

I have a pyspark data frame as below:
>>> l = [('XYZ', '324 NW', ‘VA’), ('XYZ, '323 NW', ‘VA’), (‘CVB’, '314 NW', ‘VA’)]
>>> df = spark.createDataFrame(l, ('Name', 'Address', 'State'))
Name Address State
XYZ 324 NW VA
ABC 323 NW VA
CVB 314 NW VA
CREATE or replace TABLE TEST
(
NAME VARCHAR(50),
ADDRESS VARCHAR(50),
STATE VARCHAR(50));
When I try to write this entire pyspark dataframe to a TEST table using the below pyspark code, it works perfectly fine
df.write.format(SNOWFLAKE_SOURCE_NAME).OPTIONS(**snowflake_options).option(“dbtable”, ‘TEST).mode(‘append’).save()
How do I write only 'Address' and 'State' column to snowflake table ignoring ‘Name’ column?
I also tried below code but it keeps giving me invalid syntax
df.write.format(SNOWFLAKE_SOURCE_NAME).OPTIONS(**snowflake_options)
.option(“dbtable”, ‘TEST)
.option("columnmap", Map("Address" -> "ADDRESS", "State" -> "STATE").toString())
.mode(‘append’).save()

Just select 2 columns you wanted to save
(df
.select('Address', 'State') ## add this
.write
.format(SNOWFLAKE_SOURCE_NAME)
.OPTIONS(**snowflake_options)
.option('dbtable', 'TEST')
.mode('append')
.save()
)

Related

Are the bucket hash algorithms of tez and MR different?

I'm using Hive 3.1.2 and tried to create a bucket with bucket version=2.
When I created a bucket and checked the bucket file using hdfs dfs -cat, I could see that the hashing result was different.
Are the hash algorithms of Tez and MR different? Shouldn't it be the same if bucket version=2?
Here's the test method and its results.
1. Create Bucket table & Data table
CREATE EXTERNAL TABLE `bucket_test`(
`id` int COMMENT ' ',
`name` string COMMENT ' ',
`age` int COMMENT ' ',
`phone` string COMMENT ' ')
CLUSTERED BY (id, name, age) SORTED BY(phone) INTO 2 Buckets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'bucketing_version'='2',
'orc.compress'='ZLIB');
CREATE TABLE data_table (id int, name string, age int, phone string)
row format delimited fields terminated by ',';
2. Insert data into DATA_TABLE
INSERT INTO TABLE data_table
select stack
( 20
,1, 'a', 11, '111'
,1,'a',11,'222'
,3,'b',14,'333'
,3,'b',13,'444'
,5,'c',18,'555'
,5,'c',18,'666'
,5,'c',21,'777'
,8,'d',23,'888'
,9,'d',24,'999'
,10,'d',26,'1110'
,11,'d',27,'1112'
,12,'e',28,'1113'
,13,'f',28,'1114'
,14,'g',30,'1115'
,15,'q',31,'1116'
,16,'w',32,'1117'
,17,'e',33,'1118'
,18,'r',34,'1119'
,19,'t',36,'1120'
,20,'y',36,'1130')
3. Create Bucket with MR
set hive.enforce.bucketing = true;
set hive.execution.engine = mr;
set mapreduce.job.queuename=root.test;
Insert overwrite table bucket_test
select * from data_table ;
4. Check Bucket contents
# bucket0 : 6 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
10d261110
11d271112
18r341119
3b13444
5c18555
5c18666
# bucket1 : 14 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
1a11111
12e281113
13f281114
14g301115
15q311116
16w321117
17e331118
19t361120
20y361130
1a11222
3b14333
5c21777
8d23888
9d24999
5. Create Bucket with Tez
set hive.enforce.bucketing = true;
set hive.execution.engine = tez;
set tez.queue.name=root.test;
Insert overwrite table bucket_test
select * from data_table ;
6. Check Bucket contents
# bucket0 : 11 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
1a11111
10d261110
11d271112
13f281114
16w321117
17e331118
18r341119
20y361130
1a11222
5c18555
5c18666
# bucket1 : 9 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
12e281113
14g301115
15q311116
19t361120
3b14333
3b13444
5c21777
8d23888
9d24999

PySpark best way to filter df based on columns from different df's

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!
Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

Spark DataFrame - Read pipe delimited file using SQL?

Based on Spark - load CSV file as DataFrame?
Is it possible to specify options using SQL to set the delimiter, null character, and quote?
val df = spark.sql("SELECT * FROM csv.`csv/file/path/in/hdfs`")
I know it can be done using spark.read.format("csv").option("delimiter", "|"), but ideally I wouldn't have to.
Updated Information
It seems that I have to pass the path using back-ticks.
When I attempting to pass OPTIONS
== SQL ==
SELECT * FROM
csv.`csv/file/path/in/hdfs` OPTIONS (delimiter , "|" )
-----------------------------------^^^
Error in query:
mismatched input '(' expecting {<EOF>, ',', 'WHERE', 'GROUP', 'ORDER',
'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL',
'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS',
'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}
Althoguh not a one line souliton, following might work for you:
spark.sql("CREATE TABLE some_table USING com.databricks.spark.csv OPTIONS (path \"csv/file/path/in/hdfs\", delimeter \"|\")");
val df = spark.sql("SELECT * FROM some_table");
Of course you can skip the second step of loading into dataframe if you want to perform some SQL operation directly on some_table.

How to pass where clause as variable to the query in Spark SQL?

Sample Table:
+---------------------------------------------------------++------+
| Name_Age || ID |
+---------------------------------------------------------++------+
|"SUBHAJIT SEN":28,"BINOY MONDAL":26,"SHANTANU DUTTA":35 || 15 |
|"GOBINATHAN SP":35,"HARSH GUPTA":27,"RAHUL ANAND":26 || 16 |
+---------------------------------------------------------++------+
How to pass WHERE clause as variable to the query?
My desired query is: Select Name_Age from table where ID=15 so where variable is ID=15.
If data is already registered as a table (A Hive table or after calling registerTempTable on a DataFrame), you can use SQLContext.sql:
val whereClause: String = "ID=15"
sqlContext.sql("Select Name_Age from table where " + whereClause)
If you have a df: DataFrame object you want to query:
// using a string filter:
df.filter(whereClause).select("Name_Age")
The below code can get you your answer:
spark.sql("""Select Name_Age from table where ID='15'""")