save a pyspark dataframe inside a function - pyspark

I am trying to save a pyspark dataframe into a hdfs folder. this code is working fine outside the function but once i try to put it in a function, i am getting errors. Probably a case of how i am referencing the function arguments. Thanks for the help.
def save_file(df):
start_time = time.time()
df.createOrReplaceTempView("df")
hc.sql("create table hdfs_folder.{} as select * from {}".format(df,df))
print("{} saved in hdfs_folder".format(df))
print("**********************************")
print("--- %s seconds ---" % (time.time() - start_time))
save_file(py_df)

I think what you want is to use string df instead of variable df as following:
def save_file(df):
start_time = time.time()
df.createOrReplaceTempView("df")
hc.sql("create table hdfs_folder.{} as select * from {}".format('df','df'))
print("{} saved in hdfs_folder".format('df'))
print("**********************************")
print("--- %s seconds ---" % (time.time() - start_time))
save_file(py_df)
Edited - Using the variable name:
def save_file(df, name):
start_time = time.time()
df.createOrReplaceTempView("df")
hc.sql("create table hdfs_folder.{} as select * from {}".format(name,'df'))
print("{} saved in hdfs_folder".format(name))
print("**********************************")
print("--- %s seconds ---" % (time.time() - start_time))
save_file(py_df, 'py_df')

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

Pyspark error while running sql subquery "AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate"

I had written a SQL query which is has a subquery in it. It is a correct mySQL query but it does not get implemented on Pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql.functions import *
sc = spark.sparkContext
sqlcontext = HiveContext(sc)
select location, postal, max(spend), max(revenue)
from (select a.*,
(select sum(r.revenue)
from revenue r
where r.user = a.user and
r.dte >= a.dt - interval 10 minute and
r.dte <= a.dte + interval 10 minute
) as revenue
from auction a
where a.event in ('Mid', 'End', 'Show') and
a.cat_id in (3) and
a.cat = 'B'
) a
group by location, postal;
The error eveytime I am getting is
AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate [sum(cast(revenue#17 as double)) AS sum(CAST(revenue AS DOUBLE))#498]\n+- Filter (((user#2 = outer(user#85)) && (dt#0 >= cast(cast(outer(dt#67) - interval 10 minutes as timestamp) as string))) && ((dt#0 <= cast(cast(outer(dt#67) + interval 10 minutes as timestamp) as string))
Any insights on this will be helpful.
Correlated subquery using sql syntax in PySpark is not an option, so in this case I ran the queries seperately with some twigs in sql query and left joined it using df.join to get the desired output through PySpark, this is how this issue was resolved

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

How to execute multi line sql in spark sql

How can I execute lengthy, multiline Hive Queries in Spark SQL? Like query below:
val sqlContext = new HiveContext (sc)
val result = sqlContext.sql ("
select ...
from ...
");
Use """ instead, so for example
val results = sqlContext.sql ("""
select ....
from ....
""");
or, if you want to format code, use:
val results = sqlContext.sql ("""
|select ....
|from ....
""".stripMargin);
You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line.
val results = sqlContext.sql ("""
create table enta.scd_fullfilled_entitlement as
select *
from my_table
""");
results = sqlContext.sql (" \
create table enta.scd_fullfilled_entitlement as \
select * \
from my_table \
")
val query = """(SELECT
a.AcctBranchName,
c.CustomerNum,
c.SourceCustomerId,
a.SourceAccountId,
a.AccountNum,
c.FullName,
c.LastName,
c.BirthDate,
a.Balance,
case when [RollOverStatus] = 'Y' then 'Yes' Else 'No' end as RollOverStatus
FROM
v_Account AS a left join v_Customer AS c
ON c.CustomerID = a.CustomerID AND c.Businessdate = a.Businessdate
WHERE
a.Category = 'Deposit' AND
c.Businessdate= '2018-11-28' AND
isnull(a.Classification,'N/A') IN ('Contractual Account','Non-Term Deposit','Term Deposit')
AND IsActive = 'Yes' ) tmp """
It is worth noting that the length is not the issue, just the writing. For this you can use """ as Gaweda suggested or simply use a string variable, e.g. by building it with string builder. For example:
val selectElements = Seq("a","b","c")
val builder = StringBuilder.newBuilder
builder.append("select ")
builder.append(selectElements.mkString(","))
builder.append(" where d<10")
val results = sqlContext.sql(builder.toString())
In addition to the above ways, you can use the below-mentioned way as well:
val results = sqlContext.sql("select .... " +
" from .... " +
" where .... " +
" group by ....
");
Write your sql inside triple quotes, like """ sql code """
df = spark.sql(f""" select * from table1 """)
This is same for Scala Spark and PySpark.

How to import a CSV file through a stored function?

I Have sample CSV file which contains 10 records.
So I want to upload the CSV file Thru stored procedure.
Is it possible to do that way. This is my stored function.
FOR i IN 1..v_cnt LOOP
SELECT idx_date,file_path INTO v_idx_date,v_file_path FROM cloud10k.temp_idx_dates
WHERE is_updated IS FALSE LIMIT 1;
COPY cloud10k.temp_master_idx_new(header_section) FROM v_file_path;
DELETE FROM cloud10k.temp_master_idx_new WHERE header_section NOT ILIKE '%.txt%';
UPDATE cloud10k.temp_master_idx_new SET CIK = split_part( header_section,'|',1),
company_name = split_part( header_section,'|',2),
form_type = split_part( header_section,'|',3),
date_filed = split_part( header_section,'|',4)::DATE,
accession_number = replace(split_part(split_part( header_section,'|',5),'/',4),'.txt',''),
file_path = to_char(SUBSTRING(SPLIT_PART(v_file_path,'master.',2) FROM 1 FOR 8)::DATE,'YYYY')
||'/'||to_char(SUBSTRING(SPLIT_PART(v_file_path,'master.',2) FROM 1 FOR 8)::DATE,'MM')
||'/'||to_char(SUBSTRING(SPLIT_PART(v_file_path,'master.',2) FROM 1 FOR 8)::DATE,'DD')
||'/'||CONCAT_WS('.','master',SPLIT_PART(v_file_path,'master.',2) )
WHERE header_section ILIKE '%.txt%';
END LOOP;
But its not executing. Can someone suggest me how to do that..
Tanks,
Ramesh