I am trying to save a pyspark dataframe into a hdfs folder. this code is working fine outside the function but once i try to put it in a function, i am getting errors. Probably a case of how i am referencing the function arguments. Thanks for the help.
def save_file(df):
start_time = time.time()
df.createOrReplaceTempView("df")
hc.sql("create table hdfs_folder.{} as select * from {}".format(df,df))
print("{} saved in hdfs_folder".format(df))
print("**********************************")
print("--- %s seconds ---" % (time.time() - start_time))
save_file(py_df)
I think what you want is to use string df instead of variable df as following:
def save_file(df):
start_time = time.time()
df.createOrReplaceTempView("df")
hc.sql("create table hdfs_folder.{} as select * from {}".format('df','df'))
print("{} saved in hdfs_folder".format('df'))
print("**********************************")
print("--- %s seconds ---" % (time.time() - start_time))
save_file(py_df)
Edited - Using the variable name:
def save_file(df, name):
start_time = time.time()
df.createOrReplaceTempView("df")
hc.sql("create table hdfs_folder.{} as select * from {}".format(name,'df'))
print("{} saved in hdfs_folder".format(name))
print("**********************************")
print("--- %s seconds ---" % (time.time() - start_time))
save_file(py_df, 'py_df')
Related
I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.
I had written a SQL query which is has a subquery in it. It is a correct mySQL query but it does not get implemented on Pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql.functions import *
sc = spark.sparkContext
sqlcontext = HiveContext(sc)
select location, postal, max(spend), max(revenue)
from (select a.*,
(select sum(r.revenue)
from revenue r
where r.user = a.user and
r.dte >= a.dt - interval 10 minute and
r.dte <= a.dte + interval 10 minute
) as revenue
from auction a
where a.event in ('Mid', 'End', 'Show') and
a.cat_id in (3) and
a.cat = 'B'
) a
group by location, postal;
The error eveytime I am getting is
AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate [sum(cast(revenue#17 as double)) AS sum(CAST(revenue AS DOUBLE))#498]\n+- Filter (((user#2 = outer(user#85)) && (dt#0 >= cast(cast(outer(dt#67) - interval 10 minutes as timestamp) as string))) && ((dt#0 <= cast(cast(outer(dt#67) + interval 10 minutes as timestamp) as string))
Any insights on this will be helpful.
Correlated subquery using sql syntax in PySpark is not an option, so in this case I ran the queries seperately with some twigs in sql query and left joined it using df.join to get the desired output through PySpark, this is how this issue was resolved
I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)
How can I execute lengthy, multiline Hive Queries in Spark SQL? Like query below:
val sqlContext = new HiveContext (sc)
val result = sqlContext.sql ("
select ...
from ...
");
Use """ instead, so for example
val results = sqlContext.sql ("""
select ....
from ....
""");
or, if you want to format code, use:
val results = sqlContext.sql ("""
|select ....
|from ....
""".stripMargin);
You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line.
val results = sqlContext.sql ("""
create table enta.scd_fullfilled_entitlement as
select *
from my_table
""");
results = sqlContext.sql (" \
create table enta.scd_fullfilled_entitlement as \
select * \
from my_table \
")
val query = """(SELECT
a.AcctBranchName,
c.CustomerNum,
c.SourceCustomerId,
a.SourceAccountId,
a.AccountNum,
c.FullName,
c.LastName,
c.BirthDate,
a.Balance,
case when [RollOverStatus] = 'Y' then 'Yes' Else 'No' end as RollOverStatus
FROM
v_Account AS a left join v_Customer AS c
ON c.CustomerID = a.CustomerID AND c.Businessdate = a.Businessdate
WHERE
a.Category = 'Deposit' AND
c.Businessdate= '2018-11-28' AND
isnull(a.Classification,'N/A') IN ('Contractual Account','Non-Term Deposit','Term Deposit')
AND IsActive = 'Yes' ) tmp """
It is worth noting that the length is not the issue, just the writing. For this you can use """ as Gaweda suggested or simply use a string variable, e.g. by building it with string builder. For example:
val selectElements = Seq("a","b","c")
val builder = StringBuilder.newBuilder
builder.append("select ")
builder.append(selectElements.mkString(","))
builder.append(" where d<10")
val results = sqlContext.sql(builder.toString())
In addition to the above ways, you can use the below-mentioned way as well:
val results = sqlContext.sql("select .... " +
" from .... " +
" where .... " +
" group by ....
");
Write your sql inside triple quotes, like """ sql code """
df = spark.sql(f""" select * from table1 """)
This is same for Scala Spark and PySpark.
I Have sample CSV file which contains 10 records.
So I want to upload the CSV file Thru stored procedure.
Is it possible to do that way. This is my stored function.
FOR i IN 1..v_cnt LOOP
SELECT idx_date,file_path INTO v_idx_date,v_file_path FROM cloud10k.temp_idx_dates
WHERE is_updated IS FALSE LIMIT 1;
COPY cloud10k.temp_master_idx_new(header_section) FROM v_file_path;
DELETE FROM cloud10k.temp_master_idx_new WHERE header_section NOT ILIKE '%.txt%';
UPDATE cloud10k.temp_master_idx_new SET CIK = split_part( header_section,'|',1),
company_name = split_part( header_section,'|',2),
form_type = split_part( header_section,'|',3),
date_filed = split_part( header_section,'|',4)::DATE,
accession_number = replace(split_part(split_part( header_section,'|',5),'/',4),'.txt',''),
file_path = to_char(SUBSTRING(SPLIT_PART(v_file_path,'master.',2) FROM 1 FOR 8)::DATE,'YYYY')
||'/'||to_char(SUBSTRING(SPLIT_PART(v_file_path,'master.',2) FROM 1 FOR 8)::DATE,'MM')
||'/'||to_char(SUBSTRING(SPLIT_PART(v_file_path,'master.',2) FROM 1 FOR 8)::DATE,'DD')
||'/'||CONCAT_WS('.','master',SPLIT_PART(v_file_path,'master.',2) )
WHERE header_section ILIKE '%.txt%';
END LOOP;
But its not executing. Can someone suggest me how to do that..
Tanks,
Ramesh