dbTest not defined - pyspark

Am going through my spark training on databricks but everthing seems to be breaking down when I click run. I have this error
error: name 'dbTest' is not defined
when I run this below:
ip1, count1 = ipCountDF.first()
cols = set(ipCountDF.columns)
dbTest("ET1-P-02-01-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-01-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-01-03", True, 'count' in cols)
dbTest("ET1-P-02-01-03", True, 'ip' in cols)
print("Tests passed!")
please guys any tips?

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

How to get the col headers from a select query using psycopg2 client?

I have this python3 code :
conn = psycopg2.connect( ... )
curr = conn.cursor()
curr.execute(code)
rows = curr.fetchall()
where 'code' has the select query statement
After executing this, 'rows' list will have lists of only the selected row values. How do I run 'curr.execute' in such a way that I also get the respective col headers too?
Meaning if I have say
Select col1, col2 from table Where some_condition;
I want my 'rows' list to have something like [['col1', 'col2'], [some_val_for_col1, some_val_for_col2] ....]. Any other ways of getting these col headers are also fine, but the select query in 'code' shouldn't change.
you have to execute 2 commands
curr.execute("Select * FROM people LIMIT 0")
colnames = [desc[0] for desc in curs.description]
curr.execute(code)
you can follow steps mentioned in https://kb.objectrocket.com/postgresql/get-the-column-names-from-a-postgresql-table-with-the-psycopg2-python-adapter-756

PySpark best way to filter df based on columns from different df's

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!
Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

Spark SQL query with Int val is not working when we pass int val as argument

I have a sql query with integer value as one argument.
Working:
sqlContext.sql("select concat(MinRange, '-', MaxRange) from range where 20 >= MinRange and 20 < MaxRange")
Not working:
sqlContext.sql("select concat(MinRange, '-', MaxRange) from range where "+intval+">= MinRange and "+intval+" < MaxRange")
Also this one with String interpolator, not working:
sqlContext.sql(s"select concat(MinRange, '-', MaxRange) from range where $intval >= MinRange and $intval < MaxRange")
I'm sure , i am missing something very basic.
First construct the query string then execute it will work.
val intval=10
val qry= "select concat(MinRange, '-', MaxRange) from range where "+intval+">= MinRange and "+intval+" < MaxRange"
sqlContext.sql(qry)
Did you check below:
1)
Since u r building the <Column Name> run-time, can you check if the <Column Name> actually exists in "range" table.
2)
Did you check what is the actual value of intVal variable during run-time? Make sure it is not getting null value, by giving a default value.