Is there any pyspark function or substitute for like in sql - pyspark

I have a python list of all the columns of the dataframe as below.
['Timestamp',
'ScheduleCode__VALUE',
'ScheduleCode__i:nil',
'ProductionCode__VALUE',
'ProductionCode__i:nil',
'ProductCode__VALUE',
'ProductCode__i:nil',
'ProductCategory__VALUE',
'ProductCategory__i:nil']
I need to drop all the columns from the above list which ends with __i:nil and rename all the columns with __value to only it's prefix like ProductCode__VALUE should be renamed to ProductCode.

Try this:
column_list = ['Timestamp',
'ScheduleCode__VALUE',
'ScheduleCode__i:nil',
'ProductionCode__VALUE',
'ProductionCode__i:nil',
'ProductCode__VALUE',
'ProductCode__i:nil',
'ProductCategory__VALUE',
'ProductCategory__i:nil']
for element in column_list:
if(element.endswith('__Value')):
df = (
df.withColumnRenamed(element, element.split('__')[0])
)
df = df.drop(*[element for element in column_list if element.endswith('__i:nil')])

Related

PySpark best way to filter df based on columns from different df's

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!
Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

How to create multiple temp views in spark using multiple data frame

I have 10 data frame and i want to create multiple temp view so that I can perform sql operations on it using createOrReplaceTempView command in pyspark
This is probably what you're after.
source_tables = [
'sql.production.dbo.table1',
'sql.production.dbo.table2',
'sql.production.dbo.table3',
'sql.production.dbo.table4',
'sql.production.dbo.table5',
'sql.production.dbo.table6',
'sql.production.dbo.table7',
'sql.production.dbo.table8',
'sql.production.dbo.table9',
'sql.production.dbo.table10'
]
for source_table in source_tables:
try:
view_name = source_table.replace('.', '_')
# Lowercase all column names
df = df.toDF(*[c.lower() for c in df.columns])
df.createOrReplaceTempView(view_name)
except Exception as e:
print(e)

replace column and get ltrim of the column value

I want to replace an column in an dataframe. need to get the scala
syntax code for this
Controlling_Area = CC2
Hierarchy_Name = CC2HIDNE
Need to write as : HIDENE
ie: remove the Controlling_Area present in Hierarchy_Name .
val dfPC = ReadLatest("/Full", "parquet")
.select(
LRTIM( REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"") ),
Col(ColumnN),
Col(ColumnO)
)
notebook:3: error: not found: value REPLACE
REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"")
^
Expecting to get the LTRIM and replace code in scala
You can use withColumnRenamed to achieve that:
import org.apache.spark.sql.functions
val dfPC = ReadLatest("/Full", "parquet")
.withColumnRenamed("Hierarchy_Name","Controlling_Area")
.withColumn("Controlling_Area",ltrim(col("Controlling_Area")))

How to get strings separated by commas from a list to a query in PySpark?

I want to generate a query by using a list in PySpark
list = ["hi#gmail.com", "goodbye#gmail.com"]
query = "SELECT * FROM table WHERE email IN (" + list + ")"
This is my desired output:
query
SELECT * FROM table WHERE email IN ("hi#gmail.com", "goodbye#gmail.com")
Instead I'm getting: TypeError: cannot concatenate 'str' and 'list' objects
Can anyone help me achieve this? Thanks
If someone's having the same issue, I found that you can use the following code:
"'"+"','".join(map(str, emails))+"'"
and you will have the following output:
SELECT * FROM table WHERE email IN ('hi#gmail.com', 'goodbye#gmail.com')
Try this:
Dataframe based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
email_filter_list = ["hi#gmail.com", "goodbye#gmail.com"]
df.where(col('email_id').isin(email_filter_list)).show()
Spark SQL based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
df.createOrReplaceTempView('t1')
sql_filter = ','.join(["'" +i + "'" for i in email_filter_list])
spark.sql("SELECT * FROM t1 WHERE email_id IN ({})".format(sql_filter)).show()

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.