pyspark filter with parameter value is not working - pyspark

Below is the pyspark code that I tried to run. I am not able to substitute the value with filter. Please advise.
>>> coreWordFilter = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter
"crawlResult.url.like('%furniture%')"
>>> preFilter = crawlResult.filter(coreWordFilter)
20/02/11 09:19:54 INFO execution.SparkSqlParser: Parsing command: crawlResult.url.like('%furniture%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line 1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name 'crawlResult.url.like'(line 1, pos 0)\n\n== SQL ==\ncrawlResult.url.like('%furniture%')\n^^^\n"
>>> preFilter = crawlResult.filter(crawlResult.url.like('%furniture%'))
>>>
I need some help with how to add more crawlResult.url.like logic:
Code from today 2/12/2020:
>>> coreWordFilter = crawlResult.url.like('%{}%'.format(IncoreWords[0]))
>>> coreWordFilter
Column<url LIKE %furniture%>
>>> InmoreWords
['couch', 'couches']
>>> for a in InmoreWords:
... coreWordFilter=coreWordFilter+" | crawlResult.url.like('%"+a+"%')"
>>> coreWordFilter
Column<((((((url LIKE %furniture% + | crawlResult.url.like('%) + couch) + %')) + | crawlResult.url.like('%) + couches) + %'))>
preFilter = crawlResult.filter(coreWordFilter) does not work with the above coreWordFilter.
I was hoping I could do the below but not able to do - got an error:
>>> coreWordFilter2 = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter2
"crawlResult.url.like('%furniture%')"
>>> for a in InmoreWords:
... coreWordFilter2=coreWordFilter2+" | crawlResult.url.like('%"+a+"%')"
...
>>> coreWordFilter2
"crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')"
>>> preFilter = crawlResult.filter(coreWordFilter2)
20/02/12 08:55:26 INFO execution.SparkSqlParser: Parsing command:
crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line
1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-
src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in
deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name
'crawlResult.url.like'(line 1, pos 0)\n\n== SQL
==\ncrawlResult.url.like('%furniture%') |
crawlResult.url.like('%couch%') | crawlResult.url.like('%couches%')\n^^^\n"
I think the correct syntax is:
preFilter = crawlResult.filter(crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%'))

Since you want dynamic or condition i think filtering based on String operator (AND, OR, NOT etc) would be easy compare to Column based logical operators (&, |, ~ etc).
Dummy dataframe and lists:
crawlResult.show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| table|
| 1| test-test|
| 1| couch|
+---+--------------+
# IncoreWords
# ['furniture', 'office-table', 'counch', 'blah']
# InmoreWords
# ['couch', 'couches']
Now, I am just following your OP sequence for building dynamic filter clause but it will give you broad idea.
coreWordFilter2 = "url like ('%"+IncoreWords[0]+"%')"
# coreWordFilter2
#"url like ('%furniture%')"
for a in InmoreWords:
coreWordFilter2=coreWordFilter2+" or url like('%"+a+"%')"
# coreWordFilter2
# "url like ('%furniture%') or url like('%couch%') or url like('%couches%')"
crawlResult.filter(coreWordFilter2).show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| couch|
+---+--------------+

Related

Round off the data frame in Pyspark

i am trying to round off "perc_of_count_total" column in pyspark, but i could not do it, below is my script,
Auto_data1 = Auto_data.groupBy("Make", "Fuel") \
.count() \
.withColumnRenamed('count', 'cnt_per_group') \
.withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
.show(10)
Auto_data1.select(round(col('cnt_per_group'),2)).show(5)
Output
+-----------+----+-------------+--------------------+
| Make|Fuel|cnt_per_group| perc_of_count|
+-----------+----+-------------+--------------------+
| C | I| 34748|0.027960585487965286|
| P | D| 489| 3.93482396213164E-4|
Error message
An error was encountered:
'NoneType' object has no attribute 'select'
Traceback (most recent call last):
AttributeError: 'NoneType' object has no attribute 'select'
Remove the last show function, it doesn't return anything.

PySpark passing Dataframe as extra parameter to map

I want to parallelize a python list, use a map on that list, and pass a Dataframe to the mapper function also
def output_age_split(df):
ages= [18, 19, 20, 21, 22]
age_dfs= spark.sparkContext.parallelize(ages).map(lambda x: test(x, df)
# Unsure of type of age_dfs, but should be able to split into the smaller dfs like this somehow
return age_dfs[0], age_dfs[1] ...
def test(age, df):
return df.where(col("age")==age)
This results in a pickling error
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
How should i parallelize this operation so that i get returned a collection of Dataframes?
EDIT: Sample of df
|age|name|salary|
|---|----|------|
|18 |John|40000 |
|22 |Joseph|60000 |
The issue is ages_dfs is not a dataframe, it's an RDD. Now, when you are applying a map with test function in it (which returns the dataframe), we end up getting into a weird situation where ages_dfs is actually an RDD of type PipelinedRDD which is neither a dataframe nor iterable.
TypeError: 'PipelinedRDD' object is not iterable
You can try the below workaround, where you just iterate on the list instead and create a collection of dataframe and iterate on them at will.
from pyspark.sql.functions import *
def output_age_split(df):
ages = [18, 19, 20, 21, 22]
result = []
for age in ages:
temp_df = test(age, df)
if(not len(temp_df.head(1)) == 0):
result.append(temp_df)
return result
def test(age, df):
return df.where(col("age")==age)
# +---+------+------+
# |age|name |salary|
# +---+------+------+
# |18 |John |40000 |
# |22 |Joseph|60000 |
# +---+------+------+
df = spark.sparkContext.parallelize(
[
(18, "John", 40000),
(22, "Jpseph", 60000)
]
).toDF(["age", "name", "salary"])
df.show()
result = output_age_split(df)
# Output type is: <class 'list'>
print(f"Output type is: {type(result)}")
for r in result:
r.show()
# +---+----+------+
# |age|name|salary|
# +---+----+------+
# | 18|John| 40000|
# +---+----+------+
# +---+------+------+
# |age| name|salary|
# +---+------+------+
# | 22|Jpseph| 60000|
# +---+------+------+
I am also attaching screenshot from my workspace for your reference.
Problem:
Solution:

Split string on custom Delimiter in pyspark

I have data with column foo which can be
foo
abcdef_zh
abcdf_grtyu_zt
pqlmn#xl
from here I want to create two columns such that
Part 1 Part 2
abcdef zh
abcdf_grtyu zt
pqlmn xl
The code I am using for this is
data = data.withColumn("Part 1",split(data["foo"],substring(data["foo"],-3,1))).get_item(0)
data = data.withColumn("Part 2",split(data["foo"],substring(data["foo"],-3,1))).get_item(1)
However I am getting an error column not iterable
The following should work
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import expr
>>> df = sc.parallelize(['abcdef_zh', 'abcdfgrtyu_zt', 'pqlmn#xl']).map(lambda x: Row(x)).toDF(["col1"])
>>> df.show()
+-------------+
| col1|
+-------------+
| abcdef_zh|
|abcdfgrtyu_zt|
| pqlmn#xl|
+-------------+
>>> df.withColumn('part2',df.col1.substr(-2, 3)).withColumn('part1', expr('substr(col1, 1, length(col1)-3)')).select('part1', 'part2').show()
+----------+-----+
| part1|part2|
+----------+-----+
| abcdef| zh|
|abcdfgrtyu| zt|
| pqlmn| xl|
+----------+-----+

Pyspark Dataframe TypeError: expected string or buffer

I am creating a new coulmn to an existing dataframe in Pyspark by searching one of the filed 'script' and returning match as the entry for new column.
import re as re
def sw_fix(data_str):
if re.compile(r'gaussian').search(data_str):
cleaned_str = 'gaussian'
elif re.compile(r'gromacs').search(data_str):
cleaned_str = 'gromacs'
else:
cleaned_str = 'ns'
return cleaned_str
sw_fix_udf = udf(sw_fix, StringType())
k=df.withColumn("software_new", sw_fix_udf(df.script))
The code runs fine and generates dataframe k with the new column with correct match, however I am unable to do any operation on the the newly added column
k.filter(k.software_new=='gaussian').show()
throws an error, TypeError: expected string or buffer.
I chekced the datatype of the newly added column
f.dataType for f in k.schema.fields
which shows StringType.
However this one works, where sw_app is a existing column in the original dataframe.
k.filter(k.sw_app=='gaussian').select('sw_app','software_new').show(5)
+--------+------------+
| sw_app|software_new|
+--------+------------+
|gaussian| gaussian|
|gaussian| gaussian|
|gaussian| gaussian|
|gaussian| gaussian|
|gaussian| gaussian|
+--------+------------+
Any hints on why I can't process software_new field?
It is working fine for me without any issues. see below demo in pyspark repl.
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType
>>> import re as re
>>> def sw_fix(data_str):
... if re.compile(r'gaussian').search(data_str):
... cleaned_str = 'gaussian'
... elif re.compile(r'gromacs').search(data_str):
... cleaned_str = 'gromacs'
... else:
... cleaned_str = 'ns'
... return cleaned_str
...
>>>
>>> sw_fix_udf = udf(sw_fix, StringType())
>>> df = spark.createDataFrame(['gaussian text', 'gromacs text', 'someother text'], StringType())
>>>
>>> k=df.withColumn("software_new", sw_fix_udf(df.value))
>>> k.show()
+--------------+------------+
| value|software_new|
+--------------+------------+
| gaussian text| gaussian|
| gromacs text| gromacs|
|someother text| ns|
+--------------+------------+
>>> k.filter(k.software_new == 'ns').show()
+--------------+------------+
| value|software_new|
+--------------+------------+
|someother text| ns|
+--------------+------------+

Why can not call show method after cache in spark sql?

I created a dataframe called df in pyspark with HiveContext (not SQLContext).
But I find that after call df.cache() I will not be able to call df.show(). For example:
>>> df.show(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
| bits| dst_ip|dst_port|flow_direction|in_iface|ip_dscp|out_iface| pkts|protocol| src_ip|src_port| tag|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594|42.120.84.166| 11291| 1| 3| 36| 2|17606406| pnni|42.120.84.115| 14166|10008|
|13914480|42.120.82.254| 13667| 0| 4| 32| 1|13953516| ax.25| 42.120.86.49| 19810|10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
only showing top 2 rows
>>>
>>> df.cache()
DataFrame[bits: bigint, dst_ip: string, dst_port: bigint, flow_direction: string, in_iface: bigint, ip_dscp: string, out_iface: bigint, pkts: bigint, protocol: string, src_ip: string, src_port: bigint, tag: string]
>>> df.show(2)
16/05/16 15:59:32 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 1, in <lambda>
IndexError: list index out of range
But after call df.unpersist(), the df.show() will work again
I do not understand. Because I think df.cache() is just caching the RDD for later use. Why the df.show() not work after call cache?
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
Caching Data In Memory
Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable("tableName") to remove the table from memory.
Configuration of in-memory caching can be done using the setConf method on SQLContext or by running SET key=value commands using SQL.
https://forums.databricks.com/questions/6834/cache-table-advanced-before-executing-the-spark-sq.html#answer-6900