Disable some functions in PySpark sql - pyspark

Is there a way to prevent users to use certain functions in PySpark SQL?
e.g. Let's say I want to avoid users using function log or rand, how could I disable it?

You can register an udf with the same name as the function that you want to disable and throw an exception in the udf:
def log(s):
raise Exception("log does not work anymore")
spark.udf.register("log", log)
spark.sql("select *, log(value) from table").show()
Result:
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
[...]
Exception: log does not work anymore

Related

Scala Spark - Cannot resolve a column name

This should be pretty straightforward, but I'm having an issue with the following code:
val test = spark.read
.option("header", "true")
.option("delimiter", ",")
.csv("sample.csv")
test.select("Type").show()
test.select("Provider Id").show()
test is a dataframe like so:
Type
Provider Id
A
asd
A
bsd
A
csd
B
rrr
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve '`Provider Id`' given input columns: [Type, Provider Id];;
'Project ['Provider Id]
It selected and shows the Type column just fine but couldn't get it to work for the Provider Id. I wondered if it were because the column name had a space, so I tried using backticks, removing and replacing the space, but nothing seemed to work. Also, it ran fine when I'm using Spark libraries 3.x but doesn't work when I'm using Spark 2.1.x (meanwhile I need to use 2.1.x)
Additional: I tried changing the CSV column order from Type - Provider Id to Provider Id then Type. The error was the opposite, Provider Id shows but for Type it's throwing an exception now.
Any suggestions?
test.printSchema()
You can use the result from printSchema() to see how exactly spark read your column in, then use that in your code.

MongoDB find operation throws OperationFailure: Cannot update value

I have an application that uses MongoDB (on AWS DocumentDB) to stores documents with a large string in one of its fields which we call field X.
Few notes to start:
I'm using pymongo so the method names you might see here are taken from there
As of the nature of field X it is not being indexed
On field X we use MongoDB find method using a query with regex condition limiting it by both maxTimeMS and limit to a small amount of results.
When we get the results we iterate the cursor to fetch all the results to a list (inline loop).
Most of the times the query works properly but I'm starting to get more and more of the following error:
pymongo.errors.OperationFailure: Cannot update value (error code 14)
This is being thrown after the query return a cursor and we iterating the results and occurs after trying to _refresh the cursor connection by calling the next method and being thrown by _check_command_response at its last line meaning this is a default exception(?).
The query:
collection.find(condition).max_time_ms(MAX_QUERY_TIME_MS).sort(sort_order) \
.limit(RESULT_LIMIT)
results = [document for document in cursor] # <--- here we get the error
Stack trace:
pymongo/helpers.py in _check_command_response at line 155
pymongo/cursor.py in __send_message at line 982
pymongo/cursor.py in _refresh at line 1104
pymongo/cursor.py in next at line 1189
common/my_code.py in <listcomp> at line xxx
I'm trying to understand the origin of the exception to handle it correctly or use a different approach for handling the cursor.
what is being updated at the refresh method of the cursor that might
throw the above exception?
Thanks in advance.

Pyspark, EOFError - Memory issue or broken data?

I have a dataframe, which has about 2 million rows with URLs, 2 columns: id and url. I need to parse the domain from the url. I used lambda with urlparse or simple split. But I keep getting EOFError with both ways. If I create a random "sample" of 400 000, it works.
What is also interesting, is that pyspark shows me the top 20 rows with the new column domain, but I cannot do anything with it or I get the error again.
Is it a memory issue or is something wrong with the data? Can somebody please advise me or give me a hint?
I searched several questions regarding this, none of them helped me.
The code:
parse_domain = udf(lambda x: x.split("//")[-1].split("/")[0].split('?')[0],
returnType=StringType())
df = df.withColumn("domain", parse_domain(col("url")))
df.show()
Example urls:
"https://www.dataquest.io/blog/loading-data-into-postgres/"
"https://github.com/geekmoss/WrappyDatabase"
"https://www.google.cz/search?q=pyspark&rlz=1C1GCEA_enCZ786CZ786&oq=pyspark&aqs=chrome..69i64j69i60l3j35i39l2.3847j0j7&sourceid=chrome&ie=UTF-8"
"https://search.seznam.cz/?q=google"
And the error I keep getting:
Traceback (most recent call last):
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 278, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int
raise EOFError
EOFError

Table not found error after submitting a spark script consisting with spark sql after enabling hivesupport

I want to run a simple spark script which has some sparksql query basicaly Hiveql. the corresponding tables are saved in spark-warehouse folder.
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/tmp").appName("TestApp").enableHiveSupport().getOrCreate()
sqlstring="SELECT lflow1.LeaseType as LeaseType, lflow1.Status as Status, lflow1.Property as property, lflow1.City as City, lesflow2.DealType as DealType, lesflow2.Area as Area, lflow1.Did as DID, lesflow2.MID as MID from lflow1, lesflow2 WHERE lflow1.Did = lesflow2.MID"
def queryBuilder(sqlval):
df=spark.sql(sqlval)
df.show()
result=queryBuilder(sqlstring)
print(result.collect())
print("Type of",type(result))
after performing spark submit operation i am facing bellow error
py4j.protocol.Py4JJavaError: An error occurred while calling o27.sql.
: org.apache.spark.sql.AnalysisException: Table or view not found: lflow1; line 1 pos 211
I could not figure out why it is happening. I have seen some posts from stackoverflow and they suggested that i have to enable my hive support which i have already done in my script by writing enablehivesupport(). but still i am getting this error. i am running pyspark 2.2 in windows 10. Kindly help me to figure it out
I have created and saved lflow1 and lesflow2 as permanent table in pyspark shell from dataframe. Here it is mycode
df = spark.read.json("C:/Users/codemen/Desktop/test for sparkreport engine/LeaseFlow1.json")
df2 = spark.read.json("C:/Users/codemen/Desktop/test for sparkreport engine/LeaseFlow2.json")
df.write.saveAsTable("lflow1")
df2.write.saveAsTable("lesflow2")
in Pyspark shell i have performed query
spark.sql("SELECT lflow1.LeaseType as LeaseType, lflow1.Status as Status, lflow1.Property as property, lflow1.City as City, lesflow2.DealType as DealType, lesflow2.Area as Area, lflow1.Did as DID, lesflow2.MID as MID from lflow1, lesflow2 WHERE lflow1.Did = lesflow2.MID").show()
and pyspark console is showing this

PyDev: How to avoid "assignment to reserved built-in symbol: id"?

I have PyDev 5.6 (for Eclipse). I'm getting "assignment to reserved built-in symbol: id" warning for the id:
class bla(object):
def myfn(self, task):
id = task['id']
I found #1457 Bogus "Assignment to reserved built-in symbol" warnings
(https://sourceforge.net/p/pydev/bugs/1457/)
but it's for PyDev 2.3 and issue should be fixed in 2.6.0
I don't want to disable all "Redefinition of builtin symbols" in Code Analysis (it's possible in Preferences). Someone suggested to use Id or _id instead of id , but for me the id is variable and I want to keep it in lower case.
Is it possible to set Eclipse/PyDev to ignore this symbol?
Currently it's in class and it's used locally inside func "myfn" (or other).
But I would like to ignore it also on "main" level. If you work w/ databases, 'id' is everywhere :)
I'm new in Eclipse and PyDev. Maybe I overlooked some setting.
Thanks.
What #S.Ahmad said, is correct, you can ignore it with comments in the code (and use PyDev itself to help you there).
Another option would be disabling that check altogether (for all variables) in PyDev > Editor > Code Analysis > Others > Redefinition of builtin symbols (there's no option to disable it just for id).
Personally, I try to stay clear from redefining builtins such as you're doing (even id) and give it a more meaningful name when assigning to a local (i.e.: in your example I'd call it task_id, not only id), and it's usually straightforward to define it, as the id is usually related to id of something (IMHO, it also makes code clearer to follow later on if you give the id more meaning).
Select/Highlight the variable and then press Ctrl+1.
Choose #ReservedAssignment from the dropdown menu. This will suppress the warning.
Or you can simply paste # #ReservedAssignment after each variable to suppress the message.
I don't think we can suppress such messages for a variable globally but I could be wrong.
Now I found out that id(var) is func returning address of variable/object. Hmmm... maybe that's reason for this warning.
I did little research:
>>> a= 2
>>> a
2
>>> type(a)
<class 'int'>
>>> id(a)
1650106848
>>> a.__str__
<method-wrapper '__str__' of int object at 0x00000000625AA1E0>
>>> "{:02x}".format( id(a) ) ### hex( id(a) )
'625aa1e0' ### so it's address
>>> type( id )
<class 'builtin_function_or_method'>
>>> id = 1
>>> type( id )
<class 'int'>
>>> "{:02x}".format( id(a) ) ### id() doesn't work now
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'int' object is not callable
>>>
How dangerous can be to do id = something and destroy id() functionality?