pyspark java heap space error when just retrieving data frame columns

pyspark java heap space error when just retrieving data frame columns - pyspark

Is spark's lazy evaluation really executing anything for the following simple example of pointing to a partition of a hive table and getting columns but nothing really heavy:
>>> spark.sql('select * from default.test_table where day="2021-01-01"').columns
[Stage 0:===============================> (1547 + 164) / 2477]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 28049"...
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o61.sql
I don't see why just pointing to a hive table takes much memory from PySpark (Version 2.4.3). Adding memory to driver and executor (driver-memory, executor-memory) only makes the query stuck forever without outputting any useful message. Is there a way to suppress PySpark from executing when just defining a data frame?

You can put a limit on the query to avoid memory errors:
spark.sql('select * from default.test_table where day="2021-01-01" limit 1').columns

Related

ImportError: cannot import name 'ranked_blast_output_schema' from 'param'

I am just getting started on Python, though I know a bit of R. I want to replicate something someone has already done. I am receiving this error on one of my kernels on Jupyter and I don't immediately know what to do about it. Does anyone have any input or experience with it?
Traceback (most recent call last):
File "parse.py", line 8, in <module>
from param import ranked_blast_output_schema, blast_outfmt6_schema
ImportError: cannot import name 'ranked_blast_output_schema' from 'param' (/Users/myaccount/miniconda3/lib/python3.8/site-packages/param/__init__.py)
Traceback (most recent call last):
File "lca_analysis.py", line 52, in <module>
if ("~" in blast_results["query"].iloc[0]):
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 894, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1500, in _getitem_axis
self._validate_integer(key, axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1443, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

Ah okay. I am replicating a Jupyter notebook and I was told to run this link on bash.
lca_analysis.py --blast_type nt --fpath s3://org_name/contigs/CMS001_002_Ra_S1/blast_nt.m9 --filtered_blast_path s3://bucket_name/contig_quality/CMS001_002_Ra_S1/blast_nt_filtered.m9 --excluded_contigs_path s3://bucket_name/contig_quality/CMS001_002_Ra_S1/exclude_contigs_nt.txt --outpath s3://bucket_name/contig_quality/CMS001_002_Ra_S1/lca_nt.m9 --read_count_path s3://bucket_name/contigs/CMS001_002_Ra_S1/contig_stats.json --verbose True
But then, I had this error
Read counts have been loaded: s3://bucket_name/contigs/CMS001_002_Ra_S1/contig_stats.json| elapsed time: 0.73 seconds
/var/folders/ns/gdtc2hvx1g13_29wct4qkhd80000gq/T/tmp3m3k6w9o blast file downloaded to this tempfile
Traceback (most recent call last):
File "parse.py", line 8, in <module>
from param import ranked_blast_output_schema, blast_outfmt6_schema
ImportError: cannot import name 'ranked_blast_output_schema' from 'param' (/Users/myaccount/miniconda3/lib/python3.8/site-packages/param/__init__.py)
Traceback (most recent call last):
File "lca_analysis.py", line 52, in <module>
if ("~" in blast_results["query"].iloc[0]):
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 894, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1500, in _getitem_axis
self._validate_integer(key, axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1443, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
I had just used pip install param to get param, and the install went just fine.

Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'

I am unable to run Hive queries from Pyspark.
I tried copying hive-site.xml into spark's conf but inspite of doing that it is throwing the same error
Full rror
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark-2.4.0/python/pyspark/sql/context.py", line 358, in sql
return self.sparkSession.sql(sqlQuery)
File "/usr/local/spark-2.4.0/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/local/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark-2.4.0/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"

In my test with oozie, I had to add the Hive-related jars that Spark needs. Try adding the same in spark's conf maybe

regarding a py4j exception

I have installed Java 11 and Python 3 on CentOS. Trying to run a code that worked perfectly fine on a Windows environment. Getting this exception:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1188, in
send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1014, in send_command
response = connection.send_command(command)
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1193, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "WordInformation.py", line 493, in
status = read_from_source("../Corpora/Bhandarkar Oriental Research Books")
File "WordInformation.py", line 473, in read_from_source
author, year)
File "WordInformation.py", line 381, in fetch_from_hwn
return read_store_properties(word, file, sentence, source, category, author,
year);
File "WordInformation.py", line 79, in read_store_properties
properties["synsets"] = get_other_props(word)
File "WordInformation.py", line 226, in get_other_props
output = gateway.jvm.Properties.getProperties(word)
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1286, in
call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/python3.4/site-packages/py4j/protocol.py", line 336, in
get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling
z:in.ac.iitb.cfilt.jhwnl.examples.Properties.getProperties
Initialised the gateway as follows:
gateway = JavaGateway.launch_gateway(classpath="/home/gayatri/Code/hindiwn.jar")
Is this because of some dependency? I have set the JAVA_HOME and updated the PATH variable.

I don't have the reputation to comment, but
Answer from Java side is empty
But this error indicates that the Java code is not reachable.
Just to verify basic steps
1) Be sure that the java program is running
2) Be sure that you run the Python script after the java code is running
3) The java program is running the entire time.
If you are doing those two things, then an issue may be that the operating system may be already using a port.
You can try
Java
'GatewayServerBuilder server = new GatewayServerBuilder().javaPort(1001).build()'
'server.start()'
Python
java = JavaGateway(gateway_parameters=GatewayParameters(port=1001))

Mongo connector with neo4j doc manager crashing

I'm using mongo-connector and neo4j_doc_manager for syncing the mongodb's data to neo4j, it used to work perfectly but today it started giving following error.
2016-07-29 17:18:59,558 [CRITICAL] mongo_connector.oplog_manager:549 - Exception during collection dump
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 501, in do_dump
upsert_all(dm)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 485, in upsert_all
dm.bulk_upsert(docs_to_dump(namespace), mapped_ns, long_ts)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/util.py", line 38, in wrapped
reraise(new_type, exc_value, exc_tb)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/util.py", line 32, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/doc_managers/neo4j_doc_manager.py", line 89, in bulk_upsert
tx.commit()
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher/core.py", line 306, in commit
return self.post(self.__commit or self.__begin_commit)
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher/core.py", line 261, in post
raise self.error_class.hydrate(error)
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher/error/core.py", line 54, in hydrate
error_cls = getattr(error_module, title)
Neo4jOperationFailed: 'module' object has no attribute 'ConstraintValidationFailed'
2016-07-29 17:18:59,563 [ERROR] mongo_connector.oplog_manager:557 - OplogThread: Failed during dump collection cannot recover!

You're trying to insert data which doesn't match the constraints of your Neo4j schema (unicity or existence), and apparently the code doesn't know how to handle that type of error, though it does give its name:
ConstraintValidationFailed
You should maybe activate some log to see the data which it is trying to insert, or the Cypher query it's trying to execute.

Error while retrieving result in pymongo

I have an python application which creates number of threads for a job. each thread connects to mongodb and retrieve data. Number of allowed connection to mongodb is 200 which I'm taking care using semaphore. And once mongo querying job is done each thread closes mongodb connection. But while executing this application I'm getting same error for all threads. Error is:
Traceback (most recent call last):
File "C:\Python34\lib\threading.py", line 921, in _bootstrap_inner
self.run()
File "C:\Python34\lib\threading.py", line 869, in run
self._target(*self._args, **self._kwargs)
File "C:/path/pytest/under_construction/testAlgo.py", line 95, in sample_thread
status=monObj.process_status(list_value1,list_value2,5,120,120)
File "C:\path\pytest\under_construction\mongo_lib.py", line 153, in process_status
result=self.mongo_result('Submission','find',q={})
File "C:\path\pytest\under_construction\mongo_lib.py", line 53, in mongo_result
result=list(_query[query_type.lower()](query_string[keys]))
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 1076, in __next__
if len(self.__data) or self._refresh():
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 1037, in _refresh
limit, self.__id))
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 933, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1205, in _send_message_with_response
response = self.__send_and_receive(message, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1182, in __send_and_receive
return self.__receive_message_on_socket(1, request_id, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1174, in __receive_message_on_socket
return self.__receive_data_on_socket(length - 16, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1153, in __receive_data_on_socket
chunk = sock_info.sock.recv(length)
MemoryError
Code for creating mongo connection
client=MongoClient(mc_name,port)
I was thinking, is this error due to results of all threads accumulating at one port of machine running my application?

MongoClient is a thread-safe connection pool, so you should be creating a single instance that's shared by all the worker threads rather than having each thread create its own.
The connection pool size defaults to 100, but if you want to make it even larger you can use the maxPoolSize parameter to do that (e.g. maxPoolSize=200).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark java heap space error when just retrieving data frame columns - pyspark

You can put a limit on the query to avoid memory errors: spark.sql('select * from default.test_table where day="2021-01-01" limit 1').columns

Related

ImportError: cannot import name 'ranked_blast_output_schema' from 'param'

Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'

regarding a py4j exception

Mongo connector with neo4j doc manager crashing

Error while retrieving result in pymongo

Categories

Resources