How to fix error in loading data from Azure blob? - pyspark

I am trying to read data from azure blob.
df = spark.read.load('abfs[s]://folder/a_p_c_w.csv')
sf.show(5)
But getting below error. Can someone help me here.
IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name
at index 4: abfs[s]://folder/a_p_c_w.csv
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load
return self._df(self._jreader.load(path))
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Illegal
character in scheme name at index 4:
abfs[s]://folder/a_p_c_w.csv

abfs[s] isnt a valid url. hence the error message "Illegal
character in scheme name at index 4"
the scheme you want is abfs:

Related

pyspark java heap space error when just retrieving data frame columns

Is spark's lazy evaluation really executing anything for the following simple example of pointing to a partition of a hive table and getting columns but nothing really heavy:
>>> spark.sql('select * from default.test_table where day="2021-01-01"').columns
[Stage 0:===============================> (1547 + 164) / 2477]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 28049"...
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o61.sql
I don't see why just pointing to a hive table takes much memory from PySpark (Version 2.4.3). Adding memory to driver and executor (driver-memory, executor-memory) only makes the query stuck forever without outputting any useful message. Is there a way to suppress PySpark from executing when just defining a data frame?
You can put a limit on the query to avoid memory errors:
spark.sql('select * from default.test_table where day="2021-01-01" limit 1').columns

regarding a py4j exception

I have installed Java 11 and Python 3 on CentOS. Trying to run a code that worked perfectly fine on a Windows environment. Getting this exception:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1188, in
send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1014, in send_command
response = connection.send_command(command)
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1193, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "WordInformation.py", line 493, in
status = read_from_source("../Corpora/Bhandarkar Oriental Research Books")
File "WordInformation.py", line 473, in read_from_source
author, year)
File "WordInformation.py", line 381, in fetch_from_hwn
return read_store_properties(word, file, sentence, source, category, author,
year);
File "WordInformation.py", line 79, in read_store_properties
properties["synsets"] = get_other_props(word)
File "WordInformation.py", line 226, in get_other_props
output = gateway.jvm.Properties.getProperties(word)
File "/usr/lib/python3.4/site-packages/py4j/java_gateway.py", line 1286, in
call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/python3.4/site-packages/py4j/protocol.py", line 336, in
get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling
z:in.ac.iitb.cfilt.jhwnl.examples.Properties.getProperties
Initialised the gateway as follows:
gateway = JavaGateway.launch_gateway(classpath="/home/gayatri/Code/hindiwn.jar")
Is this because of some dependency? I have set the JAVA_HOME and updated the PATH variable.
I don't have the reputation to comment, but
Answer from Java side is empty
But this error indicates that the Java code is not reachable.
Just to verify basic steps
1) Be sure that the java program is running
2) Be sure that you run the Python script after the java code is running
3) The java program is running the entire time.
If you are doing those two things, then an issue may be that the operating system may be already using a port.
You can try
Java
'GatewayServerBuilder server = new GatewayServerBuilder().javaPort(1001).build()'
'server.start()'
Python
java = JavaGateway(gateway_parameters=GatewayParameters(port=1001))

pyspark on windows (Upgrade from 1.6 to 2.0.2): sqlContext.read.format fails

The following line works very well in 1.6 although fails in 2.0.2. Any idea, what could be the issue
file_name = "D:/ProgramFiles/spark-2.0.2-bin-hadoop2.3/data/mllib/sample_linear_regression_data.txt"
df_train = sqlContext.read.format("libsvm").load(file_name)
The error is
File "<ipython-input-4-e5510d6d3d6a>", line 1, in <module>
df_train = sqlContext.read.format("libsvm").load("../data/mllib/sample_linear_regression_data.txt")
File "D:\ProgramFiles\spark-2.0.2-bin-hadoop2.3\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 147, in load
return self._df(self._jreader.load(path))
File "D:\ProgramFiles\spark-2.0.2-bin-hadoop2.3\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "D:\ProgramFiles\spark-2.0.2-bin-hadoop2.3\python\lib\pyspark.zip\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: 'Can not create a Path from an empty string'
It may be due to this bug, that has since been corrected: https://github.com/apache/spark/pull/11775
It raises this 'empty string' error instead of an invalid path.
You're using a relative path which will build a path from default directory that might have changed in your installation of spark2. Try setting environment variable HADOOP_CONF_DIR or specifying an absolute path instead of a relative one. If it's a local path use file:///

Mongo connector with neo4j doc manager crashing

I'm using mongo-connector and neo4j_doc_manager for syncing the mongodb's data to neo4j, it used to work perfectly but today it started giving following error.
2016-07-29 17:18:59,558 [CRITICAL] mongo_connector.oplog_manager:549 - Exception during collection dump
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 501, in do_dump
upsert_all(dm)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/oplog_manager.py", line 485, in upsert_all
dm.bulk_upsert(docs_to_dump(namespace), mapped_ns, long_ts)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/util.py", line 38, in wrapped
reraise(new_type, exc_value, exc_tb)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/util.py", line 32, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/mongo_connector/doc_managers/neo4j_doc_manager.py", line 89, in bulk_upsert
tx.commit()
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher/core.py", line 306, in commit
return self.post(self.__commit or self.__begin_commit)
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher/core.py", line 261, in post
raise self.error_class.hydrate(error)
File "/usr/local/lib/python2.7/site-packages/py2neo/cypher/error/core.py", line 54, in hydrate
error_cls = getattr(error_module, title)
Neo4jOperationFailed: 'module' object has no attribute 'ConstraintValidationFailed'
2016-07-29 17:18:59,563 [ERROR] mongo_connector.oplog_manager:557 - OplogThread: Failed during dump collection cannot recover!
You're trying to insert data which doesn't match the constraints of your Neo4j schema (unicity or existence), and apparently the code doesn't know how to handle that type of error, though it does give its name:
ConstraintValidationFailed
You should maybe activate some log to see the data which it is trying to insert, or the Cypher query it's trying to execute.

I have the following error. How to fix the following certificate error (ipython)?

I'm getting the following error when I want to run ipython notebook on my macbook. Does anyone know how to fix this? Could you please help me about it?
ERROR:root:Exception in I/O handler for fd 6
Traceback (most recent call last):
File "//anaconda/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 346, in start
self._handlers[fd](fd, events)
File "//anaconda/lib/python2.7/site-packages/tornado/netutil.py", line 167, in accept_handler
callback(connection, address)
File "//anaconda/lib/python2.7/site-packages/tornado/tcpserver.py", line 217, in _handle_connection
do_handshake_on_connect=False)
File "//anaconda/lib/python2.7/site-packages/tornado/netutil.py", line 407, in ssl_wrap_socket
return ssl.wrap_socket(socket, **dict(context, **kwargs))
File "//anaconda/python.app/Contents/lib/python2.7/ssl.py", line 387, in wrap_socket
ciphers=ciphers)
File "//anaconda/python.app/Contents/lib/python2.7/ssl.py", line 141, in __init__
ciphers)
SSLError: [Errno 336445449] _ssl.c:368: error:140DC009:SSL routines:SSL_CTX_use_certificate_chain_file:PEM lib
ERROR:root:Exception in I/O handler for fd 6
Traceback (most recent call last):
File "//anaconda/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 346, in start
self._handlers[fd](fd, events)
File "//anaconda/lib/python2.7/site-packages/tornado/netutil.py", line 167, in accept_handler
callback(connection, address)
File "//anaconda/lib/python2.7/site-packages/tornado/tcpserver.py", line 217, in _handle_connection
do_handshake_on_connect=False)
File "//anaconda/lib/python2.7/site-packages/tornado/netutil.py", line 407, in ssl_wrap_socket
return ssl.wrap_socket(socket, **dict(context, **kwargs))
File "//anaconda/python.app/Contents/lib/python2.7/ssl.py", line 387, in wrap_socket
ciphers=ciphers)
File "//anaconda/python.app/Contents/lib/python2.7/ssl.py", line 141, in __init__
ciphers)
SSLError: [Errno 336445449] _ssl.c:368: error:140DC009:SSL routines:SSL_CTX_use_certificate_chain_file:PEM lib
It appears that your browser is attempting to access the notebook without SSL. Make sure to access the site with HTTPS. For example, when you access the notebook, type in https://127.0.0.1:9999 in your browser. (Or whatever the address of the server is.)
It doesn't recognise the files you're passing it - either:
pass a .pem file (private key) to --NotebookApp.keyfile= and the .crt file (your certificate) to --NotebookApp.certfile=
create a new file by appending your certificate to your key and pass this new file to --certfile.