I have a dataframe, which has about 2 million rows with URLs, 2 columns: id and url. I need to parse the domain from the url. I used lambda with urlparse or simple split. But I keep getting EOFError with both ways. If I create a random "sample" of 400 000, it works.
What is also interesting, is that pyspark shows me the top 20 rows with the new column domain, but I cannot do anything with it or I get the error again.
Is it a memory issue or is something wrong with the data? Can somebody please advise me or give me a hint?
I searched several questions regarding this, none of them helped me.
The code:
parse_domain = udf(lambda x: x.split("//")[-1].split("/")[0].split('?')[0],
returnType=StringType())
df = df.withColumn("domain", parse_domain(col("url")))
df.show()
Example urls:
"https://www.dataquest.io/blog/loading-data-into-postgres/"
"https://github.com/geekmoss/WrappyDatabase"
"https://www.google.cz/search?q=pyspark&rlz=1C1GCEA_enCZ786CZ786&oq=pyspark&aqs=chrome..69i64j69i60l3j35i39l2.3847j0j7&sourceid=chrome&ie=UTF-8"
"https://search.seznam.cz/?q=google"
And the error I keep getting:
Traceback (most recent call last):
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 278, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int
raise EOFError
EOFError
Related
Is there a way to prevent users to use certain functions in PySpark SQL?
e.g. Let's say I want to avoid users using function log or rand, how could I disable it?
You can register an udf with the same name as the function that you want to disable and throw an exception in the udf:
def log(s):
raise Exception("log does not work anymore")
spark.udf.register("log", log)
spark.sql("select *, log(value) from table").show()
Result:
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
[...]
Exception: log does not work anymore
I have a dataframe(df) with 1 million rows and two columns (ID (long int) and description(String)). After transforming them into tfidf (using Tokenizer, HashingTF, and IDF), the dataframe, df has two columns (ID and features (sparse vector).
I computed the item-item similarity matrix using udf and dot function.
Computing the similarities is done successfully.
However, when I'm calling the show() function getting
"raise EOFError"
I read so many questions on this issue but did not get right answer yet.
Remember, if I apply my solution on a small dataset (like 100 rows), everything is working successfully.
Is it related to the out of memory issue?
I checked my dataset and description information, I don't see any records with null or unsupported text messages
dist_mat = data.alias("i").join(data.alias("j"), psf.col("i.ID") < psf.col("j.ID")) \
.select(psf.col("i.ID").alias("i"), psf.col("j.ID").alias("j"),
dot_udf("i.features", "j.features").alias("score"))
dist_mat = dist_mat.filter(psf.col('score') > 0.05)
dist_mat.show(1)```
If I removed the last line dist_mat.show(), it is working without error. However, when I used this line, got the error like
.......
```Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded```
...
Here is the part of the error message:
```[Stage 6:=======================================================> (38 + 1) / 39]Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError```
I increased the cluster size and run it again. It is working without errors. So, the error message is true
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
However, computing the pairwise similarities for such a large scale matrix, I found an alternative solution, Large scale matrix multiplication with pyspark
In fact, it is very efficient and much more faster, even better than the use of BlockMatrix
I'm currently implementing a program classifier for my coursework.
My lecturer ask me to use "Evolving ANN" algorithm.
So I found a package called NEAT (Neuro Evolution of Augmenting Topologies).
I have 10 inputs and 7 outputs, then I just modify the source from its documentation.
def eval_fitness(genomes):
for g in genomes:
net = nn.create_feed_forward_phenotype(g)
mse = 0
for inputs, expected in zip(alldata, label):
output = net.serial_activate(inputs)
output = np.clip(output, -1, 1)
mse += (output - expected) ** 2
g.fitness = 1 - (mse/44000) #44000 is the number of samples
print(g.fitness)
I had changed the config file too, so the program has 10 inputs and 7 outputs.
But when I try to run the code, it gives me error
Traceback (most recent call last):
File "/home/ilhammaziz/PycharmProjects/tuproSC2/eANN.py", line 40, in <module>
pop.run(eval_fitness, 10)
File "/home/ilhammaziz/.local/lib/python3.5/site-packages/neat/population.py", line 190, in run
best = max(population)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What I supposed to do?
Thanks
As far as I can tell the error is not in your code but in the library it self. Just use a different one.
This one looks promising to me.
from bson.json_util import dumps
def json_response(response):
return {"response":dumps(response,ensure_ascii=False).encode("utf8")
,"headers":{"Content-type":"text/json"}}
This problem is making me crazy. It returns an error randomly, and I can't find the solution.
/core/handlers/wsgi.py", line 38, in __call__,
output = lookup_view(req),
File "auth/decorator.py", line 8, in wrap,
return fn(req,*args,**kwargs),
File "auth/decorator.py", line 21, in wrap,
return fn(req,*args,**kwargs),
File "contrib/admin/views.py", line 67, in submit_base_premission,
return json_response({"baseperm":baseperm,"Meta":{"gmsg":u"...","type":201}}),
File "render/render_response.py", line 85, in json_response,
return {"response":dumps(response,ensure_ascii=False).encode("utf8"),
File "/usr/local/lib/python2.7/dist-packages/bson/json_util.py", line 116, in dumps,
return json.dumps(_json_convert(obj), *args, **kwargs),
File "/usr/lib/python2.7/json/__init__.py", line 238, in dumps, referer:
**kw).encode(obj),
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode,
chunks = self.iterencode(o, _one_shot=True),
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode,
return _iterencode(o, 0),
File "/usr/lib/python2.7/json/encoder.py", line 178, in default,
raise TypeError(repr(o) + " is not JSON serializable"),
TypeError: ObjectId('51f7dcee95113b7a48e974fe') is not JSON serializable,
baseperm is a pymongo Cursor, it returns this error randomly and that is where I have the problem.
It seems that it doesn't detect objectid sometimes and doesn't convert it to str so json raises an error on dumps.
Check the version of the pymongo driver, if it is under version 2.4.2+ then you may need to update it. Before that version the __str__ method of ObjectId was handled incorrectly for 2.x versions of python, check the repo: github, ObjectId.__str__ should return str in 2.x.
To check the pymongo driver version, type in the python shell:
import pymongo
print(pymongo.version)
UPDATE
I suppose you have tested both environments with the same dataset, so give a try to upgrade python 2.7.3 to 2.7.5.
Else try to iterate through the cursor and construct the list before giving it to json_response() i.e.:
baseperm = list(baseperm) #now baseperm is a list of the documents
...
my_response['baseperm'] = baseperm
my_response['Meta'] = ...
...
return json_response(my_response)
I report this problem on mongodb issue tracker
https://jira.mongodb.org/browse/PYTHON-548
answer:
You said this only happens occasionally? The only thing I can think of that might be related is mod_wsgi spawning sub interpreters. In PyMongo that tends to cause problems with the C extensions encoding python dicts to BSON. In your case this seems to be happening after the BSON documents are decoded to python dicts. It looks like isinstance is failing to match ObjectId in json_util.default(). PYTHON-539 seemed to be a similar problem related to some package miss configuration in the user's environment.
There could be a fairly large performance hit, but could you try running PyMongo without C extensions to see if that solves the problem?
You can read about the mod_wsgi issue here:
http://api.mongodb.org/python/current/faq.html#does-pymongo-work-with-mod-wsgi
Background:
I've got a python script using pymongo that pulls some XML data, parses it into an array of dictionaries called 'all_orders'. I then try to insert it into the collection "orders" and I invariably get this exception. I am reasonably certain that my array of dictionaries is correct because when the list is small it tends to work (I think). I've also found that 8 out of the ~1300 documents I tried to insert into the collection worked.
Question:
Do you know what causes this AutoReconnect(str(e)) exception? Do you know how to work around or avoid this issue?
Error Trace:
File "mongovol.py", line 152, in get_reports
orders.insert(all_orders)
File "/Users/ashutosh/hrksandbox/lumoback-garden2/venv/lib/python2.7/site-packages/pymongo/collection.py", line 359, in insert
continue_on_error, self.__uuid_subtype), safe)
File "/Users/ashutosh/hrksandbox/lumoback-garden2/venv/lib/python2.7/site-packages/pymongo/mongo_client.py", line 853, in _send_message
raise AutoReconnect(str(e))