Error when writing dataframe with PySpark

Error when writing dataframe with PySpark - pyspark

I am not able to save a table to any of a few different sources.
I have tried the following:
dataset.toPandas().to_csv("local_path")
dataset.createOrReplaceTempView("tempTable")
spark.sql("DROP TABLE IF EXISTS impala_table")
spark.sql((f"CREATE TABLE IF NOT EXISTS impala_table AS "
"SELECT * from tempTable"))
dataset.write.overwrite().saveAsTable("impala_table")
dataset.write.csv(file, header=True, mode="overwrite")
So, my deduction is that it's not even getting to writing it in any form, but I can't figure out how to know more about it.
The error logs are if not the same, very similar. The one I found most odd regards a module named "src" that is not found. This is what I found most repetitive and pertinent:
/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/
lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise
Py4JError( Py4JJavaError: An error occurred while calling o877.saveAsTable. :
org.apache.spark.SparkException: Job aborted. at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.
write(FileFormatWriter.scala:224)
...
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/
lib/spark2/python/pyspark/serializers.py", line 566,
in loads return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'src'
Thanks for checking it out.
Cheers.

I found out the problem behind this dataframe.
This was not something about the writer, but on the intermediate table calculations.
As #kfkhalili pointed out, it's a good recommendation to do sporadic .show()s in order to verify it's running smoothly.
Thanks.

Related

Pyspark, EOFError - Memory issue or broken data?

I have a dataframe, which has about 2 million rows with URLs, 2 columns: id and url. I need to parse the domain from the url. I used lambda with urlparse or simple split. But I keep getting EOFError with both ways. If I create a random "sample" of 400 000, it works.
What is also interesting, is that pyspark shows me the top 20 rows with the new column domain, but I cannot do anything with it or I get the error again.
Is it a memory issue or is something wrong with the data? Can somebody please advise me or give me a hint?
I searched several questions regarding this, none of them helped me.
The code:
parse_domain = udf(lambda x: x.split("//")[-1].split("/")[0].split('?')[0],
returnType=StringType())
df = df.withColumn("domain", parse_domain(col("url")))
df.show()
Example urls:
"https://www.dataquest.io/blog/loading-data-into-postgres/"
"https://github.com/geekmoss/WrappyDatabase"
"https://www.google.cz/search?q=pyspark&rlz=1C1GCEA_enCZ786CZ786&oq=pyspark&aqs=chrome..69i64j69i60l3j35i39l2.3847j0j7&sourceid=chrome&ie=UTF-8"
"https://search.seznam.cz/?q=google"
And the error I keep getting:
Traceback (most recent call last):
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 278, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int
raise EOFError
EOFError

Fatal Error: Added field has duplicate identifier(): APT_TRinput0Rec99 (ALR_DATIBAS3.FilterFieldError)

I have a job with 181 columns, I'm getting this error while compiling on a transformer before a funnel.
Fatal Error: Added field has duplicate identifier(): APT_TRinput0Rec99 (ALR_DATIBAS3.FilterFieldError)
The transformer has 181 constraints and nothing special, what can i try to solve it?

Error in perl with Cassandra

I am writing my Cassandra cache class in Perl 5.18.2 with Net::Async::CassandraCQL.
This is my just test example:
# in first subroutine
$self->_loop( new IO::Async::Loop );
$self->_cacheIO(new Net::Async::CassandraCQL(
host => $self->server->{ ip },
service => $self->server->{ port },
keyspace => $self->_keyspace,
default_consistency => CONSISTENCY_QUORUM,));
$self->_loop()->add( $self->_cacheIO );
$self->_cacheIO()->connect->get;
# in second subroutine
$self->_cacheIO()->query( "INSERT INTO cacheTable (key, value) VALUES ('keeeey1', 'you will pay');" )->get();
And i am getting this error on insert query:
IO::Async::Future=HASH(0x2e8a4b8) IO::Async::Future=HASH(0x2e8a4b8) lost a sequence Future at /usr/local/share/perl/5.18.2/Net/Async/CassandraCQL/Connection.pm line 231.
I already read this https://rt.cpan.org/Public/Bug/Display.html?id=97260
so it could be bug. But i think that maybe it could be overcome with IO::Async::Notifier adopt_future method. Do you have any experience with notifier and future. Any examples? Any ideas for the error and how to solve it.
Maybe it will be better to ask how to do this synchronous?
PERL_FUTURE_DEBUG=1 perl ./test.pl
(in cleanup) ERROR CODE #5
IO::Async::Future=HASH(0x5607cd8) was constructed at /usr/local/share/perl/5.18.2/Net/Async/CassandraCQL/Connection.pm line 595 and was lost near /usr/local/share/perl/5.18.2/Future.pm line 346 before it was ready.
IO::Async::Future=HASH(0x560f1a0) (constructed at /usr/local/share/perl/5.18.2/IO/Async/Loop.pm line 553) lost a sequence Future at /usr/local/share/perl/5.18.2/Net/Async/CassandraCQL/Connection.pm line 231.
(in cleanup)
IO::Async::Future=HASH(0x56030b8) was constructed at /usr/local/share/perl/5.18.2/Net/Async/CassandraCQL/Connection.pm line 504 and was lost near /usr/local/share/perl/5.18.2/Carp.pm line 168 before it was ready.
Its very strange, when i get this error and select the database i can see that inserted data are in table...
Update: ist not error but warning.

This was reported as a bug some time ago. It has since been fixed; see
https://rt.cpan.org/Ticket/Display.html?id=97260

Can't insert into MongoDB due to AutoReconnect

Background:
I've got a python script using pymongo that pulls some XML data, parses it into an array of dictionaries called 'all_orders'. I then try to insert it into the collection "orders" and I invariably get this exception. I am reasonably certain that my array of dictionaries is correct because when the list is small it tends to work (I think). I've also found that 8 out of the ~1300 documents I tried to insert into the collection worked.
Question:
Do you know what causes this AutoReconnect(str(e)) exception? Do you know how to work around or avoid this issue?
Error Trace:
File "mongovol.py", line 152, in get_reports
orders.insert(all_orders)
File "/Users/ashutosh/hrksandbox/lumoback-garden2/venv/lib/python2.7/site-packages/pymongo/collection.py", line 359, in insert
continue_on_error, self.__uuid_subtype), safe)
File "/Users/ashutosh/hrksandbox/lumoback-garden2/venv/lib/python2.7/site-packages/pymongo/mongo_client.py", line 853, in _send_message
raise AutoReconnect(str(e))

symfony2 embedded collection edit form

I created an embedded collection of another entity in a form, the idea would be that when you edit or erase up to 'demand' also would edit the 'products' that belong to it, my creating form is ok, but the editing it gives the error :
Catchable Fatal Error: Argument 1 passed to MaisAlimentos\DemandaBundle\Entity\Demanda::setProdutosDemanda() must be an instance of Doctrine\Common\Collections\ArrayCollection, instance of Doctrine\ORM\PersistentCollection given, called in /var/www/maa/vendor/symfony/src/Symfony/Component/Form/Util/PropertyPath.php on line 347 and defined in /var/www/maa/src/MaisAlimentos/DemandaBundle/Entity/Demanda.php line 130
I read on some forums, the solution is remove type of the setter, I got other error:
Catchable Fatal Error: Object of class Doctrine\ORM\PersistentCollection could not be converted to string in /var/www/maa/src/MaisAlimentos/DemandaBundle/Entity/Demanda.php line 136
my code
http://pastebin.com/WeGcHyYL

OK, so you've found the solution for your original problem.
The second one comes from a typo/copy-paste error.
In the line 162 on your pastebin code:
$this->$produtosDemanda = $produtosDemanda;
should be
$this->produtosDemanda = $produtosDemanda;
So no $ sign after $this->.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Error when writing dataframe with PySpark - pyspark

I found out the problem behind this dataframe. This was not something about the writer, but on the intermediate table calculations. As #kfkhalili pointed out, it's a good recommendation to do sporadic .show()s in order to verify it's running smoothly. Thanks.

Related

Pyspark, EOFError - Memory issue or broken data?

Fatal Error: Added field has duplicate identifier(): APT_TRinput0Rec99 (ALR_DATIBAS3.FilterFieldError)

Error in perl with Cassandra

Can't insert into MongoDB due to AutoReconnect

symfony2 embedded collection edit form

Categories

Resources