Join two tables using pyspark hive context - pyspark

I am seeing below error when joining two hive tables using pyspark hive context .
error:
""") File
"/usr/hdp/2.3.4.7-4/spark/python/lib/pyspark.zip/pyspark/sql/context.py",
line 552, in sql File
"/usr/hdp/2.3.4.7-4/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in call File
"/usr/hdp/2.3.4.7-4/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 36, in deco File
"/usr/hdp/2.3.4.7-4/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o41.sql. : org.apache.spark.SparkException: Job
cancelled because SparkContext was shut down EX:
lsf.registerTempTable('temp_table')
out = hc.sql(
"""INSERT OVERWRITE TABLE AAAAAA PARTITION (day ='2017-09-20')
SELECT tt.*,ht.id
FROM temp_table tt
JOIN hive_table ht
ON tt.id = ht.id
""")
Also how to parameterize day ?

Related

Ambiguous column error using orange3 - POSTGRES

When executing the query inside Orange 3, I got an ambiguous column error
QUERY
SELECT CAST(REPLACE(memoria_em_uso, 'G','') AS FLOAT),
CAST(REPLACE(memoria_total, 'G','') as FLOAT),
CAST(REPLACE(memoria_livre, 'M','') as FLOAT)
FROM tbl_memoria_hosts
OUTPUT
Error encountered in widget SQL Table:
Traceback (most recent call last):
File "/home/luis/anaconda3/lib/python3.7/site-packages/Orange/data/sql/backend/postgres.py", line 81, in execute_sql_query
cur.execute(query, params)
psycopg2.errors.AmbiguousColumn: column reference "replace" is ambiguous
LINHA 1: SELECT (("replace")::double precision) AS "replace", (("repl...
Note:
All columns in the table are VARCHAR, where the information is written as follows 7.7G

pyspark DataFrame selectExpr is not working for more than one column

We are trying Spark DataFrame selectExpr and its working for one column, when i add more than one column it throws error.
First one is working, the second one throws error.
Code sample:
df1.selectExpr("coalesce(gtr_pd_am,0 )").show(2)
df1.selectExpr("coalesce(gtr_pd_am,0),coalesce(prev_gtr_pd_am,0)").show()
Error log:
>>> df1.selectExpr("coalesce(gtr_pd_am,0),coalesce(prev_gtr_pd_am,0)").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.6.5.0-292/spark2/python/pyspark/sql/dataframe.py", line 1216, in selectExpr
jdf = self._jdf.selectExpr(self._jseq(expr))
File "/usr/hdp/2.6.5.0-292/spark2/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/2.6.5.0-292/spark2/python/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nmismatched input ',' expecting <EOF>(line 1, pos 21)\n\n== SQL ==\ncoalesce(gtr_pd_am,0),coalesce(prev_gtr_pd_am,0)\n---------------------^^^\n"
check this
df1.selectExpr("coalesce(gtr_pd_am,0)”,”coalesce(prev_gtr_pd_am,0)").show()
You need specify the columns individually

Load pandas dataframe into Spark cluster

I have a postgres database and I want to run a query and load a table into spark dataframe. some columns of my database are array. For example:
=> select id, f_2 from raw limit 1;
will return
id | f_2
---------+-----------
1 | {{140,130},{NULL,NULL},{NULL,NULL}}
what I want is accessing to 140 (first element of inner array) that is easy in postgres using this query:
=> select id, f_2[1][1] from raw limit 1;
id | f_2
---------+-----------
1 | 140
but I want to load it into spark dataframe and here is my code to load data:
df = sqlContext.sql("""
select id as id,
f_2 as A
from raw
""")
and return this error:
Py4JJavaError: An error occurred while calling o560.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer
and then I tried this one:
df = sqlContext.sql("""
select id as id,
f_2[0] as A
from raw
""")
and got same error and then tried this one:
df = sqlContext.sql("""
select id as id,
f_2[0][0] as A
from raw
""")
and return this error:
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
AnalysisException: u"Can't extract value from f_2#32685[0];"

Relstorage zodbpack error

Whenever I try and run zobbpack I generate the error:
psycopg2.IntegrityError: null value in column "zoid" violates not-null constraint
Any ideas what is causing this and how to fix it?
relstorage 1.5.1, postgres 8, plone 4.2.1.1
2012-12-03 13:18:03,485 [zodbpack] INFO Opening storage (RelStorageFactory)...
2012-12-03 13:18:03,525 [zodbpack] INFO Packing storage (RelStorageFactory).
2012-12-03 13:18:03,533 [relstorage] INFO pack: beginning pre-pack
2012-12-03 13:18:03,533 [relstorage] INFO pack: analyzing transactions committed Mon Nov 26 12:31:54 2012 or before
2012-12-03 13:18:03,536 [relstorage.adapters.packundo] INFO pre_pack: start with gc enabled
2012-12-03 13:18:03,759 [relstorage.adapters.packundo] INFO analyzing references from objects in 97907 new transaction(s)
2012-12-03 13:18:03,761 [relstorage.adapters.scriptrunner] WARNING script statement failed: '\n INSERT INTO object_refs_added (tid)\n VALUES (%(tid)s)\n '; parameters: {'tid': 0L}
2012-12-03 13:18:03,761 [relstorage.adapters.packundo] ERROR pre_pack: failed
Traceback (most recent call last):
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 486, in pre_pack
conn, cursor, pack_tid, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 580, in _pre_pack_with_gc
self.fill_object_refs(conn, cursor, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 387, in fill_object_refs
self._add_refs_for_tid(cursor, tid, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 459, in _add_refs_for_tid
self.runner.run_script_stmt(cursor, stmt, {'tid': tid})
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/scriptrunner.py", line 52, in run_script_stmt
cursor.execute(stmt, generic_params)
IntegrityError: null value in column "zoid" violates not-null constraint
Traceback (most recent call last):
File "zodbpack.py", line 86, in <module>
main()
File "zodbpack.py", line 78, in main
skip_prepack=options.reuse_prepack)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/storage.py", line 1114, in pack
adapter.packundo.pre_pack(tid_int, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 486, in pre_pack
conn, cursor, pack_tid, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 580, in _pre_pack_with_gc
self.fill_object_refs(conn, cursor, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 387, in fill_object_refs
self._add_refs_for_tid(cursor, tid, get_references)
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/packundo.py", line 459, in _add_refs_for_tid
self.runner.run_script_stmt(cursor, stmt, {'tid': tid})
File "/usr/local/lib64/python2.6/site-packages/RelStorage-1.5.1-py2.6.egg/relstorage/adapters/scriptrunner.py", line 52, in run_script_stmt
cursor.execute(stmt, generic_params)
psycopg2.IntegrityError: null value in column "zoid" violates not-null constraint
Running zodbpack.py without GC enabled (thanks Martijn) worked and enabled the database to pack down from 3.8Gb to 887Mb.
So my zodbpack-conf.xml looks like:
<relstorage>
pack-gc false
<postgresql>
dsn dbname='zodb' user='user' host='host' password='password'
</postgresql>
</relstorage>
and i run it with:
python zodbpack.py -d 7 zodbpack-conf.xml
Note: after the pack completes you still need to vacuum the database to get the space back. I run this from the command line as:
psql zodb postgresuser
zodb=# SELECT pg_database_size('zodb');
zodb=# vacuum full;
zodb=# SELECT pg_database_size('zodb');
Interestingly, when i run the pack command from the ZMI Control Panel, I still get the same error as previously:
Site Error
An error was encountered while publishing this resource.
Error Type: IntegrityError
Error Value: null value in column "zoid" violates not-null constraint
So i am assuming the ZMI pack uses GC enabled, and that there is still an issue with null values in my site database...
Should I try and run some SQL to clean it out? If an object has no 'zoid' value then is it effectively inaccessible junk? A SQL example for this would be great too :)

Cassandra errors when trying the new CQL 3

I downloaded Cassandra 1.1.1 and launched cqlsh under the version 3
I tried to create a new column family:
CREATE TABLE stats (
pid blob,
period int,
targetid blob,
sum counter,
PRIMARY KEY (pid, period, targetid)
);
But I got this:
Traceback (most recent call last):
File "./cqlsh", line 908, in perform_statement
self.cursor.execute(statement, decoder=decoder)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 117, in execute
response = self.handle_cql_execution_errors(doquery, prepared_q, compress)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 132, in handle_cql_execution_errors
return executor(*args, **kwargs)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1583, in execute_cql_query
self.send_execute_cql_query(query, compression)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1593, in send_execute_cql_query
self.oprot.trans.flush()
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TTransport.py", line 293, in flush
self._trans.write(buf)
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TSocket.py", line 117, in write
plus = self.handle.send(buff)
error: [Errno 32] Broken pipe
And on the server console:
Error occurred during processing of message.
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:247)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getString(AbstractCompositeType.java:140)
at org.apache.cassandra.config.CFMetaData.validate(CFMetaData.java:929)
at org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:131)
at org.apache.cassandra.cql3.statements.CreateColumnFamilyStatement.announceMigration(CreateColumnFamilyStatement.java:83)
at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:99)
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:108)
at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:121)
at org.apache.cassandra.thrift.CassandraServer.execute_cql_query(CassandraServer.java:1237)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3542)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3530)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
I'd suggest reporting bugs at https://issues.apache.org/jira/browse/CASSANDRA.