Load pandas dataframe into Spark cluster - postgresql

I have a postgres database and I want to run a query and load a table into spark dataframe. some columns of my database are array. For example:
=> select id, f_2 from raw limit 1;
will return
id | f_2
---------+-----------
1 | {{140,130},{NULL,NULL},{NULL,NULL}}
what I want is accessing to 140 (first element of inner array) that is easy in postgres using this query:
=> select id, f_2[1][1] from raw limit 1;
id | f_2
---------+-----------
1 | 140
but I want to load it into spark dataframe and here is my code to load data:
df = sqlContext.sql("""
select id as id,
f_2 as A
from raw
""")
and return this error:
Py4JJavaError: An error occurred while calling o560.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer
and then I tried this one:
df = sqlContext.sql("""
select id as id,
f_2[0] as A
from raw
""")
and got same error and then tried this one:
df = sqlContext.sql("""
select id as id,
f_2[0][0] as A
from raw
""")
and return this error:
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
AnalysisException: u"Can't extract value from f_2#32685[0];"

Related

Getting error when I try to create an iceberg table using dataFrame.write() in spark and store it in a cloud Filesystem source

Following is the script I wrote:
var df2 = spark.read.parquet("<file_path>")
df2.write.format("iceberg").save(<destination_path>)
When I ran the script I am getting the following error:
RuntimeException: Failed to get table info from metastore gs://dremio-qa/flatten.listofstructwithnulls30_iceberg
Caused by: MetaException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0.`NAME` = ?
Caused by: JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0.`NAME` = ?
Caused by: SQLSyntaxErrorException: (conn=65340) Unknown column 'A0.IS_REWRITE_ENABLED' in 'field list'
Caused by: MariaDbSqlException: Unknown column 'A0.IS_REWRITE_ENABLED' in 'field list'
Caused by: SQLException: Unknown column 'A0.IS_REWRITE_ENABLED' in 'field list'
data.write
.format("iceberg")
.mode("append")
.save("db.table")
You should user db.table insteand of '<destination_path>'

Spark doesn't recognize the column name in SQL query while can output it to a dataset

I'm applying the SQL query like that:
s"SELECT * FROM my_table_joined WHERE (timestamp > '2022-01-23' and writetime is not null and acceptTimestamp is not null)"
and I'm getting the error message like that.
warning: there was one deprecation warning (since 2.0.0); for details, enable `:setting -deprecation' or `:replay -deprecation'
org.postgresql.util.PSQLException: ERROR: column "accepttimestamp" does not exist
Hint: Perhaps you meant to reference the column "mf_joined.acceptTimestamp".
Position: 103
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2497)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2233)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:310)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:446)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:370)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:149)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:108)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
at $$$e76229fa87b6865de321c5274e52c2f9$$$$w$getDFFromJdbcSource(<console>:1133)
... 326 elided
If I omit acceptTimestamp like that:
s"SELECT * FROM my_table_joined WHERE (timestamp > '2022-01-23' and writetime is not null)"
I'm getting the data as below:
+-------------------+----------+----+------------------+-----------------+---+-----+------+----------+---------------+-------+-----------------------+----------+---------+-------------+------------+---------------+---------+-----+-------------------+-----------------------+---------------+--------------+-------------+-------------------+-------------------+---+---+------------------+-----+----+----+------------------+---+
|timestamp |flags |type|lon |lat |alt|speed|course|satellites|digital_twin_id|unit_id|unit_ts |name |unit_type|measure_units|access_level|uid |placement|stale|start |writetime |acceptTimestamp|delayWindowEnd|DiffInSeconds|time |hour |max|min|mean |count|max2|min2|mean2 |rnb|
+-------------------+----------+----+------------------+-----------------+---+-----+------+----------+---------------+-------+-----------------------+----------+---------+-------------+------------+---------------+---------+-----+-------------------+-----------------------+---------------+--------------+-------------+-------------------+-------------------+---+---+------------------+-----+----+----+------------------+---+
please note acceptTimestamp is here!
So how I should handle this column in my query to make it taken into account?
From the exception, it seems this is related to Postgres not Spark. If you look at the error message you got, the column name is folded to lowercase accepttimestamp whereas in your query the T is in uppercase acceptTimestamp.
To make the column name case-sensitive for Postgres, you need to use double-quotes. Try this:
val query = s"""SELECT * FROM my_table_joined
WHERE timestamp > '2022-01-23'
and writetime is not null
and "acceptTimestamp" is not null"""

PySpark best way to filter df based on columns from different df's

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!
Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

Spark: how to read chunk of a table from Cassandra

I have a large table that grows vertically. I want to read rows in small batches, so that I can process each and save results.
Table definition
CREATE TABLE foo (
uid timeuuid,
events blob,
PRIMARY KEY ((uid))
)
Code attempt 1 - using CassandraSQLContext
// Step 1. Get uuid of the last row in a batch
val max = 10
val rdd = sc.cassandraTable("foo", "bar")
var cassandraRows = rdd.take(max)
var lastUUID = cassandraRows.last.getUUID("uid");
// lastUUID = 131ea620-2e4e-11e4-a2fc-8d5aad979e84
// Step 2. Use last row as a pointer to the start of the next batch
val cc = new CassandraSQLContext(sc)
val cql = s"SELECT events from foo.bar where token(uid) > token($lastUUID) limit $max"
// which is at runtime
// SELECT events from foo.bar WHERE
// token(uid) > token(131ea620-2e4e-11e4-a2fc-8d5aad979e84) limit 10
cc.sql(cql).collect()
Last line throws
Exception in thread "main" java.lang.RuntimeException: [1.79] failure:
``)'' expected but identifier ea620 found
SELECT events from foo.bar where token(uid) >
token(131ea620-2e4e-11e4-a2fc-8d5aad979e84) limit 10
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
But it returns correct 10 records if I run my cql in cqlsh.
Code attempt 2 - using DataStax Cassandra connector
// Step 1. Get uuid of the last row in a batch
val max = 10
val rdd = sc.cassandraTable("foo", "bar")
var cassandraRows = rdd.take(max)
var lastUUID = cassandraRows.last.getUUID("uid");
// lastUUID = 131ea620-2e4e-11e4-a2fc-8d5aad979e84
// Step 2. Execute query
rdd.where(s"token(uid) > token($lastUUID)").take(max)
This throws
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0
in stage 1.0 (TID 1, localhost): java.io.IOException: Exception during
preparation of SELECT "uid", "events" FROM "foo"."bar" WHERE
token("uid") > ? AND token("uid") <= ? AND uid > $lastUUID ALLOW
FILTERING: line 1:118 no viable alternative at character '$'
How to use where token(...) queries in spark and Cassandra?
I would use the DataStax Cassandra Java Driver. Similar to your CassandraSQLContext, you would select chunks like this:
val query = QueryBuilder.select("events")
.where(gt(token("uid"),token(lastUUID))
.limit(10)
val rows = session.execute(query).all()
If you want to asynchronously query, session also has executeAsync, which returns a RichListenableFuture which can be wrapped by a scala Future by adding a callback.