hive read mongodb bson file - mongodb

at first:when i select count(*) from table it issue :
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.bson.BasicBSONDecoder.readFully(BasicBSONDecoder.java:65)
at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:57)
at com.mongodb.hadoop.splitter.BSONSplitter.getStartingPositionForSplit(BSONSplitter.java:427)
at com.mongodb.hadoop.mapred.BSONFileInputFormat.getRecordReader(BSONFileInputFormat.java:127)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
... 16 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
after i set:
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=2048;
it issue:
Caused by: org.bson.BsonSerializationException: Expected size to be 7169396, not 109.
at org.bson.BsonBinaryReader$Context.popContext(BsonBinaryReader.java:445)
at org.bson.BsonBinaryReader.doReadEndDocument(BsonBinaryReader.java:285)
at org.bson.AbstractBsonReader.readEndDocument(AbstractBsonReader.java:360)
at org.bson.AbstractBsonWriter.pipeDocument(AbstractBsonWriter.java:763)
at org.bson.AbstractBsonWriter.pipe(AbstractBsonWriter.java:753)
at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:48)
at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:57)
at com.mongodb.hadoop.splitter.BSONSplitter.getStartingPositionForSplit(BSONSplitter.java:427)
at com.mongodb.hadoop.mapred.BSONFileInputFormat.getRecordReader(BSONFileInputFormat.java:127)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
... 16 more
what is the problem?

Related

Getting error when I try to create an iceberg table using dataFrame.write() in spark and store it in a cloud Filesystem source

Following is the script I wrote:
var df2 = spark.read.parquet("<file_path>")
df2.write.format("iceberg").save(<destination_path>)
When I ran the script I am getting the following error:
RuntimeException: Failed to get table info from metastore gs://dremio-qa/flatten.listofstructwithnulls30_iceberg
Caused by: MetaException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0.`NAME` = ?
Caused by: JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0.`NAME` = ?
Caused by: SQLSyntaxErrorException: (conn=65340) Unknown column 'A0.IS_REWRITE_ENABLED' in 'field list'
Caused by: MariaDbSqlException: Unknown column 'A0.IS_REWRITE_ENABLED' in 'field list'
Caused by: SQLException: Unknown column 'A0.IS_REWRITE_ENABLED' in 'field list'
data.write
.format("iceberg")
.mode("append")
.save("db.table")
You should user db.table insteand of '<destination_path>'

Beam DirectRunner Calcite can't specify name

I'm running a simplified version of this beam tutorial, but running it using the DirectRunner on my local machine.
import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform
import os
with beam.Pipeline() as p:
rows = (p |
beam.Create([
beam.Row(col1="val1", col2="col2_val1"),
beam.Row(col1="val2", col2="col2_val2"),
]
))
({"my_table": rows} | SqlTransform("""SELECT * FROM my_table"""))
If I change my_table to PCOLLECTION it works (although for it to really work I need to just pass in rows instead of the dict.
The error message I get:
Traceback (most recent call last):
File "./lib/scratch/test_join.py", line 12, in <module>
({"my_table": rows} | SqlTransform("""SELECT * FROM my_table"""))
File "/Users/steeling/src/versions/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 606, in __ror__
result = p.apply(self, pvalueish, label)
File "/Users/steeling/src/versions/lib/python3.7/site-packages/apache_beam/pipeline.py", line 694, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/Users/steeling/src/versions/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 185, in apply
return m(transform, input, options)
File "/Users/steeling/src/versions/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 215, in apply_PTransform
return transform.expand(input)
File "/Users/steeling/src/versions/lib/python3.7/site-packages/apache_beam/transforms/external.py", line 305, in expand
raise RuntimeError(response.error)
RuntimeError: org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query SELECT * FROM my_table
at org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:214)
at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:111)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:171)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:109)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:499)
at org.apache.beam.sdk.expansion.service.ExpansionService$TransformProvider.apply(ExpansionService.java:367)
at org.apache.beam.sdk.expansion.service.ExpansionService.expand(ExpansionService.java:470)
at org.apache.beam.sdk.expansion.service.ExpansionService.expand(ExpansionService.java:546)
at org.apache.beam.model.expansion.v1.ExpansionServiceGrpc$MethodHandlers.invoke(ExpansionServiceGrpc.java:219)
at org.apache.beam.vendor.grpc.v1p36p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
at org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.tools.ValidationException: org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.runtime.CalciteContextException: From line 1, column 15 to line 1, column 22: Object 'my_table' not found
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.prepare.PlannerImpl.validate(PlannerImpl.java:217)
at org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:183)
... 17 more
Caused by: org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.runtime.CalciteContextException: From line 1, column 15 to line 1, column 22: Object 'my_table' not found
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:824)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:809)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError(SqlValidatorImpl.java:4805)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl(IdentifierNamespace.java:172)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl(IdentifierNamespace.java:177)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:995)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:955)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom(SqlValidatorImpl.java:3109)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom(SqlValidatorImpl.java:3091)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect(SqlValidatorImpl.java:3363)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SelectNamespace.validateImpl(SelectNamespace.java:60)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:995)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:955)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.SqlSelect.validate(SqlSelect.java:216)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression(SqlValidatorImpl.java:930)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorImpl.validate(SqlValidatorImpl.java:637)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.prepare.PlannerImpl.validate(PlannerImpl.java:215)
... 18 more
Caused by: org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException: Object 'my_table' not found
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463)
at org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.runtime.Resources$ExInst.ex(Resources.java:572)
... 37 more
You encounter this error since you only have one PCollection. It seems that Apache Beam will use PCOLLECTION as table source even if you created a dictionary that has a key value pair of {"my_table": rows}.
As a workaround, if you would like to explicitly define a table name in your SQL statement you can create a temporary PCollection that contains dummy values. Create a dictionary that have a key value pair of the two PCollections.
import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform
import os
with beam.Pipeline() as p:
rows = (p |
"create rows" >> beam.Create([
beam.Row(col1="val1", col2="col2_val1"),
beam.Row(col1="val2", col2="col2_val2"),
]
))
rows_2 = (p |
"create rows_2" >> beam.Create([
beam.Row(col1_1="val1", col2_1="123"),
]
))
({"my_table": rows, "my_table2": rows_2} | SqlTransform("""SELECT * FROM my_table""")
| beam.Map(lambda row: "col1: %s, col2: %s" % (row.col1,row.col2))
| beam.Map(print))
The output is:
You can open a Apache Beam JIRA issue to request if your use case could be done on the future releases.

Why is my query failing with SQLCODE=-204?

I'm writing some SQL for a report (this is my first time actually writing SQL so don't slam my syntax too much, lol) and the subquery seems to be failing. I'm not sure why because the report compiles just fine and I can get it into Jaspersoft Studio before it errors out when trying to preview it. Am I even doing the subquery correctly?
Thanks in advance.
SELECT
PERSON.LAST_NAME,
ACCOUNT.ACCOUNT_NUMBER,
SHARE_ESCROW.ID,
LOAN.PAYMENT_DUE_DATE,
LN_IMPOUND_ANALYSIS.CUSHION_AMOUNT,
LN_IMPOUND_ANALYSIS.NEXT_P_AND_I_AMOUNT,
LOAN.IMPOUND_AMOUNT,
LN_IMPOUND_ANALYSIS.NEXT_IMPOUND_AMOUNT,
SHARE_ESCROW.BALANCE,
SHARE_ESCROW.DESCRIPTION,
SHARE_ESCROW.CLOSE_DATE,
i.LTV,
CASE LOAN.PAYMENT_METHOD
WHEN 'T' THEN 'Automated Transfer'
WHEN 'A' THEN 'ACH'
WHEN 'C' THEN 'Cash'
END AS PAYMENT_METHOD
FROM
CORE.ACCOUNT AS ACCOUNT INNER JOIN CORE.LOAN AS LOAN ON ACCOUNT.SERIAL = LOAN.PARENT_SERIAL INNER JOIN
CORE.LN_IMPOUND_ANALYSIS AS LN_IMPOUND_ANALYSIS ON LOAN.SERIAL = LN_IMPOUND_ANALYSIS.PARENT_SERIAL INNER JOIN
CORE.SHARE AS SHARE_ESCROW ON ACCOUNT.SERIAL = SHARE_ESCROW.PARENT_SERIAL INNER JOIN
CORE.PERSON AS PERSON ON PERSON.SERIAL = ACCOUNT.PRIMARY_PERSON_SERIAL,
(SELECT
CASE COLLATERAL.AMOUNT
WHEN 0 THEN 0
WHEN NULL THEN 0
ELSE LOAN.BALANCE / COLLATERAL.AMOUNT
END AS LTV
FROM
CORE.COLLATERAL AS COLLATERAL INNER JOIN LOAN ON LOAN.SERIAL = COLLATERAL.PARENT_SERIAL
) i
WHERE
SHARE_ESCROW.DESCRIPTION = 'Escrow Share' AND
SHARE_ESCROW.CLOSE_DATE IS NULL AND
i.LTV != 0
Edit:
Here's the error:
net.sf.jasperreports.engine.JRException: net.sf.jasperreports.engine.JRRuntimeException: net.sf.jasperreports.engine.JRException: Error executing SQL statement for : Escrow32Analysis324532WIP_TableDataset_1579697577915_289752
at com.jaspersoft.studio.editor.preview.view.control.ReportControler.fillReport(ReportControler.java:466)
at com.jaspersoft.studio.editor.preview.view.control.ReportControler.access$18(ReportControler.java:441)
at com.jaspersoft.studio.editor.preview.view.control.ReportControler$4.run(ReportControler.java:333)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)
Caused by: net.sf.jasperreports.engine.JRRuntimeException: net.sf.jasperreports.engine.JRException: Error executing SQL statement for : Escrow32Analysis324532WIP_TableDataset_1579697577915_289752
at net.sf.jasperreports.engine.fill.JRFillSubreport.prepare(JRFillSubreport.java:809)
at net.sf.jasperreports.components.table.fill.FillTableSubreport.prepareSubreport(FillTableSubreport.java:156)
at net.sf.jasperreports.components.table.fill.FillTable.prepare(FillTable.java:400)
at net.sf.jasperreports.engine.fill.JRFillComponentElement.prepare(JRFillComponentElement.java:151)
at net.sf.jasperreports.engine.fill.JRFillElementContainer.prepareElements(JRFillElementContainer.java:332)
at net.sf.jasperreports.engine.fill.JRFillBand.fill(JRFillBand.java:384)
at net.sf.jasperreports.engine.fill.JRFillBand.fill(JRFillBand.java:358)
at net.sf.jasperreports.engine.fill.JRVerticalFiller.fillSummaryNoLastFooterSamePage(JRVerticalFiller.java:1102)
at net.sf.jasperreports.engine.fill.JRVerticalFiller.fillSummary(JRVerticalFiller.java:1065)
at net.sf.jasperreports.engine.fill.JRVerticalFiller.fillReportEnd(JRVerticalFiller.java:329)
at net.sf.jasperreports.engine.fill.JRVerticalFiller.fillReport(JRVerticalFiller.java:159)
at net.sf.jasperreports.engine.fill.JRBaseFiller.fill(JRBaseFiller.java:963)
at net.sf.jasperreports.engine.fill.BaseFillHandle$ReportFiller.run(BaseFillHandle.java:120)
at java.lang.Thread.run(Unknown Source)
Caused by: net.sf.jasperreports.engine.JRException: Error executing SQL statement for : Escrow32Analysis324532WIP_TableDataset_1579697577915_289752
at net.sf.jasperreports.engine.query.JRJdbcQueryExecuter.createDatasource(JRJdbcQueryExecuter.java:240)
at net.sf.jasperreports.engine.fill.JRFillDataset.createQueryDatasource(JRFillDataset.java:1114)
at net.sf.jasperreports.engine.fill.JRFillDataset.initDatasource(JRFillDataset.java:691)
at net.sf.jasperreports.engine.fill.JRBaseFiller.setParameters(JRBaseFiller.java:1314)
at net.sf.jasperreports.engine.fill.JRBaseFiller.fill(JRBaseFiller.java:931)
at net.sf.jasperreports.engine.fill.JRBaseFiller.fill(JRBaseFiller.java:873)
at net.sf.jasperreports.engine.fill.JRFillSubreport.fillSubreport(JRFillSubreport.java:665)
at net.sf.jasperreports.engine.fill.JRSubreportRunnable.run(JRSubreportRunnable.java:59)
at net.sf.jasperreports.engine.fill.AbstractThreadSubreportRunner.run(AbstractThreadSubreportRunner.java:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
Caused by: com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-204, SQLSTATE=42704, SQLERRMC=CMURPHY.LOAN, DRIVER=4.13.127
at com.ibm.db2.jcc.am.id.a(id.java:677)
at com.ibm.db2.jcc.am.id.a(id.java:60)
at com.ibm.db2.jcc.am.id.a(id.java:127)
at com.ibm.db2.jcc.am.no.c(no.java:2653)
at com.ibm.db2.jcc.am.no.d(no.java:2641)
at com.ibm.db2.jcc.am.no.a(no.java:2090)
at com.ibm.db2.jcc.am.oo.a(oo.java:7671)
at com.ibm.db2.jcc.t4.cb.h(cb.java:141)
at com.ibm.db2.jcc.t4.cb.b(cb.java:41)
at com.ibm.db2.jcc.t4.q.a(q.java:32)
at com.ibm.db2.jcc.t4.sb.i(sb.java:135)
at com.ibm.db2.jcc.am.no.ib(no.java:2059)
at com.ibm.db2.jcc.am.oo.sc(oo.java:3559)
at com.ibm.db2.jcc.am.oo.b(oo.java:4348)
at com.ibm.db2.jcc.am.oo.fc(oo.java:743)
at com.ibm.db2.jcc.am.oo.executeQuery(oo.java:713)
at net.sf.jasperreports.engine.query.JRJdbcQueryExecuter.createDatasource(JRJdbcQueryExecuter.java:233)
... 11 more
The message you see in the log:
DB2 SQL Error: SQLCODE=-204, SQLSTATE=42704, SQLERRMC=CMURPHY.LOAN
means that the object (in your case a table) LOAN in the schema CMURPHY cannot be found (SQLCODE -204). As others have mentioned in comments, the only table reference in your query that is missing the explicit schema is this:
FROM
CORE.COLLATERAL AS COLLATERAL INNER JOIN LOAN ON LOAN.SERIAL ...
---------------------------------------------^^^^
so the database server by default looks for that table in the default schema, the one matching your user ID (CMURPHY). Apparently it's not there.
Add the explicit schema name to the table reference:
FROM
CORE.COLLATERAL AS COLLATERAL INNER JOIN CORE.LOAN AS LOAN ON LOAN.SERIAL ...

Load pandas dataframe into Spark cluster

I have a postgres database and I want to run a query and load a table into spark dataframe. some columns of my database are array. For example:
=> select id, f_2 from raw limit 1;
will return
id | f_2
---------+-----------
1 | {{140,130},{NULL,NULL},{NULL,NULL}}
what I want is accessing to 140 (first element of inner array) that is easy in postgres using this query:
=> select id, f_2[1][1] from raw limit 1;
id | f_2
---------+-----------
1 | 140
but I want to load it into spark dataframe and here is my code to load data:
df = sqlContext.sql("""
select id as id,
f_2 as A
from raw
""")
and return this error:
Py4JJavaError: An error occurred while calling o560.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer
and then I tried this one:
df = sqlContext.sql("""
select id as id,
f_2[0] as A
from raw
""")
and got same error and then tried this one:
df = sqlContext.sql("""
select id as id,
f_2[0][0] as A
from raw
""")
and return this error:
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
AnalysisException: u"Can't extract value from f_2#32685[0];"

Spark: how to read chunk of a table from Cassandra

I have a large table that grows vertically. I want to read rows in small batches, so that I can process each and save results.
Table definition
CREATE TABLE foo (
uid timeuuid,
events blob,
PRIMARY KEY ((uid))
)
Code attempt 1 - using CassandraSQLContext
// Step 1. Get uuid of the last row in a batch
val max = 10
val rdd = sc.cassandraTable("foo", "bar")
var cassandraRows = rdd.take(max)
var lastUUID = cassandraRows.last.getUUID("uid");
// lastUUID = 131ea620-2e4e-11e4-a2fc-8d5aad979e84
// Step 2. Use last row as a pointer to the start of the next batch
val cc = new CassandraSQLContext(sc)
val cql = s"SELECT events from foo.bar where token(uid) > token($lastUUID) limit $max"
// which is at runtime
// SELECT events from foo.bar WHERE
// token(uid) > token(131ea620-2e4e-11e4-a2fc-8d5aad979e84) limit 10
cc.sql(cql).collect()
Last line throws
Exception in thread "main" java.lang.RuntimeException: [1.79] failure:
``)'' expected but identifier ea620 found
SELECT events from foo.bar where token(uid) >
token(131ea620-2e4e-11e4-a2fc-8d5aad979e84) limit 10
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
But it returns correct 10 records if I run my cql in cqlsh.
Code attempt 2 - using DataStax Cassandra connector
// Step 1. Get uuid of the last row in a batch
val max = 10
val rdd = sc.cassandraTable("foo", "bar")
var cassandraRows = rdd.take(max)
var lastUUID = cassandraRows.last.getUUID("uid");
// lastUUID = 131ea620-2e4e-11e4-a2fc-8d5aad979e84
// Step 2. Execute query
rdd.where(s"token(uid) > token($lastUUID)").take(max)
This throws
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0
in stage 1.0 (TID 1, localhost): java.io.IOException: Exception during
preparation of SELECT "uid", "events" FROM "foo"."bar" WHERE
token("uid") > ? AND token("uid") <= ? AND uid > $lastUUID ALLOW
FILTERING: line 1:118 no viable alternative at character '$'
How to use where token(...) queries in spark and Cassandra?
I would use the DataStax Cassandra Java Driver. Similar to your CassandraSQLContext, you would select chunks like this:
val query = QueryBuilder.select("events")
.where(gt(token("uid"),token(lastUUID))
.limit(10)
val rows = session.execute(query).all()
If you want to asynchronously query, session also has executeAsync, which returns a RichListenableFuture which can be wrapped by a scala Future by adding a callback.