Use consistency level in Phantom-dsl and Cassandra - scala

Currently using --
cqlsh> show version
[cqlsh 4.1.1 | Cassandra 2.0.17 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Using phantom-dsl 1.12.2 , Scala 2.10 ..
I can't figure out how to set consistency levels on queries.
There are predefined functions insert() , select() as part of CassandraTable .. How can I pass the consistency level to them ?
insert.value(....).consistencyLevel_=(ConsistencyLevel.QUORUM)
does not work and fails with an error ( probably because this appends a "USING CONSISTENCY QUORUM" at the end of the query). Here's the actual exception I get
com.datastax.driver.core.exceptions.SyntaxError: line 1:424 no viable alternative at input 'CONSISTENCY'
at com.datastax.driver.core.Responses$Error.asException(Responses.java:122) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:120) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:186) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:45) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:754) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:576) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
I see from the documentation and discussion on this pull request that I could do a setConsistencyLevel(ConsistencyLevel.QUORUM) on a SimpleStatement, but I would prefer not rewrite all the different insert statements.
UPDATE
Just to close the loop on this issue. I worked around this by creating a custom InsertQuery and then using that instead of the one provided by final def insert in CassandraTable
def qinsert()(implicit keySpace: KeySpace) = {
val table = this.asInstanceOf[T]
new InsertQuery[T, M, Unspecified](table, CQLQuery("INSERT into keyspace.tablename", consistencyLevel = ConsistencyLevel.QUORUM)
}

First of all there is no setValue method inside phantom and the API method you are using is missing an = at the end.
The correct structure is:
Table.insert
.value(_.name, "test")
.consistencyLevel_=(ConsistencyLevel.Quorum)
As you are on stackoverflow, an error stack trace and specific details of what doesn't work is generally preferable to "does not work".

I have finally figured out how to properly set the consistency level using phantom-dsl.
Using a statement you can do the following:
statement.setConsistencyLevel(ConsistencyLevel.QUORUM)
Also, take a look on the test project I've been working to help guys like you with examples using phantom-dsl:
https://github.com/iamthiago/cassandra-phantom

Related

Schema capitalization(uppercase) problem when reading with Spark

Using Scala here:
Val df = spark.read.format("jdbc").
option("url", "<host url>").
option("dbtable", "UPPERCASE_SCHEMA.table_name").
option("user", "postgres").
option("password", "<password>").
option("numPartitions", 50).
option("fetchsize", 20).
load()
The database I'm using the above code to call from has many schemas and they are all in uppercase letters (UPPERCASE_SCHEMA).
No matter how I try to denote that the schema is in all caps, Spark converts it to lowercase which fails to initialize with the actual DB.
I've tried making it a variable and explicitly denoting it is all uppercase, etc. in multiple languages, but no luck.
Would anyone know a workaround?
When I went into the actual DB (Postgres) and temporarily changed the schema to all lowercase, it worked absolutely fine.
Try to set spark.sql.caseSensitive to true (false by default)
spark.conf.set('spark.sql.caseSensitive', true)
You can see in the source code its definition:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833
In addition, you can see in the JDBCWriteSuite how it affects the JDBC connector:
https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

How to insert similar value into multiple locations of a psycopg2 query statement using dict? [duplicate]

I have a Python script that runs a pgSQL file through SQLAlchemy's connection.execute function. Here's the block of code in Python:
results = pg_conn.execute(sql_cmd, beg_date = datetime.date(2015,4,1), end_date = datetime.date(2015,4,30))
And here's one of the areas where the variable gets inputted in my SQL:
WHERE
( dv.date >= %(beg_date)s AND
dv.date <= %(end_date)s)
When I run this, I get a cryptic python error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) argument formats can't be mixed
…followed by a huge dump of the offending SQL query. I've run this exact code with the same variable convention before. Why isn't it working this time?
I encountered a similar issue as Nikhil. I have a query with LIKE clauses which worked until I modified it to include a bind variable, at which point I received the following error:
DatabaseError: Execution failed on sql '...': argument formats can't be mixed
The solution is not to give up on the LIKE clause. That would be pretty crazy if psycopg2 simply didn't permit LIKE clauses. Rather, we can escape the literal % with %%. For example, the following query:
SELECT *
FROM people
WHERE start_date > %(beg_date)s
AND name LIKE 'John%';
would need to be modified to:
SELECT *
FROM people
WHERE start_date > %(beg_date)s
AND name LIKE 'John%%';
More details in the pscopg2 docs: http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
As it turned out, I had used a SQL LIKE operator in the new SQL query, and the % operand was messing with Python's escaping capability. For instance:
dv.device LIKE 'iPhone%' or
dv.device LIKE '%Phone'
Another answer offered a way to un-escape and re-escape, which I felt would add unnecessary complexity to otherwise simple code. Instead, I used pgSQL's ability to handle regex to modify the SQL query itself. This changed the above portion of the query to:
dv.device ~ E'iPhone.*' or
dv.device ~ E'.*Phone$'
So for others: you may need to change your LIKE operators to regex '~' to get it to work. Just remember that it'll be WAY slower for large queries. (More info here.)
For me it's turn out I have % in sql comment
/* Any future change in the testing size will not require
a change here... even if we do a 100% test
*/
This works fine:
/* Any future change in the testing size will not require
a change here... even if we do a 100pct test
*/

Job executed with no data in Spark Streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.
It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.

PostgreSQL jsonb, `?` and JDBC

I am using PostgreSQL 9.4 and the awesome JSONB field type. I am trying to query against a field in a document. The following works in the psql CLI
SELECT id FROM program WHERE document -> 'dept' ? 'CS'
When I try to run the same query via my Scala app, I'm getting the error below. I'm using Play framework and Anorm, so the query looks like this
SQL(s"SELECT id FROM program WHERE document -> 'dept' ? {dept}")
.on('dept -> "CS")
....
SQLException: : No value specified for parameter 5.
(SimpleParameterList.java:223)
(in my actual queries there are more parameters)
I can get around this by casting my parameter to type jsonb and using the #> operator to check containment.
SQL(s"SELECT id FROM program WHERE document -> 'dept' #> {dept}::jsonb")
.on('dept -> "CS")
....
I'm not too keen on the work around. I don't know if there are performance penalties for the cast, but it's extra typing, and non-obvious.
Is there anything else I can do?
As a workaround to avoid the ? operator, you could create a new operator doing exactly the same.
This is the code of the original operator:
CREATE OPERATOR ?(
PROCEDURE = jsonb_exists,
LEFTARG = jsonb,
RIGHTARG = text,
RESTRICT = contsel,
JOIN = contjoinsel);
SELECT '{"a":1, "b":2}'::jsonb ? 'b'; -- true
Use a different name, without any conflicts, like #-# and create a new one:
CREATE OPERATOR #-#(
PROCEDURE = jsonb_exists,
LEFTARG = jsonb,
RIGHTARG = text,
RESTRICT = contsel,
JOIN = contjoinsel);
SELECT '{"a":1, "b":2}'::jsonb #-# 'b'; -- true
Use this new operator in your code and it should work.
Check pgAdmin -> pg_catalog -> Operators for all the operators that use a ? in the name.
In JDBC (and standard SQL) the question mark is reserved as a parameter placeholder. Other uses are not allowed.
See Does the JDBC spec prevent '?' from being used as an operator (outside of quotes)? and the discussion on jdbc-spec-discuss.
The current PostgreSQL JDBC driver will transform all occurrences (outside text or comments) of a question mark to a PostgreSQL specific parameter placeholder. I am not sure if the PostgreSQL JDBC project has done anything (like introducing an escape as discussed in the links above) to address this yet. A quick look at the code and documentation suggests they didn't, but I didn't dig too deep.
Addendum: As shown in the answer by bobmarksie, current versions of the PostgreSQL JDBC driver now support escaping the question mark by doubling it (ie: use ?? instead of ?).
I had the same issue a couple of days ago and after some investigation I found this.
https://jdbc.postgresql.org/documentation/head/statement.html
In JDBC, the question mark (?) is the placeholder for the positional parameters of a PreparedStatement. There are, however, a number of PostgreSQL operators that contain a question mark. To keep such question marks in a SQL statement from being interpreted as positional parameters, use two question marks (??) as escape sequence. You can also use this escape sequence in a Statement, but that is not required. Specifically only in a Statement a single (?) can be used as an operator.
Using 2 question marks seemed to work well for me - I was using the following driver (illustrated using maven dependency) ...
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4-1201-jdbc41</version>
</dependency>
... and MyBatis for creating the SQL queries and it seemed to work well. Seemed easier / cleaner than creating an PostgreSQL operator.
SQL went from e.g.
select * from user_docs where userTags ?| array['sport','property']
... to ...
select * from user_docs where userTags ??| array['sport','property']
Hopefully this works with your scenario!
As bob said just use ?? instead of ?
SQL(s"SELECT id FROM program WHERE document -> 'dept' ?? {dept}")
.on('dept -> "CS")