Cassandra-Hector-Scala:How can I get all row Composite key in column family? - scala

My data storage format is:
Family name :Test
Rowkey: comkey1:comkey2
=>(name=name,value='xyz',timestamp=1554515485)
-------------------------------------------------------
Rowkey: comkey1:comkey3
=>(name=name,value='abc',timestamp=1554515485)
-------------------------------------------------------
Rowkey: comkey1:comkey4
=>(name=name,value='pqr',timestamp=1554515485)
-------------------------------------------------------
now i want to fetch all composite key from "test" family
and i am trying
def test=Action{
val cluster = HFactory.getOrCreateCluster("Test Cluster", "127.0.0.1:9160");
val keyspace = HFactory.createKeyspace("winoriatest", cluster)
var startKey = new Composite();
var endKey= new Composite();
startKey.addComponent("comkey1", StringSerializer.get());
startKey.addComponent("comkey2", StringSerializer.get());
endKey.addComponent("comkey1", StringSerializer.get());
endKey.addComponent("comkey4", StringSerializer.get());
val rangeSlicesQuery = HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(),StringSerializer.get())
rangeSlicesQuery.setColumnFamily("test");
// CompositeSerializer.get() is not working.
rangeSlicesQuery.setKeys(startKey,endKey)
rangeSlicesQuery.setRange(null,null,false,Integer.MAX_VALUE);
rangeSlicesQuery.setReturnKeysOnly()
val result = rangeSlicesQuery.execute()
val orderedRows = result.get();
import scala.collection.JavaConversions._
for (sc <- orderedRows) {
println(sc.getKey())
}
Ok(views.html.index("Your new application is ready."))
}
Error :[NullPointerException: null] on line
val result = rangeSlicesQuery.execute()
Cassandra 2.0 scala 2.10.2
Thank you for your help in resolving this, in advance.
it giving me null pointer exception, and the same code is working with java
and my java code is
Cluster cluster = HFactory.getOrCreateCluster("Test Cluster","127.0.0.1:9160");
Keyspace keyspace = HFactory.createKeyspace("winoriatest", cluster);
Serializer<String> se= StringSerializer.get() ;
Serializer<Long> le= LongSerializer.get() ;
Serializer<Integer> ie= IntegerSerializer.get() ;
CompositeSerializer ce = new CompositeSerializer();
RangeSlicesQuery<Composite,String,byte[]> rangeSliceQuery=HFactory.createRangeSlicesQuery(keyspace,ce,se, BytesArraySerializer.get());
rangeSliceQuery.setColumnFamily("test");
rangeSliceQuery.setRange(null,null, false, Integer.MAX_VALUE);
QueryResult<OrderedRows<Composite,String,byte[]>>result=rangeSliceQuery.execute();
OrderedRows<Composite,String,byte[]> orderedRows=result.get();
for (Row<Composite,String,byte[]> r:orderedRows)
{
System.out.println("Compositekey="+r.getKey().get(0,se)+":"+r.getKey().get(1, se));
}

I'm not quite sure what "i want to fetch all composite key in test family" means. If you mean, you want to get just the partition [row] key components, then you can do this in CQL as simply as:
SELECT DISTINCT a, b FROM test
(Assigning a and b to be the column names.)
This is a good example of how much simpler CQL makes Cassandra development, which is why we're pushing people to use the native CQL driver over legacy clients like Hector.
For more on how CQL makes sense of a Thrift data model like this, see http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.

Related

H2 database content is not persisting on insert and update

I am using h2 database to test my postgres slick functionality.
I created a below h2DbComponent:
trait H2DBComponent extends DbComponent {
val driver = slick.jdbc.H2Profile
import driver.api._
val h2Url = "jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false;INIT=runscript from './test/resources/schema.sql'\\;runscript from './test/resources/schemadata.sql'"
val logger = LoggerFactory.getLogger(this.getClass)
val db: Database = {
logger.info("Creating test connection ..................................")
Database.forURL(url = h2Url, driver = "org.h2.Driver")
}
}
In the above snippet i am creating my tables using schema.sql and inserting a single row(record) with schemadata.sql.
Then i am trying to insert a record into the table as below using my test case:
class RequestRepoTest extends FunSuite with RequestRepo with H2DBComponent {
test("Add new Request") {
val response = insertRequest(Request("XYZ","tk", "DM", "RUNNING", "0.1", "l1", "file1",
Timestamp.valueOf("2016-06-22 19:10:25"), Some(Timestamp.valueOf("2016-06-22 19:10:25")), Some("scienceType")))
val actualResult=Await.result(response,10 seconds)
assert(actualResult===1)
val response2 = getAllRequest()
assert(Await.result(response2, 5 seconds).size === 2)
}
}
The above assert of insert works fine stating that the record is inserted. But the getAllRequest() assert fails as the output still contains the single row(as inserted by schemadata.sql) => which means the insertRequest change is not persisted. However the below statements states that the record is inserted as the insert returned 1 stating one record inserted.
val response = insertRequest(Request("CMP_XYZ","tesco_uk", "DM", "RUNNING", "0.1", "l1", "file1",
Timestamp.valueOf("2016-06-22 19:10:25"), Some(Timestamp.valueOf("2016-06-22 19:10:25")),
Some("scienceType")))
val actualResult=Await.result(response,10 seconds)
Below is my definition of insertRequest:
def insertRequest(request: Request):Future[Int]= {
db.run { requestTableQuery += request }
}
I am unable to figure out how can i see the inserted record. Is there any property/config which i need to add?
But the getAllRequest() assert fails as the output still contains the single row(as inserted by schemadata.sql) => which means the insertRequest change is not persisted
I would double-check that the assert(Await.result(response2, 5 seconds).size === 2) line is failing because of a size difference. Could it be failing for some other general failure?
For example, as INIT is run on each connection it could be that you are re-creating the database for each connection. Unless you're careful with the SQL, that could produce an error such as "table already exists". Adding TRACE_LEVEL_SYSTEM_OUT=2; to your H2 URL can be helpful in tracking what H2 is doing.
A couple of suggestions.
First, you could ensure your SQL only runs as needed. For example, your schema.sql could add checks to avoid trying to create the table twice:
CREATE TABLE IF NOT EXISTS my_table( my_column VARCHAR NULL );
And likewise for your schemadata.sql:
MERGE INTO my_table KEY(my_column) VALUES ('a') ;
Alternatively, you could establish schema and test data around your tests (e.g., possibly in Scala code, using Slick). Your test framework probably has a way to ensure something is run before and after a test or test suit.

Problems joining 2 kafka streams (using custom timestampextractor)

I'm having problems joining 2 kafka streams extracting the date from the fields of my event. The join is working fine when I do not define a custom TimeStampExtractor but when I do the join does not work anymore. My topology is quite simple:
val builder = new StreamsBuilder()
val couponConsumedWith = Consumed.`with`(Serdes.String(),
getAvroCouponSerde(schemaRegistryHost, schemaRegistryPort))
val couponStream: KStream[String, Coupon] = builder.stream(couponInputTopic, couponConsumedWith)
val purchaseConsumedWith = Consumed.`with`(Serdes.String(),
getAvroPurchaseSerde(schemaRegistryHost, schemaRegistryPort))
val purchaseStream: KStream[String, Purchase] = builder.stream(purchaseInputTopic, purchaseConsumedWith)
val couponStreamKeyedByProductId: KStream[String, Coupon] = couponStream.selectKey(couponProductIdValueMapper)
val purchaseStreamKeyedByProductId: KStream[String, Purchase] = purchaseStream.selectKey(purchaseProductIdValueMapper)
val couponPurchaseValueJoiner = new ValueJoiner[Coupon, Purchase, Purchase]() {
#Override
def apply(coupon: Coupon, purchase: Purchase): Purchase = {
val discount = (purchase.getAmount * coupon.getDiscount) / 100
new Purchase(purchase.getTimestamp, purchase.getProductid, purchase.getProductdescription, purchase.getAmount - discount)
}
}
val fiveMinuteWindow = JoinWindows.of(TimeUnit.MINUTES.toMillis(10))
val outputStream: KStream[String, Purchase] = couponStreamKeyedByProductId.join(purchaseStreamKeyedByProductId,
couponPurchaseValueJoiner,
fiveMinuteWindow
)
outputStream.to(outputTopic)
builder.build()
As I said this code works like a charm when I do not use a custom TimeStampExtractor but when I do by setting the StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG to my custom extractor class (I've double checked that the class is extracting the date properly) the join does not work anymore.
I'm testing the topology by running a unit test and passing the following events to it:
val coupon1 = new Coupon("Dec 05 2018 09:10:00.000 UTC", "1234", 10F)
// Purchase within the five minutes after the coupon - The discount should be applied
val purchase1 = new Purchase("Dec 05 2018 09:12:00.000 UTC", "1234", "Green Glass", 25.00F)
val purchase1WithDiscount = new Purchase("Dec 05 2018 09:12:00.000 UTC", "1234", "Green Glass", 22.50F)
val couponRecordFactory1 = couponRecordFactory.create(couponInputTopic, "c1", coupon1)
val purchaseRecordFactory1 = purchaseRecordFactory.create(purchaseInputTopic, "p1", purchase1)
testDriver.pipeInput(couponRecordFactory1)
testDriver.pipeInput(purchaseRecordFactory1)
val outputRecord1 = testDriver.readOutput(outputTopic,
new StringDeserializer(),
JoinTopologyBuilder.getAvroPurchaseSerde(
schemaRegistryHost,
schemaRegistryPort).deserializer())
OutputVerifier.compareKeyValue(outputRecord1, "1234", purchase1WithDiscount)
Not sure if the step of selecting a new key is getting rid of the proper date. I have tested a lot of combinations with no luck :(
Any help would be really appreciated!
I'm not sure of that because I don't know how much you test your code, but my guess will be that :
1) your code work with the default timestamp extractor because it's using the time when you're sending record into the pipes as timestamps records, so basically it will work because in your test you're sending data one after another without a pause.
2) you are using the TopologyTestDriver to do your tests !
Note that it's very useful for testing your business code and the topology as a unit (what I have as inputs and what is the correct according outputs) but there isn't a Kafka Stream app running in thoses tests.
In your case you can play with the method advanceWallClockTime(long) in the TopologyTestDriver class to simulate the system time walking.
If you want to start the topology you will have to do an integration test with an embedded kafka cluster (there is one on kafka libraries that's working just fine !).
Let me know if that's help :-)
Thank you for replying. I was working on this yesterday and I think I found the problem. As you said I am using the TopologyTestDriver to run my tests and when you initialize the TopologyTestDriver class it uses an initialWallClockTime, if you do not provide a value, the TopologyTestDriver will pick up the currentTimeMillis:
public TopologyTestDriver(Topology topology, Properties config) {
this(topology, config, System.currentTimeMillis());
}
There is another constructor that allows you to pass-in an initialWallClockTime. I've been testing this method but for some reason it does not work for me.
So to sum up my solution has been to create the Purchase and Coupon objects with the current timestamp. I'm still using my custom timestamp extractor but instead of hardcoding a date I am always getting the current timestamp and this way the join works fine.
Not fully happy with my end solution because I don't know why the initialWallClockTime does not work for me, but at least the tests are working fine now.

Read AVRO structures saved in HBase columns

I am new to Spark and HBase. I am working with the backups of a HBase table. These backups are in a S3 bucket. I am reading them via spark(scala) using newAPIHadoopFile like this:
conf.set("io.serializations", "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization")
val data = sc.newAPIHadoopFile(path,classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]], classOf[ImmutableBytesWritable], classOf[Result], conf)
The table in question is called Emps. The schema of Emps is :
key: empid {COMPRESSION => 'gz' }
family: data
dob - date of birth of this employee.
e_info - avro structure for storing emp info.
e_dept- avro structure for storing info about dept.
family: extra - Extra Metadata {NAME => 'extra', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
e_region - emp region
e_status - some data about his achievements
.
.
some more meta data
The table has some columns that have simple string data in them, and some columns that has AVRO stuctures in them.
I am trying to read this data directly from the HBase backup files in the S3. I do not want to re-create this HBase table in my local machine as the table is very, very large.
This is how I am trying to read this:
data.keys.map{k=>(new String(k.get()))}.take(1)
res1: Array[String] = Array(111111111100011010102462)
data.values.map{ v =>{ for(cell <- v.rawCells()) yield{
val family = CellUtil.cloneFamily(cell);
val column = CellUtil.cloneQualifier(cell);
val value = CellUtil.cloneValue(cell);
new String(family) +"->"+ new String(column)+ "->"+ new String(value)
}
}
}.take(1)
res2: Array[Array[String]] = Array(Array(info->dob->01/01/1996, info->e_info->?ж�?�ո� ?�� ???̶�?�ո� ?�� ????, info->e_dept->?ж�??�ո� ?̶�??�ո� �ո� ??, extra->e_region-> CA, extra->e_status->, .....))
As expected I can see the simple string data correctly, but the AVRO data is garbage.
I tried reading the AVRO structures using GenericDatumReader:
data.values.map{ v =>{ for(cell <- v.rawCells()) yield{
val family = new String(CellUtil.cloneFamily(cell));
val column = new String(CellUtil.cloneQualifier(cell));
val value = CellUtil.cloneValue(cell);
if(column=="e_info"){
var schema_obj = new Schema.Parser
//schema_e_info contains the AVRO schema for e_info
var schema = schema_obj.parse(schema_e_info)
var READER2 = new GenericDatumReader[GenericRecord](schema)
var datum= READER2.read(null, DecoderFactory.defaultFactory.createBinaryDecoder(value,null))
var result=datum.get("type").toString()
family +"->"+column+ "->"+ new String(result) + "\n"
}
else
family +"->"+column+ "->"+ new String(value)+"\n"
}
}
}
But this is giving me the following error:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
... 74 elided
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
Serialization stack:
- object not serializable (class: org.apache.avro.Schema$RecordSchema, value: .....
So I want to ask:
Is there any way to make the non-serializable class RecordSchema work with the map function?
Is my approach right upto this point? I would be glad to know about better approaches to handle this kind of data.
I read that handling this in a Dataframe would be a lot easier. I tried to convert the Hadoop RDD so formed into a Dataframe, but again I am running blindly there.
As the exception says - the schema is non-serializable. Can you initialize it inside the mapper function? So that it doesn't need to get shipped from the driver to the executors.
Alternatively, you can also create a scala singleton object that contains the schema. You get one scala singleton initialized on each executor, so when you access any member from the singleton, it doesn't need to be serialized & sent across the network. This avoids the unnecessary overhead of re-creating the schema for each and every row in the data.
Just for the purpose of checking that you can read the data fine - you can also convert it to a byte array on the executors, collect it on the driver and do the deserialization (parsing the AVRO data) in the driver code. But this obviously won't scale, it's just to make sure that your data looks good and to avoid spark-related complications while you're writing your prototype code to extract the data.

create table in phoenix from spark

Hi I need to create a table in Phoenix from a spark job . I have tried 2 ways below but none of them work, seems this is still not supported.
1) Dataframe.write still requires that the tables exists previously
df.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", schemaName.toUpperCase + "." + tableName.toUpperCase ).option("zkUrl", hbaseQuorum).save()
2) if we connect to phoenix thru JDBC, and try to execute the CREATE statemnt, then we get a parsing error (same create works in phoenix)
var ddlCode="create table test (mykey integer not null primary key, mycolumn varchar) "
val driver = "org.apache.phoenix.jdbc.PhoenixDriver"
val jdbcConnProps = new Properties()
jdbcConnProps.setProperty("driver", driver);
val jdbcConnString = "jdbc:phoenix:hostname:2181/hbase-unsecure"
sqlContext.read.jdbc(jdbcConnString, ddlCode, jdbcConnProps)
error:
org.apache.phoenix.exception.PhoenixParserException: ERROR 601 (42P00): Syntax error. Encountered "create" at line 1, column 15.
Anyone with similar challenges that managed to do it differently?
i have finally worked in a solution for this. Basically i think was wrong by trying to use SQLContext read method for this. I think this method is designed just to "read" data sources. The way to workaournd it has been basically to open a standard JDBC connection against Phoenix:
var ddlCode="create table test (mykey integer not null primary key, mycolumn varchar) "
val jdbcConnString = "jdbc:hostname:2181/hbase-unsecure"
val user="USER"
val pass="PASS"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(jdbcConnString, user, pass)
val statement = connection.createStatement()
statement.executeUpdate(ddlCode)

How to save design document / view in CouchDB database from Scala program?

I have written a scala program for creating new database, and adding documents/views into it.
object CouchDBTest extends App {
val dbSession = new Session("localhost", 5984)
val db = dbSession.createDatabase("couchschooltest")
val newC1 = new Document
newC1.put("Type", "Class")
newC1.put("ClassId", "C1")
newC1.put("ClassName", "C-2A")
newC1.put("ClassTeacher", "T1")
newC1.accumulate("Students", "S1");
newC1.accumulate("Students", "S2");
newC1.accumulate("Students", "S3");
db.saveDocument(newC1)
val viewDocClass = new Document
viewDocClass.addView("Class", "function(doc) {if(doc.Type == 'Class') { emit([doc.ClassId, doc.ClassName, doc.ClassTeacher, doc.Students], doc);}}")
db.saveDocument(viewDocClass)
}
When I run this code, it creates the new database in CouchDB and adds the class document in this database. However, it doesn't add the views into the database. It gives the runtime error while adding the viewDocClass as
Error adding document - null null
For this I used couchdb4j API