unable to convert a String array into blob in Cassandra - scala

I am sending an image in base64 format in a json message. I want to store the image in Cassandra. The corresponding column in Cassandra is ``image list`
The image data in json is
image:["...","..."]
This json maps to my model as an Array[String] -
The model class which maps to the json is
case class PracticeQuestion (id: Option[UUID],
d: String,
h: List[String],
image: Array[String],//images maps here
s: String,
f: String,
t: Set[String],
title: String,
a:String,
r:List[String])
I have read that to store images in cassandra, I need ByteBuffer. So I use ByteBuffer.wrap() to convert the array different indexes of the array into ByteBuffer. But my insert statement is failing. The Datastax error is
The I am trying to convert the Array[String] to list<blob> is
.value("image",seqAsJavaList(Seq[ByteBuffer](ByteBuffer.wrap(model.image(0).getBytes()))))
//for the moment, I am taking only 1 image
I take the first image, get its bytes, create a BytesBuffer. The I convert the BytesBuffer into a Seq and then the Seq into Java List
The error I see is com.datastax.driver.core.exceptions.InvalidTypeException: Value 6 of type class scala.collection.immutable.$colon$colon does not correspond to any CQL3 type
the complete insert command is
QueryBuilder.insertInto(tableName).value("id",model.id.get) .value("answer",model.a)
.value("d",model.d)
.value("f",model.f)
.value("h",seqAsJavaList(model.h))
.value("image",seqAsJavaList(Seq[ByteBuffer](ByteBuffer.wrap(model.image(0).getBytes()))))
.value("r",model.r)
.value("s",model.s)
.value("t",setAsJavaSet(model.t))
.value("title",model.title)
.ifNotExists();
The database schema is
id uuid PRIMARY KEY,
a text,
d text,
f text,
h list<text>,
image list<blob>,
r list<text>,
s text,
t set<text>,
title text

Related

Row Filter for Table is invalid pyspark

I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),\
('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()
I read this dataset through:
recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")
Then I filter a specific column with IsNotNull:
recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())
This recomendaveis_mid dataset is:
DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]
When I try to get minimum date of column published_in with:
recomendaveis_mid.select(F.min("published_in")).collect()
It throws this error:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more
The field published_in has nothing to do with my filter in entities_mid and when i try to run the date filter without running the entities_mid isNotNull my code works fine. Any suggestions? In time:
There is a similar error here but I couldnĀ“t get any other ideas. Thanks in advance
We faced similar issue in scala spark while reading from view.
Upon Analysis, we observed that when we do
df.printSchema()
df.show(1,false)
it prints all fields even before join operation takes place. But during loading/writing data frame to external storage/table it throws error :
INVALID_ARGUMENT: request failed: Row filter for table
After some experiment we observed that if we persist dataframe
df.persist()
it worked fine.
It looks like after joining we also need to have the column used to filter in select, since we don't want that column in our final dataframe. we persisted it in cluster.
Either you can unpersist
df.unpersist()
once data operation completes OR leave it AS IS if you are using ephemeral cluster as it will be deleted after deletion of cluster.

Read hive struct type and modify value

I am reading a hive table as a dataframe and retrieving it in a new dataset. I am reading specific values(string)from a struct type and I want to format the values before I store them in the case class.
For eg: I read the struct type as "listelements.sneaker.colors", this returns an array as there are several colors. Before storing them in the new dataset, I want the colors formatted like this:
"red","blue","yellow" (quoted and comma separated)
and stored as a single string.
concat_ws concats the array elements with a comma, but I also need to enclose them in double-quotes.
session.read
.table(footWear)
.select(
$"id",
$"footWearCategory".as("category"),
concat_ws(",", $"listelements".getField("sneaker").getField("colors")).as("availableColors"))
.where($"date" === runDate)
.as[FootWearInformation]
case class FootWearInformation(id: String, category: String, availableColors: String)
UDF:
def formatArray = udf((arr: collection.mutable.WrappedArray[String]) =>
arr.map(x => s""""$x\"""").mkString(","))

Create nested case class instance from a DataFrame

I have this two case classes:
case class Inline_response_200(
nodeid: Option[String],
data: Option[List[ReadingsByEpoch_data]]
)
and
case class ReadingsByEpoch_data(
timestamp: Option[Int],
value: Option[String]
)
And I have a Cassandra table that has data like nodeid|timestamp|value. Basically, each nodeid has multiple timestamp-value pairs.
All I want to do is create instances of Inline_response_200 with their proper List of ReadingsByEpoch_data so Jackson can serialize them properly to Json.
I've tried
val res = sc.cassandraTable[Inline_response_200]("test", "taghistory").limit(100).collect()
But I get this error
java.lang.IllegalArgumentException: Failed to map constructor parameter data in com.wordnik.client.model.Inline_response_200 to a column of test.taghistory
Makes total sense because there is no column data in my Cassandra table. But then how can I create the instances correctly?
Cassandra table looks like this:
CREATE TABLE test.taghistory (
nodeid text,
timestamp text,
value text,
PRIMARY KEY (nodeid, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
EDIT
As per Alex Ott's suggestion:
val grouped = data.groupByKey.map {
case (k, v) =>
Inline_response_200(k.getString(0), v.map(x => ReadingsByEpoch_data(x.getInt(1), x.getString(2))).toList)
}
grouped.collect().toList
I'm close but not there yet. This gives me the format I expect, however its creating one instance of Inline_response_200 per record:
[{"nodeid":"Tag3","data":[{"timestamp":1519411780,"value":"80.0"}]},{"nodeid":"Tag3","data":[{"timestamp":1519411776,"value":"76.0"}]}]
In this example I need to have one nodeid key, and an array of two timestamp-value pairs, like this:
[{"nodeid":"Tag3","data":[{"timestamp":1519411780,"value":"80.0"},{"timestamp":1519411776,"value":"76.0"}]}]`
Maybe I'm grouping the wrong way?
If you have data like nodeid|timestamp|value in your DB (yes, according to schema), you can't directly map it into structure that you created. Read data from table as pair RDD:
val data = sc.cassandraTable[(String,String,Option[String])]("test", "taghistory")
.select("nodeid","timestamp","value").keyBy[String]("nodeid")
and then transform it into structure that you need by using groupByKey on that pair RDD & transforming into Inline_response_200 class that you need, like this:
val grouped = data.groupByKey.map{case (k,v) => Inline_response_200(k,
v.map(x => ReadingsByEpoch_data(x._2, x._3)).toList)}
grouped.collect

no viable alternative at input : Siddhi Query

I am trying to write a simple siddhi query by simply importing a custom mapped stream. But as soon as I import stream and validate query, it gives error.
My complete query however is
#Import('bro.in.ssh.log:1.0.0')
define stream inStream (ts string, uid string, id.orig_h string, id.orig_p int, id.resp_h string, id.resp_p int, version int, client string, server string, cipher_alg string, mac_alg string, compression_alg string, kex_alg string, host_key_alg string, host_key string);
#Export('bro.out.ssh.log:1.0.0')
define stream outStream (ts string, ssh_logins int);
from inStream
select dateFormat (ts,'yyyy-MM-dd HH:mm') as formatedTs, count
group by formatedTs
insert into outStream;
All I want is to count number of records in a log for a single minute and export time and count to an output Stream. But I am getting errors even at the very first line.
My input is a log file of bro ids, ssh.log. Its sample record would be something like:
{"ts":"2016-05-08T08:59:47.363764Z","uid":"CLuCgz3HHzG7LpLwH9","id.orig_h":"172.30.26.119","id.orig_p":51976,"id.resp_h":"172.30.26.160","id.resp_p":22,"version":2,"client":"SSH-2.0-OpenSSH_5.0","server":"SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6","cipher_alg":"arcfour256","mac_alg":"hmac-md5","compression_alg":"none","kex_alg":"diffie-hellman-group-exchange-sha1","host_key_alg":"ssh rsa","host_key":"8d:df:71:ac:29:1f:67:6f:f3:dd:c3:e5:2e:5f:3e:b4"}
Siddhi does not allow an Attribute name to have dot ('.') character. So please edit the Event Stream such that the Attribute names (such as id.orig_h) will not have the Dot character.

Assertion on retrieving data from cassandra

i defined a class to map rows of a cassandra table:
case class Log(
val time: Long,
val date: String,
val appId: String,
val instanceId: String,
val appName: String,
val channel: String,
val originCode: String,
val message: String) {
}
i created an RDD to save all my tuples
val logEntries = sc.cassandraTable[Log]("keyspace", "log")
to see if all works i printed this:
println(logEntries.counts()) -> works, print the numbers of tuples retrieved.
println(logEntries.first()) -> exception on this line
java.lang.AssertionError: assertion failed: Missing columns needed by
com.model.Log: app_name, app_id, origin_code, instance_id
my columns of table log on cassandra are:
time bigint, date text, appid text, instanceid text, appname text, channel text, origincode text, message text
what's wrong?
As mentioned in cassandra-spark-connector docs, column name mapper has it's own logic for converting case class parameters to column names:
For multi-word column identifiers, separate each word by an underscore in Cassandra, and use the camel case convention on the Scala side.
So if you use case class Log(appId:String, instanceId:String) with camel-cased parameters, it will be automatically mapped to a underscore-separated notation: app_id text, instance_id text. It cannot be automatically mapped to appid text, instanceid text: you've missed an underscore.