I am trying to write a simple siddhi query by simply importing a custom mapped stream. But as soon as I import stream and validate query, it gives error.
My complete query however is
#Import('bro.in.ssh.log:1.0.0')
define stream inStream (ts string, uid string, id.orig_h string, id.orig_p int, id.resp_h string, id.resp_p int, version int, client string, server string, cipher_alg string, mac_alg string, compression_alg string, kex_alg string, host_key_alg string, host_key string);
#Export('bro.out.ssh.log:1.0.0')
define stream outStream (ts string, ssh_logins int);
from inStream
select dateFormat (ts,'yyyy-MM-dd HH:mm') as formatedTs, count
group by formatedTs
insert into outStream;
All I want is to count number of records in a log for a single minute and export time and count to an output Stream. But I am getting errors even at the very first line.
My input is a log file of bro ids, ssh.log. Its sample record would be something like:
{"ts":"2016-05-08T08:59:47.363764Z","uid":"CLuCgz3HHzG7LpLwH9","id.orig_h":"172.30.26.119","id.orig_p":51976,"id.resp_h":"172.30.26.160","id.resp_p":22,"version":2,"client":"SSH-2.0-OpenSSH_5.0","server":"SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6","cipher_alg":"arcfour256","mac_alg":"hmac-md5","compression_alg":"none","kex_alg":"diffie-hellman-group-exchange-sha1","host_key_alg":"ssh rsa","host_key":"8d:df:71:ac:29:1f:67:6f:f3:dd:c3:e5:2e:5f:3e:b4"}
Siddhi does not allow an Attribute name to have dot ('.') character. So please edit the Event Stream such that the Attribute names (such as id.orig_h) will not have the Dot character.
Related
I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),\
('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()
I read this dataset through:
recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")
Then I filter a specific column with IsNotNull:
recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())
This recomendaveis_mid dataset is:
DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]
When I try to get minimum date of column published_in with:
recomendaveis_mid.select(F.min("published_in")).collect()
It throws this error:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more
The field published_in has nothing to do with my filter in entities_mid and when i try to run the date filter without running the entities_mid isNotNull my code works fine. Any suggestions? In time:
There is a similar error here but I couldnĀ“t get any other ideas. Thanks in advance
We faced similar issue in scala spark while reading from view.
Upon Analysis, we observed that when we do
df.printSchema()
df.show(1,false)
it prints all fields even before join operation takes place. But during loading/writing data frame to external storage/table it throws error :
INVALID_ARGUMENT: request failed: Row filter for table
After some experiment we observed that if we persist dataframe
df.persist()
it worked fine.
It looks like after joining we also need to have the column used to filter in select, since we don't want that column in our final dataframe. we persisted it in cluster.
Either you can unpersist
df.unpersist()
once data operation completes OR leave it AS IS if you are using ephemeral cluster as it will be deleted after deletion of cluster.
I am sending an image in base64 format in a json message. I want to store the image in Cassandra. The corresponding column in Cassandra is ``image list`
The image data in json is
image:["...","..."]
This json maps to my model as an Array[String] -
The model class which maps to the json is
case class PracticeQuestion (id: Option[UUID],
d: String,
h: List[String],
image: Array[String],//images maps here
s: String,
f: String,
t: Set[String],
title: String,
a:String,
r:List[String])
I have read that to store images in cassandra, I need ByteBuffer. So I use ByteBuffer.wrap() to convert the array different indexes of the array into ByteBuffer. But my insert statement is failing. The Datastax error is
The I am trying to convert the Array[String] to list<blob> is
.value("image",seqAsJavaList(Seq[ByteBuffer](ByteBuffer.wrap(model.image(0).getBytes()))))
//for the moment, I am taking only 1 image
I take the first image, get its bytes, create a BytesBuffer. The I convert the BytesBuffer into a Seq and then the Seq into Java List
The error I see is com.datastax.driver.core.exceptions.InvalidTypeException: Value 6 of type class scala.collection.immutable.$colon$colon does not correspond to any CQL3 type
the complete insert command is
QueryBuilder.insertInto(tableName).value("id",model.id.get) .value("answer",model.a)
.value("d",model.d)
.value("f",model.f)
.value("h",seqAsJavaList(model.h))
.value("image",seqAsJavaList(Seq[ByteBuffer](ByteBuffer.wrap(model.image(0).getBytes()))))
.value("r",model.r)
.value("s",model.s)
.value("t",setAsJavaSet(model.t))
.value("title",model.title)
.ifNotExists();
The database schema is
id uuid PRIMARY KEY,
a text,
d text,
f text,
h list<text>,
image list<blob>,
r list<text>,
s text,
t set<text>,
title text
Given is data by joining two tables.
joinDataRdd.take(5).foreach(println)
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT)))
(65722,((164249,365,2,119.98,59.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164250,730,5,400.0,80.0),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164251,1004,1,399.98,399.98),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164252,627,5,199.95,39.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
When I am trying to get following
val data = joinDataRdd.map(x=>(x._1,x._2._1.split(",")(3)))
It's is throwing an error :
value split is not a member of (String, String, String, String, String)
val data = joinDataRdd.map(x=>(x._1,x._2._1._1.split(",")(3)))
You are trying to split the tuple so that is why the error message. At the given position x._2._1 ,
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT))), the highlighted data is the result. So if you are looking to dig inside the tuple, then you need to advance one position.
It looks like the values are already in a tuple, so you don't need to split the string. Is
val data = joinDataRdd.map(x=>(x._1,x._2._1._4))
what you are looking for?
Here are my case classes:
case class User(id: String, location: Option[Location] = None, age: Option[String] = None) {
override def toString: String = s"User: $id from ${location.toString} age of $age"
}
case class UserBookRating(user: User, bookISBN: String, rating: String)
As you can see, UserBookRating depends on User
I'm streaming data from two separate csv files: Users.csv and BookRatings.csv
Example rows from each file:
id, location, age
"10";"chicago, illinois, usa";"26"
id, isbn, rating
"10";"10240528340","5"
I'm using Akka streams with GraphDSL, I have my two sources streaming from each file. My problem is how can I get the data from both streams in order to form the BookRatings object since it depends on User. Right now I have two separate streams that form User objects from Users.csv and another stream that forms BookRating objects, but only the id field of User is populated for BookRating object since that is the only information I know from BookRatings.csv file. How could I combine the data coming from streams so I could form BookRating objects?
Create a ZipWith stage in the graph. In the constructor for ZipWith, you pass in a combiner function that turns a tuple of the output of the two csv streams and outputs a BookRating stream.
i defined a class to map rows of a cassandra table:
case class Log(
val time: Long,
val date: String,
val appId: String,
val instanceId: String,
val appName: String,
val channel: String,
val originCode: String,
val message: String) {
}
i created an RDD to save all my tuples
val logEntries = sc.cassandraTable[Log]("keyspace", "log")
to see if all works i printed this:
println(logEntries.counts()) -> works, print the numbers of tuples retrieved.
println(logEntries.first()) -> exception on this line
java.lang.AssertionError: assertion failed: Missing columns needed by
com.model.Log: app_name, app_id, origin_code, instance_id
my columns of table log on cassandra are:
time bigint, date text, appid text, instanceid text, appname text, channel text, origincode text, message text
what's wrong?
As mentioned in cassandra-spark-connector docs, column name mapper has it's own logic for converting case class parameters to column names:
For multi-word column identifiers, separate each word by an underscore in Cassandra, and use the camel case convention on the Scala side.
So if you use case class Log(appId:String, instanceId:String) with camel-cased parameters, it will be automatically mapped to a underscore-separated notation: app_id text, instance_id text. It cannot be automatically mapped to appid text, instanceid text: you've missed an underscore.