i defined a class to map rows of a cassandra table:
case class Log(
val time: Long,
val date: String,
val appId: String,
val instanceId: String,
val appName: String,
val channel: String,
val originCode: String,
val message: String) {
}
i created an RDD to save all my tuples
val logEntries = sc.cassandraTable[Log]("keyspace", "log")
to see if all works i printed this:
println(logEntries.counts()) -> works, print the numbers of tuples retrieved.
println(logEntries.first()) -> exception on this line
java.lang.AssertionError: assertion failed: Missing columns needed by
com.model.Log: app_name, app_id, origin_code, instance_id
my columns of table log on cassandra are:
time bigint, date text, appid text, instanceid text, appname text, channel text, origincode text, message text
what's wrong?
As mentioned in cassandra-spark-connector docs, column name mapper has it's own logic for converting case class parameters to column names:
For multi-word column identifiers, separate each word by an underscore in Cassandra, and use the camel case convention on the Scala side.
So if you use case class Log(appId:String, instanceId:String) with camel-cased parameters, it will be automatically mapped to a underscore-separated notation: app_id text, instance_id text. It cannot be automatically mapped to appid text, instanceid text: you've missed an underscore.
Related
I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),\
('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()
I read this dataset through:
recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")
Then I filter a specific column with IsNotNull:
recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())
This recomendaveis_mid dataset is:
DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]
When I try to get minimum date of column published_in with:
recomendaveis_mid.select(F.min("published_in")).collect()
It throws this error:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more
The field published_in has nothing to do with my filter in entities_mid and when i try to run the date filter without running the entities_mid isNotNull my code works fine. Any suggestions? In time:
There is a similar error here but I couldnĀ“t get any other ideas. Thanks in advance
We faced similar issue in scala spark while reading from view.
Upon Analysis, we observed that when we do
df.printSchema()
df.show(1,false)
it prints all fields even before join operation takes place. But during loading/writing data frame to external storage/table it throws error :
INVALID_ARGUMENT: request failed: Row filter for table
After some experiment we observed that if we persist dataframe
df.persist()
it worked fine.
It looks like after joining we also need to have the column used to filter in select, since we don't want that column in our final dataframe. we persisted it in cluster.
Either you can unpersist
df.unpersist()
once data operation completes OR leave it AS IS if you are using ephemeral cluster as it will be deleted after deletion of cluster.
I am sending an image in base64 format in a json message. I want to store the image in Cassandra. The corresponding column in Cassandra is ``image list`
The image data in json is
image:["data:image/png;base64,iVBORw...","data:image/png;base64,idfsd..."]
This json maps to my model as an Array[String] -
The model class which maps to the json is
case class PracticeQuestion (id: Option[UUID],
d: String,
h: List[String],
image: Array[String],//images maps here
s: String,
f: String,
t: Set[String],
title: String,
a:String,
r:List[String])
I have read that to store images in cassandra, I need ByteBuffer. So I use ByteBuffer.wrap() to convert the array different indexes of the array into ByteBuffer. But my insert statement is failing. The Datastax error is
The I am trying to convert the Array[String] to list<blob> is
.value("image",seqAsJavaList(Seq[ByteBuffer](ByteBuffer.wrap(model.image(0).getBytes()))))
//for the moment, I am taking only 1 image
I take the first image, get its bytes, create a BytesBuffer. The I convert the BytesBuffer into a Seq and then the Seq into Java List
The error I see is com.datastax.driver.core.exceptions.InvalidTypeException: Value 6 of type class scala.collection.immutable.$colon$colon does not correspond to any CQL3 type
the complete insert command is
QueryBuilder.insertInto(tableName).value("id",model.id.get) .value("answer",model.a)
.value("d",model.d)
.value("f",model.f)
.value("h",seqAsJavaList(model.h))
.value("image",seqAsJavaList(Seq[ByteBuffer](ByteBuffer.wrap(model.image(0).getBytes()))))
.value("r",model.r)
.value("s",model.s)
.value("t",setAsJavaSet(model.t))
.value("title",model.title)
.ifNotExists();
The database schema is
id uuid PRIMARY KEY,
a text,
d text,
f text,
h list<text>,
image list<blob>,
r list<text>,
s text,
t set<text>,
title text
I am writing a Scala/spark program which would find the max salary of the employee. The employee data is available in a CSV file, and the salary column has a comma separator for thousands and also it has a $ prefixed to it e.g. $74,628.00.
To handle this comma and dollar sign, I have written a parser function in scala which would split each line on "," and then map each column to individual variables to be assigned to a case class.
My parser program looks like below. In this to eliminate the comma and dollar signs I am using the replace function to replace it with empty, and then finally typecase to Int.
def ParseEmployee(line: String): Classes.Employee = {
val fields = line.split(",")
val Name = fields(0)
val JOBTITLE = fields(2)
val DEPARTMENT = fields(3)
val temp = fields(4)
temp.replace(",","")//To eliminate the ,
temp.replace("$","")//To remove the $
val EMPLOYEEANNUALSALARY = temp.toInt //Type cast the string to Int
Classes.Employee(Name, JOBTITLE, DEPARTMENT, EMPLOYEEANNUALSALARY)
}
My Case class look like below
case class Employee (Name: String,
JOBTITLE: String,
DEPARTMENT: String,
EMPLOYEEANNUALSALARY: Number,
)
My spark dataframe sql query looks like below
val empMaxSalaryValue = sc.sqlContext.sql("Select Max(EMPLOYEEANNUALSALARY) From EMP")
empMaxSalaryValue.show
when I Run this program I am getting this below exception
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Number
- field (class: "java.lang.Number", name: "EMPLOYEEANNUALSALARY")
- root class: "Classes.Employee"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:282)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:272)
at CalculateMaximumSalary$.main(CalculateMaximumSalary.scala:27)
at CalculateMaximumSalary.main(CalculateMaximumSalary.scala)
Any idea why I am getting this error? what is the mistake I am doing here and why it is not able to typecast to number?
Is there any better approach to handle this problem of getting maximum salary of the employee?
Spark SQL provides only a limited number of Encoders which target concrete classes. Abstract classes like Number are not supported (can be used with limited binary Encoders).
Since you convert to Int anyway, just redefine the class:
case class Employee (
Name: String,
JOBTITLE: String,
DEPARTMENT: String,
EMPLOYEEANNUALSALARY: Int
)
I am trying to write a simple siddhi query by simply importing a custom mapped stream. But as soon as I import stream and validate query, it gives error.
My complete query however is
#Import('bro.in.ssh.log:1.0.0')
define stream inStream (ts string, uid string, id.orig_h string, id.orig_p int, id.resp_h string, id.resp_p int, version int, client string, server string, cipher_alg string, mac_alg string, compression_alg string, kex_alg string, host_key_alg string, host_key string);
#Export('bro.out.ssh.log:1.0.0')
define stream outStream (ts string, ssh_logins int);
from inStream
select dateFormat (ts,'yyyy-MM-dd HH:mm') as formatedTs, count
group by formatedTs
insert into outStream;
All I want is to count number of records in a log for a single minute and export time and count to an output Stream. But I am getting errors even at the very first line.
My input is a log file of bro ids, ssh.log. Its sample record would be something like:
{"ts":"2016-05-08T08:59:47.363764Z","uid":"CLuCgz3HHzG7LpLwH9","id.orig_h":"172.30.26.119","id.orig_p":51976,"id.resp_h":"172.30.26.160","id.resp_p":22,"version":2,"client":"SSH-2.0-OpenSSH_5.0","server":"SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6","cipher_alg":"arcfour256","mac_alg":"hmac-md5","compression_alg":"none","kex_alg":"diffie-hellman-group-exchange-sha1","host_key_alg":"ssh rsa","host_key":"8d:df:71:ac:29:1f:67:6f:f3:dd:c3:e5:2e:5f:3e:b4"}
Siddhi does not allow an Attribute name to have dot ('.') character. So please edit the Event Stream such that the Attribute names (such as id.orig_h) will not have the Dot character.
How to write optional case classes with cassandra spark connector ?
example :
case class User(name : String, address : Option[Address])
case class Address(street : String, city : String)
When I tried to save user to cassandra with rdd.saveToCassandra it's raise an error
Failed to get converter for field "address" of type scala.Option[Address] in User mapped to column "address" of "testspark.logs_raw"
I have tried to implement a TypeConverter but that has not worked.
However nested case classes are correctly converted to cassandra UDT and optional fields are accepted.
Any good way to deal with that without changing the data model?
Just for visibility. Everything works fine in the modern versions - there were a lot of changes in UDTs around SCC 1.4.0-1.6.0, plus many performance optimizations in the SCC 2.0.8. With SCC 2.5.1, RDD API correctly maps everything - for example, if we have following UDT & table:
cqlsh> create type test.address (street text, city text);
cqlsh> create table test.user(name text primary key, address test.address);
cqlsh> insert into test.user(name, address) values
('with address', {street: 'street 1', city: 'city1'});
cqlsh> insert into test.user(name) values ('without address');
cqlsh> select * from test.user;
name | address
-----------------+-------------------------------------
with address | {street: 'street 1', city: 'city1'}
without address | null
(2 rows)
Then RDD API is able to correctly pull everything when reading data:
scala> import com.datastax.spark.connector._
import com.datastax.spark.connector._
scala> case class Address(street : String, city : String)
defined class Address
scala> case class User(name : String, address : Option[Address])
defined class User
scala> val data = sc.cassandraTable[User]("test", "user")
data: com.datastax.spark.connector.rdd.CassandraTableScanRDD[User] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:18
scala> data.collect
res0: Array[User] = Array(User(without address,None),
User(with address,Some(Address(street 1,city1))))