How implement LEFT or RIGHT JOIN using spark-cassandra-connector - scala

I have spark streaming job. I am using Cassandra as datastore.
I have stream which is need to be joined with cassandra table.
I am using spark-cassandra-connector, there is great method joinWithCassandraTable which is as far as I can understand implementing inner join with cassandra table
val source: DStream[...] = ...
source.foreachRDD { rdd =>
rdd.joinWithCassandraTable( "keyspace", "table" ).map{ ...
}
}
So the question is how can I implement left outer join with cassandra table?
Thanks in advance

This is currently not supported, but there is a ticket to introduce the functionality. Please vote on it if you would like it introduced in the future.
https://datastax-oss.atlassian.net/browse/SPARKC-181
A workaround is suggested in the ticket

As RussS mentioned such feature is not available in spark-cassandra-connector driver yet. So as workaround I propose the following code snippet.
rdd.foreachPartition { partition =>
CassandraConnector(rdd.context.getConf).withSessionDo { session =>
for (
leftSide <- partition;
rightSide <- {
val rs = session.execute(s"""SELECT * FROM "keyspace".table where id = "${leftSide._2}"""")
val iterator = new PrefetchingResultSetIterator(rs, 100)
if (iterator.isEmpty) Seq(None)
else iterator.map(r => Some(r.getString(1)))
}
) yield (leftSide, rightSide)
}
}

Related

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

Reading different Schema in Parquet Partitioned Dir structure

I have following partitioned parquet data on hdfs written using spark:
year
|---Month
|----monthlydata.parquet
|----Day
|---dailydata.parquet
Now when I read df from year path, spark read dailydata.parquet. How can i read monthlydata from all partitions. I tried using setting option mergeSchema = true which gives error.
I would urge you stop doing the following:
year
|---Month
|----monthlydata.parquet
|----Day
|---dailydata.parquet
When you read from year/month/ or even just year/, you won't just get monthlydata.parquet, you'll also be getting dailydata.parquet. I can't speak much to the error you're getting (please post it), but my humble suggestion would be to separate the paths in HDFS since you're already duplicating the data:
dailies
|---year
|---Month
|----Day
|---dailydata.parquet
monthlies
|---year
|---Month
|----monthlydata.parquet
Is there a reason you were keeping them in the same directories?
However, if you insist on this structure, use something like this:
schema = "dailydata1"
val dfList = dates.map { case (month, day) =>
Try(sqlContext.read.parquet(s"/hdfs/table/month=$month/day=$day/$schema.parquet"))
}
val dfUnion = dfList.collect { case Success(v) => v }.reduce { (a, b) =>
a.unionAll(b)
}
Where you can toggle the schema between dailydata1, dailydata2, etc.

How EXACTLY is Slick's SimpleLiteral used?

I want to use some extra features of PostgreSQL in my code but I don't want to fill the place with SQL string interpolations.
Currently I have:
/** Use 'now()' through Slick. */
val psqlNow = SimpleFunction.nullary[java.sql.Date]("now")
//Not really my code, but we only care for 2 lines.
def aQuery(limiter: Column[Int]) = {
myTable
.filter(_.validFrom >= psqlNow)
.filter(_.validUntil <= psqlNow)
.filter(_.fakeId === limiter).map(e => (e.fakeId, e.name)
}
But I want to use 'CURRENT_DATE', which I is a literal (and using it in place of "now" throws an exception). Can someone provide an actual example, because I can't get this to compile:
/** Use 'CURRENT_DATE' through Slick. */
val psqlNow = SimpleLiteral("CURRENT_DATE")(...WHAT GOES HERE?...)
//Not really my code, but we only care for 2 lines.
def aQuery(limiter: Column[Int]) = {
myTable
.filter(_.validFrom >= psqlNow)
.filter(_.validUntil <= psqlNow)
.filter(_.fakeId === limiter).map(e => (e.fakeId, e.name)
}
And I also want to change the following to lifted Slick, can I do it with SimpleLiteral (to somehow put 'count(*) OVER() recordsFiltered' into the generated query?
SELECT *, count(*) OVER() recordsFiltered FROM example
WHERE id = $1
The examples are trivial, the actual code is a series of folds over filtering criteria.
import scala.slick.ast.TypedType
val current_date = Column.forNode[java.sql.Date](new SimpleLiteral("CURRENT_DATE")(implicitly[TypedType[java.sql.Date]]))
does the trick. Better support is missing at the moment.
I added a PR, so in Slick 2.2 it will be supported like this:
val current_date = SimpleLiteral[java.sql.Date]("CURRENT_DATE")
See https://github.com/slick/slick/pull/981

Cassandra-Hector-Scala:How can I get all row Composite key in column family?

My data storage format is:
Family name :Test
Rowkey: comkey1:comkey2
=>(name=name,value='xyz',timestamp=1554515485)
-------------------------------------------------------
Rowkey: comkey1:comkey3
=>(name=name,value='abc',timestamp=1554515485)
-------------------------------------------------------
Rowkey: comkey1:comkey4
=>(name=name,value='pqr',timestamp=1554515485)
-------------------------------------------------------
now i want to fetch all composite key from "test" family
and i am trying
def test=Action{
val cluster = HFactory.getOrCreateCluster("Test Cluster", "127.0.0.1:9160");
val keyspace = HFactory.createKeyspace("winoriatest", cluster)
var startKey = new Composite();
var endKey= new Composite();
startKey.addComponent("comkey1", StringSerializer.get());
startKey.addComponent("comkey2", StringSerializer.get());
endKey.addComponent("comkey1", StringSerializer.get());
endKey.addComponent("comkey4", StringSerializer.get());
val rangeSlicesQuery = HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(),StringSerializer.get())
rangeSlicesQuery.setColumnFamily("test");
// CompositeSerializer.get() is not working.
rangeSlicesQuery.setKeys(startKey,endKey)
rangeSlicesQuery.setRange(null,null,false,Integer.MAX_VALUE);
rangeSlicesQuery.setReturnKeysOnly()
val result = rangeSlicesQuery.execute()
val orderedRows = result.get();
import scala.collection.JavaConversions._
for (sc <- orderedRows) {
println(sc.getKey())
}
Ok(views.html.index("Your new application is ready."))
}
Error :[NullPointerException: null] on line
val result = rangeSlicesQuery.execute()
Cassandra 2.0 scala 2.10.2
Thank you for your help in resolving this, in advance.
it giving me null pointer exception, and the same code is working with java
and my java code is
Cluster cluster = HFactory.getOrCreateCluster("Test Cluster","127.0.0.1:9160");
Keyspace keyspace = HFactory.createKeyspace("winoriatest", cluster);
Serializer<String> se= StringSerializer.get() ;
Serializer<Long> le= LongSerializer.get() ;
Serializer<Integer> ie= IntegerSerializer.get() ;
CompositeSerializer ce = new CompositeSerializer();
RangeSlicesQuery<Composite,String,byte[]> rangeSliceQuery=HFactory.createRangeSlicesQuery(keyspace,ce,se, BytesArraySerializer.get());
rangeSliceQuery.setColumnFamily("test");
rangeSliceQuery.setRange(null,null, false, Integer.MAX_VALUE);
QueryResult<OrderedRows<Composite,String,byte[]>>result=rangeSliceQuery.execute();
OrderedRows<Composite,String,byte[]> orderedRows=result.get();
for (Row<Composite,String,byte[]> r:orderedRows)
{
System.out.println("Compositekey="+r.getKey().get(0,se)+":"+r.getKey().get(1, se));
}
I'm not quite sure what "i want to fetch all composite key in test family" means. If you mean, you want to get just the partition [row] key components, then you can do this in CQL as simply as:
SELECT DISTINCT a, b FROM test
(Assigning a and b to be the column names.)
This is a good example of how much simpler CQL makes Cassandra development, which is why we're pushing people to use the native CQL driver over legacy clients like Hector.
For more on how CQL makes sense of a Thrift data model like this, see http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.

Left outer join on Slick/Mysql

I am having trouble using a left outer join in Slick. I'll start with some code :
val articles = (for {
(article, lecture) <- ArticleDAO leftJoin LectureDAO on (_.id === _.idArticle) if (article.flux === idFlux)
} yield (article, lecture.isStarred.?)).groupBy(_._1.guid).map {
case (guid, rows) => rows.first
}
PS : The ArticleDAO & LectureDAO are the objects extending Table in opposition to Article & Lecture which are simple case classes.
This is the error, I am getting when compiling the part above :
Don't know how to unpack (models.Article, Option[Boolean]) to T and pack to G
I don't really understand this error. I know it has something to do with transformation, composition of queries but I have no idea how to change/fix it. Could some shed some light on this?
The fix is explained here: https://groups.google.com/forum/#!topic/scalaquery/bIFH6be99B0 . .first is not a query operation at the moment, use .min instead.