How to preserve order of columns in cassandra - scala

I have two tables in Cassandra:
CREATE TABLE table1 (
name text PRIMARY KEY,
grade text,
labid List<int>);
CREATE TABLE table2(
name text PRIMARY KEY,
deptid List<int>
grade text,);
for example:
val result: RDD[String, String, List[Int]] = myFunction();
result.saveToCassandra(keyspace, table1)
It is working fine.
but in case of using below line:
result.saveToCassandra(keyspace, table2)
m getting this type of error : com.datastax.spark.connector.types.TypeConversionException: Cannot convert object test_data of type class java.lang.String to List[AnyRef]
Is there any solution using SomeColumns which satisfy the both tables[we don't know which table will be executed]. eg:
result.saveToCassandra(keyspace, table, SomeColumns(....))?

By default the dataframe schema only cares about position, not column name, so if your c* table has a different column order, you will get incorrect writes. The solution is like you said, to use SomeColumns.
val columns = dataFrame.schema.map(_.name: ColumnRef)
dataFrame.rdd.saveToCassandra(keyspaceName, tableName, SomeColumns(columns: _*))
Now the dataframe columns will be written to c* using their name, not position.

You arguments should be in different order because the tables have different column types:
val result: RDD[String, String, List[Int]] = myFunction();
val reorder: RDD[String, List[Int], String] = result.map(r => r._1, r._3, r._2)
reorder.saveToCassandra(keyspace, table2)

Related

DolphinDB - add columns before streaming engine and after subscribing data

If I want to add columns while filtering, what should I do if the table structure changes?
share streamTable(1000:0, `time`a, [TIMESTAMP, DOUBLE]) as table
outputTable = table(10000:0, `time`a, [TIMESTAMP, DOUBLE, DOUBLE])
def append_after_filtering(mutable inputTable, msg){
t = select * from msg
insert into inputTable values(t.time,t.a)
}
Aggregator = createTimeSeriesEngine(name="Aggregator", windowSize=6, step=3, metrics=<[avg(a)]>, dummyTable=table, outputTable=outputTable, timeColumn=`time)
subscribeTable(tableName="table", actionName="test", offset=0, handler=append_after_filtering{Aggregator}, msgAsTable=true)
I want to add a column to calculate a+100 as column b in append_after_filtering.
You can create another shared stream table (table2), and use handler to add columns on the data subscribed by subscribeTalbe (table1) in DolphinDB. The calculated table structure is table2, and set the dummyTable of createTimeSeriesEngine as table2.
share streamTable(1000:0, `time`a, [TIMESTAMP, DOUBLE]) as table1
share streamTable(1000:0, `time`a`b, [TIMESTAMP, DOUBLE, DOUBLE]) as table2
outputTable = table(10000:0, `interval_time`a`b, [TIMESTAMP, DOUBLE, DOUBLE])
def append_after_filtering(mutable inputTable, msg){
t = select * from msg
t[`b] = t.a+100
insert into inputTable values(t.time,t.a,t.b)
}
Aggregator = createTimeSeriesEngine(name="Aggregator", windowSize=6, step=3, metrics=<[avg(a), avg(b)]>, dummyTable=table2, outputTable=outputTable, timeColumn=`time)
subscribeTable(tableName="table1", actionName="test", offset=0, handler=append_after_filtering{Aggregator}, msgAsTable=true)

Transform rows to multiple rows in Spark Scala

I have a problem where I need to transform one row to multiple rows. This is based on a different mapping that I have. I have tried to provide an example below.
Suppose I have a parquet file with the below schema
ColA, ColB, ColC, Size, User
I need to aggregate the above data into multiple rows based on a lookup map. Suppose I have a static map
ColA, ColB, Sum(Size)
ColB, ColC, Distinct (User)
ColA, ColC, Sum(Size)
This means that one row in the input RDD needs to be transformed to 3 aggregate. I believe RDD is the way to go with FlatMapPair, but I am not sure how to go about this.
I am also OK to concat the columns into one key, something like ColA_ColB etc.
For creating multiple aggregates from the same data, I have started with something like this
val keyData: PairFunction[Row, String, Long] = new PairFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
val ip15M = spark.read.parquet("a.parquet").toJavaRDD
val pairs = ip15M.mapToPair(keyData)
java.util.List[(String, Long)] = [(ios,22), (ios,23), (ios,10), (ios,37), (ios,26), (web,52), (web,1)]
I believe I need to do flatmaptopair instead of mapToPair. On similar lines, I tried
val FlatMapData: PairFlatMapFunction[Row, String, Long] = new PairFlatMapFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
but it is giving Error
Expression of type (String, Long) doesn't conform to expected type util.Iterator[(String, Long)]
Any help is appreciated. Please let me know if I need to add any more details.
the outcome should only have 3 columns? I mean col1, col2, col3 (the agg outcome).
The second aggregate is a distinct count of users? (I assume yes).
If so you can basically create 3 data frames and then union them.
Something in the way of:
val df1 = spark.sql("select colA as col1, colB as col2 ,sum(Size) as colAgg group by colA,colB")
val df2 = spark.sql("select colB as col1, colC as col2 ,Distinct(User) as colAgg group by colB,colC")
val df3 = spark.sql("select colA as col1, colC as col2 ,sum(Size) as colAgg group by colA,colC")
df1.union(df2).union(df3)

Spark Enhance Join between Terabytes of Datasets

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. Also, Each table contains at least 100:600 columns all of them is Parquet format.
Example of one table below:
CREATE TABLE table_a
(
customer_id Long,
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;
I need to achieve two points,
Join all of them together with the most optimum way using Spark Scala. I tried to sortByKey before join but still there is a performance bottleneck. I tried to reparation by key before join but the performance is still not good. I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results.
After join I need to apply a separate function for some of these columns.
Sample of the join I tried below,
val dsA = spark.table(table_a)
val dsB = spark.table(table_b)
val dsC = spark.table(table_c)
val dsD = spark.table(table_d)
val dsE = spark.table(table_e)
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")
I think in this case the direct join is not the optimal case. You can acheive this task using the below simple way.
First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
Second, You will map each table to FeatureData case class key, [feature_name,feature_value]
Third, You will groupByKey and union all the dataset with the same key.
I the above way it will be faster to union than join. But it need more work.
After that, you will have a dataset with key,map. You will apply the transformation for key, Map(feature_name).
Simple example of the implementation as following:
You will map first the dataset to the case class then you can union all of them. After that you will groupByKey then map it and reduce it.
case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
FeatureMappedData(row.customer_id,
Map("featureA" -> row.featureA,
"featureB" -> row.featureB)))
val unionDataSet = dsAMapped union dsBMapped
unionDataSet.groupByKey(_.customer_id)
.mapGroups({
case (eid, featureIter) ⇒ {
val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0")
FeatureMappedData(customer_id, featuresMapped)
}
})

variable not binding value in Spark

I am passing variable, but it is not passing value.
I populates variable value here.
val Temp = sqlContext.read.parquet("Tabl1.parquet")
Temp.registerTempTable("temp")
val year = sqlContext.sql("""select value from Temp where name="YEAR"""")
year.show()
here year.show() proper value.
I am passing the parameter here in below code.
val data = sqlContext.sql("""select count(*) from Table where Year='$year' limit 10""")
data.show()
The value year is a Dataframe, not a specific value (Int or Long). So when you use it inside a string interpolation, you get the result of Dataframe.toString, which isn't something you can use to compare values to (the toString returns a string representation of the Dataframe's schema).
If you can assume the year Dataframe has a single Row with a single column of type Int, and you want to get the value of that column - you get use first().getAs[Int](0) to get that value and then use it to construct your next query:
val year: DataFrame = sqlContext.sql("""select value from Temp where name="YEAR"""")
// get the first column of the first row:
val actualYear: Int = year.first().getAs[Int](0)
val data = sqlContext.sql(s"select count(*) from Table where Year='$actualYear' limit 10")
If value column in Temp table has a different type (String, Long) - just replace the Int with that type.

Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)