DolphinDB - add columns before streaming engine and after subscribing data - streaming

If I want to add columns while filtering, what should I do if the table structure changes?
share streamTable(1000:0, `time`a, [TIMESTAMP, DOUBLE]) as table
outputTable = table(10000:0, `time`a, [TIMESTAMP, DOUBLE, DOUBLE])
def append_after_filtering(mutable inputTable, msg){
t = select * from msg
insert into inputTable values(t.time,t.a)
}
Aggregator = createTimeSeriesEngine(name="Aggregator", windowSize=6, step=3, metrics=<[avg(a)]>, dummyTable=table, outputTable=outputTable, timeColumn=`time)
subscribeTable(tableName="table", actionName="test", offset=0, handler=append_after_filtering{Aggregator}, msgAsTable=true)
I want to add a column to calculate a+100 as column b in append_after_filtering.

You can create another shared stream table (table2), and use handler to add columns on the data subscribed by subscribeTalbe (table1) in DolphinDB. The calculated table structure is table2, and set the dummyTable of createTimeSeriesEngine as table2.
share streamTable(1000:0, `time`a, [TIMESTAMP, DOUBLE]) as table1
share streamTable(1000:0, `time`a`b, [TIMESTAMP, DOUBLE, DOUBLE]) as table2
outputTable = table(10000:0, `interval_time`a`b, [TIMESTAMP, DOUBLE, DOUBLE])
def append_after_filtering(mutable inputTable, msg){
t = select * from msg
t[`b] = t.a+100
insert into inputTable values(t.time,t.a,t.b)
}
Aggregator = createTimeSeriesEngine(name="Aggregator", windowSize=6, step=3, metrics=<[avg(a), avg(b)]>, dummyTable=table2, outputTable=outputTable, timeColumn=`time)
subscribeTable(tableName="table1", actionName="test", offset=0, handler=append_after_filtering{Aggregator}, msgAsTable=true)

Related

How to add a new column to a Delta Lake table?

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like this:
DeltaTable.forPath(spark, deltaPath)
.as("dest_table")
.merge(myDF.as("source_table"),
"dest_table.id = source_table.id")
.whenNotMatched()
.insertAll()
.whenMatched(upsertCond)
.updateExpr(upsertStat)
.execute()
From these docs, it looks like Delta Lake supports adding new columns on insertAll() and updateAll() calls only. However, I'm updating only when certain conditions are met and want the new column added to all the existing data (with a default value of null).
I've come up with a solution that seems extremely clunky and am wondering if there's a more elegant approach. Here's my current proposed solution:
// Read in existing data
val myData = spark.read.format("delta").load(deltaPath)
// Register table with Hive metastore
myData.write.format("delta").saveAsTable("input_data")
// Add new column
spark.sql("ALTER TABLE input_data ADD COLUMNS (new_col string)")
// Save as DataFrame and overwrite data on disk
val sqlDF = spark.sql("SELECT * FROM input_data")
sqlDF.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(deltaPath)
Alter your delta table first and then you do your merge operation:
from pyspark.sql.functions import lit
spark.read.format("delta").load('/mnt/delta/cov')\
.withColumn("Recovered", lit(''))\
.write\
.format("delta")\
.mode("overwrite")\
.option("overwriteSchema", "true")\
.save('/mnt/delta/cov')
New columns can also be added with SQL commands as follows:
ALTER TABLE dbName.TableName ADD COLUMNS (newColumnName dataType)
UPDATE dbName.TableName SET newColumnName = val;
This is the approach that worked for me using scala
Having a delta table, named original_table, which path is:
val path_to_delta = "/mnt/my/path"
This table currently has got 1M records with the following schema: pk, field1, field2, field3, field4
I want to add a new field, named new_field, to the existing schema without loosing the data already stored in original_table.
So I first created a dummy record with a simple schema containing just pk and newfield
case class new_schema(
pk: String,
newfield: String
)
I created a dummy record using that schema:
import spark.implicits._
val dummy_record = Seq(new new_schema("delete_later", null)).toDF
I inserted this new record (the existing 1M records will have newfield populated as null). I also removed this dummy record from the original table:
dummy_record
.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save(path_to_delta )
val original_dt : DeltaTable = DeltaTable.forPath(spark, path_to_delta )
original_dt .delete("pk = 'delete_later'")
Now the original table will have 6 fields: pk, field1, field2, field3, field4 and newfield
Finally I upsert the newfield values in the corresponding 1M records using pk as join key
val df_with_new_field = // You bring new data from somewhere...
original_dt
.as("original")
.merge(
df_with_new_field .as("new"),
"original.pk = new.pk")
.whenMatched
.update( Map(
"newfield" -> col("new.newfield")
))
.execute()
https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
Have you tried using the merge statement?
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

Joining two clustered tables in spark dataset seems to end up with full shuffle

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
The insert part happens in hive
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table `t1` partition(t1_stats_date,t1_stats_hour)
select *
from t1_raw
where t1_stats_date='2020-05-10' and t1_stats_hour='12' AND
t1_req_id is not null
The code looks like as following:
val t1 = spark.table("t1").as[T1]
val t2= spark.table("t2").as[T2]
val outDS = t1.joinWith(t2, t1("t1_req_id) === t2("t2_req_id), "fullouter")
.map { case (t1Obj, t2Obj) =>
val t3:T3 = // do some logic
t3
}
outDS.toDF.write....
I see projection in DAG - but it seems that the job still does full data shuffle
Also, while looking into the logs of executor I don't see it reads the same bucket of the two tables in one chunk - that what I would expect to find
There are spark.sql.sources.bucketing.enabled, spark.sessionState.conf.bucketingEnabled and
spark.sql.join.preferSortMergeJoin flags
What am I missing? and why is there still full shuffle, if there are bucketed tables?
The current spark version is 2.3.1
One possibility here to check for is if you have a type mismatch. E.g. if the type of the join column is string in T1 and BIGINT in T2. Even if the types are both integer (e.g. one is INT, another BIGINT) Spark will still add shuffle here because different types use different hash functions for bucketing.

Transform rows to multiple rows in Spark Scala

I have a problem where I need to transform one row to multiple rows. This is based on a different mapping that I have. I have tried to provide an example below.
Suppose I have a parquet file with the below schema
ColA, ColB, ColC, Size, User
I need to aggregate the above data into multiple rows based on a lookup map. Suppose I have a static map
ColA, ColB, Sum(Size)
ColB, ColC, Distinct (User)
ColA, ColC, Sum(Size)
This means that one row in the input RDD needs to be transformed to 3 aggregate. I believe RDD is the way to go with FlatMapPair, but I am not sure how to go about this.
I am also OK to concat the columns into one key, something like ColA_ColB etc.
For creating multiple aggregates from the same data, I have started with something like this
val keyData: PairFunction[Row, String, Long] = new PairFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
val ip15M = spark.read.parquet("a.parquet").toJavaRDD
val pairs = ip15M.mapToPair(keyData)
java.util.List[(String, Long)] = [(ios,22), (ios,23), (ios,10), (ios,37), (ios,26), (web,52), (web,1)]
I believe I need to do flatmaptopair instead of mapToPair. On similar lines, I tried
val FlatMapData: PairFlatMapFunction[Row, String, Long] = new PairFlatMapFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
but it is giving Error
Expression of type (String, Long) doesn't conform to expected type util.Iterator[(String, Long)]
Any help is appreciated. Please let me know if I need to add any more details.
the outcome should only have 3 columns? I mean col1, col2, col3 (the agg outcome).
The second aggregate is a distinct count of users? (I assume yes).
If so you can basically create 3 data frames and then union them.
Something in the way of:
val df1 = spark.sql("select colA as col1, colB as col2 ,sum(Size) as colAgg group by colA,colB")
val df2 = spark.sql("select colB as col1, colC as col2 ,Distinct(User) as colAgg group by colB,colC")
val df3 = spark.sql("select colA as col1, colC as col2 ,sum(Size) as colAgg group by colA,colC")
df1.union(df2).union(df3)

Spark Enhance Join between Terabytes of Datasets

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. Also, Each table contains at least 100:600 columns all of them is Parquet format.
Example of one table below:
CREATE TABLE table_a
(
customer_id Long,
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;
I need to achieve two points,
Join all of them together with the most optimum way using Spark Scala. I tried to sortByKey before join but still there is a performance bottleneck. I tried to reparation by key before join but the performance is still not good. I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results.
After join I need to apply a separate function for some of these columns.
Sample of the join I tried below,
val dsA = spark.table(table_a)
val dsB = spark.table(table_b)
val dsC = spark.table(table_c)
val dsD = spark.table(table_d)
val dsE = spark.table(table_e)
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")
I think in this case the direct join is not the optimal case. You can acheive this task using the below simple way.
First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
Second, You will map each table to FeatureData case class key, [feature_name,feature_value]
Third, You will groupByKey and union all the dataset with the same key.
I the above way it will be faster to union than join. But it need more work.
After that, you will have a dataset with key,map. You will apply the transformation for key, Map(feature_name).
Simple example of the implementation as following:
You will map first the dataset to the case class then you can union all of them. After that you will groupByKey then map it and reduce it.
case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
FeatureMappedData(row.customer_id,
Map("featureA" -> row.featureA,
"featureB" -> row.featureB)))
val unionDataSet = dsAMapped union dsBMapped
unionDataSet.groupByKey(_.customer_id)
.mapGroups({
case (eid, featureIter) ⇒ {
val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0")
FeatureMappedData(customer_id, featuresMapped)
}
})

How to preserve order of columns in cassandra

I have two tables in Cassandra:
CREATE TABLE table1 (
name text PRIMARY KEY,
grade text,
labid List<int>);
CREATE TABLE table2(
name text PRIMARY KEY,
deptid List<int>
grade text,);
for example:
val result: RDD[String, String, List[Int]] = myFunction();
result.saveToCassandra(keyspace, table1)
It is working fine.
but in case of using below line:
result.saveToCassandra(keyspace, table2)
m getting this type of error : com.datastax.spark.connector.types.TypeConversionException: Cannot convert object test_data of type class java.lang.String to List[AnyRef]
Is there any solution using SomeColumns which satisfy the both tables[we don't know which table will be executed]. eg:
result.saveToCassandra(keyspace, table, SomeColumns(....))?
By default the dataframe schema only cares about position, not column name, so if your c* table has a different column order, you will get incorrect writes. The solution is like you said, to use SomeColumns.
val columns = dataFrame.schema.map(_.name: ColumnRef)
dataFrame.rdd.saveToCassandra(keyspaceName, tableName, SomeColumns(columns: _*))
Now the dataframe columns will be written to c* using their name, not position.
You arguments should be in different order because the tables have different column types:
val result: RDD[String, String, List[Int]] = myFunction();
val reorder: RDD[String, List[Int], String] = result.map(r => r._1, r._3, r._2)
reorder.saveToCassandra(keyspace, table2)