Spark - move data from tables to new table with extra column - scala

So we have a Cassandra project and it requires us to migrate a large number of tables from 3 separate tables into one.
e.g. table_d_abc, table_m_abc, table_w_abc to table_t_abc
Essentially data needs to be moved to this new table with an extra column with a value that was in the table's name.
There are 100's of tables like this - so you could imagine the huge job it would be to 'hand-make' a migration script. And naturally I thought SPARK should be able to do the job.
e.g.:
var tables = List("table_*_abc", "table_*_def") // etc
var periods = List('d','w','m')
for (table <- tables) {
for (period <- periods) {
var rTable = table.replace('*', period)
var nTable = table.replace('*', 't')
try {
var t = sc.cassandraTable("data", rTable)
var fr = t.first
var columns = fr.toMap.keys.toArray :+ "period"
var data = t.map(_.iterator.toArray :+ period)
// This line does not work as data is a RDD of Array[Any] and not RDD of tuple[...]
// How to ???
data.saveToCassandra("data", nTable, SomeColumns(columns.map(ColumnName(_)):_*))
} //catch {}
}
}
versus:
var periods = List('d','w','m')
for (period <- periods) {
sc.cassandraTable("data","table_" + period + "_abc")
.map(v => (v.getString("a"), v.getInt("b"), v.getInt("c"), period))
.saveToCassandra("data", "table_t_abc", SomeColumns("a","b","c","period"))
// ... 100s of other scripts like this
}
Is what I'm trying to do possible?
Is there a way to programatically save an extra column from an source with unknown number of columns and datatypes?

The issue here is the RDD objects must be of a type which has a "RowWriter" defined. This maps the data in the object to C* insertable buffers.
RDD World
Using "CassandraRow" objects this is possible. These objects allow for generic contents and can be constructed on the file. They are also the default output so making a new one from an old one should be relatively cheap.
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala
You would make a single RowMetadata (basically schema info) for each table with the additional column, then populate the row with the values of the input row + the new period variable.
Dataframe World
If you wanted to switch to Dataframes this would be easier as you could just use the DataFrame add column before saving.
cassandraDF.withColumn("period",lit("Value based on first row"))

Related

Pyspark - iterate on a big dataframe

I'm using the following code
events_df = []
for i in df.collect():
v = generate_event(i)
events_df.append(v)
events_df = spark.createDataFrame(events_df, schema)
to go over each dataframe item and add an event header calculated in the generate_event function
def generate_event(delta_row):
header = {
"id": 1,
...
}
row = Row(Data=delta_row)
return EntityEvent(header, row)
class EntityEvent:
def __init__(self, _header, _payload):
self.header = _header
self.payload = _payload
It works fine locally for df with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail
Note: with rdd seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)
is there a way to chunk the dataframe and consolidate at the end ?
The best solution that we can preview is to use spark promise features, like adding new columns using struct and create_map functions...
events_df = (
df
.withColumn(
"header",
f.create_map(
f.lit("id"),
f.lit(1)
)
)
...
So we can create columns as much as we need and make transformations to get the required header structure
PS: this solution (add new columns to the dataframe rather than iterate on it) avoid using rdd and brings a big advantage in terms of performance !

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.
This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you
It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

Best way to gain performance when doing a join count using spark and scala

i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count

Tagging a HBase Table using Spark RDD in Scala

I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.