Tagging a HBase Table using Spark RDD in Scala - scala

I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.

Related

How to get column values from list which contains column names in spark scala dataframe

I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you
The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()

Iterate Through Rows of a Dataframe

Since I am a bit new to Spark Scala, I am finding it difficult to iterate through a Dataframe.
My dataframe contains 2 columns, one is path and other is ingestiontime.
Example -
Now I want to iterate through this dataframe and do the use the data in the Path and ingestiontime column to prepare a Hive Query and run it , such that query that are run look like -
ALTER TABLE <hiveTableName> ADD PARTITON (ingestiontime=<Ingestiontime_From_the_DataFrame_ingestiontime_column>) LOCATION (<Path_From_the_dataFrames_path_column>)
To achieve this, I used -
allOtherIngestionTime.collect().foreach {
row =>
var prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = "+row.mkString("<SomeCustomDelimiter>").split("<SomeCustomDelimiter>")(1)+" LOCATION ( " + row.mkString("<SomeCustomDelimiter>").split("<SomeCustomDelimiter>")(0) + ")"
spark.sql(prepareHiveQuery)
}
But I feel this can be very dangerous, i.e when my Data consists of a similar Delimiter. I am very much interested to find out other ways of iterating through rows/columns of a Dataframe.
Check below code.
df
.withColumn("query",concat_ws("",lit("ALTER TABLE myhiveTable ADD PARTITON (ingestiontime="),col("ingestiontime"),lit(") LOCATION (\""),col("path"),lit("\"))")))
.select("query")
.as[String]
.collect
.foreach(q => spark.sql(q))
In order to access your columns path and ingestiontime you can you row.getString(0) and row.getString(1).
DataFrames
val allOtherIngestionTime: DataFrame = ???
allOtherIngestionTime.foreach {
row =>
val prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = "+row.getString(1)+" LOCATION ( " + row.getString(0) + ")"
spark.sql(prepareHiveQuery)
}
Datasets
If you use Datasets instead of Dataframes you will be able to use row.path and row.ingestiontime in an easier way.
case class myCaseClass(path: String, ingestionTime: String)
val ds: Dataset[myCaseClass] = ???
ds.foreach({ row =>
val prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = " + row.ingestionTime + " LOCATION ( " + row.path + ")"
spark.sql(prepareHiveQuery)
})
In any case, to iterate over a Dataframe or a Dataset you can use foreach , or map if you want to convert the content into something else.
Also, using collect() you are bringing all the data to the driver and that is not recommended, you could use foreach or map without collect()
If what you want is to iterate over the row fields, you can make it a Seq and iterate:
row.toSeq.foreach{column => ...}

How can create a new DataFrame from a list?

Hello guys i have this function that gets the row Values from a DataFrame, converts them into a list and the makes a Dataframe from it.
//Gets the row content from the "content column"
val dfList = df.select("content").rdd.map(r => r(0).toString).collect.toList
val dataSet = sparkSession.createDataset(dfList)
//Makes a new DataFrame
sparkSession.read.json(dataSet)
What i need to do to make a list with other column values so i can have another DataFrame with the other columns values
val dfList = df.select("content","collection", "h").rdd.map(r => {
println("******ROW********")
println(r(0).toString)
println(r(1).toString)
println(r(2).toString) //These have the row values from the other
//columns in the select
}).collect.toList
thanks
Approach doesn't look right, you don't need to collect dataframe to just add new columns. Try adding columns to directly to dataframe using withColumn() withColumnRenamed() https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html.
If you want to bring columns from another dataframe try joining. In any case it's not good idea to use collect as it will bring all your data to driver.

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.
This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you
It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

How to find out the keywords in a text table with Spark?

I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence
Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.
From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.