hash join in spark scala on pair rdd - scala

I am trying to perform a partition+broadcast join in spark scala. I have a dictionary that I am broadcasting to all the nodes. The structure of the dictionary is as follows:
{ key: Option[List[Strings]] } // I created this dictionary using a groupByKey first and then called collectAsMap before broadcasting.
The above dictionary was created using the table whose structure is similar to the table mentioned below.
I have a table that is a pair RDD whose structure is as follows:
Col A | Col B
I am trying to perform a join as follows:
val join_output = table.flatMap{
case(key, value) => custom_dictionary.value.get(key).map(
otherValue => otherValue.foreach((value, _))
)
}
My goal is to get a pair-RDD as an output whose contents are ( from table, from list stored in the dictionary).
The code runs and compiles successfully but when I check the output, I only see this: "()" as the output being saved. Where am I going wrong?
I did have a look at some of the other posts that did reflect up to some extent on this matter, but none of the options worked. I request some guidance on this issue. Also, if there is a post that exactly points to this, please let me know.

Related

Array manipulation in Spark, Scala

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.
I have a dataframe having the following two columns:
Name_Description Grade
Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.
Name description is not an array that's of fixed size. It could be something like
['asdf_ Brandon', 'Ca%abc%rd']
['fthhhhChris', 'Rock', 'is the %abc%man']
The only problems are the following:
1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.
2. %abc% randomly pops up on elements, so I wanna erase them.
Is there any way to achieve those two things in Scala? For instance, I just want
['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']
to change to
['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:
val removeGarbage = udf { arr: WrappedArray[String] =>
//in case that array is empty we need to map over option
arr.headOption
//drop first 6 characters from first element, then remove %abc% from the rest
.map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
.getOrElse(arr)
}
Then you just need to use this UDF on your Name_Description column:
val df = List(
(1, Array("asdf_ Brandon", "Ca%abc%rd")),
(2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")
df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)
Show prints:
+-----+-------------------------+
|Grade|Name_Description |
+-----+-------------------------+
|1 |[Brandon, Card] |
|2 |[Chris, Rock, is the man]|
+-----+-------------------------+
We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.
Please find below my approach. Hope it helps.
val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))
val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")
This is how I created the input dataframe.
df.select('Grade,posexplode('Name_Description)).registerTempTable("data")
We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.
spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show
This query will give out the required output. Hope this helps.

Remove records from mutable.mutableList in Scala

I have a mutable.MutableList[emp] with following structure.
case class emp(name: String,id:String,sal: Long,dept: String)
I am generating records based on above case class in the below mutable.MutableList[emp].
val list1: mutable.MutableList[emp] = ((mike, 1, 123, HR),(mike,2,123,sys),(Lind,1,2323,sys))
If I have same name with id 1 and 2, I need to take only 2 and drop id 1 record. Id id 2 is not present, I have to take id 1.
How do achieve this? I tried it with following way but results are not accurate:
0. converted mutable.mutableList to Dataframe
1. filtered records with id 1(id1s_DF)
2. filtered records with id 2(other_rec_DF)
3. joined records with name and used leftsemi as join condition.
val join_info_DF = other_rec_DF.join(id1s_DF, id1s_DF("name") =!= other_rec_DF("name"),"leftsemi")
Above join will give all the names which are present in other_rec_DS and not present in Other_rec_DF.
Looks like I am doing some thing wrong with the join and not getting expected results.
Could some please help me to achieve this in either mutableList or by converting it into Dataframe.
Thanks,
Babu
If the size of your data is small enough you don't need something like Apache Spark to do the above task.
Doing this in plain scala code, the code would look something like below
case class Emp(name: String,id:Int,sal: Long,dept: String)
val list1: mutable.MutableList[Emp] = mutable.MutableList(
Emp("mike", 1, 123, "HR"),
Emp("mike", 2, 123, "sys"),
Emp("Lind", 1, 2323, "sys")
)
val result = list1
.groupBy(_.name)
.mapValues(_.sortBy(_.id)(Ordering[Int].reverse).head)
.values
result.foreach(println)
The output of the above code would be
Emp(Lind,1,2323,sys)
Emp(mike,2,123,sys)
The idea / approach is to make sure we group by the key on which you want to de-duplicate the items, sort them and pick the one with the highest id. We then drop the key and store only the values.
The above approach would work exactly the same way on Spark as well.

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

matching keys of a hashmap to entries of a spark RDD in scala and adding value to if match found and writing rdd back to hbase

I am trying to read a HBase table using scala and then add a new column as tags based on the content in the rows in HBase Table. I have read the table as spark RDD. I also have a hashmap of which key value pairs are as follows:
keys are to be matched with the entries of spark rdd(generated from HBase table) and if match is found, the value from the hashmap is to be added into a new column.
The function to write to hbase table in the a new column name is this:
def convert (a:Int,s:String) : Tuple2[ImmutableBytesWritable,Put]={
val p = new Put(a.toString.getBytes())
p.add(Bytes.toBytes("columnfamily"),Bytes.toBytes("col_2"), s.toString.getBytes())//a.toString.getBytes())
println("the value of a is: " + a)
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(Bytes.toBytes(a)), p);
}
new PairRDDFunctions(newrddtohbaseLambda.map(x=>convert(x, ranjan))).saveAsHadoopDataset(jobConfig)
Then to read string from hashmap and compare and add back the code is this:
csvhashmap.keys.foreach{i=> if (arrayRDD.zipWithIndex.foreach{case(a,j) => a.split(" ").exists(i contains _); p = j.toInt}==true){new PairRDDFunctions(convert(p,csvhashmap(i))).saveAsHadoopDataset(jobConfig)}}
here csvhashmap is the above described hashmap, "words" is the rdd where we are trying to match the string. When the above command is run, I get the following error:
error: type mismatch;
found : (org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Put)
required: org.apache.spark.rdd.RDD[(?, ?)]
How to get rid of it? I have tried many things to change the data type but each time I get some error. Also, I have checked for the individual functions inside the above snippet and they are just fine. When I integrate them together, I got the above error. Any help would be appreciated.

ScalaSpark - Create a pair RDD with a key and a list of values

I have a log file with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
I need to create a pair RDD with the postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
I need to use mapValues and I did the following:
val namesByPCode = accountsdata.keyBy(line => line.split(',')(8)).mapValues(fields => (fields(0), (fields(4), fields(5)))).collect()
but I'm getting an error. can someone tell me what is wrong with my statement?
keyBy doesn't change the value, so the value stays a single "unsplit" string. You want to first use map to perform the split (to get an RDD[Array[String]]), and then use keyBy and mapValues as you did on the split result:
val namesByPCode = accountsdata.map(_.split(","))
.keyBy(_(8))
.mapValues(fields => (fields(0), (fields(4), fields(5))))
.collect()
BTW - per your description, sounds like you'd also want to call groupByKey on this result (before calling collect), if you want each zipcode to evaluate into a single record with a list of names. keyBy doesn't perform the grouping, it just turns an RDD[V] into an RDD[(K, V)] leaving each record a single record (with potentially many records with same "key").