Filtering RDD by substring values - scala

I want to filter out some entries from RDD[(String,List[(String,String,String,String)] based on analyzing values in substrings:
This is my sampe data:
(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,null), (649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))
In particular I want to check the 4th field in each substring (if started counting from 1). If it's equal to null, then the whole entry should be deleted. The result should be the following:
(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,3), (649,null,3,1)))
So, in this particular example the second entry should be deleted.
This is my attempt to solve this task:
val filtered = separated.map(l => (l._1,l._2.filter(!_._4.equals("null"))))
The problem is that it just deletes the substring, but not the whole entry. The result is the following (instead of the above-mentioned one):
(600,List((600,111,1,1), (615,111,1,5)))
(600,List((649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))

Filter your RDD by checking that the list of tuples does not have a tuple with 4th entry "null"
yourRdd.filter({
case (id, list) => !list.exists(t => t._4.equals("null"))
})

Related

Make a Spark code more efficient and cleaner

I have the following code that cleans a corpus of documents (pipelineClean(corpus)) that returns a Dataframe with two columns:
"id": Long
"tokens": Array[String].
After that, the code produces a Dataframe with the following columns:
"term": String
"postingList": List[Array[Long, Long]] (the first long is the documented the other the term frequency inside that document)
pipelineClean(corpus)
.select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column
.groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting
.where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
.groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list
.orderBy("term")
.persist(StorageLevel.MEMORY_ONLY_SER)
My question is: would it be possible to make this code shorter and/or more efficient? For example, is it possible to do the grouping within a single groupBy?
It doesn't look like you can do much more than what you've got apart from skipping the withColumn call and using a straight select:
.select(col("term"), struct(col("documentId"), col("count")) as "posting")
instead of
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

.contains giving empty string in rdd

I have an array of id's called id. I have an RDD called r which as a field called idval which might have some ids in the id array. I want to get only the rows which are in this array. I am using
val new_r = r.filter(x => r.contains(x.idval)
But, when I go to do
new_r.take(10).foreach(println)
I get a NumberFormatException: empty String
Does contains include empty strings?
Here is an example of lines in the RDD:
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
6, 'nose', 2011-01-01,1.0
I have a separate array with id's such as [1,3,4,5,18,...] and I want to extract the rows of the RDD above which have the idval in ids
So filtering this should give me
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
as idval 1 and 18 are in the array above.
The problem is that I am getting this empty string error when I go to foreach(println) the rows in the new filtered array.
The RDD is loaded from a csv file (loadFromUrl) and then its mapped
val r1 = rdd.map(s=>s.split(","))
val r2 = r1.map(p=>Event(s(0), p(1),dateFormat.parse(p(2).asInstanceOf[String]), p(3).toDouble))

ScalaSpark - Create a pair RDD with a key and a list of values

I have a log file with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
I need to create a pair RDD with the postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
I need to use mapValues and I did the following:
val namesByPCode = accountsdata.keyBy(line => line.split(',')(8)).mapValues(fields => (fields(0), (fields(4), fields(5)))).collect()
but I'm getting an error. can someone tell me what is wrong with my statement?
keyBy doesn't change the value, so the value stays a single "unsplit" string. You want to first use map to perform the split (to get an RDD[Array[String]]), and then use keyBy and mapValues as you did on the split result:
val namesByPCode = accountsdata.map(_.split(","))
.keyBy(_(8))
.mapValues(fields => (fields(0), (fields(4), fields(5))))
.collect()
BTW - per your description, sounds like you'd also want to call groupByKey on this result (before calling collect), if you want each zipcode to evaluate into a single record with a list of names. keyBy doesn't perform the grouping, it just turns an RDD[V] into an RDD[(K, V)] leaving each record a single record (with potentially many records with same "key").

Order by Value in Spark pairRDD from (Key,Value) where the value is from spark-sql

I have created a map like this -
val b = a.map(x => (x(0), x) )
Here b is of the type
org.apache.spark.rdd.RDD[(Any, org.apache.spark.sql.Row)]
How can I sort the PairRDD within each key using a field from the value row?
After that I want to run a function which processes all the values for each Key in isolation in the previously sorted order. Is that possible? If yes can you please give an example.
Is there any consideration needed for Partitioning the Pair RDD?
Answering only your first question:
val indexToSelect: Int = ??? //points to sortable type (has Ordering or is Ordered)
sorted = rdd.sortBy(pair => pair._2(indexToSelect))
What this does, it just selects the second value in the pair (pair._2) and from that row it selects the appropriate value ((indexToSelect) or more verbosely: .apply(indexToSelect)).