.contains giving empty string in rdd - scala

I have an array of id's called id. I have an RDD called r which as a field called idval which might have some ids in the id array. I want to get only the rows which are in this array. I am using
val new_r = r.filter(x => r.contains(x.idval)
But, when I go to do
new_r.take(10).foreach(println)
I get a NumberFormatException: empty String
Does contains include empty strings?
Here is an example of lines in the RDD:
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
6, 'nose', 2011-01-01,1.0
I have a separate array with id's such as [1,3,4,5,18,...] and I want to extract the rows of the RDD above which have the idval in ids
So filtering this should give me
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
as idval 1 and 18 are in the array above.
The problem is that I am getting this empty string error when I go to foreach(println) the rows in the new filtered array.
The RDD is loaded from a csv file (loadFromUrl) and then its mapped
val r1 = rdd.map(s=>s.split(","))
val r2 = r1.map(p=>Event(s(0), p(1),dateFormat.parse(p(2).asInstanceOf[String]), p(3).toDouble))

Related

How to get column values from list which contains column names in spark scala dataframe

I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you
The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()

Spark 2.3: Flatten Array of Array of Structs, and create new columns

I have a dataframe with a column ids that looks like
ids
WrappedArray(WrappedArray([item1,micro], [item3, mini]), WrappedArray([item2,macro]))
WrappedArray(WrappedArray([item1,micro]), WrappedArray([item5,micro], [item6,macro]))
where the exact type of the column is
StructField(ids,ArrayType(ArrayType(StructType(StructField(identifier,StringType,true), StructField(identifierType,StringType,true)),true),true),true)
I want to create two new columns, one holding the value of all of the distinct identifier in the struct, and another column holding the max observed identifierType for that row (if there are ties, then return all the ties).
So in our example, I would like the output to be
list_of_identifiers, most_frequent_type
Array(item1, item2, item3), [micro, mini, macro]
Array(item1, item5, item6), [micro]
To achieve this, the first step I need to do is flatten the ids column to something like
ids
WrappedArray([item1,micro], [item3, mini], [item2,macro])
WrappedArray([item1,micro], [item5,micro], [item6,macro])
but I cannot figure out the way to do this.
Here is sample input table
val arrayStructData = Seq(
Row(List(List(Row("item1", "micro"),Row("item3", "mini")), List(Row("item2", "macro")))),
Row(List(List(Row("item1", "micro")), List(Row("item5", "micro"), Row("item6", "macro"))))
)
val arrayStructSchema = new StructType()
.add("ids", ArrayType(ArrayType(new StructType()
.add("identifier",StringType)
.add("identifierType",StringType))))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData),arrayStructSchema)

how to pick the rows from dataframe comparing with hashmap

I have two dataframes,
df1
id slt sln elt eln start end
df2
id evt slt sln speed detector
Hashmap
Map(351608084643945 -> List(1544497916,1544497916), 351608084643944 -> List(1544498103,1544498093))
I want to compare the values in the list and if the two values in the list match ,then I want to have the full row from dataframe(df1) of that id.
else,full row from df2 of that id.
Both the dataframes and maps will have distinct and unique id.
If I understand correctly you want to traverse your hash map and for entry you want to check if value which is list have all the values same. If list have same element that you want data from df1 else from df2 for that key. If that is what you want than below is the code for same.
hashMap.foreach(x => {
var key = x._1.toString
var valueElements = x._2.toList
if (valueElements.forall(_ == valueElements.head)) {
df1.filter($"id".equalTo(key))
} else {
df2.filter($"id".equalTo(key))
}
})
Two steps:
Step One: Split the hashmap into two hashmaps, one is the matched hashmap, the other is not matched hashmap.
Step Two: Use matched hashmap to join with df1 on id, then you get the matched df1. And use unmatched hashmap to join with df2 on id, then you get the unmatched df2.

Iterate through all rows returned from an Scala Anorm query

I have a small Anorm query which is returning all the rows in the Service Messages table in my database. I would eventually like to turn each of these rows into JSON.
However, currently all I am doing is iterating through the elements of the first row with the .map function. How could I iterate through all rows so I can manipulate all the rows and turn it into a JSON object.
val result = DB.withConnection("my-db") { implicit connection =>
val messagesRaw = SQL("""
SELECT *
FROM ServiceMessages
""").apply;
messagesRaw.map(row =>
println(row[String]("title"))
)
}
Actually what you do IS iterating all the rows (not only the first one) taking contents of title column from each row.
In order to collect all titles you need the following trivial modification:
val titles = messagesRaw.map(row =>
row[String]("title")
)
Converting them to json (array) is simple as well:
import play.api.libs.json._
...
Ok(Json.toJson(titles))

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))