how to pick the rows from dataframe comparing with hashmap

how to pick the rows from dataframe comparing with hashmap - scala

I have two dataframes,
df1
id slt sln elt eln start end
df2
id evt slt sln speed detector
Hashmap
Map(351608084643945 -> List(1544497916,1544497916), 351608084643944 -> List(1544498103,1544498093))
I want to compare the values in the list and if the two values in the list match ,then I want to have the full row from dataframe(df1) of that id.
else,full row from df2 of that id.
Both the dataframes and maps will have distinct and unique id.

If I understand correctly you want to traverse your hash map and for entry you want to check if value which is list have all the values same. If list have same element that you want data from df1 else from df2 for that key. If that is what you want than below is the code for same.
hashMap.foreach(x => {
var key = x._1.toString
var valueElements = x._2.toList
if (valueElements.forall(_ == valueElements.head)) {
df1.filter($"id".equalTo(key))
} else {
df2.filter($"id".equalTo(key))
}
})

Two steps:
Step One: Split the hashmap into two hashmaps, one is the matched hashmap, the other is not matched hashmap.
Step Two: Use matched hashmap to join with df1 on id, then you get the matched df1. And use unmatched hashmap to join with df2 on id, then you get the unmatched df2.

Related

Spark 2.3: Flatten Array of Array of Structs, and create new columns

I have a dataframe with a column ids that looks like
ids
WrappedArray(WrappedArray([item1,micro], [item3, mini]), WrappedArray([item2,macro]))
WrappedArray(WrappedArray([item1,micro]), WrappedArray([item5,micro], [item6,macro]))
where the exact type of the column is
StructField(ids,ArrayType(ArrayType(StructType(StructField(identifier,StringType,true), StructField(identifierType,StringType,true)),true),true),true)
I want to create two new columns, one holding the value of all of the distinct identifier in the struct, and another column holding the max observed identifierType for that row (if there are ties, then return all the ties).
So in our example, I would like the output to be
list_of_identifiers, most_frequent_type
Array(item1, item2, item3), [micro, mini, macro]
Array(item1, item5, item6), [micro]
To achieve this, the first step I need to do is flatten the ids column to something like
ids
WrappedArray([item1,micro], [item3, mini], [item2,macro])
WrappedArray([item1,micro], [item5,micro], [item6,macro])
but I cannot figure out the way to do this.
Here is sample input table
val arrayStructData = Seq(
Row(List(List(Row("item1", "micro"),Row("item3", "mini")), List(Row("item2", "macro")))),
Row(List(List(Row("item1", "micro")), List(Row("item5", "micro"), Row("item6", "macro"))))
)
val arrayStructSchema = new StructType()
.add("ids", ArrayType(ArrayType(new StructType()
.add("identifier",StringType)
.add("identifierType",StringType))))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData),arrayStructSchema)

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?

You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.

Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

What is the result of the join of two rdd?

The element in clickRdd is (h5id,[query]), where h5id is a long number and query is a string; the element in revealRdd is (h5id, [0:id, 1:query, 2:q0, 3:q1, 4:q2, 5:q3, 6:s0, 7:s1, 8:s2, 9:s3] ).
and what is the result of clickJoin = clickRdd.join(revealRdd), I guess the join key is h5id.
Anyone can give me the content after joining ?

The joined RDD will have both the values of rdd in a tuple and hi5id as key.
clickJoin.take(1)
[(hi5id,([query],[0:id, 1:query, 2:q0, 3:q1, 4:q2, 5:q3, 6:s0, 7:s1, 8:s2, 9:s3]))]

.contains giving empty string in rdd

I have an array of id's called id. I have an RDD called r which as a field called idval which might have some ids in the id array. I want to get only the rows which are in this array. I am using
val new_r = r.filter(x => r.contains(x.idval)
But, when I go to do
new_r.take(10).foreach(println)
I get a NumberFormatException: empty String
Does contains include empty strings?
Here is an example of lines in the RDD:
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
6, 'nose', 2011-01-01,1.0
I have a separate array with id's such as [1,3,4,5,18,...] and I want to extract the rows of the RDD above which have the idval in ids
So filtering this should give me
idval,part,date,sign
1,'leg',2011-01-01,1.0
18,'arm',2013-01-01,1.0
as idval 1 and 18 are in the array above.
The problem is that I am getting this empty string error when I go to foreach(println) the rows in the new filtered array.
The RDD is loaded from a csv file (loadFromUrl) and then its mapped
val r1 = rdd.map(s=>s.split(","))
val r2 = r1.map(p=>Event(s(0), p(1),dateFormat.parse(p(2).asInstanceOf[String]), p(3).toDouble))

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.

You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to pick the rows from dataframe comparing with hashmap - scala

Two steps: Step One: Split the hashmap into two hashmaps, one is the matched hashmap, the other is not matched hashmap. Step Two: Use matched hashmap to join with df1 on id, then you get the matched df1. And use unmatched hashmap to join with df2 on id, then you get the unmatched df2.

Related

Spark 2.3: Flatten Array of Array of Structs, and create new columns

Dataframe column substring based on the value during join

What is the result of the join of two rdd?

.contains giving empty string in rdd

How to insert record into a dataframe in spark

Categories

Resources