pyspark how to collect map keys into list

pyspark how to collect map keys into list - pyspark

I have a dataframe with a map column. I want to collect the not null keys into a new column:

You can use map_filter to filter the non null keys and then use map_keys which returns the keys as an array.

Related

How to get column values from list which contains column names in spark scala dataframe

I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you

The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()

Adding empty columns to dataframe with empty values (by type) pyspark

I have the following list:
columns = [('url','string'),('count','bigint'),('isindex','boolean')]
I want to add this columns to my df with empty values:
for column in columns:
df = df.withColumn(column[0], f.lit(?).cast(?))
I am not sure what I need to put in the lit function and in the cast in order to have the suitable empty value for each type
Thank you!

how to pick the rows from dataframe comparing with hashmap

I have two dataframes,
df1
id slt sln elt eln start end
df2
id evt slt sln speed detector
Hashmap
Map(351608084643945 -> List(1544497916,1544497916), 351608084643944 -> List(1544498103,1544498093))
I want to compare the values in the list and if the two values in the list match ,then I want to have the full row from dataframe(df1) of that id.
else,full row from df2 of that id.
Both the dataframes and maps will have distinct and unique id.

If I understand correctly you want to traverse your hash map and for entry you want to check if value which is list have all the values same. If list have same element that you want data from df1 else from df2 for that key. If that is what you want than below is the code for same.
hashMap.foreach(x => {
var key = x._1.toString
var valueElements = x._2.toList
if (valueElements.forall(_ == valueElements.head)) {
df1.filter($"id".equalTo(key))
} else {
df2.filter($"id".equalTo(key))
}
})

Two steps:
Step One: Split the hashmap into two hashmaps, one is the matched hashmap, the other is not matched hashmap.
Step Two: Use matched hashmap to join with df1 on id, then you get the matched df1. And use unmatched hashmap to join with df2 on id, then you get the unmatched df2.

What is the result of the join of two rdd?

The element in clickRdd is (h5id,[query]), where h5id is a long number and query is a string; the element in revealRdd is (h5id, [0:id, 1:query, 2:q0, 3:q1, 4:q2, 5:q3, 6:s0, 7:s1, 8:s2, 9:s3] ).
and what is the result of clickJoin = clickRdd.join(revealRdd), I guess the join key is h5id.
Anyone can give me the content after joining ?

The joined RDD will have both the values of rdd in a tuple and hi5id as key.
clickJoin.take(1)
[(hi5id,([query],[0:id, 1:query, 2:q0, 3:q1, 4:q2, 5:q3, 6:s0, 7:s1, 8:s2, 9:s3]))]

ScalaSpark - Create a pair RDD with a key and a list of values

I have a log file with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
I need to create a pair RDD with the postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
I need to use mapValues and I did the following:
val namesByPCode = accountsdata.keyBy(line => line.split(',')(8)).mapValues(fields => (fields(0), (fields(4), fields(5)))).collect()
but I'm getting an error. can someone tell me what is wrong with my statement?

keyBy doesn't change the value, so the value stays a single "unsplit" string. You want to first use map to perform the split (to get an RDD[Array[String]]), and then use keyBy and mapValues as you did on the split result:
val namesByPCode = accountsdata.map(_.split(","))
.keyBy(_(8))
.mapValues(fields => (fields(0), (fields(4), fields(5))))
.collect()
BTW - per your description, sounds like you'd also want to call groupByKey on this result (before calling collect), if you want each zipcode to evaluate into a single record with a list of names. keyBy doesn't perform the grouping, it just turns an RDD[V] into an RDD[(K, V)] leaving each record a single record (with potentially many records with same "key").