PySpark dataframe column to list - pyspark

I am trying to extract the list of column values from a dataframe into a list
+------+----------+------------+
|sno_id|updt_dt |process_flag|
+------+----------+------------+
| 123 |01-01-2020| Y |
+------+----------+------------+
| 234 |01-01-2020| Y |
+------+----------+------------+
| 512 |01-01-2020| Y |
+------+----------+------------+
| 111 |01-01-2020| Y |
+------+----------+------------+
Output should be the list of sno_id ['123','234','512','111']
Then I need to iterate the list to run some logic on each on the list values. I am currently using HiveWarehouseSession to fetch data from hive table into Dataframe by using hive.executeQuery(query)

it is pretty easy as you can first collect the df with will return list of Row type then
row_list = df.select('sno_id').collect()
then you can iterate on row type to convert column into list
sno_id_array = [ row.sno_id for row in row_list]
sno_id_array
['123','234','512','111']
Using Flat map and more optimized solution
sno_id_array = df.select("sno_id ").rdd.flatMap(lambda x: x).collect()

You could use toLocalIterator() to create a generator over the column.
Since you wanted to loop over the results afterwards, this may be more efficient in your case.
Using a generator you don't create and store the list first, but when iterating over the columns you apply your logic immediately:
sno_ids = df.select('sno_id').toLocalIterator()
for row in sno_ids:
sno_id = row.sno_id
# continue with your logic
...
Alternative one-liner using a generator expression:
sno_ids = (row.sno_id for row in df.select('sno_id').toLocalIterator())
for sno_id in sno_ids:
...

Related

Pyspark create multiple columns from dictionary column

#udf(returnType=MapType(StringType(), FloatType()))
def postprocess(data):
ret = dict()
....
# insert key and values to dictionary from data
...
return ret
ret = postprocess(col('data'))
print(ret) # Column<'postprocess(data)'>
I would like to create multiple columns from dictionary column.
If ret has {"key1": 0.1, "key2": 0.3}, the result should be
| key1 | key2 |
| 0.1 | 0.3 |
How can I create it?
To achieve your goal, you can use .explode() to create multiple columns from a dictionary column. Details: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.explode.html
However, in the performance perspective, not sure how complicated your UDF is, I think you should use the spark sql function to create the columns instead of using the Python UDF function if it's possible. You can check this post: https://stackoverflow.com/a/38297050/10445333

PySpark map multiple columns to 1 'dict' column containing all values without using df.collect()

I currently have multiple columns (at least 500) in my DataFrame starting with any of the following prefixes ['a_', 'b_', 'c_'].
I want to have a DataFrame with only 3 columns
# +++++++++++++++++++++
# a | b | c |
# +++++++++++++++++++++
# {'a_1': 'a_1_value', 'a_2': 'a_2_value'} | {} | {'c_1': 'c_1_value', 'c_2': 'c_2_value'}|
Calling df.collect() causes StackOverflowErrors in the framework I'm using because the DataFrame is pretty large. I'm trying to leverage the map functions to avoid loading the DataFrame in the driver (hence the constraint)
Something like this?
Use struct to combine any columns with a certain prefix into 1 column, then use to_json to form the struct into key-value pair shape.
cols = ['a', 'b', 'c']
df.select([
F.to_json(F.struct(*[x for x in df.columns if x.startswith(f'{col}_')])).alias(col)
for col in cols]
)

Reshaping RDD from an array of array to unique columns in pySpark

I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)
Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')

Reading a csv file into PySpark that contains the key:value pairing, such that key becomes the column and value is the data of it

I am a beginner of Spark. Please help me out with a solution.
The csv file contains the text in the form of key:value paring delimited by a comma. And in some lines, the keys(or columns) may be missing.
I have loaded this file into a single column of a dataframe. I want to segregate these keys as columns and values associated to it as data into that column. And when there are some columns missing i want to add a new column and a dummy data to that.
Dataframe
+----------------------------------------------------------------+
| _c0 |
+----------------------------------------------------------------+
|name:Pradnya,IP:100.0.0.4, college: SDM, year:2018 |
|name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018 |
+----------------------------------------------------------------+
I want the output in this form
+----------- ----------------------------------------------
| name | IP | College | Semester | year |
+-----------+-------------------------+-----------+-------+
| Pradnya |100.0.0.4 | SDM | null | 2018 |
+-----------+-------------+-----------+-----------+-------+
| Ram | 100.10.10.5 | BVB | IV |2018 |
+-----------+-------------+-----------+-----------+-------+
Thanks.
Pyspark won't recognize the key:value pairing. One workaround is convert the file int json format and then read the json file.
content of raw.txt:
name:Pradnya,IP:100.0.0.4, college: SDM, year:2018
name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018
Following code will create the json file :
with open('raw.json', 'w') as outfile:
json.dump([dict([p.split(':') for p in l.split(',')]) for l in open('raw.txt')], outfile)
Now you can create the pyspark dataframe using following code :
df = spark.read.format('json').load('raw.json')
If you know all field names and keys/values do not contain embedded delimiters. then you can probably convert the key/value lines into Row object through RDD's map function.
from pyspark.sql import Row
from string import lower
# assumed you already defined SparkSession named `spark`
sc = spark.sparkContext
# initialize the RDD
rdd = sc.textFile("key-value-file")
# define a list of all field names
columns = ['name', 'IP', 'College', 'Semester', 'year']
# set Row object
def setRow(x):
# convert line into key/value tuples. strip spaces and lowercase the `k`
z = dict((lower(k.strip()), v.strip()) for e in x.split(',') for k,v in [ e.split(':') ])
# make sure all columns shown in the Row object
return Row(**dict((c, z[c] if c in z else None) for c in map(lower, columns)))
# map lines to Row objects and then convert the result to dataframe
rdd.map(setRow).toDF().show()
#+-------+-----------+-------+--------+----+
#|college| ip| name|semester|year|
#+-------+-----------+-------+--------+----+
#| SDM| 100.0.0.4|Pradnya| null|2018|
#| BVB|100.10.10.5| Ram| IV|2018|
#+-------+-----------+-------+--------+----+

Dynamically select column content based on other column from the same row

I am using Spark 1.6.1. Lets say my data frame looks like:
+------------+-----+----+
|categoryName|catA |catB|
+------------+-----+----+
| catA |0.25 |0.75|
| catB |0.5 |0.5 |
+------------+-----+----+
Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:
+------------+-----+----+-------+
|categoryName|catA |catB| score |
+------------+-----+----+-------+
| catA |0.25 |0.75| 0.25 | ('score' has value from column name 'catA')
| catB |0.5 |0.7 | 0.7 | ('score' value from column name 'catB')
+------------+-----+----+-------+
I need such extraction to some later calculations. Any ideas?
Important: I don't know names of category columns. Solution needs to be dynamic.
Spark 2.0:
You can do this (for any number of category columns) by creating a temporary column which holds a map of categroyName -> categoryValue, and then selecting from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// create a map of category -> value, and then select from that map using categoryName:
input
.withColumn("asMap", map(catCols.flatMap(c => Seq(lit(c), col(c))): _*))
.withColumn("score", $"asMap".apply($"categoryName"))
.drop("asMap")
Spark 1.6: Similar idea, but using an array and a UDF to select from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// UDF to select from array by index of colName in catCols
val getByColName = udf[Double, String, mutable.WrappedArray[Double]] {
case (colName, colValues) =>
val index = catCols.zipWithIndex.find(_._1 == colName).map(_._2)
index.map(colValues.apply).getOrElse(0.0)
}
// create an array of category values and select from it using UDF:
input
.withColumn("asArray", array(catCols.map(col): _*))
.withColumn("score", getByColName($"categoryName", $"asArray"))
.drop("asArray")
You have several options:
If you are using scala you can use the Dataset API in which case you would simply create a map which does the calculation.
You can move to RDD from dataframe and use a map
You can create a UDF which receives all relevant columns as input and do the calculation inside
you can use a bunch of when/otherwise clauses to do the search (e.g. when(col1 == CatA, col(CatA)).otherwise(col(CatB)))