Convert a column to a list in PySpark

Convert a column to a list in PySpark - pyspark

I have a column that looks like this:
---------------------------
| barcodes |
---------------------------
|["12345678", "91011121313"]|
----------------------------
It can be more than 2 items:
I tried converting it to a list so that I can iterate for every barcode in barcodes but I got a type error, TypeError: 'Column' object is not callable\n'.
I am converting it with
barcodes = df_sixty60["orderItems"][0]["barcodes"].collect()
It's not working

Generally you cannot collect a column, but only whole dataframes. So you have to select you column first and then do the collect.
E.g.
df = spark.createDataFrame([
Row(a=1, b=2),
Row(a=3, b=4),
])
rows = df.select('a').collect()
# rows: [Row(a=1), Row(a=3)]
as_list = [r['a'] for r in rows]
# as_list: [1, 3]

Related

In Pyspark: Get specific values of list-like Columns with a condition and extract to new Columns

My dataframe looks like this:
The specific values for a respective entity are at the same index of the list in a consistent way overarching all shown columns.
column_1 | [2022-08-05 03:38...
column_2 | [inside, inside, ...
column_3 | [269344c6-c01c-45...
column_4 | [ff870660-57ce-11...
column_5 | [Mannheim, Mannhe...
column_6 | [26, 21, 2, 8]
column_7 | [fa8103a0-57ce-11...
column_8 | [ATG1, ATG3, Variable1...
My Approach:
#Get columns
df_colum_names = list(df.schema.names)
# Set condition with a expression
filter_func = ("filter(geofenceeventtype,spatial_wi_df -> df.column_8 == 'Variable1')")
geofence_expr= f"transform(sort_array({filter_func}), x -> x."
geofence_prefix = "geofence_sorted"
# extract to new columns
for col in df_colum_names:
df = df.withColumn(
geofence_prefix + col,
F.element_at(
F.expr(geofence_expr + col.replace("_", ".") + ")"), 1),)
In this way i want to create columns only with the specific values of entity 'Variable1' and then drop all rows without data from this entity.
The error message:
Can't extract value from lambda df#2345: need struct type but got string
So there are rows where the value of the column is just one value as a String and not a Structtype, how to deal with this problem?

Adding Columns to a pyspark dataframe based on the length of an array in another column

I have some JSON data that I'm loading into a pyspark dataframe that originally looks something like this:
`{"timestamp": "2022-07-02T23:59:22.393458Z", "version": 2, "payload": {"links": [{"down" : {"db":"0:0", "maker":"gg", "dev":"51"}, "up" : {"db":"0:1", "maker":"03", "dev":"52"}, "max_w" : 3, "max_s" : 3, "curr_w" : 8, "curr_s" : 2},{"down" : {"db":"0:0", "maker":"tr", "dev":"20"}, "up" : {"db":"0:2:1", "maker":"pr", "dev":"1022"}, "max_w" : 8, "max_s" : 2, "curr_w" : 7, "curr_s" : 4}]}}}`
The file I am ingesting into my dataframe has multiple JSONs in a single file like the above. The array that is contained in payload.links varies in size within the json statements. I'm needing to get the data from the arrays into columns to make the dataframe look like:
timestamp | version | payload_links_0_down_db | payload_links_0_down_maker | payload_links_0_down_dev | payload_links_0_up_db | payload_links_0_up_maker | payload_links_0_up_dev | payload_links_0_max_w | payload_links_0_max_s | payload_links_0_curr_w | payload_links_0_curr_s | payload_links_1_down_db | payload_links_1_down_maker | payload_links_1_down_dev | payload_links_1_up_db | payload_links_1_up_maker | payload_links_1_up_dev | payload_links_1_max_w | payload_links_1_max_s | payload_links_1_curr_w | payload_links_1_curr_s
etc...
I understand how to bring the data into a dataframe using spark_df = spark.read.option("multiline", "true").json(source_s3_path)
and I also know how to access each element of the array like this: spark_df = spark_df.withColumn("max_width", spark_df.payload.links[0].max_w)
But due to the array size changing from json to json, I'm struggling with how to create the correct number of columns in the dataframe as well as handling populating the columns without throwing index out of bounds errors if for instance there is a max array length of 20, but the data also includes arrays of length 3.
How can I get the data down to the final schema that I need?

You could get the length of the array and run it in a loop:
from pyspark.sql import SparkSession, functions as F
spark_df = spark.read.option("multiline", "true").json(source_s3_path)
spark_df = spark_df.select("timestamp", "version", "payload.*")
# To get len of array, use explode_outer, then select any column (version) and then check len
no_of_payload_links = len(spark_df.withColumn("payload_links", F.explode_outer("links")).select("version").collect())
for i in range(no_of_payload_links):
spark_df = spark_df.withColumn(f'payload_links_{i}_down_db', spark_df.links[i].down.db)
spark_df = spark_df.withColumn(f'payload_links_{i}_max_w', spark_df.links[i].max_w)
# ... (do the same for all the columns)
# drop the original column as the last step
spark_df = spark_df.drop("links")
But this looks like bad design since you end up with a dataframe that has only 1 row and a dynamic number of columns. This will make it harder to do subsequent processing on these dataframes. A better approach could be to have the links as separate rows, and have fixed columns.
spark_df = spark.read.option("multiline", "true").json(source_s3_path)
spark_df = spark_df.select("timestamp", "version", "payload.*")
spark_df = spark_df.withColumn("payload_links", F.explode_outer("links"))
spark_df = spark_df.withColumn('payload_links_down_db', spark_df.links.down.db)
spark_df = spark_df.withColumn('payload_links_max_w', spark_df.payload_links.max_w)
# ... (do the same for all the columns)
# drop the original columns
spark_df = spark_df.drop("links", "payload_links")

PySpark map multiple columns to 1 'dict' column containing all values without using df.collect()

I currently have multiple columns (at least 500) in my DataFrame starting with any of the following prefixes ['a_', 'b_', 'c_'].
I want to have a DataFrame with only 3 columns
# +++++++++++++++++++++
# a | b | c |
# +++++++++++++++++++++
# {'a_1': 'a_1_value', 'a_2': 'a_2_value'} | {} | {'c_1': 'c_1_value', 'c_2': 'c_2_value'}|
Calling df.collect() causes StackOverflowErrors in the framework I'm using because the DataFrame is pretty large. I'm trying to leverage the map functions to avoid loading the DataFrame in the driver (hence the constraint)

Something like this?
Use struct to combine any columns with a certain prefix into 1 column, then use to_json to form the struct into key-value pair shape.
cols = ['a', 'b', 'c']
df.select([
F.to_json(F.struct(*[x for x in df.columns if x.startswith(f'{col}_')])).alias(col)
for col in cols]
)

PySpark dataframe column to list

I am trying to extract the list of column values from a dataframe into a list
+------+----------+------------+
|sno_id|updt_dt |process_flag|
+------+----------+------------+
| 123 |01-01-2020| Y |
+------+----------+------------+
| 234 |01-01-2020| Y |
+------+----------+------------+
| 512 |01-01-2020| Y |
+------+----------+------------+
| 111 |01-01-2020| Y |
+------+----------+------------+
Output should be the list of sno_id ['123','234','512','111']
Then I need to iterate the list to run some logic on each on the list values. I am currently using HiveWarehouseSession to fetch data from hive table into Dataframe by using hive.executeQuery(query)

it is pretty easy as you can first collect the df with will return list of Row type then
row_list = df.select('sno_id').collect()
then you can iterate on row type to convert column into list
sno_id_array = [ row.sno_id for row in row_list]
sno_id_array
['123','234','512','111']
Using Flat map and more optimized solution
sno_id_array = df.select("sno_id ").rdd.flatMap(lambda x: x).collect()

You could use toLocalIterator() to create a generator over the column.
Since you wanted to loop over the results afterwards, this may be more efficient in your case.
Using a generator you don't create and store the list first, but when iterating over the columns you apply your logic immediately:
sno_ids = df.select('sno_id').toLocalIterator()
for row in sno_ids:
sno_id = row.sno_id
# continue with your logic
...
Alternative one-liner using a generator expression:
sno_ids = (row.sno_id for row in df.select('sno_id').toLocalIterator())
for sno_id in sno_ids:
...

Reshaping RDD from an array of array to unique columns in pySpark

I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)

Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert a column to a list in PySpark - pyspark

Related

In Pyspark: Get specific values of list-like Columns with a condition and extract to new Columns

Adding Columns to a pyspark dataframe based on the length of an array in another column

PySpark map multiple columns to 1 'dict' column containing all values without using df.collect()

PySpark dataframe column to list

Reshaping RDD from an array of array to unique columns in pySpark

Categories

Resources