I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)
Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')
Related
I have a dataframe where I want to create pivot table from 2 columns, i'm using the question header column which will have its value pivoted like below : age , age_numeric
and the answer header is the value , my problem is I want to put the value of the answer header in a list which I'm doing using collect_list function, but the problem is i want the new column like age_numeric to be list of int, while column age to be list of strings, based on question type column, but when i try the code it always gives me a list of strings, any idea how to solve this problem?
this is the code
y=output.groupby("sessionId").pivot("questionHeader").
agg(collect_list(when(col("questionType")=="numericAnswer",
col("answerHeader")
.cast("float")).when(col("questionType")!="numericAnswer",col("answerHeader"))))
this is what i get
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | ["20"]
| 3 | ["20-25 years"] | ["20"]
This is what i want
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | [20]
| 3 | ["20-25 years"] | [20]
If you want the output as in the last two rows, then you do not require a pivot, just groupby and collect_list on each of the two columns To get the list of integers for Age_numeric, apply .cast("array< int>"), or change the type of Age_numeric column before collect_list().
Replicate the data
import pyspark.sql.functions as F
data = [(1, "20-25 years", "20"), (3, "20-25 years", "20")]
df = spark.createDataFrame(data, schema=["session_id", "Age", "Age_numeric"])
Replicate the output
df_out = (df.groupBy("session_id")
.agg(F.collect_list("Age").alias("Age"),
F.collect_list("Age_numeric")
.cast("array<int>")
.alias("Age_numeric"))
I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |
You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)
I am trying to extract the list of column values from a dataframe into a list
+------+----------+------------+
|sno_id|updt_dt |process_flag|
+------+----------+------------+
| 123 |01-01-2020| Y |
+------+----------+------------+
| 234 |01-01-2020| Y |
+------+----------+------------+
| 512 |01-01-2020| Y |
+------+----------+------------+
| 111 |01-01-2020| Y |
+------+----------+------------+
Output should be the list of sno_id ['123','234','512','111']
Then I need to iterate the list to run some logic on each on the list values. I am currently using HiveWarehouseSession to fetch data from hive table into Dataframe by using hive.executeQuery(query)
it is pretty easy as you can first collect the df with will return list of Row type then
row_list = df.select('sno_id').collect()
then you can iterate on row type to convert column into list
sno_id_array = [ row.sno_id for row in row_list]
sno_id_array
['123','234','512','111']
Using Flat map and more optimized solution
sno_id_array = df.select("sno_id ").rdd.flatMap(lambda x: x).collect()
You could use toLocalIterator() to create a generator over the column.
Since you wanted to loop over the results afterwards, this may be more efficient in your case.
Using a generator you don't create and store the list first, but when iterating over the columns you apply your logic immediately:
sno_ids = df.select('sno_id').toLocalIterator()
for row in sno_ids:
sno_id = row.sno_id
# continue with your logic
...
Alternative one-liner using a generator expression:
sno_ids = (row.sno_id for row in df.select('sno_id').toLocalIterator())
for sno_id in sno_ids:
...
My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
I am a beginner of Spark. Please help me out with a solution.
The csv file contains the text in the form of key:value paring delimited by a comma. And in some lines, the keys(or columns) may be missing.
I have loaded this file into a single column of a dataframe. I want to segregate these keys as columns and values associated to it as data into that column. And when there are some columns missing i want to add a new column and a dummy data to that.
Dataframe
+----------------------------------------------------------------+
| _c0 |
+----------------------------------------------------------------+
|name:Pradnya,IP:100.0.0.4, college: SDM, year:2018 |
|name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018 |
+----------------------------------------------------------------+
I want the output in this form
+----------- ----------------------------------------------
| name | IP | College | Semester | year |
+-----------+-------------------------+-----------+-------+
| Pradnya |100.0.0.4 | SDM | null | 2018 |
+-----------+-------------+-----------+-----------+-------+
| Ram | 100.10.10.5 | BVB | IV |2018 |
+-----------+-------------+-----------+-----------+-------+
Thanks.
Pyspark won't recognize the key:value pairing. One workaround is convert the file int json format and then read the json file.
content of raw.txt:
name:Pradnya,IP:100.0.0.4, college: SDM, year:2018
name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018
Following code will create the json file :
with open('raw.json', 'w') as outfile:
json.dump([dict([p.split(':') for p in l.split(',')]) for l in open('raw.txt')], outfile)
Now you can create the pyspark dataframe using following code :
df = spark.read.format('json').load('raw.json')
If you know all field names and keys/values do not contain embedded delimiters. then you can probably convert the key/value lines into Row object through RDD's map function.
from pyspark.sql import Row
from string import lower
# assumed you already defined SparkSession named `spark`
sc = spark.sparkContext
# initialize the RDD
rdd = sc.textFile("key-value-file")
# define a list of all field names
columns = ['name', 'IP', 'College', 'Semester', 'year']
# set Row object
def setRow(x):
# convert line into key/value tuples. strip spaces and lowercase the `k`
z = dict((lower(k.strip()), v.strip()) for e in x.split(',') for k,v in [ e.split(':') ])
# make sure all columns shown in the Row object
return Row(**dict((c, z[c] if c in z else None) for c in map(lower, columns)))
# map lines to Row objects and then convert the result to dataframe
rdd.map(setRow).toDF().show()
#+-------+-----------+-------+--------+----+
#|college| ip| name|semester|year|
#+-------+-----------+-------+--------+----+
#| SDM| 100.0.0.4|Pradnya| null|2018|
#| BVB|100.10.10.5| Ram| IV|2018|
#+-------+-----------+-------+--------+----+