Convert distinct values in a Dataframe in Pyspark to a list - pyspark

I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row(no_children=0)"
but I need only the value as I will use it for another part of my code.
So, ideally only all_values=[0,1,2,3,4]
all_values=sorted(list(df1.select('no_children').distinct().collect()))
all_values
[Row(no_children=0),
Row(no_children=1),
Row(no_children=2),
Row(no_children=3),
Row(no_children=4)]
This takes around 15secs to run, is that normal?
Thank you very much!

You can use collect_set from functions module to get a column's distinct values.Here,
from pyspark.sql import functions as F
>>> df1.show()
+-----------+
|no_children|
+-----------+
| 0|
| 3|
| 2|
| 4|
| 1|
| 4|
+-----------+
>>> df1.select(F.collect_set('no_children').alias('no_children')).first()['no_children']
[0, 1, 2, 3, 4]

You could do something like this to get only the values
list = [r.no_children for r in all_values]
list
[0, 1, 2, 3, 4]

Try this:
all_values = df1.select('no_children').distinct().rdd.flatMap(list).collect()

Related

Pyspark: Take n samples with balanced classes

I have a fairly large dataset with roughly 5bil. records. I would like to take 1mio random samples out of that. The issue is that the labels are not balanced
+---------+----------+
| label| count|
+---------+----------+
|A | 768866802|
|B |4241039902|
|C | 584150833|
+---------+----------+
Label B has a lot more data then the other labels. I know there is a concept of down and up sampling but given the large quantity of data I probably don't have to do that technique since I can easily find 1 mio records from each of the labels.
I was wondering how I could efficiently take ~ 1 mio. random samples (without replacement) so that I have an even amount over all labels ~ 333k in each label. Using PySpark
One idea I have is to split the dataset into 3 different df. Take 300k random samples out of it and stitch them together. But maybe there is more efficient ways of doing it.
You can create a column with random values and use row_number to filter 1M random samples for each label:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
n = 333333 # number of samples
df = df.withColumn('rand_col', F.rand())
sample_df1 = df.withColumn("row_num",row_number().over(Window.partitionBy("label")\
.orderBy("rand_col"))).filter(col("row_num")<=3)\
.drop("rand_col", "row_num")
sample_df1.groupBy("label").count().show()
This will always give you 1M samples for each label.
Another way of doing this is by stratified sampling using spark's stat.sampleBy
n = 333333
seed = 12345
# Creating a dictionary of fractions for eacch label
fractions = df.groupBy("label").count().withColumn("required_n", n/col("count"))\
.drop("count").rdd.collectAsMap()
sample_df2 = df.stat.sampleBy("label", fractions, seed)
sample_df2.groupBy("label").count().show()
sampleBy however results in an approximate solution depending on the run and does not guarantee an exact number of records for each label.
Example dataframe:
schema = StructType([StructField("id", IntegerType()), StructField("label", IntegerType())])
data = [[1, 2], [1, 2], [1, 3], [2, 3], [1, 2],[1, 1], [1, 2], [1, 3], [2, 2], [1, 1],[1, 2], [1, 2], [1, 3], [2, 3], [1, 1]]
df = spark.createDataFrame(data,schema=schema)
df.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 7|
| 3| 5|
+-----+-----+
Method 1:
# Sampling 3 records from each label
n = 3
# Assign a column with random values
df = df.withColumn('rand_col', F.rand())
sample_df1 = df.withColumn("row_num",row_number().over(Window.partitionBy("label")\
.orderBy("rand_col"))).filter(col("row_num")<=3)\
.drop("rand_col", "row_num")
sample_df1.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 3|
+-----+-----+
Method 2:
# Sampling 3 records from each label
n = 3
seed = 12
fractions = df.groupBy("label").count().withColumn("required_n", n/col("count"))\
.drop("count").rdd.collectAsMap()
sample_df2 = df.stat.sampleBy("label", fractions, seed)
sample_df2.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 4|
+-----+-----+
As you can see, sampleBy tends to give you an approximately equal distribution. But not exactly. I'd prefer Method 1 for your problem.

substring function return column type instead of a value. Is there a way to fetch a value out of column type in pyspark

I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

pyspark/dataframe - creating a nested structure

i'm using pyspark with dataframe and would like to create a nested structure as below
Before:
Column 1 | Column 2 | Column 3
--------------------------------
A | B | 1
A | B | 2
A | C | 1
After:
Column 1 | Column 4
--------------------------------
A | [B : [1,2]]
A | [C : [1]]
Is this doable?
I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:
import pyspark
from pyspark.sql import functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']
data = spark.createDataFrame(data, columns)
data.createOrReplaceTempView("data")
data.show()
# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+-------+-------+-------+
nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()
# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']
Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):
datatype = 'struct<B:array<bigint>,C:array<bigint>>' # Add any other potential keys here.
#F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
return {column2_value: column4_value['data']}
nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()
# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']
This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.
First, reproducible example of your dataframe.
js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+----+----+----+
Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2.
jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
| A| C| [1]|
| A| B| [1, 2]|
+----+----+------------------+

pyspark equivalent of pandas groupby('col1').col2.head()

I have a Spark Dataframe where for each set of rows with a given column value (col1), I want to grab a sample of the values in (col2). The number of rows for each possible value of col1 may vary widely, so i'm just looking for a set number, say 10, of each type.
There may be a better way to do this, but the natural approach seemed to be a df.groupby('col1')
in pandas, I could do df.groupby('col1').col2.head()
i understand that spark dataframes are not pandas dataframes, but this is a good analogy.
i suppose i could loop over all of col1 types as a filter, but that seems terribly icky.
any thoughts on how to do this? thanks.
Let me create a sample Spark dataframe with two columns.
df = SparkSQLContext.createDataFrame([[1, 'r1'],
[1, 'r2'],
[1, 'r2'],
[2, 'r1'],
[3, 'r1'],
[3, 'r2'],
[4, 'r1'],
[5, 'r1'],
[5, 'r2'],
[5, 'r1']], schema=['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| r1|
| 1| r2|
| 1| r2|
| 2| r1|
| 3| r1|
| 3| r2|
| 4| r1|
| 5| r1|
| 5| r2|
| 5| r1|
+----+----+
After grouping by col1, we get GroupedData object (instead of Spark Dataframe). You can use aggregate functions like min, max, average. But getting a head() is little bit tricky. We need to convert GroupedData object back to Spark Dataframe. This can be done Using pyspark collect_list() aggregation function.
from pyspark.sql import functions
df1 = df.groupBy(['col1']).agg(functions.collect_list("col2")).show(n=3)
Output is:
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
| 5| [r1, r2, r1]|
| 1| [r1, r2, r2]|
| 3| [r1, r2]|
+----+------------------+
only showing top 3 rows