I have a fairly large dataset with roughly 5bil. records. I would like to take 1mio random samples out of that. The issue is that the labels are not balanced
+---------+----------+
| label| count|
+---------+----------+
|A | 768866802|
|B |4241039902|
|C | 584150833|
+---------+----------+
Label B has a lot more data then the other labels. I know there is a concept of down and up sampling but given the large quantity of data I probably don't have to do that technique since I can easily find 1 mio records from each of the labels.
I was wondering how I could efficiently take ~ 1 mio. random samples (without replacement) so that I have an even amount over all labels ~ 333k in each label. Using PySpark
One idea I have is to split the dataset into 3 different df. Take 300k random samples out of it and stitch them together. But maybe there is more efficient ways of doing it.
You can create a column with random values and use row_number to filter 1M random samples for each label:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
n = 333333 # number of samples
df = df.withColumn('rand_col', F.rand())
sample_df1 = df.withColumn("row_num",row_number().over(Window.partitionBy("label")\
.orderBy("rand_col"))).filter(col("row_num")<=3)\
.drop("rand_col", "row_num")
sample_df1.groupBy("label").count().show()
This will always give you 1M samples for each label.
Another way of doing this is by stratified sampling using spark's stat.sampleBy
n = 333333
seed = 12345
# Creating a dictionary of fractions for eacch label
fractions = df.groupBy("label").count().withColumn("required_n", n/col("count"))\
.drop("count").rdd.collectAsMap()
sample_df2 = df.stat.sampleBy("label", fractions, seed)
sample_df2.groupBy("label").count().show()
sampleBy however results in an approximate solution depending on the run and does not guarantee an exact number of records for each label.
Example dataframe:
schema = StructType([StructField("id", IntegerType()), StructField("label", IntegerType())])
data = [[1, 2], [1, 2], [1, 3], [2, 3], [1, 2],[1, 1], [1, 2], [1, 3], [2, 2], [1, 1],[1, 2], [1, 2], [1, 3], [2, 3], [1, 1]]
df = spark.createDataFrame(data,schema=schema)
df.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 7|
| 3| 5|
+-----+-----+
Method 1:
# Sampling 3 records from each label
n = 3
# Assign a column with random values
df = df.withColumn('rand_col', F.rand())
sample_df1 = df.withColumn("row_num",row_number().over(Window.partitionBy("label")\
.orderBy("rand_col"))).filter(col("row_num")<=3)\
.drop("rand_col", "row_num")
sample_df1.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 3|
+-----+-----+
Method 2:
# Sampling 3 records from each label
n = 3
seed = 12
fractions = df.groupBy("label").count().withColumn("required_n", n/col("count"))\
.drop("count").rdd.collectAsMap()
sample_df2 = df.stat.sampleBy("label", fractions, seed)
sample_df2.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 4|
+-----+-----+
As you can see, sampleBy tends to give you an approximately equal distribution. But not exactly. I'd prefer Method 1 for your problem.
Related
I'm using Data Expectations to validate whether a specific column is satisfying some required condition or not. I was able to write the code for checking whether a column is unique or not. But I'm not able to write the code when it comes to filtering a column and then for that resulting dataframe checking whether another column is unique or not.
For instance, please find the below 2 scenarios, in both the scenarios we need to check whether department_id = "CSE" is having unique roll_no :
Scenario 1:
reg_no
department_id
roll_no
1
CSE
1
2
ECE
1
3
ECE
2
4
CSE
2
5
ME
1
6
EEE
1
7
CSE
2
In this case , It should fail since CSE is having duplicate roll_no :
Scenario 2:
reg_no
department_id
roll_no
1
CSE
8
2
ECE
2
3
ECE
5
4
CSE
4
5
ME
3
6
EEE
2
7
CSE
1
In this case, the job should pass since deparment_id = "CSE" is having unique roll_no values.
Please let me know on how to satisfy the above 2 scenarios where the dataframe should be filtered first and then check whether a column is unique using foundry data expectations.
You can simply build two dataframes and check if they have the same size:
for the first one just filter department_id='CSE' and select roll_no
for the second one filter department_id='CSE', select roll_no and call distinct()
if they are the same size your dataframe was unique with respect to department_id
IIUC - There is two way to answer this problem -
A generic code that will tell only duplicate values -
df = spark.createDataFrame([(1, "CSE", 1),(2, "ECE", 1),(3, "ECE", 2),(4, "CSE", 2),(5, "ME", 1),(6, "EEE", 1),(7, "CSE", 2)],["reg_no","department_id","roll_no"])
df.show()
df \
.groupby(['department_id', 'roll_no']) \
.count() \
.where('count > 1') \
.sort('count', ascending=False) \
.show()
This will help you identify a depertment_id is unique or not
_w = W.partitionBy("department_id").orderBy("department_id")
df = df.withColumn("roll_no_list", F.collect_list("roll_no").over(_w)).withColumn("roll_no_set", F.collect_set("roll_no").over(_w))
df = df.withColumn("cond_col", F.when(F.size(F.col("roll_no_list")) == F.size(F.col("roll_no_set")), "Unique").otherwise("Not Unique"))
df.show()
+------+-------------+-------+------------+-----------+----------+
|reg_no|department_id|roll_no|roll_no_list|roll_no_set| cond_col|
+------+-------------+-------+------------+-----------+----------+
| 1| CSE| 1| [1, 2, 2]| [1, 2]|Not Unique|
| 4| CSE| 2| [1, 2, 2]| [1, 2]|Not Unique|
| 7| CSE| 2| [1, 2, 2]| [1, 2]|Not Unique|
| 2| ECE| 1| [1, 2]| [1, 2]| Unique|
| 3| ECE| 2| [1, 2]| [1, 2]| Unique|
| 6| EEE| 1| [1]| [1]| Unique|
| 5| ME| 1| [1]| [1]| Unique|
+------+-------------+-------+------------+-----------+----------+
I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+
I have a Spark Dataframe where for each set of rows with a given column value (col1), I want to grab a sample of the values in (col2). The number of rows for each possible value of col1 may vary widely, so i'm just looking for a set number, say 10, of each type.
There may be a better way to do this, but the natural approach seemed to be a df.groupby('col1')
in pandas, I could do df.groupby('col1').col2.head()
i understand that spark dataframes are not pandas dataframes, but this is a good analogy.
i suppose i could loop over all of col1 types as a filter, but that seems terribly icky.
any thoughts on how to do this? thanks.
Let me create a sample Spark dataframe with two columns.
df = SparkSQLContext.createDataFrame([[1, 'r1'],
[1, 'r2'],
[1, 'r2'],
[2, 'r1'],
[3, 'r1'],
[3, 'r2'],
[4, 'r1'],
[5, 'r1'],
[5, 'r2'],
[5, 'r1']], schema=['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| r1|
| 1| r2|
| 1| r2|
| 2| r1|
| 3| r1|
| 3| r2|
| 4| r1|
| 5| r1|
| 5| r2|
| 5| r1|
+----+----+
After grouping by col1, we get GroupedData object (instead of Spark Dataframe). You can use aggregate functions like min, max, average. But getting a head() is little bit tricky. We need to convert GroupedData object back to Spark Dataframe. This can be done Using pyspark collect_list() aggregation function.
from pyspark.sql import functions
df1 = df.groupBy(['col1']).agg(functions.collect_list("col2")).show(n=3)
Output is:
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
| 5| [r1, r2, r1]|
| 1| [r1, r2, r2]|
| 3| [r1, r2]|
+----+------------------+
only showing top 3 rows
I'd like to understand how the k-means method works in PySpark.
For this, I've done this small example:
In [120]: entry = [ [1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[5,5,5],[5,5,5],[1,1,1],[5,5,5]]
In [121]: rdd_entry = sc.parallelize(entry)
In [122]: clusters = KMeans.train(rdd_entry, k=5, maxIterations=10, initializationMode="random")
In [123]: rdd_labels = clusters.predict(rdd_entry)
In [125]: rdd_labels.collect()
Out[125]: [3, 1, 0, 0, 2, 2, 2, 3, 2]
In [126]: entry
Out[126]:
[[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[5, 5, 5],
[5, 5, 5],
[1, 1, 1],
[5, 5, 5]]
At first glance it seems that rdd_labels returns the cluster to which each observation belongs, respecting the order of the original rdd. Although in this example it is evident, how can I be sure in a case where I will work with 8 million observations?
Also, I'd like to know how to join rdd_entry and rdd_labels, respecting that order, so that each observation of rdd_entry is correctly labeled with its cluster.
I tried to do a .join(), but it jumps error
In [127]: rdd_total = rdd_entry.join(rdd_labels)
In [128]: rdd_total.collect()
TypeError: 'int' object has no attribute '__getitem__'
Hope it helps! (this solution is based on pyspark.ml)
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
#sample data
df = sc.parallelize([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[5,5,5],[5,5,5],[1,1,1],[5,5,5]]).\
toDF(('col1','col2','col3'))
vecAssembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vector_df = vecAssembler.transform(df)
#kmeans clustering
kmeans=KMeans(k=3, seed=1)
model=kmeans.fit(vector_df)
predictions=model.transform(vector_df)
predictions.show()
Output is:
+----+----+----+-------------+----------+
|col1|col2|col3| features|prediction|
+----+----+----+-------------+----------+
| 1| 1| 1|[1.0,1.0,1.0]| 0|
| 2| 2| 2|[2.0,2.0,2.0]| 0|
| 3| 3| 3|[3.0,3.0,3.0]| 2|
| 4| 4| 4|[4.0,4.0,4.0]| 1|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
| 1| 1| 1|[1.0,1.0,1.0]| 0|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
+----+----+----+-------------+----------+
Although pyspark.ml has better approach I thought of writing code to achieve the same result using pyspark.mllib (trigger was the comment from #Muhammad). So here goes the solution based on pyspark.mllib...
from pyspark.mllib.clustering import KMeans
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType
#sample data
rdd = sc.parallelize([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[5,5,5],[5,5,5],[1,1,1],[5,5,5]])
#K-Means example
model = KMeans.train(rdd, k=3, seed=1)
labels = model.predict(rdd)
#add cluster label to the original data
df1 = rdd.toDF(('col1','col2','col3')) \
.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
df2 = spark.createDataFrame(labels, IntegerType()).toDF(('label')) \
.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
df = df1.join(df2, on=["row_index"]).drop("row_index")
df.show()
I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row(no_children=0)"
but I need only the value as I will use it for another part of my code.
So, ideally only all_values=[0,1,2,3,4]
all_values=sorted(list(df1.select('no_children').distinct().collect()))
all_values
[Row(no_children=0),
Row(no_children=1),
Row(no_children=2),
Row(no_children=3),
Row(no_children=4)]
This takes around 15secs to run, is that normal?
Thank you very much!
You can use collect_set from functions module to get a column's distinct values.Here,
from pyspark.sql import functions as F
>>> df1.show()
+-----------+
|no_children|
+-----------+
| 0|
| 3|
| 2|
| 4|
| 1|
| 4|
+-----------+
>>> df1.select(F.collect_set('no_children').alias('no_children')).first()['no_children']
[0, 1, 2, 3, 4]
You could do something like this to get only the values
list = [r.no_children for r in all_values]
list
[0, 1, 2, 3, 4]
Try this:
all_values = df1.select('no_children').distinct().rdd.flatMap(list).collect()