Pyspark: getting features for clustering from list - pyspark

I don't have much experience on spark and I'm having some issues parsing data in order to apply clustering in Pyspark. Basically I have some users with some items associated.
Dataframe (df) is composed of rows (user, item). An user may be associated with more items (and also an item may be associated with many users). So there may be multiple rows with same user and different items.
Example:
user | item
1 | 45
2 | 58
3 | 2
1 | 97
1 | 2
2 | 3
...
I would like to cluster the users using the associated items information.
My approach was the following:
I grouped on users transforming items in arrays. So my example df became like:
user | items
1 | [2, 45, 97]
2 | [3, 58]
3 | [2]
...
I transformed each item into a new list with n elements as the max item number and with values 0 or 1 depending on the item index. So after this my df became
user | items
1 | [0,0,1,0,0,0,0, ...]
2 | [0,0,0,1,0,0,0, ...]
3 | [0,0,1,0,0,0,0, ...]
...
I would use this to get lists for kmeans operating like this
from pyspark.mllib.clustering import KMeans, KMeansModel
def getRow(r):
res = r['items']
return res
kmtrain = df.rdd.map(lambda r: getRow(r))
model = KMeans.train(kmtrain, n_clusters, maxIterations=10, initializationMode="k-means||")
Problem is that this approach takes ages. It's clear that I'm doing something wrong here. Can help me understand where I'm doing mistakes?

Related

How do I identify the value of a skewed task of my Foundry job?

I've looked into my job and have identified that I do indeed have a skewed task. How do I determine what the actual value is inside this task that is causing the skew?
My Python Transforms code looks like this:
from transforms.api import Input, Output, transform
#transform(
...
)
def my_compute_function(...):
...
df = df.join(df_2, ["joint_col"])
...
Theory
Skew problems originate from anything that causes an exchange in your job. Things that cause exchanges include but are not limited to: joins, windows, groupBys.
These operations result in data movement across your Executors based upon the found values inside the DataFrames used. This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.
Example
Let's consider the following example distribution of data for your join:
DataFrame 1 (df1)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_1 | 2 |
| key_2 | 1 |
DataFrame 2 (df2)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_2 | 2 |
| key_3 | 1 |
These DataFrames when joined together on col_1 will have the following data distributed across the executors:
Task 1:
Receives: 5 rows of key_1 from df1
Receives: 4 rows of key_1 from df2
Total Input: 9 rows of data sent to task_1
Result: 5 * 4 = 20 rows of output data
Task 2:
Receives: 1 row of key_2 from df1
Receives: 1 row of key_2 from df2
Total Input: 2 rows of data sent to task_2
Result: 1 * 1 = 1 rows of output data
Task 3:
Receives: 1 row of key_3 from df2
Total Input: 1 rows of data sent to task_3
Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)
If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others. This task is skewed.
Identification
The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).
If we look at the above example, we see that all we need to know is the actual counts per key of the joint column. This means we can:
Aggregate each side of the join on the joint key and count the rows per key
Multiply the counts of each side of the join to determine the output row counts
The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:
Add df1 as input to a first path
Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df1_, resulting in an output from the path which is only df1_col_1 and df1_COUNT.
Add df2 as input to a second path
Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df2_, resulting in an output from the path which is only df2_col_1 and df2_COUNT.
Create a third path, using the result of the first path (df1_col_1 and df1_COUNT1)
Add a Join board, making the right side of the join the result of the second path (df2_col_1 and df2_col_1). Ensure the join type is Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique
Configure the join board to join on df1_col_1 equals df2_col_1
Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together
Add a Sort board that sorts on output_row_count descending
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew

scala fast range lookup on 2 columns

I have a spark dataframe that I am broadcasting as Array[Array[String]].
My requirement is to do a range lookup on 2 columns.
Right now I have something like ->
val cols = data.filter(_(0).toLong <= ip).filter(_(1).toLong >= ip).take(1) match {
case Array(t) => t
case _ => Array()
}
The following data file is stored as Array[Array[String]] (except for the header row that I have shown below only as reference.) and passed to the filter function shown above.
sample data file ->
startIPInt | endIPInt | lat | lon
676211200 | 676211455 | 37.33053 | -121.83823
16777216 | 16777342 | -34.9210644736842 | 138.598709868421
17081712 | 17081712 | 0 | 0
sample value to search ->
ip = 676211325
based on the range of the startIPInt and endIPInt values, I want the rest of the mapping rows.
This lookup takes 1-2 sec for each, and I am not even sure the 2nd filter condition is getting executed(in debug mode always it only seems to execute the 1st condition). Can someone suggest me a faster and more reliable lookup here?
Thanks!

Reshaping RDD from an array of array to unique columns in pySpark

I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)
Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')

How to preserve order of a DataFrame when writing it as CSV with partitioning by columns?

I sort the rows of a DataFrame and write it out to disk like so:
df.
orderBy("foo").
write.
partitionBy("bar", "moo").
option("compression", "gzip").
csv(outDir)
When I look into the generated .csv.gz files, their order is not preserved. Is this the way Spark does this? Is there a way to preserve order when writing a DF to disk with a partitioning?
Edit: To be more precise: Not the order of the CSVs is off, but the order inside them. Let's say I have it like the following after df.orderBy (for simplicity, I now only partition by one column):
foo | bar | baz
===============
1 | 1 | 1
1 | 2 | 2
1 | 1 | 3
2 | 3 | 4
2 | 1 | 5
3 | 2 | 6
3 | 3 | 7
4 | 2 | 9
4 | 1 | 10
I expect it to be like this, e.g. for files in folder bar=1:
part-00000-NNN.csv.gz:
1,1
1,3
2,5
part-00001-NNN.csv.gz:
3,8
4,10
But what it is like:
part-00000-NNN.csv.gz:
1,1
2,5
1,3
part-00001-NNN.csv.gz:
4,10
3,8
It's been a while but I witnessed this again. I finally came across a workaround.
Suppose, your schema is like:
time: bigint
channel: string
value: double
If you do:
df.sortBy("time").write.partitionBy("channel").csv("hdfs:///foo")
the timestamps in the individual part-* files get tossed around.
If you do:
df.sortBy("channel", "time").write.partitionBy("channel").csv("hdfs:///foo")
the order is correct.
I think it has to do with shuffling. So, as a workaround, I am now sorting by the columns I want my data to be partitioned by first, then by the column I want to have it sorted in the individual files.

Tag unique rows?

I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.