Limiting number of collect_list items in spark scala - scala

I have code that states
db.groupBy("ids")
.agg(collect_list("names") as "alias")
.select("ids", "alias")
However, there are some rows where the alias would be only 1 entry large, while others could be 200.
How can I limit the ones with >30 entries to a maximum of 30 entries?

You can use the slice function to subset the array.
db.groupBy("ids")
.agg(slice(collect_list("names"), 1, 30) as "alias")
.select("ids", "alias")

Related

Cast a list column into dummy columns in Python Polars?

I have a very large data frame where there is a column that is a list of numbers representing category membership.
Here is a dummy version
import pandas as pd
import numpy as np
segments = [str(i) for i in range(1_000)]
# My real data is ~500m rows
nums = np.random.choice(segments, (100_000,10))
df = pd.DataFrame({'segments': [','.join(n) for n in nums]})
userId
segments
0
885,106,49,138,295,254,26,460,0,844
1
908,709,454,966,151,922,666,886,65,708
2
664,713,272,241,301,498,630,834,702,289
3
60,880,906,471,437,383,878,369,556,876
4
817,183,365,171,23,484,934,476,273,230
...
...
Note that there is a known list of segments (0-999 in the example)
I want to cast this into dummy columns indicating membership to each segment.
I found a few ways of doing this:
In pandas:
df_one_hot_encoded = (df['segments']
.str.split(',')
.explode()
.reset_index()
.assign(__one__=1)
.pivot_table(index='index', columns='segments', values='__one__', fill_value=0)
)
(takes 8 seconds on a 100k row sample)
And polars
df2 = pl.from_pandas(df[['segments']])
df_ans = (df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(','),
pl.lit(1).alias('__one__')
])
.explode('segments')
.pivot(index='row_index', columns='segments', values='__one__')
.fill_null(0)
)
df_one_hot_encoded = df_ans.to_pandas()
(takes 1.5 seconds inclusive of the conversion to and from pandas, .9s without)
However, I hear .pivot is not efficient, and that it does not work well with lazy frames.
I tried other solutions in polars, but they were much slower:
_ = df2.lazy().with_columns(**{segment: pl.col('segments').str.contains(segment) for segment in segments}).collect()
(2 seconds)
(df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(',')
])
.explode('segments')
.to_dummies(columns=['segments'])
.groupby('row_index')
.sum()
)
(4 seconds)
Does anyone know a better solution than the .9s pivot?
This approach ends up being slower than the pivot but it's a got a different trick so I'll include it.
df2=pl.from_pandas(df)
df2_ans=(df2.with_row_count('userId').with_column(pl.col('segments').str.split(',')).explode('segments') \
.with_columns([pl.when(pl.col('segments')==pl.lit(str(i))).then(pl.lit(1,pl.Int32)).otherwise(pl.lit(0,pl.Int32)).alias(str(i)) for i in range(1000)]) \
.groupby('userId')).agg(pl.exclude('segments').sum())
df_one_hot_encoded = df2_ans.to_pandas()
A couple of other observations. I'm not sure if you checked the output of your str.contains method but I would think that wouldn't work because, for example, 15 is contained within 154 when looking at strings.
The other thing, which I guess is just a preference, is the with_row_count syntax vs the pl.arrange. I don't think the performance of either is better (at least not significantly so) but you don't have to reference the df name to get the len of it which is nice.
I tried a couple other things that were also worse including not doing the explode and just doing is_in but that was slower. I tried using bools instead of 1s and 0s and then aggregating with any but that was slower.

Improvement on Complex LoD Calculation

I'm trying to improve performance on this calculation which has multiple IF statements.
{Fixed [Name],[Region],[Adj Payor Code],[Test Category]:
IF ISNULL(SUM([Test Count YTD])) OR ISNULL(SUM([Test Count LYTD]))
OR SUM([NRR YTD])<=0 OR SUM([NRR LYTD])<=0 OR SUM([Test Count YTD])<1 OR SUM([Test Count LYTD])<1
THEN 0 ELSE
[Delta AWR]*SUM([Test Count YTD])END
}
I believe that some of the value comparisons because they are doubles could be improved.
All values only need up to two decimal places and are currently formatted as doubles.

PySpark - Split the combined key

I have a RDD that is structured in this format:
(MAC_address, dst_ip_address, 1)
Here, 1 means the machine with the MAC_address has accessed the dst_ip_address once. I need to count how many times a specific machine with MAC_address has reached a specific dst_ip_address.
I created a rdd with a combined MAC_address and dst_ip_address as key, and applied reduceByKey to count the times.
def processJson(data):
return ((MAC_address, dst_ip_address), 1)
def countreducer(a,b):
return a+b
tt = df.map(processJson).reduceByKey(countreducer)
I am able to get a RDD ((MAC_address, dst_ip_address), 52)
I need to write the RDD into a Json format like this:
MAC_address_1:
[dst_ip_1: 52],
[dst_ip_2: 38]
MAC_address_2:
[dst_ip_1: 12]
My intuition is to split the combined key first but there is no function to flat a combined key. Thus, I wonder whether the above approach is on the right track.

Spark - How to take logarithm of every value minus smallest element in a column

I want to take logarithm of every value subtracted by the smallest element in a column. For example, if a have a column like:
score: 1000, 500, 1200, 300
Then I want:
logged_score: log(700), log(200), log(900), log(0)
I tried with this in Spark data frame:
.select(log($"score" - min($"score")).alias("logged_score"))
But I got this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
grouping expressions sequence is empty, and 'score' is not an
aggregate function. Wrap '(LOG((score - min(score))) AS
logged_score)' in windowing function(s) or wrap 'score' in
first() (or first_value) if you don't care which value you get.;;
The easiest way to overcome this is by getting the min($"score") by collecting it before taking the log value. However, I am trying to avoid doing collect here if there is any better solution.
You can simply do the following
import org.apache.spark.sql.functions._
val minimumValue = df.select(min("score")).first()(0)
df.withColumn("logged_score", log($"score" - lit(minimumValue))).na.fill(0).show()

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.