Pyspark higher-order SQL functions to create histograms from arrays? - pyspark
I have a dataframe with a column of type array<sometype>.
More specifically, sometype = array<int> but I suspect this is irrelevant for the challenge at hand.
I'd like to assign to each unique element in the array the number of times it occurs there by using only higher-order functions and in linear time. The resulting column could be a map or an array of structs - doesn't matter.
For instance:
["val1", "val2", "val1", "val1", "val3", "val2", "val1"] --> { "val1": 4, "val3": 1, "val2": 2 }
I tried aggregate with map_concat (adding a single-element-map with an incremented counter) but the latter turned out to produce a multimap rather than overwrite an existing element with a new value, which foiled the plan.
Any other suggestion how this can be done?
Thanks
Try something like this.
df.show() #sample dataframe
#+---------------+
#| array|
#+---------------+
#| [1, 9, 1]|
#|[2, 2, 2, 1, 2]|
#|[3, 4, 4, 1, 4]|
#| [1, 4]|
#| [99, 99, 100]|
#| [92, 11, 92]|
#| [0, 0, 1]|
#+---------------+
Transform with filter:
from pyspark.sql import functions as F
df\
.withColumn("count",\
F.expr("""map_from_arrays(array_distinct(array),transform(array_distinct(array),\
x-> size(filter(array,y-> y=x))))"""))\
.show(truncate=False)
#+---------------+------------------------+
#|array |count |
#+---------------+------------------------+
#|[1, 9, 1] |[1 -> 2, 9 -> 1] |
#|[2, 2, 2, 1, 2]|[2 -> 4, 1 -> 1] |
#|[3, 4, 4, 1, 4]|[3 -> 1, 4 -> 3, 1 -> 1]|
#|[1, 4] |[1 -> 1, 4 -> 1] |
#|[99, 99, 100] |[99 -> 2, 100 -> 1] |
#|[92, 11, 92] |[92 -> 2, 11 -> 1] |
#|[0, 0, 1] |[0 -> 2, 1 -> 1] |
#+---------------+------------------------+
Or Transform with aggregate:
from pyspark.sql import functions as F
df\
.withColumn("count",\
F.expr("""map_from_arrays(array_distinct(array),transform(array_distinct(array),\
x-> aggregate(array, 0,(acc,t)->acc+IF(t=x,1,0))))"""))\
.show(truncate=False)
#+---------------+------------------------+
#|array |count |
#+---------------+------------------------+
#|[1, 9, 1] |[1 -> 2, 9 -> 1] |
#|[2, 2, 2, 1, 2]|[2 -> 4, 1 -> 1] |
#|[3, 4, 4, 1, 4]|[3 -> 1, 4 -> 3, 1 -> 1]|
#|[1, 4] |[1 -> 1, 4 -> 1] |
#|[99, 99, 100] |[99 -> 2, 100 -> 1] |
#|[92, 11, 92] |[92 -> 2, 11 -> 1] |
#|[0, 0, 1] |[0 -> 2, 1 -> 1] |
#+---------------+------------------------+
UPDATE:
A different way to do this could be if we know exactly all the elements we will be counting for. This way we can parallelize better as a struct is like a column and a struct of structs is basically a dataframe inside another.
elements=[1,9,2,3,4,99,100,92,11,0]
df.show() #sample dataframe
#+---------------+
#| array|
#+---------------+
#| [1, 9, 1]|
#|[2, 2, 2, 1, 2]|
#|[3, 4, 4, 1, 4]|
#| [1, 4]|
#| [99, 99, 100]|
#| [92, 11, 92]|
#| [0, 0, 1]|
#+---------------+
from pyspark.sql import functions as F
df.withColumn("struct", F.struct(*[(F.struct(F.expr("size(filter(array,x->x={}))"\
.format(y))).alias(str(y))) for y in elements]))\
.select("array",F.map_from_arrays(F.array(*[F.lit(x) for x in elements]),\
F.array(*[(F.col("struct.{}.col1".format(x)))\
for x in elements])).alias("count")).show(truncate=False)
#+---------------+-------------------------------------------------------------------------------------+
#|array |count |
#+---------------+-------------------------------------------------------------------------------------+
#|[1, 9, 1] |[1 -> 2, 9 -> 1, 2 -> 0, 3 -> 0, 4 -> 0, 99 -> 0, 100 -> 0, 92 -> 0, 11 -> 0, 0 -> 0]|
#|[2, 2, 2, 1, 2]|[1 -> 1, 9 -> 0, 2 -> 4, 3 -> 0, 4 -> 0, 99 -> 0, 100 -> 0, 92 -> 0, 11 -> 0, 0 -> 0]|
#|[3, 4, 4, 1, 4]|[1 -> 1, 9 -> 0, 2 -> 0, 3 -> 1, 4 -> 3, 99 -> 0, 100 -> 0, 92 -> 0, 11 -> 0, 0 -> 0]|
#|[1, 4] |[1 -> 1, 9 -> 0, 2 -> 0, 3 -> 0, 4 -> 1, 99 -> 0, 100 -> 0, 92 -> 0, 11 -> 0, 0 -> 0]|
#|[99, 99, 100] |[1 -> 0, 9 -> 0, 2 -> 0, 3 -> 0, 4 -> 0, 99 -> 2, 100 -> 1, 92 -> 0, 11 -> 0, 0 -> 0]|
#|[92, 11, 92] |[1 -> 0, 9 -> 0, 2 -> 0, 3 -> 0, 4 -> 0, 99 -> 0, 100 -> 0, 92 -> 2, 11 -> 1, 0 -> 0]|
#|[0, 0, 1] |[1 -> 1, 9 -> 0, 2 -> 0, 3 -> 0, 4 -> 0, 99 -> 0, 100 -> 0, 92 -> 0, 11 -> 0, 0 -> 2]|
#+---------------+-------------------------------------------------------------------------------------+
You could also try this(to get distinct count per row) using structs:
elements=[1,9,2,3,4,99,100,92,11,0]
from pyspark.sql import functions as F
collected=df.withColumn("struct", F.struct(*[(F.struct(F.expr("size(filter(array,x->x={}))"\
.format(y))).alias(str(y))) for y in elements]))\
.withColumn("vals", F.array(*[(F.col("struct.{}.col1".format(x))) for x in elements]))\
.select("array",F.arrays_zip(F.array(*[F.lit(x) for x in elements]),\
F.col("vals")).alias("count"))\
.withColumn("count", F.expr("""filter(count,x-> x.vals != 0)"""))\
.show(truncate=False)
#+---------------+------------------------+
#|array |count |
#+---------------+------------------------+
#|[1, 9, 1] |[[1, 2], [9, 1]] |
#|[2, 2, 2, 1, 2]|[[1, 1], [2, 4]] |
#|[3, 4, 4, 1, 4]|[[1, 1], [3, 1], [4, 3]]|
#|[1, 4] |[[1, 1], [4, 1]] |
#|[99, 99, 100] |[[99, 2], [100, 1]] |
#|[92, 11, 92] |[[92, 2], [11, 1]] |
#|[0, 0, 1] |[[1, 1], [0, 2]] |
#+---------------+------------------------+
Or, you can use map_from_entries with structs logic:
elements=[1,9,2,3,4,99,100,92,11,0]
from pyspark.sql import functions as F
collected=df.withColumn("struct", F.struct(*[(F.struct(F.expr("size(filter(array,x->x={}))"\
.format(y))).alias(str(y))) for y in elements]))\
.withColumn("vals", F.array(*[(F.col("struct.{}.col1".format(x))) for x in elements]))\
.withColumn("elems", F.array(*[F.lit(x) for x in elements]))\
.withColumn("count", F.map_from_entries(F.expr("""filter(arrays_zip(elems,vals),x-> x.vals != 0)""")))\
.select("array","count")\
.show(truncate=False)
#+---------------+------------------------+
#|array |count |
#+---------------+------------------------+
#|[1, 9, 1] |[1 -> 2, 9 -> 1] |
#|[2, 2, 2, 1, 2]|[1 -> 1, 2 -> 4] |
#|[3, 4, 4, 1, 4]|[1 -> 1, 3 -> 1, 4 -> 3]|
#|[1, 4] |[1 -> 1, 4 -> 1] |
#|[99, 99, 100] |[99 -> 2, 100 -> 1] |
#|[92, 11, 92] |[92 -> 2, 11 -> 1] |
#|[0, 0, 1] |[1 -> 1, 0 -> 2] |
#+---------------+------------------------+
While still hoping to see suggestions for linear time solutions, I'll post here a pretty ugly version that I'm currently using to address the problem in N*log(N):
decide whether there's a need for non-trivial aggregation
for the cases where aggregation needs to happen:
2.1. sort the list of elements (this is where the most time will be spent)
2.2. assign elements that are about to be repeated special null value, while also recording their positions in the sorted list
2.3. remove elements entries with nulls and shrink the remaining lists
2.4. from differences in the position values of the remaining elements derive how many repetitions for each n-gram there had been
h_df = in_df.withColumn('__isunique__', func.expr("size(mylist)==size(array_distinct(mylist))")) \
.withColumn('__sorted__', func.when(func.col('__isunique__'), func.col('mylist')) \
.otherwise(func.expr("array_sort(mylist)"))) \
.withColumn('__coded__', func.when(func.col('__isunique__'), func.lit(None)) \
.otherwise(func.expr("filter(transform(sequence(0, size(__sorted__)), \
n->named_struct('p', n, 'e', \
IF ((n>0 and __sorted__[n]!=__sorted__[n-1]) or n==size(__sorted__), \
__sorted__[n-1], \
null))), \
x -> isnotnull(x['e']))"))) \
.withColumn('__clen__', func.when(func.col('__isunique__'), 0).otherwise(func.size('__coded__'))) \
.withColumn('histo', func.when(func.col('__isunique__'), func.expr("map_from_arrays(__sorted__,array_repeat(1,size(__sorted__)))")) \
.otherwise(func.expr("map_from_entries(transform(sequence(0, __clen__-1), \
p -> struct(__coded__[p]['e'], \
IF (p==0, \
__coded__[p]['p'], \
__coded__[p]['p']-__coded__[p-1]['p']))))"))) \
.drop('__sorted__','__coded__','__clen__','__isunique__')
h_df.show(10,False)
==>
+------------------+
| mylist|
+------------------+
|[B, A, A, C, B, A]|
| [C, D, E]|
| [B]|
| [C, C, C]|
+------------------+
+------------------+------------------------+
|mylist |histo |
+------------------+------------------------+
|[B, A, A, C, B, A]|[A -> 3, B -> 2, C -> 1]|
|[C, D, E] |[C -> 1, D -> 1, E -> 1]|
|[B] |[B -> 1] |
|[C, C, C] |[C -> 3] |
+------------------+------------------------+
Related
how do I concatenate the results of a concatenated array?
I have two matrices (dfs): A = [1 2 3 4 5 6 7 8 9 10 11 12] and B = [1, 2, 3] and I want matrix C to be repeating each row in A, B times. for example, first row, 1,2,3,4 needs to be repeated once, second row: 5,6,7,8 twice and last row three times: C = [1 2 3 4 5 6 7 8 5 6 7 8 9 10 11 12 9 10 11 12 9 10 11 12] my code for i in range(0,2401): g = pd.concat([df1.iloc[[i]]]*z[i], ignore_index=True) partially does this, except only gives me the 3 times last row part, I need to concatenate each concatenation. below gives me what I want but its not clean, i.e. indices are not ignored and messy. result = [] for i in range(0,2401): g = pd.concat([df1.iloc[[i]]]*z[i], ignore_index=True) result.append(g)
If you write your matrices like: A = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]] B = [1, 2, 3] C = [] You can iterate through B and add to C like: for i in range(len(B)): index = 0 while index < B[i]: C.append(A[i]) index += 1 Which has an output: [[1, 2, 3, 4], [5, 6, 7, 8], [5, 6, 7, 8], [9, 10, 11, 12], [9, 10, 11, 12], [9, 10, 11, 12]] I hope this helps!
Getting the matrix value given a vector of indices along each axis
I have a Matlab matrix M with size: [70 5 3 2 10 9 5 3 21]; I have a vector with a coordinates that I want to read of that matrix: [5, 1, 1, 2, 3, 4, 1, 2, 1]; MWE example of what I am trying to get: M = rand(70 5 3 2 10 9 5 3 21); coordinates = [5, 1, 1, 2, 3, 4, 1, 2, 1]; % Output desired: M(5, 1, 1, 2, 3, 4, 1, 2, 1) %Current attempt: M(coordinates) Clearly M(coordinates) <> M(5, 1, 1, 2, 3, 4, 1, 2, 1). Is there a way of doing this?
It's a bit awkward, but you can convert the array to a cell array, and then to a comma-separated list: M = rand(70, 5, 3, 2, 10, 9, 5, 3, 21); coordinates = [5, 1, 1, 2, 3, 4, 1, 2, 1]; coords_cell = num2cell(coordinates); result = M(coords_cell{:});
Expanding each element in a (2-by-2) matrix to a (3-by-2) block
I want to expand each element in a (2-by-2) matrix to a (3-by-2) block, using Python 3 --- with professional and elegant codes. Since I don't know the python codes, I will just describe the following in maths X = # X is an 2-by-2 matrix. 1, 2 3, 4 d = (3,2) # d is the shape that each element in X should be expanded to. Y = # Y is the result 1, 1, 2, 2 1, 1, 2, 2 1, 1, 2, 2 3, 3, 4, 4 3, 3, 4, 4 3, 3, 4, 4 Not that every element in X is now an 3-by-2 block in Y. The position of the block in Y is the same as the position of the element in X. Here is the MATLAB code X = [1,2;3,4]; d = [3,2] [row, column] = size(X); a = num2cell(X); b = cell(row, column); [b{:}] = deal(ones(d)); Y = cell2mat(cellfun(#times,a,b,'UniformOutput',false)); I appreciate your help. Thanks in advance.
If you are okay with using NumPy module with Python, you can use numpy.kron - np.kron(X,np.ones((3,2),dtype=int)) Sample run - In [15]: import numpy as np In [16]: X = np.arange(4).reshape(2,2)+1 # Create input array In [17]: X Out[17]: array([[1, 2], [3, 4]]) In [18]: np.kron(X,np.ones((3,2),dtype=int)) Out[18]: array([[1, 1, 2, 2], [1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4], [3, 3, 4, 4]]) In fact, this is a direct translation of how one would achieved the desired result in MATLAB in an elegant and professional way as well, as shown below - >> X = [1,2;3 4] X = 1 2 3 4 >> kron(X,ones(3,2)) ans = 1 1 2 2 1 1 2 2 1 1 2 2 3 3 4 4 3 3 4 4 3 3 4 4
Another way to do it with ndarray.repeat: >>> X = np.arange(4).reshape(2,2)+1 >>> X.repeat(3, axis=0).repeat(2, axis=1) array([[1, 1, 2, 2], [1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4], [3, 3, 4, 4]])
find and match multiple values in the same row in array in matlab
I have a data set consists of the following (which are head values for a finite difference groundwater flow model consists of 200 row, 200 column, and 5 layers) , "id", "k", "i", "j", "f", "Active" 1, 1, 1, 1, 1, 313, 0 2, 2, 1, 1, 2, 315.2.0, 0 3, 3, 1, 1, 3, 301.24, 0 4, 4, 1, 1, 4, 306.05, 0 5, 5, 1, 1, 5, -999.0, 0 6, 6, 1, 1, 6, -999.0, 0 7, 7, 1, 1, 7, 310.57, 0 8, 8, 1, 1, 8, -999.0, 0 9, 9, 1, 1, 9, -999.0, 0 . . . 200000, 200000, 5, 200, 200, -999.0, 0 let us assume that I need to find the row that has a specific i,j,k for example I want to find the row which has i=100, j=50, k=3 to store the value f for multiple i,j,k I've tried to use find but it finds only the location for a specific item I know it can be done using for & if but it will be time demanding Is there a fast way to do so using matlab?
Lets suppose your text file has the following data "id", "k", "i", "j", "f", "Active" 1, 1, 1, 1, 313, 0 2, 1, 1, 2, 315.2.0, 0 3, 1, 1, 3, 301.24, 0 4, 1, 1, 4, 306.05, 0 5, 1, 1, 5, -999.0, 0 6, 1, 1, 6, -999.0, 0 7, 1, 1, 7, 310.57, 0 8, 1, 1, 8, -999.0, 0 9, 1, 1, 9, -999.0, 0 First read the file through >> fileID = fopen('testfileaccess.txt'); >> C = textscan(fileID,'%s %s %s %s %s %s','Delimiter',',') You will get 6 cells representing each column C = Columns 1 through 5 {10x1 cell} {10x1 cell} {10x1 cell} {10x1 cell} {10x1 cell} Column 6 {10x1 cell} >> Matrix= [str2double(C{2}).';str2double(C{3}).';str2double(C{4}).';].' the above code will result the i j and k in a matrix with each row representing each variable Matrix = NaN NaN NaN 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 then if you want to find for example k = 1 , i = 1 and j = 8 you use find() find(Matrix(:,1) == 1 & Matrix(:,2) == 1 & Matrix(:,3) == 8) and there you have it ans = 8 8th row
row = jcount*kcount*(i-1) + kcount*(j-1) + k In your case: row = 200*5*(i-1) + 5*(j-1) + k
Evaluate optimal replacement algorithm for 5 frames
Question: Consider the following page reference string: 1, 2, 3, 4, 2, 1, 5, 6, 2, 1, 2, 3, 7, 6, 3, 2, 1, 2, 3, 6. How many page faults would occur for the optimal page replacement algorithms, assuming five frames? Remember all frames are initially empty, so your first unique pages will all cost one fault each. I am not quite sure what would happen: 1 -> 1 2 -> 1, 2 3 -> 1, 2, 3 4 -> 1, 2, 3, 4, 2 -> What happens here?? 1 ...etc (with the rest of the reference string)
There will be 7 page faults in total. 1 -> 1 2 -> 1, 2 3 -> 1, 2, 3 4 -> 1, 2, 3, 4 2 -> 1, 2, 3, 4 (This is a hit 2 is already in the memory) 1 -> 1, 2, 3, 4 5 -> 1, 2, 3, 4, 5 (This is a miss but we have 5 frames.) 6 -> 1, 2, 3, 6, 5 (4 will be replaced as it is not required in future) ...