Using a reduction ufunc in agg - python-polars

How do I use a ufunc that reduces to a scalar in the context of aggregation? For example, summarizing a table using numpy.trapz:
import polars as pl
import numpy as np
df = pl.DataFrame(dict(id=[0, 0, 0, 1, 1, 1], t=[2, 4, 5, 10, 11, 14], y=[0, 1, 1, 2, 3, 4]))
df.groupby('id').agg(pl.map(['t', 'y'], np.trapz))
# Segmentation fault (core dumped)

Edit: as of Polars 0.13.18, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.
Use apply in a groupby context (rather than map).
In this case, the numpy trapz function takes only one positional parameter (y)
numpy.trapz(y, x=None, dx=1.0, axis=- 1)
So, we'll need to specify the x keyword parameter explicitly in our call. (I also assumed that you meant for your y column to be mapped as the y parameter, and your t column to be mapped as the x parameter in the call to numpy.)
The Series 'y' and 't' will be passed as a list of Series to the lambda function, so we'll use indices to indicate which column maps to which numpy parameter.
One additional wrinkle, numpy returns a value of type numpy.float64, rather than a Python float.
type(np.trapz([0, 1, 1], x=[2, 4, 5]))
<class 'numpy.float64'>
Presently, the apply function in Polars will not automatically convert a numpy.float64 to polars.Float64. To remedy this, we'll use the numpy item method to have numpy return a Python float, rather than a numpy.float64.
type(np.trapz([0, 1, 1], x=[2, 4, 5]).item())
<class 'float'>
With this in hand, we can now write our apply statement.
df.groupby("id").agg(
pl.apply(
["y", "t"],
lambda lst: np.trapz(y=lst[0], x=lst[1]).item()
)
)
shape: (2, 2)
┌─────┬──────┐
│ id ┆ y │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════╡
│ 1 ┆ 13.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0 ┆ 2.0 │
└─────┴──────┘

Related

TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector

I have a dataframe with multiple rows that look like this:
df.head() gives:
Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))
Now I want to compute the columnSimilarities() on my dataframe, and I do the following:
rdd2 = df.rdd
mat = RowMatrix(rdd2)
sims = mat.columnSimilarities()
However, I get the following error:
File "/opt/apache-spark/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py", line 67, in _convert_to_vector
raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
Can someone help me with this?
Thanks!
The present rdd form is:
Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))
As per the example in official documentation it will work if we get it in form:
[DenseVector([1.02, 4.23, 4.534, 0.342])]
Construct the RowMatrix as:
RowMatrix(df.rdd.map(list))
Here is a full example which reporduces and fixes your problem:
df = spark.createDataFrame(data=[([1.02, 4.23, 4.534, 0.342],)], schema=["features"])
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
#udf(returnType=VectorUDT())
def arrayToVector(arrCol):
from pyspark.ml.linalg import Vectors
return Vectors.dense(arrCol)
#
df = df.withColumn("features", arrayToVector("features"))
# print(df.head())
# df.printSchema()
# mat = RowMatrix(df.rdd) # Causes TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
mat = RowMatrix(df.rdd.map(list))
sims = mat.columnSimilarities()
print(sims.entries.collect())
[Out]:
[MatrixEntry(2, 3, 1.0), MatrixEntry(0, 1, 1.0), MatrixEntry(1, 2, 1.0), MatrixEntry(0, 3, 1.0), MatrixEntry(1, 3, 1.0), MatrixEntry(0, 2, 1.0)]

Transform maptype in Pyspark

I have a pyspark dataframe with 500k rows, each row has a maptype with 10k (key, value) items. The keys are the same for each row, e.g., k0, k1, ..., k9999.
What I want is to run some interpolation on the 10k values for each row and get a percentile (e.g., 50%). it seems there are two ways to do this:
first explode the maptype to columns, then do the interpolation
Run the interpolation on the maptype, then explode to columns to get the statistics
I have used pandas for some time but quite new to Pyspark. I'd very much appreciate if you could shed some lights on
Whether I should explode the maptype first
how do I do the interpolation (either on the maptype or the columns). This seem to be an easy task with numpy but I am not sure how to do the comprehension of the maptype/columns with pyspark
The following is a simple example
What I have
from pyspark.sql.functions import map_values
df = spark.sql("SELECT map('a', 1, 'b', 3, 'c', 2) as data")
df.show(20, False)
+------------------------+
|data |
+------------------------+
|[a -> 1, b -> 3, c -> 2]|
+------------------------+
What I want is to call the interp1d function to get result/median (see below) for the maptype values [1, 3, 2].
import numpy as np
from scipy.interpolate import interp1d
x = (np.linspace(0, 5, 11), np.linspace(0, 5, 11)**2)
f = interp1d(x[0], x[1], kind = 'linear', fill_value ='extrapolate', assume_sorted = False )
result = f([1,3,2])
median = np.percentile(result, 50)
print(f'result: {result}\nmedian: {median}')
result: [1. 9. 4.]
median: 4.0

calculating kurtosis array[Double] filed in spark scala

how to calculate the kurtosis of array field in spark
spark built-in function is failing array field.
due to data type mismatch: argument 1 requires double type, however, 'SERIES' is of array<double> type.;;
Example in python
from scipy.stats import kurtosis
kurtosis([1, 2, 3, 4, 5])
-1.3
i used spark built in function
df.withColumn("newcolumn",when(col("SERIES").isNotNull,kurtosis(columnName))
using Twitter Algebra package i can get kurtosis value.
import com.twitter.algebird._
val y = List(1, 2, 3, 4, 5)
def getMoments(xs: List[Int]): Moments =
xs.foldLeft(MomentsGroup.zero) { (m, x) =>
MomentsGroup.plus(m, Moments(x))
}
println(getMoments(y).kurtosis) // -1.3

How to compute edges between nodes v, w that are pointed to by the same node x

This question is about Spark GraphX. Given an arbitry graph, I want to compute a new graph that adds edges between any two nodes v, w that are both pointed to by some node x. The new edges should contain the pointing node as an attribute.
That is, given edges (x, v, nil) and (x, w, nil) compute edges (v, w, x) and (w, v, x).
It should work for any graph and not require me to know anything about the graph before hand, such as vertex ids.
Example
[Task] Add two directioned edges between nodes (e.g. A, C) when pointed to by same node (e.g. B).
Input graph:
┌────┐
┌─────│ B │──────┐
│ └────┘ │
v v
┌────┐ ┌────┐
│ A │ │ C │
└────┘ └────┘
^ ^
│ ┌────┐ │
└─────│ D │──────┘
└────┘
Output graph (bi-directional edges = two directed edges):
┌────┐
┌─────│ B │──────┐
│ └────┘ │
v v
┌────┐<───by B───>┌────┐
│ A │ │ C │
└────┘<───by D───>└────┘
^ ^
│ ┌────┐ │
└─────│ D │──────┘
└────┘
How to elegantly write a GraphX query that returns the output graph?
Here is a solution that uses pregel and aggregate messages
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
// Step 0: Create an input graph.
val nodes =
sc.parallelize(Array(
(101L, "A"), (102L, "A"), (201L, "B"), (202L, "B")
))
val edges =
sc.parallelize(Array(
Edge(201L, 101L, ("B-to-A", 0L)), Edge(201L, 102L, ("B-to-A", 0L)),
Edge(202L, 101L, ("B-to-A", 0L)), Edge(202L, 102L, ("B-to-A", 0L))
))
val graph = Graph(nodes, edges, "default")
// Step 1: Transform input graph before running pregel.
val initialGraph = graph.mapVertices((id, _) => Set[(VertexId,VertexId)]())
// Step 2: Send downstream vertex IDs (A's) to upstream vertices (B's)
val round1 = initialGraph.pregel(
initialMsg=Set[(VertexId,VertexId)](),
maxIterations=1,
activeDirection=EdgeDirection.In)
(
(id, old, msg) => old.union(msg),
triplet => Iterator((triplet.srcId, Set((triplet.dstId, triplet.srcId)))),
(a,b) => a.union(b)
)
// Step 3: Send (gathered) IDs back to downstream vertices
val round2 = round1.aggregateMessages[Set[(VertexId,VertexId)]](
triplet => {
triplet.sendToDst(triplet.srcAttr)
},
(a, b) => a.union(b)
)
// Step 4: Transform vertices to edges
val newEdges = round2.flatMap {v => v._2.filter(w => w._1 != v._1).map(w => Edge(v._1, w._1, ("shares-with", w._2)))}
// Step 5: Create a new graph that contains new edges
val newGraph = Graph(graph.vertices, graph.edges.union(newEdges))
// Step 6: print graph to verify result
newGraph.triplets foreach println
This solutions uses three main steps to compute a graph with the new edges: 1) a round of pregel. 2) a round of aggregateMessages. 3) a round of mapping nodes to edges.

Computing F-measure for clustering

Can anyone help me to calculate F-measure collectively ? I know how to calculate recall and precision, but don't know for a given algorithm how to calculate one F-measure value.
As an exemple, suppose my algorithm creates m clusters, but I know there are n clusters for the same data (as created by another benchmark algorithm).
I found one pdf but it is not useful since the collective value I got is greater than 1. Reference of pdf is F Measure explained. Specifically I have read some research paper, in which the author compares two algorithms on the basis of F-measure, they got collectively values between 0 and 1.
if you read the pdf mentioned above carefully, the formula is F(C,K) = ∑ | ci | / N * max {F(ci,kj)}
where ci is reference cluster & kj is cluster created by other algorithm, here i is running from 1 to n & j is running from 1 to m.Let say |c1|=218 here as per pdf N=m*n let say m=12 and n=10, and we got max F(c1,kj) for j=2. Definitely F(c1,k2) is between 0 and 1. but the resultant value calculated by above formula we will get value above 1.
The term f-measure itself is underspecified. It's the harmonic mean, usually of precision and recall. Actually you should even say F1-score if you mean the unweighted version, because you can put different weight on the two input values. But without saying which two values are averaged (not in the sense of the arithmetic mean!) this doesn't say much.
https://en.wikipedia.org/wiki/F1_score
Note that the values must be in the 0-1 value range. Otherwise, you have an error earlier on.
In cluster analysis, the common approach is to apply the F1-Measure to the precision and recall of pairs, often referred to as "pair counting f-measure". But you could compute the same mean on other values, too.
Pair-counting has the nice property that it doesn't directly compare clusters, so the result is well defined when one result has m cluster, the other has n clusters. However, pair counting needs strict partitions. When elements are not clustered or assigned to more than one cluster, the pair-counting measures can easily go out of the range 0-1.
E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Clusterings Metrics and Visual Support
Int. Conf. Data Engineering (ICDE 2012)
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDE.2012.128
Discusses some of these metrics (including Rand index and such) and gives a simple explanation of the "pair counting F-measure".
The paper Characterization and evaluation of similarity measures for pairs of clusterings by Darius Pfitzner, Richard Leibbrandt and David Powers contains a lot of useful information regarding this subject, including the following example:
Given the set,
D = {1, 2, 3, 4, 5, 6}
and the partitions,
P = {1, 2, 3}, {4, 5}, {6}, and
Q = {1, 2, 4}, {3, 5, 6}
where P is set created by our algorithm and Q is set created by standard algorithm we known
PairsP = {(1, 2), (1, 3), (2, 3), (4, 5)},
PairsQ = {(1, 2), (1, 4), (2, 4), (3, 5), (3, 6), (5, 6)}, and
PairsD = {(1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3), (2, 4),
(2, 5), (2, 6), (3, 4), (3, 5), (3, 6), (4, 5), (4, 6), (5, 6)}
so,
a = | PairsP intersection PairsQ | = |(1, 2)| = 1
b = | PairsP- PairsQ | = |(1, 3)(2, 3)(4, 5)| = 3
c = | PairsQ- PairsP | = |(1, 4)(2, 4)(3, 5)(3, 6)(5, 6)| = 5
F-measure= 2a/(2a+b+c)
Note: There is an error in the publication on page 364 where a, b, c, and d are computed and the result of b and c are actually switched incorrectly. This switch would throw off the results of some other measures. Obviously, the F-measure is unaffected.
The N in your formula, F(C,K) = ∑ | ci | / N * max {F(ci,kj)}, is the sum of the |ci| over all i i.e. it is the total number of elements. You are perhaps mistaking it to be the number of clusters and therefore are getting an answer greater than one. If you make the change, your answer will be between 1 and 0.