Generating a HTML table from polars DataFrame (row iterator) - python-polars

I want to generate a HTML table report (using a template engine) from polars DataFrame (something like streamlit).
But the iterator in polars DataFrame is column wise and I haven't found any API to iterate the rows for a polars DataFrame (other than using something like .to_dicts()).
With pandas I usually do for row in df.itertuples(): .... This works well because I can access the column value by name using row.col.
I can think of two options
Use pl.DataFrame(...).to_pandas().itertuples()
Write custom iterator class
Is there a solution where we can iterate polars DataFrame row wise without having to convert it or having to write a custom iterator class?

Getting a html table
You can utilize the html table that polars creates for rendering in jupyter notebooks.
df = pl.DataFrame({
"a": [1, 2, None],
"b": [1, None, None],
})
df._repr_html_()
<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n\n .dataframe td {\n white-space: pre;\n }\n\n .dataframe td {\n padding-top: 0;\n }\n\n .dataframe td {\n padding-bottom: 0;\n }\n\n .dataframe td {\n line-height: 95%;\n }\n</style>\n<table border="1" class="dataframe">\n<small>shape: (3, 2)</small>\n<thead>\n<tr>\n<th>\na\n</th>\n<th>\nb\n</th>\n</tr>\n<tr>\n<td>\ni64\n</td>\n<td>\ni64\n</td>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>\n1\n</td>\n<td>\n1\n</td>\n</tr>\n<tr>\n<td>\n2\n</td>\n<td>\nnull\n</td>\n</tr>\n<tr>\n<td>\nnull\n</td>\n<td>\nnull\n</td>\n</tr>\n</tbody>\n</table>\n</div>
Iterating rows
If you want an iterator over rows it is very easy to create one in Python.
import polars as pl
from typing import Tuple, Any
df = pl.DataFrame({
"a": [1, 2, None],
"b": [1, None, None],
})
def iterrows(df: pl.DataFrame) -> Tuple[Any]:
for i in range(df.height):
yield df.row(i)
gen = iterrows(df)
next(gen) # returns (1, 1)
next(gen) # returns (2, None)

Adding to #ritchie46 answer
import polars as pl
from collections import namedtuple
def itertuples(df):
tup = namedtuple('PolarsRow', df.columns)
for i in range(df.height):
yield tup(*df.row(i))
pl.DataFrame.itertuples = itertuples
df = pl.DataFrame(...)
for tup in df.itertuples():
print(tup)

Related

Spark groupBy X then sortBy Y then get topK

case class Tomato(name:String, rank:Int)
case class Potato(..)
I have Spark 2.4 and Dataset[Tomato, Potato] that I want to groupBy name and get topK ranks.
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
Iterator solution:
data.groupByKey{ case (tomato,_) => tomato.name }
.flatMapGroups((k,it)=>it.toList.sortBy(_.rank).take(topK))
I've also tried aggregation functions but I could not find a topK or firstK only first and last.
Another thing I hate about aggregation functions is that they convert the dataset to a dataframe (yuck) so all the types are gone.
Aggregation Fn solution syntax made up by me:
data.agg(row_number.over(Window.partitionBy("_1.name").orderBy("_1.rank").take(topK))
There are already several questions on SO that ask for groupBy then some other operation but none want to sort by a key different than the groupBy key and then get topK
You could go the iterator route without having to create a full list which indeed explodes with big datasets. Something like:
import spark.implicits._
import scala.util.Sorting
case class Tomato(name:String, rank:Int)
case class Potato(taste: String)
case class MyClass(tomato: Tomato, potato: Potato)
val ordering = Ordering.by[MyClass, Int](_.tomato.rank)
val ds = Seq(
(MyClass(Tomato("tomato1", 1), Potato("tasty"))),
(MyClass(Tomato("tomato1", 2), Potato("tastier"))),
(MyClass(Tomato("tomato2", 2), Potato("tastiest"))),
(MyClass(Tomato("tomato3", 2), Potato("yum"))),
(MyClass(Tomato("tomato3", 4), Potato("yummier"))),
(MyClass(Tomato("tomato3", 50), Potato("yummiest"))),
(MyClass(Tomato("tomato7", 50), Potato("yam")))
).toDS
val k = 2
val output = ds
.groupByKey{
case MyClass(tomato, potato) => tomato.name
}
.mapGroups(
(name, iterator)=> {
val topK = iterator.foldLeft(Seq.empty[MyClass]){
(accumulator, element) => {
val newAccumulator = accumulator :+ element
if (newAccumulator.length > k)
newAccumulator.sorted(ordering).drop(1)
else
newAccumulator
}
}
(name, topK)
}
)
output.show(false)
+-------+--------------------------------------------------------+
|_1 |_2 |
+-------+--------------------------------------------------------+
|tomato7|[[[tomato7, 50], [yam]]] |
|tomato2|[[[tomato2, 2], [tastiest]]] |
|tomato1|[[[tomato1, 1], [tasty]], [[tomato1, 2], [tastier]]] |
|tomato3|[[[tomato3, 4], [yummier]], [[tomato3, 50], [yummiest]]]|
+-------+--------------------------------------------------------+
So as you see, for each Tomato.name key, we're keeping the k elements with the largest Tomato.rank values. You get a Dataset[(String, Seq(MyClass))] as result.
This is not really optimized for performance: for each group, we're iterating over all of its elements and sorting the sequence which could become quite intensive computationally. But this all depends on the size of your actual case classes, the size of your data, your requirements, ...
Hope this helps!
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
What you could do is to come up with a topK() method that takes parameters k, Iterator[A] and a A => B mapping to return an Iterator[A] of top k elements (sorted by value of type B) -- all without having to sort the entire iterator:
def topK[A, B : Ordering](k: Int, iter: Iterator[A], f: A => B): Iterator[A] = {
val orderer = implicitly[Ordering[B]]
import orderer._
val listK = iter.take(k).toList
iter.foldLeft(listK.sortWith(f(_) > f(_))){ (lsK, x) =>
if (f(x) < f(lsK.head))
(x :: lsK.tail).sortWith(f(_) > f(_))
else
lsK
}.reverse.iterator
}
Note that topK() only involves iterative sorting of lists of size k, with the assumption k is small compared with the size of the input iterator. If necessary, it could be further optimized to eliminate the sorting of the k-elements lists by only making its first element the largest element while leaving the rest of the lists unsorted.
Using your groupByKey approach, method topK() can be plugged in within flatMapGroups as shown below:
case class T(name: String, rank: Int)
case class P(name: String, rank: Int)
val ds = Seq(
(T("t1", 4), P("p1", 1)),
(T("t1", 5), P("p2", 2)),
(T("t1", 1), P("p3", 3)),
(T("t1", 3), P("p4", 4)),
(T("t1", 2), P("p5", 5)),
(T("t2", 4), P("p6", 6)),
(T("t2", 2), P("p7", 7)),
(T("t2", 6), P("p8", 8))
).toDF("tomato", "potato").as[(T, P)]
val k = 3
ds.
groupByKey{ case (tomato, _) => tomato.name }.
flatMapGroups((_, it) => topK[(T, P), Int](k, it, { case (t, p) => t.rank })).
show
/*
+-------+-------+
| _1| _2|
+-------+-------+
|{t1, 1}|{p3, 3}|
|{t1, 2}|{p5, 5}|
|{t1, 3}|{p4, 4}|
|{t2, 2}|{p7, 7}|
|{t2, 4}|{p6, 6}|
|{t2, 6}|{p8, 8}|
+-------+-------+
*/

Transform maptype in Pyspark

I have a pyspark dataframe with 500k rows, each row has a maptype with 10k (key, value) items. The keys are the same for each row, e.g., k0, k1, ..., k9999.
What I want is to run some interpolation on the 10k values for each row and get a percentile (e.g., 50%). it seems there are two ways to do this:
first explode the maptype to columns, then do the interpolation
Run the interpolation on the maptype, then explode to columns to get the statistics
I have used pandas for some time but quite new to Pyspark. I'd very much appreciate if you could shed some lights on
Whether I should explode the maptype first
how do I do the interpolation (either on the maptype or the columns). This seem to be an easy task with numpy but I am not sure how to do the comprehension of the maptype/columns with pyspark
The following is a simple example
What I have
from pyspark.sql.functions import map_values
df = spark.sql("SELECT map('a', 1, 'b', 3, 'c', 2) as data")
df.show(20, False)
+------------------------+
|data |
+------------------------+
|[a -> 1, b -> 3, c -> 2]|
+------------------------+
What I want is to call the interp1d function to get result/median (see below) for the maptype values [1, 3, 2].
import numpy as np
from scipy.interpolate import interp1d
x = (np.linspace(0, 5, 11), np.linspace(0, 5, 11)**2)
f = interp1d(x[0], x[1], kind = 'linear', fill_value ='extrapolate', assume_sorted = False )
result = f([1,3,2])
median = np.percentile(result, 50)
print(f'result: {result}\nmedian: {median}')
result: [1. 9. 4.]
median: 4.0

Pyspark -- Remove all-alphabets elements from array

I have an array column like [abc, 123, ab12] in my df, would like to remove those elements which are pure alphabets in this array, so the output will be [123, ab12] for this example. Is there any built-ins to avoid using udf?
Thank you guys!
You can filter with an appropriate regex:
import pyspark.sql.functions as F
df2 = df.withColumn('arr', F.expr("filter(arr, x -> x not rlike '^[a-z]*$')"))

How to split a Sequence in scala

I'm using play framework with Scala and I want to split a Seq[String] into subsequence.
I return Seq[String] from a SQL Query which contains colors and season, it look like that:
spring; summer; autumn; winter, red; green; blu
The seasons and colours are separated by a comma, and I want to split that sequence to get 2 subsequences, one with the seasons and other with colors.
I've tried with:
val subsequence=sequecne.split(",")
But it doesn't work and return that error: value split is not a member of Seq[String]
So what can I do?
Assuming your sequence is like a sequence containing one string:
val sequence = Seq("spring; summer; autumn; winter, red; green; blu")
val split = sequence.flatMap(_.split(","))
// => split: Seq[String] = List(spring; summer; autumn; winter, " red; green; blu")
Try grouping,
val xs = Seq("spring; summer; autumn; winter, red; green; blu")
val groups = xs.head.split(",|;").map(_.trim).grouped(4)
This delivers an iterator of arrays of up to 4 items. The last array contains only 3, the colours.
To see the contents in the iterator,
groups.toArray
Array(Array(spring, summer, autumn, winter),
Array(red, green, blu))
In addition to lloydmeta:
sequence.flatMap(_.split(",")).map(_.split(";"))
This should give you what you seem to want based on a single element in the sequence and gives you a way to handle the case where you don't have the expected data in the string result from the SQL query. You may need to do some string trimming if that is a requirement.
val xs = Seq("spring; summer; autumn; winter, red; green; blu")
val ys = xs.head.split(",") match {
case Array(seasons, colours) => Array(seasons.split(";"), colours.split(";"))
case _ => ??? // unexpected case - handle appropriately
}
println(ys.toList.map(_.toList))
// List(List(spring, summer, autumn, winter), List( red, green, blu))

Why does splitting strings give ArrayOutOfBoundsException in Spark 1.1.0 (works fine in 1.4.0)?

I'm using Spark 1.1.0 and Scala 2.10.4.
I have an input as follows:
100,aviral,Delhi,200,desh
200,ashu,hyd,300,desh
While executing:
sc.textFile(inputFile).keyBy(line => line.split(',')(2))
Spark gives me ArrayOutOfBoundsException. Why?
Please note that the same code works fine in Spark 1.4.0. Can anyone explain the reason for different behaviour?
It works fine here in Spark 1.4.1 / spark-shell
Define an rdd with some data
val rdd = sc.parallelize(Array("1,abc,2,xyz,3","4,qwerty,5,abc,4","9,fee,11,fie,13"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:21
Run it through .keyBy()
rdd.keyBy( line => (line.split(','))(2) ).collect()
res4: Array[(String, String)] = Array((2,1,abc,2,xyz,3), (5,4,qwerty,5,abc,4), (11,9,fee,11,fie,13))
Notice it makes the key from the 3rd element after splitting, but the printing seems odd. At first it doesn't look correctly tupled but this turns out to be a printing artifact from missing any quotes on the string. We could test this to pick off the values and see if we get the line back:
rdd.keyBy(line => line.split(',')(2) ).values.collect()
res12: Array[String] = Array(1,abc,2,xyz,3, 4,qwerty,5,abc,4, 9,fee,11,fie,13)
and this looks as expected. Note that there are only 3 elements in the array, and the commas here are within the element strings.
We can also use .map() to make pairs, like so:
rdd.map( line => (line.split(',')(2), line.split(',')) ).collect()
res7: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13)))
which is printed as Tuples...
Or to avoid duplicating effort, maybe:
def splitter(s:String):(String,Array[String]) = {
val parsed = s.split(',')
(parsed(2), parsed)
}
rdd.map(splitter).collect()
res8: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13))
which is a bit easier to read. It is also slightly more parsed, because here we have split the line into its separate values.
The problem is you have a blank line after 1st row - splitting it does not return an Array containing necessary number of columns.
1,abc,2,xyz,3
<empty line - here lies the problem>
4,qwerty,5,abc,4
Remove the empty line.
Another possibility is that one of the rows does not have enough columns. You can filter all rows that does not have the required number of columns (be aware of possible data loss though).
sc.textFile(inputFile)
.map.(_.split(","))
.filter(_.size == EXPECTED_COLS_NUMBER)
.keyBy(line => line(2))