How to set display precision in PySpark Dataframe show - pyspark

How do you set the display precision in PySpark when calling .show()?
Consider the following example:
from math import sqrt
import pyspark.sql.functions as f
data = zip(
map(lambda x: sqrt(x), range(100, 105)),
map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()
Which outputs:
#+------------------+------------------+
#| col1| col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+
How can I change it so that it only displays 3 digits after the decimal point?
Desired output:
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.

Round
The easiest option is to use pyspark.sql.functions.round():
from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
This will maintain the values as numeric types.
Format Number
The functions are the same for scala and python. The only difference is the import.
You can use format_number to format a number to desired decimal places as stated in the official api document:
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
from pyspark.sql.functions import avg, format_number
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
The transformed columns would of StringType and a comma is used as a thousands separator:
#+-----------+--------------+
#| col1| col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+
As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want
Replace all substrings of the specified string value that match regexp with rep.
from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
[regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#| col1| col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+

just envelop the answer to a function witch only deal with float and double columns.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
def dataframe_format_float(df: DataFrame, num_decimals=4) -> DataFrame:
r = []
for c in df.dtypes:
name, dtype = c[0], c[1]
if dtype in ['float', 'double']:
r.append(F.round(name, num_decimals).alias(name))
else:
r.append(name)
df = df.select(r)
return df

Related

Sum vector columns in spark

I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic). I need to create a new column taking the sum of all the vector columns. I'm having a hard time getting this done. here is a code to generate a sample dataset that I'm testing on.
import org.apache.spark.ml.feature.VectorAssembler
val temp1 = spark.createDataFrame(Seq(
(1,1.0,0.0,4.7,6,0.0),
(2,1.0,0.0,6.8,6,0.0),
(3,1.0,1.0,7.8,5,0.0),
(4,0.0,1.0,4.1,7,0.0),
(5,1.0,0.0,2.8,6,1.0),
(6,1.0,1.0,6.1,5,0.0),
(7,0.0,1.0,4.9,7,1.0),
(8,1.0,0.0,7.3,6,0.0)))
.toDF("id", "f1","f2","f3","f4","label")
val assembler1 = new VectorAssembler()
.setInputCols(Array("f1","f2","f3"))
.setOutputCol("vec1")
val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)
val assembler2 = new VectorAssembler()
.setInputCols(Array("f2","f3", "f4"))
.setOutputCol("vec2")
val df = assembler2.setHandleInvalid("skip").transform(temp2)
This gives me the following dataset
+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label| vec1| vec2|
+---+---+---+---+---+-----+-------------+-------------+
| 1|1.0|0.0|4.7| 6| 0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
| 2|1.0|0.0|6.8| 6| 0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
| 3|1.0|1.0|7.8| 5| 0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
| 4|0.0|1.0|4.1| 7| 0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
| 5|1.0|0.0|2.8| 6| 1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
| 6|1.0|1.0|6.1| 5| 0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
| 7|0.0|1.0|4.9| 7| 1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
| 8|1.0|0.0|7.3| 6| 0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+
If I needed to taek sum of regular columns, I can do it using something like,
import org.apache.spark.sql.functions.col
df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))
I know I can use breeze to sum DenseVectors just using "+" operator
import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2
So, the above code gives me the expected vector. But I'm not sure how to take the sum of the vector columns and sum vec1 and vec2 columns.
I did try the suggestions mentioned here, but had no luck
Here's my take but coded in PySpark. Someone can probably help in translating this to Scala:
from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array
def vector_sum (arr):
return Vectors.dense(np.sum(arr,axis=0))
vector_sum_udf = udf(vector_sum, VectorUDT())
df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))

Convert int column to list type pyspark

My DataFrame has a column num_of_items. It is a count field. Now, I want to convert it to list type from int type.
I tried using array(col) and even creating a function to return a list by taking int value as input. Didn't work
from pyspark.sql.types import ArrayType
from array import array
def to_array(x):
return [x]
df=df.withColumn("num_of_items", monotonically_increasing_id())
df
col_1 | num_of_items
A | 1
B | 2
Expected output
col_1 | num_of_items
A | [23]
B | [43]
I tried using array(col)
Using pyspark.sql.functions.array seems to work for me.
from pyspark.sql.functions import array
df.withColumn("num_of_items", array("num_of_items")).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#| A| [1]|
#| B| [2]|
#+-----+------------+
and even creating a function to return a list by taking int value as input.
If you want to use the function you created, you have to make it a udf and specify the return type:
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql.functions import udf, col
to_array_udf = udf(to_array, ArrayType(IntegerType()))
df.withColumn("num_of_items", to_array_udf(col("num_of_items"))).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#| A| [1]|
#| B| [2]|
#+-----+------------+
But it's preferable to avoid using udfs when possible: See Spark functions vs UDF performance?

spark scala cartesian product of each element in a column

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

PySpark - Add map function as column

I have a pyspark DataFrame
a = [
('Bob', 562),
('Bob',880),
('Bob',380),
('Sue',85),
('Sue',963)
]
df = spark.createDataFrame(a, ["Person", "Amount"])
I need to create a column that hashes the Amount and returns the amount. The problem is I can't use a UDF so I have used a mapping function.
df.rdd.map(lambda x: hash(x["Amount"]))
If you can't use udf you can use the map function, but as you've currently written it, there will only be one column. To keep all the columns, do the following:
df = df.rdd\
.map(lambda x: (x["Person"], x["Amount"], hash(str(x["Amount"]))))\
.toDF(["Person", "Amount", "Hash"])
df.show()
#+------+------+--------------------+
#|Person|Amount| Hash|
#+------+------+--------------------+
#| Bob| 562|-4340709941618811062|
#| Bob| 880|-7718876479167384701|
#| Bob| 380|-2088598916611095344|
#| Sue| 85| 7168043064064671|
#| Sue| 963|-8844931991662242457|
#+------+------+--------------------+
Note: In this case, hash(x["Amount"]) is not very interesting so I changed it to hash Amount converted to a string.
Essentially you have to map the row to a tuple containing all of the existing columns and add in the new column(s).
If your columns are too many to enumerate, you could also just add a tuple to the existing row.
df = df.rdd\
.map(lambda x: x + (hash(str(x["Amount"])),))\
.toDF(df.columns + ["Hash"])\
I should also point out that if hashing the values is your end goal, there is also a pyspark function pyspark.sql.functions.hash that can be used to avoid the serialization to rdd:
import pyspark.sql.functions as f
df.withColumn("Hash", f.hash("Amount")).show()
#+------+------+----------+
#|Person|Amount| Hash|
#+------+------+----------+
#| Bob| 562| 51343841|
#| Bob| 880|1241753636|
#| Bob| 380| 514174926|
#| Sue| 85|1944150283|
#| Sue| 963|1665082423|
#+------+------+----------+
This appears to use a different hashing algorithm than the python builtin.

Read fixed length file with implicit decimal point?

Suppose I have a data file like this:
foo12345
bar45612
I want to parse this into:
+----+-------+
| id| amt|
+----+-------+
| foo| 123.45|
| bar| 456.12|
+----+-------+
Which is to say, I need to select df.value.substr(4,5).alias('amt'), but I want the value to be interpreted as a five digit number where the last two digits are after the decimal point.
Surely there's a better way to do this than "divide by 100"?
from pyspark.sql.functions import substring, concat, lit
from pyspark.sql.types import DoubleType
#sample data
df = sc.parallelize([
['foo12345'],
['bar45612']]).toDF(["value"])
df = df.withColumn('id', substring('value',1,3)).\
withColumn('amt', concat(substring('value', 4, 3),lit('.'),substring('value', 7, 2)).cast(DoubleType()))
df.show()
Output is:
+--------+---+------+
| value| id| amt|
+--------+---+------+
|foo12345|foo|123.45|
|bar45612|bar|456.12|
+--------+---+------+