Why won't the exp function work in pyspark? - pyspark

I'm trying to calculate odds ratios from the coefficients of a logistic regression but I'm encountering a problem best summed up by this code:
import pyspark.sql.functions as F
F.exp(1.2)
This fails with
py4j.Py4JException: Method exp([class java.lang.Double]) does not exist
An integer fails similarly. I don't get how a Double can cause a problem for the exp function?

If you have a look at the documentation for pyspark.sql.functions.exp(), it takes an input of a col object. Hence it will not work for a float value such as 1.2.
Create a dataframe or a Column object which you can use in F.exp()
Example would be:
df = df.withColumn("exp_x", F.exp(F.col("some_col_named_x")))

As #pissall mentionned, the pyspark.sql.functions.exp takes col objects as parameter, but you can use the pyspark.sql.functions.lit (introduced in version 1.3.0) to create a col object of a literal value.
from pyspark.sql.functions import exp, lit
df = df.withColumn("exp_1", exp(lit(1)))

Related

Passing Argument to a Generator to build a tf.data.Dataset

I am trying to build a tensorflow dataset from a generator. I have a list of tuples called some_list , where each tuple has an integer and some text.
When I do not pass some_list as an argument to the generator, the code works fine
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen1():
random.shuffle(some_list)
size=len(some_list)
i=0
while True:
yield some_list[i][0],some_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(some_list)
#Not passing any argument
tf_dataset1 = tf.data.Dataset.from_generator(text_gen1,output_types=(tf.int32,tf.string),
output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([7, 1, 2])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Seven', b'One', b'Two'], dtype=object)>)
(<tf.Tensor: shape=(3,), dtype=int32, numpy=array([3, 5, 4])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Three', b'Five', b'Four'], dtype=object)>)
However, when I try to pass some_list as an argument, the code fails
def text_gen2(file_list):
random.shuffle(file_list)
size=len(file_list)
i=0
while True:
yield file_list[i][0],file_list[i][1]
i+=1
if i>size:
i=0
random.shuffle(file_list)
tf_dataset2 = tf.data.Dataset.from_generator(text_gen2,args=[some_list],output_types=
(tf.int32,tf.string),output_shapes = ((),()))
for count_batch in tf_dataset1.repeat().batch(3).take(2):
print(count_batch)
ValueError: Can't convert Python sequence with mixed types to Tensor.
I noticed , when I try to pass a list of integers as an argument , the code works. However, a list of tuples seems to make it crash. Can someone shed some light on it ?
The problem is what it says is, you cannot have heterogeneous data types (int and str) in the same tf.Tensor. I did a few changes and came up with the code below.
Separate your some_list to two lists using zip(), i.e. int_list and str_list and make your generator function accept two lists.
I don't understand why you're manually shuffling stuff within the generator. You can do it in a cleaner way using tf.data.Dataset.shuffle()
import tensorflow as tf
import random
import numpy as np
some_list=[(1,'One'),[2,'Two'],[3,'Three'],[4,'Four'],
(5,'Five'),[6,'Six'],[7,'Seven'],[8,'Eight']]
def text_gen2(int_list, str_list):
for x, y in zip(int_list, str_list):
yield x, y
tf_dataset2 = tf.data.Dataset.from_generator(
text_gen2,
args=list(zip(*some_list)),
output_types=(tf.int32,tf.string),output_shapes = ((),())
)
i = 0
for count_batch in tf_dataset2.repeat().batch(4).shuffle(buffer_size=6):
print(count_batch)
i += 1
if i > 10: break;

Fill columns independently

I have a python class with two data class, first one is a polars time series, second one a list of string.
In a dictionary, a mapping from string and function is provided, for each element of the string is associated a function that returns a polars frame (of one column).
Then there is a function class that create a polars data frame with first column the time series and the other columns are created with this function.
Columns are all independent.
Is there a way to create this data frame in parallel?
Here I try to define a minimal example:
class data_frame_constr():
function_list: List[str]
time_series: pl.DataFrame
def compute_indicator_matrix(self) -> pl.DataFrame:
for element in self.function_list:
self.time_series.with_column(
[
mapping[element] # here is where we construct columns with the loop and mapping[element] is a custom function that returns a pl column
]
)
return self.time_series
For example, function_list = ["square", "square_root"].
Time frame is a column time series, I need to create square and square root (or other custom complex functions, identified by its name) columns, but I know the list of function only at runtime, specified in the constructor.
You can use the with_columns context to provide a list of expressions, as long as the expressions are independent. (Note the plural: with_columns.) Polars will attempt to run all expressions in the list in parallel, even if the list of expressions is generated dynamically at run-time.
def mapping(func_str: str) -> pl.Expr:
'''Generate Expression from function string'''
...
def compute_indicator_matrix(self) -> pl.DataFrame:
expr_list = [mapping(next_funct_str)
for next_funct_str in self.function_list]
self.time_series = self.time_series.with_columns(expr_list)
return self.time_series
One note: it is a common misconception that Polars is a generic Threadpool that will run any/all code in parallel. This is not true.
If any of your expressions call external libraries or custom Python bytecode functions (e.g., using a lambda function, map, apply, etc..), then your code will be subject to the Python GIL, and will run single-threaded - no matter how you code it. Thus, try to use only the Polars expressions to achieve your objectives (rather than calling external libraries or Python functions.)
For example, try the following. (Choose a value of nbr_rows that will stress your computing platform.) If we run the code below, it will run in parallel because everything is expressed using Polars expressions without calling external libraries or custom Python code. The result is embarassingly parallel performance.
nbr_rows = 100_000_000
df = pl.DataFrame({
'col1': pl.repeat(2, nbr_rows, eager=True),
})
df.with_columns([
pl.col('col1').pow(1.1).alias('exp_1.1'),
pl.col('col1').pow(1.2).alias('exp_1.2'),
pl.col('col1').pow(1.3).alias('exp_1.3'),
pl.col('col1').pow(1.4).alias('exp_1.4'),
pl.col('col1').pow(1.5).alias('exp_1.5'),
])
However, if we instead write the code using lambda functions that call Python bytecode, then it will run very slowly.
import math
df.with_columns([
pl.col('col1').apply(lambda x: math.pow(x, 1.1)).alias('exp_1.1'),
pl.col('col1').apply(lambda x: math.pow(x, 1.2)).alias('exp_1.2'),
pl.col('col1').apply(lambda x: math.pow(x, 1.3)).alias('exp_1.3'),
pl.col('col1').apply(lambda x: math.pow(x, 1.4)).alias('exp_1.4'),
pl.col('col1').apply(lambda x: math.pow(x, 1.5)).alias('exp_1.5'),
])

Create multiple rows of fixed length from a data frame column in Pyspark

My input is a dataframe column in pyspark and it has only one column DETAIL_REC.
detail_df.show()
DETAIL_REC
================================
ABC12345678ABC98765543ABC98762345
detail_df.printSchema()
root
|-- DETAIL_REC: string(nullable =true)
For every 11th char/string it has to be in next row of dataframe for downstream process to consume this.
Expected output Should be multiple rows in dataframe
DETAIL_REC (No spaces lines after each record)
==============
ABC12345678
ABC98765543
ABC98762345
If you have spark 2.4+ version, we can make use of higher order functions to do it like below:
from pyspark.sql import functions as F
n = 11
output = df.withColumn("SubstrCol",F.explode((F.expr(f"""filter(
transform(
sequence(0,length(DETAIL_REC),{n})
,x-> substring(DETAIL_REC,x+1,{n}))
,y->y <> '')"""))))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+
Logic used:
First generate a sequence of integers starting from 0 to length of the string in steps of 11 (n)
Using transform iterate through this sequence and keep getting substrings from the original string (This keeps changing the start position.
Filter out any blank strings from the resulting array and explode this array.
For lower versions of spark, use a udf with textwrap or any other functions as addressed here:
from pyspark.sql import functions as F, types as T
from textwrap import wrap
n = 11
myudf = F.udf(lambda x: wrap(x,n),T.ArrayType(T.StringType()))
output = df.withColumn("SubstrCol",F.explode(myudf("DETAIL_REC")))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

I have an issue with regex extract with multiple matches

I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML.
Can you suggest me the best way of doing it using UDF ?
Here is how you can do it with a python UDF:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')
# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches
# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))
# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))
Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.
pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))

Matrix Multiplication A^T * A in PySpark

I asked a similar question yesterday - Matrix Multiplication between two RDD[Array[Double]] in Spark - however I've decided to shift to pyspark to do this. I've made some progress loading and reformatting the data - Pyspark map from RDD of strings to RDD of list of doubles - however the matrix multiplcation is difficult. Let me share my progress first:
matrix1.txt
1.2 3.4 2.3
2.3 1.1 1.5
3.3 1.8 4.5
5.3 2.2 4.5
9.3 8.1 0.3
4.5 4.3 2.1
it's difficult to share files, however this is what my matrix1.txt file looks like. It is a space-delimited text file including the values of a matrix. Next is the code:
# do the imports for pyspark and numpy
from pyspark import SparkConf, SparkContext
import numpy as np
# loadmatrix is a helper function used to read matrix1.txt and format
# from RDD of strings to RDD of list of floats
def loadmatrix(sc):
data = sc.textFile("matrix1.txt").map(lambda line: line.split(' ')).map(lambda line: [float(x) for x in line])
return(data)
# this is the function I am struggling with, it should take a line of the
# matrix (formatted as list of floats), compute an outer product with itself
def AtransposeA(line):
# pseudocode for this would be...
# outerprod = compute line * line^transpose
# return(outerprod)
# here is the main body of my file
if __name__ == "__main__":
# create the conf, sc objects, then use loadmatrix to read data
conf = SparkConf().setAppName('SVD').setMaster('local')
sc = SparkContext(conf = conf)
mymatrix = loadmatrix(sc)
# this is pseudocode for calling AtransposeA
ATA = mymatrix.map(lambda line: AtransposeA(line)).reduce(elementwise add all the outerproducts)
# the SVD of ATA is computed below
U, S, V = np.linalg.svd(ATA)
# ...
My approach is as follows - to do matrix multiplication A^T * A, I create a function that computes outer products of rows of A. The elementwise sum of all of the outerproducts is the product I want. I then call AtransposeA() in a map function, that way is it performed on each row of the matrix, and finally I use a reduce() to add the resulting matrices.
I'm struggling thinking about how the AtransposeA function should look. How can I do an outerproduct in pyspark like this? Thanks in advance for help!
First, consider why you want to use Spark for this. It sounds like all your data fits in memory, in which case you can use numpy and pandas in a very straight-forward way.
If your data isn't structured so that rows are independent, then it probably can't be parallelized by sending groups of rows to different nodes, which is the whole point of using Spark.
Having said that... here is some pyspark (2.1.1) code that I think does what you want.
# read the matrix file
df = spark.read.csv("matrix1.txt",sep=" ",inferSchema=True)
df.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|1.2|3.4|2.3|
|2.3|1.1|1.5|
|3.3|1.8|4.5|
|5.3|2.2|4.5|
|9.3|8.1|0.3|
|4.5|4.3|2.1|
+---+---+---+
# do the sum of the multiplication that we want, and get
# one data frame for each column
colDFs = []
for c2 in df.columns:
colDFs.append( df.select( [ F.sum(df[c1]*df[c2]).alias("op_{0}".format(i)) for i,c1 in enumerate(df.columns) ] ) )
# now union those separate data frames to build the "matrix"
mtxDF = reduce(lambda a,b: a.select(a.columns).union(b.select(a.columns)), colDFs )
mtxDF.show()
+------------------+------------------+------------------+
| op_0| op_1| op_2|
+------------------+------------------+------------------+
| 152.45|118.88999999999999| 57.15|
|118.88999999999999|104.94999999999999| 38.93|
| 57.15| 38.93|52.540000000000006|
+------------------+------------------+------------------+
This seems to be the same result that you get from numpy.
a = numpy.genfromtxt("matrix1.txt")
numpy.dot(a.T, a)
array([[ 152.45, 118.89, 57.15],
[ 118.89, 104.95, 38.93],
[ 57.15, 38.93, 52.54]])