pyspark udf return values - pyspark

I created an udf that return list of lists (The built in list object). I saved the returned values to a new column, but found that it was converted to a string. I need it as a list of lists in order to activate posexplode, what is the correct way to do it?
def conc(hashes, band_width):
...
...
return combined_chunks #it's type: list[list[float]]
concat = udf(conc)
#bands column becomes a string
mh2 = mh1.withColumn("bands", concat(col('hash'),lit(bandwidth)))

I solved it:
concat = udf(conc,ArrayType(VectorUDT()))
And in conc: return a list of dense vectors using Vectors.dense.

Related

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
#udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as #Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)

Spark - Create a DataFrame from a list of Rows generated in a loop

I have a loop which generates rows in each iteration. My goal is to create a dataframe, with a given schema, that contents just those rows. I have in mind a set of steps to follow, but I am not able to add a new Row to a List[Row] in each loop iteration
I am trying the following approach:
var listOfRows = List[Row]()
val dfToExtractValues: DataFrame = ???
dfToExtractValues.foreach { x =>
//Not really important how to generate here the variables
//So to simplify all the rows will have the same values
var col1 = "firstCol"
var col2 = "secondCol"
var col3 = "thirdCol"
val newRow = RowFactory.create(col1,col2,col3)
//This step I am not able to do
//listOfRows += newRow -> Just for strings
//listOfRows.add(newRow) -> This add doesnt exist, it is a addString
//listOfRows.aggregate(1)(newRow) -> This is not how aggreage works...
}
val rdd = sc.makeRDD[RDD](listOfRows)
val dfWithNewRows = sqlContext.createDataFrame(rdd, myOriginalDF.schema)
Can someone tell me what am I doing wrong, or what could I change in my approach to generate a dataframe from the rows I'm generating?
Maybe there is a better way to collect the Rows instead of List[Row]. But then I need to convert that other type of collection into a dataframe.
Can someone tell me what am I doing wrong
Closures:
First of all it looks like you skipped over Understanding Closures in the Programming Guide. Any attempt to modify variables passed with closure is futile. All you can do is modify a copy and changes won't be reflected globally.
Variable doesn't make object mutable:
Following
var listOfRows = List[Row]()
creates a variable. Assigned List is as immutable as it was. If it wasn't in the Spark context you could create a new List and reassign:
listOfRows = newRow :: listOfRows
Note that we perpend not append - you don't want to append to the list in a loop.
Variables with immutable objects are useful, when you want to share data (it is common pattern in Akka for example), but don't have many applications in Spark.
Keep things distributed:
Finally never fetch data to the driver just to distribute it again. You should also avoid unnecessary conversions between RDDs and DataFrames. It is best to use DataFrame operators all the way:
dfToExtractValues.select(...)
but if you need something more complex map:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
dfToExtractValues.map(x => ...)(RowEncoder(schema))

Scala create array of empty arrays

I am trying to create an array where each element is an empty array.
I have tried this:
var result = Array.fill[Array[Int]](Array.empty[Int])
After looking here How to create and use a multi-dimensional array in Scala?, I also tried this:
var result = Array.ofDim[Array[Int]](Array.empty[Int])
However, none of these work.
How can I create an array of empty arrays?
You are misunderstanding Array.ofDim here. It creates a multidimensional array given the dimensions and the type of value to hold.
To create an array of 100 arrays, each of which is empty (0 elements) and would hold Ints, you need only to specify those dimensions as parameters to the ofDim function.
val result = Array.ofDim[Int](100, 0)
Array.fill takes two params: The first is the length, the second the value to fill the array with, more precisely the second parameter is an element computation that will be invoked multiple times to obtain the array elements (Thanks to #alexey-romanov for pointing this out). However, in your case it results always in the same value, the empty array.
Array.fill[Array[Int]](length)(Array.empty)
Consider also Array.tabulate as follows,
val result = Array.tabulate(100)(_ => Array[Int]())
where the lambda function is applied 100 times and for each it delivers an empty array.

Python use of class to create and manipulate a grid

Still trying to understand how to use class. I have now written the following:
`import random
class Grid():
def __init__(self, grid_row, grid_column):
self.__row = grid_row
self.__col = grid_column
self.__board=[]
def make_board(self):
for row in range(self.__row):
self.__board.append([])
for col in range(self.__col):
self.__board[row].append('0')
return self.__board
def change_tile(self):
choices = (0,1,2)
x = random.choice(choices)
y= random.choice(choices)
self.__board[x][y] = str(2)
def __repr__(self):
for row in self.__board:
print( " ".join(row))
g = Grid(3,3)
g.make_board()
g.change_tile()
print(g)
Firstly when I run this I get a grid printed followed by:
TypeError: __str__ returned non-string (type NoneType)
I don't understand why this happens. Second question. If I want to return the self.board, __str only returns the last row (0,0,0).With 'print' all three rows and columns are printed. Is there a way around the issue with 'return'?Is it an issue ( apart from the fact that I want to 'see' what I am doing)?
How would one call Grid(3,3) and get a grid with a randomly placed 2 without having to call each function separately as I have done in my example? Lastly why can I not use the integers 0 or 2, but have to convert everything to a string?. I hope that I have not exceeded the goodwill that exists on this forum by asking so many dumb questions!
The special methods __repr__ and __str__ are required to return a string. If there is no __str__ implementation given, the __repr__ will be used for the string conversion too.
Now in your case, __repr__ prints something instead of returning a string. It actually returns nothing, so None is implicitely returned. You have to change it to return a string. For example like this:
def __repr__(self):
return '\n'.join([' '.join(row) for row in self.__board])