Covert a Pyspark Dataframe into a List with actual values - pyspark

I am trying to convert a Pyspark dataframe column to a list of values NOT objects.
Now my ultimate goal is use it as a filter for filtering another dataframe.
I have tries the following:
X = df.select("columnname").collect()
But when I use it to filter I am unable to.
Y = dtaframe.filter(~dtaframe.columnname.isin(X)))
Also, tried to convert into numpy Array and aggregate collect_list()
df.groupby('columnname').agg(collect_list(df["columnname"])
Please advise.

Collect function returns an array of row object by collecting the data from executors. If you need an array of values in native datatypes, it has to be handled explicitly to fetch the column from the row object.
This code creates a DF with column number of LongType.
df = spark.range(0,10,2).toDF("number")
Convert this into a python list.
num_list = [row.number for row in df.collect()]
Now this list can used in any dataframe to filter the values using isin function.
df1 = spark.range(10).toDF("number")
df1.filter(~col("number").isin(num_list)).show()

Related

pyspark - getting error 'list' object has no attribute 'write' when attempting to write to a delta table

I am attempting to read the first X number of rows of a delta table into a dataframe, and then write (overwrite) that back to the delta table. Here is code:
# read from entire delta table into dataframe
revEnrichRef = spark.read.format("delta").load("/mnt/tables/myTable")
# retrieve first 5 rows
dfSubset = revEnrichRef.head(5)
dfSubset.write.format("delta").mode("overwrite").save("/mnt/tables/myTable")
at this point I get the error: 'list' object has no attribute 'write'
I guess that means head returns list rather than a new dateframe. What I really want is a solution that will return x rows to a dataframe. Alternatively, have a way to do this without an intermediary dataframe is just as good. Any help is appreciated. Thanks
You can do so with the limit method. This returns a dataframe limited to the number of rows passed as the argument.
dfSubset = revEnrichRef.limit(5)
The head method is an action which will collect 5 rows from your dataframe as a list. (or a single Row object if n = 1)

Spark agg to collect a single list for multiple columns

Here is my current code:
pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))
However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:
1|[a,b,c,d]
2|[e,f,g,h]
However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:
1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...
I tried collect_list("table_name, status") however collect_list only takes one column name. How can I accomplish what I am trying to do?
Use array to collect columns into an array column first, then apply collect_list:
df.groupBy(...).agg(collect_list(array("table_name", "status")))

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both

Column having list datatype : Spark HiveContext

The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?
The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.