Explode Array Element into a unique column - pyspark

I'm new to Pyspark and trying to solve an ETL step.
I have the following schema below. I would like to take the variable that is inside the array and transform it into a column, but when doing this with explode I create duplicate rows because there are positions [0], [1], and [2] inside the element.
My goal is to transform what is inside variable into a new column taking everything that is in the element (separating by comma what was in each element) and transforming it into a string.
root
|-- id: string (nullable = true)
|-- info: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- variable: string (nullable = true)
Output:
id
new column
123435e5x-9a9z
A, B, D
555585a4Z-0B1Y
A
Thank you for the help

As mentioned by David Markovitz you can use the concat_ws function as below:
from pyspark.sql import functions as F
(df.withColumn('new column', F.concat_ws(', ', F.col('info'))

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

PySpark DataFrame change column of string to array before using explode

I have a column called event_data in json format in my spark DataFrame, after reading it using from_json, I get this schema:
root
|-- user_id: string (nullable = true)
|-- event_data: struct (nullable = true)
| |-- af_content_id: string (nullable = true)
| |-- af_currency: string (nullable = true)
| |-- af_order_id: long (nullable = true)
I only need af_content_id from this column. This attribute can be of different formats:
a String
an Integer
a List of Int and Str. eg ['ghhjj23','123546',12356]
None (sometimes event_data doesn't contain af_content_id)
I want to use explode function in order to return a new row for each element in af_content_id when it is of format List. But as when I apply it, I get an error:
from pyspark.sql.functions import explode
def get_content_id(column):
return column.af_content_id
df_transf_1 = df_transf_1.withColumn(
"products_basket",
get_content_id(df_transf_1.event_data)
)
df_transf_1 = df_transf_1.withColumn(
"product_id",
explode(df_transf_1.products_basket)
)
cannot resolve 'explode(products_basket)' due to data type mismatch: input to function explode should be array or map type, not StringType;
I know the reason, it's because of the different types that the field af_content_id may contain, but I don't know how to resolve it. Using pyspark.sql.functions.array() directly on the column doesn't work because it become array of array and explode will not produce the expected result.
A sample code to reproduce the step that I'm stuck on:
import pandas as pd
arr = [
['b5ad805c-f295-4852-82fc-961a88',12732936],
['0FD6955D-484C-4FC8-8C3F-DA7D28',['Gklb38','123655']],
['0E3D17EA-BEEF-4931-8104','12909841'],
['CC2877D0-A15C-4C0A-AD65-762A35C1',[12645715, 12909837, 12909837]]
]
df = pd.DataFrame(arr, columns = ['user_id','products_basket'])
df = df[['user_id','products_basket']].astype(str)
df_transf_1 = spark.createDataFrame(df)
I'm looking for a way to convert products_basket to one only possible format: an Array so that when I apply explode, it will contain one id per row.
If you are starting with a DataFrame like:
df_transf_1.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |['Gklb38', '123655'] |
#|0E3D17EA-BEEF-4931-8104 |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
where the products_basket column is a StringType:
df.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: string (nullable = true)
You can't call explode on products_basket because it's not an array or map.
One workaround is to remove any leading/trailing square brackets and then split the string on ", " (comma followed by a space). This will convert the string into an array of strings.
from pyspark.sql.functions import col, regexp_replace, split
df_transf_new= df_transf_1.withColumn(
"products_basket",
split(regexp_replace(col("products_basket"), r"(^\[)|(\]$)|(')", ""), ", ")
)
df_transf_new.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
The regular expression pattern matches any of the following:
(^\[): An opening square bracket at the start of the string
(\]$): A closing square bracket at the end of the string
('): Any single quote (because your strings are quoted)
and replaces these with an empty string.
This assumes that your data does not contain any needed single quotes or square brackets inside the product_basket.
After the split, the schema of the new DataFrame is:
df_transf_new.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: array (nullable = true)
# | |-- element: string (containsNull = true)
Now you can call explode:
from pyspark.sql.functions import explode
df_transf_new.withColumn("product_id", explode("products_basket")).show(truncate=False)
#+--------------------------------+------------------------------+----------+
#|user_id |products_basket |product_id|
#+--------------------------------+------------------------------+----------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |Gklb38 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |123655 |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12645715 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#+--------------------------------+------------------------------+----------+

Filter a dataFrame based on elemnt from a array column

I'm working with dataframe
root
|-- c: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I'm trying to filter this dataframe based on an element ["value1", "key1"] in the array data i.e if this element exist in data of df so keep it else delete it, I tried
df.filter(col("data").contain("["value1", "key1"])
but it didn't work. Also I tried to
put val f=Array("value1", "key1") then df.filter(col("data").contain(f)) it didn't work also.
Any help please?
Straight forward approach would be to use a udf function as udf function helps to perform logics row by row and in primitive datatypes (thats what your requirement suggests to check every key and value of struct element in array data column)
import org.apache.spark.sql.functions._
//udf to check for key1 in key and value1 in value of every struct in the array field
def containsUdf = udf((data: Seq[Row])=> data.exists(row => row.getAs[String]("key") == "key1" && row.getAs[String]("value") == "value1"))
//calling the udf function in the filter
val filteredDF = df.filter(containsUdf(col("data")))
so the filteredDF should be your desired output

Pyspark RDD to DataFrame with Enforced Schema: Value Error

I am working with pyspark with a schema commensurate with that shown at the end of this post (note the nested lists, unordered fields), initially imported from Parquet as a DataFrame. Fundamentally, the issue that I am running into is the inability to process this data as a RDD, and then convert back to a DataFrame. (I have reviewed several related posts, but I still cannot tell where I am going wrong.)
Trivially, the following code works fine (as one would expect):
schema = deepcopy(tripDF.schema)
tripRDD = tripDF.rdd
tripDFNew = sqlContext.createDataFrame(tripRDD, schema)
tripDFNew.take(1)
Things do not work when I need to map the RDD (as would be the case to add a field, for instance).
schema = deepcopy(tripDF.schema)
tripRDD = tripDF.rdd
def trivial_map(row):
rowDict = row.asDict()
return pyspark.Row(**rowDict)
tripRDDNew = tripRDD.map(lambda row: trivial_map(row))
tripDFNew = sqlContext.createDataFrame(tripRDDNew, schema)
tripDFNew.take(1)
The code above gives the following exception where XXX is a stand-in for an integer, which changes from run to run (e.g., I've seen 1, 16, 23, etc.):
File "/opt/cloudera/parcels/CDH-5.8.3-
1.cdh5.8.3.p1967.2057/lib/spark/python/pyspark/sql/types.py", line 546, in
toInternal
raise ValueError("Unexpected tuple %r with StructType" % obj)
ValueError: Unexpected tuple XXX with StructType`
Given this information, is there a clear error in the second block of code? (I note that tripRDD is of class rdd.RDD while tripRDDNew is of class rdd.PipelinedRDD, but I don't think this should be a problem.) (I also note that the schema for tripRDD is not sorted by field name, while the schema for tripRDDNew is sorted by field name. Again, I don't see why this would be a problem.)
Schema:
root
|-- foo: struct (nullable = true)
| |-- bar_1: integer (nullable = true)
| |-- bar_2: integer (nullable = true)
| |-- bar_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- baz_1: integer (nullable = true)
| | | |-- baz_2: string (nullable = true)
| | | |-- baz_3: double (nullable = true)
| |-- bar_4: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- baz_1: integer (nullable = true)
| | | |-- baz_2: string (nullable = true)
| | | |-- baz_3: double (nullable = true)
|-- qux: integer (nullable = true)
|-- corge: integer (nullable = true)
|-- uier: integer (nullable = true)`
As noted in the post, the original schema has fields that are not alphabetically ordered. Therein lies the problem. The use of .asDict() in the mapping function orders the fields of the resulting RDD. The field order of tripRDDNew is in conflict with schema at the call to createDataFrame. The ValueError results from an attempt to parse one of the integer fields (i.e., qux, corge, or uier in the example) as a StructType.
(As an aside: It is a little surprising that createDataFrame requires the schema fields to have the same order as the RDD fields. You should either need consistent field names OR consistent field ordering, but requiring both seems like overkill.)
(As a second aside: The existence of non-alphabetical fields in the DataFrame is somewhat abnormal. For instance, sc.parallelize() will automatically order fields alphabetically when distributing the data structure. It seems like the fields should be ordered when importing the DataFrame from the parquet file. It might be interesting to investigate why this is not the case.)

Accessing a Nested Map column in Spark Dataframes without using explode

I have a column in a Spark dataframe where the schema looks something like this:
|-- seg: map (nullable = false)
| |-- key: string
| |-- value: array (valueContainsNull = false)
| | |-- element: struct (containsNull = false)
| | | |-- id: integer (nullable = false)
| | | |-- expiry: long (nullable = false)
The value in the column looks something like this:
Map(10000124 -> WrappedArray([20185255,1561507200], [20185256,1561507200]))]
What I want to do it create a column from this Map column which only contain an array of [20185255,20185256] (The elements of the array are 1st element of each array in the WrappedArray). How do I do this ?
I am trying not to use "explode".
** Also is their a way I can use a UDF which take in the Map and get those values ?**