Creating pyspark dataframe from same entries in the list - pyspark

I have two list, one is col_name=['col1','col2','col3'] and the other is col_value=['val1', 'val2', 'val3']. I am trying to create a dataframe from the two list with col_name being the column.I need the output with 3 columns and 1 row (with 1 header as a col_name).
Finding difficult to get the solution for this. Pls help.

Construct data directly, and use the createDataFrame method to create a dataframe.
col_name = ['col1', 'col2', 'col3']
col_value = ['val1', 'val2', 'val3']
data = [col_value]
df = spark.createDataFrame(data, col_name)
df.printSchema()
df.show(truncate=False)

Related

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!
You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

Difference in SparkSQL Dataframe columns

How do I locate difference between 2 dataframe columns ?
This is causing issues when I join 2 dataframes.
df1_cols = df1.columns
df2_cols = df2.columns
This will return columns for 2 dataframe in 2 list variables.
Thanks
df.columns returns a list here, so you can use any tool in python to compare with another list, i.e. df2_cols. e.g. You can use set to check the common columns in the two DataFrames
df1_cols = df1.columns
df2_cols = df2.columns
set(df1_cols).intersection(set(df2_cols)) # check common columns
set(df1_cols) - set(df2_cols) # check columns in df1 but not in df2
set(df2_cols) - set(df1_cols) # check columns in df2 but not in df1

Spark filter out columns and create dataFrame with remaining columns and create dataFrame with filtered columns

I am new to Spark.
I have loaded a CSV file into a Spark DataFrame, say OriginalDF
Now I want to
1. filter out some columns from it and create a new dataframe of the originalDF
2. create a dataFrame out of the extracted columns
How can these 2 dataframes be created in spark scala?
using select, you can select what columns you want.
val df2 = OriginalDF.select($"col1",$"col2",$"col3")
using filter you should able to filter the rows.
val df3 = OriginalDF.where($"col1" < 10)
another way to filter data is using where. Both filter and where are synonyms so you can use them interchangeably.
val df3 = OriginalDF.filter($"col1" < 10)
Note select and filter returns a new dataframe as a result.

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")

Creating a new column by applying a function in an existing column in PySpark?

Say I have a dataframe
product_id customers
1 [1,2,4]
2 [1,2]
I want to create a new column, say nb_customer by applying the function len on the column customers.
I tried
df = df.select('*', (map(len, df.customers)).alias('nb_customer'))
but it does not work.
What is the correct way to do that?
import pyspark.sql.functions as f
df = sc.parallelize([
[1,[1,2,4]],
[2,[1,2]]
]).toDF(('product_id', 'customers'))
df.withColumn('nb_customer',f.size(df.customers)).show()