Pyspark - passing list/tuple to toDF function - pyspark

I have a dataframe, and want to rename it using toDF by passing the columns names from list, here column list is dynamic, when i do as below getting error, how can i achieve this?
>>> df.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- dept: string (nullable = true)
columns = ['NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF('ID', 'NAME_FIRST', 'DEPT_NAME')
(or)
df2 = df.toDF('id', columns[0], columns[1])
this, does not work if we dont know how many columns would be there in the input data frame, so want to pass the list to df2, i tried as below
df2 = df.toDF('id', columns)
pyspark.sql.utils.IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\nOld column names (3): id, name, dept\nNew column names (2): id, name_first, dept_name"
Here it treats list as single item, how to pass the columns from list?

df2 = df.toDF(columns) does not work, add a * like below -
columns = ['NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF(*columns)
"*" is the "splat" operator: It takes a list as input, and expands it into actual positional arguments in the function call

What you tried is correct except you did not add all columns to your "columns" array.
This will work:
columns = ['ID','NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF(columns)
Updating answer with all steps I followed in pyspark:
list=[(1,'a','b'),(2,'c','d'),(3,'e','f')]
df = sc.parallelize(list)
columns = ['ID','NAME_FIRST', 'DEPT_NAME']
df2 = df.toDF(columns)

Related

Not able to write loop expression in withcolumn in pyspark

i have dataframe where DealKeys has data like as
[{"Charge_Type": "DET", "Country": "VN", "Tariff_Loc": "VNSGN"}]
expected out put could be
[{"keyname": "Charge_Type", "value": "DET", "description": "..."}, {"keyname": "Country", "value": "VN", "description": "..."}, {"keyname": "Tariff_Loc", "value": "VNSGN", "description": "..."}]
when i create dataframe got bellow error
df = df2.withColumn('new_column',({'keyname' : i, 'value' : dictionary[i],'description' : "..."} for i in col("Dealkeys")))
Errro: Column is not iterable
DF2 schema:
root
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- DealKeys: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Charge_Type: string (nullable = true)
| | |-- Country: string (nullable = true)
| | |-- Tariff_Loc: string (nullable = true)
We cannot iterate through a dataframe column in Pyspark, hence the error occured.
To get the expected output, you need to follow the approach specified below.
Create new_column column with value as an empty string beforehand so that we can update its value as we iterate through each row.
Since a column cannot be iterated, we can use collect() method to get DealKeys values so that we can insert them into the corresponding new_column value.
Using df.collect() returns a list of rows (rows can be iterated through). As per schema, DealKeys is also of type Row. Using dealkey_row (DealKeys column values as a row) with asDict(), perform list comprehension to create the list of dictionaries that will be inserted into corresponding Charge_No value.
# df is the initial dataframe
df = df.withColumn("new_column", lit(''))
rows = df.collect()
for row in rows:
key = row[0] #Charge_No column value (string type)
dealkey_row = row[2][0] #DealKeys column value (row type)
lst = [{'keyname' : i, 'value' : dealkey_row[i],'description' : "..."} for i in dealkey_row.asDict()] #row.asDict() to get dictionary
df = df.withColumn('new_column', when(col('Charge_No') == key, str(lst)).otherwise(col('new_column')))
df.show(truncate=False)
Row.asDict() converts a row into a dictionary so that the list comprehension can be easily. Using withColumn() along with when(<condition>,<update_value>) function in pyspark, insert the output of your list comprehension into the new_column column (‘otherwise’ helps to retain the previous value if Charge_No value doesn’t match).
The above code produced the following output when I used it.

How to find whether df column name contains substring in Scala

My df has multiple columns. I want to check whether one column name contains a substring. Such as % in SQL.
I try to use this one but seems not to work. I don't want to give a full name to find whether that column exists.
If I can find this column, I also want to rename the column using .withColumnRename
Such like
if (df.columns.contains("%ABC%" or "%BCD%")) df.withColumnrename("%ABC%" or "%BCD%","ABC123") else println(0)
Maybe you can try this.
The filter can help you select you columns which need to update.
Write your update logic in the flodLeft()() method.
flodLeft is a useful method in scala. If you want to learn more about flodLeft , you can search scala foldLeft example in google.
So, good luck with you.
df.schema.fieldNames.map(_.toUpperCase).filter(x => !x.contains("")).foldLeft(df)((a,b) => {
a.withColumnRenamed(b, ("abc_" + b).toLowerCase() )
})
First, find a column that matches your criteria:
df.columns
.filter(c => c.contains("ABC") || c.contains("BCD"))
.take(1)
This will either return an empty Array[String] if no such column exists or an array with a single element if the column does exist. take(1) is there to make sure that you won't be renaming more than one column using the same new name.
Continuing the previous expression, renaming the column boils down to calling foldLeft, which iterates over the collection chaining its second argument to the "zero" (df in this case):
.foldLeft(df)((ds, c) => ds.withColumnRenamed(c, "ABC123"))
If the array was empty, nothing will get called and the result will be the original df.
Here it is in action:
df.printSchema
// root
// |-- AB: integer (nullable = false)
// |-- ABCD: string (nullable = true)
df.columns
.filter(c => c.contains("ABC") || c.contains("BCD"))
.take(1)
.foldLeft(df)(_.withColumnRenamed(_, "ABC123"))
.printSchema
// root
// |-- AB: integer (nullable = false)
// |-- ABC123: string (nullable = true)

Multiple Spark DataFrame mutations in a single pipe

Consider a Spark DataFrame df with the following schema:
root
|-- date: timestamp (nullable = true)
|-- customerID: string (nullable = true)
|-- orderID: string (nullable = true)
|-- productID: string (nullable = true)
One column should be cast to a different type, other columns should just have their white-space trimmed.
df.select(
$"date",
df("customerID").cast(IntegerType),
$"orderID",
$"productId")
.withColumn("orderID", trim(col("orderID")))
.withColumn("productID", trim(col("productID")))
The operations seem to require different syntax; casting is done via select, while trim is done via withColumn.
I'm used to R and dplyr where all the above would be handled in a single mutate function, so mixing select and withColumn feels a bit cumbersome.
Is there a cleaner way to do this in a single pipe?
You can use either one. The difference is that withColumn will add (or replace if the same name is used) a new column to the dataframe while select will only keep the columns you specified. Depending on the situation, choose one to use.
The cast can be done using withColumn as follows:
df.withColumn("customerID", $"customerID".cast(IntegerType))
.withColumn("orderID", trim($"orderID"))
.withColumn("productID", trim($"productID"))
Note that you do not need to use withColumn on the date column above.
The trim functions can be done in a select as follows, here the column names are kept the same:
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productId").as("productId"))
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productID").as("productID"))

How to join datasets with same columns and select one?

I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.
I have tried this:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")
and this:
d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")
but it does not work.
which spark version you're using ? can you put a sample of your dataframes ?
try this:
d2prim = d2.withColumnRenamed("columnName", d2_columnName)
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")
I have two dataframes
val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
which I am joining and selecting afterwards.
I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.
val q1 = d1.as("d1")
.join(d2.as("d2"), Seq("id"), "left")
.select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
| d1|
| d1|
| d1|
+----------+
As you can see it just works.
So, why did it not work for you? Let's analyze each.
// you started very well
d1.as("d1")
// but here you used $ to reference a column to join on
// with column references by their aliases
// that won't work
.join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
// same here
// $ + aliased columns won't work
.select($"d1.columnName")
PROTIP: Use d1("columnName") to reference a specific column in a dataframe.
The other query was very close to be fine, but...
d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
.select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column
This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.
val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")

Matching two dataframes in scala

I have two RDDs in SCALA and converted those to dataframes.
Now I have two dataframes.One prodUniqueDF where I have two columns named prodid and uid, it is having master data of product
scala> prodUniqueDF.printSchema
root
|-- prodid: string (nullable = true)
|-- uid: long (nullable = false)
Second, ratingsDF where I have columns named prodid,custid,ratings
scala> ratingsDF.printSchema
root
|-- prodid: string (nullable = true)
|-- custid: string (nullable = true)
|-- ratings: integer (nullable = false)
I want to join the above two and replace the ratingsDF.prodid with prodUniqueDF.uid in the ratingsDF
To do this, I first registered them as 'tempTables'
prodUniqueDF.registerTempTable("prodUniqueDF")
ratingsDF.registerTempTable("ratingsDF")
And I run the code
val testSql = sql("SELECT prodUniqueDF.uid, ratingsDF.custid, ratingsDF.ratings FROM prodUniqueDF, ratingsDF WHERE prodUniqueDF.prodid = ratingsDF.prodid")
But the error comes as :
org.apache.spark.sql.AnalysisException: Table not found: prodUniqueDF; line 1 pos 66
Please help! How can I achieve the join? Is there another method to map RDDs instead?
The Joining of the DataFrames can easily be achieved,
Format is
DataFrameA.join(DataFrameB)
By default it takes an inner join, but you can also specify the type of join that you want to do and they have APi's for that
You can look here for more information.
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
For replacing the values in an existing column you can take help of withColumn method from the API
It would be something like this:
val newDF = dfA.withColumn("newColumnName", dfB("columnName"))).drop("columnName").withColumnRenamed("newColumnName", "columnName")
I think this might do the trick !