Column Renaming in pyspark dataframe - pyspark

I have column names with special characters. I renamed the column and trying to save and it gives the save failed saying the columns have special characters. I ran the print schema on the dataframe and i am seeing the column names with out any special characters. Here is the code i tried.
for c in df_source.columns:
df_source = df_source.withColumnRenamed(c, c.replace( "(" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( ")" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( "." , ""))
df_source.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(stg_location)
and i get the following error
Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Number_of_data_samples_(input)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Also one more thing i noticed was when i do a df_source.show() or display(df_source), both shows the same error and printschema shows that there are not special characters.
Can someone help me in finding a solutions for this.

Try Using it as below -
Input_df
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("xyz", 1)]
schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
| xyz| 1|
+------------------------------+---+
Method 1
Using regular expressions to replace the special characters and then use toDF()
import re
cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 2
Using .withColumnRenamed()
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 3
Using .withColumn to create a new column and drop the existing column
df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))
df.show()
+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
| 1| xyz|
+---+----------------------------+

Related

Using rlike with list to create new df scala

just started with scala 2 days ago.
Here's the thing, I have a df and a list. The df contains two columns: paragraphs and authors, the list contains words (strings). I need to get the count of all the paragraphs where every word on list appears by author.
So far my idea was to create a for loop on the list to query the df using rlike and create a new df, but even if this does work, I wouldn't know how to do it. Any help is appreciated!
Edit: Adding example data and expected output
// Example df and list
val df = Seq(("auth1", "some text word1"), ("auth2","some text word2"),("auth3", "more text word1").toDF("a","t")
df.show
+-------+---------------+
| a| t|
+-------+---------------+
|auth1 |some text word1|
|auth2 |some text word2|
|auth1 |more text word1|
+-------+---------------+
val list = List("word1", "word2")
// Expected output
newDF.show
+-------+-----+----------+
| word| a|text count|
+-------+-----+----------+
|word1 |auth1| 2|
|word2 |auth2| 1|
+-------+-----+----------+
You can do a filter and aggregation for each word in the list, and combine all the resulting dataframes using unionAll:
val result = list.map(word =>
df.filter(df("t").rlike(s"\\b${word}\\b"))
.groupBy("a")
.agg(lit(word).as("word"), count(lit(1)).as("text count"))
).reduce(_ unionAll _)
result.show
+-----+-----+----------+
| a| word|text count|
+-----+-----+----------+
|auth3|word1| 1|
|auth1|word1| 1|
|auth2|word2| 1|
+-----+-----+----------+

How to read csv file in dataframe with different delimiter in header as ''," and rest of the rows are separated with "|"

Have csv file header was comma separated and rest of the rows are seperated with another delimiter "|" .How to handle this different delimiters scenario ? Please advise .
import org.apache.spark.sql.{DataFrame, SparkSession}
var df1: DataFrame = null
df1=spark.read.option("header", "true").option("delimiter", ",").option("inferSchema", "false")
.option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true")
.csv("/testing.csv")
df1.show(10)
this commands displays the headers are delimited seperately .But all the data was displayed in first column ,remaining columns are displayed with null values
Read the csv first and split the columns, create new dataframe.
df.show
+---------+----+-----+
| Id|Date|Value|
+---------+----+-----+
|1|2020|30|null| null|
|1|2020|40|null| null|
|2|2020|50|null| null|
|2|2020|40|null| null|
+---------+----+-----+
val cols = df.columns
val index = 0 to cols.size - 1
val expr = index.map(i => col("array")(i))
df.withColumn("array", split($"Id", "\\|"))
.select(expr: _*).toDF(cols: _*).show
+---+----+-----+
| Id|Date|Value|
+---+----+-----+
| 1|2020| 30|
| 1|2020| 40|
| 2|2020| 50|
| 2|2020| 40|
+---+----+-----+

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

how to use Regexp_replace in spark

I am pretty new to spark and would like to perform an operation on a column of a dataframe so as to replace all the , in the column with .
Assume there is a dataframe x and column x4
x4
1,3435
1,6566
-0,34435
I want the output to be as
x4
1.3435
1.6566
-0.34435
The code I am using is
import org.apache.spark.sql.Column
def replace = regexp_replace((x.x4,1,6566:String,1.6566:String)x.x4)
But I get the following error
import org.apache.spark.sql.Column
<console>:1: error: ')' expected but '.' found.
def replace = regexp_replace((train_df.x37,0,160430299:String,0.160430299:String)train_df.x37)
Any help on the syntax, logic or any other suitable way would be much appreciated
Here's a reproducible example, assuming x4 is a string column.
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "1,3435"),
(2, "1,6566"),
(3, "-0,34435"))).toDF("Id", "x4")
The syntax is regexp_replace(str, pattern, replacement), which translates to:
df.withColumn("x4New", regexp_replace(df("x4"), "\\,", ".")).show
+---+--------+--------+
| Id| x4| x4New|
+---+--------+--------+
| 1| 1,3435| 1.3435|
| 2| 1,6566| 1.6566|
| 3|-0,34435|-0.34435|
+---+--------+--------+
We could use the map method to do this transformation:
scala> df.map(each => {
(each.getInt(0),each.getString(1).replaceAll(",", "."))
})
.toDF("Id","x4")
.show
Output:
+---+--------+
| Id| x4|
+---+--------+
| 1| 1.3435|
| 2| 1.6566|
| 3|-0.34435|
+---+--------+

How to handle the null/empty values on a dataframe Spark/Scala

I have a CSV file and I am processing its data.
I am working with data frames, and I calculate average, min, max, mean, sum of each column based on some conditions. The data of each column could be empty or null.
I have noticed that in some cases I got as max, or sum a null value instead of a number. Or I got in max() a number which is less that the output that the min() returns.
I do not want to replace the null/empty values with other.
The only thing I have done is to use these 2 options in CSV:
.option("nullValue", "null")
.option("treatEmptyValuesAsNulls", "true")
Is there any way to handle this issue? Have everyone faced this problem before? Is it a problem of data types?
I run something like this:
data.agg(mean("col_name"), stddev("col_name"),count("col_name"),
min("col_name"), max("col_name"))
Otherwise I can consider that it is a problem in my code.
I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results.
Environment: Scala, Spark 1.6.1 Hadoop 2.6.0
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))
val schema = StructType(StructField("key", StringType, false) ::
StructField("value", DoubleType, true) ::
StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)
df.show()
df.agg(mean($"value"), max($"value"), min($"value")).show()
df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()
Output:
+---+-----+----------+
|key|value| d|
+---+-----+----------+
| 1| 2.4|2016-12-21|
| 1| null|2016-12-22|
| 2| null|2016-12-23|
| 2| null|2016-12-23|
| 3| 3.0|2016-12-22|
| 3| 2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
| avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667| 3.0| 2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
| 1| 2.4| 2.4| 2.4|
| 2| null| null| null|
| 3| 2.5| 3.0| 2.0|
+---+----------+----------+----------+
From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.
Contrary to one of the comments it is not true that nulls are ignored. Here is an approach:
max(coalesce(col_name,Integer.MinValue))
min(coalesce(col_name,Integer.MaxValue))
This will still have an issue if there were only null values: you will need to convert Min/MaxValue to null or whatever you want to use to represent "no valid/non-null entries".
To add to other answers:
Remember the null and NaN are different things to spark:
NaN is not a number and numeric aggregations on a column with NaN in it result in NaN
null is a missing value and numeric aggregations on a column with null ignore it as if the row wasn't even there
df_=spark.createDataFrame([(1, np.nan), (None, 2.0),(3,4.0)], ("a", "b"))
df_.show()
| a| b|
+----+---+
| 1|NaN|
|null|2.0|
| 3|4.0|
+----+---+
df_.agg(F.mean("a"),F.mean("b")).collect()
[Row(avg(a)=2.0, avg(b)=nan)]