Splitting the file name - scala

good morning guys,
I have the following dataFrame
+--------------+--------------------------------------------------------------------------+
|co_tipo_arquiv|filename |count_tipo_arquiv|
+--------------+--------------------------------------------------------+-----------------+
|05 |hdfs://spbrhdpdev1.br.experian.local:8020/files/files_01|2 |
|01 |hdfs://spbrhdpdev1.br.experian.local:8020/files/files_02|2 |
+--------------+--------------------------------------------------------+-----------------+
I would like to get only the file name in the filename column
getting that way
+--------------+--------------------------------------------------------------------------+
|co_tipo_arquiv|filename |count_tipo_arquiv|
+--------------+--------------------------------------------------------+-----------------+
|05 |files_01 |2 |
|01 |files_02 |2 |
+--------------+--------------------------------------------------------+-----------------+
I thought about doing a split, but I don't know how to get the last value
split(col("filename"), "/")
but .last dont work
+--------------+-------------------------------------------------------------+
|co_tipo_arquiv|filename |
+--------------+-------------------------------------------------------------+
|05 |[hdfs:, , spbrhdpdev1.br.experian.local:8020,files, files_01]|
|01 |[hdfs:, , spbrhdpdev1.br.experian.local:8020,files, files_02]|
+--------------+-------------------------------------------------------------+

From Spark-2.4+:
We can use element_at function to get last index of array.
1.Using element_at function:
df.withColumn("filename",element_at(split(col("filename"),"/"),-1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+
For Spark < 2.4:
2.Using substring_index function:
df.withColumn("filename",substring_index(col("filename"),"/",-1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+
3.Using regexp_extract function:
df.withColumn("filename",regexp_extract(col("filename"),"([^\\/]+$)",1)).show()
//+--------------+--------+-----------------+
//|co_tipo_arquiv|filename|count_tipo_arquiv|
//+--------------+--------+-----------------+
//| 05|files_01| 2|
//| 01|files_02| 2|
//+--------------+--------+-----------------+

Related

spark change DF schema column rename from dot to underscore

I have a dataframe with columns names that has dot.
Example : df.printSchema
user.id_number
user.name.last
user.phone.mobile
etc and I want to rename the schema by replacing the dot with _.
user_id_number
user_name_last
user_phone_mobile
Note: the input data for this DF is JSON format (with nonrelational like NoSQL)
Use either .map,.withColumnRenamed to replace . with _
Example:
val df=Seq(("1","2","3")).toDF("user.id_number","user.name.last","user.phone.mobile")
df.toDF(df.columns.map(x =>x.replace(".","_")):_*).show()
//using replaceAll
df.toDF(df.columns.map(x =>x.replaceAll("\\.","_")):_*).show()
//+--------------+--------------+-----------------+
//|user_id_number|user_name_last|user_phone_mobile|
//+--------------+--------------+-----------------+
//| 1| 2| 3|
//+--------------+--------------+-----------------+
2. Using selectExpr:
val expr=df.columns.map(x =>col(s"`${x}`").alias(s"${x}".replace(".","_")).toString)
df.selectExpr(expr:_*).show()
//+--------------+--------------+-----------------+
//|user_id_number|user_name_last|user_phone_mobile|
//+--------------+--------------+-----------------+
//| 1| 2| 3|
//+--------------+--------------+-----------------+
3.Using .withColumnRenamed:
df.columns.foldLeft(df){(tmpdf,col) =>tmpdf.withColumnRenamed(col,col.replace(".","_"))}.show()
//+--------------+--------------+-----------------+
//|user_id_number|user_name_last|user_phone_mobile|
//+--------------+--------------+-----------------+
//| 1| 2| 3|
//+--------------+--------------+-----------------+

Infinity value in a dataframe spark / scala

I have a dataframe with infinity values . How can I replace it by 0.0.
I tried that but thoesn't work .
val Nan= dataframe_final.withColumn("Vitesse",when(col("Vitesse").isin(Double.NaN,Double.PositiveInfinity,Double.NegativeInfinity),0.0))
Example of dataframe
--------------------
| Vitesse |
--------------------
| 8.171069002316942|
| Infinity |
| 4.290418664272539|
|16.19811830014666 |
| |
How can I replace "Infinity by 0.0" ?
Thank you.
scala> df.withColumn("Vitesse", when(col("Vitesse").equalTo(Double.PositiveInfinity),0.0).otherwise(col("Vitesse")))
res1: org.apache.spark.sql.DataFrame = [Vitesse: double]
scala> res1.show
+-----------------+
| Vitesse|
+-----------------+
|8.171069002316942|
| 0.0|
|4.290418664272539|
+-----------------+
You can try like above.
Your approach is correct using when() .otherwise()
add the missing otherWise statement to get Vitesse values as is if value is not in Infinity,-Infinity,NaN.
Example:
val df=Seq(("8.171".toDouble),("4.2904".toDouble),("16.19".toDouble),(Double.PositiveInfinity),(Double.NegativeInfinity),(Double.NaN)).toDF("Vitesse")
df.show()
//+---------+
//| Vitesse|
//+---------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| Infinity|
//|-Infinity|
//| NaN|
//+---------+
df.withColumn("Vitesse", when(col("Vitesse").isin(Double.PositiveInfinity,Double.NegativeInfinity,Double.NaN),0.0).
otherwise(col("Vitesse"))).
show()
//+-------+
//|Vitesse|
//+-------+
//| 8.171|
//| 4.2904|
//| 16.19|
//| 0.0|
//| 0.0|
//| 0.0|
//+-------+

spark scala transform a dataframe/rdd

I have a CSV file like below.
PK,key,Value
100,col1,val11
100,col2,val12
100,idx,1
100,icol1,ival11
100,icol3,ival13
100,idx,2
100,icol1,ival21
100,icol2,ival22
101,col1,val21
101,col2,val22
101,idx,1
101,icol1,ival11
101,icol3,ival13
101,idx,3
101,icol1,ival31
101,icol2,ival32
I want to transform this into the following.
PK,idx,key,Value
100,,col1,val11
100,,col2,val12
100,1,idx,1
100,1,icol1,ival11
100,1,icol3,ival13
100,2,idx,2
100,2,icol1,ival21
100,2,icol2,ival22
101,,col1,val21
101,,col2,val22
101,1,idx,1
101,1,icol1,ival11
101,1,icol3,ival13
101,3,idx,3
101,3,icol1,ival31
101,3,icol2,ival32
Basically I want to create the an new column called idx in the output dataframe which will be populated with the same value "n" as that of the row following the key=idx, value="n".
Here is one way using last window function with Spark >= 2.0.0:
import org.apache.spark.sql.functions.{last, when, lit}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("PK").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("idx", when($"key" === lit("idx"), $"Value"))
.withColumn("idx", last($"idx", true).over(w))
.orderBy($"PK")
.show
Output:
+---+-----+------+----+
| PK| key| Value| idx|
+---+-----+------+----+
|100| col1| val11|null|
|100| col2| val12|null|
|100| idx| 1| 1|
|100|icol1|ival11| 1|
|100|icol3|ival13| 1|
|100| idx| 2| 2|
|100|icol1|ival21| 2|
|100|icol2|ival22| 2|
|101| col1| val21|null|
|101| col2| val22|null|
|101| idx| 1| 1|
|101|icol1|ival11| 1|
|101|icol3|ival13| 1|
|101| idx| 3| 3|
|101|icol1|ival31| 3|
|101|icol2|ival32| 3|
+---+-----+------+----+
The code first creates a new column called idx which contains the value of Value when key == idx, or null otherwise. Then it retrieves the last observed idx over the defined window.

Spark withColumn working for modifying column but not adding a new one

Scala 2.12 and Spark 2.2.1 here. I have the following code:
myDf.show(5)
myDf.withColumn("rank", myDf("rank") * 10)
myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
When I run this, in the logs I see:
+---------+-----------+----+
|fizz|buzz|rizzrankrid|rank|
+---------+-----------+----+
| 2| 5| 1440370637| 128|
| 2| 5| 2114144780|1352|
| 2| 8| 199559784|3233|
| 2| 5| 1522258372| 895|
| 2| 9| 918480276| 882|
+---------+-----------+----+
And now:
+---------+-----------+-----+
|fizz|buzz|rizzrankrid| rank|
+---------+-----------+-----+
| 2| 5| 1440370637| 1280|
| 2| 5| 2114144780|13520|
| 2| 8| 199559784|32330|
| 2| 5| 1522258372| 8950|
| 2| 9| 918480276| 8820|
+---------+-----------+-----+
So, interesting:
The first withColumn works, transforming each row's rank value by multiplying itself by 10
However the second withColumn fails, which is just adding the current date/time to all rows as a new lastRanOn column
What do I need to do to get the lastRanOn column addition working?
Your example is probably too simple, because modifying rank should also not work.
withColumn does not update DataFrame, it's create a new DataFrame.
So you must do:
// if myDf is a var
myDf.show(5)
myDf = myDf.withColumn("rank", myDf("rank") * 10)
myDf = myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
or for example:
myDf.withColumn("rank", myDf("rank") * 10).withColumn("lastRanOn", current_date()).show(5)
Only then you will have new column added - after reassigning new DataFrame reference

Find and replace not working - dataframe spark scala

I have the following dataframe:
df.show
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 2|
|2017-05-20| 1|
|2017-06-23| 2|
|2017-06-16| 3|
|2017-06-30| 1|
I want to replace the count values by 0, where it is greater than 1, i.e., the resultant dataframe should be:
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 0|
|2017-05-20| 1|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| 1|
I tried the following expression:
df.withColumn("count", when(($"count" > 1), 0)).show
but the output was
+----------+--------+
| createdon| count|
+----------+--------+
|2017-06-28| null|
|2017-06-17| 0|
|2017-05-20| null|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| null|
I am not able to understand, why for the value 1, null is getting displayed and how to overcome that. Can anyone help me?
You need to chain otherwise after when to specify the values where the conditions don't hold; In your case, it would be count column itself:
df.withColumn("count", when(($"count" > 1), 0).otherwise($"count"))
This can be done using udf function too
def replaceWithZero = udf((col: Int) => if(col > 1) 0 else col) //udf function
df.withColumn("count", replaceWithZero($"count")).show(false) //calling udf function
Note : udf functions should always be the choice only when there is no inbuilt functions as it requires serialization and deserialization of column data.