I have a Dataframe :
| subcategory | subcategory_label | category |
| 00EEE | 00EEE FFF | Drink |
| 0000EEE | 00EEE FFF | Fruit |
| 0EEE | 000EEE FFF | Meat |
from which I need to remove leading 0's from the columns in Dataframe and need a result like this
| subcategory | subcategory_label | category |
| EEE | EEE FFF | Drink |
| EEE | EEE FFF | Fruit |
| EEE | EEE FFF | Meat |
So far, I am able to remove the leading 0's from one column using
df.withColumn("subcategory ", regexp_replace(df("subcategory "), "^0*", "")).show
How to remove the leading 0's from dataframe in one go?
With this as the provided dataframe :
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|0000FFFF |0000EE 000FF |ABC |
+-----------+-----------------+--------+
You can create a regexp_replace for all the columns. Something like :
val regex_all = df.columns.map( c => regexp_replace(col(c), "^0*", "" ).as(c) )
And then, use select since it takes a varargs of type Column :
df.select(regex_all :_* ).show(false)
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|FFFF |EE 000FF |ABC |
+-----------+-----------------+--------+
EDIT:
Defining a function to do return a regexp_replaced Sequence is straight forward :
/**
* #param origCols total cols in the DF, pass `df.columns`
* #param replacedCols `Seq` of columns for which expression is to be generated
* #return `Seq[org.apache.spark.sql.Column]` Spark SQL expression
*/
def createRegexReplaceZeroes(origCols : Seq[String], replacedCols: Seq[String] ) = {
origCols.map{ c =>
if(replacedCols.contains(c)) regexp_replace(col(c), "^0*", "" ).as(c)
else col(c)
}
}
This function will return an Array[org.apache.spark.sql.Column]
Now, store the columns you want to replace in an Array :
val removeZeroes = Array( "subcategory", "subcategory_label" )
And, then call the function with removeZeroes as argument. This will return the regexp_replace statements for the columns available in removeZeroes
df.select( createRegexReplaceZeroes(df.columns, removeZeroes) :_* )
You can use UDF for doing the same.
I feel it looks more elegant.
scala> val removeLeadingZerosUDF = udf({ x: String => x.replaceAll("^0*", "") })
removeLeadingZerosUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df = Seq( "000012340023", "000123400023", "001234000230", "012340002300", "123400002300" ).toDF("cols")
df: org.apache.spark.sql.DataFrame = [cols: string]
scala> df.show()
+------------+
| cols|
+------------+
|000012340023|
|000123400023|
|001234000230|
|012340002300|
|123400002300|
+------------+
scala> df.withColumn("newCols", removeLeadingZerosUDF($"cols")).show()
+------------+------------+
| cols| newCols|
+------------+------------+
|000012340023| 12340023|
|000123400023| 123400023|
|001234000230| 1234000230|
|012340002300| 12340002300|
|123400002300|123400002300|
+------------+------------+
Related
Lets say I have a List[String] and I want to merge it with a RDD Object so that each object in the RDD gets each value in the List added to it:
List[String] myBands = ["Band1","Band2"];
Table: BandMembers
|name | instrument |
| ----- | ---------- |
| slash | guitar |
| axl | vocals |
case class BandMembers ( name:String, instrument:String );
var myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument));
//join the myRDD to myBands
// how do I do this?
//var result = myRdd.join/merge/union(myBands);
Desired result:
|name | instrument | band |
| ----- | ---------- |------|
| slash | guitar | band1|
| slash | guitar | band2|
| axl | vocals | band1|
| axl | vocals | band2|
I'm not quite sure how to go about this in the best way for Spark/Scala. I know I can convert to DF and then use spark sql to do the joins, but there has to be a better way with the RDD and List, or so I think.
The style is a bit off here, but assuming you really need RDD's instead of Dataset
So with RDD:
case class BandMembers ( name:String, instrument:String )
val myRDD = spark.sparkContext.parallelize(BandMembersTable.map(a => new BandMembers(a.name, a.instrument)))
val myBands = spark.sparkContext.parallelize(Seq("Band1","Band2"))
val res = myRDD.cartesian(myBands).map { case (a,b) => Row(a.name, a.instrument, b) }
With Dataset:
case class BandMembers ( name:String, instrument:String )
val myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument)).toDS
val myBands = Seq("Band1","Band2").toDS
val res = myRDD.crossJoin(myBands)
Input data:
val BandMembersTable = Seq(BandMembers("a", "b"), BandMembers("c", "d"))
val myBands = Seq("Band1","Band2")
Output with Dataset:
+----+----------+-----+
|name|instrument|value|
+----+----------+-----+
|a |b |Band1|
|a |b |Band2|
|c |d |Band1|
|c |d |Band2|
+----+----------+-----+
Println with RDDs (these are Rows)
[a,b,Band1]
[c,d,Band2]
[c,d,Band1]
[a,b,Band2]
Consider using RDD zip for this.. From official docs
RDD<scala.Tuple2<T,U>> zip(RDD other, scala.reflect.ClassTag evidence$11)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD,
I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.
In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>
I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file.
For example, for employee table as shown below
+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1 | abcd | 21 | India |
+----+------+-----+---------+
| 2 | qazx | 42 | Germany |
+----+------+-----+---------+
if we want to mask name and age columns then I get these columns in an sequence.
val mask = Seq("name", "age")
Expected values after masking are:
+----+----------------+----------------+---------+
| id | name | age | address |
+----+----------------+----------------+---------+
| 1 | *** Masked *** | *** Masked *** | India |
+----+----------------+----------------+---------+
| 2 | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+
If I have employee table an data frame, then what is the way to mask these columns?
If I have payment table as shown below and want to mask name and salary columns then I get mask columns in Sequence as
+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1 | abcd | 12345 | KT10 |
+----+------+--------+----------+
| 2 | qazx | 98765 | AD12d |
+----+------+--------+----------+
val mask = Seq("name", "salary")
I tried something like this mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) ) but it did not returned anything.
Thanks to #philantrovert, I found out the solution. Here is the solution I used:
def maskData(base: DataFrame, maskColumns: Seq[String]) = {
val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
base.selectExpr(maskExpr: _*)
}
The simplest and fastest way would be to use withColumn and simply overwrite the values in the columns with "*** Masked ***". Using your small example dataframe
val df = spark.sparkContext.parallelize( Seq (
(1, "abcd", 12345, "KT10" ),
(2, "qazx", 98765, "AD12d")
)).toDF("id", "name", "salary", "tax_code")
If you have a small number of columns to be masked, with known names, then you can simply do:
val mask = Seq("name", "salary")
df.withColumn("name", lit("*** Masked ***"))
.withColumn("salary", lit("*** Masked ***"))
Otherwise, you need to create a loop:
var df2 = df
for (col <- mask){
df2 = df2.withColumn(col, lit("*** Masked ***"))
}
Both these approaches will give you a result like this:
+---+--------------+--------------+--------+
| id| name| salary|tax_code|
+---+--------------+--------------+--------+
| 1|*** Masked ***|*** Masked ***| KT10|
| 2|*** Masked ***|*** Masked ***| AD12d|
+---+--------------+--------------+--------+
Please check the code below. The key is the udf function.
val df = ss.sparkContext.parallelize( Seq (
("c1", "JAN-2017", 49 ),
("c1", "MAR-2017", 83),
)).toDF("city", "month", "sales")
df.show()
val mask = udf( (s : String) => {
"*** Masked ***"
})
df.withColumn("city", mask($"city")).show`
Your statement
mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )
will return a List[org.apache.spark.sql.DataFrame] which doesn't sound too good.
You can use selectExpr and generate your regexp_replace expression using :
base.show
+---+----+-----+-------+
| id|name| age|address|
+---+----+-----+-------+
| 1|abcd|12345| KT10 |
| 2|qazx|98765| AD12d|
+---+----+-----+-------+
val mask = Seq("name", "age")
val expr = df.columns.map { col =>
if (mask.contains(col) ) s"""regexp_replace(${col}, "^.*", "** Masked **" ) as ${col}"""
else col
}
This will generate an expression with regex_replace for the columns that are present in the Sequence mask
Array[String] = Array(id, regexp_replace(name, "^.*", "** Masked **" ) as name, regexp_replace(age, "^.*", "** Masked **" ) as age, address)
Now you can use selectExpr on the generated Sequence
base.selectExpr(expr: _*).show
+---+------------+------------+-------+
| id| name| age|address|
+---+------------+------------+-------+
| 1|** Masked **|** Masked **| KT10 |
| 2|** Masked **|** Masked **| AD12d|
+---+------------+------------+-------+
After a series of validations over a DataFrame,
I obtain a List of String with certain values like this:
List[String]=(lvalue1, lvalue2, lvalue3, ...)
And I have a Dataframe with n values:
dfield 1 | dfield 2 | dfield 3
___________________________
dvalue1 | dvalue2 | dvalue3
dvalue1 | dvalue2 | dvalue3
I want to append the values of the List at the beggining of my Dataframe, in order to get a new DF with something like this:
dfield 1 | dfield 2 | dfield 3 | dfield4 | dfield5 | dfield6
__________________________________________________________
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
I have found something using a UDF. Could be this correct for my purpose?
Regards.
TL;DR Use select or withColumn with lit function.
I'd use lit function with select operator (or withColumn).
lit(literal: Any): Column Creates a Column of literal value.
A solution could be as follows.
val values = List("lvalue1", "lvalue2", "lvalue3")
val dfields = values.indices.map(idx => s"dfield ${idx + 1}")
val dataset = Seq(
("dvalue1", "dvalue2", "dvalue3"),
("dvalue1", "dvalue2", "dvalue3")
).toDF("dfield 1", "dfield 2", "dfield 3")
val offsets = dataset.
columns.
indices.
map { idx => idx + colNames.size + 1 }
val offsetDF = offsets.zip(dataset.columns).
foldLeft(dataset) { case (df, (off, col)) => df.withColumnRenamed(col, s"dfield $off") }
val newcols = colNames.zip(dfields).
map { case (v, dfield) => lit(v) as dfield } :+ col("*")
scala> offsetDF.select(newcols: _*).show
+--------+--------+--------+--------+--------+--------+
|dfield 1|dfield 2|dfield 3|dfield 4|dfield 5|dfield 6|
+--------+--------+--------+--------+--------+--------+
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
+--------+--------+--------+--------+--------+--------+