How to correlate rlike and regex_extract - scala

STATEMENT:1
spark.sql("select case when length(pop)>0 then regexp_extract(pop, '^[^#]+', 0) else '' end as pop from input").show(false)
STATEMENT:2
spark.sql("select case when length(oik)>0 and pop rlike '^[0-9]*$' then pop else '' end as pop from input").show(false)
How to correlate the above two statement regexp_extract and rlike,
sample input:1234#gamil.com output: 1234
sample input:1234abc#gmail.com output: ''
How can I correlate the two statements which I have given in a case when statement in spark-sql…(To combine rlike and regexp_extract) in a case when statement and to match the specified input and output?
First statement is for neglecting the characters after #
Second statement is for "if any non-numeric characters are present then it should reject from the first statement output"

This should work for you.
List("1234#gamil.com","1234abc#gmail.com")
.toDF("pop")
.createOrReplaceTempView("input")
spark.sql(
"""
|select
| case
| when length(pop)>0 and pop rlike '^[0-9]+[a-z-A-Z]+#.*'
| then ''
| else
| case
| when pop rlike '^[0-9]+#.*'
| then regexp_extract(pop, '^[^#]+', 0)
| end
| end as pop from input
|""".stripMargin)
.show()
/*
+----+
| pop|
+----+
|1234|
| |
+----+*/

You can try this code:
val pattern = """([0-9]+)#""".r
def parseId(id: String): String =
id match {
case pattern(id) => id
case _ => “"
}
val results = spark.sql("select case when length(pop)>0 then regexp_extract(pop, '^[^#]+', 0) else '' end as pop from input").map(parseId(_)).show()

Related

How to apply an empty condition to sql select by using "and" in Spark?

I have an UuidConditionSet, when the if condition is wrong, I want apply an empty string to my select statement(or just ignore this UuidConditionSet), but I got this error. How to solve this problem?
mismatched input 'FROM' expecting <EOF>(line 10, pos 3)
This is the select
(SELECT
item,
amount,
date
from my_table
where record_type = 'myType'
and ( date_format(date, "yyyy-MM-dd") >= '2020-02-27'
and date_format(date, "yyyy-MM-dd") <= '2020-02-28' )
and ()
var UuidConditionSet = ""
var UuidCondition = Seq.empty[String]
if(!UuidList.mkString.isEmpty) {
UuidCondition = for {
Uuid <- UuidList
UuidConditionSet = s"${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} = '".concat(eventUuid).concat("'")
} yield UuidConditionSet
UuidConditionSet = UuidCondition.reduce(_.concat(" or ").concat(_))
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and ( date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}' )
| and ($UuidConditionSet)
You can use pattern matching on the list UuidList to check the size and return an empty string if the list is empty. Also, you can use IN instead of multiple ORs here.
Try this:
val UuidCondition = UuidList match {
case l if (l.size > 0) => {
l.map(u => s"'$u'").mkString(
s"and ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} in (",
",",
")"
)
}
case _ => ""
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}'
| $UuidCondition
"""

Are these values empty or null and how do I drop these columns?

So I have this dataframe which looks like below:
+----------------+----------+-------------+-----------+---------+-------------+
|_manufacturerRef|_masterRef|_nomenclature|_partNumber|_revision|_serialNumber|
+----------------+----------+-------------+-----------+---------+-------------+
| #id2| #id19| | zaa01948| | JTJHA31U2400|
| #id2| #id29| | zaa22408| | null|
| #id2| #id45| | zaa24981| | null|
+----------------+----------+-------------+-----------+---------+-------------+
I want to drop empty columns, which are _nomenclature and _revision as shown in the above dataframe. I am trying various methods but none would drop. No method is able to detect these columns as empty. Also, there might be the possibility that the columns can be of type Struct as well. I am trying like below:
val cols = xmldf.columns
cols.foreach(c => {
var currDF = xmldf.select("`" + c + "`")
currDF.show()
val df1 = currDF.filter(currDF("`" + c + "`").isNotNull)
if(df1.count() == 0 || df1.rdd.isEmpty()){
xmldf = xmldf.drop(c)
}
})
Problem with your code is, that columns _nomeclature and _revision aren't really empty, they contain empty strings, not nulls. Because of that, you can't use isNotNull to check if the cell is empty, you need to use =!= operator.
You can also use filter and foldLeft instead of foreach, if you want to avoid using mutable var.
val df = List(("#id2","#id19", "", "zaa01947", "", "JTJHA31U2400"), ("#id2", "#id29", "", "zaa22408", "", null)).toDF("_manufacturerRef", "_masterRef", "_nomenclature", "_partNumber", "_revision", "_serialNumber")
val newDf = df.columns
.filter(c => df.where(df(c) =!= "").isEmpty) //find column containing only empty strings
.foldLeft(df)(_.drop(_)) //drop all found columns from dataframe
newDf.show()
And as expected, _nomeclature and _revision are dropped in result:
+----------------+----------+-----------+-------------+
|_manufacturerRef|_masterRef|_partNumber|_serialNumber|
+----------------+----------+-----------+-------------+
| #id2| #id19| zaa01947| JTJHA31U2400|
| #id2| #id29| zaa22408| null|
+----------------+----------+-----------+-------------+

hiveql remove duplicates including records that had a duplicate

I have a select statement that i am storing in a dataframe....
val df = spark.sqlContext.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'");
I then want to take this dataframe and ONLY select unique records. So determine all duplicates on the prty_tax_govt_issu_id field and if there are duplicates not only remove the duplicate(s), but the entire record that has that prty_tax_govt_issu_id
So original data frame may look like...
+---------------------+
|prty_tax_govt_issu_id|
+---------------------+
| 000000005|
| 000000012|
| 000000012|
| 000000028|
| 000000038|
+---------------------+
The new dataframe should look like....
|prty_tax_govt_issu_id|
+---------------------+
| 000000005|
| 000000028|
| 000000038|
+---------------------+
Not sure if i need to do this after I store in the dataframe or if i can just get that result in my select statement. Thanks :)
Count the number of rows per id and select those ones with count=1.
val df = spark.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'")
// Get counts per id
val counts = df.groupBy("prty_tax_govt_issu_id").count()
// Filter for id's having only one row
counts.filter($"count" == 1).select($"prty_tax_govt_issu_id").show()
In SQL, you could do
val df = spark.sql("""
select prty_tax_govt_issu_id
from CST_EQUIFAX.eqfx_prty_emp_incm_info
where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'
group by prty_tax_govt_issu_id
having count(*)=1
""")
df.show()
a group by clause would do it
select prty_tax_govt_issu_id
from CST_EQUIFAX.eqfx_prty_emp_incm_info
where emp_mtch_cd = 'Y'
and emp_mtch_actv_rcrd_in = 'Y'
and emp_sts_in = 'A'
GROUP BY prty_tax_govt_issu_id

Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed)

I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... } to iterate over each column
Filter summaryDF on Field using col String value
Left join summaryDF onto valuesDF based on current column
withColumn to replace the original character code column from valuesDF with new description column
Assign new DF as a var
Continue loop
However, trying this gave me Cartesian product error (I made sure to define the join as "left").
I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf does not match with anything in summaryDf, the resulting cell in this solution will contain null. If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)

How to append List[String] to every row of DataFrame?

After a series of validations over a DataFrame,
I obtain a List of String with certain values like this:
List[String]=(lvalue1, lvalue2, lvalue3, ...)
And I have a Dataframe with n values:
dfield 1 | dfield 2 | dfield 3
___________________________
dvalue1 | dvalue2 | dvalue3
dvalue1 | dvalue2 | dvalue3
I want to append the values of the List at the beggining of my Dataframe, in order to get a new DF with something like this:
dfield 1 | dfield 2 | dfield 3 | dfield4 | dfield5 | dfield6
__________________________________________________________
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
I have found something using a UDF. Could be this correct for my purpose?
Regards.
TL;DR Use select or withColumn with lit function.
I'd use lit function with select operator (or withColumn).
lit(literal: Any): Column Creates a Column of literal value.
A solution could be as follows.
val values = List("lvalue1", "lvalue2", "lvalue3")
val dfields = values.indices.map(idx => s"dfield ${idx + 1}")
val dataset = Seq(
("dvalue1", "dvalue2", "dvalue3"),
("dvalue1", "dvalue2", "dvalue3")
).toDF("dfield 1", "dfield 2", "dfield 3")
val offsets = dataset.
columns.
indices.
map { idx => idx + colNames.size + 1 }
val offsetDF = offsets.zip(dataset.columns).
foldLeft(dataset) { case (df, (off, col)) => df.withColumnRenamed(col, s"dfield $off") }
val newcols = colNames.zip(dfields).
map { case (v, dfield) => lit(v) as dfield } :+ col("*")
scala> offsetDF.select(newcols: _*).show
+--------+--------+--------+--------+--------+--------+
|dfield 1|dfield 2|dfield 3|dfield 4|dfield 5|dfield 6|
+--------+--------+--------+--------+--------+--------+
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
+--------+--------+--------+--------+--------+--------+