Check that a SPARK Dataframe column matches a Regex for all occurrences using Scala - scala

I am using Scala.
I have a dataframe with a column date which looks like that:
| date |
|2017-09-24T11:05:52.647+02:00|
|2018-09-24T11:05:52.647+02:00|
|2018-10-24T11:05:52.647+02:00|
I have a regex to check the date format:
pattern = new regex(([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]| [12]\d|3[01])T\d{2}:\d{2}:\d{2}.\d{3}\+\d{2}:\d{2}))
I want to check if each row in the dataframe matches with the regex, if yes return true and if not return false. I need to return just true or false not a list.
Any help is welcome and many thanks for your help.

This should work - but turning it around, find first non-match:
import scala.util.Try
val result = Try(Option(df.filter($"cityid" rlike "[^0-9]").first)).toOption.flatten
if (result.isEmpty) { println("Empty")}
I use a DF as outcome and you can just check if empty or not.
Please tailor to your own situation. e.g. not empty, your own regex.
Without the Try and such the .first generates an error if empty. None is returned if empty and you can do the empty check.

Related

Pyspark when statement

Hi I'm starting to use Pyspark and want to put a when and otherwise condition in:
df_1 = df.withColumn("test", when(df.first_name == df2.firstname & df.last_namne == df2.lastname, "1. Match on First and Last Name").otherwise ("No Match"))
I get the below error and wanted some assistance to understand why the above is not working.
Both df.first_name and df.last_name are strings and also df2.firstname and df2.lastname strings too
Error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Thanks in advance
There are several issues in your statement:
For df.withColum(), you can not use df and df2 columns in one statement. First join the two dataframes using df.join(df2, on="some_key", how="left/right/full").
Enclose the and condition of "when" clause in round brackets: (df.first_name == df2.firstname) & (df.last_name == df2.lastname)
The string literals of "when" and "otherwise" should be enclosed in lit() like: lit("1. Match on First and Last Name") and lit("No Match").
There is possibly a typo in your field name df.last_namne.

Spark: dropping multiple columns using regex

In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?
If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html

How do I use "not rlike" in spark-sql?

rlike works fine but not rlike throws an error:
scala> sqlContext.sql("select * from T where columnB rlike '^[0-9]*$'").collect()
res42: Array[org.apache.spark.sql.Row] = Array([412,0], [0,25], [412,25], [0,25])
scala> sqlContext.sql("select * from T where columnB not rlike '^[0-9]*$'").collect()
java.lang.RuntimeException: [1.35] failure: ``in'' expected but `rlike' found
val df = sc.parallelize(Seq(
(412, 0),
(0, 25),
(412, 25),
(0, 25)
)).toDF("columnA", "columnB")
Or it is continuation of issue https://issues.apache.org/jira/browse/SPARK-4207 ?
A concise way to do it in PySpark is:
df.filter(~df.column.rlike(pattern))
There is nothing as such not rlike, but in regex you have something called negative lookahead, which means it will give the words that does not match.
For above query, you can use the regex as below. Lets say, you want the ColumnB should not start with digits '0'
Then you can do like this.
sqlContext.sql("select * from T where columnB rlike '^(?!.*[1-9]).*$'").collect()
Result: Array[org.apache.spark.sql.Row] = Array([412,0])
What I meant over all is, you have to do with regex it self to negate the match, not with rlike. Rlike simply matches the regex that you asked to match. If your regex tells it to not match, it applies that, if your regex is for matching then it does that.
The above answers suggests to use a negative lookahead. It can be achieved for some cases. However regexps were not designed to make an effecient negative match.
Those regexp will be error prone and hard to read.
Spark does support "not rlike" since version 2.0.
# given 'url' is column on a dataframe
df.filter("""url not rlike "stackoverflow.com"""")
The only usage known to me, is a sql string expression (as above). I could not find a "not" sql dsl function in the python api. There might be one in scala.
I know your question is getting a bit old, but just in case : have you simply tried scala's unary "!" operator?
In java you would go something like that :
DataFrame df = sqlContext.table("T");
DataFrame notLikeDf = df.filter(
df.col("columnB").rlike("^[0-9]*$").unary_$bang()
);
In pyspark, I did this as :
df = load_your_df()
matching_regex = "yourRegexString"
matching_df = df.filter(df.fieldName.rlike(matching_regex))
non_matching_df = df.subtract(matching_df)