How do I use "not rlike" in spark-sql? - scala

rlike works fine but not rlike throws an error:
scala> sqlContext.sql("select * from T where columnB rlike '^[0-9]*$'").collect()
res42: Array[org.apache.spark.sql.Row] = Array([412,0], [0,25], [412,25], [0,25])
scala> sqlContext.sql("select * from T where columnB not rlike '^[0-9]*$'").collect()
java.lang.RuntimeException: [1.35] failure: ``in'' expected but `rlike' found
val df = sc.parallelize(Seq(
(412, 0),
(0, 25),
(412, 25),
(0, 25)
)).toDF("columnA", "columnB")
Or it is continuation of issue https://issues.apache.org/jira/browse/SPARK-4207 ?

A concise way to do it in PySpark is:
df.filter(~df.column.rlike(pattern))

There is nothing as such not rlike, but in regex you have something called negative lookahead, which means it will give the words that does not match.
For above query, you can use the regex as below. Lets say, you want the ColumnB should not start with digits '0'
Then you can do like this.
sqlContext.sql("select * from T where columnB rlike '^(?!.*[1-9]).*$'").collect()
Result: Array[org.apache.spark.sql.Row] = Array([412,0])
What I meant over all is, you have to do with regex it self to negate the match, not with rlike. Rlike simply matches the regex that you asked to match. If your regex tells it to not match, it applies that, if your regex is for matching then it does that.

The above answers suggests to use a negative lookahead. It can be achieved for some cases. However regexps were not designed to make an effecient negative match.
Those regexp will be error prone and hard to read.
Spark does support "not rlike" since version 2.0.
# given 'url' is column on a dataframe
df.filter("""url not rlike "stackoverflow.com"""")
The only usage known to me, is a sql string expression (as above). I could not find a "not" sql dsl function in the python api. There might be one in scala.

I know your question is getting a bit old, but just in case : have you simply tried scala's unary "!" operator?
In java you would go something like that :
DataFrame df = sqlContext.table("T");
DataFrame notLikeDf = df.filter(
df.col("columnB").rlike("^[0-9]*$").unary_$bang()
);

In pyspark, I did this as :
df = load_your_df()
matching_regex = "yourRegexString"
matching_df = df.filter(df.fieldName.rlike(matching_regex))
non_matching_df = df.subtract(matching_df)

Related

Pyspark when statement

Hi I'm starting to use Pyspark and want to put a when and otherwise condition in:
df_1 = df.withColumn("test", when(df.first_name == df2.firstname & df.last_namne == df2.lastname, "1. Match on First and Last Name").otherwise ("No Match"))
I get the below error and wanted some assistance to understand why the above is not working.
Both df.first_name and df.last_name are strings and also df2.firstname and df2.lastname strings too
Error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Thanks in advance
There are several issues in your statement:
For df.withColum(), you can not use df and df2 columns in one statement. First join the two dataframes using df.join(df2, on="some_key", how="left/right/full").
Enclose the and condition of "when" clause in round brackets: (df.first_name == df2.firstname) & (df.last_name == df2.lastname)
The string literals of "when" and "otherwise" should be enclosed in lit() like: lit("1. Match on First and Last Name") and lit("No Match").
There is possibly a typo in your field name df.last_namne.

Spark: dropping multiple columns using regex

In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?
If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

Array manipulation in Spark, Scala

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.
I have a dataframe having the following two columns:
Name_Description Grade
Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.
Name description is not an array that's of fixed size. It could be something like
['asdf_ Brandon', 'Ca%abc%rd']
['fthhhhChris', 'Rock', 'is the %abc%man']
The only problems are the following:
1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.
2. %abc% randomly pops up on elements, so I wanna erase them.
Is there any way to achieve those two things in Scala? For instance, I just want
['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']
to change to
['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:
val removeGarbage = udf { arr: WrappedArray[String] =>
//in case that array is empty we need to map over option
arr.headOption
//drop first 6 characters from first element, then remove %abc% from the rest
.map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
.getOrElse(arr)
}
Then you just need to use this UDF on your Name_Description column:
val df = List(
(1, Array("asdf_ Brandon", "Ca%abc%rd")),
(2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")
df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)
Show prints:
+-----+-------------------------+
|Grade|Name_Description |
+-----+-------------------------+
|1 |[Brandon, Card] |
|2 |[Chris, Rock, is the man]|
+-----+-------------------------+
We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.
Please find below my approach. Hope it helps.
val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))
val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")
This is how I created the input dataframe.
df.select('Grade,posexplode('Name_Description)).registerTempTable("data")
We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.
spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show
This query will give out the required output. Hope this helps.

Check that a SPARK Dataframe column matches a Regex for all occurrences using Scala

I am using Scala.
I have a dataframe with a column date which looks like that:
| date |
|2017-09-24T11:05:52.647+02:00|
|2018-09-24T11:05:52.647+02:00|
|2018-10-24T11:05:52.647+02:00|
I have a regex to check the date format:
pattern = new regex(([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]| [12]\d|3[01])T\d{2}:\d{2}:\d{2}.\d{3}\+\d{2}:\d{2}))
I want to check if each row in the dataframe matches with the regex, if yes return true and if not return false. I need to return just true or false not a list.
Any help is welcome and many thanks for your help.
This should work - but turning it around, find first non-match:
import scala.util.Try
val result = Try(Option(df.filter($"cityid" rlike "[^0-9]").first)).toOption.flatten
if (result.isEmpty) { println("Empty")}
I use a DF as outcome and you can just check if empty or not.
Please tailor to your own situation. e.g. not empty, your own regex.
Without the Try and such the .first generates an error if empty. None is returned if empty and you can do the empty check.

Possible to filter Spark dataframe by ISNUMERIC function?

I have a DataFrame for a table in SQL. I want to filter this DataFrame if a value of a certain column is numeric or not.
val df = sqlContext.sql("select * from myTable");
val filter = df.filter("ISNUMERIC('col_a')");
I want filter to be a dataframe of df where the values in col_a are numeric.
My current solution doesn't work. How can I achieve this?
You can filter out as
df.filter(row => row.getAs[String]("col_a").matches("""\d+"""))
Hope this helps!
You can cast the field in question to DECIMAL and inspect the result:
filter("CAST(col_a AS DECIMAL) IS NOT NULL")
Optionally, you can pass length and/or precision to narrow down the valid numbers to a specific maximum length:
filter("CAST(col_a AS DECIMAL(18,8)) IS NOT NULL")
Shankar Koirala's answer covers integers effectively. The regex below would cover use cases requiring doubles, with optional negative signing and handling of nulls (note that this is a Java variation):
df.filter( df.col("col_a").isNotNull() )
.filter( ( FilterFunction<Row> )
row -> row.getString( row.fieldIndex( "col_a" ) ).matches( "-?\\d+\\.?\\d*" ) )
spark.sql("select phone_number, (CASE WHEN LENGTH(REGEXP_REPLACE(phone_number),'[^0-9]', '')) = LENGTH(TRIM(phone_number)) THEN true ELSE false END) as phone_number_isNumeric from table").show()
This is really an old post, but still if anybody looking for alternate solution.
REGEXP_REPLACE(phone_number),'[^0-9]', ''
removes all characters except numeric