How to find a DataFrame column with only whitespace - pyspark

How can I replace a string which contains only whitespace with None?
Example strings:
Input:
" ",
" Hello There ",
"Hi World"
Output:
null,
" Hello There ",
"Hi World"
I have used below line but replace everything with null.
df=df.withColumn('TITLE_LINE_3',F.regexp_replace(F.trim(df.TITLE_LINE_3),"^\s+$",None))

You can trim the column and check if it's empty to replace it with null.
df.withColumn('TITLE_LINE_3', when(trim('TITLE_LINE_3') == '', None).otherwise(col('TITLE_LINE_3')))

You can use a an udf to strip the whitespaces from your column and further identify using a when-otherwise , if they are empty and replace them with None
Data Preparation
sparkDF = sql.createDataFrame(
[
(" ",),
(" Hi There ",),
("Hello World",),
],
("text",)
)
sparkDF.show()
+-----------+
| text|
+-----------+
| |
| Hi There |
|Hello World|
+-----------+
Strip UDF
strip_udf = F.udf(lambda x:x.strip(),StringType())
sparkDF = sparkDF.withColumn('preprocessed_text',strip_udf(F.col('text')))
sparkDF = sparkDF.withColumn('preprocessed_text',F.when(F.col('preprocessed_text') == '',None)\
.otherwise(F.col('preprocessed_text'))\
)
sparkDF.show()
+-----------+-----------------+
| text|preprocessed_text|
+-----------+-----------------+
| | null|
| Hi There | Hi There|
|Hello World| Hello World|
+-----------+-----------------+

Related

Blank spaces in string | Spark Scala

I have a non-breaking trailing space in a string in a column . I have tried the below solutions but cannot get rid of the space.
df.select(
col("city"),
regexp_replace(col("city"), " ", ""),
regexp_replace(col("city"), "[\\r\\n]", ""),
regexp_replace(col("city"), "\\s+$", ""),
rtrim(col("city"))
).show()
Is there any other possible solution I can try to remove the blank space?
You can use the ltrim, rtrim or trim functions from org.apache.sql.functions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("Bengaluru "),
(" Bengaluru"),
(" Bengaluru ")
).toDF("city")
df.show
+----------------+
| city|
+----------------+
| Bengaluru |
| Bengaluru|
| Bengaluru |
+----------------+
df.withColumn("city", ltrim(col("city"))).show
+-------------+
| city|
+-------------+
| Bengaluru |
| Bengaluru|
|Bengaluru |
+-------------
df.withColumn("city", rtrim(col("city"))).show
+------------+
| city|
+------------+
| Bengaluru|
| Bengaluru|
| Bengaluru|
+------------+
df.withColumn("city", trim(col("city"))).show
+---------+
| city|
+---------+
|Bengaluru|
|Bengaluru|
|Bengaluru|
+---------+
Choosing whether you want to remove leading/trailing spaces or both.
Hope this helps!

PySpark Return Exact Match from list of strings

I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1

Slice Alphanumeric word from column sentence using pyspark

I want to slice only alphanumeric word from column sentence using pyspark.
For Example,
Original text:
Expected results:
Please extract text between the white space.
df.withColumn('newtext', F.regexp_extract('text','\s(.*?)\s',0)).show()
+---+----------------+-------+
| id| text|newtext|
+---+----------------+-------+
| 1|ABCD AB12C BCDEF| AB12C |
+---+----------------+-------+
Followingg your revised question. Extract as ordered;
df.withColumn('newtext', F.regexp_extract('text','([A-Za-z]+\d+[A-Za-z]+|[A-Za-z]+\d+|\d+[A-Za-z]+)',0)).show()
+---+------------------+-------+
| id| text|newtext|
+---+------------------+-------+
| 1| ABCD AB12C BCDEF| AB12C|
| 2|SE2DC WERDF EWSQSA| SE2DC|
| 3| REDC SEDX WSDR12 | WSDR12|
+---+------------------+-------+

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.
It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+
I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.
I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

Transforming one column into multiple ones in a Spark Dataframe

I have a big dataframe (1.2GB more or less) with this structure:
+---------+--------------+------------------------------------------------------------------------------------------------------+
| country | date_data | text |
+---------+--------------+------------------------------------------------------------------------------------------------------+
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_D: QQAA\nT_NAME: name_2\nT_IN: ind_2\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_300000\nT_IN: ind_65\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 47aa" |
+---------+--------------+------------------------------------------------------------------------------------------------------+
The number of rows is 300.000 and "text" field is a string of 5000 characters approximately.
I would like to separate the field “text” in this new fields:
+---------+------------+------+-------------+--------+--------+---------+--------+------+
| country | date_data | t_d | t_name | t_in | t_c | t_add | ...... | t_r |
+---------+------------+------+-------------+--------+--------+---------+--------+------+
| EEUU | 2016-10-03 | QQWE | name_1 | ind_1 | c1ws12 | Sec_1_P | ...... | 45ee |
| EEUU | 2016-10-03 | QQAA | name_2 | ind_2 | c1ws12 | Sec_1_P | ...... | 45ee |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| EEUU | 2016-10-03 | QQWE | name_300000 | ind_65 | c1ws12 | Sec_1_P | ...... | 47aa |
+---------+------------+------+-------------+--------+--------+---------+--------+------+
Currently, I´m using regular expressions to solve this problem. Firstly, I write the regular expresions and create a function to extract individual fields from text (90 regular expressions in total):
val D_text = "((?<=T_D: ).*?(?=\\\\n))".r
val NAME_text = "((?<=nT_NAME: ).*?(?=\\\\n))".r
val IN_text = "((?<=T_IN: ).*?(?=\\\\n))".r
val C_text = "((?<=T_C: ).*?(?=\\\\n))".r
val ADD_text = "((?<=T_ADD: ).*?(?=\\\\n))".r
.
.
.
.
val R_text = "((?<=T_R: ).*?(?=\\\\n))".r
//UDF function:
def getFirst(pattern2: scala.util.matching.Regex) = udf(
(url: String) => pattern2.findFirstIn(url) match {
case Some(texst_new) => texst_new
case None => "NULL"
case null => "NULL"
}
)
Then, I create a new Dataframe (tbl_separate_fields ) as a result of applying the function with a regular expression to extract every new field from text.
val tbl_separate_fields = hiveDF.select(
hiveDF("country"),
hiveDF("date_data"),
getFirst(D_text)(hiveDF("texst")).alias("t_d"),
getFirst(NAME_text)(hiveDF("texst")).alias("t_name"),
getFirst(IN_text)(hiveDF("texst")).alias("t_in"),
getFirst(C_text)(hiveDF("texst")).alias("t_c"),
getFirst(ADD_text)(hiveDF("texst")).alias("t_add"),
.
.
.
.
getFirst(R_text)(hiveDF("texst")).alias("t_r")
)
Finally, I insert this dataframe into a Hive table:
tbl_separate_fields.registerTempTable("tbl_separate_fields")
hiveContext.sql("INSERT INTO TABLE TABLE_INSERT PARTITION (date_data) SELECT * FROM tbl_separate_fields")
This solution lasts for 1 hour for the entire dataframe so I wish to optimize and reduce the execution time. Is there any solution?
We are using Hadoop 2.7.1 and Apache-Spark 1.5.1. The configuration for Spark is:
val conf = new SparkConf().set("spark.storage.memoryFraction", "0.1")
val sc = new SparkContext(conf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Thanks in advance.
EDIT DATA:
+---------+--------------+------------------------------------------------------------------------------------------------------+
| country | date_data | text |
+---------+--------------+------------------------------------------------------------------------------------------------------+
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_NAME: name_2\nT_D: QQAA\nT_IN: ind_2\nT_C: c1ws12 ...........\nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_NAME: name_300000\nT_ADD: Sec_1_P\nT_IN: ind_65\nT_C: c1ws12\n ...........\nT_R: 47aa" |
+---------+--------------+------------------------------------------------------------------------------------------------------+
Using regular expressions in this case is slow and also fragile.
If you know that all records have the same structure, i.e. that all "text" values have the same number and order of "parts", the following code would work (for any number of columns), mainly taking advantage of the split function in org.apache.spark.sql.functions:
import org.apache.spark.sql.functions._
// first - split "text" column values into Arrays
val textAsArray: DataFrame = inputDF
.withColumn("as_array", split(col("text"), "\n"))
.drop("text")
.cache()
// get a sample (first row) to get column names, can be skipped if you want to hard-code them:
val sampleText = textAsArray.first().getAs[mutable.WrappedArray[String]]("as_array").toArray
val columnNames: Array[(String, Int)] = sampleText.map(_.split(": ")(0)).zipWithIndex
// add Column per columnName with the right value and drop the no-longer-needed as_array column
val withValueColumns: DataFrame = columnNames.foldLeft(textAsArray) {
case (df, (colName, index)) => df.withColumn(colName, split(col("as_array").getItem(index), ": ").getItem(1))
}.drop("as_array")
withValueColumns.show()
// for the sample data I created,
// with just 4 "parts" in "text" column, this prints:
// +-------+----------+----+------+-----+------+
// |country| date_data| T_D|T_NAME| T_IN| T_C|
// +-------+----------+----+------+-----+------+
// | EEUU|2016-10-03|QQWE|name_1|ind_1|c1ws12|
// | EEUU|2016-10-03|QQAA|name_2|ind_2|c1ws12|
// +-------+----------+----+------+-----+------+
Alternatively, if the assumption above is not true, you can use a UDF that converts the text column into a Map, and then perform a similar reduceLeft operation on the hard-coded list of desired columns:
import sqlContext.implicits._
// sample data: not the same order, not all records have all columns:
val inputDF: DataFrame = sc.parallelize(Seq(
("EEUU", "2016-10-03", "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12"),
("EEUU", "2016-10-03", "T_D: QQAA\nT_IN: ind_2\nT_NAME: name_2")
)).toDF("country", "date_data", "text")
// hard-coded list of expected column names:
val columnNames: Seq[String] = Seq("T_D", "T_NAME", "T_IN", "T_C")
// UDF to convert text into key-value map
val asMap = udf[Map[String, String], String] { s =>
s.split("\n").map(_.split(": ")).map { case Array(k, v) => k -> v }.toMap
}
val textAsMap = inputDF.withColumn("textAsMap", asMap(col("text"))).drop("text")
// for each column name - lookup the value in the map
val withValueColumns: DataFrame = columnNames.foldLeft(textAsMap) {
case (df, colName) => df.withColumn(colName, col("textAsMap").getItem(colName))
}.drop("textAsMap")
withValueColumns.show()
// prints:
// +-------+----------+----+------+-----+------+
// |country| date_data| T_D|T_NAME| T_IN| T_C|
// +-------+----------+----+------+-----+------+
// | EEUU|2016-10-03|QQWE|name_1|ind_1|c1ws12|
// | EEUU|2016-10-03|QQAA|name_2|ind_2| null|
// +-------+----------+----+------+-----+------+