How to extract the numeric part from a string column in spark? - scala

I am new to spark and trying to play with data to get practice. I am using databricks in scala and for dataset I am using fifa 19 complete player dataset from kaggle. one of the column named "Weight" which contains the data that looks like
+------+
|Weight|
+------+
|136lbs|
|156lbs|
|136lbs|
|... |
|... |
+------+
I want to change the column in such a way to look like this
+------+
|Weight|
+------+
|136 |
|156 |
|136 |
|... |
|... |
+------+
Can any one help how I can change the column values in spark sql.

Here is another way using regex and the regexp_extract build-in function:
import org.apache.spark.sql.functions.regexp_extract
val df = Seq(
"136lbs",
"150lbs",
"12lbs",
"30kg",
"500kg")
.toDF("weight")
df.withColumn("weight_num", regexp_extract($"weight", "\\d+", 0))
.withColumn("weight_unit", regexp_extract($"weight", "[a-z]+", 0))
.show
//Output
+------+----------+-----------+
|weight|weight_num|weight_unit|
+------+----------+-----------+
|136lbs| 136| lbs|
|150lbs| 150| lbs|
| 12lbs| 12| lbs|
| 30kg| 30| kg|
| 500kg| 500| kg|
+------+----------+-----------+

You can create a new column and use regexp_replace
dataFrame.withColumn("Weight2", regexp_replace($"Weight" , lit("lbs"), lit("")))

Related

Split single String column to multiple columns in Spark-Scala

I have a dataframe as:
+----+--------------------------+
|city|Types |
+----+--------------------------+
|BNG |school |
|HYD |school,restaurant |
|MUM |school,restaurant,hospital|
+----+--------------------------+
I wanna split Types column in multiple cols with ','.
The problem is column size is not fixed so I not getting how to do it.
I saw another related question in pyspark but I wanna do it in spark-scala and not pyspark
Any help is appreciated.
Thanks in advance
one way to address the irregular size in the column is to tweak the representation.
for example:
val data = Seq(("BNG", "school"),("HYD", "school,res"),("MUM", "school,res,hos")).toDF("city","types")
+----+--------------+
|city| types|
+----+--------------+
| BNG| school|
| HYD| school,res|
| MUM|school,res,hos|
+----+--------------+
data.withColumn("isSchool", array_contains(split(col("types"),","), "school")).withColumn("isRes", array_contains(split(col("types"),","), "res")).withColumn("isHos", array_contains(split(col("types"),","), "hos"))
+----+--------------+--------+-----+-----+
|city| types|isSchool|isRes|isHos|
+----+--------------+--------+-----+-----+
| BNG| school| true|false|false|
| HYD| school,res| true| true|false|
| MUM|school,res,hos| true| true| true|
+----+--------------+--------+-----+-----+

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

split a spark dataframe into multiple columns in Spark 1.6

I have a spark dataframe of the below format:
+--------------------+
|value |
+--------------------+
|Id,date |
|000027,2017-11-14 |
|000045,2017-11-15 |
|000056,2018-09-09 |
|C000056,2018-07-01 |
+--------------------+
I need to loop through each row, split it by comma (,) and then place the values in different columns (Id and date as two separate columns).
I am new to spark, not sure whether it could be done through lambda function. Any suggestions would be appreciated.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().appName("Demo").getOrCreate()
var df=Seq("a,b,c,f","d,f,g,h").toDF("value")
df.show //show the dataFrame
+-------+
| value|
+-------+
|a,b,c,f|
|d,f,g,h|
+-------+
//splitting out the dataFrame with "," delimeter and creating rdd[Row]
var rdd=df.rdd.map(x=>Row(x.getString(0).split(","):_*))
var schema= StructType(Array("name","class","rank","grade").map(x=>StructField(x,StringType,true)))
spark.createDataFrame(rdd,schema).show
+----+-----+----+-----+
|name|class|rank|grade|
+----+-----+----+-----+
| a| b| c| f|
| d| f| g| h|
+----+-----+----+-----+

replacing strings inside df using dictionary scala

I'm new to Scala. Im trying to replace parts of strings using a dictionary.
my dictionary would be:
val dict = Seq(("fruits", "apples"),("color", "red"), ("city", "paris")).
toDF(List("old", "new").toSeq:_*)
+------+------+
| old| new|
+------+------+
|fruits|apples|
| color| red|
| city| paris|
+------+------+
I would then translate fields from a column in another df which is:
+--------------------------+
|oldCol |
+--------------------------+
|I really like fruits |
|they are colored brightly |
|i live in city!! |
+--------------------------+
the desired output:
+------------------------+
|newCol |
+------------------------+
|I really like apples |
|they are reded brightly |
|i live in paris!! |
+------------------------+
please help! I've tried to covert dict to a map and then use replaceAllIn() function but really can't solve this one.
I've also tried foldleft following this answer: Scala replace an String with a List of Key/Values.
Thanks
Create a Map from dict dataframe and then you can easily do this using udf like below
import org.apache.spark.sql.functions._
//Creating Map from dict dataframe
val oldNewMap=dict.map(row=>row.getString(0)->row.getString(1)).collect.toMap
//Creating udf
val replaceUdf=udf((str:String)=>oldNewMap.foldLeft (str) {case (acc,(key,value))=>acc.replaceAll(key+".", value).replaceAll(key, value)})
//Select old column from oldDf and apply udf
oldDf.withColumn("newCol",replaceUdf(oldDf.col("oldCol"))).drop("oldCol").show
//Output:
+--------------------+
| newCol|
+--------------------+
|I really like apples|
|they are reded br...|
| i live in paris!!|
+--------------------+
I hope this will help you

Appending a new column to existing dataframe Spark scala [duplicate]

I am using Apache Spark 2.0 Dataframe/Dataset API
I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.
val list = List(4,5,10,7,2)
val df = List("a","b","c","d","e").toDF("row1")
I would like to do something like:
val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a |4 |
// |b |5 |
// |c |10 |
// |d |7 |
// |e |2 |
// +----+------+
For any ideas I would be greatful, my dataframe in reality contains more columns.
You could do it like this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28
// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32
// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
| a| 4|
| b| 5|
| c| 10|
| d| 7|
| e| 2|
+----+-------+
Adding for completeness: the fact that the input list (which exists in driver memory) has the same size as the DataFrame suggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrame if needed:
df.collect()
.map(_.getAs[String]("row1"))
.zip(list).toList
.toDF("row1", "row2")
That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.