How to extract the numeric part from a string column in spark?

How to extract the numeric part from a string column in spark? - scala

I am new to spark and trying to play with data to get practice. I am using databricks in scala and for dataset I am using fifa 19 complete player dataset from kaggle. one of the column named "Weight" which contains the data that looks like
+------+
|Weight|
+------+
|136lbs|
|156lbs|
|136lbs|
|... |
|... |
+------+
I want to change the column in such a way to look like this
+------+
|Weight|
+------+
|136 |
|156 |
|136 |
|... |
|... |
+------+
Can any one help how I can change the column values in spark sql.

Here is another way using regex and the regexp_extract build-in function:
import org.apache.spark.sql.functions.regexp_extract
val df = Seq(
"136lbs",
"150lbs",
"12lbs",
"30kg",
"500kg")
.toDF("weight")
df.withColumn("weight_num", regexp_extract($"weight", "\\d+", 0))
.withColumn("weight_unit", regexp_extract($"weight", "[a-z]+", 0))
.show
//Output
+------+----------+-----------+
|weight|weight_num|weight_unit|
+------+----------+-----------+
|136lbs| 136| lbs|
|150lbs| 150| lbs|
| 12lbs| 12| lbs|
| 30kg| 30| kg|
| 500kg| 500| kg|
+------+----------+-----------+

You can create a new column and use regexp_replace
dataFrame.withColumn("Weight2", regexp_replace($"Weight" , lit("lbs"), lit("")))

Related

Split single String column to multiple columns in Spark-Scala

I have a dataframe as:
+----+--------------------------+
|city|Types |
+----+--------------------------+
|BNG |school |
|HYD |school,restaurant |
|MUM |school,restaurant,hospital|
+----+--------------------------+
I wanna split Types column in multiple cols with ','.
The problem is column size is not fixed so I not getting how to do it.
I saw another related question in pyspark but I wanna do it in spark-scala and not pyspark
Any help is appreciated.
Thanks in advance

one way to address the irregular size in the column is to tweak the representation.
for example:
val data = Seq(("BNG", "school"),("HYD", "school,res"),("MUM", "school,res,hos")).toDF("city","types")
+----+--------------+
|city| types|
+----+--------------+
| BNG| school|
| HYD| school,res|
| MUM|school,res,hos|
+----+--------------+
data.withColumn("isSchool", array_contains(split(col("types"),","), "school")).withColumn("isRes", array_contains(split(col("types"),","), "res")).withColumn("isHos", array_contains(split(col("types"),","), "hos"))
+----+--------------+--------+-----+-----+
|city| types|isSchool|isRes|isHos|
+----+--------------+--------+-----+-----+
| BNG| school| true|false|false|
| HYD| school,res| true| true|false|
| MUM|school,res,hos| true| true| true|
+----+--------------+--------+-----+-----+

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))

Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

split a spark dataframe into multiple columns in Spark 1.6

I have a spark dataframe of the below format:
+--------------------+
|value |
+--------------------+
|Id,date |
|000027,2017-11-14 |
|000045,2017-11-15 |
|000056,2018-09-09 |
|C000056,2018-07-01 |
+--------------------+
I need to loop through each row, split it by comma (,) and then place the values in different columns (Id and date as two separate columns).
I am new to spark, not sure whether it could be done through lambda function. Any suggestions would be appreciated.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().appName("Demo").getOrCreate()
var df=Seq("a,b,c,f","d,f,g,h").toDF("value")
df.show //show the dataFrame
+-------+
| value|
+-------+
|a,b,c,f|
|d,f,g,h|
+-------+
//splitting out the dataFrame with "," delimeter and creating rdd[Row]
var rdd=df.rdd.map(x=>Row(x.getString(0).split(","):_*))
var schema= StructType(Array("name","class","rank","grade").map(x=>StructField(x,StringType,true)))
spark.createDataFrame(rdd,schema).show
+----+-----+----+-----+
|name|class|rank|grade|
+----+-----+----+-----+
| a| b| c| f|
| d| f| g| h|
+----+-----+----+-----+

replacing strings inside df using dictionary scala

I'm new to Scala. Im trying to replace parts of strings using a dictionary.
my dictionary would be:
val dict = Seq(("fruits", "apples"),("color", "red"), ("city", "paris")).
toDF(List("old", "new").toSeq:_*)
+------+------+
| old| new|
+------+------+
|fruits|apples|
| color| red|
| city| paris|
+------+------+
I would then translate fields from a column in another df which is:
+--------------------------+
|oldCol |
+--------------------------+
|I really like fruits |
|they are colored brightly |
|i live in city!! |
+--------------------------+
the desired output:
+------------------------+
|newCol |
+------------------------+
|I really like apples |
|they are reded brightly |
|i live in paris!! |
+------------------------+
please help! I've tried to covert dict to a map and then use replaceAllIn() function but really can't solve this one.
I've also tried foldleft following this answer: Scala replace an String with a List of Key/Values.
Thanks

Create a Map from dict dataframe and then you can easily do this using udf like below
import org.apache.spark.sql.functions._
//Creating Map from dict dataframe
val oldNewMap=dict.map(row=>row.getString(0)->row.getString(1)).collect.toMap
//Creating udf
val replaceUdf=udf((str:String)=>oldNewMap.foldLeft (str) {case (acc,(key,value))=>acc.replaceAll(key+".", value).replaceAll(key, value)})
//Select old column from oldDf and apply udf
oldDf.withColumn("newCol",replaceUdf(oldDf.col("oldCol"))).drop("oldCol").show
//Output:
+--------------------+
| newCol|
+--------------------+
|I really like apples|
|they are reded br...|
| i live in paris!!|
+--------------------+
I hope this will help you

Appending a new column to existing dataframe Spark scala [duplicate]

I am using Apache Spark 2.0 Dataframe/Dataset API
I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.
val list = List(4,5,10,7,2)
val df = List("a","b","c","d","e").toDF("row1")
I would like to do something like:
val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a |4 |
// |b |5 |
// |c |10 |
// |d |7 |
// |e |2 |
// +----+------+
For any ideas I would be greatful, my dataframe in reality contains more columns.

You could do it like this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28
// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32
// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
| a| 4|
| b| 5|
| c| 10|
| d| 7|
| e| 2|
+----+-------+

Adding for completeness: the fact that the input list (which exists in driver memory) has the same size as the DataFrame suggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrame if needed:
df.collect()
.map(_.getAs[String]("row1"))
.zip(list).toList
.toDF("row1", "row2")
That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to extract the numeric part from a string column in spark? - scala

You can create a new column and use regexp_replace dataFrame.withColumn("Weight2", regexp_replace($"Weight" , lit("lbs"), lit("")))

Related

Split single String column to multiple columns in Spark-Scala

How to check whether a the whole column in a pyspark contains a value using Expr

split a spark dataframe into multiple columns in Spark 1.6

replacing strings inside df using dictionary scala

Appending a new column to existing dataframe Spark scala [duplicate]

Categories

Resources