Add leading zeros to Columns in a Spark Data Frame [duplicate] - scala

This question already has answers here:
Prepend zeros to a value in PySpark
(2 answers)
Closed 4 years ago.
In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "output")
.option("excludeAttribute", true)
.option("allowNumericLeadingZeros", true) //including this does not solve the problem
.load("pathToXmlFile")
Example output that I'm getting
+------+---+--------------------+
|iD |val|Code |
+------+---+--------------------+
|1 |44 |9022070536692784476 |
|2 |66 |-5138930048185086175|
|3 |25 |805582856291361761 |
|4 |17 |-9107885086776983000|
|5 |18 |1993794295881733178 |
|6 |31 |-2867434050463300064|
|7 |88 |-4692317993930338046|
|8 |44 |-4039776869915039812|
|9 |20 |-5786627276152563542|
|10 |12 |7614363703260494022 |
+------+---+--------------------+
Desired output
+--------+----+--------------------+
|iD |val |Code |
+--------+----+--------------------+
|001 |044 |9022070536692784476 |
|002 |066 |-5138930048185086175|
|003 |025 |805582856291361761 |
|004 |017 |-9107885086776983000|
|005 |018 |1993794295881733178 |
|006 |031 |-2867434050463300064|
|007 |088 |-4692317993930338046|
|008 |044 |-4039776869915039812|
|009 |020 |-5786627276152563542|
|0010 |012 |7614363703260494022 |
+--------+----+--------------------+

This solved it for me, thank you all for the help
val df2 = df
.withColumn("idLong", format_string("%03d", $"iD"))

You can simply do that by using concat inbuilt function
df.withColumn("iD", concat(lit("00"), col("iD")))
.withColumn("val", concat(lit("0"), col("val")))

Related

Scala -- apply a custom if-then on a dataframe

I have this kind of dataset:
val cols = Seq("col_1","col_2")
val data = List(("a",1),
("b",1),
("a",2),
("c",3),
("a",3))
val df = spark.createDataFrame(data).toDF(cols:_*)
+-----+-----+
|col_1|col_2|
+-----+-----+
|a |1 |
|b |1 |
|a |2 |
|c |3 |
|a |3 |
+-----+-----+
I want to add an if-then column based on the existing columns.
df
.withColumn("col_new",
when(col("col_2").isin(2, 5), "str_1")
.when(col("col_2").isin(4, 6), "str_2")
.when(col("col_2").isin(1) && col("col_1").contains("a"), "str_3")
.when(col("col_2").isin(3) && col("col_1").contains("b"), "str_1")
.when(col("col_2").isin(1,2,3), "str_4")
.otherwise(lit("other")))
Instead of the list of when-then statements, I would prefer to apply a custom function. In Python I would run a lambda & map.
thank you!

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.
You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

Join with uneven columns

I have two dataframes structured the following way:
|Source|#Users|#Clicks|Hour|Type
and
Type|Total # Users|Hour
I'd like to join these columns based on hour however the first dataframe is at a deeper granularity in the second and therefore has more rows. Basically I want a dataframe where I have
|Source|#Users|#Clicks|Hour|Type|Total # Users
where the total # users is from the second dataframe. Any suggestions? I think I maybe want to use map?
Edit:
Here's an example
DF1
|Source|#Users|#Clicks|Hour|Type
|Prod1 |50 |3 |01 |Internet
|Prod2 |10 |2 |07 |iOS
|Prod3 |1 |50 |07 |Internet
|Prod2 |3 |2 |07 |Internet
|Prod3 |8 |2 |05 |Internet
DF2
|Type |Total #Users|Hour
|Internet|100 |01
|iOS |500 |01
|Internet|300 |07
|Internet|15 |05
|iOS |20 |07
Result
|Source|#Users|#Clicks|Hour|Type |Total #Users
|Prod1 |50 |3 |01 |Internet|100
|Prod2 |10 |2 |07 |iOS |20
|Prod3 |1 |50 |07 |Internet|300
|Prod2 |3 |2 |07 |Internet|300
|Prod3 |8 |2 |05 |Internet|15
That's a left join you're trying to do :
df1.join(df2, (df1.Hour === df2.Hour) & (df1.Type === df2.Type), "left_outer")
Short version : a left join keep all the rows from df1 and join on condition with matching rows of df2 if there is a match (null if not, duplicate if multiple matches).
More info on Pyspark join
More info on SQL Joins types

Spark scala join RDD between 2 datasets

Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+

Spark Dataframe Random UUID changes after every transformation/action

I have a Spark dataframe with a column that includes a generated UUID.
However, each time I do an action or transformation on the dataframe, it changes the UUID at each stage.
How do I generate the UUID only once and have the UUID remain static thereafter.
Some sample code to re-produce my issue is below:
def process(spark: SparkSession): Unit = {
import spark.implicits._
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
sc.setLogLevel("OFF")
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// register an UDF that creates a random UUID
val generateUUID = udf(() => UUID.randomUUID().toString)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", generateUUID())
dfWithUuid.show(false)
dfWithUuid.show(false) // uuid is different
// new transformations also change the uuid
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
}
The output is:
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |a414e73b-24b8-4f64-8d21-f0bc56d3d290|
|b |2 |f37935e5-0bfc-4863-b6dc-897662307e0a|
|c |3 |e3aaf655-5a48-45fb-8ab5-22f78cdeaf26|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |1c6597bf-f257-4e5f-be81-34a0efa0f6be|
|b |2 |6efe4453-29a8-4b7f-9fa1-7982d2670bd6|
|c |3 |2f7ddc1c-3e8c-4118-8e2c-8a6f526bee7e|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |00b85af8-711e-4b59-82e1-8d8e59d4c512|2.0 |
|b |2 |94c3f2c6-9234-4fb3-b1c4-273a37171131|3.0 |
|c |3 |1059fff2-b8f9-4cec-907d-ea181d5003a2|4.0 |
+----+----+------------------------------------+----+
Note that the UUID is different at each step.
It is an expected behavior. User defined functions have to be deterministic:
The user-defined functions must be deterministic. Due to optimization,
duplicate invocations may be eliminated or the function may even be
invoked more times than it is present in the query.
If you want to include non-deterministic function and preserve the output you should write intermediate data to a persistent storage and read it back. Checkpointing or caching may work in some simple cases but it won't be reliable in general.
If upstream process is deterministic (for starters there is shuffle) you could try to use rand function with seed, convert to byte array and pass to UUID.nameUUIDFromBytes.
See also: About how to add a new column to an existing DataFrame with random values in Scala
Note: SPARK-20586 introduced deterministic flag, which can disable certain optimization, but it is not clear how it behaves when data is persisted and a loss of executor occurs.
it is very old question but letting the people know what worked for me. It might help someone.
You could use the expr function as below to generate unique GUIDs which does not change on transformations.
import org.apache.spark.sql.functions._
// create dataframe
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")
df.createOrReplaceTempView("df")
df.show(false)
// generate UUID for new column
val dfWithUuid = df.withColumn("new_uuid", expr("uuid()"))
dfWithUuid.show(false)
dfWithUuid.show(false)
// new transformations
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)
Output is as below :
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
+----+----+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+
|col1|col2|new_uuid |
+----+----+------------------------------------+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+
+----+----+------------------------------------+----+
|col1|col2|new_uuid |col3|
+----+----+------------------------------------+----+
|a |1 |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|2.0 |
|b |2 |43882a79-8e7f-4002-9740-f22bc6b20db5|3.0 |
|c |3 |64bc741a-0d7c-430d-bfe2-a4838f10acd0|4.0 |
+----+----+------------------------------------+----+
I have a pyspark version:
from pyspark.sql import functions as f
pdataDF=dataDF.withColumn("uuid_column",f.expr("uuid()"))
display(pdataDF)
pdataDF.write.mode("overwrite").saveAsTable("tempUuidCheck")
Try this one:
df.withColumn("XXXID", lit(java.util.UUID.randomUUID().toString))
it works different vs:
val generateUUID = udf(() => java.util.UUID.randomUUID().toString)
df.withColumn("XXXCID", generateUUID() )
I hope this helps.
Pawel