Convert every value of a dataframe - scala

I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |

Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.

Related

how to count field with condition by spark

I have a dataframe, there is a enum field(value are 0 or 1) named A, another one field B, I would like to implement below scenario:
if `B` is null:
count(when `A` is 0) and set a column name `xx`
count(when `A` is 1) and set a column name `yy`
if `B` is not null:
count(when `A` is 0) and set a column name `zz`
count(when `A` is 1) and set a column name `mm`
how can I do it by spark scala?
It's possible to conditionally populate columns in this way, however the final output DataFrame requires an expected schema.
Assuming all of the scenarios you detailed are possible in one DataFrame, I would suggest creating each of the four columns: "xx", "yy", "zz" and "mm" and conditionally populating them.
In the below example I've populated the values with either "found" or "", primarily to make it easy to see where the values are populated. Using true and false here, or another enum, would likely make more sense in the real world.
Starting with a DataFrame (since you didn't specify the type that "B" is I have gone for a Option[String] (nullable) for this example:
val df = List(
(0, None),
(1, None),
(0, Some("hello")),
(1, Some("world"))
).toDF("A", "B")
df.show(false)
gives:
+---+-----+
|A |B |
+---+-----+
|0 |null |
|1 |null |
|0 |hello|
|1 |world|
+---+-----+
and to create the columns:
df
.withColumn("xx", when(col("B").isNull && col("A") === 0, "found").otherwise(""))
.withColumn("yy", when(col("B").isNull && col("A") === 1, "found").otherwise(""))
.withColumn("zz", when(col("B").isNotNull && col("A") === 0, "found").otherwise(""))
.withColumn("mm", when(col("B").isNotNull && col("A") === 1, "found").otherwise(""))
.show(false)
gives:
+---+-----+-----+-----+-----+-----+
|A |B |xx |yy |zz |mm |
+---+-----+-----+-----+-----+-----+
|0 |null |found| | | |
|1 |null | |found| | |
|0 |hello| | |found| |
|1 |world| | | |found|
+---+-----+-----+-----+-----+-----+

Counting distinct values for a given column partitioned by a window function, without using approx_count_distinct()

I have the following dataframe:
val df1 = Seq(("Roger","Rabbit", "ABC123"), ("Roger","Rabit", "ABC123"),("Roger","Rabbit", "ABC123"), ("Trevor","Philips","XYZ987"), ("Trevor","Philips","XYZ987")).toDF("first_name", "last_name", "record")
+----------+---------+------+
|first_name|last_name|record|
+----------+---------+------+
|Roger |Rabbit |ABC123|
|Roger |Rabit |ABC123|
|Roger |Rabbit |ABC123|
|Trevor |Philips |XYZ987|
|Trevor |Philips |XYZ987|
+----------+---------+------+
I want to group records in this dataframe by the column record. And then I want to look for anomalies in the fields first_name and last_name, which should remain constant for all records with same record value.
The best approach I found so far is using approx_count_distinct:
val wind_person = Window.partitionBy("record")
df1.withColumn("unique_fields",cconcat($"first_name",$"last_name"))
.withColumn("anomaly",capprox_count_distinct($"unique_fields") over wind_person)
.show(false)
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabit |ABC123|RogerRabit |2 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
+----------+---------+------+-------------+-------+
Where an anomaly is detected is anomaly column is greater than 1.
The problem is with approx_count_distinct we get just an approximation, and I am not sure how much confident we can be that it will always return an accurate count.
Some extra information:
The Dataframe may contain over 500M records
The Dataframe is previously repartitioned based on record column
For each different value of record, no more than 15 rows will be there
Is is safe to use approx_count_distinct in this scenario with a 100% accuracy or are there better window functions in spark to achieve this?
You can use collect_set of unique_fields over the window wind_person and get it's size which is equivalent to the count distinct of that field :
df1.withColumn("unique_fields", concat($"first_name", $"last_name"))
.withColumn("anomaly", size(collect_set($"unique_fields").over(wind_person)))
.show
//+----------+---------+------+-------------+-------+
//|first_name|last_name|record|unique_fields|anomaly|
//+----------+---------+------+-------------+-------+
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Roger |Rabit |ABC123|RogerRabit |2 |
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//+----------+---------+------+-------------+-------+
You can get the exact countDistinct over a Window using some dense_rank operations:
val df2 = df1.withColumn(
"unique_fields",
concat($"first_name",$"last_name")
).withColumn(
"anomaly",
dense_rank().over(Window.partitionBy("record").orderBy("unique_fields")) +
dense_rank().over(Window.partitionBy("record").orderBy(desc("unique_fields")))
- 1
)
df2.show
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
| Roger| Rabit|ABC123| RogerRabit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
+----------+---------+------+-------------+-------+

Skip first line of a csv file in scala

I want to read and write a csv file ignoring the first line as the header starts from second line.
val df =df1.withColumn("index", monotonicallyIncreasingId()).filter(col("index") > 1).drop("index")
This does not resolves my issue.
Left anti join maybe?
header = df.limit(1)
df.join(header, df.columns, 'left_anti').show()
or use rdd filter.
header = df.first()
df.rdd.filter(lambda x: x != header).toDF(df.columns).show()
Let me show you my try.
If header is present in second line of csv file and we need to ignore first line.
val df=spark.read.text("/yourFilePath/my.csv").withColumn("row_id",monotonically_increasing_id)
val cols=df.select("value").filter('row_id===1).first.mkString.split(",")
val df2 = df.filter('row_id>1).
withColumn("temp", split(col("value"), ",")).
select((0 until cols.length).map(i => col("temp").getItem(i).as(cols.apply(i))): _*)
Before
+------------+------+
| value|row_id|
+------------+------+
| aasadasd| 0|
|name,age,des| 1|
| a,2,dd| 2|
| b,5,ff| 3|
+------------+------+
After
+----+---+---+
|name|age|des|
+----+---+---+
|a |2 |dd |
|b |5 |ff |
+----+---+---+

Spark Scala - Need to iterate over column in dataframe

Got the next dataframe:
+---+----------------+
|id |job_title |
+---+----------------+
|1 |ceo |
|2 |product manager |
|3 |surfer |
+---+----------------+
I want to get a column from a dataframe and to create another column with indication called 'rank':
+---+----------------+-------+
|id |job_title | rank |
+---+----------------+-------+
|1 |ceo |c-level|
|2 |product manager |manager|
|3 |surfer |other |
+---+----------------+-------+
--- UPDATED ---
What I tried to do by now is:
def func (col: column) : Column = {
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")
when (col.contains(cLevel), "C-level")
.otherwise(when(col.contains(managerLevel),"manager").otherwise("other"))}
Currently I get a this error:
type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
and I think I have also other problems within the code.Sorry but I'm on a starting level with Scala over Spark.
You can use when/otherwise inbuilt function for that case as
import org.apache.spark.sql.functions._
def func = when(col("job_title").contains("cheif") || col("job_title").contains("ceo"), "c-level")
.otherwise(when(col("job_title").contains("manager"), "manager")
.otherwise("other"))
and you can call the function by using withColumn as
df.withColumn("rank", func).show(false)
which should give you
+---+---------------+-------+
|id |job_title |rank |
+---+---------------+-------+
|1 |ceo |c-level|
|2 |product manager|manager|
|3 |surfer |other |
+---+---------------+-------+
I hope the answer is helpful
Updated
I see that you have updated your post with your tryings, and you have tried creating a list of levels and you want to validate against the list. For that case you will have to write a udf function as
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")
import org.apache.spark.sql.functions._
def rankUdf = udf((jobTitle: String) => jobTitle match {
case x if(cLevel.exists(_.contains(x)) || cLevel.exists(x.contains(_))) => "C-Level"
case x if(managerLevel.exists(_.contains(x)) || managerLevel.exists(x.contains(_))) => "manager"
case _ => "other"
})
df.withColumn("rank", rankUdf(col("job_title"))).show(false)
which should give you your desired output
val df = sc.parallelize(Seq(
(1,"ceo"),
( 2,"product manager"),
(3,"surfer"),
(4,"Vaquar khan")
)).toDF("id", "job_title")
df.show()
//option 2
df.createOrReplaceTempView("user_details")
sqlContext.sql("SELECT job_title, RANK() OVER (ORDER BY id) AS rank FROM user_details").show
val df1 = sc.parallelize(Seq(
("ceo","c-level"),
( "product manager","manager"),
("surfer","other"),
("Vaquar khan","Problem solver")
)).toDF("job_title", "ranks")
df1.show()
df1.createOrReplaceTempView("user_rank")
sqlContext.sql("SELECT user_details.id,user_details.job_title,user_rank.ranks FROM user_rank JOIN user_details ON user_rank.job_title = user_details.job_title order by user_details.id").show
Results :
+---+---------------+
| id| job_title|
+---+---------------+
| 1| ceo|
| 2|product manager|
| 3| surfer|
| 4| Vaquar khan|
+---+---------------+
+---------------+----+
| job_title|rank|
+---------------+----+
| ceo| 1|
|product manager| 2|
| surfer| 3|
| Vaquar khan| 4|
+---------------+----+
+---------------+--------------+
| job_title| ranks|
+---------------+--------------+
| ceo| c-level|
|product manager| manager|
| surfer| other|
| Vaquar khan|Problem solver|
+---------------+--------------+
+---+---------------+--------------+
| id| job_title| ranks|
+---+---------------+--------------+
| 1| ceo| c-level|
| 2|product manager| manager|
| 3| surfer| other|
| 4| Vaquar khan|Problem solver|
+---+---------------+--------------+
df: org.apache.spark.sql.DataFrame = [id: int, job_title: string]
df1: org.apache.spark.sql.DataFrame = [job_title: string, ranks: string]
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Replace the value of one column from another column in spark dataframe

I have a dataframe like below
+---+------------+----------------------------------------------------------------------+
|id |indexes |arrayString |
+---+------------+----------------------------------------------------------------------+
|2 |1,3 |[WrappedArray(3, Str3), WrappedArray(1, Str1)] |
|1 |2,4,3 |[WrappedArray(2, Str2), WrappedArray(3, Str3), WrappedArray(4, Str4)] |
|0 |1,2,3 |[WrappedArray(1, Str1), WrappedArray(2, Str2), WrappedArray(3, Str3)] |
+---+------------+----------------------------------------------------------------------+
i want to loop through arrayString and get the first element as index and second element as String. Then replace the indexes with String corresponding to the index in arrayString. I want an output like below.
+---+---------------+
|id |replacedString |
+---+---------------+
|2 |Str1,Str3 |
|1 |Str2,Str4,Str3 |
|0 |Str1,Str2,Str3 |
+---+---------------+
I tried using the below udf function.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val itemIndexArray = itemIndex.split("\\,")
arrayString.map(i => {
itemIndexArray.updated(i(0).toInt,i(1))
})
itemIndexArray
})
This is giving me error and i am not getting my desired output. Is there any other way to achieve this. I cant use explode and join as i want the indexes replaced with string without losing the order.
.
You can create an udf as below to get the required result, Convert to the Array of array to map and find the index as a key in map.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val indexList = itemIndex.split("\\,")
val array = arrayString.map(x => (x(0) -> x(1))).toMap
indexList map array mkString ","
})
dataframe.withColumn("arrayString", replaceIndex($"indexes", $"arrayString"))
.show( false)
Output:
+---+-------+--------------+
|id |indexes|arrayString |
+---+-------+--------------+
|2 |1,3 |Str1,Str3 |
|1 |2,4,3 |Str2,Str4,Str3|
|0 |1,2,3 |Str1,Str2,Str3|
+---+-------+--------------+
Hope this helps!