How can I nullify spark dataframe column [duplicate] - scala

This question already has answers here:
Create new Dataframe with empty/null field values
(2 answers)
Closed 3 years ago.
I am working in Scala programming language. I want to nullify the entire column of data frame.
If that is not possible, then I at least want to put an empty string
What is the efficient way to do any of the above two?
Note: I don't want to add new column but I want to do manipulation on an existing column
Thanks

You can directly use .withColumn with same column name and spark replaces the column.
import org.apache.spark.sql.functions._
val df=Seq(("1","a"),("2","b")).toDF("id","name")
df.show()
//+---+----+
//|id |name|
//+---+----+
//|1 |a |
//+---+----+
val df1=df.withColumn("id",lit(null)) //to keep null value for id column
df1.show()
//+----+----+
//|id |name|
//+----+----+
//|null|a |
//+----+----+
val df2=df.withColumn("id",lit("")) //to keep empty string "" value for id column
df2.show()
//+---+----+
//|id |name|
//+---+----+
//| |a |
//+---+----+

Related

not able to get nested json value as column

I'm trying to create schema for json, and see it as columns in dataframe
Input json
{"place":{"place_name":"NYC","lon":0,"lat":0,"place_id":1009}, "region":{"region_issues":[{"key":"health","issue_name":"Cancer"},{"key":"sports","issue_name":"swimming"}}}
code
val schemaRsvp = new StructType()
.add("place", StructType(Array(
StructField("place_name", DataTypes.StringType),
StructField("lon", DataTypes.IntegerType),
StructField("lat", DataTypes.IntegerType),
StructField("place_id", DataTypes.IntegerType))))
val ip = spark.read.schema(schemaRsvp).json("D:\\Data\\rsvp\\inputrsvp.json")
ip.show()
Its showing all the fields in single column place, want values column wise
place_name,lon,lat,place_id
NYC,0,0,1009
Any suggestion, how to fix this ?
You can convert struct type into columns using ".*"
ip.select("place.*").show()
+----------+---+---+--------+
|place_name|lon|lat|place_id|
+----------+---+---+--------+
| NYC| 0| 0| 1009|
+----------+---+---+--------+
UPDATE:
with the new column array you can explode your date and then do the same ".*" to convert struct type into columns:
ip.select(col("place"), explode(col("region.region_issues")).as("region_issues"))
.select("place.*", "region_issues.*").show(false)
+---+---+--------+----------+----------+------+
|lat|lon|place_id|place_name|issue_name|key |
+---+---+--------+----------+----------+------+
|0 |0 |1009 |NYC |Cancer |health|
|0 |0 |1009 |NYC |swimming |sports|
+---+---+--------+----------+----------+------+

Using rlike with list to create new df scala

just started with scala 2 days ago.
Here's the thing, I have a df and a list. The df contains two columns: paragraphs and authors, the list contains words (strings). I need to get the count of all the paragraphs where every word on list appears by author.
So far my idea was to create a for loop on the list to query the df using rlike and create a new df, but even if this does work, I wouldn't know how to do it. Any help is appreciated!
Edit: Adding example data and expected output
// Example df and list
val df = Seq(("auth1", "some text word1"), ("auth2","some text word2"),("auth3", "more text word1").toDF("a","t")
df.show
+-------+---------------+
| a| t|
+-------+---------------+
|auth1 |some text word1|
|auth2 |some text word2|
|auth1 |more text word1|
+-------+---------------+
val list = List("word1", "word2")
// Expected output
newDF.show
+-------+-----+----------+
| word| a|text count|
+-------+-----+----------+
|word1 |auth1| 2|
|word2 |auth2| 1|
+-------+-----+----------+
You can do a filter and aggregation for each word in the list, and combine all the resulting dataframes using unionAll:
val result = list.map(word =>
df.filter(df("t").rlike(s"\\b${word}\\b"))
.groupBy("a")
.agg(lit(word).as("word"), count(lit(1)).as("text count"))
).reduce(_ unionAll _)
result.show
+-----+-----+----------+
| a| word|text count|
+-----+-----+----------+
|auth3|word1| 1|
|auth1|word1| 1|
|auth2|word2| 1|
+-----+-----+----------+

How to convert multiple rows of a Dataframe into a single row in Scala (Using Dataframe APIs) without using a SQL? [duplicate]

This question already has answers here:
Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function
(10 answers)
Closed 3 years ago.
I have a dataframe nameDF as below:
scala> val nameDF = Seq(("John","A"), ("John","B"), ("John","C"), ("John","D"), ("Bravo","E"), ("Bravo","F"), ("Bravo","G")).toDF("Name","Init")
nameDF: org.apache.spark.sql.DataFrame = [Name: string, Init: string]
scala> nameDF.show
+------+----+
|Name |Init|
+------+----+
|Johnny| A|
|Johnny| B|
|Johnny| C|
|Johnny| D|
|Bravo | E|
|Bravo | F|
|Bravo | G|
+------+----+
Without using SQL, I am trying to group the names and convert the multiple rows of each "Name" into a single row as given below:
+------+-------+
|Name |Init |
+------+-------+
|Johnny|A,B,C,D|
|Bravo |E,F,G |
+------+-------+
I see the available options to pivot are not suitable for String operations.
Is Pivot the correct option in this case ? If not, could anyone let me know how can I achieve the solution ?
Try this:
import org.apache.spark.sql.functions._
df.groupBy($"Name")
.agg(concat_ws(",", sort_array(collect_list($"Init"))).as("Init"))

How can I make a Dataframe in Spark from a String instead of a file? [duplicate]

This question already has answers here:
Can I read a CSV represented as a string into Apache Spark using spark-csv
(3 answers)
Closed 3 years ago.
At the moment, I am making a dataframe from a tab separated file with a header, like this.
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("inferSchema","true").load(pathToFile)
I want to do exactly the same thing but with a String instead of a file. How can I do that?
To the best of my knowledge, there is no built in way to build a dataframe from a string. Yet, for prototyping purposes, you can create a dataframe from a Seq of Tuples.
You could use that to your advantage to create a dataframe from a string.
scala> val s ="x,y,z\n1,2,3\n4,5,6\n7,8,9"
s: String =
x,y,z
1,2,3
4,5,6
7,8,9
scala> val data = s.split('\n')
// Then we extract the first element to use it as a header.
scala> val header = data.head.split(',')
scala> val df = data.tail.toSeq
// converting the seq of strings to a DF with only one column
.toDF("X")
// spliting the string
.select(split('X, ",") as "X")
// extracting each column from the array and renaming them
.select( header.indices.map( i => 'X.getItem(i).as(header(i))) : _*)
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
ps: if you are not in the spark REPL make sure to write this import spark.implicits._ so as to use toDF().

remove duplicate column from dataframe using scala

I need to remove one column from the DataFrame having another column with the same name. I need to remove only one column and need the other one for further usage.
For example, given this input DF:
sno | age | psk | psk
---------------------
1 | 12 | a4 | a4
I would like to obtain this output DF:
sno | age | psk
----------------
1 | 12 | a4
RDD is the way (but you need to know the column index of the duplicate columns for removing duplicate columns back to dataframe)
If you have dataframe with duplicate columns as
+---+---+---+---+
|sno|age|psk|psk|
+---+---+---+---+
|1 |12 |a4 |a4 |
+---+---+---+---+
You know that the last two column index are duplicates.
Next step is for you to have column names with duplicates removed and form schema
val columns = df.columns.toSet.toArray
val schema = StructType(columns.map(name => StructField(name, StringType, true)))
Vital part is to convert the dataframe to rdd and remove the required column index (here it is the 4th)
val rdd = df.rdd.map(row=> Row.fromSeq(Seq(row(0).toString, row(1).toString, row(2))))
Final step is to convert the rdd to dataframe using schema
sqlContext.createDataFrame(rdd, schema).show(false)
which should give you
+---+---+---+
|sno|age|psk|
+---+---+---+
|1 |12 |a4 |
+---+---+---+
I hope the answer is helpful