Scala Spark, array with incremental new column

Scala Spark, array with incremental new column - scala

Spark is reading from cosmosDB, which contains records like:
{
"answers": [
{
"answer": "2005-01-01 00:00",
"answerDt": "2022-07-01CEST08:07",
...,
"id": {uuid}
}
and code that takes those answers and created DF where each row is new record from that array:
dataDF
.select(
col("id").as("recordId"),
explode($"answers").as("qa")
)
.select(
col("recordId"),
$"qa.questionText",
col("qa.question").as("q-id"),
$"qa.answerText",
$"qa.answerDt"
)
.withColumn("id", concat_ws("-", col("q-id"), col("recordId")))
.drop(col("q-id"))
at the end I save it to other collection.
What I need is that I would like to add position number into those records.
So each answer row would have also some int number, which will be unique per recordId.
ie: from 1 to 20.
lp|| recordId| questionText| answerText| answerDt| id|
--------------------++-------------------+--------------------+--------------------+-------------------+-------------------+
1 |951a508c-d970-4d2...|Please give me th...| 197...|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
2 |951a508c-d970-4d2...|What X should I N...| female|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
3 |951a508c-d970-4d2...|Please Share me t...| 72 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|
1 |12345678-0987-4d2...|Give me the smth ...| 10 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|
Is it possible ? thanks

val w = Window.partitionBy("recordId").orderBy("your col")
val resDF = sourceDF.withColumn("row_num", row_number.over(w))

Related

How to check whether multiple columns values of a row are not null and then add a true/false resulting column in Spark Scala

Hi how's it going? Here are my two dataframes:
val id_df = Seq(("1","gender"),("2","city"),("3","state"),("4","age")).toDF("id","type")
val main_df = Seq(("male","los angeles","null"),("female","new york","new york")).toDF("1","2","3")
Here's what they look like in tabular form:
and this is what I would like the resultant dataframe to look like:
I want to check for all the ids in id_df, if they exist in main_df's columns, then check whether all the id values for that row are not null. If they're all not null, then we put "true" in the meets condition column for that row, otherwise we put "false". Notice how id number 4 for age isn't in main_df's columns, so we just ignore it.
How would I do this?
Thanks so much and have a great day.

Allow me to start with two short observations:
I believe that it would be safer to avoid naming our columns with single numbers. Think of the case where we need to evaluate the expression 1 is not null. Here it is ambiguous whether we mean column 1 or the value 1 itself.
As far as I am aware, it is not performant to store and process the target columns through a dataframe. That would create an overhead that can be easily avoided by using a single scala collection i.e: Seq, Array, Set, etc.
And here is the solution to your problem:
import org.apache.spark.sql.functions.col
val id_df = Seq(
("c1","gender"),
("c2","city"),
("c3","state"),
("c4","age")
).toDF("id","type")
val main_df = Seq(
("male", "los angeles", null),
("female", "new york", "new york"),
("trans", null, "new york")
).toDF("c1","c2","c3")
val targetCols = id_df.collect()
.map{_.getString(0)} //get id
.toSet //convert current sequence to a set (required for the intersection)
.intersect(main_df.columns.toSet) //get common columns with main_df
.map(col(_).isNotNull) //convert c1,..cN to col(c[i]).isNotNull
.reduce(_ && _) // apply the AND operator between items
// (((c1 IS NOT NULL) AND (c2 IS NOT NULL)) AND (c3 IS NOT NULL))
main_df.withColumn("meets_conditions", targetCols).show(false)
// +------+-----------+--------+----------------+
// |c1 |c2 |c3 |meets_conditions|
// +------+-----------+--------+----------------+
// |male |los angeles|null |false |
// |female|new york |new york|true |
// |trans |null |new york|false |
// +------+-----------+--------+----------------+

Scala: For loop on dataframe, create new column from existing by index

I have a dataframe with two columns:
id (string), date (timestamp)
I would like to loop through the dataframe, and add a new column with an url, which includes the id. The algorithm should look something like this:
add one new column with the following value:
for each id
"some url" + the value of the dataframe's id column
I tried to make this work in Scala, but I have problems with getting the specific id on the index of "a"
val k = df2.count().asInstanceOf[Int]
// for loop execution with a range
for( a <- 1 to k){
// println( "Value of a: " + a );
val dfWithFileURL = dataframe.withColumn("fileUrl", "https://someURL/" + dataframe("id")[a])
}
But this
dataframe("id")[a]
is not working with Scala. I could not find solution yet, so every kind of suggestions are welcome!

You can simply use the withColumn function in Scala, something like this:
val df = Seq(
( 1, "1 Jan 2000" ),
( 2, "2 Feb 2014" ),
( 3, "3 Apr 2017" )
)
.toDF("id", "date" )
// Add the fileUrl column
val dfNew = df
.withColumn("fileUrl", concat(lit("https://someURL/"), $"id"))
.show
My results:

Not sure if this is what you require but you can use zipWithIndex for indexing.
data.show()
+---+---------------+
| Id| Url|
+---+---------------+
|111|http://abc.go.org/|
|222|http://xyz.go.net/|
+---+---------------+
import org.apache.spark.sql._
val df = sqlContext.createDataFrame(
data.rdd.zipWithIndex
.map{case (r, i) => Row.fromSeq(r.toSeq:+(s"""${r.getString(1)}${i+1}"""))},
StructType(data.schema.fields :+ StructField("fileUrl", StringType, false))
)
Output:
df.show(false)
+---+---------------+----------------+
|Id |Url |fileUrl |
+---+---------------+----------------+
|111|http://abc.go.org/|http://abc.go.org/1|
|222|http://xyz.go.net/|http://xyz.go.net/2|
+---+---------------+----------------+

Spark Scala GroupBy column and sum values

I am a newbie in Apache-spark and recently started coding in Scala.
I have a RDD with 4 columns that looks like this:
(Columns 1 - name, 2- title, 3- views, 4 - size)
aa File:Sleeping_lion.jpg 1 8030
aa Main_Page 1 78261
aa Special:Statistics 1 20493
aa.b User:5.34.97.97 1 4749
aa.b User:80.63.79.2 1 4751
af Blowback 2 16896
af Bluff 2 21442
en Huntingtown,_Maryland 1 0
I want to group based on Column Name and get the sum of Column views.
It should be like this:
aa 3
aa.b 2
af 2
en 1
I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.

This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:
sc.textFile("path to the text file")
.map(x => x.split(" ",-1))
.map(x => (x(0),x(3)))
.countByKey
To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:
val result = df.groupBy("column to Group on").agg(count("column to count on"))
another possibility is to use the sql approach:
val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")

I assume that you have already have your RDD populated.
//For simplicity, I build RDD this way
val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
("aa", "Main_Page", 1, 78261),
("aa", "Special:Statistics", 1, 20493),
("aa.b", "User:5.34.97.97", 1, 4749),
("aa.b", "User:80.63.79.2", 1, 4751),
("af", "Blowback", 2, 16896),
("af", "Bluff", 2, 21442),
("en", "Huntingtown,_Maryland", 1, 0))
Dataframe approach
val sql = new SQLContext(sc)
import sql.implicits._
import org.apache.spark.sql.functions._
val df = data.toDF("name", "title", "views", "size")
df.groupBy($"name").agg(count($"name") as "") show
**Result**
+----+-----+
|name|count|
+----+-----+
| aa| 3|
| af| 2|
|aa.b| 2|
| en| 1|
+----+-----+
RDD Approach (CountByKey(...))
rdd.keyBy(f => f._1).countByKey().foreach(println(_))
RDD Approach (reduceByKey(...))
rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))
If any of this does not solve your problem, pls share where exactely you have strucked.

how to apply partition in spark scala dataframe with multiple columns? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following dataframe df in Spark Scala:
id project start_date Change_date designation
1 P1 08/10/2018 01/09/2017 2
1 P1 08/10/2018 02/11/2018 3
1 P1 08/10/2018 01/08/2016 1
then get designation closure to start_date and less than that
Expected output:
id project start_date designation
1 P1 08/10/2018 2
This is because change date 01/09/2017 is the closest date before start_date.
Can somebody advice how to achieve this?
This is not selecting first row but selecting the designation corresponding to change date closest to the start date

Parse dates:
import org.apache.spark.sql.functions._
val spark: SparkSession = ???
import spark.implicits._
val df = Seq(
(1, "P1", "08/10/2018", "01/09/2017", 2),
(1, "P1", "08/10/2018", "02/11/2018", 3),
(1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")
val parsed = df
.withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))
.withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))
Find difference
val diff = parsed
.withColumn("diff", datediff($"start_date", $"changed_date"))
.where($"diff" > 0)
Apply solution of your choice from How to select the first row of each group?, for example window functions. If you group by id:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"diff")
diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// | 1| P1|2018-10-08| 2017-09-01| 2| 402|
// +---+----------+----------+------------+-----------+----+
Reference:
How to select the first row of each group?

Pyspark groupby then sort within group

I have a table which contains id, offset, text. Suppose input:
id offset text
1 1 hello
1 7 world
2 1 foo
I want output like:
id text
1 hello world
2 foo
I'm using:
df.groupby(id).agg(concat_ws("",collect_list(text))
But I don't know how to ensure the order in the text. I did sort before groupby the data, but I've heard that groupby might shuffle the data. Is there a way to do sort within group after groupby data?

this will create a required df:
df1 = sqlContext.createDataFrame([("1", "1","hello"), ("1", "7","world"), ("2", "1","foo")], ("id", "offset" ,"text" ))
display(df1)
then you can use the following code, could be optimized further:
#udf
def sort_by_offset(col):
result =""
text_list = col.split("-")
for i in range(len(text_list)):
text_list[i] = text_list[i].split(" ")
text_list[i][0]=int(text_list[i][0])
text_list = sorted(text_list, key=lambda x: x[0], reverse=False)
for i in range(len(text_list)):
result = result+ " " +text_list[i][1]
return result.lstrip()
df2 = df1.withColumn("offset_text",concat(col("offset"),lit(" "),col("text")))
df3 = df2.groupby(col("id")).agg(concat_ws("-",collect_list(col("offset_text"))).alias("offset_text"))
df4 = df3.withColumn("text",sort_by_offset(col("offset_text")))
display(df4)
Final Output:

Add sort_array:
from pyspark.sql.functions import sort_array
df.groupby(id).agg(concat_ws("", sort_array(collect_list(text))))

Categories

google-cloud-firestore

h5py

db2

numbers

spring-authorization-s...

plc

numba

azure-deployment-slots

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala Spark, array with incremental new column - scala

val w = Window.partitionBy("recordId").orderBy("your col") val resDF = sourceDF.withColumn("row_num", row_number.over(w))

Related

How to check whether multiple columns values of a row are not null and then add a true/false resulting column in Spark Scala

Scala: For loop on dataframe, create new column from existing by index

Spark Scala GroupBy column and sum values

how to apply partition in spark scala dataframe with multiple columns? [duplicate]

Pyspark groupby then sort within group

Categories

Resources