I have to translate pyspark to scala.
I have a function like this: lambda x: ';'.join(map(str, x).
Then we use
our_dataframe.groupBy('column').agg(our_lambda_function(F.collect_set('other_col')))
So we group by some column first and then get values as an array of strings. Our lambda function makes a string out of this array.
Example. Instead of
|column|
|[a,b,c]|
|[d,e,f]|
we want to have
|column|
|a;b;c|
|d;e;f|
How to make that in scala?
Try using concat_ws function to get elements concat with ;
Example:
df.groupBy(lit(1)).agg(concat_ws(";",collect_set(col("id"))).alias("column")).
select("column").
show(10,false)
//+------+
//|column|
//+------+
//|c;b;a |
//+------+
Related
I am trying to Convert a nested JSON to a flattened DataFrame.
I have read in the JSON as follows:
df = spark.read.json("/mnt/ins/duedil/combined.json")
The resulting dataframe looks like the following:
I have made a start on flattening the dataframe as follows
display(df.select ("companyId","countryCode"))
The above will display the following
I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"
Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?
Your financials is an array so if you want to extract something within the financials, you need some array transformations.
One example is to use transform.
from pyspark.sql import functions as F
df.select(
"companyId",
"countryCode",
F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)
This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.
I have data like this
I want output like this
How do I achieve this?
One way of doing is: pivot, create an array and sum values within the array
from pyspark.sql.functions import *
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("reduce(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
OR use pysparks inbuilt aggregate function
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("aggregate(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns
name, age, balance;"
So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?
I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.
I have a dataframe having the following two columns:
Name_Description Grade
Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.
Name description is not an array that's of fixed size. It could be something like
['asdf_ Brandon', 'Ca%abc%rd']
['fthhhhChris', 'Rock', 'is the %abc%man']
The only problems are the following:
1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.
2. %abc% randomly pops up on elements, so I wanna erase them.
Is there any way to achieve those two things in Scala? For instance, I just want
['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']
to change to
['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:
val removeGarbage = udf { arr: WrappedArray[String] =>
//in case that array is empty we need to map over option
arr.headOption
//drop first 6 characters from first element, then remove %abc% from the rest
.map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
.getOrElse(arr)
}
Then you just need to use this UDF on your Name_Description column:
val df = List(
(1, Array("asdf_ Brandon", "Ca%abc%rd")),
(2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")
df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)
Show prints:
+-----+-------------------------+
|Grade|Name_Description |
+-----+-------------------------+
|1 |[Brandon, Card] |
|2 |[Chris, Rock, is the man]|
+-----+-------------------------+
We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.
Please find below my approach. Hope it helps.
val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))
val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")
This is how I created the input dataframe.
df.select('Grade,posexplode('Name_Description)).registerTempTable("data")
We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.
spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show
This query will give out the required output. Hope this helps.
After show command spark prints the following:
+-----------------------+---------------------------+
|NameColumn |NumberColumn |
+-----------------------+---------------------------+
|name |4.3E-5 |
+-----------------------+---------------------------+
Is there a way to change NumberColumn format to something like 0.000043?
you can use format_number function as
import org.apache.spark.sql.functions.format_number
df.withColumn("NumberColumn", format_number($"NumberColumn", 5))
here 5 is the decimal places you want to show
As you can see in the link above that the format_number functions returns a string column
format_number(Column x, int d)
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
If your don't require , you can call regexp_replace function which is defined as
regexp_replace(Column e, String pattern, String replacement)
Replace all substrings of the specified string value that match regexp with rep.
and use it as
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("NumberColumn", regexp_replace(format_number($"NumberColumn", 5), ",", ""))
Thus comma (,) should be removed for large numbers.
You can use cast operation as below:
val df = sc.parallelize(Seq(0.000043)).toDF("num")
df.createOrReplaceTempView("data")
spark.sql("select CAST (num as DECIMAL(8,6)) from data")
adjust the precision and scale accordingly.
In newer versions of pyspark you can use round() or bround() functions.
Theses functions return a numeric column and solve the problem with ",".
it would be like:
df.withColumn("NumberColumn", bround("NumberColumn",5))