how to find length of string of array of json object in pyspark scala? - scala

I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.

I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))

Related

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

Reverse contents of a field within a dataframe using scala

I'm using scala.
I have a dataframe with millions of rows and multiple fields. One of the fields is a string field containing thing like this:
"Snow_KC Bingfamilies Conference_610507"
How do I reverse the contents of just this field for all the rows in the dataframe?
Thanks.
Doing a quick search on the Scaladoc, I found this reverse function which does exactly that.
import org.apache.spark.sql.{functions => sqlfun}
val df1 = ...
val df2 = df1.withColumn("columnName", sqlfun.reverse($"columnName"))

Filter Scala dataframe by column of arrays

My scala dataframe has a column that has the datatype array(element: String). I want to display those rows of the dataframe that has the word "hello" in that column.
I have this:
display(df.filter($"my_column".contains("hello")))
I get an error because of data mismatch. It says that argument 1 requires string type, however, 'my:column' is of array<string> type.
You can use array_contains function
import org.apache.spark.sql.functions._
df.filter(array_contains(df.col("my_column"), "hello")).show

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

Convert epochmilli to DDMMYYYY - Spark Scala

I have a dataframe with one of the column containing timestamps respresented in epochmilli (column type is long) and I need to convert them to a column with DDMMYY using withColumn
Something like:
1528102439 ---> 040618
How do I achieve this?
val df_DateConverted = df.withColumn("Date",from_unixtime(df.col("timestamp").divide(1000),"ddMMyy"))