How to convert spark dataframe array to tuple - scala

How can I convert spark dataframe to a tuple of 2 in scala?
I tried to explode the array and create a new column with help of lead function, so that I can use two columns to create tuple.
In order to use lead function, I need a column to sort by, I don't have any.
Please suggest which is best way to solve this?
Note: I need to retain the same order in the array.
For example:
Input
For example, input looks something like this,
id1 | [text1, text2, text3, text4]
id2 | [txt, txt2, txt4, txt5, txt6, txt7, txt8, txt9]
expected o/p:
I need to get output of tuple of length 2
id1 | [(text1, text2), (text2, text3), (text3,text4)]
id2 | [(txt, txt2), (txt2, txt4), (txt4, txt5), (txt5, txt6), (txt6, txt7), (txt7, txt8), (txt8, txt9)]

You can create an udf to create list of tuple using sliding window function
val df = Seq(
("id1", List("text1", "text2", "text3", "text4")),
("id2", List("txt", "txt2", "txt4", "txt5", "txt6", "txt7", "txt8", "txt9"))
).toDF("id", "text")
val sliding = udf((value: Seq[String]) => {
value.toList.sliding(2).map { case List(a, b) => (a, b) }.toList
})
val result = df.withColumn("text", sliding($"text"))
Output:
+---+-------------------------------------------------------------------------------------------------+
|id |text |
+---+-------------------------------------------------------------------------------------------------+
|id1|[[text1, text2], [text2, text3], [text3, text4]] |
|id2|[[txt, txt2], [txt2, txt4], [txt4, txt5], [txt5, txt6], [txt6, txt7], [txt7, txt8], [txt8, txt9]]|
+---+-------------------------------------------------------------------------------------------------+
Hope this helps!

Related

Convert Dataset of array into DataFrame

Given Dataset[Array[String]].
In fact, this structure has a single field of array type.
Is there any possibility to convert it into a DataFrame with each array item placed into a separate column?
If I have RDD[Array[String]] I can achieve it in this way:
val rdd: RDD[Array[String]] = ???
rdd.map(arr => Row.fromSeq(arr))
But surprisingly I cannot do the same with Dataset[Array[String]] – it says that there's no encoder for Row.
And I cannot replace an array with Tuple or case class because the size of the array is unknown at compile time.
If arrays have the same size, "select" can be used:
val original: Dataset[Array[String]] = Seq(Array("One", "Two"), Array("Three", "Four")).toDS()
val arraySize = original.head.size
val result = original.select(
(0 until arraySize).map(r => original.col("value").getItem(r)): _*)
result.show(false)
Output:
+--------+--------+
|value[0]|value[1]|
+--------+--------+
|One |Two |
|Three |Four |
+--------+--------+
Here you can do a foldLeft to create all your columns manually.
val df = Seq(Array("Hello", "world"), Array("another", "row")).toDS()
Then you calculate the size of your array.
val size_array = df.first.length
Then you add the columns to your dataframe with a foldLeft :
0.until(size_array).foldLeft(df){(acc, number) => df.withColumn(s"col$number", $"value".getItem(number))}.show
Here our accumulator is our df, and we just add the columns one by one.

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

split an apache-spark dataframe string column into multiple columns by slicing/splitting on field width values stored in a list

I have a dataframe that looks like this
+--------------------
| unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...
I need to get it split it up into something like this
+--------------------
|fips | Name | Id ...
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...
I have a list like this
val fields = List(
("fips", 5),
(“Name”, 8),
(“Id”, 27)
....more fields
)
I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.
My scala is still pretty week and I assume the answer would look something like this
df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)
any suggestions/ideas much appreciated
You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using
substring. It applies regardless of the size of the fields list:
import org.apache.spark.sql.functions._
val df = Seq(
("02020sometext5002"),
("03030othrtext6003"),
("04040moretext7004")
).toDF("unparsed_data")
val fields = List(
("fips", 5),
("name", 8),
("id", 4)
)
val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
val newDF = acc._1.withColumn(
field._1, substring($"unparsed_data", acc._2, field._2)
)
(newDF, acc._2 + field._2)
}._1.
drop("unparsed_data")
resultDF.show
// +-----+--------+----+
// | fips| name| id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+
Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.
This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.
import org.apache.spark.sql.functions._
val df = Seq(
("12334sometext999")
).toDF("X")
val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")
df2.show
Gives in this case (you can rename cols again):
+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
| 12334| sometext| 999|
+------------------+------------------+-------------------+

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Create new DataFrame with new rows depending in number of a column - Spark Scala

I have a DataFrame with the following data:
num_cta | n_lines
110000000000| 2
110100000000| 3
110200000000| 1
With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.
For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:
num_cta
110000000000
110000000000
For all the Dataframe example that I show, the result to get would have to be something like this:
num_cta
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000
Is there a way to do that? And multiply a row n times, depending on the value of a column value?
Regards.
One approach would be to expand n_lines into an array with an UDF and explode it:
val df = Seq(
("110000000000", 2),
("110100000000", 3),
("110200000000", 1)
)toDF("num_cta", "n_lines")
def fillArr = udf(
(n: Int) => Array.fill(n)(1)
)
val df2 = df.withColumn("arr", fillArr($"n_lines")).
withColumn("a", explode($"arr")).
select($"num_cta")
df2.show
+------------+
| num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+
There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.
Something like
import spark.implicits._
case class (num_cta:String) // output dataframe schema
case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema
val result = df.flatmap(x => {
List.fill(x.n_lines)(x.num_cta)
}).toDF