Spark - extracting single value from DataFrame - scala

I have a Spark DataFrame query that is guaranteed to return single column with single Int value. What is the best way to extract this value as Int from the resulting DataFrame?

You can use head
df.head().getInt(0)
or first
df.first().getInt(0)
Check DataFrame scala docs for more details

This could solve your problem.
df.map{
row => row.getInt(0)
}.first()

In Pyspark, you can simply get the first element if the dataframe is single entity with one column as a response, otherwise, a whole row will be returned, then you have to get dimension-wise response i.e. 2 Dimension list like df.head()[0][0]
df.head()[0]

If we have the spark dataframe as :
+----------+
|_c0 |
+----------+
|2021-08-31|
+----------+
x = df.first()[0]
print(x)
2021-08-31

Related

How to parallelize operations on partitions of a dataframe

I have a dataframe df =
+--------------------+
| id|
+-------------------+
|113331567dc042f...|
|5ffbbd1adb4c413...|
|08c782a4ae854e8...|
|24cee418e04a461...|
|a65f47c2aecc455...|
|a74355ef35d442d...|
|86f1a9b7ffc843b...|
|25c8abd6895e445...|
|b89ce33788f4484...|
.....................
with million elements.
I want to repartition the dataframe into multiple partitions and pass each partition elelemts as list to database api call that returns spark dataset.
Something like this.
df2 = df.repartition(10)
df2.foreach-partition { partition =>
val result = spark.read
.format("custom.databse")
.where(__key in partition.toList)
.load
}
And at the end I would ike to do a Union of all the result datasets returned for each of the partition.
expected output will be final dataset of strings.
+--------------------+
| customer names|
+-------------------+
|eddy |
|jaman |
|cally |
|sunny |
|adam |
.....................
Can anyone help me to convert it to real code in spark-scala
From what I see in documentation it could be possible to something like this. You'll have to use RDD API and SparkContext so you could use parallelize to partition your data into n partitions. After that you can call foreachPartition which should already give you iterator on your data directly, no need to collect data.
Conceptually what you are asking is not really possible in Spark.
Your API call is a SparkContext dependent function ( i.e. spark.read ) , And one cannot use a SparkContext inside a partition function. In simpler words, you cannot pass spark object to executors. For ref
For even simpler imagination : think of of a Dataset having each row as Dataset. Is it even possible ? no.
In your case there can be 2 ways to solve this :
Case 1 : One by One then Union
Convert the keys to list and Split them evenly
FOr each split call spark.read api and keep Unioning .
//split into 10000 sized lists
val listOfListOfKeys : List[List[String]]= df.collect().grouped(10000).toList
//Bring Dataset for 1st 10000 keys (1st list)
var resultDf = spark.read.format("custom.databse")
.where(__key in listOfListOfKeys.apply(0)).load
//drop the 1st item
listOfListOfKeys.drop(1)
//bring rest of them
for (listOfKeys <- listOfListOfKeys) {
val tempDf = spark.read.format("custom.databse")
.where(__key in listOfKeys).load
resultDf.union(tempDf);
}
There will scaling issues with this approah because of the collected data on the driver. But if you want to use the "spark.read" api, then this might be the only easy way.
Case 2 : foreachPartition + Normal DB call which returns a iterator
If you can find another way to get the data from your Db which returns a iterator or any single threaded spark independent object. Then you can achieve what you want exactly by applying what Filip has answered i.e. df.repartition.rdd.foreachPartition(yourDbCallFuntion())

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

Spark dataframe explode column

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category.
So I proceeded as follows to explode the line string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe
I can't join the new dataframe with the initial one since the common field id was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'dan#gmail.com'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "dan#gmail.com"| "premium"
Any suggestions, thanks in advance.
If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan#gmail.com'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful

Extract a column value from a spark dataframe and add it to another dataframe

I have a spark dataframe called "df_array" it will always returns a single array as an output like below.
arr_value
[M,J,K]
I want to extract it's value and add to another dataframe.
below is the code I was executing
val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))
but my code always fails saying "org.apache.spark.sql.AnalysisException: resolved attribute(s)"
Can someone help me on this
The operation needed here is join
You'll need to have the a common column in both dataframes, which will be used as "key".
After the join you can select which columns to be included in the new dataframe.
More detailed can be found here:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
join(other, on=None, how=None)
Joins with another DataFrame, using the given join expression.
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.
The following performs a full outer join between df1 and df2.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
If you know the df_array has only one record, you can collect it to driver using first() and then use it as an array of literal values to create a column in any DataFrame:
import org.apache.spark.sql.functions._
// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)
// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*))
new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// | 1| a| [M, J, K]|
// | 2| b| [M, J, K]|
// +--------+--------+---------------+