locate function usage on dataframe without using UDF Spark Scala - scala

I am curious as to why this will not work in Spark Scala on a dataframe:
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
It works with a UDF, but not as per above. Col vs. String aspects. Seems awkward and lacking aspect. I.e. how to convert a column to a string for passing to locate that needs String.
df("search_string") allows a String to be generated is my understanding.
But error gotten is:
command-679436134936072:15: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))

Understanding what's going wrong
I'm not sure which version of Spark you're on, but the locate method has the following function signature on both Spark 3.3.1 (the current latest version) and Spark 2.4.5 (the version running on my local running Spark shell).
This function signature is the following:
def locate(substr: String, str: Column, pos: Int): Column
So substr can't be a Column, it needs to be a String. In your case, you were using df("search_string"). This actually calls the apply method with the following function signature:
def apply(colName: String): Column
So it makes sense that you're having a problem since the locate function needs a String.
Trying to fix your issue
If I correctly understood, you want to be able to locate a substring from one column inside of a string in another column without UDFs. You can use a map on a Dataset to do that. Something like this:
import spark.implicits._
case class MyTest (A:String, B: String)
val df = Seq(
MyTest("with", "potatoes with meat"),
MyTest("with", "pasta with cream"),
MyTest("food", "tasty food"),
MyTest("notInThere", "don't forget some nice drinks")
).toDF("A", "B").as[MyTest]
val output = df.map{
case MyTest(a,b) => (a, b, b indexOf a)
}
output.show(false)
+----------+-----------------------------+---+
|_1 |_2 |_3 |
+----------+-----------------------------+---+
|with |potatoes with meat |9 |
|with |pasta with cream |6 |
|food |tasty food |6 |
|notInThere|don't forget some nice drinks|-1 |
+----------+-----------------------------+---+
Once you're inside of a map operation of a strongly typed Dataset, you have the Scala language at your disposal.
Hope this helps!

Related

Convert dataframe String column to Array[Int]

I am new to Scala and Spark and I am trying to read a csv file locally (for testing):
val spark = org.apache.spark.sql.SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate;
val topics_df = spark.read.format("csv").option("header", "true").load("path-to-file.csv")
topics_df.show(10)
Here's how the file looks like:
+-----+--------------------+--------------------+
|topic| termindices| termweights|
+-----+--------------------+--------------------+
| 15|[21,31,51,108,101...|[0.0987100701,0.0...|
| 16|[42,25,121,132,55...|[0.0405490884,0.0...|
| 7|[1,23,38,7,63,0,1...|[0.1793091892,0.0...|
| 8|[13,40,35,104,153...|[0.0737646511,0.0...|
| 9|[2,10,93,9,158,18...|[0.1639456608,0.1...|
| 0|[28,39,71,46,123,...|[0.0867449145,0.0...|
| 1|[11,34,36,110,112...|[0.0729913664,0.0...|
| 17|[6,4,14,82,157,61...|[0.1583892199,0.1...|
| 18|[9,27,74,103,166,...|[0.0633899386,0.0...|
| 19|[15,81,289,218,34...|[0.1348582482,0.0...|
+-----+--------------------+--------------------+
with
ReadSchema: struct<topic:string,termindices:string,termweights:string>
The termindices column is supposed to be of type Array[Int], but when saved to CSV it is a String (This usually would not be a problem if I pull from databases).
How do I convert the type and eventually cast the DataFrame to a:
case class TopicDFRow(topic: Int, termIndices: Array[Int], termWeights: Array[Double])
I have the function ready to perform the conversion:
termIndices.substring(1, termIndices.length - 1).split(",").map(_.toInt)
I have looked at udf and a few other solutions but I am convinced that there should be a much cleaner and faster way to perform said conversion. Any help is greatly appreciated!
UDFs should be avoided when it's possible to use the more efficient in-built Spark functions. To my knowledge there is no better way than the one proposed; remove the first and last characters of the string, split and convert.
Using the in-built functions, this can be done as follows:
df.withColumn("termindices", split($"termindices".substr(lit(2), length($"termindices")-2), ",").cast("array<int>"))
.withColumn("termweights", split($"termweights".substr(lit(2), length($"termweights")-2), ",").cast("array<double>"))
.as[TopicDFRow]
substr if 1-index based so to remove the first character we start from 2. The second argument is the length to take (not the end point) hence the -2.
The last command will cast the dataframe to a dataset of type TopicDFRow.

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Transforming a Spark Dataframe Column into a Dataframe with just one line (ArrayType)

I have a dataframe that contains a column with complex objects in it:
+--------+
|col1 |
+--------+
|object1 |
|object2 |
|object3 |
+--------+
The schema of this object is pretty complex, something that looks like:
root:struct
field1:string
field2:decimal(38,18)
object1:struct
field3:string
object2:struct
field4:string
field5:decimal(38,18)
What is the best way to group everything and transform it into an array?
eg:
+-----------------------------+
|col1 |
+-----------------------------+
| [object1, object2, object3] |
+-----------------------------+
I tried to generate an array from a column then create a dataframe from it:
final case class A(b: Array[Any])
val c = df.select("col1").collect().map(_(0)).toArray
df.sparkSession.createDataset(Seq(A(b = c)))
However, Spark doesn't like my Array[Any] trick:
java.lang.ClassNotFoundException: scala.Any
Any ideas?
What is the best way to group everything and transform it into an array?
There is not even a good way to do it. Please remember that Spark cannot distribute individual rows. The result will be:
Processed sequentially.
Possibly to large to be stored in memory.
Other than the above you can just collect_list:
import org.apache.spark.sql.functions.{col, collect_list}
df.select(collect_list(col("col1"))
Spark uses encoders for datatypes, this is the reason Any doesn't work.
If the schema of the complex object is fixed, you can define a case class with that schema and do the following,
case class C(... object1: A, object2: B ...)
val df = ???
val mappedDF = df.as[C] // this will map each complex object to case class
Next, you can use a UDF to change each C object to Seq(...) on row level. It'll look something like,
import org.apache.spark.sql.expressions.{UserDefinedFunction => UDF}
import org.apache.spark.sql.functions.col
def convert: UDF =
udf((complexObj: C) => Seq(complexObj.object1,complexObj.object2,complexObj.object3))
To use this UDF,
mappedDF.withColumn("resultColumn", convert(col("col1")))
Note: Since not much info was provided about the schema, I've used generics like A and B. You will have to define all of these.

Meaning of the Symbol of single apostrophe(') in Scala using Anonymous function under withColumn function? [duplicate]

This question already has an answer here:
What does a single apostrophe mean in Scala? [duplicate]
(1 answer)
Closed 5 years ago.
My questions is:
In line number 5: What is the operation of the symbol of single apostrophes(')? Cannot understand very clearly that how withColumn function is working over here. Also Please elaborate how it is displaying like these Column order- |id |text |upper |.
Code:
1. val dataset = Seq((0, "hello"),(1, "world")).toDF("id","text")
2. val upper: String => String =_.toUpperCase
3. import org.apache.spark.sql.functions.udf
4. val upperUDF = udf(upper)
5. dataset.withColumn("upper", upperUDF('text)).show
Output:
+---------+---------+---------+
|id |text |upper |
+---------+---------+---------+
| 0 | hello |HELLO |
| 1 | world |WORLD |
+---------+---------+---------+
The ' symbol in Scala is syntax sugar for creating an instance of the Symbol. class. From the documentation
For instance, the Scala term 'mysym will invoke the constructor of the Symbol class in the following way: Symbol("mysym").
So when you write 'text, the compiler expands it into new Symbol("text").
There is additional magic here, since Sparks upperUDF method requires a Column type, not a Symbol. But, there exists an implicit in scope defined in SQLImplicits called symbolToColumn which converts a symbol to a column:
implicit def symbolToColumn(s: Symbol): ColumnName = new ColumnName(s.name)
If we extract away all the implicits and syntax sugar, the equivalent would be:
dataset.withColumn("upper", upperUDF(new Column(new Symbol("text").name))).show

groupBykey in spark

New to spark here and I'm trying to read a pipe delimited file in spark. My file looks like this:
user1|acct01|A|Fairfax|VA
user1|acct02|B|Gettysburg|PA
user1|acct03|C|York|PA
user2|acct21|A|Reston|VA
user2|acct42|C|Fairfax|VA
user3|acct66|A|Reston|VA
and I do the following in scala:
scala> case class Accounts (usr: String, acct: String, prodCd: String, city: String, state: String)
defined class Accounts
scala> val accts = sc.textFile("accts.csv").map(_.split("|")).map(
| a => (a(0), Accounts(a(0), a(1), a(2), a(3), a(4)))
| )
I then try to group the key value pair by the key, and this is not sure if I'm doing this right...is this how I do it?
scala> accts.groupByKey(2)
res0: org.apache.spark.rdd.RDD[(String, Iterable[Accounts])] = ShuffledRDD[4] at groupByKey at <console>:26
I thought the (2) is to give me the first two results back but I don't seem to get anything back at the console...
If I run a distinct...I get this too..
scala> accts.distinct(1).collect(1)
<console>:26: error: type mismatch;
found : Int(1)
required: PartialFunction[(String, Accounts),?]
accts.distinct(1).collect(1)
EDIT:
Essentially I'm trying to get to a key value pair nested mapping. For example, user1 would looke like this:
user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}
trying to learn this step by step so thought I'd break it down into chunks to understand...
I think you might have better luck if you put your data into a DataFrame if you've already gone through the process of defining a schema. First off, you need to modify the split comment to use single quotes. (See this question). Also, you can get rid of the a(0) in the beginning. Then, converting to a DataFrame is trivial. (Note that DataFrames are available on spark 1.3+.)
val accts = sc.textFile("/tmp/accts.csv").map(_.split('|')).map(a => Accounts(a(0), a(1), a(2), a(3), a(4)))
val df = accts.toDF()
Now df.show produces:
+-----+------+------+----------+-----+
| usr| acct|prodCd| city|state|
+-----+------+------+----------+-----+
|user1|acct01| A| Fairfax| VA|
|user1|acct02| B|Gettysburg| PA|
|user1|acct03| C| York| PA|
|user2|acct21| A| Reston| VA|
|user2|acct42| C| Fairfax| VA|
|user3|acct66| A| Reston| VA|
+-----+------+------+----------+-----+
It should be easier for you to work with the data. For example, to get a list of the unique users:
df.select("usr").distinct.collect()
produces
res42: Array[org.apache.spark.sql.Row] = Array([user1], [user2], [user3])
For more details, check out the docs.
3 observations that may help you understand the problem:
1)
groupByKey(2) does not return first 2 results, the parameter 2 is used as number of partitions for the resulting RDD. See docs.
2) collect does not take Int parameter. See docs.
3) split takes 2 types of parameters, Char or String. String version uses Regex so "|" needs escaping if intended as literal.