Convert dataframe String column to Array[Int] - scala

I am new to Scala and Spark and I am trying to read a csv file locally (for testing):
val spark = org.apache.spark.sql.SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate;
val topics_df = spark.read.format("csv").option("header", "true").load("path-to-file.csv")
topics_df.show(10)
Here's how the file looks like:
+-----+--------------------+--------------------+
|topic| termindices| termweights|
+-----+--------------------+--------------------+
| 15|[21,31,51,108,101...|[0.0987100701,0.0...|
| 16|[42,25,121,132,55...|[0.0405490884,0.0...|
| 7|[1,23,38,7,63,0,1...|[0.1793091892,0.0...|
| 8|[13,40,35,104,153...|[0.0737646511,0.0...|
| 9|[2,10,93,9,158,18...|[0.1639456608,0.1...|
| 0|[28,39,71,46,123,...|[0.0867449145,0.0...|
| 1|[11,34,36,110,112...|[0.0729913664,0.0...|
| 17|[6,4,14,82,157,61...|[0.1583892199,0.1...|
| 18|[9,27,74,103,166,...|[0.0633899386,0.0...|
| 19|[15,81,289,218,34...|[0.1348582482,0.0...|
+-----+--------------------+--------------------+
with
ReadSchema: struct<topic:string,termindices:string,termweights:string>
The termindices column is supposed to be of type Array[Int], but when saved to CSV it is a String (This usually would not be a problem if I pull from databases).
How do I convert the type and eventually cast the DataFrame to a:
case class TopicDFRow(topic: Int, termIndices: Array[Int], termWeights: Array[Double])
I have the function ready to perform the conversion:
termIndices.substring(1, termIndices.length - 1).split(",").map(_.toInt)
I have looked at udf and a few other solutions but I am convinced that there should be a much cleaner and faster way to perform said conversion. Any help is greatly appreciated!

UDFs should be avoided when it's possible to use the more efficient in-built Spark functions. To my knowledge there is no better way than the one proposed; remove the first and last characters of the string, split and convert.
Using the in-built functions, this can be done as follows:
df.withColumn("termindices", split($"termindices".substr(lit(2), length($"termindices")-2), ",").cast("array<int>"))
.withColumn("termweights", split($"termweights".substr(lit(2), length($"termweights")-2), ",").cast("array<double>"))
.as[TopicDFRow]
substr if 1-index based so to remove the first character we start from 2. The second argument is the length to take (not the end point) hence the -2.
The last command will cast the dataframe to a dataset of type TopicDFRow.

Related

locate function usage on dataframe without using UDF Spark Scala

I am curious as to why this will not work in Spark Scala on a dataframe:
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
It works with a UDF, but not as per above. Col vs. String aspects. Seems awkward and lacking aspect. I.e. how to convert a column to a string for passing to locate that needs String.
df("search_string") allows a String to be generated is my understanding.
But error gotten is:
command-679436134936072:15: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
Understanding what's going wrong
I'm not sure which version of Spark you're on, but the locate method has the following function signature on both Spark 3.3.1 (the current latest version) and Spark 2.4.5 (the version running on my local running Spark shell).
This function signature is the following:
def locate(substr: String, str: Column, pos: Int): Column
So substr can't be a Column, it needs to be a String. In your case, you were using df("search_string"). This actually calls the apply method with the following function signature:
def apply(colName: String): Column
So it makes sense that you're having a problem since the locate function needs a String.
Trying to fix your issue
If I correctly understood, you want to be able to locate a substring from one column inside of a string in another column without UDFs. You can use a map on a Dataset to do that. Something like this:
import spark.implicits._
case class MyTest (A:String, B: String)
val df = Seq(
MyTest("with", "potatoes with meat"),
MyTest("with", "pasta with cream"),
MyTest("food", "tasty food"),
MyTest("notInThere", "don't forget some nice drinks")
).toDF("A", "B").as[MyTest]
val output = df.map{
case MyTest(a,b) => (a, b, b indexOf a)
}
output.show(false)
+----------+-----------------------------+---+
|_1 |_2 |_3 |
+----------+-----------------------------+---+
|with |potatoes with meat |9 |
|with |pasta with cream |6 |
|food |tasty food |6 |
|notInThere|don't forget some nice drinks|-1 |
+----------+-----------------------------+---+
Once you're inside of a map operation of a strongly typed Dataset, you have the Scala language at your disposal.
Hope this helps!

How to remove all characters that start with "_" from a spark string column

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Spark dataframe get column value into a string variable

I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following
val name= test.filter(test("id").equalTo("200")).select("name").col("name")
It returns
name org.apache.spark.sql.Column = name
how to get the value?
The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)
val maxDate = spark.sql("select max(export_time) as export_time from tier1_spend.cost_gcp_raw").first()
val rowValue = maxDate.get(0)
By this snippet, you can extract all the values in a column into a string.
Modify the snippet with where clauses to get your desired value.
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")
/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/
The value of entire column is stored in col_val_str
col_val_str: String = 5,10
Let us assume you need to pick the name from the below table for a particular Id and store that value in a variable.
+-----+-------+
| id | name |
+-----+-------+
| 100 | Alex |
| 200 | Bidan |
| 300 | Cary |
+-----+-------+
SCALA
-----------
Irrelevant data is filtered out first and then the name column is selected and finally stored into name variable
var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")
PYTHON (PYSPARK)
-----------------------------
For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter is applied to filter out other records.
def GetValueFromDataframe(_df,columnName):
for row in _df.rdd.collect():
return row[columnName].strip()
name = GetValueFromDataframe(df.filter(df.id == "100"),"name")
There might be more simpler approach than this using 3x version of Python. The code which I showed above was tested for 2.7 version.
Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect function. Hence it is always recommended to apply transformations (like filter,where etc) before you call the collect function. If you
still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0 as command line argument to make use of unlimited driver memory.
For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.
val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)
s is the string of column values
.collect() converts columns/rows to an array of lists, in this case, all rows will be converted to a tuple, temp is basically an array of such tuples/row.
x(n-1) retrieves the n-th column value for x-th row, which is by default of type "Any", so needs to be converted to String so as to append to the existing strig.
s =""
// say the n-th column is the target column
val temp = test.collect() // converts Rows to array of list
temp.foreach{x =>
s += (x(n-1).asInstanceOf[String])
}
println(s)