How to parse url in spark sql(Scala) - scala

I am using following function to parse url but it throws error,
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
.withColumn("host",parse_url($"url_col","HOST"))
.withColumn("query",parse_url($"url_col","QUERY"))
.show(false)
Error:
<console>:285: error: not found: value parse_url
.withColumn("host",parse_url($"url_col","HOST"))
^
<console>:286: error: not found: value parse_url
.withColumn("query",parse_url($"url_col","QUERY"))
^
Kindly Guide how to parse url into its different parts.

Answer by #Ramesh is correct, but you also might want some hacky way to use this function without SQL queries :)
Hack is in the fact, that "callUDF" function calls not only UDFs, but any available function.
So you can write:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
b.withColumn("host", callUDF("parse_url", $"url_col", lit("HOST"))).
withColumn("query", callUDF("parse_url", $"url_col", lit("QUERY"))).
show(false)
Edit: after this Pull Request is merged, you can just use parse_url like a normal function. PR made after this question :)

parse_url is available as only sql and not as api . refer to parse_url
so you should be using it as a sql query and not as a function call through api
You should register the dataframe and use query as below
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT url_col, parse_url(`url_col`, 'HOST') as HOST, parse_url(`url_col`,'QUERY') as QUERY from temp").show(false)
which should give you output as
+--------------------------------------------------------------------------------------------+-----------------+-------+
|url_col |HOST |QUERY |
+--------------------------------------------------------------------------------------------+-----------------+-------+
|http://spark.apache.org/path?query=1 |spark.apache.org |query=1|
|https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative|people.apache.org|null |
+--------------------------------------------------------------------------------------------+-----------------+-------+
I hope the answer is helpful

As mentioned before, when you register a UDF you don't get a Java function, rather you introduce it to Spark, so you must call it in the "Spark-way".
I want to suggest another method I find convenient, especially when there are several columns you want to add, by using selectExpr
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
val c = b.selectExpr("*", "parse_url(url_col, 'HOST') as host", "parse_url(url_col, 'QUERY') as query")
c.show(false)

I created a library called bebe that exposes the parse_url functionality via the Scala API.
Suppose you have the following DataFrame:
+------------------------------------+---------------+
|some_string |part_to_extract|
+------------------------------------+---------------+
|http://spark.apache.org/path?query=1|HOST |
|http://spark.apache.org/path?query=1|QUERY |
|null |null |
+------------------------------------+---------------+
Calculate the different parts of the URL:
df.withColumn("actual", bebe_parse_url(col("some_string"), col("part_to_extract")))
+------------------------------------+---------------+----------------+
|some_string |part_to_extract|actual |
+------------------------------------+---------------+----------------+
|http://spark.apache.org/path?query=1|HOST |spark.apache.org|
|http://spark.apache.org/path?query=1|QUERY |query=1 |
|null |null |null |
+------------------------------------+---------------+----------------+

Related

locate function usage on dataframe without using UDF Spark Scala

I am curious as to why this will not work in Spark Scala on a dataframe:
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
It works with a UDF, but not as per above. Col vs. String aspects. Seems awkward and lacking aspect. I.e. how to convert a column to a string for passing to locate that needs String.
df("search_string") allows a String to be generated is my understanding.
But error gotten is:
command-679436134936072:15: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
Understanding what's going wrong
I'm not sure which version of Spark you're on, but the locate method has the following function signature on both Spark 3.3.1 (the current latest version) and Spark 2.4.5 (the version running on my local running Spark shell).
This function signature is the following:
def locate(substr: String, str: Column, pos: Int): Column
So substr can't be a Column, it needs to be a String. In your case, you were using df("search_string"). This actually calls the apply method with the following function signature:
def apply(colName: String): Column
So it makes sense that you're having a problem since the locate function needs a String.
Trying to fix your issue
If I correctly understood, you want to be able to locate a substring from one column inside of a string in another column without UDFs. You can use a map on a Dataset to do that. Something like this:
import spark.implicits._
case class MyTest (A:String, B: String)
val df = Seq(
MyTest("with", "potatoes with meat"),
MyTest("with", "pasta with cream"),
MyTest("food", "tasty food"),
MyTest("notInThere", "don't forget some nice drinks")
).toDF("A", "B").as[MyTest]
val output = df.map{
case MyTest(a,b) => (a, b, b indexOf a)
}
output.show(false)
+----------+-----------------------------+---+
|_1 |_2 |_3 |
+----------+-----------------------------+---+
|with |potatoes with meat |9 |
|with |pasta with cream |6 |
|food |tasty food |6 |
|notInThere|don't forget some nice drinks|-1 |
+----------+-----------------------------+---+
Once you're inside of a map operation of a strongly typed Dataset, you have the Scala language at your disposal.
Hope this helps!

How to remove all characters that start with "_" from a spark string column

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

How to call an UDF using Scala

How would I express the following code in Scala via the DataFrame API?
sqlContext.read.parquet("/input").registerTempTable("data")
sqlContext.udf.register("median", new Median)
sqlContext.sql(
"""
|SELECT
| param,
| median(value) as median
|FROM data
|GROUP BY param
""".stripMargin).registerTempTable("medians")
I've started via
val data = sqlContext.read.parquet("/input")
sqlContext.udf.register("median", new Median)
data.groupBy("param")
But them I'm not sure how to call the median function.
You can either use callUDF
data.groupBy("param").agg(callUDF("median", $"value"))
or call it directly:
val median = new Median
data.groupBy("param").agg(median($"value"))
// Equivalent to
data.groupBy("param").agg(new Median()($"value"))
Still, I think it would make more sense to use an object not a class.