I have two data frames as below:
df1:
language (column name)
tamil
telugu
hindi
df2:
id keywords
101 ["tamildiary", "tamilkeyboard", "telugumovie"]
102 ["tamilmovie"]
103 ["hindirhymes", "hindimovie"]
Using these two data frames, I need to create something that looks like the below. (Basically, need to find how many of the language specific keywords are against each id and store them in a map)
id keywords
101 {"tamil":2, "telugu":1}
102 {"tamil":1}
103 {"hindi":2}
Could anyone please help me in doing this. Thanks a lot!
Consider joining the two DataFrames using regexp_replace to regex match for strings with a leading language-keyword, followed by a couple of groupBys to aggregate an array of (language, count), which then gets converted to a map via map_from_entries:
val df1 = Seq("tamil", "telugu", "hindi").toDF("language")
val df2 = Seq(
(101, Seq("tamildiary", "tamilkeyboard", "telugumovie")),
(102, Seq("tamilmovie")),
(103, Seq("hindirhymes", "hindimovie"))
).toDF("id", "keywords")
val pattern = concat(lit("^"), df1("language"), lit(".*"))
df2.
withColumn("keyword", explode($"keywords")).as("df2").
join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
groupBy("id").agg(map_from_entries(collect_list(struct($"language", $"count"))).as("keywords")).
show(false)
// +---+-------------------------+
// |id |keywords |
// +---+-------------------------+
// |101|[tamil -> 2, telugu -> 1]|
// |103|[hindi -> 2] |
// |102|[tamil -> 1] |
// +---+-------------------------+
For Spark 2.0 - 2.3, create a UDF to mimic functionality of map_from_entries which is available only in Spark 2.4+:
import org.apache.spark.sql.Row
val arrayToMap = udf{ (arr: Seq[Row]) =>
arr.map{ case Row(k: String, v: Int) => (k, v) }.toMap
}
df2.
withColumn("keyword", explode($"keywords")).as("df2").
join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
groupBy("id").agg(arrayToMap(collect_list(struct($"language", $"count"))).as("keywords")).
show(false)
As a side note, below are a couple of alternatives for the regex-matching condition via a SQL expression:
Using regexp_extract:
expr("regexp_extract(df2.keyword, concat('^(', df1.language, ').*'), 1) = df1.language")
Using rlike:
expr("df2.keyword rlike concat('^', df1.language, '.*')")
Related
Schema of input dataframe
- employeeKey (int)
- employeeTypeId (string)
- loginDate (string)
- employeeDetailsJson (string)
{"Grade":"100","ValidTill":"2021-12-01","Supervisor":"Alex","Vendor":"technicia","HourlyRate":29}
For Perm employees , some attributes are available and some not. Same for Contracting Employees.
So looking to find an efficient way to build dataframe based on only selected columns, as against transforming all columns and select the ones which I need.
Also please advise this is the best way to extract values from json string based on a key. As the attributes in the string are dynamic, I can not build StructSchema based on it. So using good old get_json_object.
(spark 2.45 and will use spark 3 in future)
val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","cont.Vendor-Name","cont.Hourly-Rate" )
//val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","perm.Level","perm-Validity","perm.Supervisor" )
val resultDF = inputDF.get
.withColumn("Employee-Key", col("employeeKey"))
.withColumn("Employee-Type", when(col("employeeTypeId") === 1, "Permanent")
.when(col("employeeTypeId") === 2, "Contractor")
.otherwise("unknown"))
.withColumn("Login-Date", to_utc_timestamp(to_timestamp(col("loginDate"), "yyyy-MM-dd'T'HH:mm:ss"), ""America/Chicago""))
.withColumn("perm.Level", get_json_object(col("employeeDetailsJson"), "$.Grade"))
.withColumn("perm.Validity", get_json_object(col("employeeDetailsJson"), "$.ValidTill"))
.withColumn("perm.SuperVisor", get_json_object(col("employeeDetailsJson"), "$.Supervisor"))
.withColumn("cont.Vendor-Name", get_json_object(col("employeeDetailsJson"), "$.Vendor"))
.withColumn("cont.Hourly-Rate", get_json_object(col("employeeDetailsJson"), "$.HourlyRate"))
.select(dfSelectColumns.head, dfSelectColumns.tail: _*)
I see that you have 2 schemas, one for Permanent and another for Contractor. You can have 2 schemas.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val schemaBase = new StructType().add("Employee-Key", IntegerType).add("Employee-Type", StringType).add("Login-Date", DateType)
val schemaPerm = schemaBase.add("Level", IntegerType).add("Validity", StringType)// Permanent attributes
val schemaCont = schemaBase.add("Vendor", StringType).add("HourlyRate", DoubleType) // Contractor attributes
Then you can use the 2 schemas to load the data into dataframe.
For Permanent Employee:
val jsonPermDf = Seq( // Construct sample dataframe
(2, """{"Employee-Key":2, "Employee-Type":"Permanent", "Login-Date":"2021-11-01", "Level":3, "Validity":"ok"}""")
, (3, """{"Employee-Key":3, "Employee-Type":"Permanent", "Login-Date":"2020-10-01", "Level":2, "Validity":"ok-yes"}""")
).toDF("key", "raw_json")
val permDf = jsonPermDf.withColumn("data", from_json(col("raw_json"),schemaPerm)).select($"data.*")
permDf.show()
For Contractor:
val jsonContDf = Seq( // Construct sample dataframe
(1, """{"Employee-Key":1, "Employee-Type":"Contractor", "Login-Date":"2021-12-01", "Vendor":"technicia", "HourlyRate":29}""")
, (4, """{"Employee-Key":4, "Employee-Type":"Contractor", "Login-Date":"2019-09-01", "Vendor":"Minis", "HourlyRate":35}""")
).toDF("key", "raw_json")
val contDf = jsonContDf.withColumn("data", from_json(col("raw_json"),schemaCont)).select($"data.*")
contDf.show()
This is the result datafrme for Permanent:
+------------+-------------+----------+-----+--------+
|Employee-Key|Employee-Type|Login-Date|Level|Validity|
+------------+-------------+----------+-----+--------+
| 2| Permanent|2021-11-01| 3| ok|
| 3| Permanent|2020-10-01| 2| ok-yes|
+------------+-------------+----------+-----+--------+
This is the result dataframe for Contractor:
+------------+-------------+----------+---------+----------+
|Employee-Key|Employee-Type|Login-Date| Vendor|HourlyRate|
+------------+-------------+----------+---------+----------+
| 1| Contractor|2021-12-01|technicia| 29.0|
| 4| Contractor|2019-09-01| Minis| 35.0|
+------------+-------------+----------+---------+----------+
If the schema of the JSON in employeeDetailsJson is unstable, you can still parse it into Map(String, String) type using from_json function with schema map<string,string>. Then you can explode the map column and pivot to get keys as columns.
Example:
val df1 = df.withColumn(
"employeeDetails",
from_json(col("employeeDetailsJson"), "map<string,string>")
).select(
col("employeeKey"),
col("employeeTypeId"),
col("loginDate"),
explode("employeeDetails")
).groupBy("employeeKey", "employeeTypeId", "loginDate")
.pivot("key")
.agg(first("value"))
df1.show()
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|employeeKey|employeeTypeId|loginDate |Grade|HourlyRate|Supervisor|ValidTill |Vendor |
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|1 |1 |2021-02-05'T'21:28:06|100 |29 |Alex |2021-12-01|technicia|
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.
I have a dataframe that looks like this
+--------------------
| unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...
I need to get it split it up into something like this
+--------------------
|fips | Name | Id ...
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...
I have a list like this
val fields = List(
("fips", 5),
(“Name”, 8),
(“Id”, 27)
....more fields
)
I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.
My scala is still pretty week and I assume the answer would look something like this
df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)
any suggestions/ideas much appreciated
You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using
substring. It applies regardless of the size of the fields list:
import org.apache.spark.sql.functions._
val df = Seq(
("02020sometext5002"),
("03030othrtext6003"),
("04040moretext7004")
).toDF("unparsed_data")
val fields = List(
("fips", 5),
("name", 8),
("id", 4)
)
val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
val newDF = acc._1.withColumn(
field._1, substring($"unparsed_data", acc._2, field._2)
)
(newDF, acc._2 + field._2)
}._1.
drop("unparsed_data")
resultDF.show
// +-----+--------+----+
// | fips| name| id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+
Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.
This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.
import org.apache.spark.sql.functions._
val df = Seq(
("12334sometext999")
).toDF("X")
val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")
df2.show
Gives in this case (you can rename cols again):
+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
| 12334| sometext| 999|
+------------------+------------------+-------------------+
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+
I am a newbie in Apache-spark and recently started coding in Scala.
I have a RDD with 4 columns that looks like this:
(Columns 1 - name, 2- title, 3- views, 4 - size)
aa File:Sleeping_lion.jpg 1 8030
aa Main_Page 1 78261
aa Special:Statistics 1 20493
aa.b User:5.34.97.97 1 4749
aa.b User:80.63.79.2 1 4751
af Blowback 2 16896
af Bluff 2 21442
en Huntingtown,_Maryland 1 0
I want to group based on Column Name and get the sum of Column views.
It should be like this:
aa 3
aa.b 2
af 2
en 1
I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.
This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:
sc.textFile("path to the text file")
.map(x => x.split(" ",-1))
.map(x => (x(0),x(3)))
.countByKey
To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:
val result = df.groupBy("column to Group on").agg(count("column to count on"))
another possibility is to use the sql approach:
val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")
I assume that you have already have your RDD populated.
//For simplicity, I build RDD this way
val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
("aa", "Main_Page", 1, 78261),
("aa", "Special:Statistics", 1, 20493),
("aa.b", "User:5.34.97.97", 1, 4749),
("aa.b", "User:80.63.79.2", 1, 4751),
("af", "Blowback", 2, 16896),
("af", "Bluff", 2, 21442),
("en", "Huntingtown,_Maryland", 1, 0))
Dataframe approach
val sql = new SQLContext(sc)
import sql.implicits._
import org.apache.spark.sql.functions._
val df = data.toDF("name", "title", "views", "size")
df.groupBy($"name").agg(count($"name") as "") show
**Result**
+----+-----+
|name|count|
+----+-----+
| aa| 3|
| af| 2|
|aa.b| 2|
| en| 1|
+----+-----+
RDD Approach (CountByKey(...))
rdd.keyBy(f => f._1).countByKey().foreach(println(_))
RDD Approach (reduceByKey(...))
rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))
If any of this does not solve your problem, pls share where exactely you have strucked.