I have a big dataframe (1.2GB more or less) with this structure:
+---------+--------------+------------------------------------------------------------------------------------------------------+
| country | date_data | text |
+---------+--------------+------------------------------------------------------------------------------------------------------+
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_D: QQAA\nT_NAME: name_2\nT_IN: ind_2\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_300000\nT_IN: ind_65\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 47aa" |
+---------+--------------+------------------------------------------------------------------------------------------------------+
The number of rows is 300.000 and "text" field is a string of 5000 characters approximately.
I would like to separate the field “text” in this new fields:
+---------+------------+------+-------------+--------+--------+---------+--------+------+
| country | date_data | t_d | t_name | t_in | t_c | t_add | ...... | t_r |
+---------+------------+------+-------------+--------+--------+---------+--------+------+
| EEUU | 2016-10-03 | QQWE | name_1 | ind_1 | c1ws12 | Sec_1_P | ...... | 45ee |
| EEUU | 2016-10-03 | QQAA | name_2 | ind_2 | c1ws12 | Sec_1_P | ...... | 45ee |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| EEUU | 2016-10-03 | QQWE | name_300000 | ind_65 | c1ws12 | Sec_1_P | ...... | 47aa |
+---------+------------+------+-------------+--------+--------+---------+--------+------+
Currently, I´m using regular expressions to solve this problem. Firstly, I write the regular expresions and create a function to extract individual fields from text (90 regular expressions in total):
val D_text = "((?<=T_D: ).*?(?=\\\\n))".r
val NAME_text = "((?<=nT_NAME: ).*?(?=\\\\n))".r
val IN_text = "((?<=T_IN: ).*?(?=\\\\n))".r
val C_text = "((?<=T_C: ).*?(?=\\\\n))".r
val ADD_text = "((?<=T_ADD: ).*?(?=\\\\n))".r
.
.
.
.
val R_text = "((?<=T_R: ).*?(?=\\\\n))".r
//UDF function:
def getFirst(pattern2: scala.util.matching.Regex) = udf(
(url: String) => pattern2.findFirstIn(url) match {
case Some(texst_new) => texst_new
case None => "NULL"
case null => "NULL"
}
)
Then, I create a new Dataframe (tbl_separate_fields ) as a result of applying the function with a regular expression to extract every new field from text.
val tbl_separate_fields = hiveDF.select(
hiveDF("country"),
hiveDF("date_data"),
getFirst(D_text)(hiveDF("texst")).alias("t_d"),
getFirst(NAME_text)(hiveDF("texst")).alias("t_name"),
getFirst(IN_text)(hiveDF("texst")).alias("t_in"),
getFirst(C_text)(hiveDF("texst")).alias("t_c"),
getFirst(ADD_text)(hiveDF("texst")).alias("t_add"),
.
.
.
.
getFirst(R_text)(hiveDF("texst")).alias("t_r")
)
Finally, I insert this dataframe into a Hive table:
tbl_separate_fields.registerTempTable("tbl_separate_fields")
hiveContext.sql("INSERT INTO TABLE TABLE_INSERT PARTITION (date_data) SELECT * FROM tbl_separate_fields")
This solution lasts for 1 hour for the entire dataframe so I wish to optimize and reduce the execution time. Is there any solution?
We are using Hadoop 2.7.1 and Apache-Spark 1.5.1. The configuration for Spark is:
val conf = new SparkConf().set("spark.storage.memoryFraction", "0.1")
val sc = new SparkContext(conf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Thanks in advance.
EDIT DATA:
+---------+--------------+------------------------------------------------------------------------------------------------------+
| country | date_data | text |
+---------+--------------+------------------------------------------------------------------------------------------------------+
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_NAME: name_2\nT_D: QQAA\nT_IN: ind_2\nT_C: c1ws12 ...........\nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_NAME: name_300000\nT_ADD: Sec_1_P\nT_IN: ind_65\nT_C: c1ws12\n ...........\nT_R: 47aa" |
+---------+--------------+------------------------------------------------------------------------------------------------------+
Using regular expressions in this case is slow and also fragile.
If you know that all records have the same structure, i.e. that all "text" values have the same number and order of "parts", the following code would work (for any number of columns), mainly taking advantage of the split function in org.apache.spark.sql.functions:
import org.apache.spark.sql.functions._
// first - split "text" column values into Arrays
val textAsArray: DataFrame = inputDF
.withColumn("as_array", split(col("text"), "\n"))
.drop("text")
.cache()
// get a sample (first row) to get column names, can be skipped if you want to hard-code them:
val sampleText = textAsArray.first().getAs[mutable.WrappedArray[String]]("as_array").toArray
val columnNames: Array[(String, Int)] = sampleText.map(_.split(": ")(0)).zipWithIndex
// add Column per columnName with the right value and drop the no-longer-needed as_array column
val withValueColumns: DataFrame = columnNames.foldLeft(textAsArray) {
case (df, (colName, index)) => df.withColumn(colName, split(col("as_array").getItem(index), ": ").getItem(1))
}.drop("as_array")
withValueColumns.show()
// for the sample data I created,
// with just 4 "parts" in "text" column, this prints:
// +-------+----------+----+------+-----+------+
// |country| date_data| T_D|T_NAME| T_IN| T_C|
// +-------+----------+----+------+-----+------+
// | EEUU|2016-10-03|QQWE|name_1|ind_1|c1ws12|
// | EEUU|2016-10-03|QQAA|name_2|ind_2|c1ws12|
// +-------+----------+----+------+-----+------+
Alternatively, if the assumption above is not true, you can use a UDF that converts the text column into a Map, and then perform a similar reduceLeft operation on the hard-coded list of desired columns:
import sqlContext.implicits._
// sample data: not the same order, not all records have all columns:
val inputDF: DataFrame = sc.parallelize(Seq(
("EEUU", "2016-10-03", "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12"),
("EEUU", "2016-10-03", "T_D: QQAA\nT_IN: ind_2\nT_NAME: name_2")
)).toDF("country", "date_data", "text")
// hard-coded list of expected column names:
val columnNames: Seq[String] = Seq("T_D", "T_NAME", "T_IN", "T_C")
// UDF to convert text into key-value map
val asMap = udf[Map[String, String], String] { s =>
s.split("\n").map(_.split(": ")).map { case Array(k, v) => k -> v }.toMap
}
val textAsMap = inputDF.withColumn("textAsMap", asMap(col("text"))).drop("text")
// for each column name - lookup the value in the map
val withValueColumns: DataFrame = columnNames.foldLeft(textAsMap) {
case (df, colName) => df.withColumn(colName, col("textAsMap").getItem(colName))
}.drop("textAsMap")
withValueColumns.show()
// prints:
// +-------+----------+----+------+-----+------+
// |country| date_data| T_D|T_NAME| T_IN| T_C|
// +-------+----------+----+------+-----+------+
// | EEUU|2016-10-03|QQWE|name_1|ind_1|c1ws12|
// | EEUU|2016-10-03|QQAA|name_2|ind_2| null|
// +-------+----------+----+------+-----+------+
Related
I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1
I have a dataframe in the form
+-----+--------+-------+
| id | label | count |
+-----+--------+-------+
| id1 | label1 | 5 |
| id1 | label1 | 2 |
| id2 | label2 | 3 |
+-----+--------+-------+
and I would like the resulting output to look like
+-----+--------+----------+----------+-------+
| id | label | col_name | agg_func | value |
+-----+--------+----------+----------+-------+
| id1 | label1 | count | avg | 3.5 |
| id1 | label1 | count | sum | 7 |
| id2 | label2 | count | avg | 3 |
| id2 | label2 | count | sum | 3 |
+-----+--------+----------+----------+-------+
First, I created a list of aggregate functions using the code below. I then apply these functions into the original dataframe to get the aggregation results in separate columns.
val f = org.apache.spark.sql.functions
val aggCols = Seq("col_name")
val aggFuncs = Seq("avg", "sum")
val aggOp = for (func <- aggFuncs) yield {
aggCols.map(x => f.getClass.getMethod(func, x.getClass).invoke(f, x).asInstanceOf[Column])
}
val aggOpFlat = aggOp.flatten
df.groupBy("id", "label").agg(aggOpFlat.head, aggOpFlat.tail: _*).na.fill(0)
I get to the format
+-----+--------+---------------+----------------+
| id | label | avg(col_name) | sum(col_name) |
+-----+--------+---------------+----------------+
| id1 | label1 | 3.5 | 7 |
| id2 | label2 | 3 | 3 |
+-----+--------+---------------+----------------+
but cannot think of the logic to get to what I want.
A possible solution might be to wrap all the aggregate values inside a map and then to use explode function.
Something like that (shouldn't be an issue to make it dynamic).
val df = List ( ("id1", "label1", 5), ("id1", "label1", 2), ("id2", "label2", 3)).toDF("id", "label", "count")
df
.groupBy("id", "label")
.agg(avg("count").as("avg"), sum("count").as("sum"))
.withColumn("map", map( lit("avg"), col("avg"), lit("sum"), col("sum")))
.select(col("id"), col("label"), explode(col("map")))
.show
DF1 is what I have now, and I want make DF1 looks like DF2.
Desired Output:
DF1 DF2
+---------+-------------------+ +---------+------------------------------+
| ID | Category | | ID | Category |
+---------+-------------------+ +---------+------------------------------+
| 31898 | Transfer | | 31898 | Transfer (e-Transfer) |
| 31898 | e-Transfer | =====> | 32614 | Transfer (e-Transfer + IMT) |
| 32614 | Transfer | =====> | 33987 | Transfer (IMT) |
| 32614 | e-Transfer + IMT | +---------+------------------------------+
| 33987 | Transfer |
| 33987 | IMT |
+---------+-------------------+
Code:
val df = DF1.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
val DF2 = df.withColumn("Category", $"CategorySet"(0) ($"CategorySet"(1)))
The code is not working, how to solve it? And if there is any other better ways to do the same thing, I am open to it. Thank you in advance
You can try this:
val sliceRight = udf((array : Seq[String], from : Int) => " (" + array.takeRight(from).mkString(",") +")")
val df2 = df.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
df2.withColumn("Category", concat($"CategorySet"(0),sliceRight($"CategorySet",lit(1))))
.show(false)
Output:
+-----+----------------------------+---------------------------+
|ID |CategorySet |Category |
+-----+----------------------------+---------------------------+
|33987|[Transfer, IMT] |Transfer (IMT) |
|32614|[Transfer, e-Transfer + IMT]|Transfer (e-Transfer + IMT)|
|31898|[Transfer, e-Transfer] |Transfer (e-Transfer) |
+-----+----------------------------+---------------------------+
answer with slight modification
df.groupBy(“ID”).agg(collect_set(col(“Category”)).as(“Category”)).withColumn(“Category”, concat(col(“Category”)(0),lit(“ (“),col(“Category”)(1), lit(“)”))).show
I have a number of columns in a spark dataframe that I want to combine into one column and add a separating character between each column. I don't want to combine all the columns together with the character separating them, just some of them. In this example, I would like to add a pipe between the values of everything besides the first two columns.
Here is an example input:
+---+--------+----------+----------+---------+
|id | detail | context | col3 | col4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+
The expected output would be something like this:
+---+--------+----------+----------+---------+--------------------------------+
|id | detail | context | col3 | col4| data
+---+--------+----------+----------+---------+--------------------------------+
| 1 | {blah} | service | null | null | service||
| 2 | { blah | """ blah | """blah} | service | """blah|"""blah}|service
| 3 | { blah | """blah} | service | null | """blah}|service|
+---+--------+----------+----------+---------+--------------------------------+
Currently, I have something like the following:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")
The above combines the columns together, but doesn't add in the additional character. If I tried these possibilities, but I'm obviously not doing it right:
scala> val combined = nonulls.select($"id", concat(columns.map(col):_|*) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*, lit('|')) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*|) as "data")
Any suggestions would be much appreciated! :) Thanks!
This should do the trick:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val columnsWithPipe = columns.flatMap(colname => Seq(col(colname),lit("|"))).dropRight(1)
val combined = nonulls.select($"id",concat(columnsWithPipe:_*) as "data")
Just use the concat_ws function ... it concatenates columns with a separator of your choice.
It's imported as
import org.apache.spark.sql.functions.concat_ws
I am using spark 2.0.1 and python-2.7 to modify and flattening some nested JSON data.
Raw data (json format)
{
"created" : '28-12-2001T12:02:01.143',
"class" : 'Class_A',
"sub_class": "SubClass_B",
"properties": {
meta : 'some-info',
...,
interests : {"key1": "value1", "key2":"value2, ..., "keyN":"valueN"}
}
}
using withColumn and udf function I was able to flatten raw_data in to dataframe which looks like follows
---------------------------------------------------------------------
| created | class | sub_class | meta | interests |
---------------------------------------------------------------------
|28-12-2001T12:02:01.143 | Class_A | SubClass_B |'some-info' | "{key1: 'value1', 'key2':'value2', ..., 'keyN':'valueN'}" |
---------------------------------------------------------------------
Now I want to convert/split this 1 row into multiple rows based on interest column. How Can I do the same?
Desired Output
---------------------------------------------------------------------
| created | class | sub_class | meta | key | value |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key1 | value1 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key2 | value2 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | keyN | valueN |
---------------------------------------------------------------------
Thank you
Use explode
Here's the full example (mostly getting the data):
import pyspark.sql.functions as sql
import pandas as pd
#sc = SparkContext()
sqlContext = SQLContext(sc)
s = "28-12-2001T12:02:01.143 | Class_A | SubClass_B |some-info| {'key1': 'value1', 'key2':'value2', 'keyN':'valueN'}"
data = s.split('|')
data = data[:-1]+[eval(data[-1])]
p_df = pd.DataFrame(data).T
s_df = sqlContext.createDataFrame(p_df,schema= ['created','class','sub_class','meta','intrests'])
s_df.select(s_df.columns[:-1]+[sql.explode(s_df.intrests).alias("key", "value")]).show()