I have the following string name/date/country variable:
"John Peters|2016-01-19|England"
How can I get the variable to read as:
"John Peters|january|England"
The following works for me:
clear
input str30 loandate
"John Peters|2016-01-19|England"
end
split loandate, parse("|")
generate loandate2b = month(date(loandate2, "YMD"))
egen loandate_new = concat(loandate1 loandate2b loandate3), format(%tdMonth) punct("|")
list loandate_new
+-----------------------------+
| loandate_new |
|-----------------------------|
1. | John Peters|January|England |
+-----------------------------+
Related
I have a spark code to read some data from a database.
One of the columns (of type string) named "title" contains the following data.
+-------------------------------------------------+
|title |
+-------------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------------+
I'd like to remove the HTML entities and decode it to look as given below.
+-------------------------------------------+
|title |
+-------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------+
There is a library "html-enitites" for node.js that does exactly what I am looking for,
but i am unable to find something similar for spark-scala.
What would be good approach to do this?
You can use org.apache.commons.lang.StringEscapeUtils with a help of UDF to achieve this.
import org.apache.commons.lang.StringEscapeUtils;
val decodeHtml = (html:String) => {
StringEscapeUtils.unescapeHtml(html);
}
val decodeHtmlUDF = udf(decodeHtml)
df.withColumn("title", decodeHtmlUDF($"title")).show()
/*
+--------------------+
| title|
+--------------------+
| Example sentence |
| Read the ‘Book’ |
|‘LOTR’ Is A Great...|
+--------------------+
*/
I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.
I have a dataframe looks like this:
datetime | ID |
======================
20180201000000 | 275 |
20171231113024 | 534 |
20180201220000 | 275 |
20170205000000 | 28 |
what I want to do is to count by ID, monthly.
this way was perfactly worked :
add column of month by extracting from datetime column :
new_df = df.withColumn('month', df.datetime.substr(0,6))
count by ID & month :
count_df = new_df.groupBy('ID','month').count()
but is there a way to use substring of certain column values as an argument of groupBy() function? like :
`count_df = df.groupBy('ID', df.datetime.substr(0,6)).count()`
at least, this code didn't work.
if there exist the way to use substring of values, don't need to add new column and save much of resources(in case of big data).
but even if this approach is wrong, do you have a better idea to get same result?
Try this
>>> df.show()
+--------------+---+
| datetime| id|
+--------------+---+
|20180201000000|275|
|20171231113024|534|
|20180201220000|275|
|20170205000000| 28|
+--------------+---+
>>> df.groupBy('id',df.datetime.substr(0,6)).agg(count('id')).show()
+---+-----------------------+---------+
| id|substring(datetime,0,6)|count(id)|
+---+-----------------------+---------+
|275| 201802| 2|
|534| 201712| 1|
| 28| 201702| 1|
+---+-----------------------+---------+
I have a json string as below in a dataframe
aaa | bbb | ccc |ddd | eee
--------------------------------------
100 | xxxx | 123 |yyy|2017
100 | yyyy | 345 |zzz|2017
200 | rrrr | 500 |qqq|2017
300 | uuuu | 200 |ttt|2017
200 | iiii | 500 |ooo|2017
I want to get the result as
{100,[{xxxx:{123,yyy}},{yyyy:{345,zzz}}],2017}
{200,[{rrrr:{500,qqq}},{iiii:{500,ooo}}],2017}
{300,[{uuuu:{200,ttt}}],2017}
Kindly help
This works:
val df = data
.withColumn("cd", array('ccc, 'ddd)) // create arrays of c and d
.withColumn("valuesMap", map('bbb, 'cd)) // create mapping
.withColumn("values", collect_list('valuesMap) // collect mappings
.over(Window.partitionBy('aaa)))
.withColumn("eee", first('eee) // e is constant, just get first value of Window
.over(Window.partitionBy('aaa)))
.select("aaa", "values", "eee") // select only columns that are in the question selected
.select(to_json(struct("aaa", "values", "eee")).as("value")) // create JSON
Make sure you do
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._`
You can create a map defining the values as constants with lit() or taking them from other columns in the dataframe with $"col_name", like this:
val new_df = df.withColumn("map_feature", map(lit("key1"), lit("value1"), lit("key2"), $"col2"))
I need to return only a portion of the value in a given field.
Example:
A given field returns something like 'AB-1X3.4567' but the desired value is only the '1X3.4567'portion. So for this example I need to remove anything that precedes the pattern of
[0-9,A-Z][0-9,A-Z][0-9,A-Z][.][0-9,A-Z][0-9,A-Z][0-9,A-Z][0-9,A-Z].
What query could I write to do this?
using stuff() and patindex():
create table t (val varchar(32))
insert into t values
('AB-1X3.4567') -- given example
,('1X3.4567AB-1X3.4567') --extra junk on the end
,('1X3.4567') -- goldy locks
,('X3.4567') -- too short
,('AB-1X#.4567') -- # is not [0-9A-Z]
select
val
, str = stuff(val,1,patindex('%[0-9A-Z][0-9A-Z][0-9A-Z][.][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',val)-1,'')
from t
rextester demo: http://rextester.com/ITUJ68634
returns:
+---------------------+---------------------+
| val | str |
+---------------------+---------------------+
| AB-1X3.4567 | 1X3.4567 |
| 1X3.4567AB-1X3.4567 | 1X3.4567AB-1X3.4567 |
| 1X3.4567 | 1X3.4567 |
| X3.4567 | NULL |
| AB-1X#.4567 | NULL |
+---------------------+---------------------+
Your pattern alludes to anything which is XXX.XXXX where X = any single digit or letter. In that case we can use RIGHT() and LEN()
DECLARE #value VARCHAR(4000)='AB-1X3.4567'
SELECT RIGHT(#value,LEN(#value) - 3)