Regex special sequence - regex-group

I altered a code from solo learn app but got confused :
import re
pattern = r'(.+)(.+) \2'
match = re.match(pattern , 'ABC bca cab ABC')
if match:
print('Match 1' , match.group())
match = re.match(pattern , 'abc BCA cab BCA')
if match:
print('Match 2' , match.group())
match = re.match(pattern , 'abc bca CAB CAB')
if match:
print('Match 3' , match.group())
And am getting this output:
Match 1 ABC bca ca
Match 3 abc bca CAB CAB
Any help ?!!

Related

How to execute spark SQL using withColumn for streaming dataframe?

There is a scenario in which SCHOOL_GROUP column from Streaming data needs to be updated based on one mapping table (static dataframe).
Matching logic needs to be applied on AREA and SCHOOL_GROUP of streaming DF (teachersInfoDf) with SPLIT_CRITERIA and SCHOOL_CODE column to fetch SCHOOL from static DF(mappingDf).
teachersInfoDf (Streaming Data):
FNAME
LNAME
DOB
GENDER
SALARY
SCHOOL_GROUP
AREA
Williams
Kylie
1996
M
2000
ABCD
CNTRL-1
Maria
Brown
1992
F
2000
ABCD
CNTRL-5
John
Snow
1997
M
5000
XYZA
MM-RLBH1
Merry
Ely
1993
F
1000
PQRS
J-20
Michael
Rose
1998
M
1000
XYZA
DAY-20
Andrew
Simen
1990
M
1000
STUV
LVL-20
John
Dear
1997
M
5000
PQRS
P-RLBH1
mappingDf (Mapping Table data-Static):
SCHOOL_CODE
SPLIT_CRITERIA
SCHOOL
ABCD
(AREA LIKE 'CNTRL-%')
GROUP-1
XYZA
(AREA IN ('KK-DSK','DAY-20','MM-RLBH1','KM-RED1','NN-RLBH2'))
MULTI
PQRS
(AREA LIKE 'P-%' OR AREA LIKE 'V-%' OR AREA LIKE 'J-%')
WEST
STUV
(AREA NOT IN ('SS_ENDO2','SS_GRTGED','SS_GRTMMU','PC_ENDO1','PC_ENDO2','GRTENDO','GRTENDO1')
CORE
Required Dataframe:
FNAME
LNAME
DOB
GENDER
SALARY
SCHOOL_GROUP
AREA
Williams
Kylie
2006
M
2000
GROUP-1
CNTRL-1
Maria
Brown
2002
F
2000
GROUP-1
CNTRL-5
John
Snow
2007
M
5000
MULTI
MM-RLBH1
Merry
Ely
2003
F
1000
WEST
J-20
Michael
Rose
2002
M
1000
MULTI
DAY-20
Andrew
Simen
2008
M
1000
CORE
LVL-20
John
Dear
2007
M
5000
WEST
P-RLBH1
Using Spark SQL how can I achieve that?
(I know in streaming we can't show data like this. Streaming DF examples are for reference only.)
(For now, I created static DF to apply the logic.)
I am using below way but getting error:
def deriveSchoolOnArea: UserDefinedFunction = udf((area: String, SPLIT_CRITERIA: String, SCHOOL: String) => {
if (area == null || SPLIT_CRITERIA == null || SCHOOL == null) {
return null
}
val splitCriteria = SPLIT_CRITERIA.replace("AREA", area)
val query = """select """" + SCHOOL + """" AS SCHOOL from dual where """ + splitCriteria
print(query)
val dualDf = spark.sparkContext.parallelize(Seq("dual")).toDF()
dualDf.createOrReplaceGlobalTempView("dual")
print("View Created")
val finalHosDf = spark.sql(query)
print("Query Executed")
var finalSchool = ""
if (finalHosDf.isEmpty){
return null
}
else{
finalSchool = finalHosDf.select(col("SCHOOL")).first.getString(0)
}
print(finalSchool)
finalSchool
})
val dfJoin = teachersInfoDf.join(mappingDf,mappingDf("SCHOOL_CODE") === teachersInfoDf("SCHOOL_GROUP"), "left")
val dfJoin2 = dfJoin.withColumn("SCHOOL_GROUP", coalesce(deriveSchoolOnArea(col("area"), col("SPLIT_CRITERIA"), col("SCHOOL")), col("SCHOOL_GROUP")))
dfJoin2.show(false)
But Getting below error:
dfJoin2.show(false)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2459)

filter data in text file and load into postgresql

I have a text file with the below format:
Text: htpps:/xxx
Expiry: ddmm/yyyy
object_id: 00
object: ABC
auth: 333
RequestID: 1234
Text: htpps:/yyy
Expiry: ddmm/yyyy
object_id: 01
object: NNN
auth: 222
RequestID: 3456
and so on
...
I want to delete all lines with the exception of lines with prefix "Expiry:" "object:" and "object_id:"
then load it into a table in postgresql
Would really appreciate your help on the above two.
thanks
Nick
I'm sure there will be other methods, but I found an iterative approach if every object has the same format of
Text: htpps:/xxx
Expiry: ddmm/yyyy
object_id: 00
object: ABC
auth: 333
RequestID: 1234
Then you can transform the above with
more test.txt | awk '{ printf "%s\n", $2 }' | tr '\n' ',' | sed 's/,,/\n/' | sed '$ s/.$//'
and, for your example it will generate the entries in CSV format
htpps:/xxx,ddmm/yyyy,00,ABC,333,1234
htpps:/yyy,ddmm/yyyy,01,NNN,222,3456
The above code does:
awk '{ printf "%s\n", $2 }': prints only the second element for each row
tr '\n' ',': transform new lines in ,
sed 's/,,/\n/': removes the empty lines
sed '$ s/.$//': removes the trailing ,
Of course this is probably an oversimplified example, but you could use it as basis. Once the file is in CSV you can load it with psql

Replace newline (\n) except last of each line

my input is split into multiple lines. I want it to output in a single line.
For example Input is :
1|23|ABC
DEF
GHI
newline
newline
2|24|PQR
STU
LMN
XYZ
newline
Output:
1|23|ABC DEF GHI
2|24|PQR STU LMN XYZ
Well, here is one for awk:
$ awk -v RS="" -F"\n" '{$1=$1}1' file
Output:
1|23|ABC DEF GHI
2|24|PQR STU LMN XYZ

SPARQL Grouping

In SPARQL, can I list a group by using GROUP BY possibly?
Right now my query returns:
?p ?p2
----------------
abc zza
abc zba
abc zdf
bcd zbc
bcd zef
bcd zhr
bcd zfe
cde zop
cde zzz
The query I used is:
PREFIX bo: <https://webfiles.uci.edu/jenniyk2/businessontology#>
PREFIX v: <http://www.w3.org/2006/vcard/ns#>
SELECT DISTINCT ?p ?p2
WHERE
{
?p v:hasAddress ?ad .
?p2 v:hasAddress ?ad .
FILTER( ?p != ?p2 )
}
Is there any way I can make it return something like:
?p ?p2
---------------
abc zza
zba
zdf
bcd zbc
zef
zhr
zfe
cde zop
zzz
or
?p
-------------------
abc zza zba zdf
bcd zbc zef zhr zfe
cde zop zzz
Something like this should do the trick:
PREFIX bo:<https://webfiles.uci.edu/jenniyk2/businessontology#>
PREFIX v: <http://www.w3.org/2006/vcard/ns#>
SELECT DISTINCT (GROUP_CONCAT(?p2; SEPARATOR=" ") AS ?p)
WHERE {
?p1 v:hasAddress ?ad.
?p2 v:hasAddress ?ad.
} GROUP BY ?p1

Convert cvs string into list of strings

val lines : String = ("a1 , test1 , test2 , a2 , test3 , test4")
I'd like to convert this to a list of Strings where each string in the list contains 3 elements so above list is converted to 2 element list of strings containing "a1 , test1 , test2" and "a2 , test3 , test4"
One option I have considered is to iterate over each cvs element in the string and if on an element which is the current third element then add then add the previous elements to a new string. Is there a more functional approach?
grouped partitions them into fixed groups with a value n.
scala> lines.split(",").grouped(3).toList
res0: List[Array[String]] = List(Array("a1 ", " test1 ", " test2 "), Array(" a2 ", " test3 ", " test4"))
The answer by #Brian suffices; for an output formatted as
"a1 , test1 , test2" and "a2 , test3 , test4"
consider for instance
scala> val groups = lines.split(",").grouped(3).map { _.mkString(",").trim }.toList
groups: List[String] = List(a1 , test1 , test2, a2 , test3 , test4)
Then
scala> groups(0)
res1: String = a1 , test1 , test2
and
scala> groups(1)
res2: String = a2 , test3 , test4