Extracting Specific Field from String in Scala - scala

My dataframe returns the below result as String.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming

Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()

Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0

Related

Pyspark / Databricks. Kolmogorov - Smirnov over time. Efficiently. In parallel

Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

How to get Min, Max and Length between dates for each year?

I have an rdd with type RDD[String] as an example here is a part of it as such:
1990,1990-07-08
1994,1994-06-18
1994,1994-06-18
1994,1994-06-22
1994,1994-06-22
1994,1994-06-26
1994,1994-06-26
1954,1954-06-20
2002,2002-06-26
1954,1954-06-23
2002,2002-06-29
1954,1954-06-16
2002,2002-06-30
...
result:
(1982,52)
(2006,64)
(1962,32)
(1966,32)
(1986,52)
(2002,64)
(1994,52)
(1974,38)
(1990,52)
(2010,64)
(1978,38)
(1954,26)
(2014,64)
(1958,35)
(1998,64)
(1970,32)
I group it nicely, but my problem is this v.size part, I do not know to to calculate that length.
Just to put it in perspective, here are expected results:
It is not a mistake that there is two times for 2002. But ignore that.
define date format:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
and order:
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
create a function that receives "v" and returns MAX(date_of_matching_year) - MIN(date_of_matching_year)) = LENGTH (in days):
def f(v: Iterable[Array[String]]): Int = {
val parsedDates = v.map(LocalDate.parse(_(1), formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear
then replace the v.size with f(v)

Pyspark automatically rename repeated columns

I want to automatically rename repeated columns of a df. For example:
df
Out[4]: DataFrame[norep1: string, num1: string, num1: bigint, norep2: bigint, num1: bigint, norep3: bigint]
Apply some function to end with a df like:
f_rename_repcol(df)
Out[4]: DataFrame[norep1: string, num1_1: string, num1_2: bigint, norep2: bigint, num1_3: bigint, norep3: bigint]
I've already create my own function, and works, but I'm sure there is a shorter and better way of doing it:
def f_df_col_renombra_rep(df):
from collections import Counter
from itertools import chain
import pandas as pd
columnas_original = np.array(df.columns)
d1 = Counter(df.columns)
i_corrige = [a>1 for a in dict(d1.items()).values()]
var_corrige = np.array(dict(d1.items()).keys())[i_corrige]
var_corrige_2 = [a for a in columnas_original if a in var_corrige]
columnas_nuevas = []
for var in var_corrige:
aux_corr = [a for a in var_corrige_2 if a in var]
i=0
columnas_nuevas_aux=[]
for valor in aux_corr:
i+=1
nombre_nuevo = valor +"_"+ str(i)
columnas_nuevas_aux.append(nombre_nuevo)
columnas_nuevas.append(columnas_nuevas_aux)
columnas_nuevas=list(chain.from_iterable(columnas_nuevas))
indice_cambio = pd.Series(columnas_original).isin(var_corrige)
i = 0
j = 0
colsalida = [None]*len(df.columns)
for col in df.columns:
if indice_cambio[i] == True:
colsalida[i] = columnas_nuevas[j]
j += 1
else:
colsalida[i] = col
# no cambio el nombre
i += 1
df_out = df.toDF(*(colsalida))
return df_out
You can modify the renaming function here to suit your need, but broadly I find this as the best way to rename all the duplicates columns
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
This the conversion I do, the suffix assigned to the duplicate columns is not that of difference until the name(prefix) remains the same & I can save the file.
To update the columns you can simply run:
df=df.toDF(*new_col)
This should update the column names and remove all the duplicates
If you want to keep the numbering as _1,_2,_3:
You can use a dictionary and try and except block,
dict={}
for column in old_col:
try:
i=dict[column]+1
new_col.append(column+"_"+str(i))
dict[column]=i
except:
dict[column]=1
new_col.append(column+"_"+str(1)
print(new_col)
the easy way I am doing it is:
def col_duplicates(self):
'''rename dataframe with dups'''
columnas = self.columns.copy()
for i in range(len(columnas)-1):
for j in range(i+1, len(columnas), 1):
if columnas[i] == columnas[j]:
columnas[j] = columnas[i] + '_dup_' + str(j) # this line controls how to rename
return self.toDF(*columnas)
use as:
new_df_without_duplicates = col_duplicates(df_with_duplicates)

selecting max value in the column

I have a data like this
TagID,ListnerID,Timestamp,Sum_RSSI
2,101,1496745906,90
3,102,1496745907,70
3,104,1496745906,80
2,101,1496745909,60
4,106,1496745908,60
My expected output would be
2,101,1496745906,90
3,104,1496745906,80
4,106,1496745908,60
I tried like this
val high_window = Window.partitionBy($"tagShortID")
val prox = averageDF
.withColumn("rank", row_number().over(window.orderBy($"Sum_RSSI".desc)))
.filter($"rank" === 1)
But it prints all the rows. Any help would be appreciated.

XML parsing using Scala

I have the following XML file which I want to parse using Scala:
<infoFile xmlns="http://latest/nmc-omc/cmNrm.doc#info" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://latest/nmc-omc/cmNrm.doc#info schema\pmResultSchedule.xsd">
<fileHeader fileFormatVersion="123456" operator="ABCD">
<fileSender elementType="MSC UTLI"/>
<infoCollec beginTime="2011-05-15T00:00:00-05:00"/>
</fileHeader>
<infoCollecData>
<infoMes infoMesID="551727">
<mesPeriod duration="TT1234" endTime="2011-05-15T00:30:00-05:00"/>
<mesrePeriod duration="TT1235"/>
<mesTypes>5517271 5517272 5517273 5517274 </measTypes>
<mesValue mesObj="RPC12/LMI_ANY:Label=BCR-1232_1111, ANY=1111">
<mesResults>149 149 3 3 </mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551728">
<mesTypes>6132413 6132414 6132415</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64446, CllID=64446">
<mesResults>0 0 6</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64447, CllID=64447">
<mesResults>0 1 6</mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551729">
<mesTypes>6132416 6132417 6132418 6132419</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64448, CllID=64448">
<mesResults>1 4 6 8</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64449, CllID=64449">
<mesResults>1 2 4 5 </mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64450, CllID=64450">
<mesResults>1 7 8 5 </mesResults>
</mesValue>
</infoMes>
</infoCollecData>
I want the file to be parsed as follows:
From the fileHeader I want to be able to extract operator name then to extract beginTime.
Next scenario ****extract the information which contains CllID then get its mesTypes and mesResults respectively ****
as the file contains number of with different CllID so I want the final result like this
CllID date time mesTypes mesResults
64446 2011-05-15 00:00:00 6132413 0
64446 2011-05-15 00:00:00 6132414 0
64446 2011-05-15 00:00:00 6132415 6
64447 2011-05-15 00:00:00 6132413 0
64447 2011-05-15 00:00:00 6132414 1
64447 2011-05-15 00:00:00 6132415 6
How could I achieve this ? Here is what I have tried so far:
import java.io._
import scala.xml.Node
object xml_parser {
def main (args:Array[String]) = {
val input_xmlFile = scala.xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = input_xmlFile \ "fileHeader"
val vendorName = input_xmlFile \ "fileHeader" \ "#operator"
val dateTime = input_xmlFile \ "fileHeader" \ "infoCollec" \"#beginTime"
val date = dateTime.text.split("T")(0)
val time = dateTime.text.split("T")(1).split("-")(0)
val CcIds = (input_xmlFile \ "infoCollecData" \ "infoMes" \\ "mesTypes" )
val cids = CcIds.text.split("\\s+").toList
al CounterValues = (input_xmlFile \ "infoCollecData" \\ "infoMes" \\ "mesValue" \\ "#meaObj")
println(date);println(time);print(cids)
May I suggest kantan.xpath? It seems like it should sort your problem rather easily.
Assuming your XML data is available in file data, you can write:
import kantan.xpath.implicits._
val xml = data.asUnsafeNode
// Date format to parse dates. Put in the right format.
// Note that this uses java.util.Date, you could also use the joda time module.
implicit val format = ???
// Extract the header data
xml.evalXPath[java.util.Date](xp"//fileheader/infocollec/#begintime")
xml.evalXPath[String](xp"//fileheader/#operator")
// Get the required infoMes nodes as a list, turn each one into whatever data type you need.
xml.evalXPath[List[Node]](xp"//infomes/mesvalue[contains(#measobjldn, 'CllID')]/..").map { node =>
...
}
Extracting the CllID bit is not terribly complicated with the right regular expression, you could either use the standard Scala Regex class or kantan.regex for something a bit more type safe but that might be overkill here.
The following code can implement what you want according to your xml format
def main(args: Array[String]): Unit = {
val inputFile = xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = inputFile \ "fileHeader"
val beginTime = fileHeader \"infoCollec"
val res = beginTime.map(_.attribute("beginTime")).apply(0).get.text
val dateTime = res.split("T")
val date = dateTime(0)
val time = dateTime(1).split("-").apply(0)
val title = ("CllID", "date", "time", "mesTypes", "mesResults")
println(s"${title._1}\t${title._2}\t\t${title._3}\t\t${title._4}\t${title._5}")
val infoMesNodeList = (inputFile \\ "infoMes").filter{node => (node \ "mesValue").exists(_.attribute("measObjLdn").nonEmpty)}
infoMesNodeList.foreach{ node =>
val mesTypesList = (node \ "mesTypes").text.split(" ").map(_.toInt)
(node \ "mesValue").foreach { node =>
val mesResultsList = (node \ "mesResults").text.split(" ").map(_.toInt)
val CllID = node.attribute("measObjLdn").get.text.split(",").apply(1).split("=").apply(1).toInt
val res = (mesTypesList zip mesResultsList).map(item => (CllID, date, time, item._1, item._2))
res.foreach(item => println(s"${item._1}\t${item._2}\t${item._3}\t${item._4}\t\t${item._5}"))
}
}
}
Notes: your xml file does not have the right format
1) miss close tag in the last line of the file
2) line 11, have a wrong tag , which should be