Extracting Specific Field from String in Scala

Extracting Specific Field from String in Scala - scala

My dataframe returns the below result as String.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming

Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()

Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0

Related

Pyspark / Databricks. Kolmogorov - Smirnov over time. Efficiently. In parallel

Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

How to get Min, Max and Length between dates for each year?

I have an rdd with type RDD[String] as an example here is a part of it as such:
1990,1990-07-08
1994,1994-06-18
1994,1994-06-18
1994,1994-06-22
1994,1994-06-22
1994,1994-06-26
1994,1994-06-26
1954,1954-06-20
2002,2002-06-26
1954,1954-06-23
2002,2002-06-29
1954,1954-06-16
2002,2002-06-30
...
result:
(1982,52)
(2006,64)
(1962,32)
(1966,32)
(1986,52)
(2002,64)
(1994,52)
(1974,38)
(1990,52)
(2010,64)
(1978,38)
(1954,26)
(2014,64)
(1958,35)
(1998,64)
(1970,32)
I group it nicely, but my problem is this v.size part, I do not know to to calculate that length.
Just to put it in perspective, here are expected results:
It is not a mistake that there is two times for 2002. But ignore that.

define date format:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
and order:
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
create a function that receives "v" and returns MAX(date_of_matching_year) - MIN(date_of_matching_year)) = LENGTH (in days):
def f(v: Iterable[Array[String]]): Int = {
val parsedDates = v.map(LocalDate.parse(_(1), formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear
then replace the v.size with f(v)

Pyspark automatically rename repeated columns

I want to automatically rename repeated columns of a df. For example:
df
Out[4]: DataFrame[norep1: string, num1: string, num1: bigint, norep2: bigint, num1: bigint, norep3: bigint]
Apply some function to end with a df like:
f_rename_repcol(df)
Out[4]: DataFrame[norep1: string, num1_1: string, num1_2: bigint, norep2: bigint, num1_3: bigint, norep3: bigint]
I've already create my own function, and works, but I'm sure there is a shorter and better way of doing it:
def f_df_col_renombra_rep(df):
from collections import Counter
from itertools import chain
import pandas as pd
columnas_original = np.array(df.columns)
d1 = Counter(df.columns)
i_corrige = [a>1 for a in dict(d1.items()).values()]
var_corrige = np.array(dict(d1.items()).keys())[i_corrige]
var_corrige_2 = [a for a in columnas_original if a in var_corrige]
columnas_nuevas = []
for var in var_corrige:
aux_corr = [a for a in var_corrige_2 if a in var]
i=0
columnas_nuevas_aux=[]
for valor in aux_corr:
i+=1
nombre_nuevo = valor +"_"+ str(i)
columnas_nuevas_aux.append(nombre_nuevo)
columnas_nuevas.append(columnas_nuevas_aux)
columnas_nuevas=list(chain.from_iterable(columnas_nuevas))
indice_cambio = pd.Series(columnas_original).isin(var_corrige)
i = 0
j = 0
colsalida = [None]*len(df.columns)
for col in df.columns:
if indice_cambio[i] == True:
colsalida[i] = columnas_nuevas[j]
j += 1
else:
colsalida[i] = col
# no cambio el nombre
i += 1
df_out = df.toDF(*(colsalida))
return df_out

You can modify the renaming function here to suit your need, but broadly I find this as the best way to rename all the duplicates columns
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
This the conversion I do, the suffix assigned to the duplicate columns is not that of difference until the name(prefix) remains the same & I can save the file.
To update the columns you can simply run:
df=df.toDF(*new_col)
This should update the column names and remove all the duplicates
If you want to keep the numbering as _1,_2,_3:
You can use a dictionary and try and except block,
dict={}
for column in old_col:
try:
i=dict[column]+1
new_col.append(column+"_"+str(i))
dict[column]=i
except:
dict[column]=1
new_col.append(column+"_"+str(1)
print(new_col)

the easy way I am doing it is:
def col_duplicates(self):
'''rename dataframe with dups'''
columnas = self.columns.copy()
for i in range(len(columnas)-1):
for j in range(i+1, len(columnas), 1):
if columnas[i] == columnas[j]:
columnas[j] = columnas[i] + '_dup_' + str(j) # this line controls how to rename
return self.toDF(*columnas)
use as:
new_df_without_duplicates = col_duplicates(df_with_duplicates)

selecting max value in the column

I have a data like this
TagID,ListnerID,Timestamp,Sum_RSSI
2,101,1496745906,90
3,102,1496745907,70
3,104,1496745906,80
2,101,1496745909,60
4,106,1496745908,60
My expected output would be
2,101,1496745906,90
3,104,1496745906,80
4,106,1496745908,60
I tried like this
val high_window = Window.partitionBy($"tagShortID")
val prox = averageDF
.withColumn("rank", row_number().over(window.orderBy($"Sum_RSSI".desc)))
.filter($"rank" === 1)
But it prints all the rows. Any help would be appreciated.

XML parsing using Scala

I have the following XML file which I want to parse using Scala:
<infoFile xmlns="http://latest/nmc-omc/cmNrm.doc#info" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://latest/nmc-omc/cmNrm.doc#info schema\pmResultSchedule.xsd">
<fileHeader fileFormatVersion="123456" operator="ABCD">
<fileSender elementType="MSC UTLI"/>
<infoCollec beginTime="2011-05-15T00:00:00-05:00"/>
</fileHeader>
<infoCollecData>
<infoMes infoMesID="551727">
<mesPeriod duration="TT1234" endTime="2011-05-15T00:30:00-05:00"/>
<mesrePeriod duration="TT1235"/>
<mesTypes>5517271 5517272 5517273 5517274 </measTypes>
<mesValue mesObj="RPC12/LMI_ANY:Label=BCR-1232_1111, ANY=1111">
<mesResults>149 149 3 3 </mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551728">
<mesTypes>6132413 6132414 6132415</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64446, CllID=64446">
<mesResults>0 0 6</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64447, CllID=64447">
<mesResults>0 1 6</mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551729">
<mesTypes>6132416 6132417 6132418 6132419</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64448, CllID=64448">
<mesResults>1 4 6 8</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64449, CllID=64449">
<mesResults>1 2 4 5 </mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64450, CllID=64450">
<mesResults>1 7 8 5 </mesResults>
</mesValue>
</infoMes>
</infoCollecData>
I want the file to be parsed as follows:
From the fileHeader I want to be able to extract operator name then to extract beginTime.
Next scenario ****extract the information which contains CllID then get its mesTypes and mesResults respectively ****
as the file contains number of with different CllID so I want the final result like this
CllID date time mesTypes mesResults
64446 2011-05-15 00:00:00 6132413 0
64446 2011-05-15 00:00:00 6132414 0
64446 2011-05-15 00:00:00 6132415 6
64447 2011-05-15 00:00:00 6132413 0
64447 2011-05-15 00:00:00 6132414 1
64447 2011-05-15 00:00:00 6132415 6
How could I achieve this ? Here is what I have tried so far:
import java.io._
import scala.xml.Node
object xml_parser {
def main (args:Array[String]) = {
val input_xmlFile = scala.xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = input_xmlFile \ "fileHeader"
val vendorName = input_xmlFile \ "fileHeader" \ "#operator"
val dateTime = input_xmlFile \ "fileHeader" \ "infoCollec" \"#beginTime"
val date = dateTime.text.split("T")(0)
val time = dateTime.text.split("T")(1).split("-")(0)
val CcIds = (input_xmlFile \ "infoCollecData" \ "infoMes" \\ "mesTypes" )
val cids = CcIds.text.split("\\s+").toList
al CounterValues = (input_xmlFile \ "infoCollecData" \\ "infoMes" \\ "mesValue" \\ "#meaObj")
println(date);println(time);print(cids)

May I suggest kantan.xpath? It seems like it should sort your problem rather easily.
Assuming your XML data is available in file data, you can write:
import kantan.xpath.implicits._
val xml = data.asUnsafeNode
// Date format to parse dates. Put in the right format.
// Note that this uses java.util.Date, you could also use the joda time module.
implicit val format = ???
// Extract the header data
xml.evalXPath[java.util.Date](xp"//fileheader/infocollec/#begintime")
xml.evalXPath[String](xp"//fileheader/#operator")
// Get the required infoMes nodes as a list, turn each one into whatever data type you need.
xml.evalXPath[List[Node]](xp"//infomes/mesvalue[contains(#measobjldn, 'CllID')]/..").map { node =>
...
}
Extracting the CllID bit is not terribly complicated with the right regular expression, you could either use the standard Scala Regex class or kantan.regex for something a bit more type safe but that might be overkill here.

The following code can implement what you want according to your xml format
def main(args: Array[String]): Unit = {
val inputFile = xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = inputFile \ "fileHeader"
val beginTime = fileHeader \"infoCollec"
val res = beginTime.map(_.attribute("beginTime")).apply(0).get.text
val dateTime = res.split("T")
val date = dateTime(0)
val time = dateTime(1).split("-").apply(0)
val title = ("CllID", "date", "time", "mesTypes", "mesResults")
println(s"${title._1}\t${title._2}\t\t${title._3}\t\t${title._4}\t${title._5}")
val infoMesNodeList = (inputFile \\ "infoMes").filter{node => (node \ "mesValue").exists(_.attribute("measObjLdn").nonEmpty)}
infoMesNodeList.foreach{ node =>
val mesTypesList = (node \ "mesTypes").text.split(" ").map(_.toInt)
(node \ "mesValue").foreach { node =>
val mesResultsList = (node \ "mesResults").text.split(" ").map(_.toInt)
val CllID = node.attribute("measObjLdn").get.text.split(",").apply(1).split("=").apply(1).toInt
val res = (mesTypesList zip mesResultsList).map(item => (CllID, date, time, item._1, item._2))
res.foreach(item => println(s"${item._1}\t${item._2}\t${item._3}\t${item._4}\t\t${item._5}"))
}
}
}
Notes: your xml file does not have the right format
1) miss close tag in the last line of the file
2) line 11, have a wrong tag , which should be

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extracting Specific Field from String in Scala - scala

Using the Pyspark function regexp_extract: from pyspark.sql import functions as F df = <dataframe with a column "text" that contains the input data"> df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()

Related

Pyspark / Databricks. Kolmogorov - Smirnov over time. Efficiently. In parallel

How to get Min, Max and Length between dates for each year?

Pyspark automatically rename repeated columns

selecting max value in the column

XML parsing using Scala

Categories

Resources