I have the following XML file which I want to parse using Scala:
<infoFile xmlns="http://latest/nmc-omc/cmNrm.doc#info" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://latest/nmc-omc/cmNrm.doc#info schema\pmResultSchedule.xsd">
<fileHeader fileFormatVersion="123456" operator="ABCD">
<fileSender elementType="MSC UTLI"/>
<infoCollec beginTime="2011-05-15T00:00:00-05:00"/>
</fileHeader>
<infoCollecData>
<infoMes infoMesID="551727">
<mesPeriod duration="TT1234" endTime="2011-05-15T00:30:00-05:00"/>
<mesrePeriod duration="TT1235"/>
<mesTypes>5517271 5517272 5517273 5517274 </measTypes>
<mesValue mesObj="RPC12/LMI_ANY:Label=BCR-1232_1111, ANY=1111">
<mesResults>149 149 3 3 </mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551728">
<mesTypes>6132413 6132414 6132415</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64446, CllID=64446">
<mesResults>0 0 6</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64447, CllID=64447">
<mesResults>0 1 6</mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551729">
<mesTypes>6132416 6132417 6132418 6132419</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64448, CllID=64448">
<mesResults>1 4 6 8</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64449, CllID=64449">
<mesResults>1 2 4 5 </mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64450, CllID=64450">
<mesResults>1 7 8 5 </mesResults>
</mesValue>
</infoMes>
</infoCollecData>
I want the file to be parsed as follows:
From the fileHeader I want to be able to extract operator name then to extract beginTime.
Next scenario ****extract the information which contains CllID then get its mesTypes and mesResults respectively ****
as the file contains number of with different CllID so I want the final result like this
CllID date time mesTypes mesResults
64446 2011-05-15 00:00:00 6132413 0
64446 2011-05-15 00:00:00 6132414 0
64446 2011-05-15 00:00:00 6132415 6
64447 2011-05-15 00:00:00 6132413 0
64447 2011-05-15 00:00:00 6132414 1
64447 2011-05-15 00:00:00 6132415 6
How could I achieve this ? Here is what I have tried so far:
import java.io._
import scala.xml.Node
object xml_parser {
def main (args:Array[String]) = {
val input_xmlFile = scala.xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = input_xmlFile \ "fileHeader"
val vendorName = input_xmlFile \ "fileHeader" \ "#operator"
val dateTime = input_xmlFile \ "fileHeader" \ "infoCollec" \"#beginTime"
val date = dateTime.text.split("T")(0)
val time = dateTime.text.split("T")(1).split("-")(0)
val CcIds = (input_xmlFile \ "infoCollecData" \ "infoMes" \\ "mesTypes" )
val cids = CcIds.text.split("\\s+").toList
al CounterValues = (input_xmlFile \ "infoCollecData" \\ "infoMes" \\ "mesValue" \\ "#meaObj")
println(date);println(time);print(cids)
May I suggest kantan.xpath? It seems like it should sort your problem rather easily.
Assuming your XML data is available in file data, you can write:
import kantan.xpath.implicits._
val xml = data.asUnsafeNode
// Date format to parse dates. Put in the right format.
// Note that this uses java.util.Date, you could also use the joda time module.
implicit val format = ???
// Extract the header data
xml.evalXPath[java.util.Date](xp"//fileheader/infocollec/#begintime")
xml.evalXPath[String](xp"//fileheader/#operator")
// Get the required infoMes nodes as a list, turn each one into whatever data type you need.
xml.evalXPath[List[Node]](xp"//infomes/mesvalue[contains(#measobjldn, 'CllID')]/..").map { node =>
...
}
Extracting the CllID bit is not terribly complicated with the right regular expression, you could either use the standard Scala Regex class or kantan.regex for something a bit more type safe but that might be overkill here.
The following code can implement what you want according to your xml format
def main(args: Array[String]): Unit = {
val inputFile = xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = inputFile \ "fileHeader"
val beginTime = fileHeader \"infoCollec"
val res = beginTime.map(_.attribute("beginTime")).apply(0).get.text
val dateTime = res.split("T")
val date = dateTime(0)
val time = dateTime(1).split("-").apply(0)
val title = ("CllID", "date", "time", "mesTypes", "mesResults")
println(s"${title._1}\t${title._2}\t\t${title._3}\t\t${title._4}\t${title._5}")
val infoMesNodeList = (inputFile \\ "infoMes").filter{node => (node \ "mesValue").exists(_.attribute("measObjLdn").nonEmpty)}
infoMesNodeList.foreach{ node =>
val mesTypesList = (node \ "mesTypes").text.split(" ").map(_.toInt)
(node \ "mesValue").foreach { node =>
val mesResultsList = (node \ "mesResults").text.split(" ").map(_.toInt)
val CllID = node.attribute("measObjLdn").get.text.split(",").apply(1).split("=").apply(1).toInt
val res = (mesTypesList zip mesResultsList).map(item => (CllID, date, time, item._1, item._2))
res.foreach(item => println(s"${item._1}\t${item._2}\t${item._3}\t${item._4}\t\t${item._5}"))
}
}
}
Notes: your xml file does not have the right format
1) miss close tag in the last line of the file
2) line 11, have a wrong tag , which should be
Related
My dataframe returns the below result as String.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming
Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()
Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0
I had an use-case to fetch last 12months with end-date of that month from given date.
For Example if I give input as ('2021-04-23'), the output should be:
output1 = ('2021-04-30', '2021-03-31', '2021-02-28', '2021-01-31', '2020-12-31', '2020-11-30', '2020-10-31', '2020-09-30', '2020-08-31', '2020-07-31', '2020-06-30', '2020-05-31', '2020-04-30')
output2=('2021-04-01','2021-03-01','2021-02-01','2021-01-01','2020-12-01','2020-11-01','2020-10-01','2020-09-01', '2020-08-01','2020-07-01','2020-06-01','2020-05-01','2020-04-01')
I had the code snippet
import java.time.format.DateTimeFormatter
val monthDate = DateTimeFormatter.ofPattern("yyyy-MM")
val start = YearMonth.parse("2021-04", monthDate
val lastTwelveMonths=(0 to 12).map(x => start.minusMonths(x).format(monthDate)).toList
which returns last 12months from current month, Can any one please provide solution which includes end date too for previous 12 months. Thanks
You can use java.time.LocalDate's withDayOfMonth() for what you need:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val dateFormat = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val inputDate = LocalDate.parse("2021-04-23")
(0 to 12).map{ n =>
inputDate.minusMonths(n).withDayOfMonth(1).format(dateFormat)
}
// Vector(2021-04-01, 2021-03-01, 2021-02-01, 2021-01-01, 2020-12-01, 2020-11-01, 2020-10-01, 2020-09-01, 2020-08-01, 2020-07-01, 2020-06-01, 2020-05-01, 2020-04-01)
(0 to 12).map{ n =>
val prevDate = inputDate.minusMonths(n)
prevDate.withDayOfMonth(prevDate.lengthOfMonth).format(dateFormat)
}
// Vector(2021-04-30, 2021-03-31, 2021-02-28, 2021-01-31, 2020-12-31, 2020-11-30, 2020-10-31, 2020-09-30, 2020-08-31, 2020-07-31, 2020-06-30, 2020-05-31, 2020-04-30)
I have a Spark script that establishes a connection to Hive and read Data from different databases and then writes the union into a CSV file. I tested it with two databases and it took 20 minutes. Now I am trying it with 11 databases and it has been running since yesterday evening (18 hours!). The script is supposed to get between 400000 and 800000 row per database.
My question is: is 18 hours normal for such jobs? If not, how can I optimize it? This is what my main does:
// This is a list of the ten first databases used:
var use_database_sigma = List( Parametre_vigiliste.sourceDbSigmaGca, Parametre_vigiliste.sourceDbSigmaGcm
,Parametre_vigiliste.sourceDbSigmaGge, Parametre_vigiliste.sourceDbSigmaGne
,Parametre_vigiliste.sourceDbSigmaGoc, Parametre_vigiliste.sourceDbSigmaGoi
,Parametre_vigiliste.sourceDbSigmaGra, Parametre_vigiliste.sourceDbSigmaGsu
,Parametre_vigiliste.sourceDbSigmaPvl, Parametre_vigiliste.sourceDbSigmaLbr)
val grc = Tables.getGRC(spark) // This creates the first dataframe
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // This creates other dataframe which is the union of ten dataframes (one database each)
for(i <- 1 until use_database_sigma.length)
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
}
}
// writing into csv file
val grc_sigma=sigma.union(grc) // union of the 2 dataframes
grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
grc_sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.csv"));
grc_sigma.unpersist()
Not written in an IDE so it might be off somewhere, but you get the general idea.
val frames = Seq("table1", "table2).map{ table =>
spark.read.table(table).cache()
}
frames
.reduce(_.union(_)) //or unionByName() if the columns aren't in the same order
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.format("csv")
.options(Map("header" -> "true", "delimiter" -> "|"))
.save("filePathName")
I have an rdd with type RDD[String] as an example here is a part of it as such:
1990,1990-07-08
1994,1994-06-18
1994,1994-06-18
1994,1994-06-22
1994,1994-06-22
1994,1994-06-26
1994,1994-06-26
1954,1954-06-20
2002,2002-06-26
1954,1954-06-23
2002,2002-06-29
1954,1954-06-16
2002,2002-06-30
...
result:
(1982,52)
(2006,64)
(1962,32)
(1966,32)
(1986,52)
(2002,64)
(1994,52)
(1974,38)
(1990,52)
(2010,64)
(1978,38)
(1954,26)
(2014,64)
(1958,35)
(1998,64)
(1970,32)
I group it nicely, but my problem is this v.size part, I do not know to to calculate that length.
Just to put it in perspective, here are expected results:
It is not a mistake that there is two times for 2002. But ignore that.
define date format:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
and order:
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
create a function that receives "v" and returns MAX(date_of_matching_year) - MIN(date_of_matching_year)) = LENGTH (in days):
def f(v: Iterable[Array[String]]): Int = {
val parsedDates = v.map(LocalDate.parse(_(1), formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear
then replace the v.size with f(v)
I want 12 Hours Time format convert into 24 Hours time format, Here I attached my code and I checked this link1,link2 but it return as same time format.
Code
val inTime = "12:15 PM"
val newTimeFormat = new SimpleDateFormat("hh:mm a")
val timeWithDateFormat = newTimeFormat.parse(inTime)
val outputTime = newTimeFormat.format(timeWithDateFormat)
println("Output===========>",outputTime)
Output is:
(Output===========>,12:15 PM)
How can I resolved it.
As you want your output to be in a different format than your input, you will need to use a different formatters for input and output.
Also... 12:15 PM of 12-hour-format is 12:15 of 24-hour-format. So may be you should use a different time for this example (lets use 03:15 PM or 15:15),
val inTime = "03:15 PM"
val inputTimeFormat = new SimpleDateFormat("hh:mm a")
val timeWithDateFormat = inputTimeFormat.parse(inTime)
val outputTimeFormat = new SimpleDateFormat("HH:mm")
val outputTime = outputTimeFormat.format(timeWithDateFormat)
println("Output===========>", outputTime)
Just create new SimpleDateFormat and use it to format your date;
val format = new SimpleDateFormat("HH:mm")
For details you can check documentation;
Customizing Formats