Extract text from a large file using powershell - powershell

We have an application that produces many large log files, which I want to parse using PowerShell and get the output in CSV or text with delimiter '|'. I tried to use select-string, but couldn't get the expected result. Below i have posted teh log format and expected result
Log File data:
How to achieve the above result using PowerShell?
Thanks

Just as mentioned in my comment you'll need to separate records and try to match your data with a complex regular expression.
See the RegEx live on regex101
Study the explanation of each element in the upper right corner of that link.
This script:
## Q:\Test\2018\11\29\SO_53541952.ps1
$LogFile = '.\SO_53541952.log'
$CsvFile = '.\SO_53541952.csv'
$ExcelFile='.\SO_53541952.xlsx'
## see the regex live <https://regex101.com/r/1TWm7i/1>
$RE = [RegEx]"(?sm)^Submitter Id +=> (?<SubmitterID>.*?$).*?^Start Time +=> (?<StartTime>[0-9:]{8}) +Start Date +=> (?<StartDate>[0-9\/]{10}).*?^Message Text +=> (?<MessageText>.*?$).*?^Src File +=> (?<SrcFile>.*?$).*?^Dest File +=> (?<DestFile>.*?$)"
$Data = (Get-Content $LogFile -raw) -split "(?sm)(?=^Record Id)" | ForEach-Object {
If ($_ -match $RE){
[PSCustomObject]#{
'Submitter Id' = $Matches.SubmitterId
'Start Time' = $Matches.StartTime
'Start Date' = $Matches.StartDate
'Message Text' = $Matches.MessageText
'Src File' = $Matches.SrcFile
'Dest File' = $Matches.DestFile
}
}
}
$Data | Format-Table -Auto
$Data | Export-Csv $CsvFile -NoTypeInformation -Delimiter '|'
#$Data | Out-Gridview
## with the ImportExcel module you can directly generate an excel file
$Data | Export-Excel $ExcelFile -AutoSize # -Show
has this sample output on screen (I modified the samples to be distinguishable):
> .\SO_53541952.ps1
Submitter Id Start Time Start Date Message Text Src File Dest File
------------ ---------- ---------- ------------ -------- ---------
STMDA#432... 00:02:51 11/29/2018 Copy step successfu... File1... c\temp...
STMDA#432... 00:02:52 11/29/2018 Copy step successfu... File2... c\temp...
STMDA#432... 00:02:53 11/29/2018 Copy step successfu... File3... c\temp...
STMDA#432... 00:02:54 11/29/2018 Copy step successfu... File4... c\temp...
and with Doug Finke's ImportExcel module installed, you'll directly get an .xlsx file:

As LotPings suggested, you need to break up the log file content into separate blocks.
Then using regex, you can capture the required values and store them in objects which you can then export to a CSV file.
Something like this:
$log = #"
------------------------------------------------------------------------------
Record Id => STM
Process Name => STMDA Stat Log Time => 00:02:59
Process Number => 51657 Stat Log Date => 11/29/2018
Submitter Id => STMDA#4322
SNode User Id => de34fc5
Start Time => 00:02:59 Start Date => 11/29/2018
Stop Time => 00:02:59 Stop Date => 11/29/2018
SNODE => dfdvrvbsdfgg
Completion Code => 0
Message Id => ncpa
Message Text => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y
FASP=> N
From Node => P
Src File => File2
Dest File => c\temp2
Src CCode => 0 Dest CCode => 0
Src Msgid => ncpa Dest Msgid => ncpa
Bytes Read => 4000 Bytes Written => 4010
Records Read => 5 Records Written => 5
Bytes Sent => 4010 Bytes Received => 4010
RUs Sent => 0 RUs Received => 1
------------------------------------------------------------------------------
Record Id => STM
Process Name => STMDA Stat Log Time => 00:02:59
Process Number => 51657 Stat Log Date => 11/29/2018
Submitter Id => STMDA#4321
SNode User Id => de34fc5
Start Time => 00:02:59 Start Date => 11/29/2018
Stop Time => 00:02:59 Stop Date => 11/29/2018
SNODE => dfdvrvbsdfgg
Completion Code => 0
Message Id => ncpa
Message Text => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y
FASP=> N
From Node => P
Src File => File1
Dest File => c\temp1
Src CCode => 0 Dest CCode => 0
Src Msgid => ncpa Dest Msgid => ncpa
Bytes Read => 4000 Bytes Written => 4010
Records Read => 5 Records Written => 5
Bytes Sent => 4010 Bytes Received => 4010
RUs Sent => 0 RUs Received => 1
------------------------------------------------------------------------------
Record Id => STM
Process Name => STMDA Stat Log Time => 00:02:59
Process Number => 51657 Stat Log Date => 11/29/2018
Submitter Id => STMDA#4323
SNode User Id => de34fc5
Start Time => 00:02:59 Start Date => 11/29/2018
Stop Time => 00:02:59 Stop Date => 11/29/2018
SNODE => dfdvrvbsdfgg
Completion Code => 0
Message Id => ncpa
Message Text => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y
FASP=> N
From Node => P
Src File => File3
Dest File => c\temp3
Src CCode => 0 Dest CCode => 0
Src Msgid => ncpa Dest Msgid => ncpa
Bytes Read => 4000 Bytes Written => 4010
Records Read => 5 Records Written => 5
Bytes Sent => 4010 Bytes Received => 4010
RUs Sent => 0 RUs Received => 1
------------------------------------------------------------------------------
Record Id => STM
Process Name => STMDA Stat Log Time => 00:02:59
Process Number => 51657 Stat Log Date => 11/29/2018
Submitter Id => STMDA#4324
SNode User Id => de34fc5
Start Time => 00:02:59 Start Date => 11/29/2018
Stop Time => 00:02:59 Stop Date => 11/29/2018
SNODE => dfdvrvbsdfgg
Completion Code => 0
Message Id => ncpa
Message Text => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y
FASP=> N
From Node => P
Src File => File4
Dest File => c\temp4
Src CCode => 0 Dest CCode => 0
Src Msgid => ncpa Dest Msgid => ncpa
Bytes Read => 4000 Bytes Written => 4010
Records Read => 5 Records Written => 5
Bytes Sent => 4010 Bytes Received => 4010
RUs Sent => 0 RUs Received => 1
------------------------------------------------------------------------------
"#
# first break the log into 'Record Id' blocks
$blocks = #()
$regex = [regex] '(?m)(Record Id[^-]+)'
$match = $regex.Match($log)
while ($match.Success) {
$blocks += $match.Value
$match = $match.NextMatch()
}
# next, parse out the required values for each block and create objects to export
$blocks | ForEach-Object {
if ($_ -match '(?s)Submitter Id\s+=>\s+(?<submitter>[^\s]+).+Start Time\s+=>\s+(?<starttime>[^\s]+)\s+Start Date\s+=>\s+(?<startdate>[^\s]+).+Message Text\s+=>\s+(?<messagetext>[\w ,.;-_]+).+Src File\s+=>\s+(?<sourcefile>[\w ,.;-_]+).+Dest File\s+=>\s+(?<destinationfile>[\w ,.;-_]+)') {
[PSCustomObject]#{
'Submitter Id' = $matches['submitter']
'Start Time' = $matches['starttime']
'Start Date' = $matches['startdate']
'Message Text' = $matches['messagetext']
'Src File' = $matches['sourcefile']
'Dest File' = $matches['destinationfile']
}
}
} | Export-Csv -Path '<PATH_TO_YOUR_OUTPUT_CSV>' -Delimiter '|' -NoTypeInformation
This will result in a csv file with the following content:
"Submitter Id"|"Start Time"|"Start Date"|"Message Text"|"Src File"|"Dest File"
"STMDA#4322"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File2"|"c\temp2"
"STMDA#4321"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File1"|"c\temp1"
"STMDA#4323"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File3"|"c\temp3"
"STMDA#4324"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File4"|"c\temp4"

Related

scala fast range lookup on 2 columns

I have a spark dataframe that I am broadcasting as Array[Array[String]].
My requirement is to do a range lookup on 2 columns.
Right now I have something like ->
val cols = data.filter(_(0).toLong <= ip).filter(_(1).toLong >= ip).take(1) match {
case Array(t) => t
case _ => Array()
}
The following data file is stored as Array[Array[String]] (except for the header row that I have shown below only as reference.) and passed to the filter function shown above.
sample data file ->
startIPInt | endIPInt | lat | lon
676211200 | 676211455 | 37.33053 | -121.83823
16777216 | 16777342 | -34.9210644736842 | 138.598709868421
17081712 | 17081712 | 0 | 0
sample value to search ->
ip = 676211325
based on the range of the startIPInt and endIPInt values, I want the rest of the mapping rows.
This lookup takes 1-2 sec for each, and I am not even sure the 2nd filter condition is getting executed(in debug mode always it only seems to execute the 1st condition). Can someone suggest me a faster and more reliable lookup here?
Thanks!

Scala Slick: Getting number of fields with a specific value in a group by query

I have a table like this:
|``````````````````````````|
|imgId, pageId, isAnnotated|
|1, 1, true |
|2, 1, false |
|3, 2, true |
|4, 1, false |
|5, 3, false |
|6, 2, true |
|7, 3, true |
|8, 3, true |
|__________________________|
I want the result as:
|`````````````````````````````````````|
|pageId, imageCount, noOfAnotatedImage|
| 1 3 1 |
| 2 2 2 |
| 3 3 2 |
|_____________________________________|
I want the number of annotated images based on number field set as true.
Slick related code I tried which fired exception:
def get = {
val q = (for {
c <- WebImage.webimage
} yield (c.pageUrl, c.lastAccess, c.isAnnotated)).groupBy(a => (a._1, a._3)).map{
case(a,b) => (a._1, b.map(_._2).max, b.filter(_._3 === true).length, b.length)
}
db.run(q.result)
}
Exception:
[SlickTreeException: Cannot convert node to SQL Comprehension
| Path s6._2 : Vector[t2<{s3: String', s4: Long', s5: Boolean'}>]
]
Note: This Count the total records containing specific values thread clear shows that in plain SQL what I need is possible.
SELECT
Type
,sum(case Authorization when 'Accepted' then 1 else 0 end) Accepted
,sum(case Authorization when 'Denied' then 1 else 0 end) Denied
from MyTable
where Type = 'RAID'
group by Type
Changed the code but still getting exception:
Execution exception
[SlickException: No type for symbol s2 found for Ref s2]
In /home/ravinder/IdeaProjects/structurer/app/scrapper/Datastore.scala:60
56 def get = {
57 val q = (for {
58 c <- WebImage.webimage
59 } yield (c.pageUrl, c.lastAccess, c.isAnnotated)).groupBy(a => (a._1, a._3)).map{
[60] case(a,b) => (a._1, b.map(_._2).max, b.map(a => if (a._3.result == true) 1 else 0 ).sum, b.length)
61 }
62 db.run(q.result)
63 }
64
Given your requirement, you should group by only pageUrl so as to perform aggregation over all rows for the same page. You can aggregate lastAccess using max and isAnnotated using sum over a conditional Case.If-Then-Else. The Slick query should look something like the following:
val q = (for {
c <- WebImage.webimage
} yield (c.pageUrl, c.lastAccess, c.isAnnotated)).
groupBy( _._1 ).map{ case (url, grp) =>
val lastAcc = grp.map( _._2 ).max
val annoCnt = grp.map( _._3 ).map(
anno => Case.If(anno === true).Then(1).Else(0)
).sum
(url, lastAcc, annoCnt, , grp.length)
}

Which variables does PapPal send when subscription payments are processed periodically

I've signed for my own subscription, and I've got these messages with from paypal. I displayed txn_type and subscr_date.
| txn_type | subscr_date |
| subscr_payment | NULL |
| subscr_signup | 05:04:37 May 15, 2017 PDT |
| subscr_cancel | 05:05:57 May 15, 2017 PDT |
Messages of txn_type subsrc_payment only have subscr_id and nothing else.
I am interested in what messages will be send when recurring payment gets executed next month, next year...
I suspect there will be just
| txn_type | subscr_date | subscr_id
| subscr_payment | NULL | SOME ID HERE
Can anyone what kind of txn_type will be sent over? I am having a hard time simulating this process.
When a payment happens then the subscr_payment is the IPN type you would get. Not sure why you're saying you are only seeing minimal parameters. Here is a sample of a subscr_payment IPN:
Array
(
[mc_gross] => 5.00
[protection_eligibility] => Eligible
[address_status] => unconfirmed
[payer_id] => 8FXTJQD6PGD5N
[address_street] => 123 Test Ave.
[payment_date] => 22:30:42 Jan 04, 2017 PST
[payment_status] => Completed
[charset] => windows-1252
[address_zip] => 11000
[first_name] => Tester
[mc_fee] => 0.50
[address_country_code] => MA
[address_name] => HAMMA Omar
[notify_version] => 3.8
[subscr_id] => I-WCECH3SA87PT
[payer_status] => unverified
[business] => receiver#email.com
[address_country] => Morocco
[address_city] => Rabat
[verify_sign] => A3Y1IabViDnLM.hMAUvK-kr83JP5AaoMlP3UYuHFIfHdL4P5lBuXYBoQ
[payer_email] => payer#email.com
[txn_id] => 0RF86855U2529745U
[payment_type] => instant
[payer_business_name] => Testerson Tester
[last_name] => Testerson
[address_state] => Rabat
[receiver_email] => receiver#email.com
[payment_fee] => 0.50
[receiver_id] => G3LWKY98MHFFC
[txn_type] => subscr_payment
[item_name] => Times News
[mc_currency] => USD
[residence_country] => MA
[test_ipn] => 1
[transaction_subject] => Times News
[payment_gross] => 5.00
[ipn_track_id] => d95263f949f25
[ipn_url_name] => AE Sandbox
)

Joining two datasets spark scala

I have two csv files (datasets) file1 and file2.
File1 consists of following columns:
Orders | Requests | Book1 | Book2
Varchar| Integer | Integer| Integer
File2 consists of following columns:
Book3 | Book4 | Book5 | Orders
String| String| Varchar| Varchar
How to combine the data in two CSV files in scala to check:
how many
Orders, Book1(Ignore Book1 having value = 0), Book3 and Book4 are present in both files for each Orders
Note: column Orders is common in both files
You can join two csv by making Pair RDD.
val rightFile = job.patch.get.file
val rightFileByKeys = sc.textFile(rightFile).map { line =>
new LineParser(line, job.patch.get.patchKeyIndex, job.delimRegex, Some(job.patch.get.patchValueIndex))
}.keyBy(_.getKey())
val leftFileByKeys = sc.textFile(leftFile).map { line =>
new LineParser(line, job.patch.get.fileKeyIndex, job.delimRegex)
}.keyBy(_.getKey())
leftFileByKeys.join(rightFileByKeys).map { case (key, (left, right)) =>
(job, left.line + job.delim + right.getValue())
}

Linq to Get distinct List of rows in descending order

Consider following record
Id F1 F2 f3 Date
-------------------------------------------------
1 1800 1990 19 2016-06-27 09:24:25.550
2 1181 1991 19 2016-06-27 09:25:15.243
3 1919 2000 19 2016-06-27 11:04:27.807
4 1920 2000 19 2016-06-27 13:04:27.807
5 1800 2001 19 2016-06-28 09:24:25.550
6 1181 2002 19 2016-06-28 09:25:15.243
7 1919 2010 19 2016-06-28 11:04:27.807
I want to Groupby f1 sorted by Date descending
Desirder Output
Id F1 F2 f3 Date
-------------------------------------------------
7 1919 2010 19 2016-06-28 11:04:27.807
6 1181 2002 19 2016-06-28 09:25:15.243
5 1800 2001 19 2016-06-28 09:24:25.550
4 1920 2000 19 2016-06-27 13:04:27.807
I have Tried with
DateTime EndDate=DateTime.Now.AddDays(-1);
var result = (from opt in db.Output
where opt.f3==19 && opt.Date > EndDate
orderby opt.Date descending
select new
{
Id= opt.Id,
F1=opt.F1,
F2=opt.F2,
F3=opt.F3,
Date=opt.Date
}).GroupBy(x => x.F1).Select(s => s.OrderBy(o => o.F2).FirstOrDefault()).OrderByDescending(x => x.Date).ToList();
Im getting Output as
Id F1 F2 f3 Date
-------------------------------------------------
1 1800 1990 19 2016-06-27 09:24:25.550
2 1181 1991 19 2016-06-27 09:25:15.243
3 1919 2000 19 2016-06-27 11:04:27.807
4 1920 2000 19 2016-06-27 13:04:27.807
What is wrong with my code.
If I understand correctly you want the most recent item of each group:
db.Output.GroupBy(opt => opt.F1).
Select(group => group.OrderByDescending(opt => opt.Date).First()).
OrderBy(opt => opt.ID);
I'm not sure the translation to SQL with be efficient though due to the inner ordering.
Now since GroupBy preserves order, you might fix this issue with:
db.Output.OrderByDescending(opt => opt.Date).
GroupBy(opt => opt.F1).
Select(group => group.First().
OrderBy(opt => opt.ID);
The problem is in s.OrderBy(o => o.F2).FirstOrDefault(). Here ordering should be on Date.
Why your code doesn't work :
//creates group
.GroupBy(x => x.F1)
//Order by F1 and take first - *Here the record with latest date is eliminated
.Select(s => s.OrderBy(o => o.F2).FirstOrDefault())
//This order by desc is of no use as we already have only 1 rec from each group
.OrderByDescending(x => x.Date).ToList();
var result = db.Output
.Where(opt => opt.f3==19 && opt.Date > EndDate)
.OrderByDescending(o => o.Date)
.GroupBy(x => x.F1)
.Select(s => s.FirstOrDefault())
.ToList();
or
var result = db.Output
.Where(opt => opt.f3==19 && opt.Date > EndDate)
.OrderBy(o1=>o1.F2)
.ThenByDescending(o => o.Date)
.GroupBy(x => x.F1)
.Select(s => s.FirstOrDefault())
.ToList();
Use Multiple column Group
DateTime EndDate=DateTime.Now.AddDays(-1);
var result = (from opt in db.Output
where opt.f3==19 && opt.Date > EndDate
orderby opt.Date descending
select new
{
Id= opt.Id,
F1=opt.F1,
F2=opt.F2,
F3=opt.F3,
Date=opt.Date
}).GroupBy(x => new{x.Date, x.F1}).Select(s => s.OrderBy(o => o.F2).FirstOrDefault()).OrderByDescending(x => x.Date).ToList();