My goal is to parse an html file retrieved with Invoke-WebRequest. If possible I'd like to avoid any external libraries.
The problem I am facing is, that Invoke-WebRequest returns a BasicHtmlWebResponseObject instead of a HtmlWebResponseObject since Powershell 6. The Basic version misses the ParsedHtml property. Is there a good alternative to parse html in Powershell Core 6?
I've tried to use Select-Xml but my html is not entirely valid (e.g. a missing closing tag), hence this fails to parse the result.
Another alternative I've found is to use New-Object -ComObject "HTMLFile" but from my understanding this relies on Internet Explorer for parsing which I'd like to avoid.
There is a very similar question here but sadly this question had no answer or activity since 8 months.
As mentioned in the comments it is not really possible without a library. One very good library you could use it the AngleSharp library for dotnet. It has great html parsing capabilities and dotnet code interacts very friendly with powershell, have a look at this link.
Here is an example from their website:
var config = Configuration.Default.WithDefaultLoader();
var address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var cellSelector = "tr.vevent td:nth-child(3)";
var cells = document.QuerySelectorAll(cellSelector);
var titles = cells.Select(m => m.TextContent);
Related
I'm in the process of updating our scripts to ensure they remain functional, and discovered iText7 has replaced iTextSharp. My needs are simple; read form fields. Rather, I know how to read a form field, I'm just checking to see if there's a more streamlined way to do it, as it seems like it was easier in iTextSharp.
Here's the old code we're using with iTextSharp (the $form is being fed to the $reader via a foreach loop):
#create pdf reader object and load form
$reader = New-Object iTextSharp.text.pdf.PdfReader -ArgumentList $form.PSPath.Replace("Microsoft.PowerShell.Core\FileSystem::","")
#Get the data I need
$First = $reader.AcroFields.GetField("FirstName")
Simple. When playing with iText7 though, it seems to lose its simplicity. Here's what I have for iText7:
#Create pdf reader and load form
$Reader = [iText.Kernel.Pdf.PdfReader]::new("C:\temp\TestForm.pdf")
#Create PDFDoc object?
$PdfDoc = [iText.Kernel.Pdf.PdfDocument]::new($Reader)
#What? Why?
$Form = [iText.Forms.PdfAcroForm]::getAcroForm($PdfDoc, $True)
#Get the data I need. Oh wait, I am unable to read it.
$fName = $Form.GetField("FirstName")
#Finally...
$First = $fName.GetValue()
I'm afraid I don't have any luck researching simple code; everyone seems to be creating web forms on the fly, or parsing thousands of PDFs for data analytics. I'm also just a lowly SysAdmin, not a dev. Please tell me there's an easier way to read a single form field in iText7. Thanks in advance!
The simplicity is not necessarily measured by the number of lines of code. Your way of reading form fields in iText 7 is correct. The reason you need a couple of more lines is that iText 7 has a much clearer separation of different parts of the code across modules. This has big advantages compared to iText 5 and gives a greater room for flexibility in user code.
Inability to call $Form.GetField("FirstName").GetValue() is a PowerShell limitation by the way and has nothing to do with iText - you are able to use that kind of chaining in C# or Java.
I'm new to the lsp4e & lsp technologies and as far as I have seen the framework provides almost everything for working with eclipse. However there is a way to use this features at will? i.e I would like to use the LS to get all the functions on a file, I think this will be done with textDocument/documentSymbol but how can I get this using the lsp4e framework?
NOTE:
I checked for SymbolKind and seems it was not the one I was looking for however that input helped me finding a sample of DocumentSymbol
DocumentSymbolParams params = new DocumentSymbolParams(
new TextDocumentIdentifier(documentUri.toString()));
CompletableFuture<List<Either<SymbolInformation, DocumentSymbol>>> symbols =
languageServer.getTextDocumentService().documentSymbol(params);
I checked for SymbolKind and seems it was not the one I was looking for. However that input helped me finding a sample of DocumentSymbol
DocumentSymbolParams params = new DocumentSymbolParams(
new TextDocumentIdentifier(documentUri.toString()));
CompletableFuture<List<Either<SymbolInformation, DocumentSymbol>>> symbols =
languageServer.getTextDocumentService().documentSymbol(params);
I have to parse a html file (find div with class xyz and use the inner text).
That would be quit simple by using and filtering the WebResponse.AllElements collection. This member seems to be not available in PowerShell Core 6 (on a Mac).
On this platform, I get an BasicHtmlWebResponseObject as the return value of the Invoke-WebRequest $Url call which lags tis property.
I don't want to use any C# code or external libraries. What would be an maintainable way to get the same functionality as on the Windows PowerShell environment?
As the title suggests, I have a .Net application which uses interop to open documents in Word. I have set
app.AutomationSecurity = Microsoft.Office.Core.MsoAutomationSecurity.msoAutomationSecurityForceDisable
before opening the document. According to the documentation, thhis "Disables all macros in all files opened programmatically, without showing any security alerts"
However, when I attempt to open one specific document I get a dialog box on the screen that says "could not load an object because it is not available on this machine". It's a customer document but I believe it contains a macro with references to a COM object which I don't have installed.
Am I doing something stupid? is there any way to actually disable macros when opening a Word document?
Try:
WordBasic.DisableAutoMacros 1
Bizarrely, this relies on a throwback to pre-VBA days, but still seems to be the most-reliable way to ensure that no auto macros are triggered (in any document - you may want to turn it back using the parameter "0").
I recently had a project where I had to process 6,000 Word templates (yes, templates, not documents) many of which had oddball stuff like macros, etc. I was able to process all but 6 using this technique. (I never did figure out what the problem was with those 6).
EDIT: for a discussion of how to call this from C#, see: http://www.dotnet247.com/247reference/msgs/56/281785.aspx
For c# you can use
(_wordApp.WordBasic as dynamic).DisableAutoMacros();
The whole code I'm using is:
using Word = Microsoft.Office.Interop.Word;
private Word.Application _wordApp;
...
_wordApp = new Word.Application
{
Visible = false,
ScreenUpdating = false,
DisplayAlerts = Word.WdAlertLevel.wdAlertsNone,
FileValidation = MsoFileValidationMode.msoFileValidationSkip
};
_wordApp.Application.AutomationSecurity = MsoAutomationSecurity.msoAutomationSecurityForceDisable;
(_wordApp.WordBasic as dynamic).DisableAutoMacros();
Hello I'm looking for a way to search for a word in a word doc and add an endnote(special type of footnote) with a definition of the word as the endnote text. This would allow me to hover over that word and then the definition would pop up like a tool tip.
I know i need to use reflection, but i'm new to the whole reflection thing and all my attempts have fallen flat.
I've found the reference for endnotes here: http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.endnotes.add%28office.11%29.aspx
I've tried loading C:\WINDOWS\Assembly\Gac\Microsoft.Office.Interop.Word\11.0.0.0__71e9bce111e9429c\Microsoft.Office.Interop.Word.dll using reflection, but i don't know what to do once i've loaded it. When i try to create an new-object, it still asks me if i've loaded the appropriate dll.
Additionally i tried to fix the problem with a diff method by loading the MS word application as a comobject, but i wasn't able to figure out how to select the text i wanted and then set and endnote.
Any suggestions for this would be greatly appreciated!
-Skyler
I am not too familiar with the Word object model, but if you can handle that part I can tell you how to get an instance of Word running and automated. It's quite simple actually.
$Application = New-Object -ComObject Word.Application
$Application.Visible = $true
$Document = $Application.Documents.Add()
The key is Visible = $true otherwise it will be running but hidden. Now you can use all the methods of the Word Application object to create a new doc and automate it. Now if you're using Word 2007's docx format, you can investigate ZIP file extraction cmdlets and access the xml directly in the word doc. But dealing with namespaces in XML is a hassle and may not be as straightforward.
Word Object Model Stuff
ScriptingGuy recently posted a solution to this: http://blogs.technet.com/heyscriptingguy/archive/2009/10/14/hey-scripting-guy-october-14-2009.aspx