Cannot successfully extract paragraphs using perl - perl

I am trying to use PERL to extract paragraphs from a text. However, the code does not generate the results I expect. I benefit a lot from the answers by Zaid from this post extracting paragraphs from text with perl. Here are the codes I wrote:
#
my $string = <<'TEXT';
Assembly and Manufacturing
The Company's assembly and manufacturing operations include PCB assembly
and the manufacture of subsystems and complete products. Its PCB assembly
activities primarily consist of the placement and attachment of electronic and
mechanical components on printed circuit boards using both SMT and traditional
pin-through-hole ("PTH") technology. The Company also assembles subsystems and
systems incorporating PCBs and complex electromechanical components, and,
increasingly, manufactures and packages final products for shipment directly to
the customer or its distribution channels. The Company employs just-in-time,
ship-to-stock and ship-to-line programs, continuous flow manufacturing, demand
flow processes and statistical process control. The Company has expanded the
number of production lines for finished product assembly, burn-in and test to
meet growing demand and increased customer requirements. In addition, the
Company has invested in FICO, a producer of injection molded plastic for Asia
electronics companies with facilities in Shenzhen, China.
As OEMs seek to provide greater functionality in smaller products, they
increasingly require advanced manufacturing technologies and processes. Most of
the Company's PCB assembly involves the use of SMT, which is the leading
electronics assembly technique for more sophisticated products. SMT is a
computer-automated process which permits attachment of components directly on
both sides of a PCB. As a result, it allows higher integration of electronic
components, offering smaller size, lower cost and higher reliability than
traditional manufacturing processes. By allowing increasingly complex circuits
to be packaged with the components placed in closer proximity to each other, SMT
greatly enhances circuit processing speed, and therefore board and system
performance. The Company also provides traditional PTH electronics assembly
using PCBs and leaded components for lower cost products.;
TEXT
local $/ = "";
open my ($str_fh), '<', \$string;
while ( <$str_fh> ) {
print "New Paragraph: $_\n","*" x 40, "\n" ;
}
close $str_fh;
#
The text is from annual report of this company https://www.sec.gov/Archives/edgar/data/32272/0000950147-97-000151.txt.
I expect the code returns the paragraphs, however, I got the whole text back.
Would anyone help me with this issue? I am quite confused with these errors.
Thanks so much!!!
Best Regards

When I run the code you posted here, it works fine. It prints each paragraph separately.
Most likely, the lines between paragraphs are not completely blank. If there are spaces on the "blank" lines, then they don't count as paragraph separators.

Related

Attention Text Generation in Character-by-Character fashion

I am searching the web for a couple of days for any text generation model that would use only attention mechanisms.
The Transformer architecture that made waves in the context of Seq-to-Seq models is actually based solely on Attention mechanisms but is mainly designed and used for translation or chat bot tasks so it doesn't fit to the purpose, but the principle does.
My question is:
Does anyone knows or heard of a text generation model based solely on Attention without any recurrence?
Thanks a lot!
P.S. I'm familiar with PyTorch.
Building a character-level self-attentive model is a challenging task. Character-level models are usually based on RNNs. Whereas in a word/subword model, it is clear from the beginning what are the units carrying meaning (and therefore the units the attention mechanism can attend to), a character-level model needs to learn word meaning in the following layers. This makes it quite difficult for the model to learn.
Text generation models are nothing more than conditional languages model. Google AI recently published a paper on Transformer character language model, but it is the only work I know of.
Anyway, you should consider either using subwords units (as BPE, SentencePiece) or if you really need to go for character level, use RNNs instead.

Can tags do some calculations in a RFID system?

Can tags and readers do some calculations in a RFID system? I found many papers designed security protocols to enhance the security of RFID systems. In those protocols, tags and readers are required to do necessary calculations, such as exclusive OR (XOR), random number generation, and even hash operation and so on.
However, to my understanding, tags and readers do not have the calculation ability, then how to implement these protocols? We need design special tags and readers? Thank you very much!
This heavily depends on the type of RFID system (frequency, communication standard, etc) and the type of tag (passive or active/semi-passive).
For instance, with UHF backscatter systems, passive tags usually contain only a small memory (and logic to process commands).
In the HF range, there is ISO/IEC 14443. For that standard, there exist lots of passive tags (contactless smartcards) that contain a processing unit and can even execute complex program code. ISO/IEC 15693 (same frequency range, different standard) passive tags usually conain only memory and some additional control logic (e.g. password-based locking/unlocking). The same applies to tags in the LF range.
Active tags (regardless of the standard, as long as the standard contains some form of command response protocol if you want to interact with that functionality), however could do pretty much any calculation if they contain an appropriate processing unit.
According to wikipedia (I read the German version) one can build complex microchips into any RFID. It's also possible to add Sensor as for example GPS. RFID chips do not need to be passive, they can have a battery to power the whole RFID chip (active) or just the microprocessor (semi passive).

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

Is it possible to get the index of a exchange using Finance::Quote?

I need to get the index of a exchange like NASDAQ rather than the price of a specific stock in that exchange. I suppose that Finance::Quote will come to the rescue , but after a quick go-through of the document, I find it the way one can use the module for query is like:
%info = $q->fetch("australia","CML")
which means both the exchange and the stock should be specified in the query. then the question is: does the index itself can be treated as a stock and has a symbol name which can be used in the query?
Of course, if you have other way can meet my needs rather than using Finance::Quote, please feel free to write down your solution.
The problem with your question is that you are assuming that there is just one index for a particular exchange. Whilst there may well be a particular index that is dominant (eg. for stocks primarily traded on the London Stock Exchange, the FTSE 100 might be considered the main index; similarly for the NYSE it would be the Dow Jones Industrial Average) other exchanges may have a less clear leader in their collection of associated indicies (eg. for the Australian Stock Exchange, the S&P/ASX 200 and the All Ordinaries index are both frequently quoted side-by-side in the evening broadcast news).
Symbology of stocks, indicies, option chains, futures, etc is quite a complicated field in financial IT. Many of the symbology standards are backed by a data vendor (eg. Reuters, Bloomberg) and use of their standards requires a commercial license. On the other hand there are other efforts aiming to make symbology more open (Bloomberg themselves are behind one of these efforts).
I'm not familiar with the data sources of the Finance::Quote package you reference, but if you are serious about accessing market data (ie. prepared to pay for it) but don't need the cost/complexity/speed of a solution from Reuters, Bloomberg, etc, you could do alot worse than check out what Xignite offers in the way of market data accessible via web services.
the symbol for the nasdaq composite is "^IXIC". For nyse composite it's "^NYA".
each quote provider might have a different syntax though.

Analysing and generating statistics on your code

I was wondering if anyone had any ideas or procedures for generating general statistics on your source code.
Off the top of my head I would love to know how many functions in my project's code are called once or very few times or any classes that are only instantiated once.
I'm sure there is a ton of other interesting things to be found out.
I could do something like the above using grep magic but has anyone come across tools or tips?
Coverity is the first thing coming to mind. It currently offers (on one of their products)
Software DNA Map™ analysis system: Generates a comprehensive representation of the entire build system including a semantically correct parsing of every line of code.
Defect Manager: Intuitive interface makes it easy to establish ownership of defects and resolve them via a customized workflow that mirrors your existing development process.
Local Analysis: Enables code to be analyzed locally on developers’ desktops to ensure quality before sharing with other developers.
Boolean Satisfiability: Translates the code into questions based on Boolean values, then applies SAT solvers for the most accurate defect detection and the lowest false positive rate available. Only Prevent offers the added precision of this proprietary method.
Race Conditions Checker: Features an industry-first race conditions checker built specifically for today’s complex multi-threaded applications.
Path Simulation: Simulates 100% of all values and data paths, enabling detection of the most critical defects.
Statistical & Interprocedural Analysis: Ensures a comprehensive analysis of your entire build system by inferring correct behavior based on previously observed behavior and performing whole-program analysis similar to the executing Bin.
False Path Pruning: Efficiently removes false positives to give Prevent an average FP rate of about 15%, with some users reporting FP rates of as low as 5%.
Incremental Analysis: Analyzes source code wholly or incrementally, allowing you to save time by checking only those components that are affected by a change.
Reporting: Measures software quality trends over time via customizable reporting so you can show defects grouped by checker, classification, component, and other defect information.
There are lots of tools that do this. But afaik none of them are language independent (which in turn would be mostly impossible e.g. some languages might not even have functions).
Generally you will find those tools under the categories of "code coverage tools" or "profilers".
For .Net you can use Visual Studio or Clrprofiler.