Mallet TokenSequenceRemoveStopwords trouble reading file - mallet

I´m trying to use Mallet for Topic Modelling. So here´s my code:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Lowercase everything
pipeList.add(new CharSequenceLowercase());
// Unicode letters, underscore, and hashtag
Pattern pat = Pattern.compile("[\\p{L}_#]+");
pipeList.add(new CharSequence2TokenSequence(pat));
// Remove stop words
pipeList.add( new TokenSequenceRemoveStopwords(new File("C:\\mallet\\stoplists\\en.txt"), "UTF-8", false, false, false) );
// Convert the token sequence to a feature sequence.
pipeList.add(new TokenSequence2FeatureSequence());
return pipeList;
If I run the program it says
Exception in thread "main" java.lang.IllegalArgumentException: Trouble
reading file C:\mallet\stoplists\en.txt
Could someone please help me solve this problem?


CwvReader not loading lines starting with #

I'm trying to load a text file (.csv) into a SQL Server database table. Each line in the file is supposed to be loaded into a single column in the table. I find that lines starting with "#" are skipped, with no error. For example, the first two of the following four lines are loaded fine, but the last two are not. Anybody knows why?
This one as well
#This is another test line
Here's the segment of my code:
var sqlConn = connection.StoreConnection as SqlConnection;
CsvReader reader = new CsvReader(new StreamReader(f), false);
using (var bulkCopy = new SqlBulkCopy(sqlConn))
bulkCopy.DestinationTableName = "dbo.TestTable";
reader.SkipEmptyLines = true;
bulkCopy.BulkCopyTimeout = 300;
reader = null;
catch (Exception ex)
# is the default comment character for CsvReader. You can change the comment character by changing the Comment property of the Configuration object. You can disable comment processing altogether by setting the AllowComment property to false, eg:
SqlBulkCopy doesn't deal with CSV files at all, it sends any data that's passed to WriteServer to the database. It doesn't care where the data came from or what it contains, as long as the column mappings match
Assuming LumenWorks.Framework.IO.Csv refers to this project the comment character can be specified in the constructor. One could set it to something that wouldn't appear in a normal file, perhaps even the NUL character, the default char value :
CsvReader reader = new CsvReader(new StreamReader(f), false, escape:default);
CsvReader reader = new CsvReader(new StreamReader(f), false, escape : '\0');

Getting NPE on simple Regex Replacing (Scala on Spark)

I wrote a simple code to parse a large XML file ( extract lines, clean text, and remove any html tags from it) using Apache Spark.
I'm seeing a NullPointerException when calling .replaceAllIn on a string, which is non-null.
The funny thing is that I have no errors when I run the code locally, using input from disk, but I get a NullPointerException when I run the same code on AWS EMR, loading the input file from S3.
Here is the relevant code:
val HTML_TAGS_PATTERN = """<[^>]+>""".r
// other code here...
.textFile(pathToInputFile, numPartitions)
.filter { str => str.startsWith(" <row ") }
.map { str =>
Locale.setDefault(new Locale("en", "US"))
val parts = str.split(""""""")
var title: String = ""
var body: String = ""
// some code ommitted here
title = StringEscapeUtils.unescapeXml(title).toLowerCase.trim
body = StringEscapeUtils.unescapeXml(body).toLowerCase // decode xml entities
println("before replacing, body is: "+body)
body = HTML_TAGS_PATTERN.replaceAllIn(body, " ") // take out htmltags
Things I've tried:
printing the string just before calling replaceAllIn to make sure it's not null.
making sure the Locale is not null
printing out the exception message, and stacktrace: it just tells me that that line is where the NullPointerException occurs. Nothing more
Things that are different between my local setup and AWS EMR:
in my local setup, I load the input file from disk, on EMR I load it from s3.
in my local setup, I run Spark in standalone mode, on EMR it's run in cluster mode.
Everything else is the same on my machine and on AWS EMR: Scala version, Spark version, Java version, Cluster configs...
I have been trying to figure this out for some hours and I can't think of anything else to try.
I've moved the call to r() to within the map{} body, like this:
val HTML_TAGS_PATTERN = """<[^>]+>"""
// code ommited
body = HTML_TAGS_PATTERN.r.replaceAllIn(body, " ")
This also produces a NPE, wit the following stracktrace:
at java.util.regex.Pattern.<init>(
at java.util.regex.Pattern.compile(
at scala.util.matching.Regex.<init>(Regex.scala:191)
at scala.collection.immutable.StringLike$class.r(StringLike.scala:255)
at scala.collection.immutable.StringOps.r(StringOps.scala:29)
at scala.collection.immutable.StringLike$class.r(StringLike.scala:244)
at scala.collection.immutable.StringOps.r(StringOps.scala:29)
at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:102)
at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:72)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spar
I think you should try putting the regex inline like bellow.
This is a bit of a lame solution, you should be able to define a constant, maybe put it in a global object or something. Im not sure where you are defining it that would be a problem. But remember spark serialises the code and runs it on distributed workers, so something could be going wrong with that. { _ =>
body = """<[^>]+>""".r.replaceAllIn(body, " ")
I get a very similar error when I run .r on a null String.
val x: String = null
That error has slightly different line numbers, I think because of the scala version. Im on 2.12.2.
Thanks to Stephen's answer I found why I was getting a NPE on my UDF... I went this way (finding a match in my case):
def findMatch(word: String): String => Boolean = { s =>
Option(s) match {
case Some(validText) => if (word.toLowerCase.r.findAllIn(validText.toLowerCase).nonEmpty) true else false
case None => false
"<[^>]+>" was great, but I have one type of things in my HTML. it consists of a name of style and then parameters in between curly braces:
p { margin-top: 0px;margin-bottom: 0px;line-height: 1.15; }
body { font-family: 'Arial';font-style: Normal;font-weight: normal;font-size: 14.6666666666667px; }.Normal { telerik-style-type: paragraph;telerik-style-name: Normal;border-collapse: collapse; }.TableNormal { telerik-style-type: table;telerik-style-name: TableNormal;border-collapse: collapse; }.s_4C87DD5E { telerik-style-type: local;font-family: 'Arial';font-size: 14.6666666666667px;color: #000000; }.s_8D20FCAB { telerik-style-type: local;font-family: 'Arial';font-size: 14.6666666666667px;color: #000000;text-decoration: underline; }.p_53E06EE5 { telerik-style-type: local;margin-left: 0px; }
I tried to extract them using the following, but it didn't work:

HAPI v2 after terser: get entire changed message

I have an HL7 message whose content I'm manipulating slightly with the terser.set() method. Once I've done that, I see in the debugger that it's been changed just how I want it, but I can't seem to get the whole message back intact. I've tried (for example):
HapiContext context = new DefaultHapiContext();
Parser parser = context.getGenericParser();
Message message = parser.parse( MESSAGE );
Terser terser = new Terser( message );
terser.set( "/PID-2", "XXX XX XXXX" );
String fixedMessage = message.encode();
...which gets me close, however, lines (segment lines) that ended in just vertical bars (pipes) with no values in their fields come back trimmed (the vertical bars are simply dropped). I want the message to remain identical to what I put in (if also modified where I did it on purpose).
I think you need to use addForcedEncode in the ParserConfiguration.
public void testSetManualRepetitions() {
try {
String m = "MSH|^~\\&|hl7Integration|hl7Integration|||||ADT^A01|||2.3|\r" +
"EVN|A01|20130617154644\r" +
"PID|1|465 306 5961||407623|Wood^Patrick^^^MR||19700101|1||||||||||\r" +
HapiContext hc = new DefaultHapiContext();
ExecutorService es = hc.getExecutorService(); // to avoid npe when closing context should be fixed
ParserConfiguration pc = hc.getParserConfiguration();
PipeParser pipeParser = hc.getPipeParser();
Message message = pipeParser.parse(m);
Terser terser = new Terser(message);
//Add first Address
terser.set("/.PID-11(0)-1", "13 Oxford Road");
terser.set("/.PID-11(0)-3", "Oxford");
//Add second Address
terser.set("/.PID-11(1)-1", "16 London Road");
terser.set("/.PID-11(1)-3", "London");
pc.addForcedEncode("PID-26-1"); // make sure PID has 26 fields
System.out.println(message.encode().replaceAll("\r", "\r\n"));
} catch (HL7Exception e) {
} catch (IOException e) {
/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/bin/java -ea -Didea.launcher.port=7540 "-Didea.launcher.bin.path=/Applications/IntelliJ IDEA" -Dfile.encoding=UTF-8 -classpath "/Applications/IntelliJ IDEA IDEA" com.intellij.rt.execution.application.AppMain com.intellij.rt.execution.junit.JUnitStarter -ideVersion5 com.hl7integration.hapi.tests.SetRepetitionsTerserTest,testSetManualRepetitions
68 [main] INFO ca.uhn.hl7v2.util.Home - hapi.home is set to /Users/thomas/git/Hapi-HL7-Terser/.
170 [main] INFO ca.uhn.hl7v2.VersionLogger - HAPI version is: 2.2
197 [main] INFO ca.uhn.hl7v2.VersionLogger - Default Structure libraries found for HL7 versions 2.1, 2.3, 2.4, 2.5,
PID|1|465 306 5961||407623|Wood^Patrick^^^MR||19700101|1|||13 Oxford Road^^Oxford~16 London Road^^London|||||||||||||||
Technically you are not changing the message, you parse it to a Java object and encode it back to string. So that your output message looks like your input message. There could still be subtle differences afterwards (e.g. if your input message is dynamic).
From the docs (
Forced Encoding
By default, when encoding a message HAPI will not encode any segments or fields that have no content and therefore have no semantic meaning in the message.
This can cause problems if you need to transmit a message to a system that expects certain empty content to be present in order to get "hints" about where in the message it is.
The addForcedEncode method may be used to add Terser paths which should be forced to be encoded:
// ORC-4 will still exist (but be empty) even if ORC has no content
String encoded = parser.encode(message);
See the JavaDoc for examples.

Protovis - dealing with a text source

lets say I have a text file with lines as such:
[4/20/11 17:07:12:875 CEST] 00000059 FfdcProvider W logIncident FFDC1003I: FFDC Incident emitted on D:/Prgs/testing/WebSphere/AppServer/profiles/ProcCtr01/logs/ffdc/server1_3d203d20_11.04.20_17.07.12.8755227341908890183253.txt 134
[4/20/11 17:07:27:609 CEST] 0000005d wle E CWLLG2229E: An exception occurred in an EJB call. Error: Snapshot with ID Snapshot.8fdaaf3f-ce3f-426e-9347-3ac7e8a3863e not found.
com.lombardisoftware.core.TeamWorksException: Snapshot with ID Snapshot.8fdaaf3f-ce3f-426e-9347-3ac7e8a3863e not found.
at com.lombardisoftware.server.ejb.persistence.CommonDAO.assertNotNull(
Is there anyway to easily import a data source such as this into protovis, if not what would the easiest way to parse this into a JSON format. For example for the first entry might be parsed like so:
"Date": "4/20/11 17:07:12:875 CEST",
"Status": "00000059",
"Msg": "FfdcProvider W logIncident FFDC1003I",
Thanks, David
Protovis itself doesn't offer any utilities for parsing text files, so your options are:
Use Javascript to parse the text into an object, most likely using regex.
Pre-process the text using the text-parsing language or utility of your choice, exporting a JSON file.
Which you choose depends on several factors:
Is the data somewhat static, or are you going to be running this on a new or dynamic file each time you look at it? With static data, it might be easiest to pre-process; with dynamic data, this may add an annoying extra step.
How much data do you have? Parsing a 20K text file in Javascript is totally fine; parsing a 2MB file will be really slow, and will cause the browser to hang while it's working (unless you use Workers).
If there's a lot of processing involved, would you rather put that load on the server (by using a server-side script for pre-processing) or on the client (by doing it in the browser)?
If you wanted to do this in Javascript, based on the sample you provided, you might do something like this:
// Assumes var text = 'your text';
// use the utility of your choice to load your text file into the
// variable (e.g. jQuery.get()), or just paste it in.
var lines = text.split(/[\r\n\f]+/),
// regex to match your log entry beginning
patt = /^\[(\d\d?\/\d\d?\/\d\d? \d\d:\d\d:\d\d:\d{3} [A-Z]+)\] (\d{8})/,
items = [],
// loop through the lines in the file
lines.forEach(function(line) {
// look for the beginning of a log entry
var initialData = line.match(patt);
if (initialData) {
// start a new item, using the captured matches
currentItem = {
Date: initialData[1],
Status: initialData[2],
Msg: line.substr(initialData[0].length + 1)
} else {
// this is a continuation of the last item
currentItem.Msg += "\n" + line;
// items now contains an array of objects with your data

How to further improve error messages in Scala parser-combinator based parsers?

I've coded a parser based on Scala parser combinators:
class SxmlParser extends RegexParsers with ImplicitConversions with PackratParsers {
lazy val document: PackratParser[AstNodeDocument] =
((procinst | element | comment | cdata | whitespace | text)*) ^^ {
object SxmlParser {
def parse(text: String): AstNodeDocument = {
var ast = AstNodeDocument()
val parser = new SxmlParser()
val result = parser.parseAll(parser.document, new CharArrayReader(text.toArray))
result match {
case parser.Success(x, _) => ast = x
case parser.NoSuccess(err, next) => {
tool.die("failed to parse SXML input " +
"(line " + next.pos.line + ", column " + next.pos.column + "):\n" +
err + "\n" +
Usually the resulting parsing error messages are rather nice. But sometimes it becomes just
sxml: ERROR: failed to parse SXML input (line 32, column 1):
`"' expected but `' found
This happens if a quote characters is not closed and the parser reaches the EOT. What I would like to see here is (1) what production the parser was in when it expected the '"' (I've multiple ones) and (2) where in the input this production started parsing (which is an indicator where the opening quote is in the input). Does anybody know how I can improve the error messages and include more information about the actual internal parsing state when the error happens (perhaps something like a production rule stacktrace or whatever can be given reasonably here to better identify the error location). BTW, the above "line 32, column 1" is actually the EOT position and hence of no use here, of course.
I don't know yet how to deal with (1), but I was also looking for (2) when I found this webpage:
I'm just copying the information:
A useful enhancement is to record the input position (line number and column number) of the significant tokens. To do this, you must do three things:
Make each output type extend scala.util.parsing.input.Positional
invoke the Parsers.positioned() combinator
Use a text source that records line and column positions
Finally, ensure that the source tracks positions. For streams, you can simply use scala.util.parsing.input.StreamReader; for Strings, use scala.util.parsing.input.CharArrayReader.
I'm currently playing with it so I'll try to add a simple example later
In such cases you may use err, failure and ~! with production rules designed specifically to match the error.