How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text?

How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text? - scala

I have code along the lines of:
val lines: RDD[String] = sparkSession.sparkContext.textFile("s3://mybucket/file.gz")
The URL ends in .gz but this is a result of legacy code. The file is plain text with no compression involved. However spark insists on reading it as a GZIP file which obviously fails. How can I make it ignore the extension and simply read the file as text?
Based on this article I've tried setting configuration in various places that doesn't include the GZIP codec, e.g.:
sparkContext.getConf.set("spark.hadoop.io.compression.codecs", classOf[DefaultCodec].getCanonicalName)
This doesn't seem to have any effect.
Since the files are on S3, I can't simply rename them without copying the entire file.

First solution: Shading GzipCodec
The idea is to shadow/shade the GzipCodec as defined in the package org.apache.hadoop.io.compress by including in your own sources this java file and replacing this line:
public String getDefaultExtension() {
return ".gz";
}
with:
public String getDefaultExtension() {
return ".whatever";
}
When building your project, this will have for effect to use your definition of GzipCodec instead of the one provided by the dependencies (this is the shadowing of GzipCodec).
This way, when parsing your file, textFile() will be forced to apply the default codec as the one for gzip doesn't fit the naming of your file anymore.
The inconvenient of this solution is that you won't be able to also process real gzip files within the same app.
Second solution: Using newAPIHadoopFile with a custom/modified TextInputFormat
You can use newAPIHadoopFile (instead of textFile) with a custom/modified TextInputFormat which forces the use of the DefaultCodec (plain text).
We'll write our own line reader based on the default one (TextInputFormat). The idea is to remove the part of TextInputFormat which finds out it's named .gz and thus uncompress the file before reading it.
Instead of calling sparkContext.textFile,
// plain text file with a .gz extension:
sparkContext.textFile("s3://mybucket/file.gz")
we can use the underlying sparkContext.newAPIHadoopFile which allows us to specify how to read the input:
import org.apache.hadoop.mapreduce.lib.input.FakeGzInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
sparkContext
.newAPIHadoopFile(
"s3://mybucket/file.gz",
classOf[FakeGzInputFormat], // This is our custom reader
classOf[LongWritable],
classOf[Text],
new Configuration(sparkContext.hadoopConfiguration)
)
.map { case (_, text) => text.toString }
The usual way of calling newAPIHadoopFile would be with TextInputFormat. This is the part which wraps how the file is read and where the compression codec is chosen based on the file extension.
Let's call it FakeGzInputFormat and implement it as follow as an extension of TextInputFormat (this is a Java file and let's put it in package src/main/java/org/apache/hadoop/mapreduce/lib/input):
package org.apache.hadoop.mapreduce.lib.input;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import com.google.common.base.Charsets;
public class FakeGzInputFormat extends TextInputFormat {
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context
) {
String delimiter =
context.getConfiguration().get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
// Here we use our custom `FakeGzLineRecordReader` instead of
// `LineRecordReader`:
return new FakeGzLineRecordReader(recordDelimiterBytes);
}
#Override
protected boolean isSplitable(JobContext context, Path file) {
return true; // plain text is splittable (as opposed to gzip)
}
}
In fact we have to go one level deeper and also replace the default LineRecordReader (Java) with our own (let's call it FakeGzLineRecordReader).
As it's quite difficult to inherit from LineRecordReader, we can copy LineRecordReader (in src/main/java/org/apache/hadoop/mapreduce/lib/input) and slightly modify (and simplify) the initialize(InputSplit genericSplit, TaskAttemptContext context) method by forcing the usage of the default codec (plain text):
(the only changes compared to the original LineRecordReader have been given a comment explaining what's happening)
package org.apache.hadoop.mapreduce.lib.input;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
#InterfaceAudience.LimitedPrivate({"MapReduce", "Pig"})
#InterfaceStability.Evolving
public class FakeGzLineRecordReader extends RecordReader<LongWritable, Text> {
private static final Logger LOG =
LoggerFactory.getLogger(FakeGzLineRecordReader.class);
public static final String MAX_LINE_LENGTH =
"mapreduce.input.linerecordreader.line.maxlength";
private long start;
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
private LongWritable key;
private Text value;
private byte[] recordDelimiterBytes;
public FakeGzLineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
// This has been simplified a lot since we don't need to handle compression
// codecs.
public void initialize(
InputSplit genericSplit,
TaskAttemptContext context
) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes, split.getLength()
);
filePosition = fileIn;
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
// Simplified as input is not compressed:
private int maxBytesToConsume(long pos) {
return (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
}
// Simplified as input is not compressed:
private long getFilePosition() {
return pos;
}
private int skipUtfByteOrderMark() throws IOException {
int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
Integer.MAX_VALUE);
int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
pos += newSize;
int textLength = value.getLength();
byte[] textBytes = value.getBytes();
if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
LOG.info("Found UTF-8 BOM and skipped it");
textLength -= 3;
newSize -= 3;
if (textLength > 0) {
textBytes = value.copyBytes();
value.set(textBytes, 3, textLength);
} else {
value.clear();
}
}
return newSize;
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
#Override
public LongWritable getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {}
}
}

Related

Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

Could someone give me an example of how to extract coordinates for a 'word' with PDFBox
I am using this link to extract positions of individual characters:
https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/
I am using this link to extract words:
https://www.tutorialkart.com/pdfbox/extract-words-from-pdf-document/
I am stuck getting coordinates for whole words.

You can extract the coordinates of words by collecting all the TextPosition objects building a word and combining their bounding boxes.
Implementing this along the lines of the two tutorials you referenced, you can extend PDFTextStripper like this:
public class GetWordLocationAndSize extends PDFTextStripper {
public GetWordLocationAndSize() throws IOException {
}
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
String wordSeparator = getWordSeparator();
List<TextPosition> word = new ArrayList<>();
for (TextPosition text : textPositions) {
String thisChar = text.getUnicode();
if (thisChar != null) {
if (thisChar.length() >= 1) {
if (!thisChar.equals(wordSeparator)) {
word.add(text);
} else if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
}
}
if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
void printWord(List<TextPosition> word) {
Rectangle2D boundingBox = null;
StringBuilder builder = new StringBuilder();
for (TextPosition text : word) {
Rectangle2D box = new Rectangle2D.Float(text.getXDirAdj(), text.getYDirAdj(), text.getWidthDirAdj(), text.getHeightDir());
if (boundingBox == null)
boundingBox = box;
else
boundingBox.add(box);
builder.append(text.getUnicode());
}
System.out.println(builder.toString() + " [(X=" + boundingBox.getX() + ",Y=" + boundingBox.getY()
+ ") height=" + boundingBox.getHeight() + " width=" + boundingBox.getWidth() + "]");
}
}
(ExtractWordCoordinates inner class)
and run it like this:
PDDocument document = PDDocument.load(resource);
PDFTextStripper stripper = new GetWordLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
(ExtractWordCoordinates test testExtractWordsForGoodJuJu)
Applied to the apache.pdf example the tutorials use you get:
2017-8-6 [(X=26.004425048828125,Y=22.00372314453125) height=5.833024024963379 width=36.31868362426758]
Welcome [(X=226.44479370117188,Y=22.00372314453125) height=5.833024024963379 width=36.5999755859375]
to [(X=265.5881652832031,Y=22.00372314453125) height=5.833024024963379 width=8.032623291015625]
The [(X=276.1641845703125,Y=22.00372314453125) height=5.833024024963379 width=14.881439208984375]
Apache [(X=293.5890197753906,Y=22.00372314453125) height=5.833024024963379 width=29.848846435546875]
Software [(X=325.98126220703125,Y=22.00372314453125) height=5.833024024963379 width=35.271636962890625]
Foundation! [(X=363.7962951660156,Y=22.00372314453125) height=5.833024024963379 width=47.871429443359375]
Custom [(X=334.0334777832031,Y=157.6195068359375) height=4.546705722808838 width=25.03936767578125]
Search [(X=360.8929138183594,Y=157.6195068359375) height=4.546705722808838 width=22.702728271484375]

You can create CustomPDFTextStripper which extends PDFTextStripper and override protected void writeString(String text, List<TextPosition> textPositions). In this overriden method you need to split textPositions by the word separator to get List<TextPosition> for each word. After that you can join each character and compute bounding box.
Full example below which contains also drawing of the resulting bounding boxes.
package com.example;
import lombok.Value;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.junit.Ignore;
import org.junit.Test;
import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
public class PdfBoxTest {
private static final String BASE_DIR_PATH = "C:\\Users\\Milan\\50330484";
private static final String INPUT_FILE_PATH = "input.pdf";
private static final String OUTPUT_IMAGE_PATH = "output.jpg";
private static final String OUTPUT_BBOX_IMAGE_PATH = "output-bbox.jpg";
private static final float FROM_72_TO_300_DPI = 300.0f / 72.0f;
#Test
public void run() throws Exception {
pdfToImage();
drawBoundingBoxes();
}
#Ignore
#Test
public void pdfToImage() throws IOException {
PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
PDFRenderer renderer = new PDFRenderer(document);
BufferedImage image = renderer.renderImageWithDPI(0, 300);
ImageIO.write(image, "JPEG", new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
}
#Ignore
#Test
public void drawBoundingBoxes() throws IOException {
PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
List<WordWithBBox> words = getWords(document);
draw(words);
}
private List<WordWithBBox> getWords(PDDocument document) throws IOException {
CustomPDFTextStripper customPDFTextStripper = new CustomPDFTextStripper();
customPDFTextStripper.setSortByPosition(true);
customPDFTextStripper.setStartPage(0);
customPDFTextStripper.setEndPage(1);
Writer writer = new OutputStreamWriter(new ByteArrayOutputStream());
customPDFTextStripper.writeText(document, writer);
List<WordWithBBox> words = customPDFTextStripper.getWords();
return words;
}
private void draw(List<WordWithBBox> words) throws IOException {
BufferedImage bufferedImage = ImageIO.read(new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
Graphics2D graphics = bufferedImage.createGraphics();
graphics.setColor(Color.GREEN);
List<Rectangle> rectangles = words.stream()
.map(word -> new Rectangle(word.getX(), word.getY(), word.getWidth(), word.getHeight()))
.collect(Collectors.toList());
rectangles.forEach(graphics::draw);
graphics.dispose();
ImageIO.write(bufferedImage, "JPEG", new File(BASE_DIR_PATH, OUTPUT_BBOX_IMAGE_PATH));
}
private class CustomPDFTextStripper extends PDFTextStripper {
private final List<WordWithBBox> words;
public CustomPDFTextStripper() throws IOException {
this.words = new ArrayList<>();
}
public List<WordWithBBox> getWords() {
return new ArrayList<>(words);
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
String wordSeparator = getWordSeparator();
List<TextPosition> wordTextPositions = new ArrayList<>();
for (TextPosition textPosition : textPositions) {
String str = textPosition.getUnicode();
if (wordSeparator.equals(str)) {
if (!wordTextPositions.isEmpty()) {
this.words.add(createWord(wordTextPositions));
wordTextPositions.clear();
}
} else {
wordTextPositions.add(textPosition);
}
}
super.writeString(text, textPositions);
}
private WordWithBBox createWord(List<TextPosition> wordTextPositions) {
String word = wordTextPositions.stream()
.map(TextPosition::getUnicode)
.collect(Collectors.joining());
int minX = Integer.MAX_VALUE;
int minY = Integer.MAX_VALUE;
int maxX = Integer.MIN_VALUE;
int maxY = Integer.MIN_VALUE;
for (TextPosition wordTextPosition : wordTextPositions) {
minX = Math.min(minX, from72To300Dpi(wordTextPosition.getXDirAdj()));
minY = Math.min(minY, from72To300Dpi(wordTextPosition.getYDirAdj() - wordTextPosition.getHeightDir()));
maxX = Math.max(maxX, from72To300Dpi(wordTextPosition.getXDirAdj() + wordTextPosition.getWidthDirAdj()));
maxY = Math.max(maxY, from72To300Dpi(wordTextPosition.getYDirAdj()));
}
return new WordWithBBox(word, minX, minY, maxX - minX, maxY - minY);
}
}
private int from72To300Dpi(float f) {
return Math.round(f * FROM_72_TO_300_DPI);
}
#Value
private class WordWithBBox {
private final String word;
private final int x;
private final int y;
private final int width;
private final int height;
}
}
Note:
If you are interested in other options, you can check also Poppler
PDF to image
pdftoppm -r 300 -jpeg input.pdf output
Generate an XHTML file containing bounding box information for each word in the file.
pdftotext -r 300 -bbox input.pdf

Streaming in datamapper in mule esb

I need to take data(input.xml) from one file which is size in 100MB-200MB and need to write into four different files based on some logic.
input xml :
<?xml version="1.0"?>
<Orders>
<Order><OrderId>1</OrderId><Total>10</Total><Name>jon1</Name></Order>
<Order><OrderId>2</OrderId><Total>20</Total><Name>jon2</Name></Order>
<Order><OrderId>3</OrderId><Total>30</Total><Name>jon3</Name></Order>
<Order><OrderId>4</OrderId><Total>40</Total><Name>jon4</Name></Order>
<Orders>
logic is if Total is 1-10 then write to file1 and if Total is 11-20 then write to file2.....,
expected output:
1 10 jon1 -->write into file1
2 20 jon2 -->write into file2
3 30 jon3 -->write into file3
4 40 jon4 -->write into file4
Here i have enabled streaming in datamapper which is under configuration but i'm not getting proper output. The problem is i'm getting only some recodes into only one file which should come into that file after satisfying the condition.
But if i disable streaming button in datamapper it is working fine. As there are lakes of records i must use streaming option.
Is there any otherway to configure datamapper to enable streaming option..?
Please suggest me on this., Thanks.,

It is difficult to see a problem without a little more detail on what you are doing.
Nevertheless, I think this probably will help you to try another approach.
The data mapper will load the full XML document into memory although you activate streaming, it has to do it in order to support XPATH (it loads the full xml input into a DOM).
So if you can not afford to load 200Mb document into memory you will need to try a workaround.
What I have done before is creating a java component that transforms the input stream to an iterator with the help of a stax parser. With a very simple implementation you can code an iterator that pulls from the stream to create the next element (a pojo, a map, a string...). In the mule flow, after the "java component", you should be able to use a "for-each" with a "choice" within and apply your logic.
A quick example for your data:
package tests;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class OrdersStreamIterator implements Iterator<Map<String,String>> {
final static Log LOGGER = LogFactory.getLog(OrdersStreamIterator.class);
final InputStream is;
final XMLStreamReader xmlReader;
boolean end = false;
HashMap<String,String> next;
public OrdersStreamIterator(InputStream is)
throws XMLStreamException, FactoryConfigurationError {
this.is = is;
xmlReader = XMLInputFactory.newInstance().createXMLStreamReader(is);
}
protected HashMap<String,String> _next() throws XMLStreamException {
int event;
HashMap<String,String> order = null;
String orderChild = null;
String orderChildValue = null;
while (xmlReader.hasNext()) {
event = xmlReader.getEventType();
if (event == XMLStreamConstants.START_ELEMENT) {
if (order==null) {
if (checkOrder()) {
order = new HashMap<String,String>();
}
}
else {
orderChild = xmlReader.getLocalName();
}
}
else if (event == XMLStreamConstants.END_ELEMENT) {
if (checkOrders()) {
end = true;
return null;
}
else if (checkOrder()) {
xmlReader.next();
return order;
}
else if (order!=null) {
order.put(orderChild, orderChildValue);
orderChild = null;
orderChildValue = null;
}
}
else if (order!=null && orderChild!=null){
switch (event) {
case XMLStreamConstants.SPACE:
case XMLStreamConstants.CHARACTERS:
case XMLStreamConstants.CDATA:
int start = xmlReader.getTextStart();
int length = xmlReader.getTextLength();
if (orderChildValue==null) {
orderChildValue = new String(xmlReader.getTextCharacters(), start, length);
}
else {
orderChildValue += new String(xmlReader.getTextCharacters(), start, length);
}
break;
}
}
xmlReader.next();
}
end = true;
return null;
}
protected boolean checkOrder() {
return "Order".equals(xmlReader.getLocalName());
}
protected boolean checkOrders() {
return "Orders".equals(xmlReader.getLocalName());
}
#Override
public boolean hasNext() {
if (end) {
return false;
}
else if (next==null) {
try {
next = _next();
} catch (XMLStreamException e) {
LOGGER.error(e.getMessage(), e);
end = true;
}
return !end;
}
else {
return true;
}
}
#Override
public Map<String,String> next() {
if (hasNext()) {
final HashMap<String,String> n = next;
next = null;
return n;
}
else {
return null;
}
}
#Override
public void remove() {
throw new RuntimeException("ReadOnly!");
}
// Test
public static String dump(Map<String,String> o) {
String s = "{";
for (Entry<String,String> e : o.entrySet()) {
if (s.length()>1) {
s+=", ";
}
s+= "\"" + e.getKey() + "\" : \"" + e.getValue() + "\"";
}
return s + "}";
}
public static void main(String[] argv) throws XMLStreamException, FactoryConfigurationError {
final InputStream is = OrdersStreamIterator.class.getClassLoader().getResourceAsStream("orders.xml");
final OrdersStreamIterator i = new OrdersStreamIterator(is);
while (i.hasNext()) {
System.out.println(dump(i.next()));
}
}
}
An example flow:
<flow name="testsFlow">
<http:listener config-ref="HTTP_Listener_Configuration" path="/" doc:name="HTTP"/>
<scripting:component doc:name="Groovy">
<scripting:script engine="Groovy"><![CDATA[return tests.OrdersStreamIterator.class.getClassLoader().getResourceAsStream("orders.xml");]]></scripting:script>
</scripting:component>
<set-payload value="#[new tests.OrdersStreamIterator(payload)]" doc:name="Iterator"/>
<foreach doc:name="For Each">
<logger message="#[tests.OrdersStreamIterator.dump(payload)]" level="INFO" doc:name="Logger"/>
</foreach>
</flow>

How to use antlr4 in Eclipse?

Since Antlr4 is new version of Antlr and this is the first time I use it. I have downloaded the Antlr4 plugin from eclipse marketplace.
I made new ANTLR 4 project and I got Hello.g4
Afterward I saw this small grammar:
/**
* Define a grammar called Hello
*/
grammar Hello;
r : 'hello' ID ; // match keyword hello followed by an identifier
ID : [a-z]+ ; // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Once it was saved, It was build and I saw it from the Antlr console, I wanted to test the program but I didn't know how, and I didn't know how to make a new file that can be compiled by the new grammar?
Thanks in advance for any help.

You need to create an instance of the generated parser in order to run it.
Create a Java project J next to your ANTLR project A.
Create a linked folder in project J referencing the
generated-sources/antlr4 folder from A, and make this linked folder a source
folder. Compile errors should appear.
Add the antlr4 jar to the build path of project J. This should remove the compile errors.
Write a main program in J which creates an instance of the generated
parser and feeds it with text. You can take the TestRig from the
ANTLR documentation, pasted below for convenience.
The TestRig must be invoked by passing the grammar name (Hello, in your example) and the starting rule (r).
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.DefaultErrorStrategy;
import org.antlr.v4.runtime.DiagnosticErrorListener;
import org.antlr.v4.runtime.InputMismatchException;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.Parser;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.RecognitionException;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.TokenStream;
import org.antlr.v4.runtime.atn.PredictionMode;
import javax.print.PrintException;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.util.ArrayList;
import java.util.List;
/** Run a lexer/parser combo, optionally printing tree string or generating
* postscript file. Optionally taking input file.
*
* $ java org.antlr.v4.runtime.misc.TestRig GrammarName startRuleName
* [-tree]
* [-tokens] [-gui] [-ps file.ps]
* [-trace]
* [-diagnostics]
* [-SLL]
* [input-filename(s)]
*/
public class Test {
public static final String LEXER_START_RULE_NAME = "tokens";
protected String grammarName;
protected String startRuleName;
protected final List<String> inputFiles = new ArrayList<String>();
protected boolean printTree = false;
protected boolean gui = false;
protected String psFile = null;
protected boolean showTokens = false;
protected boolean trace = false;
protected boolean diagnostics = false;
protected String encoding = null;
protected boolean SLL = false;
public Test(String[] args) throws Exception {
if ( args.length < 2 ) {
System.err.println("java org.antlr.v4.runtime.misc.TestRig GrammarName startRuleName\n" +
" [-tokens] [-tree] [-gui] [-ps file.ps] [-encoding encodingname]\n" +
" [-trace] [-diagnostics] [-SLL]\n"+
" [input-filename(s)]");
System.err.println("Use startRuleName='tokens' if GrammarName is a lexer grammar.");
System.err.println("Omitting input-filename makes rig read from stdin.");
return;
}
int i=0;
grammarName = args[i];
i++;
startRuleName = args[i];
i++;
while ( i<args.length ) {
String arg = args[i];
i++;
if ( arg.charAt(0)!='-' ) { // input file name
inputFiles.add(arg);
continue;
}
if ( arg.equals("-tree") ) {
printTree = true;
}
if ( arg.equals("-gui") ) {
gui = true;
}
if ( arg.equals("-tokens") ) {
showTokens = true;
}
else if ( arg.equals("-trace") ) {
trace = true;
}
else if ( arg.equals("-SLL") ) {
SLL = true;
}
else if ( arg.equals("-diagnostics") ) {
diagnostics = true;
}
else if ( arg.equals("-encoding") ) {
if ( i>=args.length ) {
System.err.println("missing encoding on -encoding");
return;
}
encoding = args[i];
i++;
}
else if ( arg.equals("-ps") ) {
if ( i>=args.length ) {
System.err.println("missing filename on -ps");
return;
}
psFile = args[i];
i++;
}
}
}
public static void main(String[] args) throws Exception {
Test test = new Test(args);
if(args.length >= 2) {
test.process();
}
}
public void process() throws Exception {
//System.out.println("exec "+grammarName+" "+startRuleName);
String lexerName = grammarName+"Lexer";
ClassLoader cl = Thread.currentThread().getContextClassLoader();
Class<? extends Lexer> lexerClass = null;
try {
lexerClass = cl.loadClass(lexerName).asSubclass(Lexer.class);
}
catch (java.lang.ClassNotFoundException cnfe) {
System.err.println("1: Can't load "+lexerName+" as lexer or parser");
return;
}
Constructor<? extends Lexer> lexerCtor = lexerClass.getConstructor(CharStream.class);
Lexer lexer = lexerCtor.newInstance((CharStream)null);
Class<? extends Parser> parserClass = null;
Parser parser = null;
if ( !startRuleName.equals(LEXER_START_RULE_NAME) ) {
String parserName = grammarName+"Parser";
parserClass = cl.loadClass(parserName).asSubclass(Parser.class);
if ( parserClass==null ) {
System.err.println("Can't load "+parserName);
}
Constructor<? extends Parser> parserCtor = parserClass.getConstructor(TokenStream.class);
parser = parserCtor.newInstance((TokenStream)null);
}
if ( inputFiles.size()==0 ) {
InputStream is = System.in;
Reader r;
if ( encoding!=null ) {
r = new InputStreamReader(is, encoding);
}
else {
r = new InputStreamReader(is);
}
process(lexer, parserClass, parser, is, r);
return;
}
for (String inputFile : inputFiles) {
InputStream is = System.in;
if ( inputFile!=null ) {
is = new FileInputStream(inputFile);
}
Reader r;
if ( encoding!=null ) {
r = new InputStreamReader(is, encoding);
}
else {
r = new InputStreamReader(is);
}
if ( inputFiles.size()>1 ) {
System.err.println(inputFile);
}
process(lexer, parserClass, parser, is, r);
}
}
protected void process(Lexer lexer, Class<? extends Parser> parserClass, Parser parser, InputStream is, Reader r) throws IOException, IllegalAccessException, InvocationTargetException, PrintException {
try {
ANTLRInputStream input = new ANTLRInputStream(r);
lexer.setInputStream(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
if ( showTokens ) {
for (Object tok : tokens.getTokens()) {
System.out.println(tok);
}
}
if ( startRuleName.equals(LEXER_START_RULE_NAME) ) return;
if ( diagnostics ) {
parser.addErrorListener(new DiagnosticErrorListener());
parser.getInterpreter().setPredictionMode(PredictionMode.LL_EXACT_AMBIG_DETECTION);
}
if ( printTree || gui || psFile!=null ) {
parser.setBuildParseTree(true);
}
if ( SLL ) { // overrides diagnostics
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
}
parser.setTokenStream(tokens);
parser.setTrace(trace);
//parser.setErrorHandler(new BailErrorStrategy());
try {
Method startRule = parserClass.getMethod(startRuleName);
ParserRuleContext tree = (ParserRuleContext)startRule.invoke(parser, (Object[])null);
if ( printTree ) {
System.out.println(tree.toStringTree(parser));
}
if ( gui ) {
tree.inspect(parser);
}
if ( psFile!=null ) {
tree.save(parser, psFile); // Generate postscript
}
}
catch (NoSuchMethodException nsme) {
System.err.println("No method for rule "+startRuleName+" or it has arguments");
}
}
finally {
if ( r!=null ) r.close();
if ( is!=null ) is.close();
}
}
#SuppressWarnings("unused")
private static class BailErrorStrategy extends DefaultErrorStrategy {
/** Instead of recovering from exception e, rethrow it wrapped
* in a generic RuntimeException so it is not caught by the
* rule function catches. Exception e is the "cause" of the
* RuntimeException.
*/
#Override
public void recover(Parser recognizer, RecognitionException e) {
throw new RuntimeException(e);
}
/** Make sure we don't attempt to recover inline; if the parser
* successfully recovers, it won't throw an exception.
*/
#Override
public Token recoverInline(Parser recognizer)
throws RecognitionException
{
throw new RuntimeException(new InputMismatchException(recognizer));
}
/** Make sure we don't attempt to recover from problems in subrules. */
#Override
public void sync(Parser recognizer) { }
}
}
'Hope this helps!

Create a new ANTLR 4 Project
Convert the project to faceted form
Add Java project facet
Optionally you can add package names to be generated in the header of the grammar file
Use target/generated-sources/antlr4 as source folder
Edit and save the grammar file to re-generate everything
The antlr4 generated sources should now be packaged and imported as usual inside your project.

JMapframe displays only a single shapefile

I used the Netbeans and GeoTools to program a graphical interface to display multiple shapefiles in the same JmapFrame. I used the following code but I do not know, when execute, it display only one shapefile.Svp, someone can help me, I await your answers.
import com.vividsolutions.jts.geom.Coordinate;
import com.vividsolutions.jts.geom.Geometry;
import java.io.File;
import org.geotools.data.FeatureSource;
import org.geotools.data.FileDataStore;
import org.geotools.data.FileDataStoreFinder;
import org.geotools.data.simple.SimpleFeatureCollection;
import org.geotools.data.simple.SimpleFeatureIterator;
import org.geotools.map.DefaultMapContext;
import org.geotools.map.MapContext;
import org.geotools.swing.JMapFrame;
import org.geotools.swing.data.JFileDataStoreChooser;
import org.opengis.feature.simple.SimpleFeature;
/**
*
* #author Brahim
*/
class ImportVecteur2
{
private JMapFrame fenMap;
private MapContext mapContext;
ImportVecteur2(JMapFrame fenMap)
{
//this.mapContext = mapContext;
this.fenMap = fenMap;
}
#SuppressWarnings("static-access")
public void chercheAfficheVecteur() //throws Exception
{
try
{
File file = JFileDataStoreChooser.showOpenFile("shp", null);
if (file == null)
{
return;
}
FileDataStore store = FileDataStoreFinder.getDataStore(file);
FeatureSource featureSource = store.getFeatureSource();
//get vertices of file
// Create a map context and add our shapefile to it
mapContext = new DefaultMapContext();
mapContext.addLayer(featureSource, null);
// Now display the map
fenMap.enableLayerTable(true);
fenMap.setMapContext(mapContext);
fenMap.setVisible(true);
}

Each time you call chercheAfficheVecteur you create a new MapContext so the previous one is thrown away and with it your previous shapefile. If you change the method to be
public void chercheAfficheVecteur() {
try {
File file = JFileDataStoreChooser.showOpenFile("shp", null);
if (file == null) {
return;
}
FileDataStore store = FileDataStoreFinder.getDataStore(file);
FeatureSource featureSource = store.getFeatureSource();
//get vertices of file
// Create a map context and add our shapefile to it
if(mapContext == null){
mapContext = new DefaultMapContext();
fenMap.setMapContext(mapContext);
}
//make it look prettier
Style style = SLD.createSimpleStyle(featureSource.getSchema());
mapContext.addLayer(featureSource, style);
}
and
ImportVecteur2(JMapFrame fenMap)
{
//this.mapContext = mapContext;
this.fenMap = fenMap;
fenMap.enableLayerTable(true);
fenMap.setVisible(true);
}
It should work better.
After further testing (i.e. I actually compiled some code) - MapContext is deprecated (and has been for some time) please use MapContent.
package org.geotools.tutorial.quickstart;
import java.awt.Color;
import java.awt.Dimension;
import java.io.File;
import java.io.IOException;
import org.geotools.data.FeatureSource;
import org.geotools.data.FileDataStore;
import org.geotools.data.FileDataStoreFinder;
import org.geotools.map.FeatureLayer;
import org.geotools.map.Layer;
import org.geotools.map.MapContent;
import org.geotools.styling.SLD;
import org.geotools.styling.Style;
import org.geotools.swing.JMapFrame;
import org.geotools.swing.data.JFileDataStoreChooser;
public class Test {
private static final Color[] color = { Color.red, Color.blue, Color.green,
Color.MAGENTA };
private static MapContent mapContext;
private static JMapFrame fenMap;
public static void main(String args[]) throws IOException {
Test me = new Test();
me.run();
}
public void run() throws IOException {
fenMap = new JMapFrame();
mapContext = new MapContent();
fenMap.setMapContent(mapContext);
fenMap.enableToolBar(true);
fenMap.setMinimumSize(new Dimension(300, 300));
fenMap.setVisible(true);
int i = 0;
while (chercheAfficheVecteur(i)) {
i++;
i = i % color.length;
}
}
public boolean chercheAfficheVecteur(int next) throws IOException {
File file = JFileDataStoreChooser.showOpenFile("shp", null);
if (file == null) {
return false;
}
FileDataStore store = FileDataStoreFinder.getDataStore(file);
FeatureSource featureSource = store.getFeatureSource();
// get vertices of file
// Create a map context and add our shapefile to it
if (mapContext == null) {
}
// make it look prettier
Style style = SLD.createSimpleStyle(featureSource.getSchema(), color[next]);
Layer layer = new FeatureLayer(featureSource, style);
mapContext.addLayer(layer);
return true;
}
}

How to change GWT Place URL from the default ":" to "/"?

By default, a GWT Place URL consists of the Place's simple class name (like "HelloPlace") followed by a colon (:) and the token returned by the PlaceTokenizer.
My question is how can I change ":" to be "/"?

I just made my own PlaceHistoryMapper that directly implements the interface instead of using AbstractPlaceHistoryMapper:
public class AppPlaceHistoryMapper implements PlaceHistoryMapper
{
String delimiter = "/";
#Override
public Place getPlace(String token)
{
String[] tokens = token.split(delimiter, 2);
if (tokens[0].equals("HelloPlace"))
...
}
#Override
public String getToken(Place place)
{
if (place instanceof HelloPlace)
{
return "HelloPlace" + delimiter + whatever;
}
else ...
}
}
It's certainly extra code to write, but you can control your url structure all in one place, and use slashes instead of colons!

Here is how to customize the delimiter, while using the standard GWT places. (PlaceHistoryMapper)
Nothing else needs to be changed; it works with the standard way of using Places and Tokenizers.
Insert into gwt.xml
<generate-with
class="com.google.gwt.place.rebind.CustomPlaceHistoryMapperGenerator">
<when-type-assignable class="com.google.gwt.place.shared.PlaceHistoryMapper" />
</generate-with>
Add CustomAbstractPlaceHistoryMapper
package com.siderakis.client.mvp;
import com.google.gwt.place.impl.AbstractPlaceHistoryMapper;
import com.google.gwt.place.shared.Place;
import com.google.gwt.place.shared.PlaceTokenizer;
public abstract class CustomAbstractPlaceHistoryMapper extends AbstractPlaceHistoryMapper {
public final static String DELIMITER = "/";
public static class CustomPrefixAndToken extends PrefixAndToken {
public CustomPrefixAndToken(String prefix, String token) {
super(prefix, token);
assert prefix != null && !prefix.contains(DELIMITER);
}
#Override
public String toString() {
return (prefix.length() == 0) ? token : prefix + DELIMITER + token;
}
}
#Override
public Place getPlace(String token) {
int colonAt = token.indexOf(DELIMITER);
String initial;
String rest;
if (colonAt >= 0) {
initial = token.substring(0, colonAt);
rest = token.substring(colonAt + 1);
} else {
initial = "";
rest = token;
}
PlaceTokenizer tokenizer = getTokenizer(initial);
if (tokenizer != null) {
return tokenizer.getPlace(rest);
}
return null;
}
}
Add CustomPlaceHistoryMapperGenerator
/*
* Copyright 2010 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
package com.google.gwt.place.rebind;
import java.io.PrintWriter;
import com.siderakis.client.mvp.CustomAbstractPlaceHistoryMapper;
import com.siderakis.client.mvp.CustomAbstractPlaceHistoryMapper.CustomPrefixAndToken;
import com.google.gwt.core.client.GWT;
import com.google.gwt.core.ext.Generator;
import com.google.gwt.core.ext.GeneratorContext;
import com.google.gwt.core.ext.TreeLogger;
import com.google.gwt.core.ext.UnableToCompleteException;
import com.google.gwt.core.ext.typeinfo.JClassType;
import com.google.gwt.core.ext.typeinfo.JMethod;
import com.google.gwt.place.shared.Place;
import com.google.gwt.place.shared.PlaceTokenizer;
import com.google.gwt.user.rebind.ClassSourceFileComposerFactory;
import com.google.gwt.user.rebind.SourceWriter;
/**
* Generates implementations of
* {#link com.google.gwt.place.shared.PlaceHistoryMapper PlaceHistoryMapper}.
*/
public class CustomPlaceHistoryMapperGenerator extends Generator {
private PlaceHistoryGeneratorContext context;
#Override
public String generate(TreeLogger logger, GeneratorContext generatorContext,
String interfaceName) throws UnableToCompleteException {
context = PlaceHistoryGeneratorContext.create(logger,
generatorContext.getTypeOracle(), interfaceName);
if (context == null) {
return null;
}
PrintWriter out = generatorContext.tryCreate(logger, context.packageName,
context.implName);
if (out != null) {
generateOnce(generatorContext, context, out);
}
return context.packageName + "." + context.implName;
}
private void generateOnce(GeneratorContext generatorContext, PlaceHistoryGeneratorContext context,
PrintWriter out) throws UnableToCompleteException {
TreeLogger logger = context.logger.branch(TreeLogger.DEBUG, String.format(
"Generating implementation of %s", context.interfaceType.getName()));
ClassSourceFileComposerFactory f = new ClassSourceFileComposerFactory(
context.packageName, context.implName);
String superClassName = String.format("%s<%s>",
CustomAbstractPlaceHistoryMapper.class.getSimpleName(),
context.factoryType == null ? "Void" : context.factoryType.getName());
f.setSuperclass(superClassName);
f.addImplementedInterface(context.interfaceType.getName());
f.addImport(CustomAbstractPlaceHistoryMapper.class.getName());
f.addImport(context.interfaceType.getQualifiedSourceName());
f.addImport(CustomAbstractPlaceHistoryMapper.class.getCanonicalName());
if (context.factoryType != null) {
f.addImport(context.factoryType.getQualifiedSourceName());
}
f.addImport(Place.class.getCanonicalName());
f.addImport(PlaceTokenizer.class.getCanonicalName());
f.addImport(CustomPrefixAndToken.class.getCanonicalName());
f.addImport(GWT.class.getCanonicalName());
SourceWriter sw = f.createSourceWriter(generatorContext, out);
sw.println();
writeGetPrefixAndToken(context, sw);
sw.println();
writeGetTokenizer(context, sw);
sw.println();
sw.outdent();
sw.println("}");
generatorContext.commit(logger, out);
}
private void writeGetPrefixAndToken(PlaceHistoryGeneratorContext context,
SourceWriter sw) throws UnableToCompleteException {
sw.println("protected CustomPrefixAndToken getPrefixAndToken(Place newPlace) {");
sw.indent();
for (JClassType placeType : context.getPlaceTypes()) {
String placeTypeName = placeType.getQualifiedSourceName();
String prefix = context.getPrefix(placeType);
sw.println("if (newPlace instanceof " + placeTypeName + ") {");
sw.indent();
sw.println(placeTypeName + " place = (" + placeTypeName + ") newPlace;");
JMethod getter = context.getTokenizerGetter(prefix);
if (getter != null) {
sw.println(String.format("return new CustomPrefixAndToken(\"%s\", "
+ "factory.%s().getToken(place));", escape(prefix),
getter.getName()));
} else {
sw.println(String.format(
"PlaceTokenizer<%s> t = GWT.create(%s.class);", placeTypeName,
context.getTokenizerType(prefix).getQualifiedSourceName()));
sw.println(String.format("return new CustomPrefixAndToken(\"%s\", "
+ "t.getToken((%s) place));", escape(prefix), placeTypeName));
}
sw.outdent();
sw.println("}");
}
sw.println("return null;");
sw.outdent();
sw.println("}");
}
private void writeGetTokenizer(PlaceHistoryGeneratorContext context,
SourceWriter sw) throws UnableToCompleteException {
sw.println("protected PlaceTokenizer getTokenizer(String prefix) {");
sw.indent();
for (String prefix : context.getPrefixes()) {
JMethod getter = context.getTokenizerGetter(prefix);
sw.println("if (\"" + escape(prefix) + "\".equals(prefix)) {");
sw.indent();
if (getter != null) {
sw.println("return factory." + getter.getName() + "();");
} else {
sw.println(String.format("return GWT.create(%s.class);",
context.getTokenizerType(prefix).getQualifiedSourceName()));
}
sw.outdent();
sw.println("}");
}
sw.println("return null;");
sw.outdent();
sw.println("}");
sw.outdent();
}
}

Good question. The problem is, that this is hard-coded into AbstractPlaceHistoryMapper:
AbstractPlaceHistoryMapper.PrefixAndToken.toString():
return (prefix.length() == 0) ? token : prefix + ":" + token;
AbstractPlaceHistoryMapper.getPlace(String token):
int colonAt = token.indexOf(':');
...
And AbstractPlaceHistoryMapper is hard-coded into PlaceHistoryMapperGenerator.
It would probably be possible to exchange the generator by supplying your own module xml file, and reconfiguring the binding, but overall, I would consider this as "basically not configurable". (But see Riley's answer for a good alternative without declarative Tokenizer configuration!)

)
I really appreciated the answers, thanks SO for making me optimize my time (I was following in debugger where my getToken() method was called, and it's getting through listeners, handlers, magicians and funny stuff like that :-)
So thinking of the HistoryMapper is perfect, but the solution if you want to add a crawler is a lot simpler since you just need to add an extra '!' after the bookmark hash #!
So it is enough to simply decorate the original result with an extra character.
public class PlaceHistoryMapperDecorator implements PlaceHistoryMapper {
private static final String CRAWLER_PREFIX = "!";
protected PlaceHistoryMapper delegateHistoryMapper;
public PlaceHistoryMapperDecorator(PlaceHistoryMapper delegateHistoryMapper) {
this.delegateHistoryMapper = delegateHistoryMapper;
}
#Override
public Place getPlace(String token) {
String cleanToken = token;
if (token.startsWith(CRAWLER_PREFIX))
cleanToken = token.substring(CRAWLER_PREFIX.length());
else {
if (token.length() > 0)
System.err.println("there might be an error: can't find crawler prefix in " + token);
}
return delegateHistoryMapper.getPlace(cleanToken);
}
#Override
public String getToken(Place place) {
return CRAWLER_PREFIX + delegateHistoryMapper.getToken(place);
}
}
Then you pass that new instance to your PlaceHistoryHandler and that's it
PlaceHistoryMapperDecorator historyMapperDecorator = new PlaceHistoryMapperDecorator((PlaceHistoryMapper) GWT.create(AppPlaceHistoryMapper.class));
PlaceHistoryHandler historyHandler = new PlaceHistoryHandler(historyMapperDecorator);
I tested it before posting this message, it works fine :-)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text? - scala

Related

Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

Streaming in datamapper in mule esb

How to use antlr4 in Eclipse?

JMapframe displays only a single shapefile

How to change GWT Place URL from the default ":" to "/"?

Categories

Resources