How to process multi line input records in Spark - scala

I have each record spread across multiple lines in the input file(Very huge file).
Ex:
Id: 2
ASIN: 0738700123
title: Test tile for this product
group: Book
salesrank: 168501
similar: 5 0738700811 1567184912 1567182813 0738700514 0738700915
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
How to identify and process each multi line record in spark?

If the multi-line data has a defined record separator, you could use the hadoop support for multi-line records, providing the separator through a hadoop.Configuration object:
Something like this should do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "id:")
val dataset = sc.newAPIHadoopFile("/path/to/data", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = dataset.map(x=>x._2.toString)
This will provide you with an RDD[String] where each element corresponds to a record. Afterwards you need to parse each record following your application requirements.

I have done this by implementing custom input format and record reader.
public class ParagraphInputFormat extends TextInputFormat {
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) {
return new ParagraphRecordReader();
}
}
public class ParagraphRecordReader extends RecordReader<LongWritable, Text> {
private long end;
private boolean stillInChunk = true;
private LongWritable key = new LongWritable();
private Text value = new Text();
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private byte[] endTag = "\n\r\n".getBytes();
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = taskAttemptContext.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
fsin = fs.open(path);
long start = split.getStart();
end = split.getStart() + split.getLength();
fsin.seek(start);
if (start != 0) {
readUntilMatch(endTag, false);
}
}
public boolean nextKeyValue() throws IOException {
if (!stillInChunk) return false;
boolean status = readUntilMatch(endTag, true);
value = new Text();
value.set(buffer.getData(), 0, buffer.getLength());
key = new LongWritable(fsin.getPos());
buffer.reset();
if (!status) {
stillInChunk = false;
}
return true;
}
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
public float getProgress() throws IOException, InterruptedException {
return 0;
}
public void close() throws IOException {
fsin.close();
}
private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
if (b == -1) return false;
if (withinBlock) buffer.write(b);
if (b == match[i]) {
i++;
if (i >= match.length) {
return fsin.getPos() < end;
}
} else i = 0;
}
}
}
endTag identifies the end of each record.

Related

TableView, TableColumns vanish off the right edge when resizing

Using some custom resizing behaviour I'm losing columns off the right side of the TableView. I have to use UNCONSTRAINED_RESIZE_POLICY (or maybe write a custom POLICY) so that I can size some of the columns to their content.
I have some custom behaviour for the resizing of columns in the TableViews I use in my application.
I use the reflection pattern to autoresize some columns to their content when the data first populates. The remaining columns width is set to a proportion of the remaining width (if there are 3 columns not being autoresized then remaining width/3=column width).
I also have a column width listener which will listen for when a user drags column widths or double clicks on the header divider to size the column to it's content. I also listen to the width of the table itself and then assign any new extra width to the last column.
The above works ok but the issue is when a user resizes a column or multiple columns to the point the last column is as small as it can go columns will start to getting pushed off the right side of the TableView. It makes sense it would do this as I have my POLICY set to UNCONSTRAINED. I obviously can't use CONSTRAINED_RESIZE_POLICY or the above logic won't work.
Is there a custom policy out there that will reduce the rightmost columns inside 1 by 1 as the user increases the column width, so the right column first until it's as small as it can be, then the next rightmost and so on. Or do I need to write this behaviour? I did come across a Koitlin based POLICY in TorpedoFX that looked interesting but I'd rather stay pure Java.
Basically the outcome I want is what I have now but any user resizing just reduces the right-most column to a minimum size, then the next right-most and so on until all the columns to the right of the column the user is resizing are at minimum size but are still visible on the TableView. If there are no columns to the right that can be resized then the user shouldn't be able to resize their column without first resizing a column to the left.
Columns should never disappear off the right side of the TableView.
I've written a test class that mimics this behaviour, it's slightly verbose in places and would be refactored in the real application.
package application;
import java.io.Serializable;
import java.lang.reflect.Method;
import java.util.Arrays;
import java.util.List;
import java.util.ResourceBundle;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import javafx.application.Application;
import javafx.application.Platform;
import javafx.beans.property.ReadOnlyObjectWrapper;
import javafx.beans.value.ChangeListener;
import javafx.beans.value.ObservableValue;
import javafx.collections.FXCollections;
import javafx.collections.ListChangeListener;
import javafx.collections.ObservableList;
import javafx.concurrent.Task;
import javafx.scene.Group;
import javafx.scene.Scene;
import javafx.scene.control.Skin;
import javafx.scene.control.TableColumn;
import javafx.scene.control.TableColumn.CellDataFeatures;
import javafx.scene.control.TableColumnBase;
import javafx.scene.control.TableView;
import javafx.scene.control.cell.PropertyValueFactory;
import javafx.scene.layout.HBox;
import javafx.scene.layout.Priority;
import javafx.stage.Stage;
import javafx.util.Callback;
public class TableViewSample extends Application {
private TableView<TableData> table = new TableView<TableData>();
private ObservableList<TableData> data = FXCollections.observableArrayList();
private boolean columnResizeOperationPerformed = false;
private String resizeThreeColumn = "";
private String resizeFourColumn = "";
private String resizeSixColumn = "";
private final ExecutorService executorService = Executors.newFixedThreadPool(1);
public static void main(String[] args) {
launch(args);
}
#Override
public void start(Stage stage) {
Scene scene = new Scene(new Group());
stage.setWidth(1300);
stage.setHeight(600);
TableColumn<TableData, String> oneColumn = new TableColumn<>("One");
TableColumn<TableData, String> twoColumn = new TableColumn<>("Two");
TableColumn<TableData, String> threeColumn = new TableColumn<>("Three");
TableColumn<TableData, String> fourColumn = new TableColumn<>("Four");
TableColumn<TableData, String> fiveColumn = new TableColumn<>("Five");
TableColumn<TableData, String> sixColumn = new TableColumn<>("Six");
TableColumn<TableData, String> sevenColumn = new TableColumn<>("");
TableColumn<TableData, String> eightColumn = new TableColumn<>("");
TableColumn<TableData, String> nineColumn = new TableColumn<>("Nine");
TableColumn<TableData, String> tenColumn = new TableColumn<>("Ten");
TableColumn<TableData, String> elevenColumn = new TableColumn<>("Eleven");
TableColumn<TableData, String> twelveColumn = new TableColumn<>("Twelve");
TableColumn<TableData, String> thirteenColumn = new TableColumn<>("Thirteen");
TableColumn<TableData, String> lastColumn = new TableColumn<>("Last");
table.setEditable(false);
table.setPrefWidth(1100);
table.setMaxWidth(1100);
table.setItems(data);
table.getColumns().addAll(oneColumn, twoColumn, threeColumn, fourColumn, fiveColumn, sixColumn, sevenColumn, eightColumn, nineColumn, tenColumn, elevenColumn, twelveColumn, thirteenColumn, lastColumn);
table.setFixedCellSize(25.0);
// This cellValueFactory code could be refactored in the real application
oneColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("oneColumn"));
oneColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getOneColumn());
}
});
twoColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("twoColumn"));
twoColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getTwoColumn());
}
});
threeColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("threeColumn"));
threeColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getThreeColumn());
}
});
fourColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("fourColumn"));
fourColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getFourColumn());
}
});
fiveColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("fiveColumn"));
fiveColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getFiveColumn());
}
});
sixColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("sixColumn"));
sixColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getSixColumn());
}
});
sevenColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("sevenColumn"));
sevenColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getSevenColumn());
}
});
eightColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("eightColumn"));
eightColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getEightColumn());
}
});
nineColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("nineColumn"));
nineColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getNineColumn());
}
});
tenColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("tenColumn"));
tenColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getTenColumn());
}
});
elevenColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("elevenColumn"));
elevenColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getElevenColumn());
}
});
twelveColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("twelveColumn"));
twelveColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getTwelveColumn());
}
});
thirteenColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("thirteenColumn"));
thirteenColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getThirteenColumn());
}
});
lastColumn.setCellValueFactory(new PropertyValueFactory<TableData, String>("lastColumn"));
lastColumn.setCellValueFactory(new Callback<CellDataFeatures<TableData, String>, ObservableValue<String>>() {
public ObservableValue<String> call(CellDataFeatures<TableData, String> p) {
return new ReadOnlyObjectWrapper(p.getValue().getLastColumn());
}
});
// using CONSTRAINED_RESIZE_POLICY will cause all kinds of odd behaviour because of the autoresize and then the columnWidthListener below.
table.setColumnResizePolicy(TableView.UNCONSTRAINED_RESIZE_POLICY);
table.getItems().addListener(new ListChangeListener<TableData>() {
#Override
public void onChanged(Change<? extends TableData> c) {
// check to see if any of the data coming in has column 3 or 4 values that columns can be resized with
if (!columnResizeOperationPerformed) {
boolean outerBreak = false;
while (c.next() && !outerBreak) {
List<? extends TableData> addedSubList = c.getAddedSubList();
if (!addedSubList.isEmpty()) {
for (TableData data : addedSubList) {
outerBreak = checkForColThreeOrFourData(data);
}
}
}
}
// resize some columns to fit contents, other columns to take up remaining space
if (!columnResizeOperationPerformed && !table.getItems().isEmpty()) {
// only prevent future column resizing if the threeColumn has some valid data to size on
if (resizeThreeColumn.length() > 0) {
columnResizeOperationPerformed = true;
}
double totalWidth = 0;
totalWidth = autosizeColumn(oneColumn);
totalWidth += autosizeColumn(threeColumn);
totalWidth += autosizeColumn(fourColumn);
totalWidth += autosizeColumn(sixColumn);
totalWidth += autosizeColumn(sevenColumn);
totalWidth += autosizeColumn(eightColumn);
totalWidth += autosizeColumn(nineColumn);
totalWidth += autosizeColumn(tenColumn);
totalWidth += autosizeColumn(elevenColumn);
totalWidth += autosizeColumn(twelveColumn);
totalWidth += autosizeColumn(lastColumn);
double remainingWidth = table.getWidth() - totalWidth;
sizeColumn(twoColumn, remainingWidth / 4.0);
sizeColumn(fiveColumn, remainingWidth / 4.0);
sizeColumn(thirteenColumn, remainingWidth / 4.0);
table.requestLayout();
}
}
});
ChangeListener<? super Number> columnWidthListener = (obs, ov, nv) -> {
double totalWidth = table.getColumns().stream()
.filter(tc -> !tc.equals(lastColumn))
.mapToDouble(TableColumnBase::getWidth)
.sum();
sizeColumn(lastColumn, table.getWidth() - totalWidth);
};
// listen for any column resizing or table width changes and assign extra width to the lastColumn above
table.getColumns().stream()
.filter(tc -> !tc.equals(lastColumn)).forEach(tc -> {
tc.widthProperty().addListener(columnWidthListener);
});
table.widthProperty().addListener(columnWidthListener);
HBox hBox = new HBox(table);
HBox.setHgrow(table, Priority.ALWAYS);
((Group) scene.getRoot()).getChildren().addAll(hBox);
stage.setScene(scene);
stage.show();
// create Task to update the table data after the UI is constructed so that the column autoresizing code above is applied as data is populated.
Task task = new Task() {
#Override
protected Object call() {
try {
Thread.sleep(100);
Platform.runLater(() -> {
updateTableData();
});
}
catch (Exception ex) {}
return null;
}
};
executorService.submit(task);
}
/**
* A test version of a check from the real application to make sure resizing of columns happens when column data of specific columns is valid
*
* #param tableData
* #return
*/
private boolean checkForColThreeOrFourData(TableData tableData) {
if (resizeThreeColumn.length() == 0) {
resizeThreeColumn = tableData.getThreeColumn();
}
if (resizeFourColumn.length() == 0) {
resizeFourColumn = tableData.getFourColumn();
}
if (resizeSixColumn.length() == 0) {
resizeSixColumn = tableData.getSixColumn();
}
if ((resizeThreeColumn.length() > 0) && resizeFourColumn.length() > 0 && resizeSixColumn.length() > 0) { return true; }
return false;
}
public void sizeColumn(TableColumn<?, ?> column, double width) {
column.setPrefWidth(width);
}
public static double autosizeColumn(TableColumn<?, ?> column) {
final TableView<?> tableView = column.getTableView();
final Skin<?> skin = tableView.getSkin();
final int rowsToMeasure = -1;
try {
Method method = skin.getClass().getDeclaredMethod("resizeColumnToFitContent", TableColumn.class, int.class);
method.setAccessible(true);
method.invoke(skin, column, rowsToMeasure);
}
catch (Exception e) {
e.printStackTrace();
}
return column.getWidth();
}
private void updateTableData() {
data.setAll(Arrays.asList(new TableData("Manufacturer1", "User 1", "value12345", "desc12345", "defaultName", "17:04:49 15/05/19", "200", "0", "0", "3", "12", "2", "16-15-14", "80"),
new TableData("Manufacturer2", "User 2", "value67890", "desc67890", "", "17:06:38 15/05/19", "100", "0", "0", "3", "11", "2", "16-15-14", "82")));
}
class TableData implements Serializable {
private static final long serialVersionUID = 1L;
private String oneColumn;
private String twoColumn;
private String threeColumn;
private String fourColumn;
private String fiveColumn;
private String sixColumn;
private String sevenColumn;
private String eightColumn;
private String nineColumn;
private String tenColumn;
private String elevenColumn;
private String twelveColumn;
private String thirteenColumn;
private String lastColumn;
public TableData(String oneColumn, String twoColumn, String threeColumn, String fourColumn, String fiveColumn, String sixColumn, String sevenColumn, String eightColumn, String nineColumn, String tenColumn, String elevenColumn,
String twelveColumn, String thirteenColumn, String lastColumn) {
this.oneColumn = oneColumn;
this.twoColumn = twoColumn;
this.threeColumn = threeColumn;
this.fourColumn = fourColumn;
this.fiveColumn = fiveColumn;
this.sixColumn = sixColumn;
this.sevenColumn = sevenColumn;
this.eightColumn = eightColumn;
this.nineColumn = nineColumn;
this.tenColumn = tenColumn;
this.elevenColumn = elevenColumn;
this.twelveColumn = twelveColumn;
this.thirteenColumn = thirteenColumn;
this.lastColumn = lastColumn;
}
public String getOneColumn() {
return oneColumn;
}
public String getTwoColumn() {
return twoColumn;
}
public String getThreeColumn() {
return threeColumn;
}
public String getFourColumn() {
return fourColumn;
}
public String getFiveColumn() {
return fiveColumn;
}
public String getSixColumn() {
return sixColumn;
}
public String getSevenColumn() {
return sevenColumn;
}
public String getEightColumn() {
return eightColumn;
}
public String getNineColumn() {
return nineColumn;
}
public String getTenColumn() {
return tenColumn;
}
public String getElevenColumn() {
return elevenColumn;
}
public String getTwelveColumn() {
return twelveColumn;
}
public String getThirteenColumn() {
return thirteenColumn;
}
public String getLastColumn() {
return lastColumn;
}
}
}
As mentioned above this code will auto resize the selected columns and assign the remaining width equally to the other columns.
It will listen to user column width adjustments correctly.
What it won't do is prevent the columns to the right edge vanishing off the view. I would like the right columns to be reduced in width to accomodate the user column width increase in the order described above, right-most first continuing in from the right as columns reach their minimum.
Thanks for any help.

Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

Could someone give me an example of how to extract coordinates for a 'word' with PDFBox
I am using this link to extract positions of individual characters:
https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/
I am using this link to extract words:
https://www.tutorialkart.com/pdfbox/extract-words-from-pdf-document/
I am stuck getting coordinates for whole words.
You can extract the coordinates of words by collecting all the TextPosition objects building a word and combining their bounding boxes.
Implementing this along the lines of the two tutorials you referenced, you can extend PDFTextStripper like this:
public class GetWordLocationAndSize extends PDFTextStripper {
public GetWordLocationAndSize() throws IOException {
}
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
String wordSeparator = getWordSeparator();
List<TextPosition> word = new ArrayList<>();
for (TextPosition text : textPositions) {
String thisChar = text.getUnicode();
if (thisChar != null) {
if (thisChar.length() >= 1) {
if (!thisChar.equals(wordSeparator)) {
word.add(text);
} else if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
}
}
if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
void printWord(List<TextPosition> word) {
Rectangle2D boundingBox = null;
StringBuilder builder = new StringBuilder();
for (TextPosition text : word) {
Rectangle2D box = new Rectangle2D.Float(text.getXDirAdj(), text.getYDirAdj(), text.getWidthDirAdj(), text.getHeightDir());
if (boundingBox == null)
boundingBox = box;
else
boundingBox.add(box);
builder.append(text.getUnicode());
}
System.out.println(builder.toString() + " [(X=" + boundingBox.getX() + ",Y=" + boundingBox.getY()
+ ") height=" + boundingBox.getHeight() + " width=" + boundingBox.getWidth() + "]");
}
}
(ExtractWordCoordinates inner class)
and run it like this:
PDDocument document = PDDocument.load(resource);
PDFTextStripper stripper = new GetWordLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
(ExtractWordCoordinates test testExtractWordsForGoodJuJu)
Applied to the apache.pdf example the tutorials use you get:
2017-8-6 [(X=26.004425048828125,Y=22.00372314453125) height=5.833024024963379 width=36.31868362426758]
Welcome [(X=226.44479370117188,Y=22.00372314453125) height=5.833024024963379 width=36.5999755859375]
to [(X=265.5881652832031,Y=22.00372314453125) height=5.833024024963379 width=8.032623291015625]
The [(X=276.1641845703125,Y=22.00372314453125) height=5.833024024963379 width=14.881439208984375]
Apache [(X=293.5890197753906,Y=22.00372314453125) height=5.833024024963379 width=29.848846435546875]
Software [(X=325.98126220703125,Y=22.00372314453125) height=5.833024024963379 width=35.271636962890625]
Foundation! [(X=363.7962951660156,Y=22.00372314453125) height=5.833024024963379 width=47.871429443359375]
Custom [(X=334.0334777832031,Y=157.6195068359375) height=4.546705722808838 width=25.03936767578125]
Search [(X=360.8929138183594,Y=157.6195068359375) height=4.546705722808838 width=22.702728271484375]
You can create CustomPDFTextStripper which extends PDFTextStripper and override protected void writeString(String text, List<TextPosition> textPositions). In this overriden method you need to split textPositions by the word separator to get List<TextPosition> for each word. After that you can join each character and compute bounding box.
Full example below which contains also drawing of the resulting bounding boxes.
package com.example;
import lombok.Value;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.junit.Ignore;
import org.junit.Test;
import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
public class PdfBoxTest {
private static final String BASE_DIR_PATH = "C:\\Users\\Milan\\50330484";
private static final String INPUT_FILE_PATH = "input.pdf";
private static final String OUTPUT_IMAGE_PATH = "output.jpg";
private static final String OUTPUT_BBOX_IMAGE_PATH = "output-bbox.jpg";
private static final float FROM_72_TO_300_DPI = 300.0f / 72.0f;
#Test
public void run() throws Exception {
pdfToImage();
drawBoundingBoxes();
}
#Ignore
#Test
public void pdfToImage() throws IOException {
PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
PDFRenderer renderer = new PDFRenderer(document);
BufferedImage image = renderer.renderImageWithDPI(0, 300);
ImageIO.write(image, "JPEG", new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
}
#Ignore
#Test
public void drawBoundingBoxes() throws IOException {
PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
List<WordWithBBox> words = getWords(document);
draw(words);
}
private List<WordWithBBox> getWords(PDDocument document) throws IOException {
CustomPDFTextStripper customPDFTextStripper = new CustomPDFTextStripper();
customPDFTextStripper.setSortByPosition(true);
customPDFTextStripper.setStartPage(0);
customPDFTextStripper.setEndPage(1);
Writer writer = new OutputStreamWriter(new ByteArrayOutputStream());
customPDFTextStripper.writeText(document, writer);
List<WordWithBBox> words = customPDFTextStripper.getWords();
return words;
}
private void draw(List<WordWithBBox> words) throws IOException {
BufferedImage bufferedImage = ImageIO.read(new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
Graphics2D graphics = bufferedImage.createGraphics();
graphics.setColor(Color.GREEN);
List<Rectangle> rectangles = words.stream()
.map(word -> new Rectangle(word.getX(), word.getY(), word.getWidth(), word.getHeight()))
.collect(Collectors.toList());
rectangles.forEach(graphics::draw);
graphics.dispose();
ImageIO.write(bufferedImage, "JPEG", new File(BASE_DIR_PATH, OUTPUT_BBOX_IMAGE_PATH));
}
private class CustomPDFTextStripper extends PDFTextStripper {
private final List<WordWithBBox> words;
public CustomPDFTextStripper() throws IOException {
this.words = new ArrayList<>();
}
public List<WordWithBBox> getWords() {
return new ArrayList<>(words);
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
String wordSeparator = getWordSeparator();
List<TextPosition> wordTextPositions = new ArrayList<>();
for (TextPosition textPosition : textPositions) {
String str = textPosition.getUnicode();
if (wordSeparator.equals(str)) {
if (!wordTextPositions.isEmpty()) {
this.words.add(createWord(wordTextPositions));
wordTextPositions.clear();
}
} else {
wordTextPositions.add(textPosition);
}
}
super.writeString(text, textPositions);
}
private WordWithBBox createWord(List<TextPosition> wordTextPositions) {
String word = wordTextPositions.stream()
.map(TextPosition::getUnicode)
.collect(Collectors.joining());
int minX = Integer.MAX_VALUE;
int minY = Integer.MAX_VALUE;
int maxX = Integer.MIN_VALUE;
int maxY = Integer.MIN_VALUE;
for (TextPosition wordTextPosition : wordTextPositions) {
minX = Math.min(minX, from72To300Dpi(wordTextPosition.getXDirAdj()));
minY = Math.min(minY, from72To300Dpi(wordTextPosition.getYDirAdj() - wordTextPosition.getHeightDir()));
maxX = Math.max(maxX, from72To300Dpi(wordTextPosition.getXDirAdj() + wordTextPosition.getWidthDirAdj()));
maxY = Math.max(maxY, from72To300Dpi(wordTextPosition.getYDirAdj()));
}
return new WordWithBBox(word, minX, minY, maxX - minX, maxY - minY);
}
}
private int from72To300Dpi(float f) {
return Math.round(f * FROM_72_TO_300_DPI);
}
#Value
private class WordWithBBox {
private final String word;
private final int x;
private final int y;
private final int width;
private final int height;
}
}
Note:
If you are interested in other options, you can check also Poppler
PDF to image
pdftoppm -r 300 -jpeg input.pdf output
Generate an XHTML file containing bounding box information for each word in the file.
pdftotext -r 300 -bbox input.pdf

Curator ServiceCacheListener is triggered three times when a service is added

I am learning zookeeper and trying out the Curator framework for service discoveries. However, I am facing a weird issue that I have difficulties to figure out. The problem is when I tried to register an instance via serviceDiscovery, the cacheChanged event of the serviceCache gets triggered three times. When I removed an instance, it is only triggered once, which is the expected behavior. Please see the code below:
public class DiscoveryExample {
private static String PATH = "/base";
static ServiceDiscovery<InstanceDetails> serviceDiscovery = null;
public static void main(String[] args) throws Exception {
CuratorFramework client = null;
try {
// this is the ip address of my VM
client = CuratorFrameworkFactory.newClient("192.168.149.129:2181", new ExponentialBackoffRetry(1000, 3));
client.start();
JsonInstanceSerializer<InstanceDetails> serializer = new JsonInstanceSerializer<InstanceDetails>(
InstanceDetails.class);
serviceDiscovery = ServiceDiscoveryBuilder.builder(InstanceDetails.class)
.client(client)
.basePath(PATH)
.serializer(serializer)
.build();
serviceDiscovery.start();
ServiceCache<InstanceDetails> serviceCache = serviceDiscovery.serviceCacheBuilder()
.name("product")
.build();
serviceCache.addListener(new ServiceCacheListener() {
#Override
public void stateChanged(CuratorFramework curator, ConnectionState state) {
// TODO Auto-generated method stub
System.out.println("State Changed to " + state.name());
}
// THIS IS THE PART GETS TRIGGERED MULTIPLE TIMES
#Override
public void cacheChanged() {
System.out.println("Cached Changed ");
List<ServiceInstance<InstanceDetails>> list = serviceCache.getInstances();
Iterator<ServiceInstance<InstanceDetails>> it = list.iterator();
while(it.hasNext()) {
System.out.println(it.next().getAddress());
}
}
});
serviceCache.start();
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
System.out.print("> ");
String line = in.readLine();
} finally {
CloseableUtils.closeQuietly(serviceDiscovery);
CloseableUtils.closeQuietly(client);
}
}
}
AND
public class RegisterApplicationServer {
final static String PATH = "/base";
static ServiceDiscovery<InstanceDetails> serviceDiscovery = null;
public static void main(String[] args) throws Exception {
CuratorFramework client = null;
try {
client = CuratorFrameworkFactory.newClient("192.168.149.129:2181", new ExponentialBackoffRetry(1000, 3));
client.start();
JsonInstanceSerializer<InstanceDetails> serializer = new JsonInstanceSerializer<InstanceDetails>(
InstanceDetails.class);
serviceDiscovery = ServiceDiscoveryBuilder.builder(InstanceDetails.class).client(client).basePath(PATH)
.serializer(serializer).build();
serviceDiscovery.start();
// SOME OTHER CODE THAT TAKES CARES OF USER INPUT...
} finally {
CloseableUtils.closeQuietly(serviceDiscovery);
CloseableUtils.closeQuietly(client);
}
}
private static void addInstance(String[] args, CuratorFramework client, String command,
ServiceDiscovery<InstanceDetails> serviceDiscovery) throws Exception {
// simulate a new instance coming up
// in a real application, this would be a separate process
if (args.length < 2) {
System.err.println("syntax error (expected add <name> <description>): " + command);
return;
}
StringBuilder description = new StringBuilder();
for (int i = 1; i < args.length; ++i) {
if (i > 1) {
description.append(' ');
}
description.append(args[i]);
}
String serviceName = args[0];
ApplicationServer server = new ApplicationServer(client, PATH, serviceName, description.toString());
server.start();
serviceDiscovery.registerService(server.getThisInstance());
System.out.println(serviceName + " added");
}
private static void deleteInstance(String[] args, String command, ServiceDiscovery<InstanceDetails> serviceDiscovery) throws Exception {
// in a real application, this would occur due to normal operation, a
// crash, maintenance, etc.
if (args.length != 2) {
System.err.println("syntax error (expected delete <name>): " + command);
return;
}
final String serviceName = args[0];
Collection<ServiceInstance<InstanceDetails>> set = serviceDiscovery.queryForInstances(serviceName);
Iterator<ServiceInstance<InstanceDetails>> it = set.iterator();
while (it.hasNext()) {
ServiceInstance<InstanceDetails> si = it.next();
if (si.getPayload().getDescription().indexOf(args[1]) != -1) {
serviceDiscovery.unregisterService(si);
}
}
System.out.println("Removed an instance of: " + serviceName);
}
}
I appriciate if anyone can please point out where I am doing wrong and maybe can share some good materials/examples so I can refer to. The official website and the examples on github does not help a lot.

netty SimpleChannelInboundHandler<String> channelRead0 only occasionally invoked

I know that there are several similar questions that have either been answered or still outstanding, however, for the life of me...
Later Edit 2016-08-25 10:05 CST - Actually, I asked the wrong question.
The question is the following: given that I have both a netty server (taken from DiscardServer example) and a netty client - (see above) what must I do to force the DiscardServer to immediately send the client a request?
I have added an OutboundHandler to the server and to the client.
After looking at both the DiscardServer and PingPongServer examples, there is an external event occurring to kick off all the action. In the case of Discard server, it is originally waiting for a telnet connection, then will transmit whatever was in the telnet msg to the client.
In the case of PingPongServer, the SERVER is waiting on the client to initiate action.
What I want is for the Server to immediately start transmitting after connection with the client. None of the examples from netty seem to do this.
If I have missed something, and someone can point it out, much good karma.
My client:
public final class P4Listener {
static final Logger LOG;
static final String HOST;
static final int PORT;
static final Boolean SSL = Boolean.FALSE;
public static Dto DTO;
static {
LOG = LoggerFactory.getLogger(P4Listener.class);
HOST = P4ListenerProperties.getP4ServerAddress();
PORT = Integer.valueOf(P4ListenerProperties.getListenerPort());
DTO = new Dto();
}
public static String getId() { return DTO.getId(); }
public static void main(String[] args) throws Exception {
final SslContext sslCtx;
if (SSL) {
LOG.info("{} creating SslContext", getId());
sslCtx = SslContextBuilder.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE).build();
} else {
sslCtx = null;
}
EventLoopGroup group = new NioEventLoopGroup();
try {
Bootstrap b = new Bootstrap();
b.group(group).channel(NioSocketChannel.class)
.handler(new LoggingHandler(LogLevel.INFO))
.handler(new P4ListenerInitializer(sslCtx));
// Start the connection attempt.
LOG.debug(" {} starting connection attempt...", getId());
Channel ch = b.connect(HOST, PORT).sync().channel();
// ChannelFuture localWriteFuture = ch.writeAndFlush("ready\n");
// localWriteFuture.sync();
} finally {
group.shutdownGracefully();
}
}
}
public class P4ListenerHandler extends SimpleChannelInboundHandler<String> {
static final Logger LOG = LoggerFactory.getLogger(P4ListenerHandler.class);
static final DateTimeFormatter DTFormatter = DateTimeFormatter.ofPattern("yyyyMMdd-HHMMss.SSS");
static final String EndSOT;
static final String StartSOT;
static final String EOL = "\n";
static final ClassPathXmlApplicationContext AppContext;
static {
EndSOT = P4ListenerProperties.getEndSOT();
StartSOT = P4ListenerProperties.getStartSOT();
AppContext = new ClassPathXmlApplicationContext(new String[] { "applicationContext.xml" });
}
private final RequestValidator rv = new RequestValidator();
private JAXBContext jaxbContext = null;
private Unmarshaller jaxbUnmarshaller = null;
private boolean initialized = false;
private Dto dto;
public P4ListenerHandler() {
dto = new Dto();
}
public Dto getDto() { return dto; }
public String getId() { return getDto().getId(); }
Message convertXmlToMessage(String xml) {
if (xml == null)
throw new IllegalArgumentException("xml message is null!");
try {
jaxbContext = JAXBContext.newInstance(p4.model.xml.request.Message.class, p4.model.xml.request.Header.class,
p4.model.xml.request.Claims.class, p4.model.xml.request.Insurance.class,
p4.model.xml.request.Body.class, p4.model.xml.request.Prescriber.class,
p4.model.xml.request.PriorAuthorization.class,
p4.model.xml.request.PriorAuthorizationSupportingDocumentation.class);
jaxbUnmarshaller = jaxbContext.createUnmarshaller();
StringReader strReader = new StringReader(xml);
Message m = (Message) jaxbUnmarshaller.unmarshal(strReader);
return m;
} catch (JAXBException jaxbe) {
String error = StacktraceUtil.getCustomStackTrace(jaxbe);
LOG.error(error);
throw new P4XMLUnmarshalException("Problems when attempting to unmarshal transmission string: \n" + xml,
jaxbe);
}
}
#Override
public void channelActive(ChannelHandlerContext ctx) {
LOG.debug("{} let server know we are ready", getId());
ctx.writeAndFlush("Ready...\n");
}
/**
* Important - this method will be renamed to
* <code><b>messageReceived(ChannelHandlerContext, I)</b></code> in netty 5.0
*
* #param ctx
* #param msg
*/
#Override
protected void channelRead0(ChannelHandlerContext ctx, String msg) throws Exception {
ChannelFuture lastWriteFuture = null;
LOG.debug("{} -- received message: {}", getId(), msg);
Channel channel = ctx.channel();
Message m = null;
try {
if (msg instanceof String && msg.length() > 0) {
m = convertXmlToMessage(msg);
m.setMessageStr(msg);
dto.setRequestMsg(m);
LOG.info("{}: received TIMESTAMP: {}", dto.getId(), LocalDateTime.now().format(DTFormatter));
LOG.debug("{}: received from server: {}", dto.getId(), msg);
/*
* theoretically we have a complete P4(XML) request
*/
final List<RequestFieldError> errorList = rv.validateMessage(m);
if (!errorList.isEmpty()) {
for (RequestFieldError fe : errorList) {
lastWriteFuture = channel.writeAndFlush(fe.toString().concat(EOL));
}
}
/*
* Create DBHandler with message, messageStr, clientIp to get
* dbResponse
*/
InetSocketAddress socketAddress = (InetSocketAddress) channel.remoteAddress();
InetAddress inetaddress = socketAddress.getAddress();
String clientIp = inetaddress.getHostAddress();
/*
* I know - bad form to ask the ApplicationContext for the
* bean... BUT ...lack of time turns angels into demons
*/
final P4DbRequestHandler dbHandler = (P4DbRequestHandler) AppContext.getBean("dbRequestHandler");
// must set the requestDTO for the dbHandler!
dbHandler.setClientIp(clientIp);
dbHandler.setRequestDTO(dto);
//
// build database request and receive response (string)
String dbResponse = dbHandler.submitDbRequest();
/*
* create ResponseHandler and get back response string
*/
P4ResponseHandler responseHandler = new P4ResponseHandler(dto, dbHandler);
String responseStr = responseHandler.decodeDbServiceResponse(dbResponse);
/*
* write response string to output and repeat exercise
*/
LOG.debug("{} -- response to be written back to server:\n {}", dto.getId(), responseStr);
lastWriteFuture = channel.writeAndFlush(responseStr.concat(EOL));
//
LOG.info("{}: response sent TIMESTAMP: {}", dto.getId(), LocalDateTime.now().format(DTFormatter));
} else {
throw new P4EventException(dto.getId() + " -- Message received is not a String");
}
processWriteFutures(lastWriteFuture);
} catch (Throwable t) {
String tError = StacktraceUtil.getCustomStackTrace(t);
LOG.error(tError);
} finally {
if (lastWriteFuture != null) {
lastWriteFuture.sync();
}
}
}
private void processWriteFutures(ChannelFuture writeFuture) throws InterruptedException {
// Wait until all messages are flushed before closing the channel.
if (writeFuture != null) {
writeFuture.sync();
}
}
#Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
cause.printStackTrace();
ctx.close();
}
}
/**
* Creates a newly configured {#link ChannelPipeline} for a new channel.
*/
public class P4ListenerInitializer extends ChannelInitializer<SocketChannel> {
private static final StringDecoder DECODER = new StringDecoder();
private static final StringEncoder ENCODER = new StringEncoder();
private final SslContext sslCtx;
public P4ListenerInitializer(SslContext sslCtx) {
this.sslCtx = sslCtx;
}
#Override
public void initChannel(SocketChannel ch) {
P4ListenerHandler lh = null;
ChannelPipeline pipeline = ch.pipeline();
if (sslCtx != null) {
P4Listener.LOG.info("{} -- constructing SslContext new handler ", P4Listener.getId());
pipeline.addLast(sslCtx.newHandler(ch.alloc(), P4Listener.HOST, P4Listener.PORT));
} else {
P4Listener.LOG.info("{} -- SslContext null; bypassing adding sslCtx.newHandler(ch.alloc(), P4Listener.HOST, P4Listener.PORT) ", P4Listener.getId());
}
// Add the text line codec combination first,
pipeline.addLast(new DelimiterBasedFrameDecoder(8192, Delimiters.lineDelimiter()));
pipeline.addLast(DECODER);
P4Listener.LOG.debug("{} -- added Decoder ", P4Listener.getId());
pipeline.addLast(ENCODER);
P4Listener.LOG.debug("{} -- added Encoder ", P4Listener.getId());
// and then business logic.
pipeline.addLast(lh = new P4ListenerHandler());
P4Listener.LOG.debug("{} -- added P4ListenerHandler: {} ", P4Listener.getId(), lh.getClass().getSimpleName());
}
}
#Sharable
public class P4ListenerOutboundHandler extends ChannelOutboundHandlerAdapter {
static final Logger LOG = LoggerFactory.getLogger(P4ListenerOutboundHandler.class);
private Dto outBoundDTO = new Dto();
public String getId() {return this.outBoundDTO.getId(); }
#Override
public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) {
try {
ChannelFuture lastWrite = ctx.write(Unpooled.copiedBuffer((String) msg, CharsetUtil.UTF_8));
try {
if (lastWrite != null) {
lastWrite.sync();
promise.setSuccess();
}
} catch (InterruptedException e) {
promise.setFailure(e);
e.printStackTrace();
}
} finally {
ReferenceCountUtil.release(msg);
}
}
}
output from client
Just override channelActive(...) on the handler of the server and trigger a write there.

Creating custom plugin for chinese tokenization

I'm working towards properly integrating the stanford segmenter within SOLR for chinese tokenization.
This plugin involves loading other jar files and model files. I've got it working in a crude manner by hardcoding the complete path for the files.
I'm looking for methods to create the plugin where the paths need not be hardcoded and also to have the plugin in conformance with the SOLR plugin architecture. Please let me know if there are any recommended sites or tutorials for this.
I've added my code below :
public class ChineseTokenizerFactory extends TokenizerFactory {
/** Creates a new WhitespaceTokenizerFactory */
public ChineseTokenizerFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
#Override
public ChineseTokenizer create(AttributeFactory factory, Reader input) {
Reader processedStringReader = new ProcessedStringReader(input);
return new ChineseTokenizer(luceneMatchVersion, factory, processedStringReader);
}
}
public class ProcessedStringReader extends java.io.Reader {
private static final int BUFFER_SIZE = 1024 * 8;
//private static TextProcess m_textProcess = null;
private static final String basedir = "/home/praveen/PDS_Meetup/solr-4.9.0/custom_plugins/";
static Properties props = null;
static CRFClassifier<CoreLabel> segmenter = null;
private char[] m_inputData = null;
private int m_offset = 0;
private int m_length = 0;
public ProcessedStringReader(Reader input){
char[] arr = new char[BUFFER_SIZE];
StringBuffer buf = new StringBuffer();
int numChars;
if(segmenter == null)
{
segmenter = new CRFClassifier<CoreLabel>(getProperties());
segmenter.loadClassifierNoExceptions(basedir + "ctb.gz", getProperties());
}
try {
while ((numChars = input.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
} catch (IOException e) {
e.printStackTrace();
}
m_inputData = processText(buf.toString()).toCharArray();
m_offset = 0;
m_length = m_inputData.length;
}
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
int charNumber = 0;
for(int i = m_offset + off;i<m_length && charNumber< len; i++){
cbuf[charNumber] = m_inputData[i];
m_offset ++;
charNumber++;
}
if(charNumber == 0){
return -1;
}
return charNumber;
}
#Override
public void close() throws IOException {
m_inputData = null;
m_offset = 0;
m_length = 0;
}
public String processText(String inputText)
{
List<String> segmented = segmenter.segmentString(inputText);
String output = "";
if(segmented.size() > 0)
{
output = segmented.get(0);
for(int i=1;i<segmented.size();i++)
{
output = output + " " +segmented.get(i);
}
}
System.out.println(output);
return output;
}
static Properties getProperties()
{
if (props == null) {
props = new Properties();
props.setProperty("sighanCorporaDict", basedir);
// props.setProperty("NormalizationTable", "data/norm.simp.utf8");
// props.setProperty("normTableEncoding", "UTF-8");
// below is needed because CTBSegDocumentIteratorFactory accesses it
props.setProperty("serDictionary",basedir+"dict-chris6.ser.gz");
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
}
return props;
}
}
public final class ChineseTokenizer extends CharTokenizer {
public ChineseTokenizer(Version matchVersion, Reader in) {
super(matchVersion, in);
}
public ChineseTokenizer(Version matchVersion, AttributeFactory factory, Reader in) {
super(matchVersion, factory, in);
}
/** Collects only characters which do not satisfy
* {#link Character#isWhitespace(int)}.*/
#Override
protected boolean isTokenChar(int c) {
return !Character.isWhitespace(c);
}
}
You can pass the argument through the Factory's args parameter.