BeamSql tramsformation issues - apache-beam

I have my below code in which I'm reading a csv file and defining its schema, after that I'm converting it into BeamRecords. and then applying BeamSql to implement PTransforms.
Code:
class Clo {
public String Outlet;
public String CatLib;
public String ProdKey;
public Date Week;
public String SalesComponent;
public String DuetoValue;
public String PrimaryCausalKey;
public Float CausalValue;
public Integer ModelIteration;
public Integer Published;
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<java.lang.String> lines= p.apply(TextIO.read().from("gs://gcpbucket/input/WeeklyDueto.csv"));
PCollection<Clorox> pojos = lines.apply(ParDo.of(new ExtractObjectsFn()));
List<java.lang.String> fieldNames = Arrays.asList("Outlet", "CatLib", "ProdKey", "Week", "SalesComponent", "DuetoValue", "PrimaryCausalKey", "CausalValue", "ModelIteration", "Published");
List<java.lang.Integer> fieldTypes = Arrays.asList(Types.VARCHAR, Types.VARCHAR, Types.VARCHAR, Types.DATE, Types.VARCHAR,Types.VARCHAR,Types.VARCHAR, Types.FLOAT, Types.INTEGER, Types.INTEGER);
BeamRecordSqlType appType = BeamRecordSqlType.create(fieldNames, fieldTypes);
PCollection<BeamRecord> apps = pojos.apply(
ParDo.of(new DoFn<Clo, BeamRecord>() {
#ProcessElement
public void processElement(ProcessContext c) {
BeamRecord br = new BeamRecord(
appType,
c.element().Outlet,
c.element().CatLib,
c.element().ProdKey,
c.element().Week,
c.element().SalesComponent,
c.element().DuetoValue,
c.element().PrimaryCausalKey,
c.element().CausalValue,
c.element().ModelIteration,
c.element().Published
);
c.output(br);
}
})).setCoder(appType, getRecordCoder());
PCollection<BeamRecord> out = apps.apply(BeamSql.query("select Outlet from PCOLLECTION"));
out.apply("WriteMyFile", TextIO.write().to("gs://gcpbucket/output/sbc.txt"));
}
My questions are:
what shall I implement in ExtractObjectsFn() so that the records gets converted into BeamRecords ?
How to write the final output to a csv file ?
I have implemented ExtractObjectsFn() as :
public void processElement(ProcessContext c) {
ArrayList<Clo> clx = new ArrayList<Clo>();
java.lang.String[] strArr = c.element().split("\n");
for(int i = 0; i < strArr.length; i++) {
Clo clo = new Clo();
java.lang.String[] temp = strArr[i].split(",");
clo.setCatLib(temp[1]);
clo.setCausalValue(temp[7]);
clo.setDuetoValue(temp[5]);
clo.setModelIteration(temp[8]);
clo.setOutlet(temp[0]);
clo.setPrimaryCausalKey(temp[6]);
clo.setProdKey(temp[2]);
clo.setPublished(temp[9]);
clo.setSalesComponent(temp[4]);
clo.setWeek(temp[3]);
c.output(clo);
clx.add(clo);
}
}
Let me know if its done correctly because while executing the code and getting error as No Coder has been manually specified; you may do so using .setCoder().

1> what shall I implement in ExtractObjectsFn() so that the records gets converted into BeamRecords ?
In the processElement() method of ExtractObjectsFn, you simply need to convert a CSV line from the input (String) to a Clorox type. Split the string by the comma delimiter (,), which returns an array. Iterate over the array to retrieve the CSV values and construct the Clorox object.
2> How to write the final output to a csv file ?
Similar process as above. You simply need to apply a new transform that will convert a BeamRecord to a CSV line (String). Members of the BeamRecord can be concatenated into a string (CSV line). After this transform is applied, the TextIO.Write transform can be applied to write the CSV line to file.

Related

How to optimize SQL query in Anylogic

I am generating Agents with parameter values coming from SQL table in Anylogic. when agent is generated at source I am doing a v look up in table and extracting corresponding values from table. For now it is working perfectly but it is slowing down the performance.
Structure of Table looks like this
I am querying the data from this table with below code
double value_1 = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.avg_value)).get(0);
double value_min = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.min_value)).get(0);
double value_max = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.max_value)).get(0);
// Fetch the cluster number from account table
int cluster_num = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.cluster)).get(0);
int act_no = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.actno)).get(0);
String pay_term = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.pay_term)).get(0);
String pay_term_prob = (selectFrom(account_details)
.where(account_details.act_code.eq(z))
.list(account_details.pay_term_prob)).get(0);
But this is very slow and wants to improve the performance. someone mentioned that we can create a Java class and then add the table into collection . Is there any example where I can refer. I am finding it difficult to put entire code.
I have created a class using below code:
public class Customer {
private String act_code;
private int actno;
private double avg_value;
private String pay_term;
private String pay_term_prob;
private int cluster;
private double min_value;
private double max_value;
public String getact_code() {
return act_code;
}
public void setact_code(String act_code) {
this.act_code = act_code;
}
public int getactno() {
return actno;
}
public void setactno(int actno) {
this.actno = actno;
}
public double getavg_value() {
return avg_value;
}
public void setavg_value(double avg_value) {
this.avg_value = avg_value;
}
public String getpay_term() {
return pay_term;
}
public void setpay_term(String pay_term) {
this.pay_term = pay_term;
}
public String getpay_term_prob() {
return pay_term_prob;
}
public void setpay_term_prob(String pay_term_prob) {
this.pay_term_prob = pay_term_prob;
}
public int cluster() {
return cluster;
}
public void setcluster(int cluster) {
this.cluster = cluster;
}
public double getmin_value() {
return min_value;
}
public void setmin_value(double min_value) {
this.min_value = min_value;
}
public double getmax_value() {
return max_value;
}
public void setmax_value(double max_value) {
this.max_value = max_value;
}
}
Created collection object like this:
Pls provide an reference to add this database table into collection as a next step. then I want to query the collection based on the condition
You are on the right track here!
Every time you access the database to read data there is a computational overhead. So the best option is to access the database only once, at the start of the model. Create all the objects you need, store other data you will need later into Java classes, and then use the Java classes.
My suggestion is to create a Java class for each row in your table, like you have done. And then create a map object - like you have done, but with the key as String and the value as this new object.
Then on model start you can populate this map as follows:
List<Tuple> rows = selectFrom(customer).list();
for (Tuple row : rows) {
Customer customerData = new Customer(
row.get( customer.act_code ),
row.get( customer.actno ),
row.get( customer.avg_value )
);
mapOfCustomerData.put(customerData.act_code, customerData);
}
Where mapOfCustomerData is a linkedHashMap and customer is the name of the table
See the model created in this blog post for more details and an example on using a scenario object to store all the data from the Database in a separate object
Note: The code above is just an example - read this blog post for more details on using the AnyLogic INternal Database
Before using Java classes, try this first: click the "index" tickbox for all columns that you query with a WHERE clause.

Writable Classes in mapreduce

How can i use the values from hashset (the docid and offset) to the reduce writable so as to connect map writable with reduce writable?
The mapper (LineIndexMapper) works fine but in the reducer (LineIndexReducer) i get the error that it can't get string as argument when i type this:
context.write(key, new IndexRecordWritable("some string");
although i have the public String toString() in the ReduceWritable too.
I believe the hashset in reducer's writable (IndexRecordWritable.java) maybe isn't taking the values correctly?
I have the below code.
IndexMapRecordWritable.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
public class IndexMapRecordWritable implements Writable {
private LongWritable offset;
private Text docid;
public LongWritable getOffsetWritable() {
return offset;
}
public Text getDocidWritable() {
return docid;
}
public long getOffset() {
return offset.get();
}
public String getDocid() {
return docid.toString();
}
public IndexMapRecordWritable() {
this.offset = new LongWritable();
this.docid = new Text();
}
public IndexMapRecordWritable(long offset, String docid) {
this.offset = new LongWritable(offset);
this.docid = new Text(docid);
}
public IndexMapRecordWritable(IndexMapRecordWritable indexMapRecordWritable) {
this.offset = indexMapRecordWritable.getOffsetWritable();
this.docid = indexMapRecordWritable.getDocidWritable();
}
#Override
public String toString() {
StringBuilder output = new StringBuilder()
output.append(docid);
output.append(offset);
return output.toString();
}
#Override
public void write(DataOutput out) throws IOException {
}
#Override
public void readFields(DataInput in) throws IOException {
}
}
IndexRecordWritable.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.HashSet;
import org.apache.hadoop.io.Writable;
public class IndexRecordWritable implements Writable {
// Save each index record from maps
private HashSet<IndexMapRecordWritable> tokens = new HashSet<IndexMapRecordWritable>();
public IndexRecordWritable() {
}
public IndexRecordWritable(
Iterable<IndexMapRecordWritable> indexMapRecordWritables) {
}
#Override
public String toString() {
StringBuilder output = new StringBuilder();
return output.toString();
}
#Override
public void write(DataOutput out) throws IOException {
}
#Override
public void readFields(DataInput in) throws IOException {
}
}
Alright, here is my answer based on a few assumptions. The final output is a text file containing the key and the file names separated by a comma based on the information in the reducer class's comments on the pre-condition and post-condition.
In this case, you really don't need IndexRecordWritable class. You can simply write to your context using
context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));
with the class declaration line as
public class LineIndexReducer extends Reducer<Text, IndexMapRecordWritable, Text, Text>
Don't forget to set the correct output class in the driver.
That must serve the purpose according to the post-condition in your reducer class. But, if you really want to write a Text-IndexRecordWritable pair to your context, there are two ways approach it -
with string as an argument (based on your attempt passing a string when you IndexRecordWritable class constructor is not designed to accept strings) and
with HashSet as an argument (based on the HashSet initialised in IndexRecordWritable class).
Since your constructor of IndexRecordWritable class is not designed to accept String as an input, you cannot pass a string. Hence the error you are getting that you can't use string as an argument. Ps: if you want your constructor to accept Strings, you must have another constructor in your IndexRecordWritable class as below:
// Save each index record from maps
private HashSet<IndexMapRecordWritable> tokens = new HashSet<IndexMapRecordWritable>();
// to save the string
private String value;
public IndexRecordWritable() {
}
public IndexRecordWritable(
HashSet<IndexMapRecordWritable> indexMapRecordWritables) {
/***/
}
// to accpet string
public IndexRecordWritable (String value) {
this.value = value;
}
but that won't be valid if you want to use the HashSet. So, approach #1 can't be used. You can't pass a string.
That leaves us with approach #2. Passing a HashSet as an argument since you want to make use of the HashSet. In this case, you must create a HashSet in your reducer before passing it as an argument to IndexRecordWritable in context.write.
To do this, your reducer must look like this.
#Override
protected void reduce(Text key, Iterable<IndexMapRecordWritable> values, Context context) throws IOException, InterruptedException {
//StringBuilder valueBuilder = new StringBuilder();
HashSet<IndexMapRecordWritable> set = new HashSet<>();
for (IndexMapRecordWritable val : values) {
set.add(val);
//valueBuilder.append(val);
//valueBuilder.append(",");
}
//write the key and the adjusted value (removing the last comma)
//context.write(key, new IndexRecordWritable(valueBuilder.substring(0, valueBuilder.length() - 1)));
context.write(key, new IndexRecordWritable(set));
//valueBuilder.setLength(0);
}
and your IndexRecordWritable.java must have this.
// Save each index record from maps
private HashSet<IndexMapRecordWritable> tokens = new HashSet<IndexMapRecordWritable>();
// to save the string
//private String value;
public IndexRecordWritable() {
}
public IndexRecordWritable(
HashSet<IndexMapRecordWritable> indexMapRecordWritables) {
/***/
tokens.addAll(indexMapRecordWritables);
}
Remember, this is not the requirement according to the description of your reducer where it says.
POST-CONDITION: emit the output a single key-value where all the file names are separated by a comma ",". <"marcello", "a.txt#3345,b.txt#344,c.txt#785">
If you still choose to emit (Text, IndexRecordWritable), remember to process the HashSet in IndexRecordWritable to get it in the desired format.

ITextSharp / PDFBox text extract fails for certain pdfs

The code below extracts the text from a PDF correctly via ITextSharp in many instances.
using (var pdfReader = new PdfReader(filename))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
1,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
Console.WriteLine(currentText);
}
However, in the case of this PDF I get the following instead of text: "\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\a\u0001\u0002\u0003\u0004\u0005\u0006\u0003"
I have tried different encodings and even PDFBox but still failed to decode the PDF correctly. Any ideas on how to solve the issue?
Extracting the text nonetheless
#Bruno's answer is the answer one should give here, the PDF clearly does not provide the information required to allow proper text extraction according to section 9.10 Extraction of Text Content of the PDF specification ISO 32000-1...
But there actually is a slightly evil way to extract the text from the PDF at hand nonetheless!
Wrapping one's text extraction strategy in an instance of the following class, the garbled text is replaced by the correct text:
public class RemappingExtractionFilter : ITextExtractionStrategy
{
ITextExtractionStrategy strategy;
System.Reflection.FieldInfo stringField;
public RemappingExtractionFilter(ITextExtractionStrategy strategy)
{
this.strategy = strategy;
this.stringField = typeof(TextRenderInfo).GetField("text", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}
public void RenderText(TextRenderInfo renderInfo)
{
DocumentFont font =renderInfo.GetFont();
PdfDictionary dict = font.FontDictionary;
PdfDictionary encoding = dict.GetAsDict(PdfName.ENCODING);
PdfArray diffs = encoding.GetAsArray(PdfName.DIFFERENCES);
;
StringBuilder builder = new StringBuilder();
foreach (byte b in renderInfo.PdfString.GetBytes())
{
PdfName name = diffs.GetAsName((char)b);
String s = name.ToString().Substring(2);
int i = Convert.ToInt32(s, 16);
builder.Append((char)i);
}
stringField.SetValue(renderInfo, builder.ToString());
strategy.RenderText(renderInfo);
}
public void BeginTextBlock()
{
strategy.BeginTextBlock();
}
public void EndTextBlock()
{
strategy.EndTextBlock();
}
public void RenderImage(ImageRenderInfo renderInfo)
{
strategy.RenderImage(renderInfo);
}
public String GetResultantText()
{
return strategy.GetResultantText();
}
}
It can be used like this:
ITextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
string text = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
Beware, I had to use System.Reflection to access private members. Some environments may forbid this.
The same in Java
I initially coded this in Java for iText because that's my primary development environment. Thus, here the initial Java version:
public class RemappingExtractionFilter implements TextExtractionStrategy
{
public RemappingExtractionFilter(TextExtractionStrategy strategy) throws NoSuchFieldException, SecurityException
{
this.strategy = strategy;
this.stringField = TextRenderInfo.class.getDeclaredField("text");
this.stringField.setAccessible(true);
}
#Override
public void renderText(TextRenderInfo renderInfo)
{
DocumentFont font =renderInfo.getFont();
PdfDictionary dict = font.getFontDictionary();
PdfDictionary encoding = dict.getAsDict(PdfName.ENCODING);
PdfArray diffs = encoding.getAsArray(PdfName.DIFFERENCES);
;
StringBuilder builder = new StringBuilder();
for (byte b : renderInfo.getPdfString().getBytes())
{
PdfName name = diffs.getAsName((char)b);
String s = name.toString().substring(2);
int i = Integer.parseUnsignedInt(s, 16);
builder.append((char)i);
}
try
{
stringField.set(renderInfo, builder.toString());
}
catch (IllegalArgumentException | IllegalAccessException e)
{
e.printStackTrace();
}
strategy.renderText(renderInfo);
}
#Override
public void beginTextBlock()
{
strategy.beginTextBlock();
}
#Override
public void endTextBlock()
{
strategy.endTextBlock();
}
#Override
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
#Override
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
final Field stringField;
}
(RemappingExtractionFilter.java)
It can be used like this:
String extractRemapped(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
TextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
return PdfTextExtractor.getTextFromPage(reader, pageNo, strategy);
}
(from RemappedExtraction.java)
Why does this work?
First of all, this is not the solution to all extraction problems, merely for extracting text from PDFs like the OP has presented.
This method works because the names the PDF uses in its fonts' encoding differences arrays can be interpreted even though they are not standard. These names are built as /Gxx where xx is the hexadecimal representation of the ASCII code of the character this name represents.
A good test to find out whether or not a PDF allows text to be extracted correctly, is by opening it in Adobe Reader and to copy and paste the text.
For instance: I copied the word ABSTRACT and I pasted it in Notepad++:
Do you see the word ABSTRACT in Notepad++? No, you see %&SOH'"%GS. The A is represented as %, the B is represented as &, and so on.
This is a clear indication that the content of the PDF isn't accessible: there is no mapping between the encoding that was use (% = A, & = B,...) and the actual characters that humans can understand.
In short: the PDF doesn't allow you to extract text, not with iText, not with iTextSharp, not with PDFBox. You'll have to find an OCR tool instead and OCR the complete document.
For more info, you may want to watch the following videos:
https://www.youtube.com/watch?v=4ur9WRWVrbM (~5 minutes)
https://www.youtube.com/watch?v=wxGEEv7ibHE (~15 minutes)
https://www.youtube.com/watch?v=g-QcU9B4qMc (~45 minutes)

Creating Objects from input file

I'm trying to somehow create objects by reading from a text file. However, I seem to be doing something wrong, and I can't put my finger on it.
Main:
import java.util.*;
import java.io.*;
public class Project2 {
public static void main(String[] args) throws IOException {
Scanner sc = new Scanner((new File("Project2DataFile.txt")));
sc.useDelimiter(",");
ArrayList<BaseballPlayer> myplayer = new ArrayList<BaseballPlayer>();
while (sc.hasNext()) {
String str = sc.nextLine();
for(int cnt = 0; cnt < 4; cnt++){
BaseballPlayer player = new BaseballPlayer();
if( player.batavg < 0 || player.batavg > 100 ){throw new IllegalArgumentException ("Illegal Batting Avg");}
player.pnumber = sc.nextInt();
player.lastname = sc.next();
player.firstname = sc.next();
player.batavg = sc.nextFloat();
}
continue;
}
System.out.println(myplayer);
}
Class:
public class BaseballPlayer {
public static int pnumber; // player number
public static String lastname; // player's last name
public static String firstname; // player's first name
public static float batavg; // player's batting average
}
And I might as well put the text file in there too:
48,deGrom,Jacob,.120
58,Mejia,Jenry,.140
49,Niese,Jon,.091
7,d'Arnaud,Travis,.324
21,Duda,Lucas,.237
4,Flores,Wilmner,.268
11,Tejada,Ruben,.345
5,Wright,David,.289
3,Granderson,Curtis,.327
12,Lagares,Juan,.298
It seems you are using static incorrectly. static is for members shared by all instances of a class, so you should not use it in BaseballPlayer. To fix the problem, you need to remove all statics in BaseballPlayer.

tostring method giving wrong output

Hi I'm writing a test program to reverse a string. When I convert the character array to a string using the toString() method, I get the wrong output. When I try to print the array manually using a for loop without converting it to a string the answer is correct. The code I've written is shown below:
import java.util.*;
public class stringManip {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String str = "This is a string";
System.out.println("String=" +str);
//reverse(s);
char[] c = str.toCharArray();
int left = 0;
int right = str.length() - 1;
for (int i = 0; i < (str.length())/2; i++)
{
char temp = c[left];
c[left++] = c[right];
c[right--] = temp;
}
System.out.print("Reverse="+c.toString());
}
}
I should get the reverse of the string I entered, instead the output am getting is:
String=This is a string
Reverse=[C#45a1472d
Is there something am doing wrong when using the toString() method? Any help is appreciated. Thank you.
Arrays don't override the toString() method. What you're seeing is thus the output of the default Object.toString() implementation, which contains the type of the object ([C means array of chars) followed by its hashCode.
To construct a String from a char array, use
new String(c)