Apache beam get kafka data execute SQL error:Cannot call getSchema when there is no schema - apache-beam

I will input data of multiple tables to kafka, and beam will execute SQL after getting the data, but now there are the following errors:
Exception in thread "main"
java.lang.IllegalStateException: Cannot call getSchema when there is
no schema at
org.apache.beam.sdk.values.PCollection.getSchema(PCollection.java:328)
at
org.apache.beam.sdk.extensions.sql.impl.schema.BeamPCollectionTable.(BeamPCollectionTable.java:34)
at
org.apache.beam.sdk.extensions.sql.SqlTransform.toTableMap(SqlTransform.java:141)
at
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:102)
at
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:82)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:539) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473) at
org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:248)
at BeamSqlTest.main(BeamSqlTest.java:65)
Is there a feasible solution? Please help me!

I think you need to set schema for your input collection PCollection<Row> apply with setRowSchema() or setSchema(). The problem is that your schema is dynamic and it's defined in runtime (not sure if Beam supports this). Could you have static schema and define it before starting processing input data?
Also, since your input source is unbounded, you need to define windows to apply SqlTransform after.

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.beam.repackaged.sql.com.google.common.collect.ImmutableMap;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.io.kafka.KafkaIO;
import org.apache.beam.sdk.io.kafka.KafkaRecord;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.util.ArrayList;
import java.util.List;
class BeamSqlTest {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(PipelineOptions.class);
options.setRunner(DirectRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KafkaRecord<String, String>> lines = p.apply(KafkaIO.<String, String>read()
.withBootstrapServers("192.168.8.16")
.withTopic("tmp_table.reuslt")
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withConsumerConfigUpdates(ImmutableMap.of("group.id", "beam_app"))
.withReadCommitted()
.commitOffsetsInFinalize());
PCollection<Row> apply = lines.apply(ParDo.of(new DoFn<KafkaRecord<String, String>,Row>(){
#ProcessElement
public void processElement(ProcessContext c) {
String jsonData = c.element().getKV().getValue(); //data: {id:0001#int,name:test01#string,age:29#int,score:99#int}
if(!"data_increment_heartbeat".equals(jsonData)){ //Filter out heartbeat information
JSONObject jsonObject = JSON.parseObject(jsonData);
Schema.Builder builder = Schema.builder();
//A data pipeline may have data from multiple tables so the Schema is obtained dynamically
//This assumes data from a single table
List<Object> list = new ArrayList<Object>();
for(String s : jsonObject.keySet()) {
String[] dataType = jsonObject.get(s).toString().split("#"); //data#field type
if(dataType[1].equals("int")){
builder.addInt32Field(s);
}else if(dataType[1].equals("string")){
builder.addStringField(s);
}
list.add(dataType[0]);
}
Schema schema = builder.build();
Row row = Row.withSchema(schema).addValues(list).build();
System.out.println(row);
c.output(row);
}
}
}));
PCollection<Row> result = PCollectionTuple.of(new TupleTag<>("USER_TABLE"), apply)
.apply(SqlTransform.query("SELECT COUNT(id) total_count, SUM(score) total_score FROM USER_TABLE GROUP BY id"));
result.apply( "log_result", MapElements.via( new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("USER_TABLE result: " + input.getValues());
return input;
}
}));`enter code here`
}
}

Related

Is it possible to deserialize Avro message(consuming message from Kafka) without giving Reader schema in ConfluentRegistryAvroDeserializationSchema

I am using Kafka Connector in Apache Flink for access to streams served by Confluent Kafka.
Apart from schema registry url ConfluentRegistryAvroDeserializationSchema.forGeneric(...) expecting 'reader' schema.
Instead of providing read schema I want to use same writer's schema(lookup in registry) for reading the message too because Consumer will not have latest schema.
FlinkKafkaConsumer010<GenericRecord> myConsumer =
new FlinkKafkaConsumer010<>("topic-name", ConfluentRegistryAvroDeserializationSchema.forGeneric(<reader schema goes here>, "http://host:port"), properties);
myConsumer.setStartFromLatest();
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/connectors/kafka.html
"Using these deserialization schema record will be read with the schema that was retrieved from Schema Registry and transformed to a statically provided"
Since I do not want to keep schema definition at consumer side how do I deserialize Avro message from Kafka using writer's schema?
Appreciate your help!
I don't think it is possible to use directly ConfluentRegistryAvroDeserializationSchema.forGeneric. It is intended to be used with a reader schema and they have preconditions checking for this.
You have to implement your own. Two import things:
Set specific.avro.reader to false (other wise you'll get specific records)
The KafkaAvroDeserializer has to be lazily initialized (because it isn't serializable it self, as it holds a reference to the schema registry client)
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient;
import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;
import io.confluent.kafka.serializers.KafkaAvroDeserializer;
import io.confluent.kafka.serializers.KafkaAvroDeserializerConfig;
import java.util.HashMap;
import java.util.Map;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema;
public class KafkaGenericAvroDeserializationSchema
implements KeyedDeserializationSchema<GenericRecord> {
private final String registryUrl;
private transient KafkaAvroDeserializer inner;
public KafkaGenericAvroDeserializationSchema(String registryUrl) {
this.registryUrl = registryUrl;
}
#Override
public GenericRecord deserialize(
byte[] messageKey, byte[] message, String topic, int partition, long offset) {
checkInitialized();
return (GenericRecord) inner.deserialize(topic, message);
}
#Override
public boolean isEndOfStream(GenericRecord nextElement) {
return false;
}
#Override
public TypeInformation<GenericRecord> getProducedType() {
return TypeExtractor.getForClass(GenericRecord.class);
}
private void checkInitialized() {
if (inner == null) {
Map<String, Object> props = new HashMap<>();
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, registryUrl);
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, false);
SchemaRegistryClient client =
new CachedSchemaRegistryClient(
registryUrl, AbstractKafkaAvroSerDeConfig.MAX_SCHEMAS_PER_SUBJECT_DEFAULT);
inner = new KafkaAvroDeserializer(client, props);
}
}
}
env.addSource(
new FlinkKafkaConsumer<>(
topic,
new KafkaGenericAvroDeserializationSchema(schemaReigstryUrl),
kafkaProperties));

How do I get all the list of assets present in the AEM DAM

We have assets api to fetch the list, but for that we need to provide AEM user credentials.
Do we have any interface, to fetch all the assets list from the dam just the way get all the pages using page manager.
For this, you can use JCR's QueryManager API and your specific query in conjunction.
Below is a sample servlet which lists all the Assets below path - /content/dam/we-retail/en/features
import javax.jcr.Session;
import javax.jcr.query.Query;
import javax.jcr.query.QueryManager;
import javax.jcr.query.QueryResult;
import javax.jcr.query.Row;
import javax.jcr.query.RowIterator;
import javax.servlet.Servlet;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.resource.ResourceResolver;
import org.apache.sling.api.servlets.HttpConstants;
import org.apache.sling.api.servlets.SlingSafeMethodsServlet;
import org.osgi.service.component.annotations.Component;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
#Component(immediate = true, service = Servlet.class, property = { "sling.servlet.methods=" + HttpConstants.METHOD_GET,
"sling.servlet.paths=" + "/bin/learning/assetlister" })
public class AssetListerServlet extends SlingSafeMethodsServlet {
// Generated serialVersionUID
private static final long serialVersionUID = 7762806638577908286L;
// Default logger
private final Logger log = LoggerFactory.getLogger(this.getClass());
// Instance of ResourceResolver
private ResourceResolver resourceResolver;
// JCR Session instance
private Session session;
#Override
protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) {
try {
// Getting the ResourceResolver from the current request
resourceResolver = request.getResourceResolver();
// Getting the session instance by adapting ResourceResolver
session = resourceResolver.adaptTo(Session.class);
QueryManager queryManager = session.getWorkspace().getQueryManager();
String queryString = "SELECT * FROM [dam:Asset] AS asset WHERE ISDESCENDANTNODE(asset ,'/content/dam/we-retail/en/features')";
Query query = queryManager.createQuery(queryString, "JCR-SQL2");
QueryResult queryResult = query.execute();
response.getWriter().println("--------------Result-------------");
RowIterator rowIterator = queryResult.getRows();
while (rowIterator.hasNext()) {
Row row = rowIterator.nextRow();
response.getWriter().println(row.toString());
}
} catch (Exception e) {
log.error(e.getMessage(), e);
} finally {
if (resourceResolver != null) {
resourceResolver.close();
}
}
}
}
Similary using your specific requirement, you can use this logic in a component, service etc. I hope this helps.
If you are looking for a python solution, here is a python tool that I created to connect to AEM DAM and perform most DAM operations, including listing of all DAM assets or assets under a given path

Error setting database config property for IDatabaseConnection (HSQLDB)

I've included fully testable code below, which generates the following error when supplied with a dataset xml containing empty fields. A sample dataset.xml is also below.
java.lang.IllegalArgumentException: table.column=places.CITY value is
empty but must contain a value (to disable this feature check, set
DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS to true)
The thread here is similar but is different since it uses multiple dbTester.getConnection() whereas my code only uses one, yet has the same error. The main problem relates to this line databaseConfig.setProperty(DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS, Boolean.TRUE); .
It seems to be ignored entirely. I've tried putting the init code inside the #Test method but the error remains.
dataset.xml
<?xml version='1.0' encoding='UTF-8'?>
<dataset>
<places address="123 Up Street" city="Chicago" id="001"/>
<places address="456 Down Street" city="" id="002"/>
<places address="789 Right Street" city="Boston" id="003"/>
</dataset>
Code:
import org.dbunit.IDatabaseTester;
import org.dbunit.JdbcDatabaseTester;
import org.dbunit.database.DatabaseConfig;
import org.dbunit.database.IDatabaseConnection;
import org.dbunit.dataset.IDataSet;
import org.dbunit.dataset.xml.FlatXmlDataSetBuilder;
import org.dbunit.operation.DatabaseOperation;
import org.junit.Before;
import org.junit.Test;
import java.io.File;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
public class DBConnectionIT {
IDatabaseTester databaseTester = null;
IDatabaseConnection iConn = null;
Connection connection = null;
#Before
public void init() throws Exception {
databaseTester = new JdbcDatabaseTester(org.hsqldb.jdbcDriver.class.getName(), "jdbc:hsqldb:mem:testdb;sql.syntax_pgs=true", "sa", "");
iConn = databaseTester.getConnection();
DatabaseConfig databaseConfig = iConn.getConfig();
databaseConfig.setProperty(DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS, Boolean.TRUE);
connection = iConn.getConnection();
createTable(connection);
IDataSet dataSet = new FlatXmlDataSetBuilder().build(new File("dataset.xml"));
databaseTester.setDataSet(dataSet);
databaseTester.setSetUpOperation(DatabaseOperation.CLEAN_INSERT);
databaseTester.setTearDownOperation(DatabaseOperation.DELETE_ALL);
databaseTester.onSetup();
}
#Test
public void testDBUnit() {
try {
PreparedStatement pst = connection.prepareStatement("select * from places");
ResultSet rs = pst.executeQuery();
while (rs.next()) {
System.out.println(rs.getString(1));
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
private void createTable(Connection conn) throws Exception {
PreparedStatement pp = conn.prepareStatement(
"CREATE TABLE PLACES" +
"(address VARCHAR(255), " +
"city TEXT, " +
"id VARCHAR(255) NOT NULL primary key)");
pp.executeUpdate();
pp.close();
}
}
EDIT (based on César Rodríguez's answer):
I've now refactored out this method in the parent class:
protected void setUpDatabaseConfig(DatabaseConfig databaseConfig) {
databaseConfig.setProperty(DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS, Boolean.TRUE);
}
and created a sub-class which #Overrides this method, but it's saying this sub-class is not being used. How do I address this class (DBConnectionOverride) in the parent class, to solve my problem?
class DBConnectionOverride extends DBConnectionIT {
#Override
protected void setUpDatabaseConfig(DatabaseConfig databaseConfig) {
databaseConfig.setProperty(DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS, true);
}
}
I've stumbled upon the correct answer, at least the one which solves my problem. It related to this line all along databaseTester.onSetup() which could simply be replaced with DatabaseOperation.CLEAN_INSERT.execute(iConn, dataSet);. Feel free comment on why this seemed to have fixed the error.
You must override method setUpDatabaseConfig(DatabaseConfig config) as follows:
#Override
protected void setUpDatabaseConfig(DatabaseConfig config) {
config.setProperty(DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS, true);
}
Hope it helps
for me it's work:
IDatabaseConnection dbConn = new DatabaseDataSourceConnection(getDataSource());
dbConn.getConfig().setProperty(DatabaseConfig.FEATURE_ALLOW_EMPTY_FIELDS, true);
DatabaseOperation.CLEAN_INSERT.execute(dbConn, getiDataSet(loadDBData.source()));

Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

I am trying write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. I know that TextIO/AvroIO do not support streaming pipelines. However, I read in [1] that it is possible to write to GCS in a streaming pipeline from a ParDo/DoFn in a comment by the author. I constructed a pipeline by following their article as closely as I could.
I was aiming for this behaviour:
Messages written out in a batches of up to 100 to objects in GCS (one per window pane) under a path that corresponds to the time the message was published in dataflow-requests/[isodate-time]/[paneIndex].
I get different results:
There is only a single pane in every hourly window. I therefore only get one file in every hourly 'bucket' (it's really an object path in GCS). Reducing MAX_EVENTS_IN_FILE to 10 made no difference, still only one pane/file.
There is only a single message in every GCS object that is written out
The pipeline occasionally raises a CRC error when writing to GCS.
How do I fix these problems and get the behaviour I'm expecting?
Sample log output:
21:30:06.977 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:06.977 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.773 sucessfully write pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.846 sucessfully write pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.847 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
Here is my code:
package com.example.dataflow;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import com.google.cloud.dataflow.sdk.transforms.windowing.*;
import com.google.cloud.dataflow.sdk.values.PCollection;
import com.google.gcloud.storage.BlobId;
import com.google.gcloud.storage.BlobInfo;
import com.google.gcloud.storage.Storage;
import com.google.gcloud.storage.StorageOptions;
import org.joda.time.Duration;
import org.joda.time.format.ISODateTimeFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
public class PubSubGcsSSCCEPipepline {
private static final Logger LOG = LoggerFactory.getLogger(PubSubGcsSSCCEPipepline.class);
public static final String BUCKET_PATH = "dataflow-requests";
public static final String BUCKET_NAME = "myBucketName";
public static final Duration ONE_DAY = Duration.standardDays(1);
public static final Duration ONE_HOUR = Duration.standardHours(1);
public static final Duration TEN_SECONDS = Duration.standardSeconds(10);
public static final int MAX_EVENTS_IN_FILE = 100;
public static final String PUBSUB_SUBSCRIPTION = "projects/myProjectId/subscriptions/requests-dataflow";
private static class DoGCSWrite extends DoFn<String, Void>
implements DoFn.RequiresWindowAccess {
public transient Storage storage;
{ init(); }
public void init() { storage = StorageOptions.defaultInstance().service(); }
private void readObject(java.io.ObjectInputStream in)
throws IOException, ClassNotFoundException {
init();
}
#Override
public void processElement(ProcessContext c) throws Exception {
String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
String blobName = String.format("%s/%s/%s", BUCKET_PATH, isoDate, c.pane().getIndex());
BlobId blobId = BlobId.of(BUCKET_NAME, blobName);
LOG.info("writing pane {} to blob {}", c.pane().getIndex(), blobName);
storage.create(BlobInfo.builder(blobId).contentType("text/plain").build(), c.element().getBytes());
LOG.info("sucessfully write pane {} to blob {}", c.pane().getIndex(), blobName);
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
options.as(DataflowPipelineOptions.class).setStreaming(true);
Pipeline p = Pipeline.create(options);
PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
.subscription(PUBSUB_SUBSCRIPTION);
PCollection<String> streamData = p.apply(readFromPubsub);
PCollection<String> windows = streamData.apply(Window.<String>into(FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE))
.withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes());
windows.apply(ParDo.of(new DoGCSWrite()));
p.run();
}
}
[1] https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/
Thanks to Sam McVeety for the solution. Here is the corrected code for anyone reading:
package com.example.dataflow;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.*;
import com.google.cloud.dataflow.sdk.transforms.windowing.*;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
import com.google.gcloud.WriteChannel;
import com.google.gcloud.storage.BlobId;
import com.google.gcloud.storage.BlobInfo;
import com.google.gcloud.storage.Storage;
import com.google.gcloud.storage.StorageOptions;
import org.joda.time.Duration;
import org.joda.time.format.ISODateTimeFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Iterator;
public class PubSubGcsSSCCEPipepline {
private static final Logger LOG = LoggerFactory.getLogger(PubSubGcsSSCCEPipepline.class);
public static final String BUCKET_PATH = "dataflow-requests";
public static final String BUCKET_NAME = "myBucketName";
public static final Duration ONE_DAY = Duration.standardDays(1);
public static final Duration ONE_HOUR = Duration.standardHours(1);
public static final Duration TEN_SECONDS = Duration.standardSeconds(10);
public static final int MAX_EVENTS_IN_FILE = 100;
public static final String PUBSUB_SUBSCRIPTION = "projects/myProjectId/subscriptions/requests-dataflow";
private static class DoGCSWrite extends DoFn<Iterable<String>, Void>
implements DoFn.RequiresWindowAccess {
public transient Storage storage;
{ init(); }
public void init() { storage = StorageOptions.defaultInstance().service(); }
private void readObject(java.io.ObjectInputStream in)
throws IOException, ClassNotFoundException {
init();
}
#Override
public void processElement(ProcessContext c) throws Exception {
String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
long paneIndex = c.pane().getIndex();
String blobName = String.format("%s/%s/%s", BUCKET_PATH, isoDate, paneIndex);
BlobId blobId = BlobId.of(BUCKET_NAME, blobName);
LOG.info("writing pane {} to blob {}", paneIndex, blobName);
WriteChannel writer = storage.writer(BlobInfo.builder(blobId).contentType("text/plain").build());
LOG.info("blob stream opened for pane {} to blob {} ", paneIndex, blobName);
int i=0;
for (Iterator<String> it = c.element().iterator(); it.hasNext();) {
i++;
writer.write(ByteBuffer.wrap(it.next().getBytes()));
LOG.info("wrote {} elements to blob {}", i, blobName);
}
writer.close();
LOG.info("sucessfully write pane {} to blob {}", paneIndex, blobName);
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
options.as(DataflowPipelineOptions.class).setStreaming(true);
Pipeline p = Pipeline.create(options);
PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
.subscription(PUBSUB_SUBSCRIPTION);
PCollection<String> streamData = p.apply(readFromPubsub);
PCollection<KV<String, String>> keyedStream =
streamData.apply(WithKeys.of(new SerializableFunction<String, String>() {
public String apply(String s) { return "constant"; } }));
PCollection<KV<String, Iterable<String>>> keyedWindows = keyedStream
.apply(Window.<KV<String, String>>into(FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE))
.withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes())
.apply(GroupByKey.create());
PCollection<Iterable<String>> windows = keyedWindows
.apply(Values.<Iterable<String>>create());
windows.apply(ParDo.of(new DoGCSWrite()));
p.run();
}
}
There's a gotcha here, which is that you'll need a GroupByKey in order for the panes to be aggregated appropriate. The Spotify example references this as "Materialization of panes is done in “Aggregate Events” transform which is nothing else than a GroupByKey transform", but it's a subtle point. You'll need to provide a key in order to do this, and in your case, it appears a constant value will work.
PCollection<String> streamData = p.apply(readFromPubsub);
PCollection<KV<String, String>> keyedStream =
streamData.apply(WithKeys.of(new SerializableFunction<String, String>() {
public Integer apply(String s) { return "constant"; } }));
At this point, you can apply your windowing function, and then a final GroupByKey to get the desired behavior:
PCollection<String, Iterable<String>> keyedWindows = keyedStream.apply(...)
.apply(GroupByKey.create());
PCollection<Iterable<String>> windows = keyedWindows
.apply(Values.<Iterable<String>>create());
Now the elements in processElement will be Iterable<String>, with size 100 or more.
We've filed https://issues.apache.org/jira/browse/BEAM-184 to make this behavior clearer.
As of Beam 2.0, TextIO/AvroIO do support writing unbounded collections - see documentation, in particular, you have to specify withWindowedWrites().

How to use itext to span 2 columnns of a page?

I am trying to do the following using itext and java through eclipse.I need to create a PDF which will consist of a no of multiple choice questions retrieved from database.The data retrieved is in the form of html tags hence for that i am using xml worker to parse it.I am able to retrieve the questions one by one from the database and add it to the pdf.But the problem is that it occupies only one side of a page while i need the questions to cover 2 columns of a page.
When the end of the first PDF page is reached it should utilize the right hand corner of the first PDF page to add questions.Only when both left and right sides of a page are fully used it should move on to the next PDF page.
Now i managed to get the html data in 2 columns using ColumnText.But the problem i face now is that the questions retrieved from database do not appear in the format as intended.Each question is getting displayed on one line.
I entered the questions is this format:
1)What is 2+2=?
a)2
b)4
c)8
d)15
I want that the output on pdf should be as above.
However i get the following as output:
1)What is2+2=?a)2b)4c)8d)15
How do i preserve the html formatting ?????????
This is my code so far:
import java.io.ByteArrayInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.StringReader;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.Collection;
import com.itextpdf.text.Chunk;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Element;
import com.itextpdf.text.List;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.Phrase;
import com.itextpdf.text.pdf.ColumnText;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.draw.LineSeparator;
import com.itextpdf.tool.xml.ElementHandler;
import com.itextpdf.tool.xml.ElementList;
import com.itextpdf.tool.xml.Pipeline;
import com.itextpdf.tool.xml.Writable;
import com.itextpdf.tool.xml.XMLWorker;
import com.itextpdf.tool.xml.XMLWorkerHelper;
import com.itextpdf.tool.xml.html.Tags;
import com.itextpdf.tool.xml.parser.XMLParser;
import com.itextpdf.tool.xml.pipeline.WritableElement;
import com.itextpdf.tool.xml.pipeline.css.CSSResolver;
import com.itextpdf.tool.xml.pipeline.css.CssResolverPipeline;
import com.itextpdf.tool.xml.pipeline.end.PdfWriterPipeline;
import com.itextpdf.tool.xml.pipeline.html.HtmlPipeline;
import com.itextpdf.tool.xml.pipeline.html.HtmlPipelineContext;
public class ColumnTextExample {
public static final float[][] COLUMNS = {
{ 36, 36, 224, 579 } , { 230, 36, 418, 579 }
};
public static void main(String[] args)throws IOException, DocumentException, ClassNotFoundException, SQLException {
// TODO Auto-generated method stub
Document document = new Document(PageSize.A4.rotate());
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("C:\\columns.pdf"));
document.open();
Class.forName("com.mysql.jdbc.Driver");
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/test3", "root", "root");
Statement st=con.createStatement();
ResultSet rs=st.executeQuery("select * from exam2");
int size=0;
while (rs.next()){
size++;
};
ResultSet rs1=st.executeQuery("select * from exam2");
String[] myStringArray = new String[size];
int i=0;
while (rs1.next()){
myStringArray[i]=rs1.getString("paper");
i++;
}
ColumnText ct = new ColumnText(writer.getDirectContent());
for (String article : myStringArray) {
ct.addElement(createPhrase(article,writer,document));
ct.addElement(Chunk.NEWLINE);
document.add(Chunk.NEWLINE);
}
ct.setAlignment(Element.ALIGN_CENTER);
ct.setExtraParagraphSpace(55);
ct.setLeading(0, 1.2f);
ct.setFollowingIndent(27);
int linesWritten = 0;
int column = 0;
int status = ColumnText.START_COLUMN;
while (ColumnText.hasMoreText(status)) {
ct.setSimpleColumn(
COLUMNS[column][0], COLUMNS[column][1],
COLUMNS[column][2], COLUMNS[column][3]);
ct.setYLine(COLUMNS[column][3]);
status = ct.go();
linesWritten += ct.getLinesWritten();
column = Math.abs(column - 1);
if (column == 0)
document.newPage();
}
ct.addElement(new Phrase("Lines written: " + linesWritten));
ct.go();
document.close();
}
public static Phrase createPhrase(String myString, PdfWriter writer, Document document) throws IOException, DocumentException {
Phrase p = new Phrase();
String myString2=myString+"<html><body><br></br></body></html>";
String k="<br></br>";
XMLWorkerHelper xwh = XMLWorkerHelper.getInstance();
InputStream is = new ByteArrayInputStream(myString.getBytes());
ElementList myList = new ElementList();
ElementList myList1 = new ElementList();
xwh.parseXHtml(myList,new StringReader(myString2));
p.addAll(myList);
return p;
}
}
When i use XMLWorkerHelper.getInstance().parseXHtml(writer, document, new StringReader(name1)); i am able to preserve the html formatting such as new line.
The below code helps me retrieve html data from DB and parse it while preserving the formatting.However it prints to the pdf in a single column
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.io.InputStream;
import java.io.StringReader;
import com.itextpdf.text.Document;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
import java.io.ByteArrayInputStream;
import java.sql.*;
public class GeneratePDF {
public static void main(String[] args) {
try {
OutputStream file = new FileOutputStream(new File("C:\\Test.pdf"));
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
Class.forName("com.mysql.jdbc.Driver");
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/test3", "root", "root");
Statement st=con.createStatement();
ResultSet rs=st.executeQuery("select paper from exam2 ");
String name1="";
while (rs.next()){
String name = rs.getString("paper");
//out.println(name);
name1=name;
/*String k="<h1 style='text-align: center;'><strong>Maths Question2 Paper</strong></h1>"+
"<pre><strong>1)What is the sum of 2+2??<br /></strong><strong>a)3<br /></strong><strong>b)5<br /></strong><strong>c)4<br /></strong><strong>d)1</strong></pre>"+
"<pre><strong>2)What is the sum of 5+2??<br /></strong><strong>a)3<br />b)5<br />c)7<br />d)1</strong></pre>"+
"<pre> </pre>";*/
System.out.println(name1);
//InputStream is = new ByteArrayInputStream(name1.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new StringReader(name1));
}
document.close();
file.close();
} catch (Exception e) {
e.printStackTrace();
}}}
You're parsing the HTML to a Document. This means you want iText to organize all content on one page, using one column defined by the page size (in your case A4) and the page margins (in your case 36pt on each side).
If you want to organize the content differently, you should use the parseXHTML() method that takes an ElementHandler as parameter and pass an ElementList object. This list will then contain a List of Element objects that you can feed to a ColumnText object. With the ColumnText class, you can define multiple rectangles on a page, and use the go() method to fill these rectangles with the content from the ElementList.