Flink - DynamoDB source

Flink - DynamoDB source - scala

I'm new working with real-time applications. Currently, I'm using AWS Kinesis/Flink and Scala I have the following architecture:
old architecture
As you can see I consume a CSV file using CSVTableSource. Unfortunately, the CSV file became too big for the Flink Job. The file is updated daily, then new rows are added.
So, now I am working in a new architecture, where I want to replace the CSV for a DynamoDB.
new architecture
My question is: what do you recommend to consume the DynamoDB table?
PD: I need the to do a left outer join using the DynamoDB table and the Kinesis Data Stream data

You could use a RichFlatMapFunction to open DynamoDB client and lookup data from DynamoDB. A sample code is given below.
public static class DynamoDBMapper extends RichFlatMapFunction < IN, OUT > {
// Declare Dynamodb table
private Table table;
private String tableName = "";
#Override
public void open(Configuration parameters) throws Exception {
// Initialize DynamoDB client
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard()
.withRegion(Regions.US_EAST_1)
.build();
DynamoDB dynamoDB = new DynamoDB(client);
this.table = dynamoDB.getTable(tableName);
}
#Override
public void flatMap(IN < T > , Collector < T > out) throws Exception {
// execute getitem
out.collect();
}
}

Related

Flink: binlog transformation to multiple DTO and transformation method in flink

A newer to Kafka, Flink and Tidb.
Assume I have three source MySql tables s_a, s_b, and s_c, and want to collect records to target TiDb table t_a and t_b in a realtime.
The mapping rules are
`s_a` --> `t_a`
`s_b` union `s_c` ---> `t_b` with some transformation (e.g., field remapping).
The solution I adopted is kafka + Flink with Tidb sink, where binlog changes are subscriped to Kafka topic; Flink consumes the topic and write the transformed result to Tidb. The problem for me in the flink code part are:
how to can I easily restore the json string (that has information of operatin, tables), polled from kafka, to different kinds of DTO operation (e.g, insert/creat t_a or t_b ). I have found a tool called Debezium as Kafka&Flink connector, but it looks like it requires equality between the source table and the target table.
How to write the transformation VKDataMapper if i have multiple target tables. I have difficult in defineing the T as it can be t_a DTO(Data Transfer Object), or t_b DTO.
The existing sample code for me is like:
//The main routine.
StreamExecutionEnvironment environment =
StreamExecutionEnvironment.getExecutionEnvironment();
//consume is FlinkkafkaConsumer. TopicFilter returns true.
environment.addSource(consumer).filter(new TopicFilter()).map(new VKDataMapper())
.addSink(new TidbSink());
try {
environment.execute();
} catch (Exception e) {
log.error("exception {}", e);
}
public class VKDataMapper implements MapFunction<String, T> {
#Override
public T map(String value) throws Exception {
//How T can represents both `T_a data DTO` `T_b`....,
return null;
}
}

Why not try the Flink SQL? In this way, you only need to create some tables in Flink, and then define your tasks through sql like:
insert into t_a select * from s_a;
insert into t_b select * from s_b union select * from s_c;
See some examples in https://github.com/LittleFall/flink-tidb-rdw, feel free to ask anything which makes you confused.

Spring batch JdbcBatchItemWriter insert is very slow with MYSQL

I'm using a chunk step with a reader and writer. I am reading data from Hive with 50000 chunk size and insert into mysql with same 50000 commit.
#Bean
public JdbcBatchItemWriter<Identity> writer(DataSource mysqlDataSource) {
return new JdbcBatchItemWriterBuilder<Identity>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql(insertSql)
.dataSource(mysqlDataSource)
.build();
}
When i have started dataload and insert into Mysql its commiting very slow and 100000 records are takiing more than a hr to load while same loader with Gemfire loading 5 million recordsin 30 min.
seems like it insert one by one insted of batch as laoding 1500 then 4000 then ....etc ,does anyone faced same issue ?

Since you are using BeanPropertyItemSqlParameterSourceProvider, this will include lot of reflection to set variables in prepared statement.This will increase time.
If speed is the your high priority try implementing your own ItemWriter as given below and use prepared statement batch to execute update.
#Component
public class CustomWriter implements ItemWriter<Identity> {
//your sql statement here
private static final String SQL = "INSERT INTO table_name (column1, column2, column3, ...) VALUES (?,?,?,?);";
#Autowired
private DataSource dataSource;
#Override
public void write(List<? extends Identity> list) throws Exception {
PreparedStatement preparedStatement = dataSource.getConnection().prepareStatement(SQL);
for (Identity identity : list) {
// Set the variables
preparedStatement.setInt(1, identity.getMxx());
preparedStatement.setString(2, identity.getMyx());
preparedStatement.setString(3, identity.getMxt());
preparedStatement.setInt(4, identity.getMxt());
// Add it to the batch
preparedStatement.addBatch();
}
int[] count = preparedStatement.executeBatch();
}
}
Note: This is a rough code. So Exception handling and resource handling is not done properly. You can work on the same. I think this will improve your writing speed very much.

Try Adding ";useBulkCopyForBatchInsert=true" to your connection url.
Connection con = DriverManager.getConnection(connectionUrl + ";useBulkCopyForBatchInsert=true")
Source : https://learn.microsoft.com/en-us/sql/connect/jdbc/use-bulk-copy-api-batch-insert-operation?view=sql-server-ver15

Accessing Pipeline within DoFn

I'm writing a pipeline to replicate data from one source to another. Info about data sources is stored in db (BQ). How I can use this data it to build read/write endpoints dynamically?
I tried to pass Pipeline object to my custom DoFn but it can't be serialized. Later I tried to call method getPipeline() on a passed view but it doesn't work as well. -- which is actually expected
I can't know all tables I need to serialize in advance so I have to read all data from db (or any other source).
// builds some random view
PCollectionView<IdWrapper> idView = ...;
// reads tables meta and replicates data per each table
pipeline.apply(getTableMetaEndpont().read())
.apply(ParDo.of(new MyCustomReplicator(idView)).withSideInputs(idView))
private static class MyCustomReplicator extends DoFn<TableMeta, TableMeta> {
private final PCollectionView<IdWrapper> idView;
private DataReplicator(PCollectionView<IdWrapper> idView) {
this.idView = idView;
}
// TableMeta {string: sourceTable, string: destTable}
#ProcessElement
public void processElement(#Element TableMeta tableMeta, ProcessContext ctx) {
long id = ctx.sideInput(idView).getValue();
// builds read endpoint which depends on table meta
// updates entities
// stores entities using another endpoint
idView
.getPipeline()
.apply(createReadEndpoint(tableMeta).read())
.apply(ParDo.of(new SomeFunction(tableMeta, id)))
.apply(createWriteEndpoint(tableMeta).insert());
ctx.output(tableMetadata);
}
}
I expect it to replicate data specified by TableMeta but I can't use pipeline within DoFn object because it can't be serialized/deserialized.
Is there any way to implement the intended behavior?

Amazon DynamoDB - Joins

I have a question regarding joins in Amazon Dynamo DB. As Amazon Dynamo DB is a NoSQL database and doesn't supports joins. I am looking for an alternate solution for using a join command for Dynamo DB tables. I am using Dynamo DB with Android SDK.

No way to do joins in Dynamo DB.
Dynamo db table's primary key is a composite of partition key & sort
key. You must need partition key to query in table.
It is not like Relational Database, Dynamo DB table is irrespective
of data type. SO it's quite complicate to use joins in it.
Query each table individually & use values of resultant to proceed with other table.

Since DynamoDB is a NOSQL Database It doesn't support RDBMS. So you cannot join tables in the dynamo db. AWS has other databases which support RDBMS like AWS Aurora.

Disclaimer: I work at Rockset- but I def. see that this can help you solve this issue really easily. Yes, you can't do joins on DynamoDB, but you can indirectly do joins using Dyanmodb integrated with Rockset.
Create integration between dynamo db giving Rockset read permissions.
Write your SQL query with JOIN on the editor
Save the SQL query as a RESTFUL endpoint via Query Lambda on Rockset console.
On your android app, making a HTTP request to that endpoint and get your query:
Assuming you imported all the proper libraries:
public class MainActivity extends AppCompatActivity {
private String url = "https://api.rs2.usw2.rockset.com/v1/orgs/self/ws/commons/lambdas/LambdaName/versions/versionNumber";
private String APIKEY = "ApiKey YOUR APIKEY";
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
new JSONAsyncTask().execute("url");
}
class JSONAsyncTask extends AsyncTask<String, Void, Boolean> {
#Override
protected void onPreExecute() {
super.onPreExecute();
}
#Override
protected Boolean doInBackground(String... urls) {
try {
HttpPost httpPost = new HttpPost(url);
HttpClient httpclient = new DefaultHttpClient();
httpPost.addHeader("Authorization" , APIKEY);
httpPost.addHeader("Content-Type" , "application/json");
HttpResponse response = httpclient.execute(httpPost);
int status = response.getStatusLine().getStatusCode();
if (status == 200) {
HttpEntity entity = response.getEntity();
String data = EntityUtils.toString(entity);
Log.e("foo", data);
JSONObject jsono = new JSONObject(data);
return true;
} else {
Log.e("foo", "error" + String.valueOf(status));
}
} catch (IOException e) {
e.printStackTrace();
} catch (JSONException e) {
e.printStackTrace();
}
return false;
}
protected void onPostExecute(Boolean result) {
}
}
}
From there, you'll get your results as a log and then you can do what you want with that data.
While you can't do JOINS directly on DyanmoDB, you can indirectly with Rockset if you're building data-driven applications.

DynamoDb is a NoSQl database and as such you cant do joins.
However from my experience there isn't anything you can't do if you create a correct design of your database.Use a single table and a combination of primary and secondary keys.
Here are the docs https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html

You cannot use joins in DynamoDB, but you can structure your data in a single table using GSI indexes, so that you can query data in most of the possible way.
So before designing database structure, you will need to make a list of all the queries that will need and design DB structure, mainly set indexes, according to that.

Dataflow uploading file encoding error

My development environment uses Eclipse OXYGEN, Google Cloud Tools for Eclipse 1.7.0 installed.
I create Google cloud Dataflow Java Project.
There was a problem testing wordcount example.
When reading a file in the bucket, it will be output normally from the log as follows.
The problem occurs when you process data for WordCount and store the data in the bucket.
If you check the saved file, you can see the above picture.
Does dataflow not support Korean language?
here is my TextIO.write Codes
static class WriteData extends PTransform<PCollection<KV<URI, String>>, PDone>
{
private String output;
public WriteData(String output)
{
this.output = output;
}
#Override
public Coder<?> getDefaultOutputCoder()
{
return KvCoder.of(StringDelegateCoder.of(URI.class), StringUtf8Coder.of());
}
#Override
public PDone expand(PCollection<KV<URI, String>> outputfile) {
// TODO Auto-generated method stub
return outputfile
.apply(ParDo.of(new DoFn<KV<URI, String>, String>(){
#ProcessElement
public void processElement(ProcessContext c)
{
output = c.element().getKey().toString();
LOG.info("WRITE DATA : " + c.element().getValue());
c.output(c.element().getValue());
}
}))
.apply(TextIO.write().to(output).withSuffix(".txt"));
}
}

Most of the time, the correct coder can be automatically inferred, but if it doesn't, then make sure you're specifying a coder when reading data.
When you need to specify the coder, you typically need to do it when reading data into your pipeline from an external source (or creating pipeline data from local data), and also when you output pipeline data to an external sink.
For example, you can decode the data to read:
StringUtf8Coder.of().decode(inStream)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Flink - DynamoDB source - scala

Related

Flink: binlog transformation to multiple DTO and transformation method in flink

Spring batch JdbcBatchItemWriter insert is very slow with MYSQL

Accessing Pipeline within DoFn

Amazon DynamoDB - Joins

Dataflow uploading file encoding error

Categories

Resources