Connecting to Cloud SQL from Dataflow Job

Connecting to Cloud SQL from Dataflow Job - google-cloud-sql

I'm struggling to use JdbcIO with Apache Beam 2.0 (Java) to connect to a Cloud SQL instance from Dataflow within the same project.
I'm getting the following error:
java.sql.SQLException: Cannot create PoolableConnectionFactory (Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.)
According to the documentation the dataflow service account *#dataflow-service-producer-prod.iam.gserviceaccount.com should have access to all resources within the same project if he's got "Editor" permissions.
When I run the same Dataflow job with DirectRunner everything works fine.
This is the code I'm using:
private static String JDBC_URL = "jdbc:mysql://myip:3306/mydb?verifyServerCertificate=false&useSSL=true";
PCollection < KV < String, Double >> exchangeRates = p.apply(JdbcIO. < KV < String, Double >> read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", JDBC_URL)
.withUsername(JDBC_USER).withPassword(JDBC_PW))
.withQuery(
"SELECT CurrencyCode, ExchangeRate FROM mydb.mytable")
.withCoder(KvCoder.of(StringUtf8Coder.of(), DoubleCoder.of()))
.withRowMapper(new JdbcIO.RowMapper < KV < String, Double >> () {
public KV < String, Double > mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getDouble(2));
}
}));
EDIT:
Using the following approach outside of beam within another dataflow job seems to work fine with DataflowRunner which tells me that the database might not be the problem.
java.sql.Connection connection = DriverManager.getConnection(JDBC_URL, JDBC_USER, JDBC_PW);

Following these instructions on how to connect to Cloud SQL from Java:
https://cloud.google.com/sql/docs/mysql/connect-external-app#java
I managed to make it work.
This is what the code looks like (you must replace MYDBNAME, MYSQLINSTANCE, USER and PASSWORD with your values.
Heads up: MYSQLINSTANCE format is project:zone:instancename.
And I'm using a custom class (Customer) to store the values for each row, instead of key-value pairs.
p.apply(JdbcIO. <Customer> read()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver",
"jdbc:mysql://google/MYDBNAME?cloudSqlInstance=MYSQLINSTANCE&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=USER&password=PASSWORD&useUnicode=true&characterEncoding=UTF-8"
)
)
.withQuery( "SELECT CustomerId, Name, Location, Email FROM Customers" )
.withCoder( AvroCoder.of(Customer.class) )
.withRowMapper(
new JdbcIO.RowMapper < Customer > ()
{
#Override
public Customer mapRow(java.sql.ResultSet resultSet) throws Exception
{
final Logger LOG = LoggerFactory.getLogger(CloudSqlToBq.class);
LOG.info(resultSet.getString(2));
Customer customer = new Customer(resultSet.getInt(1), resultSet.getString(2), resultSet.getString(3), resultSet.getString(3));
return customer;
}
}
)
);
I hope this helps.

Hi it worked for me in the way u did it.Additionaly i removed withusername and password methods from the db configuration method and my pipeline configurations looks like below
PCollection < KV < Double, Double >> exchangeRates = p.apply(JdbcIO. < KV < Double, Double >> read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", "jdbc:mysql://ip:3306/dbname?user=root&password=root&useUnicode=true&characterEncoding=UTF-8")
)
.withQuery(
"SELECT PERIOD_YEAR, PERIOD_YEAR FROM SALE")
.withCoder(KvCoder.of(DoubleCoder.of(), DoubleCoder.of()))
.withRowMapper(new JdbcIO.RowMapper < KV < Double, Double >> () {
#Override
public KV<Double, Double> mapRow(java.sql.ResultSet resultSet) throws Exception {
LOG.info(resultSet.getDouble(1)+ "Came");
return KV.of(resultSet.getDouble(1), resultSet.getDouble(2));
}
}));
Hope this will help

I think this approach may work better, please try the com.mysql.jdbc.GoogleDriver, and use the maven dependencies listed here.
https://cloud.google.com/appengine/docs/standard/java/cloud-sql/#Java_Connect_to_your_database
Related question:
Where i find and download this jar file com.mysql.jdbc.GoogleDriver?

Related

Gatling. What would be the best way to pass feeder data to other functions?

I'm trying to write a load test for service.
I want to build DeliveryObjects and publish them, each delivery must have a unique id.
The problem I encounter is that I cant pass variables from the session to a function that I wrote (I know the documentation says I can't), also I can't "catch" the value on run time as I saw in several examples. So this is one thing I have tried:
object AdminClient extends FireClient {
def getDeliveryStateByDeliveryId(name: String = "Get delivery state by ID",
#Nullable deliveryId: Expression[String] = "${delivery_uuid}")
: HttpClientRequest = {
// The deliveryId resolve to something like this: io.gatling.core.session.el.ElCompiler$$$Lambda$372/1144897090#473b3b7a
println("delivery id in adminclient is: " + deliveryId)
get(name)
.uri(s"/url/${deliveryId}")
.requestModeAdmin
.allowOkStatus
}
}
and the scenario looks like this (to make things simpler):
object LoadTest extends FireScenarios {
val csvFeeder = csv("deliveries.csv")
fireScenario("Load test starts")(_
.feed(csvFeeder)
.exec { session =>
// Here delivery_uuid get a real value something like "b6070d6b-ce10-5fd3-b81d-ed356665f0e1"
println("delivery id id:" + session.get("delivery_uuid").as[String])
session
}
.exec(AdminClient.getDeliveryStateByDeliveryId())
)
}
So I guess my question is how can I pass a value to the var "${delivery_uuid}" in the "getDeliveryStateByDeliveryId" method?
Note that I also can't just call the getDeliveryStateByDeliveryId method from withing the
exec{ session =>
AdminClient.getDeliveryStateByDeliveryId(deliveryId = session.get("delivery_uuid"))
session
}
Although the method gets the variable as I want, the Gatling throws an error that no request was sent and no report will be produced.
I'm very confused after reading the docs too many times, any help will be much appreciated.

Let's sum up what you can find in the official documentation:
Expression[String] is a Scala type alias for (making it simple) scala.Function[Session, String]. Similarly, in Gatling's Java DSL, you directly pass java.util.Function<Session, String>.
Gatling's Scala DSL has some implicit conversion that transform String values passed to methods expecting a Function parameter into a proper function.
So if we make explicit what you've written, you actually have (doesn't compile, but you'll get the idea):
def getDeliveryStateByDeliveryId(name: String = "Get delivery state by ID",
#Nullable deliveryId: Expression[String] = session => session("delivery_uuid").as[String])
: HttpClientRequest = {
println("delivery id in adminclient is: " + deliveryId)
get(name)
.uri("/url/" + deliveryId)
.requestModeAdmin
.allowOkStatus
}
This cannot possibly work. You're concatenating a String and a Function which, like in Java uses the toString inherited from Object.
Now, as you're a beginner, why do you need deliveryId to be a function? Can't it just be a String with the name of the desired attribute?
def getDeliveryStateByDeliveryId(name: String = "Get delivery state by ID",
deliveryId: String = "delivery_uuid")
: HttpClientRequest =
get(name)
.uri(s"/url/#{$deliveryId}")
.requestModeAdmin
.allowOkStatus
object LoadTest extends FireScenarios {
val csvFeeder = csv("deliveries.csv")
fireScenario("Load test starts")(_
.feed(csvFeeder)
.exec(AdminClient.getDeliveryStateByDeliveryId())
)
}

How do I confirm I am reading the data from Mongo secondary server from Java

For performance optimisation we are trying to read data from Mongo secondary server for selected scenarios. I am using the inline query using "withReadPreference(ReadPreference.secondaryPreferred())" to read the data, PFB the code snippet.
What I want to confirm the data we are getting is coming from secondary server after executing the inline query highlighted, is there any method available to check the same from Java or Springboot
public User read(final String userId) {
final ObjectId objectId = new ObjectId(userId);
final User user = collection.withReadPreference(ReadPreference.secondaryPreferred()).findOne(objectId).as(User.class);
return user;
}

Pretty much the same way in Java. Note we use secondary() not secondaryPrefered(); this guarantees reads from secondary ONLY:
import com.mongodb.ReadPreference;
{
// This is your "regular" primaryPrefered collection:
MongoCollection<BsonDocument> tcoll = db.getCollection("myCollection", BsonDocument.class);
// ... various operations on tcoll, then create a new
// handle that FORCES reads from secondary and will timeout and
// fail if no secondary can be found:
MongoCollection<BsonDocument> xcoll = tcoll.withReadPreference(ReadPreference.secondary());
BsonDocument f7 = xcoll.find(queryExpr).first();
}

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks

What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

How to get server messages (raise notice) from PostgreSQL function with Groovy Sql

I have stored function in PostgreSQL, in which I have PL/PGSQL statement like this:
raise notice 'Message text';
I have also Groovy application, which uses default Sql class to call this function. I want to get messages (raise notice) from function are displayed in stdout or logged in my Groovy application.
I created PoC project to test this: https://github.com/lospejos/groovy-jdbc-get-server-messages
Please find comment in Groovy file: https://github.com/lospejos/groovy-jdbc-get-server-messages/blob/master/call_db_function.groovy
I also found this: https://stackoverflow.com/a/23087861/1828296
But I can't get how to get Statement object from Sql instance.

For the benefit of the others. To get server messages from stored function, call SQL like this:
def sql = Sql.newInstance('jdbc:postgresql://localhost/postgres', 'postgres', 'postgres')
final String paramValue = "Param value"
sql.query("select * from testme(param => :paramValue)", [paramValue: paramValue]) { resultSet ->
def rsRows = [:]
while (resultSet.next()) {
rsRows << resultSet.toRowResult()
}
def warning = resultSet.getStatement().getWarnings()
while (warning) {
println "[${LocalDateTime.now()}] [${warning.getSQLState()}] ${warning.message}"
warning = warning.nextWarning
}
println rsRows
}
I also updated repository code.

Error with AWS EMR Client: java.lang.NoSuchFieldError: SIGNING_REGION

I'm having a sudden issue with this error when I am running my AWS EMR client on our deployed server. This does not happen locally, and runs fine. Basically, I have an EMR client that I use to build and execute steps as such:
class EMRClient(emrClusterId:String) extends LazyLogging{
val accessKey = ...// access key
val secretKey = ...//secret key
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val REGION = <my region>
println(">>>>>>>>>>>>>>>>>>>>Initializing EMR client for clusterId " + emrClusterId + " . The region is " + REGION)
val emr = AmazonElasticMapReduceClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(REGION)
.build()
def executeHQLStep(s3ScriptPath:String, stepName:String, args:String = ""): AddJobFlowStepsResult= {
val hqlScriptStep = buildHQLStep(hqlScriptPath, stepName, args)
val stepSet = new java.util.HashSet[StepConfig]()
//stepSet.add(enableDebugging)
stepSet.add(hqlScriptStep)
executeJobFlowSteps(stepSet)
}
/**
* Builds a StepConfig to be executed in a job flow for a given .hql file from S3
* #param hqlScriptPath the location in S3 of the script file containing the script to run
* #param args optional field for arguments for hive script.
* #param stepName An identifier to give to EMR to name your Step
* #return
*/
private def buildHQLStep(hqlScriptPath:String, stepName:String, args:String= ""): StepConfig = {
new StepConfig()
.withName(stepName)
.withActionOnFailure(ActionOnFailure.CANCEL_AND_WAIT)
.withHadoopJarStep(stepFactory.newRunHiveScriptStep(hqlScriptPath, args))
}
private def executeJobFlowSteps(steps: java.util.Set[StepConfig]): AddJobFlowStepsResult = {
emr.addJobFlowSteps(new AddJobFlowStepsRequest()
.withJobFlowId(emrClusterId)
.withSteps(steps)) // where the error is thrown
}
}
However, when this class is instantiated on the server, none of the println statements at the top are visible, my executeJobFlowSteps method is called and throws this error:
java.lang.NoSuchFieldError: SIGNING_REGION
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.executeAddJobFlowSteps(AmazonElasticMapReduceClient.java:439)
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.addJobFlowSteps(AmazonElasticMapReduceClient.java:421)
at emrservices.EMRClient.executeJobFlowSteps(EMRClient.scala:64)
at emrservices.EMRClient.executeHQLStep(EMRClient.scala:44)
This project is composed of several projects, and similar issues to mine have said it has to do with their AWS dependencies, but across the board all of the projects have this in the build.sbt's library dependencies: "com.amazonaws" % "aws-java-sdk" % "1.11.286"
Any idea on what the issue is?

This looks like you are mixing the 1.11.286 version of aws-java-sdk with an older version (1.11.149) of aws-java-sdk-core. The newer client is using a new field added to the core module but since your core module is out of date you are seeing the no such field error. Can you ensure all of your dependencies are in sync with one another?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Connecting to Cloud SQL from Dataflow Job - google-cloud-sql

Related

Gatling. What would be the best way to pass feeder data to other functions?

How do I confirm I am reading the data from Mongo secondary server from Java

How to enrich event stream with big file in Apache Flink?

How to get server messages (raise notice) from PostgreSQL function with Groovy Sql

Error with AWS EMR Client: java.lang.NoSuchFieldError: SIGNING_REGION

Categories

Resources