How do you write to HDFS from beam? - apache-beam

I'm trying to write a Beam pipeline that runs using the SparkRunner, reads from a local file, and writes to HDFS.
Here's a minimal example:
Options class -
package com.mycompany.beam.hdfsIOIssue;
import org.apache.beam.runners.spark.SparkPipelineOptions;
import org.apache.beam.sdk.io.hdfs.HadoopFileSystemOptions;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.Validation;
public interface WritingToHDFSOptions extends PipelineOptions, SparkPipelineOptions, HadoopFileSystemOptions {
#Validation.Required
#Description("Path of the local file to read from")
String getInputFile();
void setInputFile(String value);
#Validation.Required
#Description("Path of the HDFS to write to")
String getOutputFile();
void setOutputFile(String value);
}
Beam main class -
package com.mycompany.beam.hdfsIOIssue;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.FileBasedSink;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.values.PCollection;
public class WritingToHDFS {
public static void main(String[] args) {
PipelineOptionsFactory.register(WritingToHDFSOptions.class);
WritingToHDFSOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(WritingToHDFSOptions.class);
Pipeline p = Pipeline.create(options);
buildPipeline(p, options);
p.run();
}
static void buildPipeline(Pipeline p, WritingToHDFSOptions options) {
PCollection<String> input = p.apply("ReadLines", TextIO.read().from(options.getInputFile()));
ResourceId resource = FileBasedSink.convertToFileResourceIfPossible(options.getOutputFile());
TextIO.Write write = TextIO.write().to(resource);
input.apply("WriteLines", write);
}
}
Running it like:
spark-submit test --master yarn --deploy-mode cluster --class com.mycompany.beam.hdfsIOIssue.WritingToHDFS my-project-bundled-0.1-SNAPSHOT.jar --runner=SparkRunner --inputFile=testInput --outputFile=hdfs://testOutput
What I expect to happen: It reads the lines in the local testInput file and writes them to a new file names testOutput in my hdfs home directory.
What actually happens: Nothing, as far as I can tell. Spark says the job completes successfully and I see the Beam steps in the logs, but there isn't a file or directory named testOutput written to hdfs or to my local directory. Maybe it's getting written locally on the spark executor nodes but I don't have access to those to check.
I'm guessing either that I'm using the TextIO interface wrong or that I need to do more to configure the Filesystem, not just add it to my PipelineOptions interface. But I can't find documentation that explains how to do that.

I think your options should look something like the following :
--inputFile=hdfs:///testInput --outputFile=hdfs:///testOutput
You might also want to wait until the pipeline is finished so you can see the result:
p.run().waitUntilFinish();
You can find a simple complete working example of an HDFS write (Avro files) here
Please be aware of (BEAM-2277) which might also apply depending on the version of Beam you are running with (it would throw error). You can work around that using:
TextIO.Write write = TextIO.write().to(resource)
.withTempDirectory(FileSystems.matchNewResource("hdfs:///tmp/beam-test", true));
If you package up your project on a public GitHub repo I will test it and help you get running.

Related

How to deploy a Google dataflow worker with a file loaded into memory?

I am trying to deploy Google Dataflow streaming for use in my machine learning streaming pipeline, but cannot seem to deploy the worker with a file already loaded into memory. Currently, I have setup the job to pull a pickle file from a GCS bucket, load it into memory, and use it for model prediction. But this is executed on every cycle of the job, i.e. pull from GCS every time a new object enters the dataflow pipeline - meaning that the current execution of the pipeline is much slower than it needs to be.
What I really need, is a way to allocate a variable within the worker nodes on setup of each worker. Then use that variable within the pipeline, without having to re-load on every execution of the pipeline.
Is there a way to do this step before the job is deployed, something like
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
But within my setup.py file?
##### based on - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
"""Setup.py module for the workflow's worker utilities.
All the workflow related code is gathered in a package that will be built as a
source distribution, staged in the staging area for the workflow being run and
then installed in the workers when they start running.
This behavior is triggered by specifying the --setup_file command line option
when running the workflow for remote execution.
"""
# pytype: skip-file
from __future__ import absolute_import
from __future__ import print_function
import subprocess
from distutils.command.build import build as _build # type: ignore
import setuptools
# This class handles the pip install mechanism.
class build(_build): # pylint: disable=invalid-name
"""A build command class that will be invoked during package install.
The package built using the current setup.py will be staged and later
installed in the worker using `pip install package'. This class will be
instantiated during install for this specific scenario and will trigger
running the custom commands specified.
"""
sub_commands = _build.sub_commands + [('CustomCommands', None)]
CUSTOM_COMMANDS = [['pip', 'install', 'scikit-learn==0.23.1']]
CUSTOM_COMMANDS = [['pip', 'install', 'google-cloud-storage']]
CUSTOM_COMMANDS = [['pip', 'install', 'mlxtend']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
REQUIRED_PACKAGES = [
'google-cloud-storage',
'mlxtend',
'scikit-learn==0.23.1',
]
setuptools.setup(
name='ML pipeline',
version='0.0.1',
description='ML set workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
'build': build,
'CustomCommands': CustomCommands,
})
Snippet of current ML load mechanism:
class MlModel(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self,element):
if self._model is None:
bucket = self._storage.Client().get_bucket(myBucket)
blob = bucket.get_blob(myBlob)
self._model = self._pkl.loads(blob.download_as_string())
new_df = self._pd.read_json(element, orient='records').iloc[:, 3:-1]
predict = self._model.predict(new_df)
df = self._pd.DataFrame(data=predict, columns=["A", "B"])
A = df.iloc[0]['A']
B = df.iloc[0]['B']
d = {'A':A, 'B':B}
return [d]
You can use the #Setup method in your MlModel DoFn method where you can load your model and then use it in your #Process method. The #Setup method is called once per worker initialization.
I had written a similar answer here
HTH

Loading JDBC connector in Eclipse Jython plugin

I'm working with an Eclipse-based tool which provides an interactive Jython shell for scripting and data analysis against an internal data model.
I'm trying to write a script which exports results to some form of database, so I'm trying to use the built-in com.ziclix.python.sql package in Jython to provide the interface and the Xerial JDBC connector for SQLite (https://github.com/xerial/sqlite-jdbc) to provide the backend.
The script below works perfectly when run outside of the third-party tool using a standard command line Jython interpreter, including relying on the importJar() hack which is commonly used to work around Jython not always using the user CLASSPATH when run using java -jar <blah>:
from com.ziclix.python.sql import zxJDBC
from java.net import URL, URLClassLoader
from java.lang import ClassLoader
from java.io import File
JDBC_URL = "jdbc:sqlite:test.db"
JDBC_DRIVER = "org.sqlite.JDBC"
JDBC_JAR = "E:/sqlite-jdbc-3.21.0.jar"
# Import Jar file into local class path
def importJar(jarFile):
m = URLClassLoader.getDeclaredMethod("addURL", [URL])
m.accessible = 1
m.invoke(ClassLoader.getSystemClassLoader(), [File(jarFile).toURL()])
def main():
try:
importJar(JDBC_JAR)
dbConn = zxJDBC.connect(JDBC_URL, None, None, JDBC_DRIVER)
cursor = dbConn.cursor()
# Do something useful
cursor.close()
dbConn.close()
except zxJDBC.DatabaseError, msg:
print msg
if __name__ == '__main__':
main()
... but fails when run from the plugin inside Eclipse, where the zxJDBC.connect() call errors with:
driver [org.sqlite.JDBC] not found
If I add the Jar file to the JYTHONPATH environment I can do import org.sqlite.JDBC successfully in the Python script, but the connect call still fails in the Java-side of the JDBC driver manager.
For sake of completeness the full path to the Jar file is on the CLASSPATH, PYTHONPATH, and JYTHONPATH environment variables ...
Any ideas?

spark - hadoop argument

I am running both hadoop and spark and I want to use files from hdfs as an argument on spark-submit, so I made a folder in hdfs with the files
eg. /user/hduser/test/input
and I want to run spark-submit like this:
$SPARK_HOME/bin/spark-submit --master spark://admin:7077 ./target/scala-2.10/test_2.10-1.0.jar hdfs://user/hduser/test/input
but I cant make it work, what's the right way to do it?
the error I am getting is :
WARN FileInputDStream: Error finding new files
java.lang.NullPointerException
Check if you are able to access HDFS from Spark code, If yes then you need to add following line of code in your Scala import.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkFiles
then in your code add following lines
var hadoopConf = new org.apache.hadoop.conf.Configuration()
var fileSystem = FileSystem.get(hadoopConf)
var path = new Path(args(0))
actually the problem was the path. I had to use hdfs://localhost:9000/user/hduser/...

Postgres PL/JAVA: java.lang.ClassNotFoundException error after loading JAR file in database

I am getting the java.lang.ClassNotFoundException: error inside Postgres when running a function that calls a JAR file I have loaded. I have installed and configured PL/JAVA (including the delivered examples) in my database and can run the examples to success. I am not attempting to load/install my first JAR, but I am doing something wrong.
My host controls the OS version: CentOS 6.8. Postgres is version 8.4.
I am attempting to install my own very simple java class, which is a derivative of the delivered example Parameters.addOne class. All my code is in /tmp. Here are the steps I've followed:
Doug.java:
package com.msmetric;
import java.math.BigDecimal;
import java.sql.Date;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Time;
import java.sql.Timestamp;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.TimeZone;
import java.util.logging.Logger;
public class Doug {
public static int addOne(int value) {
return value + 1;
}
}
Compile Doug.java using 'javac Doug.java' succeeds.
Create JAR file with Doug.class file in it using 'jar -cvf Doug.jar Doug.class. This works fine.
Now I load the JAR file into Postgres (public schema), change the classpath, create the function that calls the JAR, then attempt to run at psql prompt.
Run sqlj.install_jar from psql:
select sqlj.install_jar('file:/tmp/Doug.jar','Doug',false);
Set the classpath inside Postgres (from psql prompt postgres=#):
select sqlj.set_classpath('public','Doug');
Create the function that calls the JAR. This create function code is taken directly from the examples.ddr file that came with PL/JAVA. I simply changed org.postgres to com.msmetric.
create or replace function addone(int) returns int as 'com.msmetric.Doug.addOne(java.lang.Integer)' language java;
Now with the JAR loaded and function created, I attempt to run it. This function should simply add 1 to the number provided.
select addone(3);
Results:
ERROR: java.lang.ClassNotFoundException: com.msmetric.Doug
Thoughts?
I'm very sorry I didn't see your question sooner. Underneath all the exotic details (PostgreSQL, PL/Java, schemas, classpaths...), there's just a bit of basic Java going on here: if a jar file contains a class Doug.class in package com.msmetric, its path within the jar has to reflect that: it has to be com/msmetric/Doug.class. Otherwise, it won't be found.
You can set up that whole structure step by step:
javac Doug.java
mkdir com
mkdir com/msmetric
mv Doug.class com/msmetric/
jar -cvf Doug.jar com/msmetric/Doug.class
Or, you can let javac do more of the work for you:
mkdir classes
javac -d classes Doug.java
jar -cvf Doug.jar -C classes .
When you give javac a -ddirectory option, instead of just writing class files next to their .java sources, it will put them all in their proper places under the directory you named, and then you can just tell jar to change into that directory and slurp them all up (don't overlook the . at the end of that jar command).
Once you fix that, if you retry your original steps, you'll see that you now get a different error:
ERROR: Unable to find static method com.msmetric.Doug.addOne with signature (Ljava/lang/Integer;)I
That happens because you declared the function in Doug.java with int addOne(int value) (that is, taking a primitive int argument), but you declared it in SQL with returns int as 'com.msmetric.Doug.addOne(java.lang.Integer)' taking an Integer object.
Once you correct that:
create or replace function addone(int) returns int as 'com.msmetric.Doug.addOne(int)' language java;
you'll be able to see:
# select addone(3);
addone
--------
4
(1 row)
If you happen to see this belated answer, may I ask what version of PL/Java you are using? That's one detail you didn't mention. If it is older than 1.5.0, there are newer features that can help you out. For one, you can just annotate that function:
#Function
public static int addOne(int value) {
return value + 1;
}
and have javac spit out not only the Doug.class file but also a pljava.ddr file with your SQL function declaration already written correctly (no mixing up argument types!). There is a way to include that .ddr file into the jar you create so that you can just call sqlj.install_jar with the last parameter true so it runs the commands in the .ddr and your functions are ready to use. There's a Hello, world example in the docs that shows more of how it's done.
Cheers,
-Chap

How to execute JMeter test case from Java code

How do I run a JMeter test case from Java code?
I have followed the example Here from Blazemeter.com
My code is as follows:
public class BasicSampler {
public static void main(String[] argv) throws Exception {
// JMeter Engine
StandardJMeterEngine jmeter = new StandardJMeterEngine();
// Initialize Properties, logging, locale, etc.
JMeterUtils.loadJMeterProperties("/home/stone/Workbench/automated-testing/apache-jmeter-2.11/bin/jmeter.properties");
JMeterUtils.setJMeterHome("/home/stone/Workbench/automated-testing/apache-jmeter-2.11");
JMeterUtils.initLogging();// you can comment this line out to see extra log messages of i.e. DEBUG level
JMeterUtils.initLocale();
// Initialize JMeter SaveService
SaveService.loadProperties();
// Load existing .jmx Test Plan
FileInputStream in = new FileInputStream("/home/stone/Workbench/automated-testing/apache-jmeter-2.11/bin/examples/CSVSample.jmx");
HashTree testPlanTree = SaveService.loadTree(in);
in.close();
// Run JMeter Test
jmeter.configure(testPlanTree);
jmeter.run();
}
}
but I keep getting the following messages in the console and my test never executes.
INFO 2014-09-23 12:04:40.492 [jmeter.e] (): Listeners will be started after enabling running version
INFO 2014-09-23 12:04:40.511 [jmeter.e] (): To revert to the earlier behaviour, define jmeterengine.startlistenerslater=false
I have also tried uncommented jmeterengine.startlistenerslater=false from jmeter.properties file
How do you know that your "test never executes"?
What is in jmeter.log file (it should be in the root of your project). Or alternatively comment JMeterUtils.initLogging() line to see the full output in STDOUT
Have you changed relative path CSVSample_user.csv in "Get user details" CSV Data Set Config as it may resolve into a different location as it recommended in Using CSV DATA SET CONFIG
Is CSVSample.jtl file generated anywhere (again it should be in the root of your project by default)? What is in it?
The code looks good and I'm pretty sure that the problem is with the path to CSVSample_user.csv file and you have something like java.io.FileNotFoundException in your log. Please double check that CSVSample.jmx file contains valid full path to CSVSample_user.csv.
UPDATE TO ANSWER QUESTIONS IN COMMENTS
jmeter.log file should be under your Eclipse workspace folder by default
Looking into CSVSample.jmx there is a View Resulst in Table listener which is configured to store results under ~/CSVSample.jtl
If you want to see summarizer messages and "classic" .jtl reporting add next few lines before jmeter.configure(testPlanTree); stanza
Summariser summer = null;
String summariserName = JMeterUtils.getPropDefault("summariser.name", "summary");
if (summariserName.length() > 0) {
summer = new Summariser(summariserName);
}
String logFile = "/path/to/jtl/results/file.jtl";
ResultCollector logger = new ResultCollector(summer);
logger.setFilename(logFile);
testPlanTree.add(testPlanTree.getArray()[0], logger);
Try using library - https://github.com/abstracta/jmeter-java-dsl.
It supports implementing JMeter test as java code.
Below example shows how to implement and execute test for REST API. Same approach could be applied to other type of tests as well.
#Test
public void testPerformance() throws IOException {
TestPlanStats stats = testPlan(
threadGroup(2, 10,
httpSampler("http://my.service")
.post("{\"name\": \"test\"}", Type.APPLICATION_JSON)
),
//this is just to log details of each request stats
jtlWriter("test" + Instant.now().toString().replace(":", "-") + ".jtl")
).run();
assertThat(stats.overall().elapsedTimePercentile99()).isLessThan(Duration.ofSeconds(5));
}