Dataflow uploading file encoding error - encoding

My development environment uses Eclipse OXYGEN, Google Cloud Tools for Eclipse 1.7.0 installed.
I create Google cloud Dataflow Java Project.
There was a problem testing wordcount example.
When reading a file in the bucket, it will be output normally from the log as follows.
The problem occurs when you process data for WordCount and store the data in the bucket.
If you check the saved file, you can see the above picture.
Does dataflow not support Korean language?
here is my TextIO.write Codes
static class WriteData extends PTransform<PCollection<KV<URI, String>>, PDone>
{
private String output;
public WriteData(String output)
{
this.output = output;
}
#Override
public Coder<?> getDefaultOutputCoder()
{
return KvCoder.of(StringDelegateCoder.of(URI.class), StringUtf8Coder.of());
}
#Override
public PDone expand(PCollection<KV<URI, String>> outputfile) {
// TODO Auto-generated method stub
return outputfile
.apply(ParDo.of(new DoFn<KV<URI, String>, String>(){
#ProcessElement
public void processElement(ProcessContext c)
{
output = c.element().getKey().toString();
LOG.info("WRITE DATA : " + c.element().getValue());
c.output(c.element().getValue());
}
}))
.apply(TextIO.write().to(output).withSuffix(".txt"));
}
}

Most of the time, the correct coder can be automatically inferred, but if it doesn't, then make sure you're specifying a coder when reading data.
When you need to specify the coder, you typically need to do it when reading data into your pipeline from an external source (or creating pipeline data from local data), and also when you output pipeline data to an external sink.
For example, you can decode the data to read:
StringUtf8Coder.of().decode(inStream)

Related

How to run BigQueryIO.read().fromQuery with parameters

I need to run multiple queries from a single .SQL file but with different params
I've tried something like this but it does not work as BigQueryIO.Read consumes only PBegin.
public PCollection<KV<String, TestDitoDto>> expand(PCollection<QueryParamsBatch> input) {
PCollection<KV<String, Section1Dto>> section1 = input.apply("Read Section1 from BQ",
BigQueryIO
.readTableRows()
.fromQuery(ResourceRetriever.getResourceFile("query/test/section1.sql"))
.usingStandardSql()
.withoutValidation())
.apply("Convert section1 to Dto", ParDo.of(new TableRowToSection1DtoFunction()));
}
Are there any other ways to put params from existing PCollection inside my BigQueryIO.read() invocation?
Are different queries/parameters available in the pipeline construction time ? If so you could just create multiple read transforms and combine results, for example, using a Flatten transform.
Beam Java BigQuery source does not support reading a PCollection of queries currently. Python BQ source does though.
I've come up with the following solution: not to use BigQueryIO but regular GCP library for accessing BigQuery, marking it as transient and initializing it each time in method with #Setup annotation, as it is not Serializable
public class DenormalizedCase1Fn extends DoFn<*> {
private transient BigQuery bigQuery;
#Setup
public void initialize() {
this.bigQuery = BigQueryOptions.newBuilder()
.setProjectId(bqProjectId.get())
.setLocation(LOCATION)
.setRetrySettings(RetrySettings.newBuilder()
.setRpcTimeoutMultiplier(1.5)
.setInitialRpcTimeout(Duration.ofSeconds(5))
.setMaxRpcTimeout(Duration.ofSeconds(30))
.setMaxAttempts(3).build())
.build().getService();
}
#ProcessElement
...

How should I define Flink's Schema to read Protocol Buffer data from Pulsar

I am using Pulsar-Flink to read data from Pulsar in Flink. I am having difficulty when the data's format is Protocol Buffer.
In the GitHub top page, Pulsar-Flink is using SimpleStringSchema. However, seemingly it does not comply with Protocol Buffer officially. Does anyone know how to deal with the data format? How should I define the schema?
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
Properties props = new Properties();
props.setProperty("topic", "test-source-topic")
FlinkPulsarSource<String> source = new FlinkPulsarSource<>(serviceUrl, adminUrl, new SimpleStringSchema(), props);
DataStream<String> stream = see.addSource(source);
// chain operations on dataStream of String and sink the output
// end method chaining
see.execute();
FYI, I am writing Scala code, so if your explanation is for Scala(not for Java), it is really helpful. Surely, any kind of advice is welcome!! Including Java.
You should implement your own DeserializationSchema. Let's assume that you have a protobuf message Address and have generated the respective Java class. Then the schema should look like the following:
public class ProtoDeserializer implements DeserializationSchema<Address> {
#Override
public TypeInformation<Address> getProducedType() {
return TypeInformation.of(Address.class);
}
#Override
public Address deserialize(byte[] message) throws IOException {
return Address.parseFrom(message);
}
#Override
public boolean isEndOfStream(Address nextElement) {
return false;
}
}

Drilling down model objects on Swagger

My application is a restful api and its integrated with Swagger and OpenAPI.
I have generated all Java stubs using OpenAPI YAML file and everything is working fine.
But when i try o drill down model objects on Swagger then it cannot locate some of objects although there are part of project as project compiles fine.
As shown in below screenshot, drilldown fails to locate COnfiguration object.
Any ideas on how to resolve this.
Edit:
I have a restful webservice and i generate all the java stubs [Data transfer objects] from a YAML file using openapi-generator plugin. This plugin automatically generates a class OpenAPIDocumentationConfig and following are the details of the class. After this setup, models are automatically generated in Swagger UI.
Also want to add that I am using OpenAPI 3.0 but i need to split Object definitions into multiple files. So i am referring to them using definitions as i don't believe component schemas can be split into multiple files.
#Configuration
#EnableSwagger2
public class OpenAPIDocumentationConfig {
ApiInfo apiInfo() {
return new ApiInfoBuilder()
.title("ABC Service")
.description("ABC Service")
.license("")
.licenseUrl("http://unlicense.org")
.termsOfServiceUrl("")
.version("1.0.0")
.contact(new Contact("","", "xyz#abc.com"))
.build();
}
#Bean
public Docket customImplementation(ServletContext servletContext, #Value("${openapi.studioVALService.base-path:}") String basePath) {
return new Docket(DocumentationType.SWAGGER_2)
.select()
.apis(RequestHandlerSelectors.basePackage("com.x.y.z"))
.build()
.pathProvider(new BasePathAwareRelativePathProvider(servletContext, basePath))
.directModelSubstitute(java.time.LocalDate.class, java.sql.Date.class)
.directModelSubstitute(java.time.OffsetDateTime.class, java.util.Date.class)
.apiInfo(apiInfo());
}
class BasePathAwareRelativePathProvider extends RelativePathProvider {
private String basePath;
public BasePathAwareRelativePathProvider(ServletContext servletContext, String basePath) {
super(servletContext);
this.basePath = basePath;
}
#Override
protected String applicationPath() {
return Paths.removeAdjacentForwardSlashes(UriComponentsBuilder.fromPath(super.applicationPath()).path(basePath).build().toString());
}
#Override
public String getOperationPath(String operationPath) {
UriComponentsBuilder uriComponentsBuilder = UriComponentsBuilder.fromPath("/");
return Paths.removeAdjacentForwardSlashes(
uriComponentsBuilder.path(operationPath.replaceFirst("^" + basePath, "")).build().toString());
}
}
}
EDIT 2:
I moved all definitions to components and schemas but they are still split in multiple files and are referring to components across files but still i get the same error.
If you are using OpenAPI 3 you should put schemas that you want to reuse inside components. To refeer to it you must use refs like:
$ref: "#/components/schemas/EquityOptionConfigurationDO"

where to put images uploaded to be viewed in browser [duplicate]

I read here that one should not save the file in the server anyway as it is not portable, transactional and requires external parameters. However, given that I need a tmp solution for tomcat (7) and that I have (relative) control over the server machine I want to know :
What is the best place to save the file ? Should I save it in /WEB-INF/uploads (advised against here) or someplace under $CATALINA_BASE (see here) or ... ? The JavaEE 6 tutorial gets the path from the user (:wtf:). NB : The file should not be downloadable by any means.
Should I set up a config parameter as detailed here ? I'd appreciate some code (I'd rather give it a relative path - so it is at least Tomcat portable) - Part.write() looks promising - but apparently needs a absolute path
I'd be interested in an exposition of the disadvantages of this approach vs a database/JCR repository one
Unfortunately the FileServlet by #BalusC concentrates on downloading files, while his answer on uploading files skips the part on where to save the file.
A solution easily convertible to use a DB or a JCR implementation (like jackrabbit) would be preferable.
Store it anywhere in an accessible location except of the IDE's project folder aka the server's deploy folder, for reasons mentioned in the answer to Uploaded image only available after refreshing the page:
Changes in the IDE's project folder does not immediately get reflected in the server's work folder. There's kind of a background job in the IDE which takes care that the server's work folder get synced with last updates (this is in IDE terms called "publishing"). This is the main cause of the problem you're seeing.
In real world code there are circumstances where storing uploaded files in the webapp's deploy folder will not work at all. Some servers do (either by default or by configuration) not expand the deployed WAR file into the local disk file system, but instead fully in the memory. You can't create new files in the memory without basically editing the deployed WAR file and redeploying it.
Even when the server expands the deployed WAR file into the local disk file system, all newly created files will get lost on a redeploy or even a simple restart, simply because those new files are not part of the original WAR file.
It really doesn't matter to me or anyone else where exactly on the local disk file system it will be saved, as long as you do not ever use getRealPath() method. Using that method is in any case alarming.
The path to the storage location can in turn be definied in many ways. You have to do it all by yourself. Perhaps this is where your confusion is caused because you somehow expected that the server does that all automagically. Please note that #MultipartConfig(location) does not specify the final upload destination, but the temporary storage location for the case file size exceeds memory storage threshold.
So, the path to the final storage location can be definied in either of the following ways:
Hardcoded:
File uploads = new File("/path/to/uploads");
Environment variable via SET UPLOAD_LOCATION=/path/to/uploads:
File uploads = new File(System.getenv("UPLOAD_LOCATION"));
VM argument during server startup via -Dupload.location="/path/to/uploads":
File uploads = new File(System.getProperty("upload.location"));
*.properties file entry as upload.location=/path/to/uploads:
File uploads = new File(properties.getProperty("upload.location"));
web.xml <context-param> with name upload.location and value /path/to/uploads:
File uploads = new File(getServletContext().getInitParameter("upload.location"));
If any, use the server-provided location, e.g. in JBoss AS/WildFly:
File uploads = new File(System.getProperty("jboss.server.data.dir"), "uploads");
Either way, you can easily reference and save the file as follows:
File file = new File(uploads, "somefilename.ext");
try (InputStream input = part.getInputStream()) {
Files.copy(input, file.toPath());
}
Or, when you want to autogenerate an unique file name to prevent users from overwriting existing files with coincidentally the same name:
File file = File.createTempFile("somefilename-", ".ext", uploads);
try (InputStream input = part.getInputStream()) {
Files.copy(input, file.toPath(), StandardCopyOption.REPLACE_EXISTING);
}
How to obtain part in JSP/Servlet is answered in How to upload files to server using JSP/Servlet? and how to obtain part in JSF is answered in How to upload file using JSF 2.2 <h:inputFile>? Where is the saved File?
Note: do not use Part#write() as it interprets the path relative to the temporary storage location defined in #MultipartConfig(location). Also make absolutely sure that you aren't corrupting binary files such as PDF files or image files by converting bytes to characters during reading/writing by incorrectly using a Reader/Writer instead of InputStream/OutputStream.
See also:
How to save uploaded file in JSF (JSF-targeted, but the principle is pretty much the same)
Simplest way to serve static data from outside the application server in a Java web application (in case you want to serve it back)
How to save generated file temporarily in servlet based web application
I post my final way of doing it based on the accepted answer:
#SuppressWarnings("serial")
#WebServlet("/")
#MultipartConfig
public final class DataCollectionServlet extends Controller {
private static final String UPLOAD_LOCATION_PROPERTY_KEY="upload.location";
private String uploadsDirName;
#Override
public void init() throws ServletException {
super.init();
uploadsDirName = property(UPLOAD_LOCATION_PROPERTY_KEY);
}
#Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
// ...
}
#Override
protected void doPost(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
Collection<Part> parts = req.getParts();
for (Part part : parts) {
File save = new File(uploadsDirName, getFilename(part) + "_"
+ System.currentTimeMillis());
final String absolutePath = save.getAbsolutePath();
log.debug(absolutePath);
part.write(absolutePath);
sc.getRequestDispatcher(DATA_COLLECTION_JSP).forward(req, resp);
}
}
// helpers
private static String getFilename(Part part) {
// courtesy of BalusC : http://stackoverflow.com/a/2424824/281545
for (String cd : part.getHeader("content-disposition").split(";")) {
if (cd.trim().startsWith("filename")) {
String filename = cd.substring(cd.indexOf('=') + 1).trim()
.replace("\"", "");
return filename.substring(filename.lastIndexOf('/') + 1)
.substring(filename.lastIndexOf('\\') + 1); // MSIE fix.
}
}
return null;
}
}
where :
#SuppressWarnings("serial")
class Controller extends HttpServlet {
static final String DATA_COLLECTION_JSP="/WEB-INF/jsp/data_collection.jsp";
static ServletContext sc;
Logger log;
// private
// "/WEB-INF/app.properties" also works...
private static final String PROPERTIES_PATH = "WEB-INF/app.properties";
private Properties properties;
#Override
public void init() throws ServletException {
super.init();
// synchronize !
if (sc == null) sc = getServletContext();
log = LoggerFactory.getLogger(this.getClass());
try {
loadProperties();
} catch (IOException e) {
throw new RuntimeException("Can't load properties file", e);
}
}
private void loadProperties() throws IOException {
try(InputStream is= sc.getResourceAsStream(PROPERTIES_PATH)) {
if (is == null)
throw new RuntimeException("Can't locate properties file");
properties = new Properties();
properties.load(is);
}
}
String property(final String key) {
return properties.getProperty(key);
}
}
and the /WEB-INF/app.properties :
upload.location=C:/_/
HTH and if you find a bug let me know

Reading from streams instead of files in spring batch itemReader

I am getting a csv file as a webservice call which needs to be laoded. Right now I am saving it in temp directory to provide it as setResource to Reader.
Is there a way to provide stream(byte[]) as is instead of saving the file first?
The method setResource of the ItemReader takes a org.springframework.core.io.Resource as a parameter. This class has a few out-of-the-box implementations, among which you can find org.springframework.core.io.InputStreamResource. This class' constructor takes a java.io.InputStream which can be implemented by java.io.ByteArrayInputStream.
So technically, yes you can consume a byte[] parameter in an ItemReader.
Now, for how to actually do that, here are a few ideas :
1) Create your own FlatFileItemReader (since CSV is a flat file) and make it implement StepExecutionListener
public class CustomFlatFileItemReader<T> extends FlatFileItemReader<T> implements StepExecutionListener {
}
2) Override the beforeStep method, do your webservice call within and save the result in a variable
private byte[] stream;
#Override
public void beforeStep(StepExecution stepExecution) {
// your webservice logic
stream = yourWebservice.results();
}
3) Override the setResource method to pass this stream as the actual resource.
#Override
public void setResource(Resource resource) {
// Convert byte array to input stream
InputStream is = new ByteArrayInputStream(stream);
// Create springbatch input stream resource
InputStreamResource res = new InputStreamResource(is);
// Set resource
super.setResource(res);
}
Also, if you don't want to call your webservice within the ItemReader, you can simply store the byte array in the JobExecutionContext and get it in the beforeStep method with stepExecution.getJobExecution().getExecutionContext().get("key");
I am doing right now with FlaFileItemReader, reading a file from Google Storage. No needed to extends:
#Bean
#StepScope
public FlatFileItemReader<MyDTO> itemReader(#Value("#{jobParameters['filename']}") String filename) {
InputStream stream = googleStorageService.getInputStream(GoogleStorage.UPLOADS, filename);
return new FlatFileItemReaderBuilder<MyDTO>()
.name("myItemReader")
.resource(new InputStreamResource(stream)) //InputStream here
.delimited()
.names(FIELDS)
.lineMapper(lineMapper()) // Here is mapped like a normal File
.fieldSetMapper(new BeanWrapperFieldSetMapper<MyDTO>() {{
setTargetType(MyDTO.class);
}})
.build();
}