How to convert Parquet file to Protobuf and save it HDFS/AWS S3 - scala

I have a file which is in Parquet format. I want to read it and save it in HDFS or AWS S3 in Protobuf format using spark with Scala. I am not sure of any way. Searched many blogs but could not understand anything, can anyone help?

You can use ProtoParquetReader, which is ParquetReader with ProtoReadSupport.
Something like:
try (ParquetReader reader = ProtoParquetReader.builder(path).build()
) {
while ((model = reader.read()) != null){
System.out.println("check model " + "-- " + model);
...
}
} catch (IOException e) {
e.printStackTrace();
}

In order to read from parquet you need to use the following code :
public List<Record> read(Path path) {
List<Record> records = new ArrayList<>();
ParquetReader<Record> reader = AvroParquetReader<Record>builder(path).withConf(new Configuration()).build();
for (Record value = reader.read(); value != null; value = reader.read()) {
records.add(value);
}
return records;
}
Writing to a file from parquet would be something like this. Although this is not the protobuf file this might help you get started. Have in mind that you will have issues if you end up using spark-stream with protobuf v2.6 and greater
public void write(List<Record> records, String location) throws IOException {
Path filePath = new Path(location);
try (ParquetWriter<Record> writer = AvroParquetWriter.<GenericData.Record>builder(filePath)
.withSchema(getSchema()) //
.withConf(getConf()) //
.withCompressionCodec(CompressionCodecName.SNAPPY) //
.withWriteMode(Mode.CREATE) //
.build()) {
for (Record record : records) {
writer.write(record);
}
} catch (Exception e) {
e.printStackTrace();
}
}

Related

HttpMessageConverter - AVRO to JSON to AVRO

I'm looking for an easy to use solution allowing to send all sorts of AVRO objects I read from a kafka stream to synchroneous recipients via REST. This can be single objects as well as collections or arrays of the same object types. The same for a binary format (between instances e.g. for interactive query feature where the record would be on a nother node). And finally I need to support compression.
There are several sources discussing Spring solutions for HttpMessageConverter simplifying the handling of domain objects in micro services using Kafka and AVRO e.g.:
Apache Avro Serialization with Spring MVC
Avro Converter. Serializing Apache Avro Objects via REST API and Other Transformations
Above proposed solutions work perfectly fine for scenarios where one instance of an AVRO object needs to be sent back or received. What is missing however, would be a solution allowing to send/receive collections or arrays of same AVRO objects.
For serializing single AVRO objects the code may look like
public byte[] serialize(T data) throws SerializationException {
try {
byte[] result = null;
if (data != null) {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().jsonEncoder(data.getSchema(), byteArrayOutputStream);
DatumWriter<T> datumWriter = new SpecificDatumWriter<>(data.getSchema());
datumWriter.write(data, encoder);
encoder.flush();
byteArrayOutputStream.close();
result = byteArrayOutputStream.toByteArray();
}
return result;
} catch (IOException e) {
throw new SerializationException("Can't serialize data='" + data + "'", e);
}
}
Similarly, the deserialization
public T deserialize(Class<? extends T> clazz, byte[] data) throws SerializationException {
try {
T result = null;
if (data != null) {
Class<? extends SpecificRecordBase> specificRecordClass =
(Class<? extends SpecificRecordBase>) clazz;
Schema schema = specificRecordClass.newInstance().getSchema();
DatumReader<T> datumReader =
new SpecificDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, new ByteArrayInputStream(data));
result = datumReader.read(null, decoder);
}
return result;
} catch (InstantiationException | IllegalAccessException | IOException e) {
throw new SerializationException("Can't deserialize data '" + Arrays.toString(data) + "'", e);
}
}
Examples focusing on JSON but same principle also applies for a binary format. What is missing however would be a solution allowing to send/receive collections or arrays of same AVRO objects. I therefore introduced two methods:
public byte[] serialize(final Iterator<T> iterator) throws SerializationException {
Encoder encoder = null;
DatumWriter<T> datumWriter = null;
try (final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
while (iterator.hasNext()) {
T data = iterator.next();
if (encoder == null) {
// now that we have our first object we can get the schema
encoder = EncoderFactory.get().jsonEncoder(data.getSchema(), byteArrayOutputStream);
datumWriter = new SpecificDatumWriter<>(data.getSchema());
byteArrayOutputStream.write('[');
}
datumWriter.write(data, encoder);
if (iterator.hasNext()) {
encoder.flush();
byteArrayOutputStream.write(',');
}
}
if (encoder != null) {
encoder.flush();
byteArrayOutputStream.write(']');
return byteArrayOutputStream.toByteArray();
} else {
return null;
}
} catch (IOException e) {
throw new SerializationException("Can't serialize the data = '" + iterator + "'", e);
}
}
Deserialization gets even more a bit of hack:
public Collection<T> deserializeCollection(final Class<? extends T> clazz, final byte[] data) throws SerializationException {
try {
if (data != null) {
final Schema schema = clazz.getDeclaredConstructor().newInstance().getSchema();
final SpecificDatumReader<T> datumReader = new SpecificDatumReader<>(schema);
final ArrayList<T> resultList = new ArrayList<>();
int i = 0;
int startRecord = 0;
int openCount = 0;
ParserStatus parserStatus = ParserStatus.NA;
while (i < data.length) {
if (parserStatus == ParserStatus.NA) {
if (data[i] == '[') {
parserStatus = ParserStatus.ARRAY;
}
} else if (parserStatus == ParserStatus.ARRAY) {
if (data[i] == '{') {
parserStatus = ParserStatus.RECORD;
openCount = 1;
startRecord = i;
// } else if (data[i] == ',') {
// ignore
} else if (data[i] == ']') {
parserStatus = ParserStatus.NA;
}
} else { // parserStatus == ParserStatus.RECORD
if (data[i] == '}') {
openCount--;
if (openCount == 0) {
// now carve out the part start - i+1 and use a datumReader to create avro object
try (final ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(data, startRecord, i - startRecord +1)) {
final Decoder decoder = DecoderFactory.get().jsonDecoder(schema, byteArrayInputStream);
final SpecificRecordBase avroRecord = datumReader.read(null, decoder);
resultList.add((T) avroRecord);
}
parserStatus = ParserStatus.ARRAY;
}
} else if (data[i] == '{') {
openCount++;
}
}
i++;
}
if (parserStatus != ParserStatus.NA) {
log.warn("Malformed json input '{}'", new String(data));
}
return resultList;
}
return null;
} catch (InstantiationException | InvocationTargetException | IllegalAccessException | NoSuchMethodException | IOException e) {
throw new SerializationException("Can't deserialize data '" + new String(data) + "'", e);
}
}
Doing the same with format binary is far more straight forward as for the serialization one record after another can be serialized using datumWriter.write(data, encoder) with encoder = EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null). There is no additional syntax needed and similarly deserialize SpecificRecordBase avroRecord = datumReader.read(null, decoder) and then adding the avroRecord to the collection.
For JSON the additional syntax is needed as the recipient might use its own deserialization e.g. for creating plain POJ objects or the other way round might have created the input using a POJ object and serialized it.
My current solution looks to me quite hacky. One option I thought of would be to create an intermediate AVRO object which only contains an array of the enclosed object and then I could use this intermediate object for serialization and deserialization as then the out of the box encoder and decoder would take over the extra logic. But introducing an extra AVRO object only for this purpose seems unnecessary overhead.
As an alternative solution I’ve started looking into org.apache.avro.io. JsonDecoder but didn’t see an easy way to extend it in a way to abstract from above home grown solution.
I'm currently extending above to support also compression using a decorator for compression and decompression (Deflator, GZIP and LZ4).
Any help or better solution is appreciated.

(log) File watching with citrus-framework

Is there a way and/or what are best practices to watch log files from the System Under Test?
My requirement is to validate presence/absence of log entries according known patterns produced by the SUT.
Thank you very much!
Well, I don't think there is a Citrus tool specifically designed for that. But I think that is a really good idea. You could open an issue and ask for this feature.
Meanwhile, here is a solution that we have used in one of our projects to check if the applicaiton log contained specific strings that were generated by our test.
sleep(2000),
echo("Searching the log..."),
new AbstractTestAction() {
#Override
public void doExecute(TestContext context) {
try {
String logfile = FileUtils.getFileContentAsString(Paths.get("target", "my-super-service.log").toAbsolutePath().normalize());
if (!logfile.contains("ExpectedException: ... | Details: BOOM!.")) {
throw new RuntimeException("Missing exceptions in log");
}
} catch (IOException e) {
throw new RuntimeException("Unable to get log");
}
}
}
OR you can replace that simple contains with a more elegant solution:
String grepResult = grepForLine(LOGFILE_PATH, ".*: SupermanMissingException.*");
if (grepResult == null) {
throw new RuntimeException("Expected error log entry not found");
}
The function goes over each line searching for a match to the regex supplied.
public String grepForLine(Path path, String regex) {
Pattern regexp = Pattern.compile(regex);
Matcher matcher = regexp.matcher("");
String msg = null;
try (
BufferedReader reader = Files.newBufferedReader(path, Charset.defaultCharset());
LineNumberReader lineReader = new LineNumberReader(reader)
) {
String line;
while ((line = lineReader.readLine()) != null) {
matcher.reset(line); //reset the input
if (matcher.find()) {
msg = "Line " + lineReader.getLineNumber() + " contains the error log: " + line;
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
return msg;
}

How do i return a powerpoint (.pptx) file from REST response in springMVC

I am generating a powerpoint file(.pptx) and i would like to return back this file when a REST call happens. But now am able to get only .File type extension.
#RequestMapping(value = "/ImageManagerPpt/{accessionId}", method = RequestMethod.GET, produces = "application/ppt")
public ResponseEntity<InputStreamResource> createPptforAccessionId(#PathVariable("accessionId") String accessionId,HttpServletResponse response) throws IOException** {
System.out.println("Creating PPT for Patient Details with id " + accessionId);
File pptFile = imageManagerService.getPptForAccessionId(accessionId);
if (pptFile == null) {
System.out.println("Patient Id with id " + accessionId + " not found");
return new ResponseEntity<InputStreamResource>(HttpStatus.NOT_FOUND);
}
InputStream stream = null;
try {
stream = new FileInputStream(pptFile);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ClassPathResource classpathfile = new ClassPathResource("Titlelayout3.pptx");
InputStreamResource inputStreamResource = new InputStreamResource(stream);
return ResponseEntity.ok().contentLength(classpathfile.contentLength())
.contentType(MediaType.parseMediaType("application/octet-stream"))
.body(new InputStreamResource(classpathfile.getInputStream()));
}
-Bharat
Have you tried, this?
InputStream stream = new InputStream(pptFile);
org.apache.commons.io.IOUtils.copy(is, response.getOutputStream());
response.flushBuffer();
You will get file as you put into the InputStream.

I have json values , how to save into mongodb using spring mvc

I have class like "location".
"Location" class have four field.
fields are:id,city,state,country..
country is seprate class it contains 2 field , country code, country name , 2 fields must read from location class..
if i write "locationMongoRepository.save()", then it shows error as bound mismatch. please give solution for how to save in mongodb.
public void insertLocation() throws InvalidFormatException, IOException, JSONException{
FileInputStream inp;
Workbook workbook;
try {
inp = new FileInputStream( "/home/Downloads/eclipse/Workspace/Samplboot-master latest/cityListForIndia1.xlsx" );
workbook = WorkbookFactory.create( inp );
Sheet sheet = workbook.getSheetAt(0);
JSONArray json = new JSONArray();
boolean isFirstRow = true;
ArrayList<String> rowName = new ArrayList<String>();
for ( Iterator<Row> rowsIT = sheet.rowIterator(); rowsIT.hasNext(); )
{
Row row = rowsIT.next();
JSONObject jRow = new JSONObject();
if(isFirstRow)
{
for ( Iterator<Cell> cellsIT = row.cellIterator(); cellsIT.hasNext(); )
{
Cell cell = cellsIT.next();
rowName.add(cell.getStringCellValue());
}
isFirstRow = false;
}
else
{
JSONObject jRowCountry= new JSONObject();
JSONObject jRowLocation= new JSONObject();
jRowLocation.put("city", row.getCell(0));
jRowLocation.put("state", row.getCell(1));
jRowCountry.put("country",row.getCell(2) );
jRowCountry.put("countryCode", row.getCell(3) );
jRowLocation.put("country", jRowCountry);
System.out.println("Location"+jRowLocation.toString());
}
}
}
catch (InvalidFormatException e) {
// TODO Auto-generated catch block
System.out.println("Invalid Format, Only Excel files are supported");
e.printStackTrace();
} catch (IOException e) {
System.out.println("Check if the input file exists and the path is correct");
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
System.out.println("Unable to generate Json");
e.printStackTrace();
}
}
}
I'm using Spring Data to support working with MongoDB and it's really helpful. You should read this article to get its idea and applied to your case https://dzone.com/articles/spring-data-mongodb-hello.
P/S: In case you can't use Spring Data to work with MongoDB, please provide more detail in your code/ your exception so we can investigate it more detail.

File corrupted after uploading on server in GWT

I'm using gwtupload.client.MultiUploader to upload zip files on the server in GWT. Then on the server I transform zip file to array of bytes to insert into database. As the result 50% of files in database are corrupted. Here a little bit of my code.
#UiField(provided = true)
MultiUploader muplDef;
public MyClass(){
muplDef = new MultiUploader();
muplDef.setValidExtensions("zip");
muplDef.addOnFinishUploadHandler(onFinishUploaderHandler);
muplDef.addOnCancelUploadHandler(onCancelUploaderHander);
}
private final IUploader.OnFinishUploaderHandler onFinishUploaderHandler = new IUploader.OnFinishUploaderHandler() {
#SuppressWarnings("incomplete-switch")
#Override
public void onFinish(IUploader uploader) {
switch (uploader.getStatus()) {
case SUCCESS:
attachZip = true;
}
}
};
private final IUploader.OnCancelUploaderHandler onCancelUploaderHander = new IUploader.OnCancelUploaderHandler() {
#Override
public void onCancel(IUploader uploader) {
attachZip = false;
}
};
Byte Array
String fileName = "D:\1.zip";
File f = new File(fileName);
byte[] edocBinary = new byte[(int) f.length()];
RandomAccessFile ff;
try {
ff = new RandomAccessFile(f, "r");
ff.readFully(edocBinary);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
My questions are: Files can be correpted even if I have OnFinishUploaderHandler and case:SUCCESS? There are other cases like ERROR, maybe this case will check the file? Or the problem is with the transformation to byte array? Can you provide me some advices, thanks.
As you said, you had two steps :
1- Uploading the zip file
2- Inserting the zip file in the database
If the step 1 is done correctly, you're gonna get a success showing that the file is transfered correcly from client to the server, what you do after that is not managed by GwtUpload.
So I guess the corruption happened when you try to insert the file in the database. If you are using MySQL Try this http://www.codejava.net/java-se/jdbc/insert-file-data-into-mysql-database-using-jdbc