How to convert InfluxDB Line Protocol to Parquet in NiFi - apache-kafka

I have influxDB Line Protocol records coming in to NiFi via a ConsumeKafka processor, and then merged into flowfiles containing 10,000 records. Now I'd like to get them converted to Parquet and stored in HDFS with an end goal of building Impala tables for the end user. Is there a way to convert Line Protocol to something consumable by the PutParquet processor, or another way to convert to Parquet files?
I did find a custom influxlineprotocolreader processor, however there's very little information and no examples (that I've found) on how to use this processor so I'm not sure if it fits this use case.
Alternatively, I can use Spark to do the conversion and write Parquet files, but I was hoping to do everything in NiFi if at all possible, especially since I haven't found many resources on doing such a conversion in Spark either (I'm new to both Spark and NiFi).

There is nothing out of the box in NiFi that understands InfluxDB line protocol. You would have to implement something that converted that to a known format like JSON, Avro, etc, and then you could go to Parquet, or if you implemented a InfluxDbRecordReader then you could use ConvertRecord with that and a parquet writer to go directly between the two.

Related

AWS Glue output to stream

I'm just starting to get familiar with AWS and it's tools and have been researching Glue/DataBrew. I'm trying to understand if it would fit a streaming use case I have in mind and I can clearly see plenty of documentation around consuming streaming data into Glue, but I can't find anything related to publishing streaming data from a glue job.
What I would like to do is pick up a file from some source, rip it apart into component records using Glue and then publish each individual record onto a stream (Kinesis, SNS, Kafka, etc). Is this yet possible with Glue? or am I barking up the wrong tree here.
Is there a better more appropriate AWS solution for this type of use case?
pick up a file from some source
Use S3... Hook a AWS Lambda trigger to S3 upload events.
Write a Lambda that will download this file's contents and parse it.
Then as parsing, you can send events to SNS, MSK, or Kinesis, or write to Athena, RDS, other S3 files, etc...
Sure, Glue might piece some of these together, but you dont "need" it for simple ETL workloads.

How to read and write BSON files with Spark?

I have many MongoDB dumps in gzip compressed BSON files, each with multiple documents. I would like to read them directly to Spark, ideally partitioning on individual document level.
Previous discussions (1, 2) are old and use the depracated Hadoop Mongo connector. The new, actively maintained Spark Mongo connector seems to implement a DefaultSource interface, a couple custom partitioners, and a connection layer.
I would like to extract (or contribute) a way to read a multi-document BSON file from disk into a DataFrame, such that different documents can be loaded into different partitions. Writing would also be great to have for completeness, but I'm not sure how robust can writing to a single file from multiple writers be. I am new to Spark and unsure where to start.

How can I ingest into Kafka text files that were created for splunk?

I'm evaluating the use of apache-kafka to ingest existing text files and after reading articles, connectors documentation, etc, I still don't know if there is an easy way to ingest the data or if it would require transformation or custom programming.
The background:
We have a legacy java application (website/ecommerce). In the past, there was a splunk server to do several analytics.
The splunk server is gone, but we still generate the log files used to ingest the data into splunk.
The data was ingested to Splunk using splunk-forwarders; the forwarders read log files with the following format:
date="long date/time format" type="[:digit:]" data1="value 1" data2="value 2" ...
Each event is a single line. The key "type" defines the event type and the remaining key=value pairs vary with the event type.
Question:
What are my options to use these same files to send data to Apache Kafka?
Edit (2021-06-10): Found out this log format is called Logfmt (https://brandur.org/logfmt).
The events are single lines of plaintext, so all you need is a StringSerializer, no transforms needed
If you're looking to replace the Splunk forwarder, then Filebeat or Fluentd/Fluentbit are commonly used options for shipping data to Kafka and/or Elasticsearch rather than Splunk
If you want to pre-parse/filter the data and write JSON or other formats to Kafka, Fluentd or Logstash can handle that

How to parse EDIFACT file data using apache spark?

Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.
In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.