Reading csv file by Flink, scala, addSource and readCsvFile - scala

I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).

readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.

If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv

Related

How do you send the record matching certain condition/s to specific/multiple output topics in Spring apache-kafka?

I have referred this. but, this is an old post so i'm looking for a better solution if any.
I have an input topic that contains 'userActivity' data. Now I wish to gather different analytics based on userInterest, userSubscribedGroup, userChoice, etc... produced to distinct output topics from the same Kafka-streams-application.
Could you help me achieve this... ps: This my first time using Kafka-streams so I'm unaware of any other alternatives.
edit:
It's possible that One record matches multiple criteria, in which case the same record should go into those output topics as well.
if(record1 matches criteria1) then... output to topic1;
if(record1 matches criteria2) then ... output to topic2;
and so on.
note: i'm not looking elseIf kind of solution.
For dynamically choosing which topic to send to at runtime based on each record's key-value pairs. Apache Kafka version 2.0 or later introduced a feature called: Dynamic routing
And this is an example of it: https://kafka-tutorials.confluent.io/dynamic-output-topic/confluent.html

Create Parquet file in Scala without Spark

I am trying to write streaming JSON messages directly to Parquet using Scala (no Spark). I see only couple of post online and this post, however I see the ParquetWriter API is deprecated and the solution doesn't actually provides an example to follow. I read some other posts too but didn't find any descriptive explanation.
I know I have to use ParquetFileWriter API but lack of documentation is making difficult for me to use it. Can someone please provide and example of it along with all the constructor parameter and how to create those parameter, especially schema?
You may want to try using Eel, a a toolkit to manipulate data in the Hadoop ecosystem.
I recommend reading the README to gain a better understanding of the library, but to give you a sense of how the library works, what your are trying to do would look somewhat like the following:
val source = JsonSource(() => new FileInputStream("input.json"))
val sink = ParquetSink(new Path("output.parquet"))
source.toDataStream().to(sink)

How can one use spark Catalyst?

According to this
Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions.
I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However, it is not clear for me by reading the code if there is any wrapper class around Catalyst that I can use? The ideal wrapper would receive a sql statement and produces the equivalent Scala code. For my use case would look like this
def generate("select substring(s, 1, 3) as from t1") =
{ // custom code
return custom_scala_code_which is executable given s as List[String]
}
This is a simple example, but the idea is that I don't want to write another parser and I need to parse many SQL functionality from a legacy system that I have to write a custom Scala implementation for them.
In a more general question, with a lack of class level design documentation, how can someone learn the code base and make contributions?
Spark takes SQL queries using spark.sql. For example: you can just feed the string SELECT * FROM table as an argument to such as spark.sql("SELECT * FROM table") after having defined your dataframe as "table". To define your dataframe as "table" for use in SQL queries create a temporary view using
DataFrame.createOrReplaceTempView("table")
You can see examples here:
https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically
Dataframe automatically changes into RDD and optimise the code, and this optimization is done through Catalyst. When a programmer writes a code in Dataframe , internally code will be optimized. For more detail visit
Catalyst optimisation in Spark

Processing Kafka message byte array in Scala

I am trying to use Spark Streaming and Kafka to ingest and process messages received from a web server.
I am testing the consumer mentioned in https://github.com/dibbhatt/kafka-spark-consumer/blob/master/README.md to take advantage of the extra features it offers.
As a first step, I am trying to use the example provided just to see how it plays out. However, I am having difficulties actually seeing the data in the payload.
Looking at the result of the following function:
ReceiverLauncher.launch
I can see it returns a collection of RDDs, each of type:
MessageAndMetadata[Array[Byte]]
I am stuck at this point and don't know how to parse this and see the actual data. All the examples on the web that use the consumer that ships with Spark create an iterator object, go through it, and process the data. However, the returned object from this custom consumer doesn't give me any iterator interfaces to start with.
There is a getPayload() method in the RDD, but I don't know how to get to the data from it.
The questions I have are:
Is this consumer actually a good choice for a production environment? From the looks of it, the features it offers and the abstraction it provides seem very promising.
Has anybody ever tried it? Does anybody know how to get to the data?
Thanks in advance,
Moe
getPayload() needs to be converted to String, e.g.
new String(line.getPayload())

Efficient way to load csv file in spark/scala

I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes:
sqlContext.read.format("csv").options(option).load(path)
sqlContext.read.options(option).csv(path)
What is the difference between these two and which gives the better performance?
Thanks
There's no difference.
So why do both exist?
The .format(fmt).load(path) method is a flexible, pluggable API that allows adding more formats without having to re-compile spark - you can register aliases for custom Data Source implementations and have Spark use them; "csv" used to be such a custom implementation (outside of the packaged Spark binaries), but it is now part of the project
There are shorthand methods for "built-in" data sources (like csv, parquet, json...) which make the code a bit simpler (and verified at compile time)
Eventually, they both create a CSV Data Source and use it to load the data.
Bottom line, for any supported format, you should opt for the "shorthand" method, e.g. csv(path).