I am trying to write streaming JSON messages directly to Parquet using Scala (no Spark). I see only couple of post online and this post, however I see the ParquetWriter API is deprecated and the solution doesn't actually provides an example to follow. I read some other posts too but didn't find any descriptive explanation.
I know I have to use ParquetFileWriter API but lack of documentation is making difficult for me to use it. Can someone please provide and example of it along with all the constructor parameter and how to create those parameter, especially schema?
You may want to try using Eel, a a toolkit to manipulate data in the Hadoop ecosystem.
I recommend reading the README to gain a better understanding of the library, but to give you a sense of how the library works, what your are trying to do would look somewhat like the following:
val source = JsonSource(() => new FileInputStream("input.json"))
val sink = ParquetSink(new Path("output.parquet"))
source.toDataStream().to(sink)
Related
I am new to Scala and Amazon Deequ. I have been asked to write a Scala code that would compute metrics (e.g. Completeness, CountDistinct etc) on constraints by using Deequ on source csv files stored on S3, and load the generated metrics in a Glue table which will be further used for reporting.
Can anyone please help me by pointing me in the right direction towards online resources that would help me achieve this ? I am new to both Scala and Deequ. So can anyone give me a sample Scala code and explain how the deequ libraries could be used etc ?
Please let me know if additional information is required to explain my question better.
Thank you for your interest in Deequ. The github page of deequ has information on how to get started with using it: https://github.com/awslabs/deequ
Additionally, there is a blogpost at the AWS blog with some examples as well: https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
Best,
Sebastian
You can check the examples available here: https://github.com/awslabs/deequ/tree/master/src/main/scala/com/amazon/deequ/examples
Hope that helps.
Take some time to read the documentation too.
I'm writing a Scala tool that encodes ~300 JSON Schema files into files of a different format and saves them to disk. These schemas I later re-need for instantiating JSON Data files, or better, I don't need all the schemas but only a few fields of each.
I was thinking that the best solution could be to populate a Map object (while the tool encodes the schemas) containing only the info that I need. And later re-use the Map object (in another run of the tool) as already compiled and populated map.
I've got two questions:
1. Is this really the most performant solution? and
2. How can I save the Map object, created at runtime, on disk as a file that can be later built/executed with the rest of my code?
I've read several posts about serialization and storing objects, but I'm not entirely sure whether these are the same as what I need. Also, I'm not sure is the best solution and I would like to hear an opinion from people with more experience than me.
What I would like to achieve is an elegant solution that allows me to lookup values from a map generated by another tool.
The whole process of compiling/building/executing sometimes is still confusing to me, so apologies if the question is trivial.
To Answer your question,
I think using an embedded KV Store would be more efficient considering the number of files and amount of traversal.
Here is a small Wiki on "How to use RocksJava". You can consider it as a KV store. https://github.com/facebook/rocksdb/wiki/RocksJava-Basics
You can use the below reference to serialize and de-serialize an object in Scala and put it as Key value pair in the RocksDB as I mentioned in the comment.
Convert Any type in scala to Array[Byte] and back
On how to use rocksDB, the below dependency in your build will suffice:
"org.rocksdb" % "rocksdbjni" % "5.17.2"
Thanks.
I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).
readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.
If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv
I am trying to use Spark Streaming and Kafka to ingest and process messages received from a web server.
I am testing the consumer mentioned in https://github.com/dibbhatt/kafka-spark-consumer/blob/master/README.md to take advantage of the extra features it offers.
As a first step, I am trying to use the example provided just to see how it plays out. However, I am having difficulties actually seeing the data in the payload.
Looking at the result of the following function:
ReceiverLauncher.launch
I can see it returns a collection of RDDs, each of type:
MessageAndMetadata[Array[Byte]]
I am stuck at this point and don't know how to parse this and see the actual data. All the examples on the web that use the consumer that ships with Spark create an iterator object, go through it, and process the data. However, the returned object from this custom consumer doesn't give me any iterator interfaces to start with.
There is a getPayload() method in the RDD, but I don't know how to get to the data from it.
The questions I have are:
Is this consumer actually a good choice for a production environment? From the looks of it, the features it offers and the abstraction it provides seem very promising.
Has anybody ever tried it? Does anybody know how to get to the data?
Thanks in advance,
Moe
getPayload() needs to be converted to String, e.g.
new String(line.getPayload())
I wanna use a fileentry component to save an image to a database(postgres with JPA connection) and JSF. I was searching it, so i think i have to use an inputstream, but i don't know how to use it to convert the image to a bytea type to be saved to a database in a column called for example cover's book(bytea). what code shoud be in the bean. Please help me. I have something like this:
Icefaces 3.0.1 FileEntry: FileEntryListener is never called
You're trying to do it all at once. Break it down step-by-step into pieces you can understand on their own.
First make sure that receiving the image and storing it to a local file works. There are lots of existing examples for this part.
Then figure out how to map bytea in your JPA implementation and store/retrieve files from disk. You didn't mention which JPA provider you're using, so it's hard to be specific about this.
Finally, connect the two up by writing directly from the stream, but only once you know both parts work correctly by themselves. At this point you will find that you can't actually send a stream directly to the database if you want to use a bytea field - you must fetch the whole file into memory as a byte buffer.
If you want to be able to do stream I/O with files in the database you can do that using PostgreSQL's support for "large objects". You're very unlikely to be able to work with this via JPA, though; you'll have to manage the files in the DB directly with JDBC using the PgJDBC extensions for large object suppport. You'll have to unwrap the Connection object from the connection pooler to get the underlying Connection that you can cast to PgConnection in order to access this API.
If you're stuck on any individual part, feel free to post appropriately detailed and specific questions about that part; if you link back to this question that'll help people understand the context. As it stands, this is a really broad "show me how to use technology X and Y to to Z" that would require writing (or finding) a full tutorial to answer.