Spark standalone cluster read parquet files after saving - scala

I've a two-node spark standalone cluster and I'm trying to read some parquet files that I just saved but am getting files not found exception.
Checking the location, it looks like all the parquet files got created on one of the nodes in my standalone cluster.
The problem now, reading the parquet files back, it says cannot find xasdad.part file.
The only way I manage to load it is to scale down the standalone spark cluster to one node.
My question is how can I load my parquet files while running more than one node in my standalone cluster ?

You have to put your files on a shard directory which is accessible to all spark nodes with the same path. Otherwise, use spark with Hadoop HDFS : a distributed file system.

Related

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Kafka Connect: can multiple standalone connectors write to the same HDFS directory?

For our pipeline, we have about 40 topics (10-25 partitions each) that we want to write into the same HDFS directory using HDFS 3 Sink Connectors in standalone mode (distributed doesn't work for our current setup). We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory? Since the connectors then organize all files in HDFS by topic, I don't think this should be an issue but I'm wondering if anyone has experience with this setup.
Basic example:
Connector-1 config
name=connect-1
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic1
hdfs.url=hdfs://kafkaOutput
Connector-2 config
name=connect-2
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic2
hdfs.url=hdfs://kafkaOutput
distributed doesn't work for our current setup
You should be able to run connect-distibured in the exact same nodes as connect-standalone is ran.
We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted
Yeah, I would suggest not bundling all topics into one connector.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory?
That is my personal recommendation, and yes, they can because the HDFS path is named by the topic name, futher split by the partitioning scheme
Note: The following allow applies to all other storage connectors (S3 & GCS)

some data did not consumed by kafka-engine

We are using the kafka-engine to connect kafka topic, and then MATERIALIZED VIEW to store the data.
But from time to time, some data did not consumed by kafka-engine( due to we also apply flume to put the data into hdfs file, and these missing data can be found in hdfs file).
Is there any other method to find related log to located problem except upgrade the clcikhouse server version(we are on the way to upgrade clickhouse server )
You can enable rdkafka logs by adding <kafka><debug>all</debug></kafka> fragment to clickhouse config.xml file
Logs will be written to /var/log/clickhouse-server/stderr file (in docker - to docker logs).

Apache Kafka topic data in HDFS format

Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .
When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.
scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")
Is this the intended behaviour or should all blocks be distributed across the cluster?
Thanks
Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy).
So yes, this is the intended behaivour.
Just as #nik says, I do my work with multi cients and it done for me:
This is the python snippet:
columns = xfact.columns
test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns)
test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')