Fastest way to upload text files into HDFS(hadoop) - eclipse

Iam trying to upload 1 million text files into HDFS.
So, uploading those files using Eclipse is taking around 2 hours.
Can anyone please suggest me any fast technique to do this thing.?
What Iam thinking of is : To zip all the text files into a single zip and then upload that into HDFS and finally using some unzipping technique , I would extract those files onto HDFS.
Any help will be appreciated.

Distcp is a good way to upload files to HDFS, but for your particular use case (you want to upload local files to a single node cluster running in the same computer) the best thing is not to upload the files to HDFS at all. You can use localfs (file://a_file_in_your_local_disk) instead of HDFS, so no need to upload the files.
See this other SO question for examples on how to do this.

Try DistCp. DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. You can use it to copy data from your local FS to HDFS as well.
Example : bin/hadoop distcp file:///Users/miqbal1/dir1 hdfs://localhost:9000/

Related

Load COPY (Cobol) file in Talend tool

I would like to load a file in Talend which is supposed to have compress data inside. I don't know how to do that, I mean I don't know neither load a COPY file nor a COPY file with compress data. May someone help me please?
These are sample files (one of them is the schema): https://www.dropbox.com/sh/bqvcw0dk56hqhh2/AABbs1GRKjo7rycQrcUM_dgta?dl=0
P.S.: I know how to load csv, Excel, data from SQL databases, among others. However, I don't know how to handle this kind of files.
Thanks in advance.

save PostgreSQL data in Parquet format

I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The data size can be gigantic (ex: 10TB). I'm very new to this topic and has done some research online but did not find a direct way to convert the data to Parquet file. Here are my questions:
The only feasible solution I saw is to load Postgres table to Apache Spark via JDBC and save as a parquet file. But I assume it will be very slow while transferring 10TB data.
Is it possible to generate a huge parquet file size that is 10 TB? Or is it better to create multiple parquet files?
Hope my question is clear and I really appreciate any helpful feedbacks. Thanks in advance!
Use the ORC format instead of the parquet format for this volume.
I assume the data is partitioned, so I think it's a good idea to extract in parallel taking advantage of data partitioning.

Data streaming to Google Cloud ML Engine

I found that Google ml engine expects data in cloud storage, big query etc. Is there any way to stream data to ml-engine. For example, imagine that I need to use data in WordPress or Drupal site to create a tensorflow model, say a spam detector. One way is to export the whole data as CSV and upload it to cloud storage using google-cloud--php library. The problem here is that, for every minor change, we have to upload the whole data. Is there any better way?
By minor change, do you mean "when you get new data, you have to upload everything--the old and new data--again to gcs"? One idea is to export just the new data to gcs on some schedule, making many csv files over time. You can write your trainer to take a file pattern and expand it using get_matching_files/Glob or multiple file paths.
You can also modify your training code to start from an old checkpoint and train over just the new data (which is in its own file) for a few steps.

How to improve the speed of a lot of small files' read and write?

My Job is to Improve the speed of reading a lot of small file (1KB) from disk to write into our database.
The database is open source to me, and I can change all the code from the client to the server.
The database architecture is that , it is a simple master-slave distributed HDFS based database like HBase. The small file from disk can be insert into our database and combined into bigger block automatically and then write into HDFS.(also the big file can be split to smaller block by database and then write into HDFS)
One way to change the client is to increase the thread number.
I don't have any other idea.Or you can provide some idea to do the performance analysis.
One of the way to process such small files could be to convert these small files to a sequence file and store it into HDFS. Then use this file as a Map Reduce job input file to put the data into HBase or similar database.
This uses aws as an example but it could be any storage/queue setup:
If the files were able to exist on a shared storage such as S3 you could add one queue entry for each file and then just start throwing servers at the queue to add the files to the db. At that point the bottleneck becomes the db instead of the client.

How to copy hbase data

I need to copy data from one cluster to another cluster. I did some research and found I could do copyTable, which basically scan and put data from one to another.
I also know that I can copy over the whole HDFS volume for Hhase. I am wondering if this works and if perform better than copyTable? (I believe it should perform better since it copy files without logic operations)
Take a look at HBase replication you can find a short how to here