Is there a way to copy quickly files from remote location to local in pyspark - pyspark

I'm copying files from remote location using lftp using mget parameter. The task takes approximately 2 min to copy 50 xml files from a sftp machine to my local Unix machine. I'd like to be able to copy 20k files. An XML file is approx ~15kb. The dataframe df_files contains the list of all the XML files that I'd like to copy.
I've tried the code below with 20 thousand files, it seems to take few hours in order to create a dataframe with those files.
for row in df_files.tolist():
print row
cmd_p1 = """lftp sftp://username:password!#remotelocation-e "lcd /var/projects/u_admin/folder/;mget /var/projects/storage/folder/"""+row
cmd_p2 = """;bye " """
cmd_get_xml = cmd_p1+cmd_p2
s=subprocess.call(cmd_get_xml,shell=True,stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
j=0
for row in df_file.itertuples(index=True, name='Pandas'):
print getattr(row,'filename')
if j==0:
acq = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Message").load("file:///var/projects/u_admin/folder/"+df_file['filename'].iloc[j])
schema = acq.schema
else :
acq2 = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Message").load("file:///var/projects/u_admin/folder/"+df_file['filename'].iloc[j], schema = schema)
acq = acq.union(acq2)
I'd like to be able to copy those files for the least amount of time.

First, get all your .xml files into one directory using SCP module for Paramiko. Assuming that you .xml files have the same schema since you are able to do a union on the same, once you have all these xml files in one directory, you can directly read the entire directory rather than reading files individually.
This solution will save a lot of time that you spend in the for loop.
import paramiko
from scp import SCPClient
def createSSHClient(server, port, user, password):
client = paramiko.SSHClient()
client.load_system_host_keys()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(server, port, user, password)
return client
ssh = createSSHClient(server, port, user, password)
scp = SCPClient(ssh.get_transport())
Then call scp.get() or scp.put() to do SCP operations.
acq_all = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Message").load("file:///var/projects/u_admin/folder/", schema = schema)
I understand your use-case might be a little different, since you're also having an if-else block, but the schema is same, so it can be done once the files are read.
You can read one file, to get the right schema or you can define it by yourself before the read.

Related

Writing large spark data frame as parquet to s3 bucket

My Scenario
I have a spark data frame in a AWS glue job with 4 million records
I need to write it as a SINGLE parquet file in AWS s3
Current code
file_spark_df.write.parquet("s3://"+target_bucket_name)
Issue
the above code creates 100+ files each 17.8 to 18.1 MB in size , guess its some default break down size
Ques 1 : How do I create just one file ? for one spark data frame ?
I checked https://spark.apache.org/docs/latest/sql-data-sources-parquet.html didnt find any parameter to set
Ques 2 : How do I specify the name of the file
I tried ...
file_df.write.parquet("s3://"+target_bucket_name+"/"+target_file_name)
It created 100+ files inside "s3://"+target_bucket_name+"/"+target_file_name
Ques 3 : How do I specify the name of the file
I need to create sub folders inside base3 bucket following code can do the job
file_df.write.parquet("s3://"+target_bucket_name+"/"+today_date+"/"+target_file_name)
not sure if its the best way ... or there is a better way ?
Use .repartition(1) or as #blackbishop says, coalesce(1) to say "I only want one partition on the output"
use a subdir as things don't like writing to the root path. It's not a normal directory
filenames get chosen by the partition code, best to list the dir for the single file and rename.
it should look something like this
val dest = "s3://"+target_bucket_name + "/subdir"
val destPath = newPath(dest)
val fs = Filesystem.get(destPath, conf) // where conf is the hadoop conf from your spark conf
fs.delete(destPath, true)
file_spark_df.parquet.repartition(1).write.(dest)
// at this point there should be only one file in the dest dir
val files = fs.listStatus(destPath) // array of fileStatus of size == 1
if (fs.size != 1) throw new IOException("Wrong number of files in " + destPath)
fs.rename(files[0].getPath(), new Path(destPath, "final-filename.parquet")
(note, code written # console, not compiled, tested etc. You should get the idea though)

Save variables as mat files on S3

I would like to save variables as mat files on s3. The example on the official site shows "tall table" only. Maybe I can use the "system" command overstep MATLAB but I am looking for a straight forward solution.
Any suggestion?
It does look like save does not support saving to remote filesystems.
You can, however, write matrices, cells, tables and timetables.
An example which uses writetable:
LastName = {'Smith';'Johnson';'Williams';'Jones';'Brown'};
Age = [38;43;38;40;49];
T = table(Age,LastName)
writetable(T,'s3://.../table.txt')
Note:
To write to a remote location, filename must contain the full path of
the file specified as a uniform resource locator (URL) of the form:
scheme_name://path_to_file/my_file.ext
To obtain the right URL of the bucket, you can navigate to the contents of the s3 bucket, select a file in there, choose Copy path and remove the name of the file (e.g table.txt).
The alternative is, as you mentioned, a system call:
a = rand(5);
save('matExample','a');
system('aws s3api put-object --bucket mybucket --key=s3mat.mat --body=matExample.mat')
the mat file matExample.mat is saved as s3.mat on the server.

Searching all file names recursively in hdfs using Spark

I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..

Spark - Get from a directory with nested folders all filenames of a particular data type

I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

How to read and write DataFrame from Spark

I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:
To read the file if it exists:
df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")
To write the file:
df.write.parquet("s3n://bucket/myTest.parquet")
This does not work because:
1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?
2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.
So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.
You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.
In Java it would look like this:
SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");
Now you can read from the Data with:
spark.read().table("TableName");