Spark partitonBy() subdirectory naming - scala

I was reading this article and in section 4.2 it explains how you can use the partitionBy() function to create subdirectories for all values of the column you are trying to partition by. In that example, we see a list of subdirectories in the format "state=some_state_name".
My question is, is there any way to use the partitionBy() function but rename the subdirectories to be "some_state_name", removing the "state=" part?
In other words, how could I modify this code snippet to achieve this naming?
df.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/tmp/zipcodes-state")

Please refer the this
"Spark can't discover partitions that aren't encoded as partition_name=value in the path so you'll have to create them."

Related

Reading files into a pyspark dataframe from directories and subdirectories

I have the below to read all files within a directory, but I am struggling with getting the subdirectories too. I won't always know what the subdirectories are and hence cannot explicitly define it
Can anyone advise me please?
df = my_spark.read.format("csv").option("header", "true").load(yesterday+"/*.csv")
Use Wildcards after the directory location where you wish to read all the sub directories.
"path/*/*"
Thanks to Joby
can you try giving wildcards in this way and see "path//" – Joby 23
hours ago

How to rename partly the downloaded file using wget?

I'd like to download many files (about 10000) from ftp-server. Names of the files are too long. I'd like to save them only with the date in names. For example: ABCDE201604120000-abcde.nc I prefer to be 20160412.nc
Is it possible?
I am not sure if wget provides similar functionality, nevertheless with curl, one can profit from the relatively rich syntax it provides in order to specify the URL of interest. For example:
curl \
"https://ftp5.gwdg.de/pub/misc/openstreetmap/SOTMEU2014/[53-54].{mp3,mp4}" \
-o "file_#1.#2"
will download files 53.mp3, 53.mp4, 54.mp3, 54.mp4. The output file is specified as file_#1.#2 - here, #1 is replaced by curl with the value of the sequence [53-54] corresponding to the file being downloaded. Similarly, #2 is replace with either mp3 or mp4. Thus, e.g., 53.mp3 will be saved as file_53.mp3.
ewcz's answer works fine if you can enumerate the file names as shown in the post. However, if the filenames are difficult to enumerate, for example, because the integers are sparsely populated, this solution would result in a lot of 404 Not Found requests.
If this is the case, then it is probably better to download all the files recursively, as you have shown, and rename them afterwards. If the file names follow a fixed pattern, you can select the substring from the original name and use it as the new name. In the given example, the new file names start at position 5 and are 8 characters long. The following bash command renames all *.nc files in the current directory.
for f in *.nc; do mv "$f" "${f:5:8}.nc" ; done
If the filenames do not follow a fix pattern and might vary in length, you can use more complex pattern substitution using sed, see SO post for an example.

TextIO. Read multiple files from GCS using pattern {}

I tried using the following
TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")
That pattern didn't work, as I get
java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}
Even though those 2 files do exist. And I tried with a local file using a similar expression
TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")
And that did work just fine.
I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?
This may be another option, in addition to Scott's suggestion and your comment on his answer:
You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:
PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
Then create a PCollectionList:
PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
And then flatten this list into your PCollection for your main input:
PCollection<String> events = eventsList.apply(Flatten.pCollections());
Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.
GCS glob wildcard patterns are documented here (Wildcard Names).
In the case above, you could use:
TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")
Note however that this will also include any other matching files.
Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:
public TextIO.Read from(java.lang.String filepattern)
Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

Why does tup need one Tupfile per directory?

I've read a lot about the tup build system.
In many places, it is said that tup "does not support recursive rules" and that you need to have one Tupfile per directory. Yet I have not seen an official statement or explanation.
Is the above claim correct?
If yes, why, and for which kind of task is this problematic? An example would be nice.
It is worth noting that currently a Tupfile can create files in a different directory. You could always read files from different directory, so currently you can have a single Tupfile for the whole project.
Some more info here: https://groups.google.com/d/msg/tup-users/h7B1YzdgCag/qsOpBs4DIQ8J (a little outdated) + https://groups.google.com/d/msg/tup-users/-w932WkPBkw/7ckmHJ9WUCEJ (new syntax to use the group as input)
If you use the new LUA parser you can also have a "default" Tupfile - see here http://gittup.org/tup/lua_parser.html and check info about Tupdefault.lua
Some of the answers have mentioned already that the limitation really is one Tupfile per directory where you want output files, instead of one Tupfile per directory. In a recent commit, this limitation has been relaxed and tup allows you to place output files also in subdirectories of the Tupfile.
In addition, with variants, it is possible to generate output files anywhere in the build tree.
The official statement can be found in tup manual: http://gittup.org/tup/manual.html
You must create a file called "Tupfile" anywhere in the tup hierarchy
that you want to create an output file based on the input files. The
input files can be anywhere else in the tup hierarchy, but the output
file(s) must be written in the same directory as the Tupfile.
(Quotation is the 1st paragraph from the section TUPFILES in the manual)
AFAIK, it is a limitation which has something to do with the way how tup stores dependencies in the .tup subdir but I don't know details.

Writing a script for reading many .csv files with similar filenames

I have several .csv files with similar filenames except a numeric month (i.e. 03_data.csv, 04_data.csv, 05_data.csv, etc.) that I'd like to read into R.
I have two questions:
Is there a function in R similar to
MATLAB's varname and assignin that
will let me create/declare a variable name
within a function or loop that will allow me to
read the respective .csv file - i.e.
03_data.csv into 03_data data.frame,
etc.? I want to write a quick loop to
do this because the filenames are
similar.
As an alternative, is it better to
create one dataframe with the first
file and then append the rest using a
for loop? How would I do that?
You could look at this related question. You can create the file names easily with a paste command:
file.names <- paste(sprintf("%02d",1:10), "_data.csv", sep="")
Once you have your file names (whether by creating them or by reading them from the directory as in the other question), you can import them quickly with an lapply:
import.list <- lapply(file.names, read.csv)
Lastly, to combine the list into one dataframe, the easiest approach is to use the reshape function below:
library(reshape)
data <- merge_recurse(import.list)
It is also very easy to read the content of a directory including use of regular expressions to skip focus on certain names only, e.g.
filestoread <- list.files(someDir, pattern="\\.csv$", full.names=TRUE)
returns all (fully-formed, including full path) files in the given directory someDir that end on ".csv". You can get fancier with better regular expressions which are documented in many places.
Once you have your list of files, it is straightforward to read them all using apply or lapply or a loop.