I have a text file I'd like to read into a dataframe. I prefer to read it into a single column. This was working until I came across a file with ^ in it.
raw = spark.read.option("delimiter", "^").csv(data_dir + pair[0])
But alas, alack-a-day, the very next broke the pattern. I don't see an option for delimiter None. Is there an efficient way to do this?
Have you looked at using spark.read.textFile instead? It may do what you want it to.
I have a file with pattern like this,
30,402.660064697,196.744171143,18.5563354492,15.3047790527,2.16090902686,aeroplane
30,177.246170044,113.594314575,18.9164428711,16.5203704834,1.71010773629,aeroplane
30,392.224212646,226.437973022,26.2086791992,27.4663391113,0.782454758883,aeroplane
30,241.633453369,169.349304199,628.560913086,540.041259766,0.530623318627,aeroplane
30,529.454589844,322.412719727,24.9837646484,21.5563354492,0.503144180402,aeroplane
30,365.581298828,148.842697144,21.3596801758,16.3081970215,0.490551069379,sheep
30,436.230773926,272.073303223,17.6417236328,19.9946289062,0.483223423362,aeroplane
30,438.188201904,286.455200195,20.164855957,23.1041870117,0.224495329894,adog
30,511.185546875,289.902099609,19.7315673828,19.3796386719,0.203064805828,aeroplane
30,365.777252197,177.576202393,21.8588256836,15.1581115723,0.181338354014,cat
30,380.210266113,150.396713257,19.6742553711,15.7977600098,0.171210919507,aeroplane
and another file contain the meta data
dog
aeroplane
cat
sheep
now I want to map the string(ex:aeroplane to int)
dog -->1
aeroplane -->2
cat -->3
sheep -->4
I know I can open a new file and convert it line by line by fgets
but in my case that pattern has n of thousands which will be very slow.
Is there any smarter way that I can solve this problem and dont need to create a new file and just update in same file
Assuming the file is not too large to read into memory, I suggest using textscan and strcomp and fprintf. You can overwrite the existing file if you really want, but I would recommend writing to a different file (especially when debugging).
I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...
If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.
I am reaidng files on HDFS via scalding, aggregating on some fields, and writing to a tab delimited file via TSV. How can I write out a file that contains the schema of my output file? For example,
UnpackedAvroSource(args("input"))
.project('key, 'var1)
.groupBy('key){_.sum[Long]('var1 -> var1sum))}
.write(Tsv(args("output")))
I want to write an output text file that contains "Key, var1sum" that someone who picks up my ooutput file later knows what the columns. I'm assuming scalding doesn't embed this in the file somewhere?
Thanks.
Just found the option writeHeader = true which will write the column names to the output file, negating the need for writing out to a file.
I have a plain text file that I am reading from my local system that I am uploading to HDFS. I have Spark/Scala code that reads the file in, converts the file to a string, and then i use saveAsTextFile function to specify my HDFS path where I want the file to be saved. Note I am using the coalesce function because I want one file saved, rather than the file getting split.
import scala.io.Source
val fields = Source.fromFile("MyFile.txt").getLines
val lines = fields.mkString
sc.makeRDD(lines).coalesce(1, true).saveAsTextFile("hdfs://myhdfs:8520/user/alan/mysavedfile")
The code I have saves my text successfully to HDFS, unfortunately though, for some reason each character in my string has a line feed character after it.
Is there a way around this?
I wasn't able to get this working exactly as I wanted, but I did come up with a work around. I saved the device locally and then called a shell command through Scala to upload the completed file to HDFS. Pretty straightforward.
Would still appreciate if anyone could tell me how to copy a string directly to a file in HDFS though.