Is there any way I can convert a pair RDD back to a regular RDD?
Suppose I get a local csv file, and I first load it as a regular rdd
rdd = sc.textFile("$path/$csv")
Then I create a pair rdd (i.e. key is the string before "," and value is the string after ",")
pairRDD = rdd.map(lambda x : (x.split(",")[0], x.split(",")[1]))
I store the pairRDD by using the saveAsTextFile()
pairRDD.saveAsTextFile("$savePath")
However, as investigated, the stored file will contain some necessary characters, such as "u'", "(" and ")" (as pyspark simply calls toString(), to store key-value pairs)
I was wondering if I can convert back to a regular rdd, so that the saved file wont contain "u'" or "(" and ")"?
Or any other storage methods I can use to get rid of the unnecessary characters ?
Those characters are the Python representation of your data as string (tuples and Unicode strings). You should convert your data to text (i.e. a single string per record) since you use saveAsTextFile. You can use map to convert the key/value tuple into a single value again, e.g.:
pairRDD.map(lambda (k,v): "Value %s for key %s" % (v,k)).saveAsTextFile(savePath)
Related
Below is the value of a string in a text column.
select col1 from tt_d_tab;
'A:10000000,B:50000000,C:1000000,D:10000000,E:10000000'
I'm trying to convert it into json of below format.
'{"A": 10000000,"B": 50000000,"C": 1000000,"D": 10000000,"E": 10000000}'
Can someone help on this?
If you know that neither the keys nor values will have : or , characters in them, you can write
select json_object(regexp_split_to_array(col1,'[:,]')) from tt_d_tab;
This splits the string on every colon and comma, then interprets the result as key/value pairs.
If the string manipulation gets any more complicated, SQL may not be the ideal tool for the job, but it's still doable, either by this method or by converting the string into the form you need directly and then casting it to json with ::json.
If your key is a single capital letter as in your example
select concat('{',regexp_replace('A:10000000,B:50000000,C:1000000,D:10000000,E:10000000','([A-Z])','"\1"','g'),'}')::json json_field;
A more general case with any number of letters caps or not
select concat('{',regexp_replace('Ac:10000000,BT:50000000,Cs:1000000,D:10000000,E:10000000','([a-zA-Z]+)','"\1"','g'),'}')::json json_field;
I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)
I want to read 100 numbers from a file which are stored in such a fashion:
Each number is on the different line. I am not sure which data structure should be used here because later I will need to sum all these numbers altogether and extract first 10 digits of the sum.
I only managed to simply read the file, but I want to split all the text by newline separators and get each number as a list or array element:
val source = Source.fromFile("pathtothefile")
val lines = source.getLines.mkString
I would be grateful for any advice on a data structure to be used here!
Update on approach:
val lines = Source.fromFile("path").getLines.toList
you almost have it there, just map to BigInt, then you have a list of BigInt
val lines = Source.fromFile("path").getLines.map(BigInt(_)).toList
(and then you can use .sum to sum them all up, etc)
I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:
Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;
I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")
However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:
[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)
The expected result is:
[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)
(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)
Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)
My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work [got an error saying
java.io.IOException: EOF whilst processing escape sequence].
Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?
Thanks!
EDIT: Forgot to mention, i'm using spark 1.6
wholeTextFiles api should be a rescuer api in your case. It read files as key, value pairs : key as the path of the file and value as the whole text of the file. You will have to do some replacements and splittings to get the desired output though
val rdd = sparkSession.sparkContext.wholeTextFiles("path to the file")
.flatMap(x => x._2.replace("\\\n", "").replace(";\n", "\n").split("\n"))
.map(x => x.split(";"))
the rdd output is
[Line1field1,Line1field2.1 Line1field2.2,Line1field3]
[Line2FIeld1,Line2field2,Line2field3]
Sample Dataset:
$, Claw "OnCreativity" (2012) [Himself]
$, Homo Nykytaiteen museo (1986) [Himself] <25>
Suuri illusioni (1985) [Guests] <22>
$, Steve E.R. Sluts (2003) (V) <12>
$hort, Too 2012 AVN Awards Show (2012) (TV) [Himself - Musical Guest]
2012 AVN Red Carpet Show (2012) (TV) [Himself]
5th Annual VH1 Hip Hop Honors (2008) (TV) [Himself]
American Pimp (1999) [Too $hort]
I have created a Key-Value Pair RDD as using the following code:
To split data: val actorTuple = actor.map(l => l.split("\t"))
To make KV pair: val actorKV = actorTuple.map(l => (l(0), l(l.length-1))).filter{case(x,y) => y != "" }
The Key-Value RDD output on console:
Array(($, Claw,"OnCreativity" (2012) [Himself]), ($, Homo,Nykytaiteen museo (1986) [Himself] <25>), ("",Suuri illusioni (1985) [Guests] <22>), ($, Steve,E.R. Sluts (2003) (V) <12>).......
But, a lot of lines have this "" as key i.e blank (see the RDD output above), because of the nature of dataset, So, I want to have a function that copies the actor of the previous line to this line if it's empty.
How this can be done.
New to Spark and Scala. But perhaps it would be simpler to change your parsing of the lines, and first create a pair RDD with values of type list, eg.
($, Homo, (Nykytaiteen museo (1986) [Himself] <25>,Suuri illusioni (1985) [Guests] <22>) )
I don't know your data, but perhaps if a line doesn't begin with "$" you append onto the value list.
Then depending on what you want to do, perhaps you could use flatMapValues(func) on the pair RDD described above. This applies a function which returns an iterator to each value of a pair RDD, and for each element returned, produces a key-value entry with the old key.
ADDED:
What format is your input data ("Sample Dataset") in? Is it a text file or .tsv?
You probably want to load the whole file at once. That is, use .wholeTextFiles() rather than .textFile() to load your data. This is because your records are stored across more than one line in the file.
ADDED
I'm not going to download the file, but it seems to me that each record you are interested in begins with "$".
Spark can work with any of the Hadoop Input formats, so check those to see if there is one which will work for your sample data.
If not, you could write your own Hadoop InputFormat implementation that parses files into records split on this character instead of the default for TextFiles, which is the '\n' character.
Continuing from the idea xyzzy gave, how about you try this after loading in the file as a string:
val actorFileSplit = actorsFile.split("\n\n")
val actorData = sc.parallelize(actorsFileSplit)
val actorDataSplit = actorsData.map(x => x.split("\t+",2).toList).map(line => (line(0), line(1).split("\n\t+").toList))
To explain what I'm doing, I start by splitting the string up every time we find a line break. Consecutively I parallelize this into a sparkcontext for mapping functions. Then I split every entry into two parts which is delimited by the first occurance of a number of tabs (one or more). The first part should now be the actor and the second part should still be the string with the movie titles. The second part may once again be split at every new line followed by a number of tabs. This should create a list with all the titles for every actor. The final result is in the form:
actorDataSplit = [(String, [String])]
Good luck