Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id - scala

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?

This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

Related

Convert csv file to map

I have a csv file containing a list of abbreviations and their full values such that the file looks like the below
original,mappedValue
bbc,britishBroadcastingCorporation
ch4,channel4
I want to convert this csv file into a Map such that it is of the form
val x:Map[String,String] = Map("bbc"->"britishBroadcastingCorporation", "ch4"->"channel4")
I have tried using the below:
Source.fromFile("pathToFile.csv").getLines().drop(1).map(_.split(","))
but this leaves me with an Iterator[Array[String]]

You are close , split provides an array. You have to convert it into a tuple and then to a map
Source.fromFile("/home/agr/file.csv").getLines().drop(1).map(csv=> (csv.split(",")(0),csv.split(",")(1))).toMap
res4: scala.collection.immutable.Map[String,String] = Map(bbc -> britishBroadcastingCorporation, ch4 -> channel4)
In real life , you will check for existance of bad rows and filtering out the array splits whose length is less than 2 or may be put that into another bin as bad data etc.

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!

Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?

.withColumn() expects a column object as second parameter and you are supplying a list.

Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

Reading a csv file and selecting three columns in Scala

I need to read a csv file and then to make a new file having the specified 3 columns ..
I am aware of reading a text file but not csv file .
import scala.io.Source._
val lines = fromFile("file.txt").getLines

Or if you just want the first three columns, try this
val lines = fromFile("file.txt").
getLines.
map(_.split(",",4).take(3)).
toList

Assuming a collection of indices idx that refer to columns in the csv file, consider first
val idx = Array(1,3,4)
val xs = (1 to 10).toArray
and so we can fetch the 2nd, 4th and 5th columns (index 0 refers to the first column),
idx.map(xs)
Array(2, 4, 5)
We can apply this idea onn each array from splitting each line as follows,
Source.fromFile("file.csv").getLines.map(_.split(",").map(idx))
This approach allows for defining the indices of interest at runtime (non hard-coding).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id - scala

This happens because first will return a Row. To access the first element of the row you must do: val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0) By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

Related

Convert csv file to map

To split data into good and bad rows and write to output file using Spark program

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Read and processing data in spark output is not deliminated correctly

Reading a csv file and selecting three columns in Scala

Categories

Resources