Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm doing some pre-processing on a bunch of data. Each line has the following schema
<row Att1="...." Att2="..." Attn"...." />
However not all the attributes exist in all the rows. That is some rows might have only three attributes while some other have five, etc. Besides, there is no way attribute indicating how many attribute exist within each row.
I would like to form an RDD or DataFrame (prefrable) and run some query on the data. However I can't find a good way of splitting each row. For example, splitting by space not work. I only need a few attributes in my processing. I tried to use pattern matching to extract 4 attributes that exist in all the row as follows but it fails.
val pattern = "Att1=(.*) Att3=(.*) Att10=(.*) Att11=(.*)".r
val rdd1 = sc.textFile("file.xml")
val rdd2 = rdd1.map {line => line match {
case pattern(att1,att2,att3,att4) => Post(att1,att2,att3,att4)
}
}
case class Post(Att1: String, Att3: String, Att10: String, Att11: String)
p.s. I'm using scala.
This is less of a spark problem than it is a scala problem. Is the data stored across multiple files?
I would recommend parallelizing by file and then parsing row by row.
For the parsing I would:
Create a case class of what you want the rows to look like (This will allow the schema to be inferred using reflection when creating the DF)
Create a list of name/regex tuples for the parsing like: ("Attribute", regex)
Map over the list of regex and convert to a map: (Attribute -> Option[Value])
Create the case class objects
This should lead to a data structure of List[CaseClass] or RDD[CaseClass] which can be converted to a dataframe. You may need to do additional processing to filter out un-needed rows and to remove the Options.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm able to write pytest functions by manually giving column names and values to create data frame and passing it to the production code to check all the transformed fields values in palantir foundry code repository.
Instead of manually passing column names and their respective values I want to store all the required data in dataset and import that dataset into pytest function to fetch all the required values and passing over to the production code to check all the transformed field values.
Is there anyways to accept dataset as an input the test function in planatir code repository.
You can probably do something like this:
Lets say you have your csv inside a fixtures/ folder next to your test.
test_yourtest.py
fixtures/yourfilename.csv
You can just read it directly and pass it to create a new dataframe. I didn't test this code but it should be something similar to this:
def load_file(spark_context):
filename = "yourfilename.csv"
file_path = os.path.join(Path(__file__).parent, "fixtures", filename)
return open(file_path).read()
Now you can load your CSV, it's just a matter of loading it into a dataframe and passing it into your pyspark logic that you want to test. Get CSV to Spark dataframe
This question already has an answer here:
Spark Implicit $ for DataFrame
(1 answer)
Closed 3 years ago.
Let us say I've a DF created as follows
val posts = spark.read
.option("rowTag","row")
.option("attributePrefix","")
.schema(Schemas.postSchema)
.xml("src/main/resources/Posts.xml")
What is the advantage of converting it to a Column using posts.select("Id") over posts.select($"Id")
df.select operates on the column directly while $"col" creates a Column instance. You can also create Column instances using col function. Now the Columns can be composed to form complex expressions which then can be passed to any of the df functions.
You can also find examples and more usages on the Scaladoc of Column class.
Ref - https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Column
There is no particular advantage, it's an automatic conversion anyway. But not all methods in SparkSQL perform this conversion, so sometimes you have to put the Column object with the $.
There is not much difference but some functionalities can be used only using $ with the column name.
Example : When we want to sort the value in this column, without using $ prior to column name, it will not work.
Window.orderBy("Id".desc)
But if you use $ before column name, it works.
Window.orderBy($"Id".desc)
This question already has an answer here:
Spark groupByKey alternative
(1 answer)
Closed 5 years ago.
My data is a file containing over 2 million rows of employee records. Each row has 15 fields of employee features, including name, DOB, ssn, etc. Example:
ID|name|DOB|address|SSN|...
1|James Bond|10/01/1990|1000 Stanford Ave|123456789|...
2|Jason Bourne|05/17/1987|2000 Yale Rd|987654321|...
3|James Bond|10/01/1990|5000 Berkeley Dr|123456789|...
I need to group the data by a number of columns and aggregate the employee's ID (first column) with the same key. The number and name of the key columns are passed into the function as parameters.
For example, if the key columns include "name, DOB, SSN", the data will be grouped as
(James Bond, 10/01/1990, 123456789), List(1,3)
(Jason Bourne, 05/17/1987, 987654321), List(2)
And the final output is
List(1,3)
List(2)
I am new to Scala and Spark. What I did to solve this problem is: read the data as RDD, and tried using groupBy, reduceByKey, and foldByKey to implement the function based on my research on StackOverflow. Among them, I found groupBy was the slowest, and foldByKey was the fastest. My implementation with foldByKey is:
val buckets = data.map(row => (idx.map(i => row(i)) -> (row(0) :: Nil)))
.foldByKey(List[String]())((acc, e) => acc ::: e).values
My question is: Is there faster implementation than mine using foldByKey on RDD?
Update: I've read posts on StackOverflow and understand groupByKey may be very slow on large dataset. This is why I did avoid groupByKey and ended up with foldByKey. However, this is not the question I asked. I am looking for an even faster implementation, or the optimal implementation in terms of processing time with the fixed hardware setting. (The processing of 2 million records now requires ~15 minutes.) I was told that converting RDD to DataFrame and call groupBy can be faster.
Here are some details on each of these first to understand how they work.
groupByKey runs slow as all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
reduceByKey works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
So avoid groupbyKey. Hoping this helps.
Cheers !
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have created a unlabeled Dataset which has some columns. The values in one of the Column are France,Germany,France and UK
I know how to filter and count using below code.
val b =data.filter(_.contains("France")).count
However, I am not sure how to count values other than France.
I tried below code but it is giving me wrong result
val a =data.filter(x=>x!="France").count
PS: My question is a bit similar to Is there a way to filter a field not containing something in a spark dataframe using scala? but I am looking for some simpler answer.
You are trying to filter those elements which is equal to "France".
Try this
val a=data.filter(!_.contains("France")).count
To cricket_007 's point, should be something like this
val myDSCount = data.filter(row => row._1 != "France").count()
I am not sure what column your data is in, so the row._1 would change to the correct number. You can run the following to see all of your columns:
data.printSchema
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am a bit disappointed with slick & its TableQueries : the model of an application can be a "class Persons(tag: Tag) extends Table[Person] for example (where Person is a case class with some fields like name, age,address...).
The weird point is the "val persons = TableQuery[Persons]" contains all the records.
To have for example all the adults, we can use:
adults = persons.filter(p => p.age >= 18).list()
Is the content of the database loaded in the variable persons?
Is there on the contrary a mechanism that allows to evaluate not "persons" but "adults"?(a sort of lazy variable)?
Can we say something like 'at any time, "persons" contains the entire database'?
Are there good practices, some important ideas that can help the developer?
thanks.
You are mistaken in your assumption that persons contains all of the records. The Table and TableQuery classes are representations of a SQL table, and the whole point of the library is to ease the interaction with SQL databases by providing a convenient, scala-like syntax.
When you say
val adults = persons.filter{ p => p.age >= 18 }
You've essentially created a SQL query that you can think of as
SELECT * FROM PERSONS WHERE AGE >= 18
Then when you call .list() it executes that query, transforming the result rows from the database back into instances of your Person case class. Most of the methods that have anything to do with slick's Table or Query classes will be focused on generating Queries (i.e. "select" statements). They don't actually load any data until you invoke them (e.g. by calling .list() or .foreach).
As for good practices and important ideas, I'd suggest you read through their documentation, as well as take a look at the scaladocs for any of the classes you are curious about.
http://slick.typesafe.com/docs/