Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am a bit disappointed with slick & its TableQueries : the model of an application can be a "class Persons(tag: Tag) extends Table[Person] for example (where Person is a case class with some fields like name, age,address...).
The weird point is the "val persons = TableQuery[Persons]" contains all the records.
To have for example all the adults, we can use:
adults = persons.filter(p => p.age >= 18).list()
Is the content of the database loaded in the variable persons?
Is there on the contrary a mechanism that allows to evaluate not "persons" but "adults"?(a sort of lazy variable)?
Can we say something like 'at any time, "persons" contains the entire database'?
Are there good practices, some important ideas that can help the developer?
thanks.
You are mistaken in your assumption that persons contains all of the records. The Table and TableQuery classes are representations of a SQL table, and the whole point of the library is to ease the interaction with SQL databases by providing a convenient, scala-like syntax.
When you say
val adults = persons.filter{ p => p.age >= 18 }
You've essentially created a SQL query that you can think of as
SELECT * FROM PERSONS WHERE AGE >= 18
Then when you call .list() it executes that query, transforming the result rows from the database back into instances of your Person case class. Most of the methods that have anything to do with slick's Table or Query classes will be focused on generating Queries (i.e. "select" statements). They don't actually load any data until you invoke them (e.g. by calling .list() or .foreach).
As for good practices and important ideas, I'd suggest you read through their documentation, as well as take a look at the scaladocs for any of the classes you are curious about.
http://slick.typesafe.com/docs/
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am looking into Amazon Quicksight as a reporting tool and I am using data from a postgres database, which in includes some columns in a few tables in jsonb format. Unfortunately these columns are skipped by Quicksight, because it only supports primitive types as mentioned here: https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html
I am looking for a solution where I can include these data, together with the rest of the relational data that are in the same tables.
So far I cannot find anything better than actually making a view in my own application with this data in a relational format, that can be used by Quicksight. Is there anything else that does not pollute my original database with reporting stuff? I also thought of having these views only in the read-only replica of my db, but this is not possible with postgres on RDS. Athena is also not an option, and nor is the option to choose json as the data set, and this is because I want to have both the relational data and the json for my analysis.
Any better ideas?
Created a test Postgres table with the following columns:
id integer
info jsonb
Added data to the table, with a sample value:
{ "customer": "John Doe", "items": {"product": "Beer","qty": 6}}
In QuickSight, created a data set using custom SQL, with a SQL statement (based on [1]) similar to:
select id, (info#>>'{}') as jsonb_value from "orders"
With the above data set I was able to import both the columns to QuickSight SPICE as well as directly query the data. The JSONB column gets imported as 'String' type field in QuickSight.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
In the functional programming world, when I want to design an API, i will encounter the word algebra api.
Could someone please describe, what an algebra is in FP in context of designing API.
Which components build an algebra api? Laws, operations, etc..?
There is a word primitive, what is a primitive exactly? Please with show me an example.
I think what you are referring to is algebraic data types.
Product Type
A common class of ADT is the product type. As an example, a "user" can be described as a combination of "name", "email address", and "age":
case class User(name : String, email : String, age : Int)
This is called a "product" type because we can count the number of possible distinct Users using multiplication:
distinct user count = (distinct name count) x (distinct email count) x (distinct age count)
Sum Type
The other common ADT class is the sum type. As an example, a user can either be a common user or an adminstrator:
sealed trait User
case class CommonUser(name : String) extends User
case class AdminUser(name : String, powers : Set[AdminPowers]) extends User
This is called a "sum" type because we can count the number of possible distinct Users using addition:
distinct user count = (distinct common user count) + (distinct admin user count)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a massive delimited file and many normalized tables to input the data. Is there a best practice for bringing in the data and inserting the data into its proper fields and tables?
For instance, right now I've created a temp table that holds all the arbitrary data. Some logic runs against each row to determine what values will be going in to what table. Without too much specifics the part that concerns me looks something like:
INSERT INTO table VALUES (
(SELECT TOP 1 field1 FROM #tmpTable),
(SELECT TOP 1 field30 FROM #tmpTable),
(SELECT TOP 1 field2 FROM #tmpTable),
...
(SELECT TOP 1 field4 FROM #tmpTable))
With that, my questions are: Is it reasonable to be using a temp table for this purpose? And is it poor practice to use these SELECT statements so liberally like this? It feels sort of hacky, are there a better ways to handle mass data importing and separation like this?
You should try SSIS.
SSIS How to Create an ETL Package
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have created a unlabeled Dataset which has some columns. The values in one of the Column are France,Germany,France and UK
I know how to filter and count using below code.
val b =data.filter(_.contains("France")).count
However, I am not sure how to count values other than France.
I tried below code but it is giving me wrong result
val a =data.filter(x=>x!="France").count
PS: My question is a bit similar to Is there a way to filter a field not containing something in a spark dataframe using scala? but I am looking for some simpler answer.
You are trying to filter those elements which is equal to "France".
Try this
val a=data.filter(!_.contains("France")).count
To cricket_007 's point, should be something like this
val myDSCount = data.filter(row => row._1 != "France").count()
I am not sure what column your data is in, so the row._1 would change to the correct number. You can run the following to see all of your columns:
data.printSchema
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm doing some pre-processing on a bunch of data. Each line has the following schema
<row Att1="...." Att2="..." Attn"...." />
However not all the attributes exist in all the rows. That is some rows might have only three attributes while some other have five, etc. Besides, there is no way attribute indicating how many attribute exist within each row.
I would like to form an RDD or DataFrame (prefrable) and run some query on the data. However I can't find a good way of splitting each row. For example, splitting by space not work. I only need a few attributes in my processing. I tried to use pattern matching to extract 4 attributes that exist in all the row as follows but it fails.
val pattern = "Att1=(.*) Att3=(.*) Att10=(.*) Att11=(.*)".r
val rdd1 = sc.textFile("file.xml")
val rdd2 = rdd1.map {line => line match {
case pattern(att1,att2,att3,att4) => Post(att1,att2,att3,att4)
}
}
case class Post(Att1: String, Att3: String, Att10: String, Att11: String)
p.s. I'm using scala.
This is less of a spark problem than it is a scala problem. Is the data stored across multiple files?
I would recommend parallelizing by file and then parsing row by row.
For the parsing I would:
Create a case class of what you want the rows to look like (This will allow the schema to be inferred using reflection when creating the DF)
Create a list of name/regex tuples for the parsing like: ("Attribute", regex)
Map over the list of regex and convert to a map: (Attribute -> Option[Value])
Create the case class objects
This should lead to a data structure of List[CaseClass] or RDD[CaseClass] which can be converted to a dataframe. You may need to do additional processing to filter out un-needed rows and to remove the Options.