Postgres DB Schema - multiple columns vs one json column - postgresql

I have a db that contains username with 3 different phone numbers and 3 different ids. also we will have 3 different type of notes for each username.
I am using postgres and data are planned to increase to millions of rows. querying and inserting new data process are really important to be fastest way.
which schema would be better for that:
username | no1(string) | no2(string) | no3(string) | id1(string) | id2(string) | | id3(string) | note1(string) | note2(string) | note3(string)
OR
username | no(JSON) | id(JSON) | note(JSON)

Related

PostgreSQL arabic case insensitive

I am looking for how to search a database using Arabic text. In Arabic there are some letters that can be written in different ways but in the results they should all show up if one of them is included in the where clause.
The famous example for this would be:
SELECT * FROM persons WHERE name = "اسامة";
+----+--------------+
| id | name |
+----+--------------+
| 3 | أسامه |
| 4 | أسامة |
| 5 | اسامه |
| 6 | اسَامه |
+----+--------------+
4 rows in set (0.00 sec)
I found a good and probably most performant way to do this by creating a custom collation on MySQL in this article but I have no idea how that is done or if it is possible at all in PostgreSQL.
Other ways that include changing the query itself to use Regex are not useful for my use case.
Can someone please guide me how to do the same

Know which table are affected by a connection

I want to know if there is a way to retrieve which table are affected by request made from a connection in PostgreSQL 9.5 or higher.
The purpose is to have the information in such a way that will allow me to know which table where affected, in which order and in what way.
More precisely, something like this will suffice me :
id | datetime | id_conn | id_query | table | action
---+----------+---------+----------+---------+-------
1 | ... | 2256 | 125 | user | select
2 | ... | 2256 | 125 | order | select
3 | ... | 2256 | 125 | product | select
(this will be the result of a select query from user join order join product).
I know I can retrieve id_conn througth "pg_stat_activity", and I can see if there is a running query, but I can't find an "history" of the query.
The final purpose is to debug the database when incoherent data are inserted into the table (due to a lack of constraint). Knowing which connection do the insert will lead me to find the faulty script (as I have already the script name and the id connection linked).

What is the column limit for Spark Data Frames?

Our team is having a lot of issues with the Spark API particularly with large schema tables. We currently have a program written in Scala that utilizes the Apache spark API to create two Hive tables from raw files. We have one particularly very large raw data file that is giving us issues that contains around ~4700 columns and ~200,000 rows.
Every week we get a new file that shows the updates, inserts and deletes that happened in the last week. Our program will create two tables – a master table and a history table. The master table will be the most up to date version of this table while the history table shows all changes inserts and updates that happened to this table and showing what changed. For example, if we have the following schema where A and B are the primary keys:
Week 1 Week 2
|-----|-----|-----| |-----|-----|-----|
| A | B | C | | A | B | C |
|-----|-----|-----| |-----|-----|-----|
| 1 | 2 | 3 | | 1 | 2 | 4 |
|-----|-----|-----| |-----|-----|-----|
Then the master table will now be
|-----|-----|-----|
| A | B | C |
|-----|-----|-----|
| 1 | 2 | 4 |
|-----|-----|-----|
And The history table will be
|-----|-----|-------------------|----------------|-------------|-------------|
| A | B | changed_column | change_type | old_value | new_value |
|-----|-----|-------------------|----------------|-------------|-------------|
| 1 | 2 | C | Update | 3 | 4 |
|-----|-----|-------------------|----------------|-------------|-------------|
This process is working flawlessly for shorter schema tables. We have a table that has 300 columns but over 100,000,000 rows and this code still runs as expected. The process above for the larger schema table runs for around 15 hours, and then crashes with the following error:
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.generic.Growable$class.loop$1(Growable.scala:52)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
Here is a code example that takes around 4 hours to run for this larger table, but runs in 20 seconds for other tables:
var dataframe_result = dataframe1.join(broadcast(dataframe2), Seq(listOfUniqueIds:_*)).repartition(100).cache()
We have tried all of the following with no success:
Using hash broad-cast joins (dataframe2 is smaller, dataframe1 is huge)
Repartioining on different numbers, as well as not repartitioning at all
Caching the result of the dataframe (we originally did not do this).
What is causing this error and how can we fix it? The only difference between this problem table is that it has so many columns. Is there an upper limit to how many columns Spark can handle?
Note: We are running this code on a very large MAPR cluster and we tried giving the code 500GB of RAM and its still failing.

Getting duplicate rows when querying Cloud SQL in AppMaker

I migrated from Drive tables to a 2nd gen MySQL Google Cloud SQL data model. I was able to insert 19 rows into the following Question table in AppMaker:
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| SurveyType | varchar(64) | NO | PRI | NULL | |
| QuestionNumber | int(11) | NO | PRI | NULL | |
| QuestionType | varchar(64) | NO | | NULL | |
| Question | varchar(512) | NO | | NULL | |
| SecondaryQuestion | varchar(512) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
I queried the data from the command line and know it is good. However, when I query the data in AppMaker like this:
var newQuery = app.models.Question.newQuery();
newQuery.filters.SurveyType._equals = surveyType;
newQuery.sorting.QuestionNumber._ascending();
var allRecs = newQuery.run();
I get 19 rows with the same data (the first row) instead of the 19 different rows. Any idea what is wrong? Additionally (and possibly related) my list rows in AppMaker are not showing any data. I did notice that _key is not being set correctly in the records.
(Edit: I thought maybe having two columns as the primary key was the problem, but I tried having the PK be a single identity column, same result.)
Thanks for any tips or pointers.
You have two primary key fields in your table, which is problematic according to the App Maker Cloud SQL documentation: https://developers.google.com/appmaker/models/cloudsql
App Maker can only write to tables that have a single primary key
field—If you have an existing Google Cloud SQL table with zero or
multiple primary keys, you can still query it in App Maker, but you
can't write to it.
This may account for the inability of the view to be able to properly display each row and to properly set the _key.
I was able to get this to work by creating the table inside AppMaker rather than using a table created directly in the Cloud Shell. Not sure if existing tables are not supported or if there is a bug in AppMaker, but since it is working I am closing this.

Is it possible to get 3 select query results by executing only one stored procedure?

I have to display data in the following format
-----------------------------------------------------------
| Group Name | Description | Assigned Users | Super Groups|
-----------------------------------------------------------
|Group1 | Blah Blah | User1 | SPG1 |
| | | User2 | SPG3 |
| | | User3 | |
-----------------------------------------------------------
| Group2 | More Blah | User1 | SPG5 |
| | | User13 | |
-----------------------------------------------------------
Assigned users and Super groups data are coming from unrelated tables. Now I wonder whether is it possible to get 3 select query results in one shot (i.e. the same procedures returns 3 results). Otherwise I'm going to query the groups and users first, get the group IDs then query super groups.
So again, Is it possible to get 3 select query results by executing only one stored procedure?
Yes, just include 3 select statements.
If you're consuming these in .net and storing them in a DataSet you'll have 3 tables in the DataSet.
Example:
create procedure test
as
select 1 as res1;
select 2 as res2;
select 3 as res3
exec test
Yes. You'll have to include the three statements in your stored procedure. Take a look at this post.