split DataFrame into training and testing DataFrame using pyspark [closed] - pyspark

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 months ago.
This post was edited and submitted for review 6 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I have 2038 length of DataFrame, which i want to split it into training and testing DataFrame using PySpark.
My DataFrame looks like this (DataFrame Sample) -
Item
Value
1
10
2
2
2
35
1
12
1
16
3
26
I want my DataFrame to Split like this
Training DataFrame length - 2008 and
Testing DataFrame length - 30

Since you are requiring the exact number of sample in the training and testing dataset, functions like randomSplit or sampleBy that use fraction or weight is not suitable for your case. If you don't have any performance concern, one of the solutions is using the joining:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
df = spark\
.createDataFrame([(str(i),) for i in range(2038)],
['row_index'])
df.show(5, False)
+---------+
|row_index|
+---------+
|0 |
|1 |
|2 |
|3 |
|4 |
+---------+
Then you can use the joining directly:
train_df = df.limit(2000)
test_df = df.join(train_df, on='row_index', how='leftanti')
If you want to do randomization, just add:
train_df = df.orderBy(func.rand()).limit(2000)

Related

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to make a vertical table from a horizontal set of columns, for example:-
Take this example:-
MyPrimaryKey
Insurer_Factor_1_Name
Insurer_Factor_1_Code
Insurer_Factor_1_Value
Insurer_Factor_2_Name
Insurer_Factor_2_Code
Insurer_Factor_2_Value
Insurer_Factor_[n]_Name
Insurer_Factor_[n]_Code
Insurer_Factor_[n]_Value
XX-ABCDEF-1234-ABCDEF123
Special
SP1
2500
Awesome
AW2
3500
ecetera
etc
999999
[n] being any number of iterations
transforming it into a new vertical representation dataframe:-
MyPrimaryKey
Insurer_Factor_ID
Insurer_Factor_Name
Insurer_Factor_Code
Insurer_Factor_Value
XX-ABCDEF-1234-ABCDEF123
1
Special
SP1
2500
XX-ABCDEF-1234-ABCDEF123
2
Awesome
AW2
3500
XX-ABCDEF-1234-ABCDEF123
[n]
ecetera
etc
999999
There is also the possibility that the “Code” column may be missing and we only receive the name and value therefore requiring null to be added to the code column.
I've searched High and low for this, but there just doesn't seem to be anything out there?
Also there could be many rows in the first example...
The reason you haven't found it is that there is not a magic trick to move a 'interstingly' designed table into a well designed table. You are going to have to hand code a query to either union the rows into your table, or select arrays that you then explode.
Sure you could probably write some code to generate the SQL that you want, but really they're isn't a good feature to magically translate this feature format into a row based format.
In order of preference:
Revisit your decision to send multiple files:
It sounds like it would save a lot of work if you just sent multiple files.
Change the column schema:
Put a delimiter (every 4th column) into the column schema allowing us to see the rows. We can then suck the file in as rows. Using a delimiter.
Write your own custom datasource:
You can use the existing text one as a example for how you can write your own, that cold interpret every 3 columns as a row.
Write a custom UDF that takes all columns as a parameter and returns an array of rows, that you then call explode on to turn them into rows. This will be slow so I give it to you as the final option.
*** WARNING ***
This is going to use up a lot of memory. with 6000 rows it will be slow and may run out of memory. If it works great but I suggest you code your own data source as that likely is a better/faster strategy.
If you want to do this with a UDF and you are only doing this with a couple of row you can do this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
/* spark.sql("select * from info").show();
+----+-------+----+
|type|db_type|info|
+----+-------+----+
| bot| x_bot| x|
| bot| x_bnt| x|
| per| xper| b|
+----+-------+----+ */
val schema = ArrayType(new StructType().add("name","string").add("info","string"))
val myUDF = udf((s: Row) => {
Seq( Row( s.get(0).toString, s.get(1).toString ), Row(s.get(2).toString, s.get(2).toString ) )
},schema)
val records = spark.sql("select * from info");
val arrayRecords = records.select( myUDF(struct(records.columns.map(records(_)) : _*)).alias("Arrays") )
arrayRecords.select( explode(arrayRecords("Arrays")).alias("myCol") )
.select( col("myCol.*").show()
+----+-----+
|name| info|
+----+-----+
| bot|x_bot|
| x| x|
| bot|x_bnt|
| x| x|
| per| xper|
| b| b|
+----+-----+
Sudo code
Create schema for rows.
create udf (with schema) (here I only show small manipulation but you can obviously use more complicated logic in your case)
select data,
Apply udf,
Explode Array.

Disable scientific notation in Spark Scala [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I have a table to consume from Spark and do data transformations.
There is a column that have values with 8 numbers after . something like this: 0.00000001
This column have the absolute value like this number above.
I need to take the absolute value of that column, without let Spark transform this in scientific notation. When Spark reads the table, it transforms the column in scientific notation.
I have tried to convert in String, Double or Float, but nothing works. I need this field in the type Decimal. The type of the field is already Decimal natively in the column.
Is there some way to do this?
Code to simulate:
val df = spark.sparkContext.parallelize(Seq((0.00000001))).toDF("Value")
Spark shows me this: 0E-8
And i need this in Decimal Type: 0.00000001
Thanks!
You can use format_number function to get desired result.
spark.sparkContext.parallelize(Seq((0.00000001))).toDF("Value").
selectExpr("format_number(Value,'#.########') as Value").show(false)
/*
+----------+
|Value |
+----------+
|0.00000001|
+----------+*/

PostgreSQL column to keep track of number of appearances of a certain kind [duplicate]

This question already has an answer here:
Calculating Cumulative Sum in PostgreSQL
(1 answer)
Closed 1 year ago.
I have to make a new column inside of an already existing database that keeps track of the number of occurences of a certain kind. Here's an example to be more clear:
id | eqid | etcd | this would be my new column
-------------------------
1 | 4 | abc | 1
2 | 3 | def | 1
3 | 1 | ghi | 1
4 | 4 | jkl | 2
5 | 3 | mno | 2
6 | 4 | pqr | 3
I am trying to make this column to be able to get all the EQID's of a certain value and get their position. For example: in the example above, let's say you want to get the first EQID that is equal to 4 and that has it's ETCD equal to "abc" in this example to become the second EQID equal to 4 thanks to a button and the second EQID to become the first, I would increment the one with the ETCD equal to "abc" and decrement the one with the ETCD equal to "jkl".
I have tried finding answers everywhere but I haven't found an answer that could help me out. Thanks for any help. By the way this column must be saved it can not be a temporary column or anything like that.
Use the Window function Row_Number(). See demo here
select id, eqid ,etcd
, row_number() over( partition by eqid order by id) "This is new column"
from sample_data
order by id;
You should look into and become familiar with Window Functions and the documentation references there.

How to map values with a key to a column in a Spark DataFrame

I am doing some feature engineering in Spark 2.3 with Scala.
I have IP addresses in a column of a Spark DataFrame looking like
.
I then used data.groupBy("ip").count() to get a list of frequencies for each IP address. This looks like
Now I want to map each of those frequencies to the original dataframe. Where I would have
ip | freq |
-- | |
123 | 3 |
567 | 7 |
857 | 10 |
123 | 3 |
What would be an efficient way of solving such a problem ?
I develop pipelines with 1billion+ rows and this is how I'd go about this.
w = Window.partitionBy('id')
df.withColumn('freq', F.count('id').over(w) ).show()
This is much simpler, reads well, and most importantly efficient. It doesn't aggregate data so no need for creating two df objects and joining.
The previous answer doesn't scale up well with big data, primarily as joins are expensive due to the extra shuffles.

Merge multiple google spreadsheet columns into one column [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
In a google spreadsheet, I have 2 columns of data. I'd like to merge them into another column. The data on column 1 is not as the same as column 2 (data that is present in column 1 but not in column 2). I would like to merge them to create a single column of information containing no multiple data, complete list of column 1 and 2 to column 3.
Expected Result:
Column 1
Apple
Banana
Cashew
Watermelon
Column 2
Apple
Banana
Strawberry
Mango
Column 3
Apple
Banana
Cashew
Watermelon
Strawberry
Mango
List of fruits in column 1 and 2 are not exactly the same and it is all listed in column 3 without duplicate fruit name result.
does this formula work as you want in column C:
=UNIQUE({A:A;B:B})
You can use cell function - CONCATENATE.
e.g. cell A1 contains data "Name", cell B1 contains data "Suffix" then in C1 you can use functions as =CONCATENATE(A1,B1) this will give NAMESuffix in cell C1.