aggregate all columns and shapiro.test - aggregate
I would like to aggregate all my columns (here only two but i have 25 in reality) by my first column which contains different groups and in addition i would like to use a shapiro.test as FUN argument.
Here is y data with my modalities and 2 variables with values for each modality (I did n=10-9 replicates for this experience).
structure(list(moda = structure(c(20L, 20L, 20L, 20L, 20L, 20L,
20L, 20L, 20L, 20L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L,
18L, 18L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 15L,
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L,
16L, 16L, 16L, 16L, 16L, 16L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 23L,
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), .Label = c("ACN1", "ACN2",
"BA", "BM", "BS1", "BS2", "CN", "EK5", "HW1", "HW2", "HW3", "L27",
"L5K", "LC", "M2K", "M630", "PB1", "PB2", "PB3", "PG", "RMB",
"RMC", "RMM"), class = "factor"), epicotyle = c(1.5, 1.5, 2,
1, 1.5, 1.2, 1, 2.4, 1.3, 1.4, 1.7, 2, 1.8, 2.3, 2.5, 2.5, 1.5,
1.5, 2, 1.3, 1.5, 1.8, 1.3, 1.8, 1.7, 1.5, 2.3, 1.8, 2.2, 1.5,
1.5, 1.5, 1.3, 1.5, 1.5, 1.5, 1.5, 1.8, 1.5, 2.1, 1.8, 1.3, 2,
1.5, 2, 3.5, 1.5, 1.7, 1.7, 2, 1.7, 2, 1.5, 2, 1.5, 2, 2, 1.5,
2, 1.5, 1.8, 1, 2, 3, 1.6, 1.5, 1.5, 1.3, 1.5, 1.5, 1.2, 1.5,
1.5, 1, 1.2, 1.5, 1.5, 1.5, 1.5, 2, 1.1, 1.5, 1.5, 1.7, 1.8,
1.5, 1.3, 1.5, 1.5, 2.5, 1.2, 1.4, 1, 1.5, 2, 1.5, 1.2, 1.5,
2, 2.3, 2.1, 2, 2.4, 1.5, 1.7, 1.4, 2.4, 1, 1, 2, 1.5, 1.2, 2.4,
1.2, 1, 0.8, 1.8, 1.5, 1.5, 1.5, 2.1, 1.5, 1.4, 1.5, 1.3, 1.5,
3, 2.6, 1.5, 2.2, 1.9, 1.5, 1.4, 1.4, 2.5, 2.1, 2, 1.5, 2, 2,
2, 1.5, 2.1, 2, 1.5, 2.5, 2.5, 3, 3, 3.5, 3.5, 3, 2, 2.5, 3.5,
1, 1.2, 1.5, 2.5, 1.5, 1.5, 1.5, 1.5, 1.5, 2.4, 1.5, 2, 3, 1.7,
3, 2.5, 2, 2.5, 2.5, 2.5, 1.5, 1.5, 1.5, 1, 1.5, 2, 1.4, 1.2,
1.7, 2.1, 1.5, 2, 1.5, 1.5, 2, 1.4, 2, 3, 2, 2, 2, 2.5, 3, 3,
1.7, 3, 1.8, 2, 1.8, 2.2, 2.3, 1.5, 2, 1.8, 1.8, 1.3, 2, 1.8,
1.8, 2, 1.8, 1.5, 1.7, 2, 1.4, 1.5, 1.7, 1.5), hypocotyle = c(3.8,
4, 7, 5, 6, 4, 5.4, 3.5, 3.6, 5, 5, 7, 2.5, 6.5, 5.4, 5, 6, 5.7,
7, 5.5, 5.7, 5.5, 7, 6.5, 5.5, 5.5, 6.7, 4.9, 5.3, 6.7, 5.8,
6.5, 6, 5.6, 5, 5.5, 6, 6, 6, 3.5, 4.7, 4.5, 5.9, 5, 6, 7, 6,
5.5, 5, 5.8, 5.5, 5.5, 4.8, 5.7, 6, 7, 5.2, 5, 5.2, 5.3, 5.6,
5, 5.3, 6, 5, 5.5, 4.5, 5.7, 6, 4.5, 4.4, 5.2, 5.2, 4.1, 5.2,
5.2, 5.4, 6, 5.5, 6.5, 5, 6, 5.5, 7.5, 5.2, 5.6, 5.4, 5.5, 5,
5, 6, 5.2, 6, 6.3, 6.3, 4.2, 5.1, 3.5, 6, 6, 6, 6, 5, 5, 6, 5,
5.6, 5.5, 5, 5, 6, 5.2, 6, 6.3, 6.3, 4.2, 5.1, 3.8, 4, 7, 5,
6, 4, 5.4, 3.5, 3.6, 5, 6, 4.8, 4.7, 4.4, 5.5, 3.5, 5.3, 4.3,
5.5, 4.5, 5.5, 4.2, 6, 4.3, 4, 4.7, 3.5, 3.7, 4.2, 5, 5, 5.1,
5.7, 5, 3.5, 4, 5.6, 3.9, 3.5, 7, 6, 6, 6, 6.5, 5.5, 4.5, 6.5,
6.5, 3, 5, 5.5, 5.3, 4, 5.5, 6, 4, 5.5, 6, 5, 4, 4.5, 4.5, 4,
3.5, 4.5, 5, 4, 4.5, 5, 4.7, 6, 3.8, 4.5, 4.1, 4, 3.7, 4, 4.5,
5, 6, 4.5, 6, 5.7, 3.7, 5.8, 6.2, 5.5, 5, 3.8, 4, 7, 5, 6, 4,
5.4, 3.5, 3.6, 5, 7, 6.5, 8, 6.5, 5.7, 7.5, 7.3, 7.4)), class = "data.frame", row.names = c(NA,
-223L))
Well it works pretty good when i selected only one column, with that code example:
data <- aggregate(formula =data1[,2]~data1[,1],
data = data1,
FUN = function(e) {b <- shapiro.test(e); c(b$statistic, b$p.value)})
but when i used a point to select all my other columns exept my first colum: data <- aggregate(formula =.~data1[,1],
data = data1,
FUN = function(e) {b <- shapiro.test(e); c(b$statistic, b$p.value)})
I only got this result:
Error in shapiro.test(e) : all 'x' values are identical.
Related
Google Combo chart horizontal axis not showing all labels
We implemented the Google Combo chart with some horizontal labels in place. But somehow its not showing the first label. Does anybody have any insight in why its not working? Example: https://www.cdfund.com/track-record/rendement/nac.html Code example: var data = new google.visualization.DataTable(); data.addColumn('date', 'Time of measurement'); data.addColumn('number', 'Benchmark (50%/50% TSX-V/HUI) '); data.addColumn('number', 'CDF NAC '); data.addRows([[new Date(2018, 0, 1),42.09,82.47,],[new Date(2018, 1, 1),42.88,82.47,],[new Date(2018, 2, 1),39.33,78.26,],[new Date(2018, 3, 1),38.96,72.98,],[new Date(2018, 4, 1),38.98,77.62,],[new Date(2018, 5, 1),38.64,79.53,],[new Date(2018, 6, 1),37.46,75.12,],[new Date(2018, 7, 1),35.75,72.28,],[new Date(2018, 8, 1),33.72,69.29,],[new Date(2018, 9, 1),33.10,71.27,],[new Date(2018, 10, 1),31.72,68.62,],[new Date(2018, 11, 1),30.54,65.53,],[new Date(2019, 0, 1),31.49,61.23,],[new Date(2019, 1, 1),34.30,64.15,],[new Date(2019, 2, 1),34.11,64.13,],[new Date(2019, 3, 1),34.37,63.52,],[new Date(2019, 4, 1),32.61,58.88,],[new Date(2019, 5, 1),32.38,56.60,],[new Date(2019, 6, 1),35.77,59.77,],[new Date(2019, 7, 1),36.44,62.15,],[new Date(2019, 8, 1),39.01,65.34,],[new Date(2019, 9, 1),35.86,61.54,],[new Date(2019, 10, 1),36.70,60.51,],[new Date(2019, 11, 1),36.03,59.00,],[new Date(2020, 0, 1),39.85,67.53,],[new Date(2020, 1, 1),39.15,66.76,],[new Date(2020, 2, 1),34.93,59.35,],[new Date(2020, 3, 1),28.78,50.16,],[new Date(2020, 4, 1),38.07,69.69,],[new Date(2020, 5, 1),41.80,79.14,],[new Date(2020, 6, 1),45.95,91.51,],[new Date(2020, 7, 1),54.05,104.16,],[new Date(2020, 8, 1),55.26,116.85,],[new Date(2020, 9, 1),51.67,115.98,],[new Date(2020, 10, 1),49.87,111.20,],[new Date(2020, 11, 1),49.84,113.11,],[new Date(2021, 0, 1),55.39,125.83,],[new Date(2021, 1, 1),55.39,117.29,],[new Date(2021, 2, 1),56.02,116.46,],[new Date(2021, 3, 1),54.85,113.09,],[new Date(2021, 4, 1),55.98,123.36,],[new Date(2021, 5, 1),60.81,133.58,],[new Date(2021, 6, 1),55.63,120.68,],[new Date(2021, 7, 1),55.32,118.26,],[new Date(2021, 8, 1),52.44,111.19,],[new Date(2021, 9, 1),48.82,102.59,],[new Date(2021, 10, 1),53.49,113.06,],[new Date(2021, 11, 1),53.79,109.98,],[new Date(2022, 0, 1),54.24,114.31,],[new Date(2022, 1, 1),50.69,106.74,],[new Date(2022, 2, 1),53.79,112.16,],[new Date(2022, 3, 1),58.19,118.96,],[new Date(2022, 4, 1),52.91,113.69,],[new Date(2022, 5, 1),47.26,102.92,],[new Date(2022, 6, 1),40.73,86.32,],[new Date(2022, 7, 1),40.44,95.37,],[new Date(2022, 8, 1),38.20,92.43,],[new Date(2022, 9, 1),37.64,81.94,],[new Date(2022, 10, 1),37.82,81.27,],[new Date(2022, 11, 1),,,]]); var options = { hAxis: { format: 'yyyy', gridlines: { count: 5, color: 'transparent' }, ticks: [new Date(2018, 3, 1), new Date(2019, 1, 1), new Date(2020, 1, 1), new Date(2021, 1, 1), new Date(2022, 1, 1)], minorGridlines: { color: 'transparent' }, textStyle: { color: '#000', fontSize: 8 } }, vAxis: { minorGridlines: { color: 'transparent' }, gridlines: { count: 4 }, textStyle: { color: '#706345', italic: true, fontSize: 8 }, textPosition: 'in', }, height: '360', colors: ['#CB9B01','#AA9870','#C2AE81','#706345','#E2D7BD'], backgroundColor: '#F4F3F0', chartArea: { 'width': '90%', 'height': '65%' }, legend: { 'position': 'bottom', padding: 30 }, seriesType: 'area', series: { 1: { type: 'line' }, 2: { type: 'line' }, 3: { type: 'line' }, 4: { type: 'line' }, 5: { type: 'line' } } }; Thanks
How to custom animate a text in flutter with duration
I have a list of texts with duration for animation List<RhymeModel> rhymePhrases = [ RhymeModel(lyricsPhrase: 'Baa, baa', startAt: 0.0, endAt: 0.2), RhymeModel(lyricsPhrase: 'black sheep', startAt: 0.3, endAt: 0.4), RhymeModel(lyricsPhrase: 'Have you', startAt: 0.5, endAt: 0.6), RhymeModel(lyricsPhrase: 'any wool?', startAt: 0.7, endAt: 0.8), RhymeModel(lyricsPhrase: 'Yes, sir,', startAt: 0.9, endAt: 1.0), RhymeModel(lyricsPhrase: 'yes, sir,', startAt: 1.1, endAt: 1.2), RhymeModel(lyricsPhrase: 'Three bags full.', startAt: 1.3, endAt: 1.4), RhymeModel(lyricsPhrase: 'One for the master,', startAt: 1.5, endAt: 1.6), RhymeModel(lyricsPhrase: 'And one for', startAt: 1.7, endAt: 1.8), RhymeModel(lyricsPhrase: 'dame,And one', startAt: 1.9, endAt: 2.0), RhymeModel(lyricsPhrase: 'for the little', startAt: 2.1, endAt: 2.2), RhymeModel(lyricsPhrase: 'boy Who lives ', startAt: 2.3, endAt: 2.4), RhymeModel(lyricsPhrase: 'down the lane.', startAt: 2.5, endAt: 2.6), ]; My objective is to animate each text with a reveal animation using the duration(Similar to what you can see when lyrics get matched with audio) how can I animate each texts with animations ?
The google chat is not showing last tick on chart
The problem is that last date is not showing as tick even it has value & tick. google.charts.load('current', {'packages': ['corechart']}); google.charts.setOnLoadCallback(drawChart); function drawChart() { var data = google.visualization.arrayToDataTable( [ ["Month Day", "New User"], [new Date(2020, 9, 1), 4064], [new Date(2020, 9, 2), 3415], [new Date(2020, 9, 3), 2071], [new Date(2020, 9, 4), 397], [new Date(2020, 9, 5), 1425], [new Date(2020, 9, 6), 4848], [new Date(2020, 9, 7), 667] ]); var options = { vAxis: { gridlines: { color: "transparent" }, format: "#,###", baseline: 0, }, hAxis: { format: "dd MMM", gridlines: { color: "transparent" }, "ticks": [ new Date(2020, 9, 1), new Date(2020, 9, 2), new Date(2020, 9, 3), new Date(2020, 9, 4), new Date(2020, 9, 5), new Date(2020, 9, 6), new Date(2020, 9, 7) ] }, height: 300, legend: "none", chartArea: { height: "85%", width: "92%", bottom: "11%", left: "10%" }, colors: ["#85C1E9"], }; var chart = new google.visualization.AreaChart(document.getElementById('chart_div')); chart.draw(data, options); } If I add extra date for tick it looks odd on chart. Is there any way to show last tick date on chart xAxis ? https://jsfiddle.net/hu3wm0jn/
just need to allow enough room on the right side of the chart for the label to appear see updated chartArea options... chartArea: { left: 64, top: 48, right: 48, bottom: 64, height: '100%', width: '100%' }, height: '100%', width: '100%', see following working snippet... google.charts.load('current', { packages: ['corechart'] }).then(function () { var data = google.visualization.arrayToDataTable([ ["Month Day", "New User"], [new Date(2020, 9, 1), 4064], [new Date(2020, 9, 2), 3415], [new Date(2020, 9, 3), 2071], [new Date(2020, 9, 4), 397], [new Date(2020, 9, 5), 1425], [new Date(2020, 9, 6), 4848], [new Date(2020, 9, 7), 667] ]); var options = { vAxis: { gridlines: { color: "transparent" }, format: "#,###", baseline: 0, }, hAxis: { format: "dd MMM", gridlines: { color: "transparent" }, ticks: [ new Date(2020, 9, 1), new Date(2020, 9, 2), new Date(2020, 9, 3), new Date(2020, 9, 4), new Date(2020, 9, 5), new Date(2020, 9, 6), new Date(2020, 9, 7) ] }, legend: "none", chartArea: { left: 64, top: 48, right: 48, bottom: 64, height: '100%', width: '100%' }, height: '100%', width: '100%', colors: ["#85C1E9"] }; var chart = new google.visualization.AreaChart(document.getElementById('chart_div')); chart.draw(data, options); window.addEventListener('resize', function () { chart.draw(data, options); }); }); html, body { height: 100%; margin: 0px 0px 0px 0px; padding: 0px 0px 0px 0px; } #chart_div { min-height: 500px; height: 100%; } <script src="https://www.gstatic.com/charts/loader.js"></script> <div id="chart_div"></div>
groupby and join vs window in pyspark
I have a data frame in pyspark which has hundreds of millions of rows (here is a dummy sample of it): import datetime import pyspark.sql.functions as F from pyspark.sql import Window,Row from pyspark.sql.functions import col from pyspark.sql.functions import month, mean,sum,year,avg from pyspark.sql.functions import concat_ws,to_date,unix_timestamp,datediff,lit from pyspark.sql.functions import when,min,max,desc,row_number,col dg = sqlContext.createDataFrame(sc.parallelize([ Row(cycle_dt=datetime.datetime(1984, 5, 2, 0, 0), network_id=4,norm_strength=0.5, spend_active_ind=1,net_spending_amt=0,cust_xref_id=10), Row(cycle_dt=datetime.datetime(1984, 6, 2, 0, 0), network_id=4,norm_strength=0.5, spend_active_ind=1,net_spending_amt=2,cust_xref_id=11), Row(cycle_dt=datetime.datetime(1984, 7, 2, 0, 0), network_id=4,norm_strength=0.5, spend_active_ind=1,net_spending_amt=2,cust_xref_id=12), Row(cycle_dt=datetime.datetime(1984, 4, 2, 0, 0), network_id=4,norm_strength=0.5, spend_active_ind=1,net_spending_amt=2,cust_xref_id=13), Row(cycle_dt=datetime.datetime(1983,11, 5, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=0,net_spending_amt=8,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1983,12, 2, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=0,net_spending_amt=2,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1984, 1, 3, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=1,net_spending_amt=15,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1984, 3, 2, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=0,net_spending_amt=7,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1984, 4, 3, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=0,net_spending_amt=1,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1984, 5, 2, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=0,net_spending_amt=1,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1984,10, 6, 0, 0), network_id=1,norm_strength=0.5, spend_active_ind=1,net_spending_amt=10,cust_xref_id=1 ), Row(cycle_dt=datetime.datetime(1984, 1, 7, 0, 0), network_id=1,norm_strength=0.4, spend_active_ind=0,net_spending_amt=8,cust_xref_id=2 ), Row(cycle_dt=datetime.datetime(1984, 1, 2, 0, 0), network_id=1,norm_strength=0.4, spend_active_ind=0,net_spending_amt=3,cust_xref_id=2 ), Row(cycle_dt=datetime.datetime(1984, 2, 7, 0, 0), network_id=1,norm_strength=0.4, spend_active_ind=1,net_spending_amt=5,cust_xref_id=2 ), Row(cycle_dt=datetime.datetime(1985, 2, 7, 0, 0), network_id=1,norm_strength=0.3, spend_active_ind=1,net_spending_amt=8,cust_xref_id=3 ), Row(cycle_dt=datetime.datetime(1985, 3, 7, 0, 0), network_id=1,norm_strength=0.3, spend_active_ind=0,net_spending_amt=2,cust_xref_id=3 ), Row(cycle_dt=datetime.datetime(1985, 4, 7, 0, 0), network_id=1,norm_strength=0.3, spend_active_ind=1,net_spending_amt=1,cust_xref_id=3 ), Row(cycle_dt=datetime.datetime(1985, 4, 8, 0, 0), network_id=1,norm_strength=0.3, spend_active_ind=1,net_spending_amt=9,cust_xref_id=3 ), Row(cycle_dt=datetime.datetime(1984, 4, 2, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=0,net_spending_amt=3,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 4, 3, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=0,net_spending_amt=2,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 1, 2, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=0,net_spending_amt=5,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 1, 3, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=1,net_spending_amt=6,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 3, 2, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=0,net_spending_amt=2,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 1, 5, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=0,net_spending_amt=9,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 1, 6, 0, 0), network_id=2,norm_strength=0.5, spend_active_ind=1,net_spending_amt=1,cust_xref_id=4 ), Row(cycle_dt=datetime.datetime(1984, 1, 7, 0, 0), network_id=2,norm_strength=0.4, spend_active_ind=0,net_spending_amt=7,cust_xref_id=5 ), Row(cycle_dt=datetime.datetime(1984, 1, 2, 0, 0), network_id=2,norm_strength=0.4, spend_active_ind=0,net_spending_amt=8,cust_xref_id=5 ), Row(cycle_dt=datetime.datetime(1984, 2, 7, 0, 0), network_id=2,norm_strength=0.4, spend_active_ind=1,net_spending_amt=3,cust_xref_id=5 ), Row(cycle_dt=datetime.datetime(1985, 2, 7, 0, 0), network_id=2,norm_strength=0.6, spend_active_ind=1,net_spending_amt=6,cust_xref_id=6 ), Row(cycle_dt=datetime.datetime(1985, 3, 7, 0, 0), network_id=2,norm_strength=0.6, spend_active_ind=0,net_spending_amt=9,cust_xref_id=6 ), Row(cycle_dt=datetime.datetime(1985, 4, 7, 0, 0), network_id=2,norm_strength=0.6, spend_active_ind=1,net_spending_amt=4,cust_xref_id=6 ), Row(cycle_dt=datetime.datetime(1985, 4, 8, 0, 0), network_id=2,norm_strength=0.6, spend_active_ind=1,net_spending_amt=6,cust_xref_id=6 ), Row(cycle_dt=datetime.datetime(1984, 4, 2, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 4, 3, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 1, 2, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 1, 3, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 3, 2, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 1, 5, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 1, 6, 0, 0), network_id=3,norm_strength=0.5, spend_active_ind=0,net_spending_amt=0,cust_xref_id=7 ), Row(cycle_dt=datetime.datetime(1984, 1, 7, 0, 0), network_id=3,norm_strength=0.4, spend_active_ind=0,net_spending_amt=3,cust_xref_id=8 ), Row(cycle_dt=datetime.datetime(1984, 1, 2, 0, 0), network_id=3,norm_strength=0.4, spend_active_ind=0,net_spending_amt=2,cust_xref_id=8 ), Row(cycle_dt=datetime.datetime(1984, 2, 7, 0, 0), network_id=3,norm_strength=0.4, spend_active_ind=1,net_spending_amt=8,cust_xref_id=8 ), Row(cycle_dt=datetime.datetime(1985, 2, 7, 0, 0), network_id=3,norm_strength=0.6, spend_active_ind=1,net_spending_amt=4,cust_xref_id=9 ), Row(cycle_dt=datetime.datetime(1985, 3, 7, 0, 0), network_id=3,norm_strength=0.6, spend_active_ind=0,net_spending_amt=1,cust_xref_id=9 ), Row(cycle_dt=datetime.datetime(1985, 4, 7, 0, 0), network_id=3,norm_strength=0.6, spend_active_ind=1,net_spending_amt=9,cust_xref_id=9 ), Row(cycle_dt=datetime.datetime(1985, 4, 8, 0, 0), network_id=3,norm_strength=0.6, spend_active_ind=0,net_spending_amt=3,cust_xref_id=9 ) ])) I am trying to sumspend_active_ind for each cust_xref_id and keep those with sum more than zero. One way to do this is using grouby and join: dg1 = dg.groupby("cust_xref_id").agg(sum("spend_active_ind").alias("sum_spend_active_ind")) dg1 = dg1.filter(dg1.sum_spend_active_ind != 0).select("cust_xref_id") dg = dg.alias("t1").join(dg1.alias("t2"),col("t1.cust_xref_id")==col("t2.cust_xref_id")).select(col("t1.*")) The other way I can think of it is using window: w = Window.partitionBy ('cust_xref_id') dg = dg.withColumn('sum_spend_active_ind',sum(dg.spend_active_ind).over(w)) dg = dg.filter(dg.sum_spend_active_ind!=0) which one of these methods (or any other method) is more efficient for what I am trying to do. Thanks
You could try to open your spark-ui at localhost:4040, or see the query plan using the explain method: ( dg .groupby('cust_xref_id') .agg(F.sum('spend_active_ind').alias('sum_spend_active_ind')) .filter(F.col('sum_spend_active_ind') > 0) ).explain()
Scala collection of dates and group by week
import java.time.LocalDate case class Day(date: LocalDate, other: String) val list = Seq( Day(LocalDate.of(2016, 2, 1), "text"), Day(LocalDate.of(2016, 2, 2), "text"), // Tuesday Day(LocalDate.of(2016, 2, 3), "text"), Day(LocalDate.of(2016, 2, 4), "text"), Day(LocalDate.of(2016, 2, 5), "text"), Day(LocalDate.of(2016, 2, 6), "text"), Day(LocalDate.of(2016, 2, 7), "text"), Day(LocalDate.of(2016, 2, 8), "text"), Day(LocalDate.of(2016, 2, 9), "text"), Day(LocalDate.of(2016, 2, 10), "text"), Day(LocalDate.of(2016, 2, 11), "text"), Day(LocalDate.of(2016, 2, 12), "text"), Day(LocalDate.of(2016, 2, 13), "text"), Day(LocalDate.of(2016, 2, 14), "text"), Day(LocalDate.of(2016, 2, 15), "text"), Day(LocalDate.of(2016, 2, 16), "text"), Day(LocalDate.of(2016, 2, 17), "text") ) // hard code, for example Tuesday def groupDaysBy(list: Seq[Day]): List[List[Day]] = { ??? } val result = Seq( Seq(Day(LocalDate.of(2016, 2, 1), "text")), // Separate Seq(Day(LocalDate.of(2016, 2, 2), "text"), // Tuesday Day(LocalDate.of(2016, 2, 3), "text"), Day(LocalDate.of(2016, 2, 4), "text"), Day(LocalDate.of(2016, 2, 5), "text"), Day(LocalDate.of(2016, 2, 6), "text"), Day(LocalDate.of(2016, 2, 7), "text"), Day(LocalDate.of(2016, 2, 8), "text")), Seq(Day(LocalDate.of(2016, 2, 9), "text"), // Tuesday Day(LocalDate.of(2016, 2, 10), "text"), Day(LocalDate.of(2016, 2, 11), "text"), Day(LocalDate.of(2016, 2, 12), "text"), Day(LocalDate.of(2016, 2, 13), "text"), Day(LocalDate.of(2016, 2, 14), "text"), Day(LocalDate.of(2016, 2, 15), "text")), Seq(Day(LocalDate.of(2016, 2, 16), "text"), // Tuesday Day(LocalDate.of(2016, 2, 17), "text")) ) assert(groupDaysBy(list) == result) I have a list of Day object, and I want to group every 7 days together and the start date can be any day (from Monday to Sunday, I give Tuesday as an example). Above is the function and expected result for my requirement. I am wondering how can I take advantage of Scala collection API to achieve without tail recursive?
Here's what you can do: // hard code, for example Tuesday def groupDaysBy(list: Seq[Day]): Seq[Seq[Day]] = { val (list1,list2)= list.span(_.date.getDayOfWeek != DayOfWeek.TUESDAY) Seq(list1) ++ list2.grouped(7) } I would recommend taking day as a parameter instead of hardcoding though, so it becomes // hard code, for example Tuesday def groupDaysBy(list: Seq[Day], dayOfWeek: DayOfWeek): Seq[Seq[Day]] = { val (list1,list2)= list.span(_.date.getDayOfWeek != dayOfWeek) Seq(list1) ++ list2.grouped(7) } ... assert(groupDaysBy(list, DayOfWeek.TUESDAY) == result)
Map your list to create a Tuple(GroupKey, value) with GroupKey a value representing a uniq week (year*53 + week_of_the_year) for example. Then you can group on GroupKey