To increase number of mappers while querying in hive - hiveql

I want to run my hive query with 1500 mappers. I have set reducers to 500. To what value should i set input splitsize inorder to achieve the above

set below values as below
mapred.map.tasks = XX;
Choose 1 or 2 based on your hadoop version
1)
mapred.map.tasks
mapred.reduce.tasks
2)
mapreduce.job.maps
mapreduce.job.reduces

Related

Setting column values as column names in the Flink SQL query result

I would like to read a table that has values that will be the column names of the Flink SQL query result. For example, I have t1 as
name value
----------
sp_1 100
sp_2 200
sp_3 300
... ...
Now I want the result of the query to look like this (t2):
sp_1 sp_2 sp_3 ...
100 200 300
Assume all the sp_* have been created in t2.
Is it possible to achieve it through Flink SQL?
Flink version: 1.13.6
I believe this would be possible using something like PIVOT and UNPIVOT functions, which are at the time of writing this not yet supported. You can track https://issues.apache.org/jira/browse/FLINK-23179 for updates.

Skewed Window Function & Hive Source Partitions?

The data I am reading via Spark is highly skewed Hive Table with the following stats.
(MIN, 25TH, MEDIAN, 75TH, MAX) via Spark UI:
1506.0 B / 0 232.4 KB / 27288 247.3 KB / 29025 371.0 KB / 42669 269.0 MB / 27197137
I believe it is causing problems downstream in the job when I perform some Window Funcs, and Pivots.
I tried exploring this parameter to limit the partition size however nothing changed and the partitions are still skewed upon read.
spark.conf.set("spark.sql.files.maxPartitionBytes")
Also, when I cache this DF with the Hive table as source it takes a few min and even causes some GC in the Spark UI most likely because of the skew as well.
Does this spark.sql.files.maxPartitionBytes work on Hive tables or only files?
What is the best course of action for handling this skewed Hive source?
Would something like a stage barrier write to parquet or Salting be suitable for this problem?
I would like to avoid .repartition() on read as it adds another layer to an already data roller-coaster of a job.
Thank you
==================================================
After further research it appears the Window Function is causing skewed data too and this is where the Spark Job hangs.
I am performing some time series filling via double Window Function (forward then backward fill to impute all the null sensor readings) and am trying to follow this article to try a salt method to evenly distribute ... however the following code produces all null values so the salt method is not working.
Not sure why I am getting skews after Window since each measure item I am partitioning by has roughly the same amount of records after checking via .groupBy() ... thus why would salt be needed?
+--------------------+-------+
| measure | count|
+--------------------+-------+
| v1 |5030265|
| v2 |5009780|
| v3 |5030526|
| v4 |5030504|
...
salt post => https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18
nSaltBins = 300 # based off number of "measure" values
df_fill = df_fill.withColumn("salt", (F.rand() * nSaltBins).cast("int"))
# FILLS [FORWARD + BACKWARD]
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
# FORWARD FILLING IMPUTER
ffill_imputer = F.last(df_fill['new_value'], ignorenulls=True)\
.over(window)
fill_measure_DF = df_fill.withColumn('value_impute_temp', ffill_imputer)\
.drop("value", "new_value")
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(0,Window.unboundedFollowing)
# BACKWARD FILLING IMPUTER
bfill_imputer = F.first(df_fill['value_impute_temp'], ignorenulls=True)\
.over(window)
df_fill = df_fill.withColumn('value_impute_final', bfill_imputer)\
.drop("value_impute_temp")
Salting might be helpful in the case where a single partition is big enough to not fit in memory on a single executor. This might happen even if all the keys are equally distributed as well (as in your case).
You have to include the salt column in your partitionBy clause which you are using to create the Window.
window = Window.partitionBy('measure', 'salt')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Again you have to create another window which will operate on the intermediate result
window1 = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Hive based solution :
You can enable Skew join optimization using hive configuration. Applicable settings are:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
See databricks tips for this :
skew hints may work in this case

How to track rows (by id) with a specific column value using Kafka JDBC Connector?

I have a table containing a large number of records. There's a column defining a type of the record. I'd like to collect records with a specific value in that column. Kind of:
Select * FROM myVeryOwnTable WHERE type = "VERY_IMPORTANT_TYPE"
What I've noticed I can't use WHERE clause in a custom query when I choose incremental(+timestamp) mode, otherwise I'd need to take care if filtering on my own.
The background of that I'd like to achieve is that I use Logstash to transfer some type of data from MySQL to ES. That's easily achievable there by using query that can contain where clause. However, with Kafka I can transfer my data much quicker (almost instantly) after inserting new rows in DB.
Thank you for any hints or advices.
Thanks to #wardziniak I was able to set it up.
query=select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p
topic.prefix=test-mysql-jdbc-
incrementing.column.name=id
however, I was expecting a topic test-mysql-jdbc-myVeryOwnTable so I've registered my consumer to that. However, using the query shown above table name is skipped so my topic was named exactly as prefix defined above. So I've just updated my properties topic.prefix=test-mysql-jdbc-myVeryOwnTable and it seems to be working just fine.
You can use subquery in your Jdbc Source Connector query property.
Sample JDBC Source Connector configuration:
{
...
"query": "select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p",
"incrementing.column.name": "id",
...
}

apache cassandra - Inconsistency between number of records returned and count(*) result

I am importing some data into a table in Apache Cassandra using COPY command. I have 7 rows in my csv files. But after importing I just have 1 row instead of 7 rows. What would make this inconsistency?
attached is the image of my cqlsh screen
Possible issue:
same clustering key for the rows.
Solution
try adding another column as clustering key (domain specific) that gives the rows uniqueness.

OnComponentOrder flow and tMap connections in Talend

I have the following flow:
1 Component that needs to be executed to extract from MYSQL a certain
timestamp
3 MYSQL inputs that needs to use that timestamp
1 tMap which needs to get the 3 mysql input
However, I am not allowed to connect the 3 mysql into the single tMap because they are depending on the first component (through OnComponentOk) but with different order. How do I orchestrate this sort of situations?
You could execute a query and set a global variable using the tSetGlobalVar component (referencing row1.mydate, for example), then in each of your queries going into tMap, reference the global variable like:
SELECT ...
FROM ...
WHERE mydate >= '" + (String) globalMap.get("myDate") + "';"
Two subjobs, one for getting the variable and storing it, and another for doing your three queries into tMap, etc.