About sreeviswa_athic

sreeviswa_athic · ‎10-10-2017

Hi Team, I was trying to load dataframe to hive table, bucket by one of the column. I am facing error. File "<stdin>", line 1, in <module> AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy' Here is the statement I am trying to pass rs.write.bucketBy(4,"Column1").sortBy("column2").saveAsTable("database.table") Can you please help me out in this

sreeviswa_athic · ‎08-30-2017

Hello Team, Can some one please help me understand the comparison/difference between Hive CLI and beeline.

sreeviswa_athic · ‎07-28-2017

We generally encounter such errors when the delimiter specified in command doesn't match the delimiter in input file. Also make sure you are giving complete and right path of file Please try below syntax load '/path_to_file' using PigStorage('|') as (aa,bb,cc,dd,ee);

sreeviswa_athic · ‎07-28-2017

Time taking for Query execution depends on multiple factors 1. Mainly the Hive query design, joins and the columns being pulled 2. YARN/TEZ container size allocated, depends where you are running 3. Check the queue you are running your job, check if queue is free to answer your question on why one of the reducer is taking 1000 tasks please the hive.exec.reducers.max value defined If you want to play and modify the number of reducers, try changing the value of hive.exec.reducers.bytes.per.reducer(preferably assign a smaller, as this value is inversely proportional to number of reducers)

sreeviswa_athic · ‎07-21-2017

@Varun Please see the below to control number of Reducers setting MAPRED.REDUCE.TASKS = -1 -- this property lets Tez determine the no of reducers to initiate hive.tez.auto.reducer.parallelism = true; --this property is enabled to TRUE, hive will estimate data sizes and set parallelism estimates. Tez will sample source vertices, output sizes and adjust the estimates at run time this is the 1st property that determines initial number of reducers once Tez starts the query hive.tex.min.partition.factor =0.25;-- when auto parallelism enable, this property will be used to put a lower limit to number of reducers that Tez specified 1. hive.tez.max.partition.factor - 2.0; -- this property specifies,over-partition data in shuffle edges 2.hive.exec.reducers.max by default is 1099 --max number of reducers 3.hive.exec.reducers.bytes.per.reducer = 256 MB; which is 268435456 bytes Now to calculate the number of reducers we will need to put altogether, along with this formula also from Explain plan we will need to get the size of output, lets assume 200,000 bytes Max(1, Min(hive.exec.reducers.max [1099], Reducer Stage estimate/hive.exec.reducers.bytes.per.reducer)) x hive.tez.max.partition.factor [2] Max(1, Min(1099, 200000/268435456)) x 2 =MAX(1,min(1099,0.00074505805)) X 2 =MAX(1,0.0007) X 2 = 1 X 2 = 2 Tez will spawn 2 Reducers. In this case we can legally make Tez initiate higher number of reducers by modifying value of hive.exec.reducers.bytes.per.reducer by setting it to 20 KB =Max(1,min(1099,20000/10432)) X 2 =Max(1,19) X 2 = 38 Please note higher number of reducers doesn't mean better performance

sreeviswa_athic · ‎07-21-2017

@Varun R Optimization varies in every case. Depends on incoming data, file size. In general please use these setting for fine tuning Enable predicate pushdown (PPD) to filter at the storage layer: SET hive.optimize.ppd=true; SET hive.optimize.ppd.storage=true Vectorized query execution processes data in batches of 1024 rows instead of one by one: SET hive.vectorized.execution.enabled=true; SET hive.vectorized.execution.reduce.enabled=true; Enable the Cost Based Optimizer (COB) for efficient query execution based on cost and fetch table statistics: SET hive.cbo.enable=true; SET hive.compute.query.using.stats=true; SET hive.stats.fetch.column.stats=true; SET hive.stats.fetch.partition.stats=true; Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance. Control reducer output: SET hive.tez.auto.reducer.parallelism=true; Partition table based on necessary column, also bucket the tables(wisely identify the column) Also depends on how you want to tune your Query, based on Explain Plan. Please check number of Mappers and Reducers spawnned.

sreeviswa_athic · ‎07-21-2017

@Simran Kaur I see in stdout as Oozie launcher failed. Are you trying to run Hive action in Oozie. If that's the case please use command (yarn logs -applicationId application_1499692338187_45811)to get logs, or follow the below KB article to trace for logs to debug further https://community.hortonworks.com/articles/9148/troubleshooting-an-oozie-flow.html

sreeviswa_athic · ‎07-21-2017

@Helmi Khalifa Please use below snytax to load data from hdfs to hive tables LOAD DATA INPATH '/hdfs/path' OVERWRITE INTO TABLE TABLE_NAME; In case if you are trying to load to a specific partition of the table LOAD DATA INPATH '/hdfs/path' OVERWRITE INTO TABLE TABLE_NAME PARTITION (ds='2008-08-15');

sreeviswa_athic · ‎07-14-2017

For every Reducer certain number of tasks are created. Can someone explain what is the factor which decides number of tasks to be created for each reducer

sreeviswa_athic · ‎06-23-2017

get hdfs path where hive table files are stored. Use hdfs dfs -du -s -h /hdfs_path to get size in readable format.

Online	Offline
Last Visited	‎04-04-2020 07:07 PM

Member Since	‎02-25-2016 11:18 PM
Last Visited	‎04-04-2020 07:07 PM
Posts	72
Kudos received	34

Cloudera Community

Re: Hive query execution taking longer time

Re: HBase/Phoenix - How to specify autocommit in J...

Re: Hive 1.2.x - CTAS behavior when using CAST to ...

Re: Oozie > HiveActionExecutor > LauncherMapper di...

Re: How to apply configuration when creating more ...

Spark 1.6.3 bucketBy error

Hive CLI vs Beeline

Re: getting error when trying to deilimit the rows...

Re: Hive query execution taking longer time

Re: How to increase performance of Tez in hive

Re: How to increase performance of Tez in hive

Re: Hive Queries started failing

Re: How to Load data from hdfs Multi level directo...

Number of Tasks created for each reducer

Re: Hive tables are split into files, how can we k...