About bkosaraju

bkosaraju · ‎10-03-2017

Hi @Mamta Chawla, To get the table schema, we can easily achieved by respective command line utility ( like Bteq,sqlplus or mysql client etc..), sqoop can be handy to pull the larger volumes of data across the RDBMS to HDFS. However, if you want to store the schema information in Hadoop as files using the sqoop, this can be achieved by querying the database metadata tables. for instance : select * from all_tab_columns where owner = '<OWENR_NAME>' and table_name = 'YOUR TABLE NAME HERE'; --for Oracle select * from DBC.columns where databasename ='<Database_Name>' and tablename ='<Table_name>'; --for Teradata SELECT * from INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'tbl_name' AND table_schema = 'db_name';-- for Mysql .. etc Hope this helps !!

bkosaraju · ‎10-02-2017

Hi @ilia kheifets, Can you please verify UID in /etc/passwd and GID's for the user "usertest1" are consistent and available across the cluster(in case of local authentication). apart from that, you can try checking with yarn client mode and copy the core-site.xml from /etc/hadoop/conf to SPARK Confidence directory across all the nodes in cluster. I presume something wrong with sharing the libraries between the Nodes in cluster. Hope this helps!!

bkosaraju · ‎10-02-2017

Hi @Joe Harvy, Yarn/Other tenent Application not aware of any of the other tenents resource usage, this will be much bigger problem when there is swap defined, as the OS Terminates(technically "sacrifice" ) one of the process based out of age and amount of resources free up for the sacrifice. So it become much critical to organize the applications in a multi tenant Environment. there are multiple things needs to be considered while managing these kind of environments, such as memory CPU and Disk bottlenecks. Memory Usage : Interns of the Memory usage, we need to subtract the each component's maximum Heap allocation (-xmx ) and add additional resources such as 2G- for OS, 2GB -For DataNode, 2GB - Ambari Metrics etc then for HBASE additional BucketCache(off heap) + Region Server Heap Size, and similar for Accumulo and Storm etc .. After all subtracted from total memory, remaining can be allocated for Yarn, example of this has been well documented at HBASE cache configuration Here CPU usage : This is Bit tricky as, Configuration of this value upfront may not be straight forward. need to compute the SAR / Ambari Metrics information, with respect to CPU usage and allocate the remaining CPU for the Yarn. At the same time verify the load average on the host, should not be exceed too high, in cases that should be controlled with amount of parallel work happening form apps/YARN according to the priority. - this is where yarn scheduler comes handy. Disk Usage : Have a keen eye on CPU wait IO, any of the increase in that value cased by the low disk latency, better option is not share the disk for multiple purposes ( ex : for data nodes other application activities ), will result in queuing up the resources. Hope this helps!!

bkosaraju · ‎10-02-2017

Hi @Venkatesh Innanji, For the given scenario(and code), I believe following amendments will work to optimize the code, please have a look and revert for any concerns. Remove the un-wanted columns from initial selection ( instead of * specify the columns), such cases you will end-up having fewer than many, this helps to quicker shuffle operation. In general where there is huge shuffle, there it takes all the time and resources, in case of this join tables get shuffle per Emp_id, LOC_ID,EMPDET_ID) doing repartition on Join columns for the bigger tables leaves quicker execution times as will reduce the relative shuffle data between the executors, as they will end up having the same hash code. Please dont cache the dataframe unless you need multiple times with same dimension you are join (will cause extra over head for multiple reads and writes) with the excessive IO (the primary reason for longer execution duration). Instead of one hop go for multiple, the aggregate function will ensure that, one after one get executed and effectively reduce the data carried for next sql in this case join EMP and DEPT and compute the distinct (if needed) and then join with LOC and later with Sales the order should compute the smaller data sets first and leave the larger to last, as we may need to shuffle the DF every time we join with other DF. looks the main reason for the longer execution is huge IO reads (from the data volumes you have specified in query), so better to take one hop at time and go for next one on execution prspective, the lazy evolution can be streamlined by keeping the aggregated functions between the code - (not recommended for small tables / operations ). Hope this helps !!

bkosaraju · ‎10-02-2017

Hi @Rohit Khose, This occurs usually when spark code is not formatted(either the braces or quotes/ indentation (in case of python) are not ended) properly will result this exception. could you palest reformat your code and run the same (best is to use a text editor which highlight or submit the code block by block) On the other hand this may occur due to the network failures while on data shuffle between execution (I doubt this is not the case) - in such cases resubmitting the job should result the successful completion(presume not in your case).

bkosaraju · ‎10-02-2017

Hi @raouia , Can you please look at the Yarn Application log(application_XXX_XXX) which will have the detailed information about the root cause for the failure. on the other hand, from the symptom, Looks something to do with library Configuration , con you please ensure, when you add the additional node, all the spark libraries are available on that node too. Hope This Helps !!

bkosaraju · ‎09-28-2017

Hi @Mateusz Grabowski, The quick solution is to use the apache web server to redirect the requests to https (documentation with examples can be found here) -this approach is external to ambari and no need to make any configuration changes at ambari level. however, there must be a inherent way to handle with jetty by modifying the ambari configuration, similar discussion for NiFi can be found at following thread. https://community.hortonworks.com/questions/63171/nifi-admin-question-redirect-http-to-https.html Hope this helps !!

bkosaraju · ‎09-28-2017

Hi @Will Dailey, It is possible to configure the secondary LDAP server to handle the AD Failure scenario. --ldap-secondary-url is the parameter(optional) which will take additional AD host in the ldap-setup command line arguments. --ldap-url=LDAP_URL Primary url for LDAP --ldap-secondary-url=LDAP_SECONDARY_URL Secondary url for LDAP

bkosaraju · ‎09-27-2017

Hi @Br Hmedna, Couple of things, from screen shot your data is in ORC format. to import and export use HCatalog integration, the syntax for the export would be sqoop export --connect jdbc:mysql://127.0.0.1/mooc2015 –username <mysql_User> –password <mysql_pwd> –table Act_Grade –hcatalog-table <hive_table> --hcatalog-database <hive_database_name> please note that --export-dir option is not supported while on hcat integration, so better use above syntax. on the side note for debugging your code. the error you have provided is not the application log, high-level sqoop log. Yarn app(task) log can be found with the extension(job_1506422992590_0008), where you can find the reason for the failure. hope this helps !!

bkosaraju · ‎09-26-2017

Hi @Alvin Jin, looks you have not substituted the token in place of $token can you please place the token as string and test ? curl -k -X GET 'https://<nifi-server>:9091/nifi-api/cluster/summary' -H 'Authorization: Bearer <token which is generated from Above Command>' --compressed

Online	Offline
Last Visited	‎04-09-2019 11:41 AM

Member Since	‎01-03-2017 05:05 AM
Last Visited	‎04-09-2019 11:41 AM
Posts	181
Kudos received	44

Cloudera Community

Re: Api to help pull yarn metrics and RM metrics

Re: NiFi Cluster Setup

Re: Hive LLAP ranger insert issue (requires defaul...

Re: Ranger Audit Log (Add filter)

Re: HDFS is not rebalancing after adding new DataN...

Re: Sqoop get table schema with data types

Re: spark on yarn error in localizer.ResourceLocal...

Re: Yarn behaviour with external tools

Re: Unable to retrieve data using the Data Frames ...

Re: EOF Exception while deployed spark application...

Re: How to fix an error in spark submit ?

Re: How to redirect Ambari http to https page?

Re: Ambari Server Connecting to multiply LDAP/AD h...

Re: Export from HDFS to mysql using Sqoop

Re: Use REST API to access a secured NiFi cluster