Member since
01-03-2017
181
Posts
44
Kudos Received
24
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1849 | 12-02-2018 11:49 PM | |
2474 | 04-13-2018 06:41 AM | |
2043 | 04-06-2018 01:52 AM | |
2348 | 01-07-2018 09:04 PM | |
5695 | 12-20-2017 10:58 PM |
10-03-2017
01:41 AM
Hi @Mamta Chawla, To get the table schema, we can easily achieved by respective command line utility ( like Bteq,sqlplus or mysql client etc..), sqoop can be handy to pull the larger volumes of data across the RDBMS to HDFS. However, if you want to store the schema information in Hadoop as files using the sqoop, this can be achieved by querying the database metadata tables. for instance : select * from all_tab_columns where owner = '<OWENR_NAME>' and table_name = 'YOUR TABLE NAME HERE'; --for Oracle
select * from DBC.columns where databasename ='<Database_Name>' and tablename ='<Table_name>'; --for Teradata
SELECT * from INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'tbl_name' AND table_schema = 'db_name';-- for Mysql .. etc Hope this helps !!
... View more
10-02-2017
08:38 AM
Hi @ilia kheifets, Can you please verify UID in /etc/passwd and GID's for the user "usertest1" are consistent and available across the cluster(in case of local authentication). apart from that, you can try checking with yarn client mode and copy the core-site.xml from /etc/hadoop/conf to SPARK Confidence directory across all the nodes in cluster. I presume something wrong with sharing the libraries between the Nodes in cluster. Hope this helps!!
... View more
10-02-2017
08:21 AM
1 Kudo
Hi @Joe Harvy, Yarn/Other tenent Application not aware of any of the other tenents resource usage, this will be much bigger problem when there is swap defined, as the OS Terminates(technically "sacrifice" ) one of the process based out of age and amount of resources free up for the sacrifice. So it become much critical to organize the applications in a multi tenant Environment. there are multiple things needs to be considered while managing these kind of environments, such as memory CPU and Disk bottlenecks. Memory Usage : Interns of the Memory usage, we need to subtract the each component's maximum Heap allocation (-xmx ) and add additional resources such as 2G- for OS, 2GB -For DataNode, 2GB - Ambari Metrics etc then for HBASE additional BucketCache(off heap) + Region Server Heap Size, and similar for Accumulo and Storm etc .. After all subtracted from total memory, remaining can be allocated for Yarn, example of this has been well documented at HBASE cache configuration Here CPU usage : This is Bit tricky as, Configuration of this value upfront may not be straight forward. need to compute the SAR / Ambari Metrics information, with respect to CPU usage and allocate the remaining CPU for the Yarn. At the same time verify the load average on the host, should not be exceed too high, in cases that should be controlled with amount of parallel work happening form apps/YARN according to the priority. - this is where yarn scheduler comes handy. Disk Usage : Have a keen eye on CPU wait IO, any of the increase in that value cased by the low disk latency, better option is not share the disk for multiple purposes ( ex : for data nodes other application activities ), will result in queuing up the resources. Hope this helps!!
... View more
10-02-2017
05:46 AM
Hi @Venkatesh Innanji, For the given scenario(and code), I believe following amendments will work to optimize the code, please have a look and revert for any concerns. Remove the un-wanted columns from initial selection ( instead of * specify the columns), such cases you will end-up having fewer than many, this helps to quicker shuffle operation. In general where there is huge shuffle, there it takes all the time and resources, in case of this join tables get shuffle per Emp_id, LOC_ID,EMPDET_ID)
doing repartition on Join columns for the bigger tables leaves quicker execution times as will reduce the relative shuffle data between the executors, as they will end up having the same hash code. Please dont cache the dataframe unless you need multiple times with same dimension you are join (will cause extra over head for multiple reads and writes) with the excessive IO (the primary reason for longer execution duration). Instead of one hop go for multiple, the aggregate function will ensure that, one after one get executed and effectively reduce the data carried for next sql
in this case
join EMP and DEPT and compute the distinct (if needed) and then join with LOC and later with Sales
the order should compute the smaller data sets first and leave the larger to last, as we may need to shuffle the DF every time we join with other DF. looks the main reason for the longer execution is huge IO reads (from the data volumes you have specified in query), so better to take one hop at time and go for next one on execution prspective, the lazy evolution can be streamlined by keeping the aggregated functions between the code - (not recommended for small tables / operations ). Hope this helps !!
... View more
10-02-2017
02:14 AM
Hi @Rohit Khose, This occurs usually when spark code is not formatted(either the braces or quotes/ indentation (in case of python) are not ended) properly will result this exception. could you palest reformat your code and run the same (best is to use a text editor which highlight or submit the code block by block) On the other hand this may occur due to the network failures while on data shuffle between execution (I doubt this is not the case) - in such cases resubmitting the job should result the successful completion(presume not in your case).
... View more
10-02-2017
02:04 AM
Hi @raouia , Can you please look at the Yarn Application log(application_XXX_XXX) which will have the detailed information about the root cause for the failure. on the other hand, from the symptom, Looks something to do with library Configuration , con you please ensure, when you add the additional node, all the spark libraries are available on that node too. Hope This Helps !!
... View more
09-28-2017
08:33 AM
Hi @Mateusz Grabowski, The quick solution is to use the apache web server to redirect the requests to https (documentation with examples can be found here) -this approach is external to ambari and no need to make any configuration changes at ambari level. however, there must be a inherent way to handle with jetty by modifying the ambari configuration, similar discussion for NiFi can be found at following thread. https://community.hortonworks.com/questions/63171/nifi-admin-question-redirect-http-to-https.html Hope this helps !!
... View more
09-28-2017
02:51 AM
1 Kudo
Hi @Will Dailey, It is possible to configure the secondary LDAP server to handle the AD Failure scenario. --ldap-secondary-url is the parameter(optional) which will take additional AD host in the ldap-setup command line arguments. --ldap-url=LDAP_URL Primary url for LDAP
--ldap-secondary-url=LDAP_SECONDARY_URL Secondary url for LDAP
... View more
09-27-2017
05:41 AM
Hi @Br Hmedna, Couple of things, from screen shot your data is in ORC format. to import and export use HCatalog integration, the syntax for the export would be sqoop export --connect jdbc:mysql://127.0.0.1/mooc2015 –username <mysql_User> –password <mysql_pwd> –table Act_Grade –hcatalog-table <hive_table> --hcatalog-database <hive_database_name>
please note that --export-dir option is not supported while on hcat integration, so better use above syntax. on the side note for debugging your code. the error you have provided is not the application log, high-level sqoop log. Yarn app(task) log can be found with the extension(job_1506422992590_0008), where you can find the reason for the failure. hope this helps !!
... View more
09-26-2017
03:10 AM
Hi @Alvin Jin, looks you have not substituted the token in place of $token can you please place the token as string and test ? curl -k -X GET 'https://<nifi-server>:9091/nifi-api/cluster/summary' -H 'Authorization: Bearer <token which is generated from Above Command>' --compressed
... View more