Member since
09-24-2015
178
Posts
113
Kudos Received
28
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3376 | 05-25-2016 02:39 AM | |
3591 | 05-03-2016 01:27 PM | |
839 | 04-26-2016 07:59 PM | |
14395 | 03-24-2016 04:10 PM | |
2020 | 02-02-2016 11:50 PM |
12-11-2015
11:21 PM
I see 'Connection Refused' which means either a service is down or connection to wrong port. Like Deepesh said, appears to be former and that History server is down.
... View more
12-11-2015
07:42 PM
3 Kudos
@Matthew bird You need a home directory for the user in HDFS so here is what is needed - #Login as root to the sandbox
su - hdfs
hdfs dfs -mkdir /user/root
hdfs dfs -chown root:hadoop /user/root
hdfs dfs -chmod 755 /user/root Try to run the pig script after you've done the above steps.
... View more
12-11-2015
06:45 PM
1 Kudo
@Amit Jain Atlas has ton of exciting features in the roadmap and definitely plans for two way metadata exchange with other metadata management tools. As of right now (& this may change), the plan is to exchange the lineage information with other tools too, to be able to provide an end-to-end lineage of data from source system, all the way to the final destination. With that said, it seems very unlikely that in a large enterprise setting you would replace all other metadata tools with one magical tool. Typically speaking the governance tools are expected to tap into the data processes automatically and non-intrusively to gather lineage information and this require native hooks into those data processes. Atlas has and will continue to expand, when it comes to native hooks for processing that takes place in a Hadoop cluster but I doubt there is any interest in tapping natively into the processes going on into other systems like data warehousing system, transactional, operational and reporting systems. For those pieces (metadata and lineage) from external systems, Atlas will continue to rely on and integrate with other metadata tools. Just like Hadoop, other components in overall data architecture have their roles and place so they will continue to exist and so will the governance tools for those components. Vendors need to and (most likely) will work together to provide a seamless experience to the customers. If you havent watched this presentation from Andrew Ahn, PM for Governance Tools at HWX, I would highly recommend it to understand better where Atlas is going - https://www.youtube.com/watch?time_continue=3&v=LZ... Hope this helps. Let me know if you have any follow up question.
... View more
12-11-2015
02:34 PM
1 Kudo
There are few solutions - 1. The easy solution - grant permission on files to root user. In this case, looks like the file has wide open permission but because the file is under another user's home directory, may be root user does not have access to the guest home directory. So, check the permission for /user/guest and adjust if needed. 2. Use the correct user for the job - I like to create a service Id for data processing and not use local super users like (root) or hdfs super users like (hdfs). So you can use users like guest and inbuilt test user ambari-qa. The user is identify based on their local OS identity so you can switch user to guest before running the process.
... View more
12-10-2015
02:39 AM
@Hajime - The best way to find the nodemanager heap size and other memory settings is to calculate it specifically for your cluster size and hardware spec. Here is the utility that you can use - http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-... Snippet hdp-configuration-utils.sh options where options are as follows: Table 1.1. hdp-configuration-utils.sh Options Option Description -c CORESThe number of cores on each host.-m MEMORYThe amount of memory on each host in GB.-d DISKSThe number of disks on each host.-k HBASE"True" if HBase is installed, "False" if not. The output recommendation is in this format - Using cores=16 memory=64GB disks=4 hbase=True
Profile: cores=16 memory=49152MB reserved=16GB usableMem=48GB disks=4
Num Container=8
Container Ram=6144MB
Used Ram=48GB
Unused Ram=16GB
yarn.scheduler.minimum-allocation-mb=6144
yarn.scheduler.maximum-allocation-mb=49152
yarn.nodemanager.resource.memory-mb=49152
mapreduce.map.memory.mb=6144
mapreduce.map.java.opts=-Xmx4096m
mapreduce.reduce.memory.mb=6144
mapreduce.reduce.java.opts=-Xmx4096m
yarn.app.mapreduce.am.resource.mb=6144
yarn.app.mapreduce.am.command-opts=-Xmx4096m
mapreduce.task.io.sort.mb=1792
tez.am.resource.memory.mb=6144
tez.am.launch.cmd-opts =-Xmx4096m
hive.tez.container.size=6144
hive.tez.java.opts=-Xmx4096m
hive.auto.convert.join.noconditionaltask.size=1342177000
... View more
12-04-2015
08:16 PM
@Neeraj Sabharwal Its not the same error. The exception stack trace pasted by OP is originating with Atlas (org.apache.atlas.web.filters.AuditFilter.doFilter) where as the one in the JIRA is within Hadoop. Same exception class different applications.
... View more
12-04-2015
06:23 PM
Looking at the ExecuteSQL code here. The capability description reads - @CapabilityDescription("Execute provided SQL select query. Query result will be converted to Avro format." + " Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on " + "a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. " + "If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the " + "select query. " + "FlowFile attribute 'executesql.row.count' indicates how many rows were selected." ) Even though above para says - "Streaming is used so arbitrarily large result sets are supported." , it appears that its not referring to the JDBC streaming but the fact that ResultSet is broken down into smaller tuples and sent to next processor as stream. Here is the snippet of Code to back that assessment - Query Execution in ExecuteSQL and call to JDBCCommon. convertToAvroStream -> convertToAvroStream method reading data using getObject method The getObject method does not seem to support streaming alternative like getAscii etc as described here - https://docs.oracle.com/cd/B28359_01/java.111/b312...
... View more
12-04-2015
06:08 PM
Can you help understand the scenario when this is needed? So the Hive shell is executed but wait until a query is executed for creating AM.. this means there are situations where Hive shell is executed and then exited without executing the query? Wont this be an exception scenario or in your case this is so frequent / regular that a workaround is required. I am sorry, just trying to understand when will such a configuration be needed..
... View more
12-04-2015
04:57 AM
This should be updated / corrected then? Partitioning Recommendations for Slave Nodes
Hadoop Slave node partitions: Hadoop should have its own partitions for Hadoop files and logs. Drives should be partitioned using ext3, ext4, or XFS, in that order of preference. HDFS on ext3 has been publicly tested on the Yahoo cluster, which makes it the safest choice for the underlying file system. The ext4 file system may have potential data loss issues with default options because of the "delayed writes" feature. XFS reportedly also has some data loss issues upon power failure. Do not use LVM; it adds latency and causes a bottleneck. Source: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_cluster-planning-guide/content/ch_partitioning_chapter.html A lot of this conflicts with the reality (Paul's Smartsense statistics) and what we all are discussing here.
... View more
12-04-2015
01:09 AM
My response to your comment was longer than whats allowed for comments so adding as new answer.
... View more