Member since
09-18-2015
3274
Posts
1159
Kudos Received
426
Solutions
02-09-2016
10:11 PM
H100 Unable to submit statement show databases like '*': org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe" The Ambari view throws the following error: H100 Unable to submit statement show databases like '*': org.apache.thrift.transport.TTransportException:java.net.SocketTimeoutException: Read timed out
And I can no longer view database or run queries in the Ambari view. The problem appears to resolve itself after some time, but then will reappear after running a sequence of queries in the Hive view. 1) This should be fixed in Ambari 2.2 2) Browser Refresh fixed it: When connection was open for too long then there was a broken pipe when you tried to access the hive view. When you did a restart to opened a fresh connection and was able to connect.
... View more
Labels:
02-08-2016
02:25 AM
Original thread https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster.html
... View more
02-08-2016
02:25 AM
3 Kudos
I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented. The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance. Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources. Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over. Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases. That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files. Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back. The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE. The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job. Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files: Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster. Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement. Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks. Sqoop - Manager the number of mappers to generate appropriately size files. Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects. Credit: @David Streever Originial thread https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster.html
... View more
02-07-2016
01:54 PM
2 Kudos
Customer: Is Hadoop Enterprise Ready? Me: Standing next to the whiteboard, Yes and that's why we use the term "Enterprise Ready Data Lake" Imagine that there are 3 points Point 1 -> You need to prove your identity to get access to Lake and then need permissions or authority to access data. Point 2 -> Once you proved your authenticity then demands comes to manage the lifecycle of data from it's requirement to retirement "Automated process" Point 3 -> Life Cycle Management process needs to be integrated with a Governance solution to manage data of data "metadata" , data lineage, auditing and more to fullfil security and compliance requirement. Point 1 --> Entry Point: You must have strong Authentication in place to get into the system and more users will be coming in to access data as we move away from silos of data to a centralized repository. The access management must be easier to manage i,e Security solution should have a centralized place toAdmin (create, define and manage) security policies. Once users gets in and has access then we need to track their actions and that's Auditing. At last, Data Encryption in motion & at rest Point 2 --> Security is place and now we know that Data ingestion is occurring with full security. Now, business wants to manage the lifecycle of data in one common place "Data replication, retention, handling late data arrival rules, data mirroring and visualize the complete data pipeline" Point 3 --> Once data lifecycle management in place then we will be generating more data of data "metadata" and there is existing legacy metadata that need to be exchange with Hadoop system. This generates the requirement of Data Governance solution. This solution should provide complete data lineage, exchange, search functionality Customer: Yes, this is exactly what we are looking for. All this must be well integrated and please provide this as 100% open source but enterprise ready solution. Solution: Security Data Lifecycle Management Data Governance Happy Hadooping!!! Kerberos is must in production
... View more
Labels:
02-06-2016
08:59 PM
Problem: File"/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140,in _call_wrapper result = _call(command,**kwargs_copy) File"/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291,in _call raiseFail(err_msg) resource_management.core.exceptions.Fail:Execution of 'yarn resourcemanager -format-state-store' returned 255.15/10/2616:11:16 INFO resourcemanager.ResourceManager: STARTUP_MSG: 15/10/2616:11:17 INFO recovery.ZKRMStateStore: org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted!Exiting! 15/10/2616:11:17 INFO zookeeper.ZooKeeper:Session:0x150a4b3429b0002 closed 15/10/2616:11:17 FATAL resourcemanager.ResourceManager:Error starting ResourceManager org.apache.zookeeper.KeeperException$NotEmptyException:KeeperErrorCode=Directorynot empty for/rmstore/ZKRMStateRoot/RMAppRoot at org.apache.zookeeper.KeeperException.create(KeeperException.java:125) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.recursiveDeleteWithRetriesHelper(ZKRMStateStore.java:1049) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.recursiveDeleteWithRetriesHelper(ZKRMStateStore.java:1045) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$500(ZKRMStateStore.java:89) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$10.run(ZKRMStateStore.java:1032) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$10.run(ZKRMStateStore.java:1029) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1104) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1125) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.deleteWithRetries(ZKRMStateStore.java:1029) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.deleteStore(ZKRMStateStore.java:825) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.deleteRMStateStore(ResourceManager.java:1267) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1190) 15/10/2616:11:17 INFO zookeeper.ClientCnxn:EventThread shut down 15/10/2616:11:17 INFO resourcemanager.ResourceManager: SHUTDOWN_MSG:
Solution: Error details: FATAL resourcemanager.ResourceManager:Error starting ResourceManager org.apache.zookeeper.KeeperException$NotEmptyException:KeeperErrorCode=Directory not empty for /rmstore/ZKRMStateRoot/RMAppRoot Please see this. In my case, I have all the application data sitting under that particular location [zk: localhost:2181(CONNECTED) 2] ls /rmstore/ZKRMStateRoot/RMAppRoot [application_1445593412630_0002, application_1445593412630_0001, application_1445366030467_0002, application_1445366030467_0001, application_1445366030467_0004, application_1445366030467_0003, application_1445593412630_0006, application_1445366030467_0005, application_1445593412630_0005, application_1445593412630_0004, application_1445593412630_0003, application_1445173693339_0006, application_1445173693339_0005, application_1445173693339_0004, application_1445173693339_0003, application_1445173693339_0002, application_1445173693339_0001, application_1445394313024_0004, application_1445394313024_0003, application_1445394313024_0002, application_1445394313024_0001, application_1445394313024_0008, application_1445394313024_0007, application_1445394313024_0006, application_1445394313024_0005] [zk: localhost:2181(CONNECTED) 3] quit Quitting... [zk: localhost:2181(CONNECTED) 3] rmr /rmstore/ZKRMStateRoot/RMAppRoot [zk: localhost:2181(CONNECTED) 4] ls /rmstore/ZKRMStateRoot/RMAppRoot Node does not exist: /rmstore/ZKRMStateRoot/RMAppRoot Restart Yarn and I got the location back [zk: localhost:2181(CONNECTED) 6] ls /rmstore/ZKRMStateRoot/RMAppRoot [] [zk: localhost:2181(CONNECTED) 7] [zk: localhost:2181(CONNECTED) 7] ls /rmstore/ZKRMStateRoot [AMRMTokenSecretManagerRoot, RMAppRoot, EpochNode, RMDTSecretManagerRoot, RMVersionNode] [zk: localhost:2181(CONNECTED) 8] You can try this but if you are not sure or its prod then open support ticket. "Consult support before doing this in production"
... View more
Labels:
02-06-2016
08:46 PM
More information https://code.facebook.com/posts/938595492830104/osquery-introducing-query-packs/
... View more
02-06-2016
06:11 PM
10 Kudos
OLAP (Online Analytical Processing) is the technology behind many Business Intelligence (BI) applications. OLAP is a powerful technology for data discovery, including capabilities for limitless report viewing, complex analytical calculations, and predictive “what if” scenario (budget, forecast) planning. OLAP is an acronym for Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. It is the foundation for may kinds of business applications for Business Performance Management, Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data Warehouse Reporting. OLAP enables end-users to perform ad hoc analysis of data in multiple dimensions, thereby providing the insight and understanding they need for better decision making. Source OLAP solutions
Open source
Apache Kylin http://kylin.apache.org/ Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc. Extremely Fast OLAP Engine at Scale: Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data - ANSI SQL Interface on Hadoop: Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions - Interactive Query Capability: Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset - MOLAP Cube: User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records - Seamless Integration with BI Tools: Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel is coming soon - Other Highlights: - Job Management and Monitoring
- Compression and Encoding Support
- Incremental Refresh of Cubes
- Leverage HBase Coprocessor for query latency
- Approximate Query Capability for distinct Count (HyperLogLog)
- Easy Web interface to manage, build, monitor and query cubes
- Security capability to set ACL at Cube/Project Level
- Support LDAP Integration Druid http://druid.io/druid.html Druid is an open source data store designed for OLAP queries on event data. This page is meant to provide readers with a high level overview of how Druid stores data, and the architecture of a Druid cluster. This data set is composed of three distinct components. If you are acquainted with OLAP terminology, the following concepts should be familiar.
Timestamp column: We treat timestamp separately because all of our queries center around the time axis. Dimension columns: Dimensions are string attributes of an event, and the columns most commonly used in filtering the data. We have four dimensions in our example data set: publisher, advertiser, gender, and country. They each represent an axis of the data that we’ve chosen to slice across. Metric columns: Metrics are columns used in aggregations and computations. In our example, the metrics are clicks and price. Metrics are usually numeric values, and computations include operations such as count, sum, and mean. Also known as measures in standard OLAP terminology. Commercial Atscale http://www.atscale.com/ AtScale turns your Hadoop cluster into scale-out OLAP server. Now you can use your BI tool of choice – from Tableau to Microstrategy to Microsoft Excel – to connect to and query data in Hadoop, with no extra layers in between.
Dynamic, virtual cubes present complex data as simple measures and dimensions Support for virtually any BI tool that can talk SQL or MDX Analyze billions of rows of data directly on your Hadoop cluster Eliminate need for costly data marts, extracts, and custom cubes Consistent metric definitions across all users, regardless of BI Kyvos Insights http://www.kyvosinsights.com/solution The cubes Kyvos can build and run on Hadoop are orders of magnitude bigger than what could be built on traditional OLAP gear. Instead of getting rid of the granular level of detail that would ordinarily be summarized or aggregated in a traditional OLAP setup, Kyvos can build a specific dimension for each column or field, whether it’s an individual customer or an individual SKU (stock keeping unit). Source Cloud option Source With Altiscale Data Cloud, the AtScale Intelligence Platform runs on top of enterprise-grade Hadoop in the cloud, reducing time to value, lowering costs and eliminating implementation risk. Since Altiscale runs a complete Hadoop ecosystem for its customers, it also eliminates one of Hadoop’s greatest challenges: ongoing operational risk. This allows customers to focus on their business goals without losing time and effort to the ongoing burden of Hadoop management.
... View more
Labels:
02-06-2016
04:04 AM
1 Kudo
osquery allows you to easily ask questions about your Linux and OSX infrastructure. Whether your goal is intrusion detection, infrastructure reliability, or compliance, osquery gives you the ability to empower and inform a broad set of organizations within your company.
Download https://osquery.readthedocs.org/en/latest/installation/install-linux/
[root@phdns02 ~]# sudo rpm -ivh https://osquery-packages.s3.amazonaws.com/centos6/noarch/osquery-s3-centos6-repo-1-0.0.noarch.rpm
Retrieving https://osquery-packages.s3.amazonaws.com/centos6/noarch/osquery-s3-centos6-repo-1-0.0.noarch.rpm
warning: /var/tmp/rpm-tmp.rCrgXh: Header V4 RSA/SHA1 Signature, key ID c9d8b80b: NOKEY
Preparing... ########################################### [100%]
1:osquery-s3-centos6-repo########################################### [100%]
[root@phdns02 ~]# yum install osquery
Loaded plugins: fastestmirror
Setting up Install Process
Loading mirror speeds from cached hostfile
epel/metalink | 12 kB 00:00
* base: mirrors.abcd.net
* epel: mirror.sfo12.us.xyz.net
* extras: repos.lmnopq.com
* updates: mirror.xxxx.org
HDP-2.3 | 2.9 kB 00:00
HDP-UTILS-1.1.0.20 | 2.9 kB 00:00
Updates-ambari-2.2.0.0 | 2.9 kB 00:00
base | 3.7 kB 00:00
dockerrepo | 2.9 kB 00:00
epel | 4.3 kB 00:00
epel/primary_db | 5.8 MB 00:00
epel-apache-maven | 2.4 kB 00:00
extras | 3.4 kB 00:00
osquery-s3-centos6-repo | 3.3 kB 00:00
osquery-s3-centos6-repo/primary_db | 11 kB 00:00
updates | 3.4 kB 00:00
updates/primary_db | 3.3 MB 00:00
Resolving Dependencies
--> Running transaction check
---> Package osquery.x86_64 0:1.7.0_4_g08ca034-1.el6 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
==========================================================================================================================================================================================================
Package Arch Version Repository Size
==========================================================================================================================================================================================================
Installing:
osquery x86_64 1.7.0_4_g08ca034-1.el6 osquery-s3-centos6-repo 5.5 M
Transaction Summary
==========================================================================================================================================================================================================
Install 1 Package(s)
Total download size: 5.5 M
Installed size: 16 M
Is this ok [y/N]: y
Downloading Packages:
osquery-1.7.0-4-g08ca034.rpm | 5.5 MB 00:01
warning: rpmts_HdrFromFdno: Header V4 RSA/SHA1 Signature, key ID c9d8b80b: NOKEY
Retrieving key from file:///etc/pki/rpm-gpg/OSQUERY-S3-RPM-REPO-GPGKEY
Importing GPG key 0xC9D8B80B:
Userid : osquery (osquery) <osquery@fb.com>
Package: osquery-s3-centos6-repo-1-0.0.noarch (installed)
From : /etc/pki/rpm-gpg/OSQUERY-S3-RPM-REPO-GPGKEY
Is this ok [y/N]: y
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
Installing : osquery-1.7.0_4_g08ca034-1.el6.x86_64 1/1
Verifying : osquery-1.7.0_4_g08ca034-1.el6.x86_64 1/1
Installed:
osquery.x86_64 0:1.7.0_4_g08ca034-1.el6
Complete!
Launch osquery shell
[root@phdns02 ~]#osqueryi
osquery> SELECT name, path, pid FROM processes where name= "java";
osquery> .help
Welcome to the osquery shell. Please explore your OS!
You are connected to a transient 'in-memory' virtual database.
.all [TABLE] Select all from a table
.bail ON|OFF Stop after hitting an error; default OFF
.echo ON|OFF Turn command echo on or off
.exit Exit this program
.header(s) ON|OFF Turn display of headers on or off
.help Show this message
.mode MODE Set output mode where MODE is one of:
csv Comma-separated values
column Left-aligned columns. (See .width)
line One value per line
list Values delimited by .separator string
pretty Pretty printed SQL results
.nullvalue STR Use STRING in place of NULL values
.print STR... Print literal STRING
.quit Exit this program
.schema [TABLE] Show the CREATE statements
.separator STR Change separator used by output mode and .import
.show Show the current values for various settings
.tables [TABLE] List names of tables
.trace FILE|off Output each SQL statement as it is run
.width [NUM1]+ Set column widths for "column" mode
.timer ON|OFF Turn the CPU timer measurement on or off
osquery>
osquery> .tables
=> acpi_tables
=> arp_cache
=> authorized_keys
=> block_devices
=> chrome_extensions
=> cpuid
=> crontab
=> device_file
=> device_hash
=> device_partitions
=> disk_encryption
=> etc_hosts
=> etc_protocols
=> etc_services
=> file
=> file_events
=> firefox_addons
=> groups
=> hardware_events
=> hash
=> interface_addresses
=> interface_details
=> iptables
=> kernel_info
=> kernel_integrity
=> kernel_modules
=> known_hosts
=> last
=> listening_ports
=> logged_in_users
=> magic
=> memory_map
=> mounts
=> msr
=> opera_extensions
=> os_version
=> osquery_events
=> osquery_extensions
=> osquery_flags
=> osquery_info
=> osquery_packs
=> osquery_registry
=> osquery_schedule
=> pci_devices
=> platform_info
=> process_envs
=> process_events
=> process_memory_map
=> process_open_files
=> process_open_sockets
=> processes
=> routes
=> rpm_package_files
=> rpm_packages
=> shared_memory
=> shell_history
=> smbios_tables
=> socket_events
=> suid_bin
=> system_controls
=> system_info
=> time
=> uptime
=> usb_devices
=> user_events
=> user_groups
=> users
=> yara
=> yara_events
osquery>
... View more
01-31-2016
04:30 PM
2 Kudos
Original Article Can I authorize access to Kafka over a non-secure channel via Ranger? Yes. you can control access by ip-address. Can I authorize access to Kafka over non-secure channel by user/user-groups? No, one can’t use user/group based access to authorize Kafka access over a non-secure channel. This is because it isn't possible to assert client’s identity over the non-secure channel. Why do we have to specify public user group on all policies items created for authorizing Kafka access over non-secure channel?
Kafka can’t assert the identity of client user over a non-secure channel. Thus, Kafka treats all users for such access as an anonymous user (a special user literally named ANONYMOUS ). Ranger's public user group is a means to model all users which, of course, includes this anonymous user ( ANONYMOUS ). What are the specific things to watch out for when setting up authorization for accessing Kafka over non-secure channel?
Make sure that all broker-ips have Kafka admin access to all topics, i.e. *.
Make sure no publishers or consumers are running on broker nodes that need access control. Since broker ips have open access it isn’t possible to control access on those nodes. Please take time to read the original article.
... View more
01-31-2016
02:12 PM
1 Kudo
Tools in use: HBase shell and Zeppelin
User demouser needs access to HBase table called PRICES.
User zeppelin needs the same access to run few queries.
You can run this demo by using Hortonworks Sandbox
... View more