Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 6986 | 09-21-2018 09:54 PM | |
| 8744 | 03-31-2018 03:59 AM | |
| 2624 | 03-31-2018 03:55 AM | |
| 2758 | 03-31-2018 03:31 AM | |
| 6185 | 03-27-2018 03:46 PM |
12-20-2016
08:27 PM
@k v vittal hebbar Either one is fine.
... View more
12-20-2016
08:10 PM
@Praveen PentaReddy Thanks for submitting the JIRA ticket. I reviewed Artem's response and it seems that until the enhancement is implemented, his response is still the best option as of right now.
... View more
12-20-2016
06:42 PM
3 Kudos
@Suresh Bonam Not out of box. You can build custom. CSV is still an option. If your source is streaming data in real-time then Flume is a reasonable option. An alternative is Apache NiFi. Assuming the streaming in real-time and willingness for Flume, the target files to be stored to HDFS will have a similar structure (no transformation in flight). Apache NiFi could help you to perform some transformation in-flight as such the file at the target is easier to consume, e.g. Hive external tables. You could achieve something like that with Flume but with coding and pain involved. If your Excel is static then you should use something else like a MapReduce or Spark job.
... View more
12-20-2016
06:16 PM
@Ritesh jain sudo -u hdfs hadoop fs -ls Does this work for you? If it works, then create a home directory for your user (in HDFS) and make sure that your user is member of hadoop or hdfs group. *** If helped, please vote/accept answer. For the second question, please create a new question and remove it from the current question. We are trying to build a body of knowledge easy to follow and avoid open-ended questions.
... View more
12-20-2016
06:10 PM
@Ritesh jain Please provide the exact command you executed.
... View more
12-20-2016
03:19 PM
1 Kudo
@Sampat Budankayala Of course, you can use custom UDF, but those are not part of Hive core and performance is not guaranteed, especially for such expensive operation on a big data set. There is a reason that is not part of it. Iterative and recursive problems not well suited for map reduce
because tasks do not share state or coordinate with each other. If you still want to go this path, in a few words, you would build the jar and deploy it to Hive auxiliary libraries folder or HDFS, then create a permanent or temporary function which you can invoke it in your SQL. Follow the steps described here: https://dzone.com/articles/writing-custom-hive-udf-andudaf. Look also at this: https://community.hortonworks.com/articles/39980/creating-a-hive-udf-in-java.html. You would follow similar steps with the code you found, even is Scala. I am not aware of similar code implemented in Java, but it must be.
... View more
12-20-2016
03:14 AM
13 Kudos
Pre-requisites Hortonworks Data Platform 2.5 on CentOS 7.2 Python distribution that comes with HDP 2.5 - Python 2.7.5 Download and install pip #wget https://bootstrap.pypa.io/get-pip.py
Install add-on package #pip install requests Start Python CLI (default version) #python Import pre-reqs >>>import requests >>>import json >>>import sys Environment Variables Set Ambari domain variable to the IP address or FQDN of your Ambari node. >>>AMBARI_DOMAIN = '127.0.0.1' Set Ambari port, Ambari user and password variables to match your specifics. >>>AMBARI_PORT = '8080' >>>AMBARI_USER_ID = 'admin' >>>AMBARI_USER_PW = 'admin' Set the following variable to the IP address or FQDN of your ResourceManager node. >>>RM_DOMAIN = '127.0.0.1' Set Resource Manager port variable >>>RM_PORT = '8088' Ambari REST API Call Examples Let's find Cluster Name, Cluster Version, Stack and Stack Version: >>>restAPI='/api/v1/clusters' >>>url="http://"+AMBARI_DOMAIN+":"+AMBARI_PORT+restAPI >>>r=requests.get(url, auth=(AMBARI_USER_ID, AMBARI_USER_PW)) >>>json_data=json.loads(r.text) >>>CLUSTER_NAME = json_data["items"][0]["Clusters"]["cluster_name"] >>>print(CLUSTER_NAME) >>>CLUSTER_VERSION =json_data["items"][0]["Clusters"]["version"] >>>print(CLUSTER_VERSION) >>>STACK = CLUSTER_VERSION.split('-')[0] >>>print(STACK) >>>STACK_VERSION = CLUSTER_VERSION.split('-')[1] >>>print(STACK_VERSION) >>>CLUSTER_INFO=json_data >>>print(CLUSTER_INFO) Let's find HDP stack repository: >>>restAPI = "/api/v1/stacks/"+STACK+"/versions/"+STACK_VERSION+"/operating_systems/redhat7/repositories/"+CLUSTER_VERSION >>>url = "http://"+AMBARI_DOMAIN+":"+AMBARI_PORT+restAPI >>>r= requests.get(url, auth=(AMBARI_USER_ID, AMBARI_USER_PW)) >>>json_data=json.loads(r.text) >>>print(json_data) >>>REPOSITORY_NAME=json_data["Repositories"]["latest_base_url"] >>>print(REPOSITORY_NAME) A more elegant approach is to create utility functions. See my repo: https://github.com/cstanca1/HDP-restAPI/ restAPIFunctions.py script in the repo defines a number of useful functions that I have collected over time. Run restAPIFunctions.py The same example presented above can now be implemented with a single line call to get CLUSTER_NAME, CLUSTER_VERSION and CLUSTER_INFO using getClusterVersionAndName() function: >>>CLUSTER_NAME,CLUSTER_VERSION,CLUSTER_INFO = getClusterVersionAndName()
>>>print(CLUSTER_NAME)
>>>print(CLUSTER_VERSION)
>>>print(CLUSTER_INFO) Resource Manager REST API Call Examples >>>RM_INFO=getResourceManagerInfo()
>>>RM_SCHEDULER_INFO=getRMschedulerInfo()
>>>print(RM_INFO)
>>>print(RM_SCHEDULER_INFO) Other Functions These are other functions included in restAPIFunctions.py script getServiceActualConfigurations()
getClusterRepository()
getAmbariHosts()
getResourceManagerInfo()
getRMschedulerInfo()
getAppsSummary()
getNodesSummary()
getServiceConfigTypes()
getResourceManagerMetrics()
getCheckClusterForRollingUpgrades()
... View more
Labels:
12-18-2016
10:17 PM
13 Kudos
Background Tungsten became the default in Spark 1.5 and can be enabled in earlier versions by setting spark.sql.tungsten.enabled to true (or disabled in later versions by setting this to false). Even without Tungsten, Spark SQL uses a columnar storage format with Kryo serialization to minimize storage cost. Goal The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough). Scope Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc. Optimization Features Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly, Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates, Whole-Stage Code Generation (aka CodeGen). Design Improvements Tungsten includes specialized in-memory data structures
tuned for the type of operations required by Spark, improved code generation,
and a specialized wire protocol. Tungsten’s representation is substantially smaller than
objects serialized using Java or even Kryo serializers. As Tungsten does not depend on Java objects, both on-heap
and off-heap allocations are supported. Not only is the format more compact, serialization times can
be substantially faster than with native serialization. Since Tungsten no longer depends on working with Java
objects, you can use either on-heap (in the JVM) or off-heap storage. If you
use off-heap storage, it is important to leave enough room in your containers
for the off-heap allocations - which you can get an approximate idea for from
the web ui. Tungsten’s data structures are also created closely in mind
with the kind of processing for which they are used. The classic example of this is with sorting, a common and
expensive operation. The on-wire representation is implemented so that sorting
can be done without having to deserialize the data again. By
avoiding the memory and GC overhead of regular Java objects, Tungsten is able
to process larger data sets than the same hand-written aggregations. Benefits The following Spark jobs will benefit from Tungsten: Dataframes: Java, Scala, Python, R SparkSQL queries Some RDD API programs via general serialization and compression optimizations Next Steps In the future Tungsten may make it more feasible to use certain non-JVM libraries. For many simple operations the cost of using BLAS, or similar linear algebra packages, from the JVM is dominated by the cost of copying the data off-heap. References: Project
Tungsten: Bringing Apache Spark Closer to Bare Metal High Performance Spark by Holden Karau; Rachel
Warren Slides: Deep Dive Into Project Tungsten - Josh Rosen Video: Deep Dive into Project Tungsten Bringing Spark Closer to Bare Metal -Josh Rosen (Databricks)
... View more
Labels:
12-18-2016
09:51 PM
6 Kudos
@Andi Sonde You will use Kafka clients when you are a developer, you want to connect an application to Kafka and can modify the code of the application, and you want to push data into Kafka or pull data from Kafka. You will use Connect to connect Kafka to datastores that you did not write and can’t or won’t modify their code. For data stores where a connector already exists, Connect can be used by non-developers who will only need to configure the connectors. If you need to connect Kafka to a data store and a connector does not exist yet, you can choose between writing an app using the Kafka clients or the Connect APIs. Connect is recommended because it provides out of the box features like configuration management, offset storage, parallelization, error handling, support for different data types and standard management REST API. Writing a small app that connects Kafka to a data store sounds simple, but there are many little details you will need to handle around data types and configuration that make the task non-trivial - Kafka connect handles most of this for you, allowing you to focus on transporting data to and from the external stores. I realize that is a bit out of scope, but I'd like to recommend Apache NiFi as an alternative to Kafka Connect. See: https://nifi.apache.org/ NiFi is a good alternative to Flume too. NiFi provide an advanced data management capability covering easily Kafka producer or Kafka consumer needs with no coding, visually, at most, some regular expressions.
... View more
12-18-2016
09:15 PM
@Kaliyug Antagonist I like Tom's suggestion and will it try myself. Otherwise, if you wish to create your local cluster with Vagrant: https://community.hortonworks.com/articles/39156/setup-hortonworks-data-platform-using-vagrant-virt.html I use Eclipse and Vagrant cluster. I share a folder between my local machine and the cluster where I place the output jars and then submit them for execution. I followed instructions published here: here: https://community.hortonworks.com/articles/43269/intellij-eclipse-usage-against-hdp-25-sandbox.html I am not sure why you are against of the idea to use the sandbox. The code you develop can be at most tested functionally, locally. I get it that you want more debugging capabilities locally. A true load testing still needs to happen in a full scale environment.
... View more