About cstanca

cstanca · ‎12-20-2016

@k v vittal hebbar Either one is fine.

cstanca · ‎12-20-2016

@Praveen PentaReddy Thanks for submitting the JIRA ticket. I reviewed Artem's response and it seems that until the enhancement is implemented, his response is still the best option as of right now.

cstanca · ‎12-20-2016

@Suresh Bonam Not out of box. You can build custom. CSV is still an option. If your source is streaming data in real-time then Flume is a reasonable option. An alternative is Apache NiFi. Assuming the streaming in real-time and willingness for Flume, the target files to be stored to HDFS will have a similar structure (no transformation in flight). Apache NiFi could help you to perform some transformation in-flight as such the file at the target is easier to consume, e.g. Hive external tables. You could achieve something like that with Flume but with coding and pain involved. If your Excel is static then you should use something else like a MapReduce or Spark job.

cstanca · ‎12-20-2016

@Ritesh jain sudo -u hdfs hadoop fs -ls Does this work for you? If it works, then create a home directory for your user (in HDFS) and make sure that your user is member of hadoop or hdfs group. *** If helped, please vote/accept answer. For the second question, please create a new question and remove it from the current question. We are trying to build a body of knowledge easy to follow and avoid open-ended questions.

cstanca · ‎12-20-2016

@Ritesh jain Please provide the exact command you executed.

cstanca · ‎12-20-2016

@Sampat Budankayala Of course, you can use custom UDF, but those are not part of Hive core and performance is not guaranteed, especially for such expensive operation on a big data set. There is a reason that is not part of it. Iterative and recursive problems not well suited for map reduce because tasks do not share state or coordinate with each other. If you still want to go this path, in a few words, you would build the jar and deploy it to Hive auxiliary libraries folder or HDFS, then create a permanent or temporary function which you can invoke it in your SQL. Follow the steps described here: https://dzone.com/articles/writing-custom-hive-udf-andudaf. Look also at this: https://community.hortonworks.com/articles/39980/creating-a-hive-udf-in-java.html. You would follow similar steps with the code you found, even is Scala. I am not aware of similar code implemented in Java, but it must be.

cstanca · ‎12-20-2016

Pre-requisites Hortonworks Data Platform 2.5 on CentOS 7.2 Python distribution that comes with HDP 2.5 - Python 2.7.5 Download and install pip #wget https://bootstrap.pypa.io/get-pip.py Install add-on package #pip install requests Start Python CLI (default version) #python Import pre-reqs >>>import requests >>>import json >>>import sys Environment Variables Set Ambari domain variable to the IP address or FQDN of your Ambari node. >>>AMBARI_DOMAIN = '127.0.0.1' Set Ambari port, Ambari user and password variables to match your specifics. >>>AMBARI_PORT = '8080' >>>AMBARI_USER_ID = 'admin' >>>AMBARI_USER_PW = 'admin' Set the following variable to the IP address or FQDN of your ResourceManager node. >>>RM_DOMAIN = '127.0.0.1' Set Resource Manager port variable >>>RM_PORT = '8088' Ambari REST API Call Examples Let's find Cluster Name, Cluster Version, Stack and Stack Version: >>>restAPI='/api/v1/clusters' >>>url="http://"+AMBARI_DOMAIN+":"+AMBARI_PORT+restAPI >>>r=requests.get(url, auth=(AMBARI_USER_ID, AMBARI_USER_PW)) >>>json_data=json.loads(r.text) >>>CLUSTER_NAME = json_data["items"][0]["Clusters"]["cluster_name"] >>>print(CLUSTER_NAME) >>>CLUSTER_VERSION =json_data["items"][0]["Clusters"]["version"] >>>print(CLUSTER_VERSION) >>>STACK = CLUSTER_VERSION.split('-')[0] >>>print(STACK) >>>STACK_VERSION = CLUSTER_VERSION.split('-')[1] >>>print(STACK_VERSION) >>>CLUSTER_INFO=json_data >>>print(CLUSTER_INFO) Let's find HDP stack repository: >>>restAPI = "/api/v1/stacks/"+STACK+"/versions/"+STACK_VERSION+"/operating_systems/redhat7/repositories/"+CLUSTER_VERSION >>>url = "http://"+AMBARI_DOMAIN+":"+AMBARI_PORT+restAPI >>>r= requests.get(url, auth=(AMBARI_USER_ID, AMBARI_USER_PW)) >>>json_data=json.loads(r.text) >>>print(json_data) >>>REPOSITORY_NAME=json_data["Repositories"]["latest_base_url"] >>>print(REPOSITORY_NAME) A more elegant approach is to create utility functions. See my repo: https://github.com/cstanca1/HDP-restAPI/ restAPIFunctions.py script in the repo defines a number of useful functions that I have collected over time. Run restAPIFunctions.py The same example presented above can now be implemented with a single line call to get CLUSTER_NAME, CLUSTER_VERSION and CLUSTER_INFO using getClusterVersionAndName() function: >>>CLUSTER_NAME,CLUSTER_VERSION,CLUSTER_INFO = getClusterVersionAndName() >>>print(CLUSTER_NAME) >>>print(CLUSTER_VERSION) >>>print(CLUSTER_INFO) Resource Manager REST API Call Examples >>>RM_INFO=getResourceManagerInfo() >>>RM_SCHEDULER_INFO=getRMschedulerInfo() >>>print(RM_INFO) >>>print(RM_SCHEDULER_INFO) Other Functions These are other functions included in restAPIFunctions.py script getServiceActualConfigurations() getClusterRepository() getAmbariHosts() getResourceManagerInfo() getRMschedulerInfo() getAppsSummary() getNodesSummary() getServiceConfigTypes() getResourceManagerMetrics() getCheckClusterForRollingUpgrades()

cstanca · ‎12-18-2016

Background Tungsten became the default in Spark 1.5 and can be enabled in earlier versions by setting spark.sql.tungsten.enabled to true (or disabled in later versions by setting this to false). Even without Tungsten, Spark SQL uses a columnar storage format with Kryo serialization to minimize storage cost. Goal The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough). Scope Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc. Optimization Features Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly, Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates, Whole-Stage Code Generation (aka CodeGen). Design Improvements Tungsten includes specialized in-memory data structures tuned for the type of operations required by Spark, improved code generation, and a specialized wire protocol. Tungsten’s representation is substantially smaller than objects serialized using Java or even Kryo serializers. As Tungsten does not depend on Java objects, both on-heap and off-heap allocations are supported. Not only is the format more compact, serialization times can be substantially faster than with native serialization. Since Tungsten no longer depends on working with Java objects, you can use either on-heap (in the JVM) or off-heap storage. If you use off-heap storage, it is important to leave enough room in your containers for the off-heap allocations - which you can get an approximate idea for from the web ui. Tungsten’s data structures are also created closely in mind with the kind of processing for which they are used. The classic example of this is with sorting, a common and expensive operation. The on-wire representation is implemented so that sorting can be done without having to deserialize the data again. By avoiding the memory and GC overhead of regular Java objects, Tungsten is able to process larger data sets than the same hand-written aggregations. Benefits The following Spark jobs will benefit from Tungsten: Dataframes: Java, Scala, Python, R SparkSQL queries Some RDD API programs via general serialization and compression optimizations Next Steps In the future Tungsten may make it more feasible to use certain non-JVM libraries. For many simple operations the cost of using BLAS, or similar linear algebra packages, from the JVM is dominated by the cost of copying the data off-heap. References: Project Tungsten: Bringing Apache Spark Closer to Bare Metal High Performance Spark by Holden Karau; Rachel Warren Slides: Deep Dive Into Project Tungsten - Josh Rosen Video: Deep Dive into Project Tungsten Bringing Spark Closer to Bare Metal -Josh Rosen (Databricks)

cstanca · ‎12-18-2016

@Andi Sonde You will use Kafka clients when you are a developer, you want to connect an application to Kafka and can modify the code of the application, and you want to push data into Kafka or pull data from Kafka. You will use Connect to connect Kafka to datastores that you did not write and can’t or won’t modify their code. For data stores where a connector already exists, Connect can be used by non-developers who will only need to configure the connectors. If you need to connect Kafka to a data store and a connector does not exist yet, you can choose between writing an app using the Kafka clients or the Connect APIs. Connect is recommended because it provides out of the box features like configuration management, offset storage, parallelization, error handling, support for different data types and standard management REST API. Writing a small app that connects Kafka to a data store sounds simple, but there are many little details you will need to handle around data types and configuration that make the task non-trivial - Kafka connect handles most of this for you, allowing you to focus on transporting data to and from the external stores. I realize that is a bit out of scope, but I'd like to recommend Apache NiFi as an alternative to Kafka Connect. See: https://nifi.apache.org/ NiFi is a good alternative to Flume too. NiFi provide an advanced data management capability covering easily Kafka producer or Kafka consumer needs with no coding, visually, at most, some regular expressions.

cstanca · ‎12-18-2016

@Kaliyug Antagonist I like Tom's suggestion and will it try myself. Otherwise, if you wish to create your local cluster with Vagrant: https://community.hortonworks.com/articles/39156/setup-hortonworks-data-platform-using-vagrant-virt.html I use Eclipse and Vagrant cluster. I share a folder between my local machine and the cluster where I place the output jars and then submit them for execution. I followed instructions published here: here: https://community.hortonworks.com/articles/43269/intellij-eclipse-usage-against-hdp-25-sandbox.html I am not sure why you are against of the idea to use the sandbox. The code you develop can be at most tested functionally, locally. I get it that you want more debugging capabilities locally. A true load testing still needs to happen in a full scale environment.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: error while usin select statement using hive i...

Re: HCatStorer is not overwriting a Hive table. or...

Re: How to inject Excel files from local file syst...

Re: Read/Write on HDP data platform on Windows 210...

Re: Read/Write on HDP data platform on Windows 210...

Re: How to implement "connect By" of ORACLE in Hiv...

Ambari and Resource Manager REST API with Python

What is Tungsten for Apache Spark?

Re: When to use Kafka Connect vs. Producer and Con...

Re: Can Eclipse/IntelliJ Idea be used to execute c...