Member since
06-07-2016
923
Posts
319
Kudos Received
115
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1200 | 10-18-2017 10:19 PM | |
1178 | 10-18-2017 09:51 PM | |
4907 | 09-21-2017 01:35 PM | |
304 | 08-04-2017 02:00 PM | |
355 | 07-31-2017 03:02 PM |
09-19-2017
03:08 AM
1 Kudo
@Gobi Subramani I will solve this by using HBase - Tall and narrow table. Actually I have worked on a particular application which stored ticker data in HBase and recorded every change. Our HBase key, was stock symbol plus timestamp plus some more stuff we needed to search on. This enabled us the following: AAPL<epoch time> AAPL<epoch time -1> AAPL<epoch time - 2> and so on This is Trillion plus row table. now if you have a given symbol, in this case AAPL, then you run a scan and limit to 100 rows. You can also make a short wide table where all data for AAPL is stored in one row and then do a get and get only first hundred columns. This should be easy to implement using HBase.
... View more
09-19-2017
03:01 AM
@Jon Page Depends on the environment the cluster is in. Since you are asking here and it will take couple of days for drive to arrive, we can reasonably assume that this is dev/sandbox type environment. Here is what's going to happen. Because of a loss node, you have lost some data blocks. Once a data node is marked dead, Hadoop will start replicating the lost blocks on available nodes. This will create network traffic that can be unnecessary in some cases (seems like in this case). To avoid that, you can increase dfs.namenode.heartbeat.recheck-interval This time decides the interval to check for expired datanodes. With this value and dfs.heartbeat.interval, the interval of deciding the datanode is stale or not is also calculated. The unit of this configuration is millisecond. You can increase this value to increase the time it will take for data node to be marked stale and that effectively buys you more time in which you can replace a drive. Problem is, this setting requires restart, so its a little late. For now, you just run with under replicated blocks. Unless you lose more disks, you should not lose data but there is of course a risk of data loss if two more disks, contain additional replicas of lost blocks also fail.
... View more
09-19-2017
02:13 AM
@Sami Ahmad
As the error message says, you cannot do incremental updates using Hcatalog. See the example on this page: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html Use jdbc driver to connect. Also, on the following page, read the best answer to remove any confusion you may have (there is too much redundancy on this page. Simply focus on best answer and comments on best answer). https://community.hortonworks.com/questions/10710/sqoop-incremental-import-working-fine-now-i-want-k.html
... View more
09-16-2017
03:54 PM
@adam chui Cassandra runs on EXT4 or similar linux compatible filesystem. Unlike HBase, it does not run on hadoop which means you cannot install Cassandra on HDP 2.6. What is your use case? Why not just use HBase?
... View more
09-11-2017
03:03 AM
@Ashish Arora Can you please try starting a putty session? Log into docker container using the following command? ssh -p 2222 root@127.0.0.1
... View more
09-11-2017
02:54 AM
@Bhaskar Das So you want to know when mappers have completed and data is being transferred to reducers, how many times copy occurs? Right? After mappers complete, data is sent to reducer based on keys. Data for each key will land on a particular reducer and only that reducer, no matter which mapper it is coming from. One reducer may have more than one key, but one key will always exist on a particular reducer. So imagine, mappers output data on node 1, node 2, and node 3. Further assume that there is a key "a" for which data is present in mapper outputs on node 1, node 2, and node 3. Imagine reducers running on each of the three nodes (total three reducers). suppose data for key "a" is going to node 3. Then data from node 1, node 2 will be copied to node 3 as reducer input. In fact data from node 3 will also be copied over in a folder where reducer can pick it up (local copy unlike over the network for data coming from node 1 and node 2). So really three copies occurred when you had 3 mappers and 1 reducer. If you follow the above logic on how copy is done based on keys, you will arrive at "m*n" copies. Please see the picture in following link (Map Reduce data flow). that should visually answer what I have described above. Hope this helps. https://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow
... View more
09-06-2017
02:57 PM
@Edgar Daeds I want to make sure I understand this correctly. Please let me know if I am wrong. 1. You have configured LDAP group mapping. 2. Your HBase Region server cannot reach the LDAP server due to security reasons. 3. Once LDAP timeout expires, your query works. If my understanding is correct, then you need to disable LDAP integration until you can actually query the LDAP server for group mappings. What's the point in configuring LDAP when you cannot actually reach out to it?
... View more
09-06-2017
02:52 PM
2 Kudos
@Bin Ye I can only guess here but 100 sequence numbers are cached by client. S, if you run "sqlline.py" first, it will cache 100 sequences and a new client will start from next sequence which is 101. This is what is likely going on. The only other explanation is that your CREATE SEQUENCE statement conatins "STARTS WITH 101" which I think it doesn't. Change the following value to 50 and see if your jdbc value starts with 51? https://phoenix.apache.org/sequences.html
... View more
09-06-2017
03:54 AM
@Karan Alang One thing that jumps out here is that you are using "-alias localhost". this cannot be the case when you are communicating between two physically different servers. Give the right DNS name for node04 and node05. This name should be the same using which you should be successfully able to "ping <node04/05>" and get a reply.
... View more
08-30-2017
12:30 PM
@heta desai You already have a good idea of how to implement it. However, I'll suggest an easier design. 1. Download, latest version of HDF 3.0 and HDP 2.6.1 from hortonworks website. After installation, create Kafka topics to store data ingested from twitter into Kafka. 2. Use Nifi to ingest data from Twitter. Here is a link to a Nifi Twitter processor. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-social-media-nar/1.3.0/org.apache.nifi.processors.twitter.GetTwitter/index.html 3. Use Nifi publishKafka processor to push hyour data ingested from Twitter into Kafka topic. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-0-10-nar/1.3.0/org.apache.nifi.processors.kafka.pubsub.PublishKafka_0_10/additionalDetails.html 4. Use Streaming Analytics Manager to create a flow by simple drag and drop which reads from Kafka topic, perform sentiment analysis using processors already provided by Streaming Analytics Manager and then use a processor to push results to HBase. All done without writing a single line of code. Streaming Analytics Manager uses Apache Storm instead of Spark Streaming under the hood. But do you care which tool is used vs your problem is solved. If you cannot use Streaming Analytics Manager, then you will have to write your Spark Streaming code to ingest data from Kafka and push it to HBase. Here is the doc to integrate Spark Streaming with Kafka. https://spark.apache.org/docs/latest/streaming-kafka-integration.html Following link has an example of Java HBase context being used to write to HBase. https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/org/apache/hadoop/hbase/spark/JavaHBaseContext.scala If you follow my suggestion to use Streaming Analytics Manager, you are done at step 4, without writing any code.
... View more
08-18-2017
04:58 AM
H. L. Will you run an Ubuntu VM on your Windows 7 box? If yes, in that case you should be able to install a cluster without issues (meaning some VMs and some hardware nodes). You cannot however, have a cluster with Windows machine on them. Finally you shouldn't use sandbox. May be it will work but I cannot say because I have personally never done that. However, to install HDP using Ambari is literally a matter of one hour on four machines. Since you are doing it for the first time, it might take 2-3 hours but its much easier and better to just use Ambari and install HDP cluster rather than trying to work with sandbox which comes dedicated for one VM.
... View more
08-17-2017
10:01 PM
@Qi Wang Have you setup a truststore and then trust SAM as an application that can connect to Ambari? I have not set this up but not setting up a truststore and "trusting" SAM can be a reason for your error. Check troubleshooting in the following link: https://community.hortonworks.com/articles/39865/enabling-https-for-ambariserver-and-troubleshootin.html
... View more
08-04-2017
07:28 PM
@Kent Brodie I am assuming you run major compactions probably once a week or some regular schedule. So that is not an issue. Do you have a lot of snapshots? Here is how snapshots work. When you create a snapshot, it only captures metadata at that point in time. So in case you ever have to restore to that point in time, you restore snapshot. Through metadata that was captured, Snapshot knows which data to restore. Now, as HBase is running, you might be deleting data. Usually when Major compaction runs, your deleted data is gone for good. Disk space is recovered. However, if you have Snapshots created which are pointing to data that is being deleted, HBase will not delete that data because what if you trying to recover to that particular point in time by restoring the snapshot? So, in that case, the data that snapshot is pointing to is moved to archive folder. The more Snapshots you have, the more archive folder will grow as needed by Snapshots. I can only guess, but a reasonable guess of what you are seeing is that you have too many snapshots.
... View more
08-04-2017
02:00 PM
@Mohammedfahim Pathan You assign queues when you run your jobs. For example in spark you can specify --queue parameter. In YARN configuration you specify ACL on who is part of the queue. This can be users that are running those "tools". So you cannot say that a queue is meant for "Hive", but when you limit a queue to a group using Hive only, then in a way you achieve your purpose. <property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>u:user1:queue1,g:group1:queue2,u:%user:%user,u:user2:%primary_group</value>
<description>
Here, <user1> is mapped to <queue1>, <group1> is mapped to <queue2>,
maps users to queues with the same name as user, <user2> is mapped
to queue name same as <primary group> respectively. The mappings will be
evaluated from left to right, and the first valid mapping will be used.
</description>
</property>
... View more
07-31-2017
03:02 PM
2 Kudos
@younes kafi Please see replies inline below: 1/ Should Kafka brokers be located within the same data nodes, or should they be on separate nodes? Which way is better in term of performance ? Is it possible to have Kafka on a datanode when Kafka is installed using HDF? Is this for production? Before answering I would suggest you engage someone from you local Hortonworks Account team to help you answer these questions.
Depending on your data ingest, you might need dedicated Kafka servers. In other cases you may co locate Kafka on data nodes (rarely happens in production unless its something very small). Even when you co locate Kafka on data nodes, make sure you give it dedicated nodes and its own Zookeeper. Kafka must have its own Zookeeper. Also Zookeeper should have its own disks. Not large capacity disks but its own disks. 2/ Can Kafka and NiFi shares the same zookeeper or should Kafka have its own ZK used exclusively by Kafka? Ideally you don't want Zookeeper to be shared. Kafka should get its own Zookeeper. That being said, in my personal opinion, sharing Zookeeper with Nifi will be okay. Just don't add any new component beyond these two to Zookeeper dedicated for Kafka. 3/ Does the installation of NiFi by HDF(ambari) apply the needed system requirements such as max files handles, max forked processes ...or should theses requirement be done before proceeding to the installation by ambari? No, when Ambari is managing Nifi, it enables you to configure Nifi. IT is not going to make OS level changes. Imagine you make OS level changes from Nifi which affects everything else on that server. You don't want that. 4/ Is it possible to have a node that belong to both a HDF and HDP clusters at the same time with same ambari agent running on the node ? Two things here. New version of Ambari manages both HDP and HDF. And yes you can install HDF services on HDP cluster. Please see the following link. https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_installing-hdf-and-hdp/content/install-hdp.html
... View more
07-30-2017
10:10 PM
@Himanshu Mishra Go to HBase shell, and run "describe test". This will show you how table was created in HBase. A better way to create tables in Phoenix is to follow the following convention: CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR) Above statement will create two column families "A" and "B" with qualifies "Col1" and "Col2" in column family A and "col3" in column family B. When you create tables from Phoenix, it also adds an empty key-value for each row so queries work in a SQL like fashion without requiring you to write all projections in your query. Following link describes how columns are mapped from Phoenix to HBase. https://phoenix.apache.org/faq.html#How_I_map_Phoenix_table_to_an_existing_HBase_tablention
... View more
07-30-2017
07:15 PM
2 Kudos
@Bala Vignesh N V I am looking into something very similar but what I have found is using Hive LLAP/ACID (merge feature) is the right way to go. Here is what I know so far, from talking to a coworker who has done couple of successful POC for SCD Type 2 . Before reading below, please see the following link (video starts at 20:20 which is where presenter talks about SCD Type 2 example): https://www.youtube.com/watch?v=EjkOIhdOqek#t=20m20s Here is the approach on implementing SCD Type 2: The loading of the data was done by exporting existing data to a landing zone and then doing CTAS to create the optimized ORC tables with table and column stats. Another option would be to use the new merge function to load incremental data (above youtube link). One approach is to use a CDC tool (attunity) + HDF to stream changes into a diff table and then at a regular interval use merge to update the SCD type 2 tables. Hive ACID is different than what people are used to with RDBMS ACID. The transaction scope is only per table or partition. There are also no begin/…/commit statements. Everything is essentially auto-commit. What happens under the covers is that delta files are created for the table with the changes. Then hive manages minor and major compactions under the covers to merge the data. This is no doubt slower than a traditional RDBMS. But performance can be improved by increasing the number of compaction threads and running the updates across different tables or partitions. If you have the scenario where new data is being added (no deletes or updates, just inserts) then you can insert multiple rows per statement and this speeds up the ingest. Once the transaction commits the new data immediately available to the consumer. There is also the concept of batch transactions which can be used to increase performance when doing high velocity transactions. You would have to code for it though. The hive streaming storm bolt and nifi processors use it but your SQL gui tools won’t.
... View more
07-25-2017
10:12 PM
@PJ These directories exists on journal nodes if that's what you are using or whatever disk you will specify in ambari for namenode when you do your install. I think you will find the following link helpful. https://hortonworks.com/blog/hdfs-metadata-directories-explained/
... View more
07-25-2017
08:18 PM
2 Kudos
@PJ
If you are just looking for redundancy then it is achieved by writing namenode metadata on journal nodes (typically three), and both standby and active name node point to same journal nodes. When active namenode goes down, Zookeeper, simply needs to make standby node active and it already pointing to same data which is replicated on three journal nodes. If you don't have journal nodes and you have only one namenode, then your namenode metadata is written only once but here it is recommended that you use RAID 10 array so one disk failure is not going to result in data loss. To answer your question whether two copies of metadata are present, the answer is it depends. If you are using RAID 10 then your disk array is making a copy of blocks but that's not really a copy in the sense you are asking. If High Availability is enabled and you are using journal nodes, then you do have three copies of metadata available on three different nodes.
... View more
07-24-2017
01:59 PM
@Ashis PanigrahiCan you please elaborate your question? Nifi is event driven. What are you looking for? Each flow can represent each individual event ingested.
... View more
07-20-2017
09:13 PM
1 Kudo
@Brad Penelli Seems like a schema registry issue. Is the schema name specified in registry correct (meaning no typos). I will also not put any dashes or special characters in schema name. If everything else is right, then simply restart schema regstry. That seemed to solve my problem.
... View more
07-20-2017
08:11 PM
@Ir Mar You definitely need go to port 2222. I am not sure your ip when I run sandbox, I am able to do "ssh -p 2222 127.0.0.1".
... View more
07-20-2017
08:07 PM
1 Kudo
@Dhiraj What is your question? There is enough online material available if you just want to know the differences between the two. Following is a good article that summarizes the two approaches and help guides which approach to use when. https://community.hortonworks.com/articles/2473/rolling-upgrade-express-upgrade-in-ambari.html
... View more
07-19-2017
06:38 PM
@Bala Vignesh N V Have you tried doing groupByKey(), reduceByKey or aggregate()?
... View more
07-19-2017
06:11 PM
@Jobin George Can you shutdown the cluster, then delete your flow file from node 4, then add the node in Ambari, before starting verify new flow file is not there and then start the cluster. I know this is wrong because we should be able to add a node without bringing the cluster down but I just want to see what might make it work.
... View more
07-19-2017
05:48 PM
@Manikandan Jeyabal Are you able to ping the destination host from source?
... View more
07-19-2017
05:43 PM
@Bala Vignesh N V It may be your first line and not the subtract function. try removing one extra slash from your hdfs path. Badically use the following: sc.textFile("hdfs://data/spark/genome-tags.csv") or if you haven't provided hadoop config before then use the following: sc.textFile("hdfs://<namenode uri>:8020/data/spark/genome-tags.csv")
... View more
07-19-2017
05:38 PM
@Suhel How many users are connecting to your HiveServer 2 concurrently? That determines your memory. From Hortonworks recommendations, for 20 concurrent users you need a mere 6 GB. If you have 10 concurrent connections, 4 GB is enough. For single connection 2 GB, so definitely you don't wont to go below that. When you have too much memory, you run into what's called "Stop the world garbage collection pauses". You can google more on this but basically JVM needs to move object and update references to it. Now if you move object before updating the references and application that is running access it from old reference than there is trouble. if you update reference first and than try to move object the updated reference is wrong till object is moved and any access while object has not moved will cause issue. For both CMS and Parallel collector the young generation collection algorithm is similar and it is stop the world that is, application is stopped when collection is happening. When you allocate too much memory, like 24 GB, stop the world takes longer time, hence your application fails. So, your metastore does not need to have same memory as Hive Server 2. They are two different processes. If metastore is also running into similar issues, you can set it to 8 GB or less - that's still a lot of memory for just Metastore.
... View more
07-19-2017
05:21 PM
1 Kudo
@Bala Vignesh N V Why not use filter like the following? val header = data.first val rows = data.filter(line => line != header)
... View more
07-19-2017
05:16 PM
@Jobin George On your new node, do you have flow.xml.gz? If yes, can you delete it and try adding the node again.
... View more