Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7357 | 08-12-2016 01:02 PM | |
| 2708 | 08-08-2016 10:00 AM | |
| 3672 | 08-03-2016 04:44 PM | |
| 7211 | 08-03-2016 02:53 PM | |
| 1863 | 08-01-2016 02:38 PM |
04-07-2016
03:56 PM
@hoda moradi slightly unrelated the following might be useful a bit. I wrote a sample applicatoon that does some parsing and processing in Spark https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html
... View more
04-07-2016
01:38 PM
Is that really true? You can use the LDAP interface for all frontends and you I think can even do Linux->AD integration using LDAP as well. https://technet.microsoft.com/en-us/magazine/2008.12.linux.aspx#id0060006 I totally agree with you that this doesn't make sense but it should theoretically work I would think.
... View more
04-07-2016
11:53 AM
1 Kudo
There are lots of resources out there, for example: http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf%20 The main question is what you want to do consumption wise. Hive has progressed a lot with Tez, ORC, predicate pushdown, a cost based optimizer etc. and should be easily able to outdo an Oracle DW for heavy analytical queries, unless you have a really expensive Exadata system ( even then it depends on the queries ) It would be great for example for running daily/weekly aggregations. However it will not be able (yet, LLAP is coming) to compete with Oracle for smaller interactive queries and it will be hard to directly connect a reporting studio to it with interactive queries for example. So the question is what you plan to do with the Warehouse. Do you directly connect to it with reporting software that runs in interactive mode ( Hive is not good for that yet although Phoenix might be ) Or do you mostly use it for aggregation queries, cube generation etc. ( For example by daily creating a Tableau report that is then provided to users. ) In that case you might be able to get rid of your warehouse entirely. ( Some tools like Platfora provide a neat combination ) Of course an Oracle warehouse provides some other advantages like UPDATE/DELETE capabilities, ( Hive has ACID now but its pretty new ), stored procedures, foreign key relationships ( however often disabled for big loads ) integration with most common backup tools etc. So there may be some other reasons to keep a warehouse around as main store. In this case you would go to one of the following scenarios - data warehouse offloading Use hadoop for workloads like heavy aggregations that take up too much of the Warehouse capacity and are better done in hadoop - landing zone Use hadoop as an initial data landing and transformation zone before it goes into the warehouse. Utilize the capability of hadoop to store data in any format and keep large amounts of data around. - hot archiving Archive old data into Hadoop to have it still accessible - Advanced analytics Run things like sparkml or R on Hadoop that may not be possible either technically or because of costs in the warehouse - Unstructured Analytics Augment your warehouse with unstructured data ( emails, social media data , ... ) in Hadoop. - Realtime Ingestion using Flume,Kafka,Spark Streaming, Storm So summarizing all approaches is valid Source -> Hadoop ( if you can completely replace the warehouse) Source -> Hadoop -> Warehouse ( with hadoop as etl landing zone which gives you advantages like keeping source data around if desired, reducing need for massive ETL installations ( you might still have them but many can push heavy computations into hadoop so could be smaller , ... ) Source -> Warehouse -> Hadoop ( Warehouse offloading, archival ) that has the advantage that you can keep the existing environment but you can reduce pressure on your warehouse. Hope that helps.
... View more
04-06-2016
11:42 AM
No idea about the empty lines sorry, I don't have these lines as you can see. Just delete them with sed? ( google delete empty lines linux ) To add a string on top of a file you could just use linux as well from the outside i.e. : echo "The count is:" > out.txt beeline .... >>out.txt the double ">>" appends to a file
... View more
04-05-2016
06:22 PM
That article might help too: https://community.hortonworks.com/content/kbentry/25726/spark-streaming-explained-kafka-to-phoenix.html The github link has a complete explanation how to create the kafka topic start a producer for testing etc.
... View more
04-05-2016
10:25 AM
How about this? beeline -u jdbc:hive2://localhost:10000/default --silent=true --outputformat=csv2 -e "select * from sample_07 limit 10" > out.txt [root@sandbox ~]# cat out.txt sample_07.code,sample_07.description,sample_07.total_emp,sample_07.salary 00-0000,All Occupations,134354250,40690 11-0000,Management occupations,6003930,96150 11-1011,Chief executives,299160,151370 11-1021,General and operations managers,1655410,103780 11-1031,Legislators,61110,33880 11-2011,Advertising and promotions managers,36300,91100 11-2021,Marketing managers,165240,113400 11-2022,Sales managers,322170,106790 11-2031,Public relations managers,47210,97170 11-3011,Administrative services managers,239360,76370
... View more
04-05-2016
10:18 AM
So Cascade works because he forces the delete of all objects belonging to that object ( similar to delete ... cascade for row deletes ) Now the question is why your drop function did not work. And I don't know we might have to look into logs to figure that out. But I have seen flakiness with functions in hive before on an older version of hive. So it might be just a bug or a restart required or something. But again without logs hard to say.
... View more
04-05-2016
07:36 AM
1 Kudo
What does he say when you try to drop these functions? Just silent? Anything in the logs? You can also try to use the drop ... cascade command. DROP DATABASE IF EXISTS userdb CASCADE;
... View more
04-04-2016
04:37 PM
Or distcp/falcon. It depends how you load the data as well, if that is well defined they could just duplicate the load. Nifi would come in in my opinion in very specific scenarios when you have control over the data source. There is also wandisco but that would be a big change.
... View more
04-04-2016
03:07 PM
1 Kudo
aaaah really the absolute same datanodes. So not even two different data folders and configs but the absolute same datanode with the same blocks pointing to two different Namenodes? How would you think that would work with file changes? Would those be merged? Or do you want the same HDFS period monitored by two ambari? I fail to even logically see how any of that could be possible. How about just implementing two Queues and have a research queue with a percentage of the cluster and perhaps a folder with a capacity as well? Sounds like about the same.
... View more