About avijeetd

avijeetd · ‎02-13-2017

Hi, I have a file, which can be used for lookup during a data flow. How do I read the file and put it in the DistributedCache? Thanks, Avijeet

avijeetd · ‎02-03-2017

Thanks @Tibor Kiss What is the kind of industry practice when it comes to writing streaming data to both HDFS and another real time store such as HBASE, Cassandra Should we write to HDFS from the stream-processing layer (STORM, SPARK Streaming) OR Should we write it separately using a separate consumer (KAFKA) or SINK (flume) Some reason I think writing from stream processing layer to HDFS doesn't sound right. Thanks, Avijeet

avijeetd · ‎02-03-2017

Hi All, I understand SOLR creates a index file and makes searches faster - however I have a fundamental question - Does SOLR stores the data + index - for example if I have a Table with 100 columns, and I want index on a few columns Will SOLR store all the Table data so that it can show the full row on search match OR The full file can be in HDFS/HBASE and SOLR can look it up and show the full row? So can there be an approach where the Data is in HDFS and the primary/secondary indexes in SOLR - and search can find the full data in HDFS. Not only find , can also update / delete. Thanks, Avijeet

avijeetd · ‎02-03-2017

That's great @Tibor Kiss - I am trying to run a spark streaming - how do I say to run on standalone cluster mode?

avijeetd · ‎02-02-2017

Thanks @Tibor Kiss - I am looking for more information around distributed mode, is there a name to the cluster managers in storm or spark stremaing.

avijeetd · ‎02-02-2017

Hi All, most of the batch processing frameworks (MR, Spark) support a local mode and a distributed mode (standalone, yarn, mesos) of deployment and execution. what about stream processing frameworks such as STORM, Spark-streaming? Do they manage the distributed mode on their own? is it even realistic to expect them to be work on YARN? How to monitor a distributed spark streaming job? And do we need to specify master as yarn to make it distributed? Thanks, Avijeet

avijeetd · ‎01-27-2017

Hi, I have been seeing stream processing use cases where as part of streaming ingest along with HBASE, Cassandra etc. HDFS is also shown. Isn't HDFS write was supposedly only with big files 64MB/128MB +. In Flume this is achieved by hdfs.rollSize configurations. So Flume manages the buffer until it becomes big, then it writes/flushes it out. How does this part is taken care when writing from Spark-streaming or STORM? Thanks, Avijeet

avijeetd · ‎01-27-2017

Thanks @mqureshi - that answers my question. However a number of components have started using HBASE as a meta-data store such as Atlas, Falcon etc. How to see these use cases?

avijeetd · ‎01-25-2017

Hi All, HIVE has been established as an analytics engine (SQL query processing) for large file based data. The new features added to HIVE such as ACID, Streaming, updates etc. how does these features fit into the overall HIVE positioning? Is the idea to create a all-in-one DB on HIVE ? Thanks, Avijeet

avijeetd · ‎01-23-2017

Thanks @mqureshi Can you pls confirm for a cluster deployed without VPC - is there any way to secure Hadoop with all these ports open? Thinking of KNOX as one way - anything else that can be done quickly, also will KNOX work without LDAP/AD? Regards, Avijeet

Online	Offline
Last Visited	‎01-18-2021 12:06 AM

Member Since	‎06-09-2016 06:30 AM
Last Visited	‎01-18-2021 12:06 AM
Posts	185
Kudos received	22

Cloudera Community

Re: storm supervisor error

Re: Storm-hdfs - java.lang.RuntimeException: Error...

Re: zeppelin architecture

Re: Falcon server - cannot create cluster

Re: Hive Streamaing

how to put data in PutDistributedMapCache

Re: streaming ingest to hdfs

SOLR - how to use it

Re: stream processing runtimes

Re: stream processing runtimes

stream processing runtimes

streaming ingest to hdfs

Re: HIVE and HBASE clusters

HIVE positioning

Re: ports required to be open