About gobi_subramani

Shikharguglia1 · ‎04-20-2022

Hello all Please reply to this ASAP I am trying to install vm on my pc but my screen is sticking at the same point That Extracting and loading the hortonworks sandbox I have assigned 8 gb of ram n my laptop configuration is 8gb also and i5 n 7th gen

pkadam · ‎07-28-2018

refer the thread https://community.hortonworks.com/questions/1786/how-to-clean-up-purge-kafka-queues.html

balavignesh_nag · ‎09-19-2017

@Gobi Subramani it resemble what you are trying to do is auditing. If that is the case then create a audit table with source table name ,target table name, Records loaded and counts and sum. Get the values from select statement and use it with insert to the audit table. It should work fine. If it is not audit then you have to either append the data to a file which I have mentioned in the previous comment or you have to create a table and insert the data into the corresponding tables. Happy Hadooping!

gobi_subramani · ‎04-05-2017

@Namit Maheshwari Yes, there is a pattern for creating partition (yyyy-mm-dd) .. OK,your idea is , run the command and store the result and check for the existence of the partition ?? Is there any other simple way to check ?

vjain · ‎11-22-2016

@Gobi Subramani You are looking at it wrong. Spark Context is the main entry point into Spark and is the connection to a Spark cluster, and can be used to create RDDs, accumulators etc. on that cluster. You can run both in cluster as well as local mode and if you would have to define which one, you'd define that in the Spark context. The workers don't get the Spark context per say, but if you were to package your program into a jar, the cluster manager would be responsible for copying the jar file to the workers, before it allocates tasks.

srai1 · ‎11-16-2016

This should be easy enough for you to test: 1. Insert values 1 to 40 for column user_id into table user_info_bucketed 2. Insert around 440 rows from 41 to 440 3. Ideally, each bucket should have about 19 rows, or around that 4. You can then check something like: SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160; OR you can check the physical location of the file on HDFS to determine the line count.

rajkumar_singh · ‎11-14-2016

@Gobi Subramani in case of memory pressure Spark will automatically evict RDD partitions from Workers in an LRU manner if no caching or persistence applied. depending on worker memory available LRU eviction happens independently on each Worker node.

gkeys · ‎11-08-2016

HDP 2.3+ packages Sqoop 1.4.6 which allows direct import to HDFS as parquet file, by using: --as-parquetfile See: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html If you import directly to hive table (vs HDFS) you may need to do this as 2-step process (https://community.hortonworks.com/questions/56847/parquet-files-sqoop-import.html)

bhagan · ‎11-07-2016

Hi Gobi, In your KafkaProducer constructor, you instantiate the class with a set of Properties, which should include a list of Brokers. This allows the Producer to have knowledge of more than one server. If you only have one server listed, then, yes, if that server goes down, your Producer will be unable to send any more messages. However, this scenario is highly unlikely because it is a best practice to use more than one Broker in your cluster. One benefit of configuring your Producer with a list of servers allows you to send messages without having to worry about the IP address of the particular server that will receive your messages. In terms of defining the topic, to which you will send your messages, this is defined in the ProducerRecord and can be achieved with something like this: Properties props = new Properties(); props.put(“bootstrap.servers”, “192.168.86.10:9092, host2:port, host3:port”); Producer<String, String> producer = new KafkaProducer<>(props); producer.send(new ProducerRecord<String, String>(“test-topic”, “hello distributed commit log”)); Have a great day, Brian

matt_andruff · ‎10-05-2016

@Gobi Subramani I would suggest that you download and install HDP. It can handle creating the data flow for you. Here's an example of it collecting logs. Instead of writing to an Event bus you could use putHDFS connector and it would write it to hdfs for you. There isn't a lot of trickery to get the date/folder to work, you just need to ${now()} in place of the folder name to get the schema you are looking for. If you look around there are lots of walk throughs and templates. I have included a pic of a simple flow that would likely solve your issue.

Online	Offline
Last Visited	‎03-22-2019 11:29 AM

Member Since	‎09-01-2016 10:55 AM
Last Visited	‎03-22-2019 11:29 AM
Posts	44
Kudos received	3

Cloudera Community

Re: HDP sandbox startup too long on virtualbox

Re: How to delete kafka message

Re: set hive hint

Re: Checking hive partition

Re: How to get SparkContext in executor

Re: Hive partition and bucketing

Re: Memory for RDD

Re: Convert to parquet format

Re: Kafka leader election

Re: Need help on constructing oozie wf