About csguna

csguna · ‎02-24-2017

Use the event desearlizer You can use BlobDeserializer - if you want to parse the whole file inside one event. or You can use Line - one event per line of text input. Refer the link https://flume.apache.org/FlumeUserGuide.html#event-deserializers

ManojSingh · ‎02-14-2017

Yes, host IP.

mbigelow · ‎02-13-2017

In HDFS, you tell it which disk to use and it will fill up those disk. There is the ability to set how much space on those disks are reserved for non-DFS data but it doesn't actual prevent the disk from being filled up. The issue at hand is that the smaller disk will fill up faster, so at some point they will not allow any more write operations and the cluster will have no way to balance itself out. This causes issue with HDFS replication and placement, along with hotspotting in MR, Spark, and any other jobs. Say for instance if you primarily operation on the last days worth of data for 80% of your jobs. At some point you will hit critical mass were those jobs, are running mostly on the same set of nodes. You could set the reserved non-DFS space to different values using Host Templates in CM. This would then at least give you a warning when you are approaching filling up the smaller disk, but then at that point the larger disk would have free space that isn't getting used. This is why it is strongly encourage to not have different hardware. If possible upgrade the smaller set. A possible option would be to use Heterogeneous storage. With it you can designate pools, so the larger nodes would be in one pool and the smaller in the other. Each ingestion point would need to set which pool it would use and you can set how many replicas go to each. This is a big architectural change those and should be carefully reviewed to see if it benefits your use case(s) in anyway. So, simply, use the same hardware or you will more than likely run into issues.

Revathi91 · ‎02-12-2017

Thanks for your reply. Please find Hive ORC related config below. hive.exec.orc.skip.corrupt.data=false hive.exec.orc.default.row.index.stride=10000

csguna · ‎02-05-2017

1. Check on the user perimission of the jar file. 2.when you add driver manully do this --driver = org.postgresql.Driver 3. Please try the below jdbc driver version. Optionally you can extract the jar file see the if there is org.postgresql.Driver.class curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar $ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop/

sanjeev20 · ‎02-04-2017

Hi Csguna, Doing this for one or two columns is fine but doing for more than 200 columns is where I am stuck Sqoop import ... --map-column-java id=String,value=Integer

saranvisa · ‎01-30-2017

@csguna It is authorized_key nothing to do with hdfs here. so it is user:linux group (instead of hdfs group)

mbigelow · ‎01-17-2017

On the setting changes, stats, as stated will help with counts as that info is precalculates and stored in the metadata. The CBO and stats also help a lot with joins. It is possible that the OS cache is more to do with the improvement if this was a subsequent run with little activity. You could look at Hive on Spark for better consistent performance. Set hive.execution.engine = spark; On the times, the big impact between job submission and start is the the scheduler. That is a deep topic. It is best if you read up on them and review your settings and ask any specific questions that come up, preferably in a new topic. The other factor, not captured on the job stats, is the time it takes to return the results to the client. This will vary depending on the client and there isn't much to do about it. In general small result sets can be handle by the hive CLI. You can increase the client heap if needed. Otherwise use HS2 connections like beeline or HUE.

mbigelow · ‎01-02-2017

Yes. Go through your process. It is granting more accessible which is generally less risky. Also, it is the correct way to install Hadoop/CDH. https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_sg_cm_users_principals.html

csguna · ‎11-21-2016

Check the all the impala and hive demon status using the below command , if anyone one of them is not runing up please start and fire the invalidate metadata for refersh. sudo service impala-state-store status note - if not started please replace status with start. sudo service impala-catalog status sudo service hive-metastore status sudo service impala-server status

Online	Offline
Last Visited	‎10-28-2024 06:24 AM

Member Since	‎05-16-2016 09:33 PM
Last Visited	‎10-28-2024 06:24 AM
Posts	785
Kudos received	112

Cloudera Community

Re: Kerberos / Sentry Integration

Re: How to upgrade Hive from 2.1 to 3.0 via CDH 6....

Re: How does nameservice id works for HA, how does...

Re: What license does the express edition fall und...

Re: Sqoop2 over Sqoop1 in CDH6

Re: CSV files stored in partition to HDFS

Re: Connection to Impala Failed

Re: Hadoop data nodes with different disk space

Re: How to import table using Sqoop which has Blob...

Re: Sqoop Import Issue (driver)

Re: Special characters in mysql sqoop import as Te...

Re: Configure NameNode HA Cluster - How to generat...

Re: Hive Queries run slowly

Re: org.apache.hadoop.security.AccessControlExcept...

Re: exercise 2 error timed out (code THRIFTSOCKET...