About LesterMartin

Charlie_Cook · ‎08-16-2016

I added a comment to Lester's response above yesterday... it looks like you can't see it unless you click a link to show more details. At any rate, to answer your question, in this scenario pig scripts are working fine in the grunt shell. It appears something may have gotten corrupted in the Ambari view, as recreating it with the same settings (that were set up following the Ambari Views documentation) did give a view that is now working fine. This issue is resolved for me. It may be worth noting, just for interest sake, that this is the first time this Pig view has worked on this cluster. Originally when it was set up with Ambari 2.1.0 there was an error that, upon researching, looked like it was related to a bug that had been fixed in Ambari 2.2.0. After upgrading Ambari I got the error above. Now that it has been recreated it is working fine.

ahadjidj · ‎05-29-2016

Hi @Lester Martin Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718 Hope this helps

jing · ‎05-28-2016

Answers by @Sagar Shimpi and @Lester Martin look pretty good to me. Some further explanations: How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data If you're using the current distcp for DR (i.e., using distcp copying data from one cluster to your backup cluster), you have an option to utilize snapshot to do incremental backup so as to improve the distcp performance/efficiency. More specifically, you can choose to take snapshots in both the source and the backup cluster and use -diff option for the distcp command. Then instead of blindly copying all the data, the distcp will first compute the difference between the given snapshots, and only copy the difference to the backup cluster. As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case? Yes, if you have not skipped the trash, the file will be moved to the trash, and in the meanwhile, you can still access the file using the corresponding snapshot path. How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded? No, if the file belongs to the snapshot (i.e., the file was created before a snapshot was taken), you will not release quota by deleting it. You may have to delete some old snapshots or increase your quota limit. Also in some old hadoop versions you may find the snapshots also affect the namespace quota usage in a strange way, i.e., sometimes deleting a file can increase the quota usage. This has been fixed by the latest version of HDP.

contact · ‎01-20-2017

We kind of built a data warehouse around the same idea that you have talked about in your article. Integrating Salesforce and Google analytics as data-warehouse @infocaptor http://www.infocaptor.com The benefit is you can also co-relate with your financial data When you design using GA api, you need to load the initial historical data for a certain date range. This has its own complications as you might run into segmentation issues, loss of data etc. You need to handle pagination etc. Once the initial data load is complete, you then run it in incremental mode where you just bring new data only. This data gets appended to the same Data warehouse tables and does not cause duplicate with overlapping dates. The minimum you would need to design is some kind of background daemon that runs everyday or at some frequency. You will need job tables to monitor the success and failure of the extracts so that it can resume from where the error occurred. Some of the other considerations 1. What happens if you run the extract for the same data range 2. What if a job fails for certain dates It is important to set your primary keys for your DW target tables. The extracted data is stored as CSV files and these can be easily pushed to Hadoop file system.

sluangsay · ‎05-19-2016

Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment. If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.

Aish · ‎03-30-2018

Is this method correct to specify the path of the hdfs file. Is this same for everyone

LesterMartin · ‎03-09-2016

Looks like others have reported the same problem before. See http://grokbase.com/t/cloudera/hue-user/137axc7mpm/upload-file-over-64mb-via-hue and https://issues.cloudera.org/browse/HUE-2782. I do agree with HUE-2782's "we need to intentionally limit upload file size" as this is a web app and probably isn't the right tool when we are getting to some file size. Glad to hear "hdfs dfs -put" is working fine. On the flipside, I did test this out a bit with the HDFS Files "Ambari View" that is available with the 2.4 Sandbox and as you can see from the screenshot, user maria_dev was able to load a 80MB file to her home directory via the web interface as well as a 500+MB file. I'm sure this Ambari View also has some upper limit. Maybe it is time to start thinking about moving from Hue to Ambari Views??

vigneshwaransun · ‎02-21-2018

Can someone please help me with the below error while moving a file from local drive to hdp? [root@sandbox-hdp ~]# ssh root@127.0.0.1 -p 22 The authenticity of host '127.0.0.1 (127.0.0.1)' can't be established. RSA key fingerprint is 9c:3a:83:80:2c:d5:1f:a9:41:48:68:96:4d:0f:bb:ed. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. root@127.0.0.1's password: Last login: Fri Dec 15 17:45:44 2017 from 127.0.0.1 [root@sandbox-hdp ~]# scp -P 22 /Users/me/test/Input/movies_initial.csv.gz root@sandbox-hdp.hortonworks.com:root/Input root@sandbox-hdp.hortonworks.com's password: /Users/me/test/Input/movies_initial.csv.gz: No such file or directory , I am facing the below error while moving a file from local directory(Mac) to Hortonworks. Can someone please help? [root@sandbox-hdp ~]# ssh root@127.0.0.1 -p 22 The authenticity of host '127.0.0.1 (127.0.0.1)' can't be established. RSA key fingerprint is 9c:3a:83:80:2c:d5:1f:a9:41:48:68:96:4d:0f:bb:ed. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. root@127.0.0.1's password: Last login: Fri Dec 15 17:45:44 2017 from 127.0.0.1 [root@sandbox-hdp ~]# scp -P 22 /Users/me/Hadoop/Input/movies_initial.csv.gz root@sandbox-hdp.hortonworks.com:root/Input root@sandbox-hdp.hortonworks.com's password: /Users/me/Hadoop/Input/movies_initial.csv.gz: No such file or directory

LesterMartin · ‎11-24-2015

Well, it seems 600 characters wasn't enough for to supply enough of the back story @Peter Coates, so I added some high-level details in my original answer. I'd be glad to share more fine-grained info via email if you'd like.

clay_stevens · ‎11-14-2017

If you keep using `globals()` you will eventually get an error, as it keeps adding itself to itself and you eventually get one of the following errors: RuntimeError: maximum recursion depth exceeded while getting the repr of a list or RuntimeError: dictionary changed size during iteration

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: view pig error use

Re: increasing textFile() partitioning number anom...

Re: Snapshots, Backup and DR

Re: How to use Google Analytics API to import data...

Re: Load data to HDFS & Data Transformation with S...

Re: How to get the full file path to my hdfs root ...

Re: Can not upload file greater than 64MB via File...

Re: HDP 2.4.0 Sandbox: documentation typo regardin...

Re: How many files is too many on a modern HDP clu...

Re: Do the Spark REPLs have a way to list current ...