Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6996 | 06-03-2019 09:31 PM | |
1671 | 05-22-2019 02:38 AM | |
2122 | 05-22-2019 02:21 AM | |
1321 | 05-04-2019 08:17 PM | |
1626 | 04-14-2019 12:06 AM |
08-16-2016
12:20 PM
I added a comment to Lester's response above yesterday... it looks like you can't see it unless you click a link to show more details. At any rate, to answer your question, in this scenario pig scripts are working fine in the grunt shell. It appears something may have gotten corrupted in the Ambari view, as recreating it with the same settings (that were set up following the Ambari Views documentation) did give a view that is now working fine. This issue is resolved for me. It may be worth noting, just for interest sake, that this is the first time this Pig view has worked on this cluster. Originally when it was set up with Ambari 2.1.0 there was an error that, upon researching, looked like it was related to a bug that had been fixed in Ambari 2.2.0. After upgrading Ambari I got the error above. Now that it has been recreated it is working fine.
... View more
05-29-2016
05:20 PM
Hi @Lester Martin Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718 Hope this helps
... View more
05-28-2016
12:22 AM
3 Kudos
Answers by @Sagar Shimpi and @Lester Martin look pretty good to me. Some further explanations:
How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data If you're using the current distcp for DR (i.e., using distcp copying data from one cluster to your backup cluster), you have an option to utilize snapshot to do incremental backup so as to improve the distcp performance/efficiency. More specifically, you can choose to take snapshots in both the source and the backup cluster and use -diff option for the distcp command. Then instead of blindly copying all the data, the distcp will first compute the difference between the given snapshots, and only copy the difference to the backup cluster. As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case? Yes, if you have not skipped the trash, the file will be moved to the trash, and in the meanwhile, you can still access the file using the corresponding snapshot path. How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded? No, if the file belongs to the snapshot (i.e., the file was created before a snapshot was taken), you will not release quota by deleting it. You may have to delete some old snapshots or increase your quota limit. Also in some old hadoop versions you may find the snapshots also affect the namespace quota usage in a strange way, i.e., sometimes deleting a file can increase the quota usage. This has been fixed by the latest version of HDP.
... View more
01-20-2017
05:41 PM
We kind of built a data warehouse around the same idea that you have talked about in your article. Integrating Salesforce and Google analytics as data-warehouse @infocaptor http://www.infocaptor.com The benefit is you can also co-relate with your financial data
When you design using GA api, you need to load the initial historical data for a certain date range. This has its own complications as you might run into segmentation issues, loss of data etc. You need to handle pagination etc. Once the initial data load is complete, you then run it in incremental mode where you just bring new data only. This data gets appended to the same Data warehouse tables and does not cause duplicate with overlapping dates.
The minimum you would need to design is some kind of background daemon that runs everyday or at some frequency. You will need job tables to monitor the success and failure of the extracts so that it can resume from where the error occurred. Some of the other considerations 1. What happens if you run the extract for the same data range 2. What if a job fails for certain dates
It is important to set your primary keys for your DW target tables. The extracted data is stored as CSV files and these can be easily pushed to Hadoop file system.
... View more
05-19-2016
07:09 AM
Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment. If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.
... View more
03-30-2018
01:04 PM
Is this method correct to specify the path of the hdfs file. Is this same for everyone
... View more
03-09-2016
03:08 PM
3 Kudos
Looks like others have reported the same problem before. See http://grokbase.com/t/cloudera/hue-user/137axc7mpm/upload-file-over-64mb-via-hue and https://issues.cloudera.org/browse/HUE-2782. I do agree with HUE-2782's "we need to intentionally limit upload file size" as this is a web app and probably isn't the right tool when we are getting to some file size. Glad to hear "hdfs dfs -put" is working fine. On the flipside, I did test this out a bit with the HDFS Files "Ambari View" that is available with the 2.4 Sandbox and as you can see from the screenshot, user maria_dev was able to load a 80MB file to her home directory via the web interface as well as a 500+MB file. I'm sure this Ambari View also has some upper limit. Maybe it is time to start thinking about moving from Hue to Ambari Views??
... View more
02-21-2018
01:24 PM
Can someone please help me with the below error while moving a file from local drive to hdp? [root@sandbox-hdp ~]# ssh root@127.0.0.1 -p 22 The authenticity of host '127.0.0.1 (127.0.0.1)' can't be established. RSA key fingerprint is 9c:3a:83:80:2c:d5:1f:a9:41:48:68:96:4d:0f:bb:ed. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. root@127.0.0.1's password: Last login: Fri Dec 15 17:45:44 2017 from 127.0.0.1 [root@sandbox-hdp ~]# scp -P 22 /Users/me/test/Input/movies_initial.csv.gz root@sandbox-hdp.hortonworks.com:root/Input root@sandbox-hdp.hortonworks.com's password: /Users/me/test/Input/movies_initial.csv.gz: No such file or directory , I am facing the below error while moving a file from local directory(Mac) to Hortonworks. Can someone please help? [root@sandbox-hdp ~]# ssh root@127.0.0.1 -p 22 The authenticity of host '127.0.0.1 (127.0.0.1)' can't be established. RSA key fingerprint is 9c:3a:83:80:2c:d5:1f:a9:41:48:68:96:4d:0f:bb:ed. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. root@127.0.0.1's password: Last login: Fri Dec 15 17:45:44 2017 from 127.0.0.1 [root@sandbox-hdp ~]# scp -P 22 /Users/me/Hadoop/Input/movies_initial.csv.gz root@sandbox-hdp.hortonworks.com:root/Input root@sandbox-hdp.hortonworks.com's password: /Users/me/Hadoop/Input/movies_initial.csv.gz: No such file or directory
... View more
11-24-2015
06:20 AM
1 Kudo
Well, it seems 600 characters wasn't enough for to supply enough of the back story @Peter Coates, so I added some high-level details in my original answer. I'd be glad to share more fine-grained info via email if you'd like.
... View more
11-14-2017
09:06 PM
If you keep using `globals()` you will eventually get an error, as it keeps adding itself to itself and you eventually get one of the following errors: RuntimeError: maximum recursion depth exceeded while getting the repr of a list or RuntimeError: dictionary changed size during iteration
... View more
- « Previous
- Next »