About arald

arald · ‎02-27-2018

@Ravikiran Dasari: do you have any experience in shell scripts? Do you know how to write a bash script, or maybe a Python script? Do you have a special scheduler that you use in your environment, or will you use cron? If you don't know, I guess it will become cron. Try if you are able to edit the crontab via entering this command on the shell crontab -e Either you get a list of cron jobs or an error message like 'You (<<userid>>) are not allowed to use this program (crontab)' Now when you want to write a shell script the starting point is a simple text files, containing the commands you otherwise enter on the shell. The script file should start with a line aka 'shebang' providing the script interpreter to be used. I.e. on RedHat bash: #!/bin/bash python: #!/usr/bin/python perl: #!usr/bin/perl If you decide to go for bash script, just create a file like this (you can use a different editor if you like): vi ~/mycopyscript enter in that script all your command #!/bin/bash dir = `date +%Y-%m-%d` sftp ayosftpuser@IPaddredss << __MY_FTP_COMMANDS__ password cd /sourcedir get -Pr ${dir} bye __MY_FTP_COMMANDS__ #at this point the files should already be locally copied hadoop fs -put -f ${dir} /destination Save the script ( by entering <ESC>:wq in vi) Next make the script executable, and only allow access from the owner (you) chmod 700 ~/mycopyscript You should be able to execute it now: ~/mycopyscript This script is just a starting point, and done plain simple, no error handling and no security, whoever reads the script also has access to the password, and no parameter (you must execute it at the date that the dir uses) Still it should provide you with the basic idea of a shell script.

arald · ‎02-26-2018

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job? Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

arald · ‎02-20-2018

yes, that's correct. The only interesting point is, how your client works. If it is a proper hadoop client, it will run directly and in parallel on the nodes storing the file blocks. If you have a non-hadoop client it will really retrieve the full file from hdfs, process it and write back to hdfs. In a Spark application each of the steps a-d will be executed in parallel on different nodes, while the hadoop framework takes care to bring the execution to the data. And if the stripes are unluckily distributed on the blocks (and therefore hdfs nodes), the data tranfer between the nodes is much higher, than if the stripes are well distributed. But this is exactly because the stripes are created independent from the blocks. And it also is the key to optimize the stripe size (together with your use pattern).

arald · ‎02-20-2018

posted my answer in parallel, without noticing yours. sorry for the redundant info.

arald · ‎02-20-2018

Are you looking for the scheduling or how to script it? Shall the files be copied to hdfs as soon as they arrive, or in a special frequency, ie. daily, hourly, etc... The best way depends on the tools and knowledge you have. It could be done with a plain shell script, but also with nifi. Spark has also a FTP connector. Here is a post on how to solve it with nifi: https://community.hortonworks.com/questions/70261/how-to-read-data-from-a-file-from-remote-ftp-serve.html

arald · ‎02-20-2018

On the functional level you will have the logic for the ORC file (stripes and indexes) independent from the distribution of the blocks in hdfs. So on application level you simply read the ORC file and write the data to the next ORC file. The application handling the ORC format knows at which position/byte in the file the data is stored, while hdfs nows which bytes of the files are stored on which node. If you now read the complete file, all Nodes will deliver their block. If this data is then just stored again in a hdfs file hdfs will typically decide how many blocks it will use and distribute it on the nodes. So the data gets transfered from the source nodes to the target nodes. This is typically transparent to the user. The application will have to take care if the stripes are differently defined to write the data correctly into the ORC file.

arald · ‎02-20-2018

Perhaps some more details could help. Do you want to query the data after you created a table in hive?

arald · ‎02-19-2018

A block will not be divided into stripes. The block of HDFS is the lowest level, and is splitted independent from the file format based on the number of bytes. Binary files as well as text files or ORC files will be splitted into blocks. Now with ORC File format you will have several stripes within the file. HDFS will split the file, not considering the stripe format. So one block in HDFS may contain a part of a stripe, a complete stripe or even multiple stripes in one block. Of course with setting proper values you can optimize it, as described here: https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html

arald · ‎02-19-2018

The log.dirs directory is the persistence layer for the Kafka broker (acting as a server process). So when you create a topic (no matter with how many partitions), it is somehow first created in memory, than on the disk, and only after the structure is created on disk, the topic creation is confirmed to the client. Somehow you can consider the memory as the cache for the broker and the disk as the persistent storage. consumers do not directly access the data stored on the disk. consumers (and producers as well) always communicate with the broker. All the disk I/O to the logs.dir is done by the broker. And consumer do not share memory with the broker or the producer. no, as I mentioned, no client (consumer or producer) is accessing the broker file system directly. But if the broker purges a message, it will purge it from memory and from disk. Just as mentioned, the memory is a kind of cache to the data stored on disk. This is a bit more tricky, in almost any Linux, when you create a file or write data to disk, it is first cached and only written to disk when a flush occurs (which also occurs when the file is closed) or a defined number of changes bytes have been chached. So even if an application has written data to disk, it might be physically still in memory as the filesystem does a caching. If your application crashes, all data will still be written to disk, but if your OS crashes, you might loose data that is in the filesystem cache.

arald · ‎02-15-2018

Try searching in the dataset repos mentioned here: https://www.datasciencecentral.com/profiles/blogs/great-sensor-datasets-to-prepare-your-next-career-move-in-iot-int

Online	Offline
Last Visited	‎08-19-2019 03:23 AM

Member Since	‎06-28-2017 06:04 AM
Last Visited	‎08-19-2019 03:23 AM
Posts	279
Kudos received	43

Cloudera Community

Re: secured nifi cluster must import a cert to bro...

Re: Nifi Epoch conversion not working?

Re: Scenario when we store data in HBase and acce...

Re: Setup environment variables in NiFi cluster se...

Re: CREATE EXTERNAL HIVE TABLE on existing HBASE T...

Re: what is the best way to get ftp file to hdfs c...

Re: Does hadoop run dfs -du automatically when a n...

Re: Is a block in data node divided into multiple ...

Re: what is the best way to get ftp file to hdfs c...

Re: what is the best way to get ftp file to hdfs c...

Re: Is a block in data node divided into multiple ...

Re: how to check views in hive from hdfs?

Re: Is a block in data node divided into multiple ...

Re: kakfa architecture questions

Re: Looking for sensor CSV file