Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8286 | 02-20-2018 12:33 PM | |
657 | 02-19-2018 05:12 AM | |
950 | 12-28-2017 06:13 AM | |
4678 | 09-28-2017 09:25 AM | |
8288 | 09-25-2017 11:19 AM |
12-12-2018
04:29 PM
Its good approach but the only point which I could find as disadvantage is multiple hops to achieve the desired result. Instead of performing joins we can apply windowing function to achieve the same in a single hop assuming you unique value column and last modified date in your scenario.
... View more
12-11-2018
05:18 PM
Hi Joe one option is to increase the heap size and verify it. But you have mentioned already that the heap size provided is more than enough. So try clearing the namnode if something is not necessary as it would be one of the possible cause of this issue. Hope it helps!!
... View more
12-11-2018
07:45 AM
Hi @SP Lots of small files in the cluster damages the cluster health with the blocksize you have. Before coming to changing the block size, check if there is a possibility of combining the files. If there are similar set of files which can be combined together then you may need to do that first so that the size of the file can be considerable. Also does the cluster has only these small files which are <1 MB if that is the case then its useful to think about changing the blocksize. But if you have other big files which has multiple splits then there instead of changing the bolcksize you may need to think about combining the small files as I have mentioned earlier. Or If you have separate clusters for hot/cold/warm data and if these files belong to cold data then you can very well reduce the block size but it will fail the aim of hdfs which works well of distributed system. Also if the bock size are reduced you may need to touch on other configuration parameters like mapper size/ reducer size/ input split size etc
... View more
12-11-2018
07:26 AM
Hi @harsha vardhan Could you explain a bit more on that? Yes you can override the queue whenever you want. But it also depends on the user/groups access as well. If the user is assigned to specific groups and if the groups are not assigned/given privileges to access any other queue then it will not be possible unless proper access are given to user groups. But if you have access to multiple queues then , you can have a parameter passed as a queue name to the sqoop job and if the queue name has to be changed, then you can do that with the combination of shell+sqoop.
... View more
04-24-2018
12:52 PM
Hi @vivekananda chagam Once the file is read from directory use FOREACH (GROUP data ALL) GENERATE COUNT(data);
... View more
04-23-2018
06:35 AM
Hi @Swaapnika Guntaka When you are deleting a data from HDFS all the data will be moved to Trash. But there is a time span between which the trash will be flushed out on regular frequency. If its flushed out then there is no way to recover the data unless you DR in place which is possible only in production environment. Hope it Helps!!
... View more
04-12-2018
04:13 PM
@johny
gate
yes col1 are column names. in your case name,date & amount. Yes it should work for the scenario.
... View more
04-12-2018
07:18 AM
Hi @David Sandoval This doesn't help in solving your problem. Paste your query and the complete logs to understand the error. Only then people will be able to answer your question
... View more
04-12-2018
07:15 AM
Hi @johny gate Below query works but its kind of dirty. Hope it Helps! Select a.*,tblb.col3 from a left join (select*,lag(col3)over (partition by col1 order by col2)as lag_val from a) tblb on tbl b.col1=a.col1 and a.col2=tblb.lag_val
... View more
04-12-2018
07:11 AM
Hi @johny
gate
Below query works but its kind of dirty. Hope it Helps!! select * from a
left join
(select*,lag(col3)over (partition by col1 order by col2) as lag_val from a) tblb
on tbl b.col1=a.col1 and a.col2=tblb.lag_val
... View more
04-06-2018
12:56 PM
Hi @Simran Kaur I dont think that possible in email action rather than triggering an email through oozie. But I would say to go with shell script where you can perform whatever you need and trigger it with oozie.
... View more
04-06-2018
12:48 PM
Hi @Subramaniam Ramasubramanian You would have to start by looking into the executor failures. As you said that this jobs was working fine earlier and recently you were facing this issue. In that case I believe the maximum executor failures was set to 10 and it was working fine. But now the no of executor failures started increasing more than 10. Executor failures may be due to resource unavailability as well. So you may need to consider the cluster resource/ memory availability at the time of your job execution as well. Hope it helps!
... View more
04-06-2018
12:43 PM
1 Kudo
Hi @Geir Fredheim Does the process huge no of files in the hdfs file directory? What is the hdfs block size and file of the files which are being created in the target directory? If you find anything but less than the block size & if it has huge no of files then you may need to check on that. When the no of files being created are huge then that would end up being a bottle neck for the process. Im not sure how data stage is handling the inserts but do check no of mapred jobs are created. Tweek the mapred jobs based on the size of the files. Hope it helps!!
... View more
03-16-2018
06:16 AM
1 Kudo
@Timothy Spann If open source is given importance then I would go with Hive using merge, though I haven't tried with merge with huge volume I believe that it would work decent.
... View more
03-15-2018
05:36 AM
@Timothy Spann I would go with either attunity & or some utility/framework which can be modified depending on the use case. These kind of frameworks reduces time and effort. Multiple tables can be processed in parallel with less effort.
... View more
02-28-2018
05:11 AM
1 Kudo
@Elena Lauren Happy Hadooping!!
... View more
02-23-2018
12:12 PM
These are links which would help to gain knowledge at high level. However you have dive deep if you wanted to know more. https://datajobs.com/what-is-hadoop-and-nosql https://it.toolbox.com/blogs/maryannrichardson/hadoop-or-nosql-what-is-the-difference-113016
... View more
02-23-2018
12:10 PM
1 Kudo
@Elena Lauren Let me put that in short & simplest way. Hadoop is a Storage where you can store structured, semi-structured & unstructured data. Usage of hadoop varies from batch to streaming which handles huge amount of data. It has different services which can be used for specific use case. For Example: Hive is something similar to RDBMS, but resides over hadoop. we can create tables which are structured, even flat files, csv and few other semi-structured data can also be handled. So what will you do if you have to stored document but still should be easily accessible without involvement of the data people? What happens when your data keeps changing frequently. In such cases you will not be able to handle it in Hive. Rather you have to choose Hbase if you are using hadoop. Or there are other NoSQL platforms like MongoDB, Couch DB which comes under NoSQL which uniquely defines them. Hope it helps!!
... View more
02-22-2018
06:36 AM
Hi @yassine sihi One way which I could think of is to import the database from one cluster to the other cluster using sqoop, which is possible at all means. Converting into csv and then again performing the work around is a time consuming process. Hope it helps!!
... View more
02-21-2018
05:20 AM
Hi @hippagun It wont work. though its ORC hive will be able to differentiate the columns based on the delimiter which you have specified during the table creation. So no matter whether you re-create it it wont work. There are two option which you can do now: 1) Create another external table with the additional columns. Write a simple query to load the records from old to the new table specifying null the newly added columns. Once it is done drop the old table. Going forward you can make use of this table. It will be suitable for ORC 2) The other way is, If the schema of the table changes frequently then its better to go with avro table as the schema changes can be handled easily. You have to follow the above step just for the first time. But whenever there is any changes in the schema in future then you need to alter the schema file and nothing else is needed. You can refer to this Link to get the understanding of the handling the schema changes in avro file. Hope it helps!!
... View more
02-20-2018
12:33 PM
Hi @Ravikiran Dasari If it is for knowledge purpose then what Im going to give has no more information then the previous answers. But if you are looking for something related to work then this answer might help a bit. Have a file watcher which looks for a file with the particular pattern, which has to be ftp'ed to the desired location. Once the file arrives you can move the file to HDFS server. This can be accomplished by a simple shell script which requires basic knowledge on shell and nothing more. Also this can accomplished by either push or pull. If you have any other downstream jobs which has to be executed once the file arrives in hdfs then I would recommend to go with pull approach so that you can execute any other hadoop/hive/pig/spark jobs in hdfs server. Hope it helps!!
... View more
02-19-2018
05:12 AM
Hi @Lanic When you submit a job, its YARN which gives an information about the resources. So the driver gets the information from name node regarding the HDFS data location, needed to execute the job. Then based on the nearest available resource which are closer to the data will be taken into consideration where the jobs will be executed. Its the name node which gives Yarn about the information of the HDFS data location. Once all the jobs are completed then the communication about all the jobs status will be updated and corresponding metastore will be brought in sync. Hope it Helps!!
... View more
01-30-2018
09:52 AM
Apart from specifying the no of partitions when creating a DF or using coalesce/re-partitions, Is there any parameter where we can change the configurations or parameter so default RDD partitions(200) can be reduced. @Dinesh Chitlangia could you help me with this.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
01-30-2018
05:13 AM
Hi, I have certain set of question which Im trying to understand in spark which are mentioned below: What the best compression codec that can be used in spark. In hadoop we should not use gz compression unless it is cold data where input splits of very less use. But if we were to choose any other compression w.r.t (lzo/bzip2/snappy etc) then based on what parameters do we need to choose the compressions? Does spark makes use of the input splits if the files are compressed? How does spark handles compression when compared with MR? Does compression increases the amount of data which is being shuffled? Thanks in advance!!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
01-24-2018
05:10 AM
Hi @buihuuhieu buihuuhieu To access HDFS type in ' hadoop fs -ls' To access hive type , hive . You will logged into hive shell, where you can query the sample databases and the files which already exists in HDP. Hope it helps!!
... View more
01-10-2018
05:58 AM
In Spark we have RDD, there are options to persists the RDD in case if we are using the RDD in multiple steps in the code. in general RDD holds a lineage graph and along with lazy evaluation it will be computed when it is needed. Now if I ever wanted to persists the RDD, then I can choose persist option to store the data of the RDD. I believe that the persist RDD data will be stored on the node where it is being computed. If that is the case, then all the RDD data resides in one single node. If I ever make use of the the persist RDD in other lines of the code then does it really uses distributed computing(assuming that all the data are stored in single node)? Is my understanding right? If it is wrong could someone help me to understand.
... View more
Labels:
- Labels:
-
Apache Spark
01-05-2018
05:26 AM
This would also work. import java.io.File val files = getListOfFiles("/tmp") def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList
... View more
01-05-2018
05:23 AM
@Chaitanya D It will be possible in with unix and spark combination. hadoop fs -ls /filedirectory/*txt_processed Above command will return the desired file you need. Then pass the result to spark and process the file as you need. Alternatively in spark you can select the desired file using the below command. val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!! Hope it helps !
... View more
01-04-2018
01:28 PM
@rahul gulati I assume that you meant hive jobs when you have mentioned hive.cli when the jobs are stuck then it doesn't meant that its because of the resource availability. There are many ways that it can be related to data which is being handled in the hive/spark jobs. Are you facing this issue only when you are running the same sets of query in hive and spark-sql? If that is the case then it is definitely related to the data. When running hive jobs are you able to see few reducers running for very long time? in that case then few reducers are accumulated with huge data. Check the reason for that accumulation and distribute the data. Hope it helps!!
... View more
01-03-2018
12:05 PM
@Alexandros Biratsis i believe that you are not using Insert Overwrite when inserting the incremental records into the target. Assuming that its weird how the data is being overridden. For the union part --> If you wanted to avoid union then you may have to may have to perform left join between the incremental data and target to apply some transformations ( assuming that you are performing SCD type 1). If you wanted to just append the data then insert the incremental data through multiple queries into the target one by one. By if you are inserting the data multiple times then the no of jobs will be more which be more or less equal to performing union over it. Sorry for the late reply.
... View more