Member since
03-08-2016
23
Posts
8
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1391 | 03-24-2016 07:54 PM |
08-22-2016
05:01 PM
Here's the fix to the timestamp value truncations that I attempted to explain above: public static ColumnDescription from(final ResultSet resultSet) throws SQLException { final ResultSetMetaData md = resultSet.getMetaData(); List<String> columns = new ArrayList<>(); HashMap<String,Int> columncache = new HashMap<String,Int>(); // NEW - used to store column size, as per database service for (int i = 1; i < md.getColumnCount() + 1; i++) { columns.add(md.getColumnName(i)); columncache.put(md.getColumnName(i),md.getPrecision(i)); // NEW - get physical column size } final String columnName = resultSet.getString("COLUMN_NAME"); final int dataType = resultSet.getInt("DATA_TYPE"); //final int colSize = resultSet.getInt("COLUMN_SIZE"); final int colSize = columncache.get(columnName); // NEW the rest of the constructor methods code has been omitted.
... View more
08-22-2016
03:27 PM
Matt - thanks for fixing the AbstractDatabaseFetchProcessor's timestamp bug that caused the timestamp truncation problem with the QueryDatabaseTable processor. I've encourtered this same timestamp truncation phenomena in both the ConvertJsonToSQL and the QueryDatabaseTable processors. The ColumnDescription method of ConvertJSONToSQL does not appear to use the lower level information within ResultSetMetaData's getPrecision method to get the real physical size of each column, as per the database service, in the result set (which contains the table's schema metadata). When the generateInsert and the generateUpdate methods execute, building the sql.args.N.values FlowFile attribute, these methods only have the data type size info from the generic ResultSet to establish the column's size: the size used to truncate the result set's timestamp values held in the JSON string. Perhaps, if the ColumnDescription method were extended to exploit the ResultSetMetaData.getPrecision() method the real physical column size can be applied when building the sql.args.N.values FlowFile attribute, thereby solving the timestamp data type value truncation issue.
... View more
08-12-2016
04:19 PM
1 Kudo
The following feedback is based on using NiFi for Change Data Capture (CDC) use cases with source data tables managed by ORACLE, MS SQL, PostgreSQL, and MySQL, RDBMS. When supported by the RDBMS that manages the source data table, turn on the table's CDC feature, which automatically creates in the background a dedicated CDC table which contains all of the columns in the source data table, as well as additional metadata columns that can be used to support down-stream ETL logic processing. The RDBMS will automatically detect the new and changed records within the source data table for you, and will duplicate those new and changed records into the dedicated CDC table. Against that dedicated CDC table, execute a QueryDatabaseTable processor which uses an SQL SELECT query to fetch the latest records written to the CDC table (since the last time the QueryDatabaseTable processor executed successfully). If the source data table has columns which hold the time-stamps of when the record was first created, or if recently updated, and if you do not have access to a Hadoop environment which supports Sqoop, you can still use NiFi to bulk extract the records in the source data table in parallel, using streams. First, you logically fragment the source data table into windows of time, such as a given month of a given year. For each window of time, create a corresponding QueryDatabaseTable processor. In this way, you can easily execute the extract across N threads of a NiFi node (or on N NiFi nodes). Essentially, you create the first QueryDatabaseTable, clone it N-1 time, and simply edit the predicate expression of each SQL SELECT so that it fetches the source data table records that were created within the desired window of time. If the source data table has the time-stamp for update events as well, then clones of these bulk extract QueryDatabaseTable processes can be slightly modified and used to grab on-going updates to records, as well as new records, created within those windows of time. These on-going CDC type QueryDatabaseTable processes can be scheduled to execute based on the probability of update events for a given window of time. The update time-stamp can then be used by NiFi to hand off individual CDC records to specific NiFi processors for routing, mediation, data transformation, data aggregation, data egress (e.g., PutKafka).
... View more
03-24-2016
07:54 PM
The root cause of the problem has been remedied. Please consider this matter closed.
... View more
03-22-2016
07:05 PM
@smanjee Thanks for your recommendation, unfortunately, when browsing to that URL the 'page not found' error is remitted
... View more
03-22-2016
01:57 PM
1 Kudo
I've built an HDP 2.3 cluster on AWS, and using Ambari I created a view for Hive, and granted the Ambari admin login permission to that view. However, when logged in as the Ambari admin account, whenever I attempt to access that view this error message is thrown: H060 Unable to open Hive session: org.apache.thrift.protocol.TProtocolException: Required field 'serverProtocolVersion' is unset! in the HDFS advance configuration, custom core-site section, these properties have been set: hadoop.proxyuser.root.hosts=* hadoop.proxyuser.root.groups=* hadoop.proxyuser.ec2-user.hosts=* hadoop.proxyuser.ec2-user.groups=* Any recommendations to remedy this error are welcome.
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
03-18-2016
03:30 PM
@manish Thanks for your feedback. I've downloaded the guide you have recommended and I've downloaded the correct VM. The ssh node1 command completes successfully using the correct VM. BTW - the HDP Admin course which I am engaging is the only version of the course provided at Hortonworks partners portal. As such, I do not have discretion in choosing which version of the Admin course to engage. If the latest HDP Admin course were available then I'd certainly adopt your recommendation. Perhaps the disparity between the current HDP course content and the HDP course content available through the partner portal will be rectified sometime soon.
... View more
03-16-2016
10:25 PM
All links to all self-paced lectures and labs for the HDP Admin training are found within this URL https://ilearning.seertechsolutions.com/lmt/clmslearningpathdetails.prmain?in_sessionId=50225380985J3855∈_learningPathId=47269964∈_from_module=CLMSLEARNINGPATHS.PRMAIN This is the URL to the 'Download VM' lecture, which contains the installation setup suite of instructions https://content.seertechsolutions.com/ilearn/lmsapi/Oracle_CMI_Adapter.html?starting_url=https%3A%2F%2Fcontent.seertechsolutions.com%2Fsecure_content%2Fhw%2Fonline_courses%2FSandboxPlus%2FLabs%2FHDPMigration%2FVM%2FindexAPI.html&lms_debug=off&LMS_URL=https%3A%2F%2Filearning.seertechsolutions.com%2Filearn%2Fen%2Flearner%2Fjsp%2Flms.jsp&sessionId=-19681281391458166101852 It is not possible to download the lecture, but to review that installation instructions, please see below. My progress stalls at step 9.2
Installation Steps 8.
Start the VM
1.
Click the Play virtual machine link to start the VM.
2.
Wait for the VM to start - it may take several minutes. When the VM has started successfully, you should see the following login screen:
3.
Login as the root user (by clicking on train). The password is hadoop.
9.
Check whether you can login to 4 different CentOS machines named node1, node2, node3 and node4.
1.
Click on Terminal icon on left-hand-side taskbar (3rd icon from top):
2.
Connect to the first machine using following command, password is ‘hadoop’:
root@ubuntu:~# ssh node1
3.
Type ‘exit’ to close the connection to node1 and repeat above step for node2, node3 and node4.
Thanks, Charles
... View more
03-16-2016
10:11 PM
Rafeal, all links to all self-paced lectures and labs for the HDP Admin training are found within this URL https://ilearning.seertechsolutions.com/lmt/clmslearningpathdetails.prmain?in_sessionId=50225380985J3855&in_learningPathId=47269964&in_from_module=CLMSLEARNINGPATHS.PRMAIN This is the URL to the 'Download VM' lecture, which contains the installation setup suite of instructions https://content.seertechsolutions.com/ilearn/lmsapi/Oracle_CMI_Adapter.html?starting_url=https%3A%2F%2Fcontent.seertechsolutions.com%2Fsecure_content%2Fhw%2Fonline_courses%2FSandboxPlus%2FLabs%2FHDPMigration%2FVM%2FindexAPI.html&lms_debug=off&LMS_URL=https%3A%2F%2Filearning.seertechsolutions.com%2Filearn%2Fen%2Flearner%2Fjsp%2Flms.jsp&sessionId=-19681281391458166101852 The installation instructions are the preliminary tasks which must be completed successfully prior to starting the HDP Migration Lab. The URL for downloading the VM is http://tinyurl.com/basewestrev2
... View more
03-16-2016
10:03 PM
@manish There is no such file /root/.sys/fix_network.sh, nor /root/.sys/admin_course.sh on the VM, nor does /root/.sys exist. The /root/ directory does contain an install_course.sh <course_id>, which requires a value assigned to the course_id input parameter in order to execute successfully. The value of course_id is likely the identity of the repository from GitHub which contains the commands to build the classroom lab. The 'Download VM' guide makes not mention of this script nor the course_id. There is this script /root/scripts/admin_course.sh and when it is executed it fails after attempting to start node1. The error messages are: Unable to find image 'hwx/ambari_server' locally .... Invalid namespace name (hwx), .... Error: No such image or container: node1 Apparently, the script can not find the docker/container files it is looking for, and so is unable to start nodes 1-5. The /root/dockerfiles directory contains sub-directories specific to various types of nodes (ambari_server_node, hdp_node, etc.). The /root/hwx directory exists, but it is empty.
... View more