Member since
01-14-2019
144
Posts
48
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1247 | 10-05-2018 01:28 PM | |
1056 | 07-23-2018 12:16 PM | |
1392 | 07-23-2018 12:13 PM | |
7111 | 06-25-2018 03:01 PM | |
4533 | 06-20-2018 12:15 PM |
03-27-2018
02:34 PM
@swathi thukkaraju The pipe is a special character for splits, please use single quotes to split pipe-delimited strings: val df1 = sc.textFile("testfile.txt").Map(_.split('|')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF() Alternatively, you can use commas or another separator. See the following StackOverflow post for more detail: https://stackoverflow.com/questions/11284771/scala-string-split-does-not-work
... View more
02-26-2018
09:32 PM
2 Kudos
If you'd like to generate some data to test out the HDP/HDF platforms at a larger scale, you can use the following GitHub repository: https://github.com/anarasimham/data-gen This will allow you to generate two types of data: Point-of-sale (POS) transactions, containing data such as transaction amount, time stamp, store ID, employee ID, part SKU, and quantity of product. These are transactions you make at a store when you are checking out. For simplicity's sake, this assumes each shopper only buys one product (potentially greater than 1 in quantity) Automotive manufacturing parts production records, simulating the completion of parts in an assembly line. Imagine a warehouse completing different components of a car, such as the hood, front bumper, etc. at different points in time and those parts being tested for heat and vibration thresholds. This data will contain a timestamp of when the part was produced, thresholds for heat & vibration, values as tested for heat & vibration, quanity of produced part, a "short name" identifier for the part, a notes field, and a part location Full details of both schemas are documented in the code in file datagen/datagen.py at the repository above. The application is able to generate data and insert into one of two supported locations: Hive MySQL You will need to configure the table by running one of the scripts in the mysql folder after connecting to the desired server and the desired database as the desired user. Once that is done, you can copy the inserter/mysql.passwd.template file into inserter/mysql.passwd and edit it to provide the correct details. If you'd like to insert into Hive, do the same with the hive.passwd.template file. After editing, you can execute using the following command: python main_manf.py 10 mysql This will insert 10 rows of manufacturing data into the configured MySQL database table. At this point, you're ready to explore your data in greater detail. Possible next steps include using NiFi to pull the data out of MySQL and push into Druid for a dashboard-style data lookup workflow. You can also push into Hive for ad-hoc analyses. These activities are out of scope for this article but are suggestions to think about.
... View more
Labels:
02-26-2018
07:48 PM
Serialization is the algorithm by which data is written to disk or transmitted somewhere. Different applications have different ways to serialize data to optimize for a specific outcome, whether it is dealing with reads or writes. As it says in the Hive language manual, integers and strings are encoded to disk and compressed in different ways, and it lists out the rules which it uses to do so. For example, variable-width encoding optimizes the space usage of the data, as it uses less space to encode smaller data. See the following Wikipedia article for more detail: https://en.wikipedia.org/wiki/Serialization
... View more
02-06-2018
07:23 PM
No, shouldn't be any special HDP policies. Perhaps you are running up against an upper quota? Cores/RAM/disk set by either AWS or your organization?
... View more
02-05-2018
09:21 PM
@Malay Sharma If you are writing a new file to HDFS and trying to read from it at the same time, your read operation will fail with a 'File does not exist' error message until the file write is complete. If you are writing to a file via the 'appendToFile' command and try to read it mid-write, your command will wait until the file is updated and then read the new version of it. In the case of tail, it will stream out the entire contents that you are appending instead of only the last few lines.
... View more
02-05-2018
08:36 PM
@Bob Thorman
According to your stack trace, you may not have the requisite permissions to perform this operation. Please check your AWS user permissions and make sure you have enough capacity to allocate the cluster you are trying to allocate.
cloudbreak_1 | 2018-02-01 21:00:13,056 [reactorDispatcher-9] accept:140
DEBUG c.s.c.c.f.Flow2Handler -
[owner:a290539e-7056-4492-8831-23d497654084] [type:STACK] [id:7]
[name:storm] flow control event arrived: key: SETUPRESULT_ERROR, flowid:
384a1d99-4eba-4e16-ba29-5e71534c852a, payload:
CloudPlatformResult{status=FAILED, statusReason='You are not authorized
to perform this operation.',
errorDetails=com.sequenceiq.cloudbreak.cloud.exception.CloudConnectorException:
You are not authorized to perform this operation.,
request=CloudStackRequest{,
cloudStack=CloudStack{groups=[com.sequenceiq.cloudbreak.cloud.model.Group@c94202e,
com.sequenceiq.cloudbreak.cloud.model.Group@6491d02d,
com.sequenceiq.cloudbreak.cloud.model.Group@43807e4d,
com.sequenceiq.cloudbreak.cloud.model.Group@4f5952db,
com.sequenceiq.cloudbreak.cloud.model.Group@62e19e4e],
... View more
12-06-2017
05:12 PM
If you're using Ambari 2.5.2 you should be able to install NiFi on the same cluster using the HDF management pack: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_installing-hdf-and-hdp/content/ch_install-mpack.html Yes you can automate the job with NiFi - you'll have to create a way to query your SFTP endpoint for incremental changes and then get those new files.
... View more
12-06-2017
03:20 PM
3 Kudos
You will need to use the GetHDFS processor to retrieve the file and then the InvokeHTTP processor to send the data to an HTTP endpoint. Data format shouldn't matter - a binary sequence is being transmitted so unless you need to parse the data before transmission it can be anything. If you are dealing with a large file, you may want to split it as you could run into memory limitations. You will have to split it before transmission into manageable chunks and join it afterwards.
... View more
12-06-2017
03:04 PM
1 Kudo
If you have an HDF cluster running, you can create a NiFi flow to accomplish this. Otherwise you will need a client to download the file first before importing into HDFS.
... View more
11-02-2017
05:31 PM
3 Kudos
Assumptions: -You have a running HDP cluster with Sqoop installed -Basic knowledge of Sqoop and its parameters Ingesting SAP HANA data with Sqoop To ingest SAP HANA data, all you need is a JDBC driver. To the HDP platform, HANA is just another database - drop the JDBC driver in and you can plug & play. 1. Download the JDBC driver. This driver is not publicly available - it is only available to customers using the SAP HANA product. Find it on their members-only website and download it. 2. Drop the JDBC driver into Sqoop's lib directory. For me, this is located at /usr/hdp/current/sqoop-client/lib 3. Execute a Sqoop import. This command has many variations and many command-line parameters, but the following is one such example. sqoop import --connect "jdbc:sap://<HANA_SERVER>:30015" --driver com.sap.db.jdbc.Driver --username <YOUR_USERNAME> --password <PASSWORD> --table "<TABLE_NAME>" --target-dir=/path/to/hdfs/dir -m 1 -- --schema "<YOUR_SCHEMA_NAME>" The '-m 1' argument will limit Sqoop to using one thread, so don't use this if you want parallelism. You'll need to use the --split-by argument and give it a column name to be able to parallelize the import work. If all goes well, Sqoop should start importing the data into your target directory. Happy Sqooping!
... View more
Labels: