1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1927 | 04-03-2024 06:39 AM | |
| 3018 | 01-12-2024 08:19 AM | |
| 1655 | 12-07-2023 01:49 PM | |
| 2425 | 08-02-2023 07:30 AM | |
| 3373 | 03-29-2023 01:22 PM |
01-15-2017
03:23 AM
Bigger files are better than millions of little one.
... View more
01-12-2017
10:04 AM
4 Kudos
Some people say I must have a bot to read and reply to email at all crazy hours of the day. An awesome email assistant, well I decided to prototype it.
This is the first piece. After this I will add some Spark machine learning to intelligently reply to emails from a list of pretrained responses. With supervised learning it will learn what emails to send to who, based on Subject, From, Body Content, attachments, time of day, sender domain and many other variables.
For now, it just reads some emails and checks for a hard coded subject.
I could use this to trigger other processes, such as running a batch Spark job.
Since most people send and use HTML email (that's what Outlook, Outlook.com, Gmail do), I will send and receive HTML emails as to make it look more legit.
I could also run my fortune script and return that as my email content. Making me sound wise, or pull in a random selection of tweets about Hadoop or even recent news. Making the email very current and fresh.
Snippet Example of a Mixed Content Email Message (Attachments Removed to Save Space)
Return-Path: <x@example.com>
Delivered-To: nifi@example.com
Received: from x.x.net
by x.x.net (Dovecot) with LMTP id +5RhOfCcB1jpZQAAf6S19A
for <nifi@example.com>; Wed, 19 Oct 2016 12:19:13 -0400
Return-path: <x@example.com>
Envelope-to: nifi@example.com
Delivery-date: Wed, 19 Oct 2016 12:19:13 -0400
Received: from [x.x.x.x] (helo=smtp.example.com)
by x.example.com with esmtp (Exim)
id 1bwtaC-0006dd-VQ
for nifi@example.com; Wed, 19 Oct 2016 12:19:12 -0400
Received: from x.x.net ([x.x.x.x])
by x with bizsmtp
id xUKB1t0063zlEh401UKCnK; Wed, 19 Oct 2016 12:19:12 -0400
X-EN-OrigIP: 64.78.52.185
X-EN-IMPSID: xUKB1t0063zlEh401UKCnK
Received: from x.x.net (localhost [127.0.0.1])
(using TLSv1 with cipher AES256-SHA (256/256 bits))
(No client certificate requested)
by emg-ca-1-1.localdomain (Postfix) with ESMTPS id BEE9453F81
for <nifi@example.com>; Wed, 19 Oct 2016 09:19:10 -0700 (PDT)
Subject: test
MIME-Version: 1.0
x-echoworx-msg-id: e50ca00a-edc5-4030-a127-f5474adf4802
x-echoworx-emg-received: Wed, 19 Oct 2016 09:19:10.713 -0700
x-echoworx-message-code-hashed: 5841d9083d16bded28a3c4d33bc505206b431f7f383f0eb3dbf1bd1917f763e8
x-echoworx-action: delivered
Received: from 10.254.155.15 ([10.254.155.15])
by emg-ca-1-1 (JAMES SMTP Server 2.3.2) with SMTP ID 503
for <nifi@example.com>;
Wed, 19 Oct 2016 09:19:10 -0700 (PDT)
Received: from x.x.net (unknown [x.x.x.x])
(using TLSv1 with cipher AES256-SHA (256/256 bits))
(No client certificate requested)
by emg-ca-1-1.localdomain (Postfix) with ESMTPS id 6693053F86
for <nifi@example.com>; Wed, 19 Oct 2016 09:19:10 -0700 (PDT)
Received: from x.x.net (x.x.x.x) by
x.x.net (x.x.x.x) with Microsoft SMTP
Server (TLS) id 15.0.1178.4; Wed, 19 Oct 2016 09:19:09 -0700
Received: from x.x.x.net ([x.x.x.x]) by
x.x.x.net ([x.x.x.x]) with mapi id
15.00.1178.000; Wed, 19 Oct 2016 09:19:09 -0700
From: x x<x@example.com>
To: "nifi@example.com" <nifi@example.com>
Thread-Topic: test
Thread-Index: AQHSKiSFTVqN9ugyLEirSGxkMiBNFg==
Date: Wed, 19 Oct 2016 16:19:09 +0000
Message-ID: <D49AD137-3765-4F9A-BF98-C4E36D11FFD8@hortonworks.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [71.168.178.39]
x-source-routing-agent: Processed
Content-Type: multipart/related;
boundary="_004_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_";
type="multipart/alternative"
--_004_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_
Content-Type: multipart/alternative;
boundary="_000_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_"
--_000_D49AD13737654F9ABF98C4E36D11FFD8hortonworkscom_
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Python Script to Parse Email Messages
#!/usr/bin/env python
"""Unpack a MIME message into a directory of files."""
import json
import os
import sys
import email
import errno
import mimetypes
from optparse import OptionParser
from email.parser import Parser
def main():
parser = OptionParser(usage="""Unpack a MIME message into a directory of files.
Usage: %prog [options] msgfile
""")
parser.add_option('-d', '--directory',
type='string', action='store',
help="""Unpack the MIME message into the named
directory, which will be created if it doesn't already
exist.""")
opts, args = parser.parse_args()
if not opts.directory:
os.makedirs(opts.directory)
try:
os.mkdir(opts.directory)
except OSError as e:
# Ignore directory exists error
if e.errno != errno.EEXIST:
raise
msgstring = ''.join(str(x) for x in sys.stdin.readlines())
msg = email.message_from_string(msgstring)
headers = Parser().parsestr(msgstring)
response = {'To': headers['to'], 'From': headers['from'], 'Subject': headers['subject'], 'Received': headers['Received']}
print json.dumps(response)
counter = 1
for part in msg.walk():
# multipart/* are just containers
if part.get_content_maintype() == 'multipart':
continue
# Applications should really sanitize the given filename so that an
# email message can't be used to overwrite important files
filename = part.get_filename()
if not filename:
ext = mimetypes.guess_extension(part.get_content_type())
if not ext:
# Use a generic bag-of-bits extension
ext = '.bin'
filename = 'part-%03d%s' % (counter, ext)
counter += 1
fp = open(os.path.join(opts.directory, filename), 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
if __name__ == '__main__':
main()
mailnifi.sh
python mailnifi.py -d /opt/demo/email/"$@"
Python needs the email component for parsing the message, you can install via PIP.
pip install email
I am using Python 2.7, you could use a newer Python 3.x
Here is the flow:
For the final part of the flow, I read the files created by the parsing, load them to HDFS and delete from the file system using the standard GetFile.
Reference:
https://docs.python.org/2/library/email-examples.html
https://jsonpath.curiousconcept.com/
Files:
email-assistant-12-jan-2017.xml
... View more
Labels:
01-12-2017
09:01 AM
Protect Your Cloud Big Data Assets Step 1: Do not put anything into the cloud unless you have a CISO, Chieft Security Architect, Certified Cloud Administrator, full understanding of your PII and private data, a Lawyer to defend you against the coming lawsuits, full understanding of Hadoop, Hadoop Certified Administrators, a Hadoop premier support contract, a security plan, full understanding of your Hadoop architecture and layout. Step 2: Study all running services in Ambari. Step 3: Confirm and check all of your TCP/IP ports. Hadoop has a lot of them! Step 4: if you are not using a service, do not run it. Step 5: By default, disable all access to everything, always. Only open ports and access when something and someone critical cannot access them. Step 6: SSL, SSH, VPN and Encryption Everywhere. Step 7: Run Knox! Set it up correctly. Step 8: Run Kali and audit all your IPs and ports. Step 9: Use Kali hacking tools to attempt to access all your web ports, shells and other access points. Step 10: Run in a VPC Step 11: Setup security groups. Never open to 0.0.0.0 or all ports or all IPs!?!??!?!!! Step 12: If this seems too hard, don't run in the cloud. Step 14: Step 13 is unlucky, skip that one. Step 15: Read all the recommended security documentation and use it. Step 16: Kerberize everything. Step 17: Run Metron My recommendation is get a professional services contract with an experience Hadoop organization or use something like Microsoft HDInsight or HDC that is managed. TCP/IP Ports 50070
: Name Node Web UI 50470
: Name Node HTTPS Web UI 8020,
8022, 9000 : Name Node via HDFS 50075
: Data Node(s) WebUI 50475
: Data Node(s) HTTPS Web UI 50090
: Secondary Name Node 60000
: HBase Master 8080
: HBase REST 9090
: Thrift Server 50111
: WebHCat 8005
: Sqoop2 2181:
Zookeeper 9010:
Zookeeper JMX 50020 50010 50030 8021 50060 51111 9083 10000, 60010, 60020, 60030, 2888,
3888, 8660, 8661, 8662, 8663, 8660, 8651, 3306,
80, 8085, 1004, 1006, 8485, 8480, 2049, 4242,14000,
14001, 8021, 9290, 50060, 8032, 8030, 8031, 8033,
8088, 8040, 8042, 8041, 10020, 13562, 19888, 9090,
9095, 9083, 16000, 12000, 12001, 3181, 4181, 8019,
9010, 8888, 11000, 11001, 7077, 7078, 18080, 18081, 50100 There's more of these if you are also running your own visualization tools, other data websites, other tools, Oracle, SQL Server, mail, NiFi, Druid, etc... Reference http://www.slideshare.net/bunkertor/hadoop-security-54483815 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/set_up_validate_knox_gateway_installation.html https://aws.amazon.com/articles/1233/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html https://www.quora.com/What-are-the-best-practices-in-hardening-Amazon-EC2-instance https://stratumsecurity.com/2012/12/03/practical-tactical-cloud-security-ec2/ http://hortonworks.com/solutions/security-and-governance/ http://metron.incubator.apache.org/
... View more
Labels:
01-12-2017
06:11 AM
1 Kudo
Protect Your Cloud Big Data Assets Step 1: Do not put anything into the cloud unless you have a CISO, Chieft Security Architect, Certified Cloud Administrator, full understanding of your PII and private data, a Lawyer to defend you against the coming lawsuits, full understanding of Hadoop, Hadoop Certified Administrators, a Hadoop premier support contract, a security plan, full understanding of your Hadoop architecture and layout. Step 2: Study all running services in Ambari. Step 3: Confirm and check all of your TCP/IP ports. Hadoop has a lot of them! Step 4: if you are not using a service, do not run it. Step 5: By default, disable all access to everything, always. Only open ports and access when something and someone critical cannot access them. Step 6: SSL, SSH, VPN and Encryption Everywhere. Step 7: Run Knox! Set it up correctly. Step 8: Run Kali and audit all your IPs and ports. Step 9: Use Kali hacking tools to attempt to access all your web ports, shells and other access points. Step 10: Run in a VPC Step 11: Setup security groups. Never open to 0.0.0.0 or all ports or all IPs!?!??!?!!! Step 12: If this seems too hard, don't run in the cloud. Step 14: Step 13 is unlucky, skip that one. Step 15: Read all the recommended security documentation and use it. Step 16: Kerberize everything. Step 17: Run Metron My recommendation is get a professional services contract with an experience Hadoop organization or use something like Microsoft HDInsight or HDC that is managed. Reference http://www.slideshare.net/bunkertor/hadoop-security-54483815 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/set_up_validate_knox_gateway_installation.html https://aws.amazon.com/articles/1233/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html https://www.quora.com/What-are-the-best-practices-in-hardening-Amazon-EC2-instance https://stratumsecurity.com/2012/12/03/practical-tactical-cloud-security-ec2/ http://hortonworks.com/solutions/security-and-governance/ http://metron.incubator.apache.org/
... View more
01-12-2017
05:19 AM
2 Kudos
See above comments. The main issue it to up your JVM memory. If you add 12-16 GB you should be awesome. If it's a VM environment, give the node 16-32 or more cores. If that's not enough, go to multiple nodes in the cluster. One node should scale to 10k/sec easy. How big are these files? Anything failing? Errors in the logs? https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
... View more
01-12-2017
05:15 AM
Add more nodes to your NiFi cluster or you can add RAM. Move to a bigger box (more RAM, CPU, Cores). 1500 messages is not a lot for NiFi. You should be able to process 10k easy. What are you JVM settings? https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html This one is important: https://community.hortonworks.com/articles/30424/optimizing-performance-of-apache-nifis-network-lis.html See: https://community.hortonworks.com/articles/9782/nifihdf-dataflow-optimization-part-1-of-2.html https://community.hortonworks.com/content/kbentry/9785/nifihdf-dataflow-optimization-part-2-of-2.html https://dzone.com/articles/apache-nifi-10-cheatsheet https://community.hortonworks.com/articles/68375/nifi-cluster-and-load-balancer.html Check to see if something is failing or where it is slow https://dzone.com/articles/finding-nifi-errors
... View more
01-11-2017
08:49 PM
That did it, thanks alot!
... View more
01-11-2017
08:46 PM
YARN is designed for Hadoop and is very mature and stable. Mesos is very new, written in C++, has CPU scheduling. This presentation is pretty good. http://www.slideshare.net/mKrishnaKumar1/mesos-vs-yarn-an-overview
... View more