About vjain

koshi_funamizu · ‎11-21-2016

Yes I did. I had to change every eth0 in vora manager UI as well. Now, Vora is running fine.

georgios_gkekas · ‎04-06-2017

This problem has been happening on our side since many months as well. Both with Spark1 and Spark2. Both while running jobs in the shell as well as in Python notebooks. And it is very easy to reproduce. Just open a notebook and let it run for a couple of hours. Or just do some simple dataframe operations in an infinite loop. There seems to be something fundamentally wrong with the timeout configurations in the core of Spark. We will open a case for that as no matter what kind of configurations we have tried, the problem insists.

vjain · ‎09-22-2016

1 Challenges managing LAS files Siloed datasets Working with disparate, complex datasets under a traditional analysis model limits innovation and does not allow for the speed required for unconventionals LAS File Volume A single well could have 10s or 100s of LAS files making it difficult to provide a consolidated view for analysis Extrapolating this volume out across 1000s of wells requires an automated approach Manual QC process Identifying out of range data is time consuming and challenging even for experienced geoscientists and petrophysicists Management and storage is expensive What if cost could be reduced from $23/Gb to $.19/Gb; $55 GB could cost $1,200 or $10 Delta is 1-2 orders of magnitude Download Sample Data Set The wellbook concept is about a single view of an oil well and its history- something akin to a "Facebook Wall" for oil wells. This repo is built from data collected and made available by the North Dakota Industrial Commission. I used the wellindex.csv file to obtain a list of well file numbers (file_no), scraped their respective Production, Injection, Scout Ticket web pages, any available LAS format well logfiles, and loaded them into HDFS (/user/dev/wellbook/) for analysis. To avoid the HDFS small files problem I used the Apache Mahout seqdirectory tool for combining my textfiles into SequenceFiles: the keys are the filenames and the values are the contents of each textfile. Then I used a combination of Hive queries and the pyquery Python library for parsing relevant fields out of the raw HTML pages. List of Tables: wellbook.wells -- well metadata including geolocation and owner wellbook.well_surveys -- borehole curve wellbook.production -- how much oil, gas, and water was produced for each well on a monthly basis wellbook.auctions -- how much was paid for each parcel of land at auction wellbook.injections -- how much fluid and gas was injected into each well (for enhanced oil recovery and disposal purposes) wellbook.log_metadata -- metadata for each LAS well log file wellbook.log_readings -- sensor readings for each depth step in all LAS well log files wellbook.log_key -- map of log mnemonics to their descriptions wellbook.formations -- manually annotated map of well depths to rock formations wellbook.formations_key -- Descriptions of rock formations wellbook.water_sites -- metadata for water quality monitoring stations in North Dakota 2 Watch video to get started Automated Analysis of LAS Files 3 Join with Production / EOR / Auction data (Power BI) Get a 360-degree view of the well <Hive tables - Master> a. Predictive Analytics (Linear Regression) b. Visualize the data using Yarn Ready applications 4 Dynamic Well Logs Query for multiple mnemonic readings per well or multiple wells in a given region. Normalize and graph data for specific depth steps on the fly. 5 Dynamic Time Warping Run the algorithm per well or for all wells and all mnemonics and visualize the results to know what readings belong to the same curve class. Using supervised machine learning, enable automatic bucketing of mnemonics belonging to the same curve class. Build on your own Clone the git below and follow the steps in Readme to create your own demo. $ git clone https://github.com/vedantja/wellbook.git For more questions, please contact Vedant Jain. Special thanks to Randy Gelhausen and Ofer Mendelevitch for the work and help put into this.

sbpothineni · ‎02-13-2017

If you have long keys (compared to the values) or many columns, use a prefix encoder. FAST_DIFF is recommended Sorry, this post is few months old, the above sentence mean it is recommended to use FAST_DIFF over PREFIX (not PREFIX_TREE) right?

JeffG · ‎05-01-2019

Hi Vedant! You state: num.io.threads should be greater than the number of disks dedicated for Kafka. I strongly recommend to start with same number of disks first. Is num.io.threads to be calculated as the number of disks per node allocated to Kafka or the total number of disk for Kafka for the entire cluster? I'm guessing disks per node dedicated for Kafka, but I wanted to confirm. Thanks, Jeff G.

elserj · ‎02-23-2017

Use `ps` https://linux.die.net/man/1/ps

vjain · ‎02-02-2016

@lmccay Thanks for the info Larry. We are in the process of working with the prospect. I'll let you know when we cross that bridge.

aervits · ‎02-02-2016

@Vedant Jain can you accept the best answer to close this thread?

vjain · ‎12-23-2015

Some basic charts are already included in Zeppelin. Visualizations are not limited to SparkSQL's query, any output from any language backend can be recognized and visualized. This sufficient for a Data Analyst, but if you were a Data Scientist, the built in tools within Zeppelin just don't cut. If you are a Python programmer and have been working with data in IPython, you must definitely be well versed in matplotlib library. You can install matplotlib in your python shell as long as Zeppelin is still using the same version of python. Once you do that the Pyspark interpreter will be able to import matplotlib libraries and you will be able to create graphs and charts in the Zeppelin interface itself. Make sure to add the following code at the end of your script and modify it as needed: def show(graph): img = StringIO.StringIO() graph.savefig(img, format='svg') img.seek(0) print "%html <div style='width:1000px'>" + img.buf + "</div>" show(plt)

vishal423 · ‎09-23-2016

@cnormile I don't see oozie workflow designer (either as standalone or as part of ambari view) in HDP 2.5 stack. Is this still in pipeline?

Online	Offline
Last Visited	‎08-08-2019 05:49 AM

Member Since	‎10-02-2015 08:11 PM
Last Visited	‎08-08-2019 05:49 AM
Posts	76
Kudos received	77

Cloudera Community

Re: how to upgrade spark 1.3.1.2.3 to spark 1.41

Re: Write Dataframe to teradata

Re: SAP HANA / SAP HANA Vora Processor for Apache ...

Re: Getting 'publicIP' error when installing Cloud...

Re: Filesystem exception

Re: Need help in installing HANA Vora 1.3 in HDP2....

Re: Spark job stage cancelled because SparkContext...

360° of an Oil & Gas well

Re: Compression in HBase

Re: Kafka 0.9 Configuration Best Practices

Re: HBase Thrift - configuring port

Re: Where are we with integrating OpenSSO with Kno...

Re: Spark 1.4.1 and HBase 1.1.2 Integration

Are there any advanced visualization capabilities ...

Re: Falcon & Oozie in Hue