Reply
New Contributor
Posts: 7
Registered: ‎08-30-2016

Re: Ask me anything!

[ Edited ]

Thanks Mike for your reply! Very detailed indeed and I think we all remember life changing events. I am an author (with a few things on Pluralsight, SyncFusion and MS) who specialized on Agile, then search (Solr) and now I got really interested in Cloudera.

 

If I may reply on my AHA MOMENT to get deeply interested in Cloudera, it came when the CEO of the company I work for (one of your partners) sat down with you in an event in New York and you told him how search fits into big data - and then he told us the story in a company event.

 

From there I watched some of your videos (the Stanford one being my favorite) and got started with Cloudera stuff. Just took the Hadoop Admin training, taking the test in 2 weeks and hopefully going to the Germany bootcamp in Nov. Also I am working my way to becoming a Cloudera expert (despite being a Microsoft MVP which probably should indicate that I go to Azure/HDInsight and a few more MS related stuff)

 

Thanks for what you have built. I hope I can soon become a Cloudera expert as planned. What do you see for Cloudera in the future? Where are you going? (if I may ask)

Highlighted
Posts: 9
Topics: 1
Kudos: 20
Solutions: 0
Registered: ‎08-26-2016

Re: Cloudera & Apache Kafka


BQuinart wrote:

 

First, I would like to congratulate you with what you and the team build at Cloudera.

Even now that the company is becoming mature, there is a DNA of deep expertise & vision with regards to the products and technologies; and most importantly, you keep customers and their use cases in the center.

 

Cloudera quickly understood the importance of (interactive) SQL and invested heavily in the area. You were also early to understand and discuss the importance of Apache Spark and showed the ability to move beyond the core technologies. Finally, you show with Apache Kudu, that if needed you will invest in technology to address specific needs.

 

I am interested to understand what you think about the importance of Apache Kafka?

What do you think of the newer components like Connect and Streams?

What should Cloudera’s role be in this community? What does Cloudera want to achieve in it?

Are you happy with your current position and progress against your plans?

How will you benchmark yourself in this area (against established Apache Hadoop players or against the specialized start-ups)?

 

I might be biased here, because I truly believe in the power of Apache Kafka. I believe it can grow the big data market and enable many new use cases and at the same time drastically simplify some of the classic architectural patterns.

I am a bit concerned that Cloudera does not share (or act on) this view. Apache Kafka does not get that much ‘airtime’ in keynotes, blogs or roadmap sessions.

There is no such thing as a One Platform Initiative for Apache Kafka with clear ambitions and roadmap.

 

Looking forward to get your insights on this!

 


Thanks for the kind words in your first two paragraphs. Big team working hard for a long time can do amazing things, it turns out. I think we've had some good strategic insights -- you highlight SQL and Apache Spark, and I strongly agree on both -- but the real secret weapon has been the people we've hired to turn those ideas into shipping product.

 

I am also bullish on streaming data ingest in general, and, like you, on Apache Kafka in particular. I agree that we don't call it out as often specifically in our public presentations. That's not for lack of interest, but rather because we see it as an outstanding solution to a much narrower problem than is Spark. Let me lay out some thoughts here.

 

First of all, data ingest and preparation are absolutely crucial to our success. A big data platform isn't of much use if you can't get data into it. We created some early tools to help -- Apache Sqoop and Apache Flume were both Cloudera projects, released to (and developed with) the open source community. When LinkedIn released the Kafka code, we recognized it as a well-designed and useful addition to the arsenal.

 

Over time, it's taken a major role in the ecosystem of projects that collect and handle data in flight. It's part of our supported product, and we participate in development of the project. We promote it to our customers, and our technical field folks are good at integrating customer-specific data sources into it.

 

I should say that the primary other streaming engine we see in real use is the streaming support in Apache Spark. I don't want to start a battle here on micro-batching, latency and so on; what I can tell you is that the vast majority of our customers use one or both of Kafka and Spark Streaming for data collection and analysis in flight. Use of alternatives like Storm is, at least in our sample, seriously on the wane. I think this combination is likely to dominate for the next several years, though other open source alternatives may emerge.

 

We do invest pretty significantly -- integration with security, governance and lineage, support for various storage substrates and so on is critical to our users who want to use Kafka.

 

We've consciously kept ourselves out of the battle for dominance among these projects. We like robust communities that develop to their own taste, and we trust the user base generally to spot the best techology and adopt it. Our embrace of Kafka was driven by our technical taste -- it's well-made -- and by our observation of adoption in our installed base. We'll continue to exercise both kinds of judgment in bringing new projects into our platform.

Posts: 9
Topics: 1
Kudos: 20
Solutions: 0
Registered: ‎08-26-2016

Re: Ask me anything!


DataBrian wrote:

Mike,
Kudos for building an amazing company by recognizing an industry trend and its potential way before anybody else did!

My understanding is that Cloudera has built its market dominance primarily on greenfield opportunities, i.e., new applications and new domains. In order to tap into the $40+B monster of the data management market, however, greenfield won’t get you very far? (All database startups of the past 15 years are testament to the limits of greenfield – and ended up getting acquired)

What is Cloudera doing to migrate applications off of incumbent systems in order to take market share from, say, Teradata or Oracle? Are you developing tools like Amazon’s Database Migration Services or Datometry’s Hyper-Q to make re-platforming to Cloudera not only attractive but feasible for enterprises?

 

Thanks!

 


Pretty good user name, DataBrian!

 

Our platform adoption is actually driven by both new applications in greenfield opportunities, and by migration of existing workloads to the new infrastructure. Let me touch on those in reverse order.

 

One of the most common reasons we get pulled into a commercial discussion, or even see people deploy our free open source code, is because they're running into cost and scale problems with their existing systems. The canonical story is, "It takes us 23 hours to process one day's worth of data, and data volume is growing at ten percent per month." You don't have to draw that curve out more than a week or so before you see some badness happen, right before your eyes.

 

The historical way to deal with that problem was to go buy hardware twice as beefy as current infra, and move your ETL workloads there. That's okay if you are seeing one or two percent growth, but ten percent a month means you might as well just sign your revenue stream over to a white-box foundry in Taiwan. Sooner or later, you're going to run out of growth in centralized systems.

 

There's another legacy problem that lots of our customers want to solve: They've been archiving data to tape for years, and they want to stop. It's expensive, and it's nearly impossible to get the data back. No one reloads data from tape without an act of God or government -- your data center floods or the FBI comes knocking with a subpoena, and you need to restore. Else, those tapes just sit there in the repurposed mine under the mountain, incurring rent.

 

In that case, customers often choose to archive from expensive app infrastructure, like traditional RDBMS, to Hadoop instead of to tape. The data's on spinning disk, available for analysis or processing.

 

So, those legacy use cases are very common. They're often the first workload we take on.

 

You're right, though, that greenfield is a big driver, too. We can use scale-out storage and powerful new analytics, training up models under Spark, for example, to do magic with data that simply wasn't possible before. Those use cases can help customers interact with their customers better, develop better products and services, and manage risks and threats in their business using analytics over the data they collect in new ways.

 

In fact, most of our customers, and all of our big ones, do both things. They handle older workloads cheaper at greater scale, and do new analyses to create new value from old and new data.

 

You ask specifically about capture of traditional RDBMS workloads. Over the last years, Hive for data processing jobs and Impala for analytic database workloads have gotten steadily better. We do see more overlap between the Hadoop ecosystem and traditional database workloads than we did eight or five or three years ago.

 

That'll continue, but it's not our central focus. Our secret plan is to help companies profit from the vast quantities of new data they can collect. We're happy to let them optimize infrastructure, moving workloads among Cloudera, traditional databases, real-time systems and so on as business needs demand. We'll continue to invest in making that easy. But the 40Bn relational market you cite is less interesting to me than the future market for a thousand times more data in heretofore-unkown formats. I think there's money to be made, there.

Posts: 9
Topics: 1
Kudos: 20
Solutions: 0
Registered: ‎08-26-2016

Re: Ask me anything!


francescodevere wrote:

Hello Mike,

 

I would also like to congratulate you and the Cloudera team on some wonderful acheivements

 

I always look forward to seeing Amr Awadallah, and sometimes Sean Owen, at the Cloudera London Sessions.

 

I do have a simple technical question maybe you can get some of the cloudera engineers to look at:

 

My mahout arff.vector command produces NaN output for real and double values but works with integer values

ie, It works with input data like 2,1,3,1,15, ... But not input data like 2.5,1.6,5.00,2.8,1.11, ...Is there a simple solution ?

I am using Cloudera CDH5 Version 5.6.0-1.cdh5.6.0.p0.45 and Mahout Version 0.9+cdh5.6.0+26.

 

With regard to Cloudera and Mahout I am sad to see the Mahout mapreduce implimentations deprecated. But I am happy to hear you are moving on to new pastures with Apache Spark.

 

I am wondering how long will one be able to use the Mahout mapreduce algorithms with Cloudera's CDH distribution.


Your confidence in my technical depth is gratifying. In this instance, I am afraid I am of no help; I haven't used the Mahout package myself at all, and an unfamiliar with the interfaces or failure behavior there. I encourage you to post this question to the community at large, though.

 

I will say that, as fundamental as Mahout was to Hadoop's early success, we see most of the energy directed elsewhere these days. There are plenty of commercial ML companies with proprietary implementations running on the Hadoop platform under Spark.

 

There are some good open source choices as well. MLlib in particular has strong traction among the users we work with, and we and Intel have worked hard to optimize it using MKL (math kernel library) calls to take advantage of IA optimizations in silicon.

 

I'm pretty convinced that current-gen ML implementations are just a good start for what we'll see emerge in the next decade. Many of the algorithms we use date back to the 1970s; they didn't work, back then, and the only reason they do now is the ridiculous hardware we can throw at them. Silicon is evolving fast -- connectivity, storage, compute -- and commercial interest (and investment) in ML is intense. That combination seems like a rich field for innovation. My bet is that Mahout is kind of the Pascal of machine learning: It was interesting back when I was a kid, and had a lot of good ideas, but smart people took those ideas in lots of new directions and came up with new stuff that is much more interesting.

Posts: 9
Topics: 1
Kudos: 20
Solutions: 0
Registered: ‎08-26-2016

Re: Ask me anything!

Nearly at the top of the hour, here, folks, so into the final few minutes of the AMA.

 

Thanks so much for the great questions. I hope you've enjoyed the discussion!

 

I'm going to close with an answer to a question no one has asked me, but that I think about a lot:

 

What does Cloudera look like ten years from now?

 

That seems like a crazy long time, but it's been more than eight years since we started the company, so we're close to the midpoint. We had a long-term vision when we started the company, and extensive practice has taught us a lot about our present and likely future.

 

We've always intended to build a long-lived, standalone company. That's not an automatic for an entrepreneurial startup, by the way: Lots of folks aim expressly at acqusition. But the four of us in 2008 honestly believed that a new generation of enterprise infrastructure was about to happen, and someone had to be the leader. We wanted to create that company.

 

When we started, no one knew about Hadoop and the meme "big data" hadn't happened, yet. Fast forward to today, and we're in a very dynamic, growing, big market. I like our position! Lots of hard work in front of us, but the opportunity is crazy big.

 

I'm watching closely the sudden and rapid shift in hardware (see Intel's announcements of 3D Xpoint and Silicon Photonics, over the last year). Original Hadoop was built for racks of pizza boxes; hardware of the future will have much faster storage with wildly different latency characteristics, and CPUs are going to be networked together across the data center at the speed of light. The software absolutely must change to take advantage of that.

 

I think our unique and strategic relationship with Intel is a huge advantage to us, in this regard.

 

And we're beginning to see the advent of a new class of competitor, and a new ecosystem in which we must live. Notwithstanding the "Cloud" in "Cloudera," for our first eight years, the vast majority of our customers deployed our systems on premises. These days, we're seeing much more enthusiasm for the public cloud. Google, Amazon and Microsoft are all innovating furiously, doing great work, creating platforms on which our products can run and changing, in some cases, the borders on "data management" and "analytics."

 

Five years ago, when you said Hadoop, you meant Cloudera or Hortonworks or MapR. These days, our eyes are aimed at a more distant horizon.

 

The first eight years have been wonderful. Tough, and tiring, but wonderful! With history as our guide, but with the opportunity before us, I am confident that we've got another eight or ten years of hard work and good fun in front of us.

 

Ten years from now, I believe that we'll be part of a much larger company, and that that company will be called Cloudera.

Posts: 416
Topics: 51
Kudos: 75
Solutions: 49
Registered: ‎06-26-2013

Re: Ask me anything!

Thanks @Mike Olson, for all the great interaction here!  I hope folks have found this interesting and fun!

 

I just want to clarify that we are leaving this AMA topic open for questions for the remainder of the day today.  Please feel free to post your questions!

 

Mike will not be here live to answer you, but he will do his best to come back and post replies as time allows today.

 

 

New Contributor
Posts: 1
Registered: ‎08-30-2016

Re: Ask me anything!

Hello Mike, 

 

  I have recently began taking courses that focus on Big Data solutions using Hadoop.  I am interested in knowing a few basic steps that I should follow that could lead to employment for a novice?

Posts: 9
Topics: 1
Kudos: 20
Solutions: 0
Registered: ‎08-26-2016

Re: Ask me anything!


txgoldbear wrote:

Hello Mike, 

 

  I have recently began taking courses that focus on Big Data solutions using Hadoop.  I am interested in knowing a few basic steps that I should follow that could lead to employment for a novice?


This is a great question.

 

You're doing one sensible thing already; taking courses, including on-line courses, is a good way to learn the fundamentals and master new skills. Those skills only really get locked down, though, with practice.

 

So, my advice is to actually build something that uses the big data platform. If you've got a project you're interested in already -- I know folks who analyze census data, make good visual presentations from stock market data, collect and summarize health care info -- rolling up your sleeves and building something is a fantastic idea. There are tons of publicly available data sets (https://www.data.gov/ is a great place to start), and you can dream up interesting questions to ask of the data on your own. Then, you just need to turn that question into code against the big data platform.

 

Another way to build skills and reputation is to join a project that's already up and running. If you're interested in contributing to one of the open source projects, you'll find an enthusiastic group of folks happy to welcome you and glad of your assistance. You'll need to establish your bona fides, of course; take on some small tasks to get started, and show the group what you can do. At Cloudera, we know who the meaningful contributors are to the projects we include in CDH, and that's one of the best places for us to recruit. You'll find all the other vendors in the big data market think in exactly the same way.

 

Of course http://apache.org is a great place to browse for projects. There are tons of git repositories out there as well; some of the ones we're tractking are at http://community.cloudera.com/t5/Cloudera-Labs/bd-p/ClouderaLabs.

 

Hope this helps! And good luck!

Announcements
Unanswered Topics
No posts to display.