华盛顿大学公开课 Introduction to Data Science 057_random_forests

合集下载

华盛顿大学公开课 Introduction to Data Science 1 - 4 - Dimensions (10-24)

华盛顿大学公开课 Introduction to Data Science 1 - 4 - Dimensions (10-24)
technique, and applying it and running it
is a fairly small fraction.
What you'll be spending much more time on
is the preparation of the data, the
manipulation of the data, the cleaning of
very, very large data sets.
Okay.
However, some of the methods, some of the
models you'll build are, are the same in
both cases.
So database experts, database programmers
new mathematics to squeeze as much
information as you can out of the 20 you
already have.
But that's not always the problem
anymore, right?
So as we shift from a data-poor regime to
business intelligence is that, the BI
engineers are not typically expected to
consume their own data products, and
perform their own analysis, and make, and

华盛顿大学公开课Introduction to Data Science 003_this_course_1

华盛顿大学公开课Introduction to Data Science 003_this_course_1
Bill Howe, UW 2
4/28/13
tools
abstr.
What are the abstractions of data science?
“Data Jujitsu” “Data Wrangling” “Data Munging”
Translation: “We have no idea what this is all about”
4/28/13
Bill Howe, UW
4
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow
• • • • • You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~5,000 disks
desk
cloud
• • • •
You can FTP 1 MB in 1 sec You can FTP 1 GB / min (~1$) … 2 days and 1K$ … 3 years and 1M$
• At some point you need indices to limit search parallel data search and analysis • This is where databases can help
abstr.
Pre-2004: commercial RDBMS, some open source 2004 Dean et al. MapReduce 2008 Hadoop 0.17 release 2008 Olston et al. Pig: Relational Algebra on Hadoop 2008 DryadLINQ: Relational Algebra in a Hadoop-like system 2009 Thusoo et al. HIVE: SQL on Hadoop 2009 Hbase: Indexing for Hadoop 2010 Dietrich et al. Schemas and Indexing for Hadoop 2012 Transactions in HBase (plus VoltDB, other NewSQL systems) But also some permanent contributions: – Fault tolerance – Schema-on-Read – User-defined functions that don’t suck

华盛顿大学公开课 Introduction to Data Science 6 - 12 - 12 k Nearest Neighbors (11-43)

华盛顿大学公开课 Introduction to Data Science 6 - 12 - 12 k Nearest Neighbors (11-43)
derive estimates of important statistics.
Okay.
And we put some of that together and
talked about random forests, which is an
ensemble method for decision trees that
to a pure machine learning, pure
statistical you know background.
the category is quite common, and so
defining nearest neighbor is sometimes
possible, but it's not, it's, it's not
says you work with especially when you're
coming from kind of a, you know maybe
data processing background.
Data science sort of scenario, as opposed
represent another.
You know, maybe survived or not survived
in the Titanic example.
and when you have a new point that you
want to classify,ቤተ መጻሕፍቲ ባይዱjust drop it into the
challenge of decision trees is that they

华盛顿大学公开课 Introduction to Data Science 1 - 3 - Context (9-30)

华盛顿大学公开课 Introduction to Data Science 1 - 3 - Context (9-30)
to.
So another perspective on this that you
should be familiar with is this Venn
diagram that made the rounds several
years ago by Drew Conway.
And what he, his point was that Data
This, you know, DJ has an applied math
background so he's maybe coming from that
perspective.
So Mike Driscoll also talks about the
three skills of data geeks which are
dangerous but you don't know how to
ground your analysis in.
Proper theorem.
And you know, some of my colleagues like
to joke, the computer scientists don't
[MUSIC].
Welcome back.
So I want to talk about in this segment
what this term data science actually
means actually means.
So, you know, you'll see these quotes
blend of Red-Bull-fueled hacking and

华盛顿大学公开课Introduction to Data Science 006_big_data

华盛顿大学公开课Introduction to Data Science 006_big_data
4/28/13 Bill Howe, UW 7
Big Data Now
“…the necessity of grappling with Big Data, and the desirability of unlocking the information hidden within it, is now a key theme in all the sciences – arguably the key scientific theme of our times.”
AUVs gliders cruises, CTDs flow cytometry satellites
ADCP
# of data sources
4/28/13 Bill Howe, UW 3
Big Data
“Big Data is any data that is expensive to manage and hard to extract value from.”
– the size of the data
• Velocity
– the latency of data processing relative to the growing demand for interactivity
• Variety
– the diversity of sources, formats, quality, structures
telescopes n-body sims
# of bytes
spectra
Ocean Sciences
models stations
OOI (~50TB/year; sims, RSN) IOOS (~50TB/year; sims, satellite, gliders, AUVs, vessels, more) CMOP (~10TB/year; sims, stations, gliders, AUVs, vessels, more)

华盛顿大学公开课 Introduction to Data Science 2 - 5 - Relational Algebra Details- Project, Cross

华盛顿大学公开课 Introduction to Data Science 2 - 5 - Relational Algebra Details- Project, Cross
probably maybe should have put a slide in
here just about join in general first.
Well, generating all possible pairs and
applyபைடு நூலகம்ng the function is, is always a, a
reasonable way to do this.
Okay.
Alright, so these kinds of operations are
And so, the question here is which one is
more efficient?
Well, removing duplicates is expensive.
So, just leaving them in place is, is
more efficient.
called distinct.
Okay.
Now, let's see an example of this.
All right.
So, fine.
So, as an example here is project social
security number and names.
We only want two columns out of all the
applications.
And that's this notion of cross product.
And so a cross product here is, for every
tuple in r one.

华盛顿大学公开课 Introduction to Data Science 2 - 1 - From Data Models to Databases (10-35)

华盛顿大学公开课 Introduction to Data Science 2 - 1 - From Data Models to Databases (10-35)
So, you have here sort of an embedded
table within a spreadsheet and so on.
So, you need to sort of, the idea is to
think about what data model is being
What constraints there are the over the
structure and so on and this gives you an
idea of the data model.
Alright, what does it, so what is a
database?
that they're not they persist, even when
the power goes out.
The data is safe, even when the power is
not on, okay, so non volatile storage,
right?
And so, a lot of the work in data base is
It's essentially get the next in bytes,
move to another position within the file
and then you can open and close the file,
and that's about that's not all the
more of the logical way we store, the

华盛顿大学公开课 Introduction to Data Science 1 - 5 - This Course Part 1 (14-02)

华盛顿大学公开课 Introduction to Data Science 1 - 5 - This Course Part 1 (14-02)
to worry about this because the
assumption, first of all they weren't
running on thousands of computers at
once.
And second of all they were sort of under
want to focus on fundamental concepts as
opposed to specific tools.
But I can appreciate the people that are
taking this course and many other courses
more sophisticated types of indexing
which is two other things the databases
have.
And then you start now to see transaction
processing being a very hot, very
environment or a dupe.
And if you haven't heard a relation
algebra don't worry, we'll talk about it.
But let me convince you that rela-,
notice our relational, relational algebra
but a few years later you had an open
source implementation of the ideas in

华盛顿大学公开课 Introduction to Data Science 027_eventual_consistency

华盛顿大学公开课 Introduction to Data Science 027_eventual_consistency
5/18/2013 Bill Howe, UW 5
Two-Phase Commit
1) user updates their status 5) commit 2) Prepare subordinate 1
4) ready 2) Prepare
5) commit 2) Prepare 5) commit subordinate 3 4) ready
– D. Terry et al., “Managing Update Conflicts in Bayou,a Weakly Connected Replicated Storage System”, SOSP 1995
“We believe that applications must be aware that they may read weakly consistent data and also that their write operations may conflict with those of other users and applications.” “Moreover, applications must be revolved m the detection and resolution of conflicts since these naturally depend on the semantics of the application.”
Transactions O record record EC, record O EC, record EC, record EC, record entity groups record ✔
Joins/ Integrity Analytics Constraints Views O O O MR O ✔ compat. w/MR / O O O O O O O O ✔ ✔ O O O MR O O / O compat. w/MR / O ? ✔ ✔

华盛顿大学公开课Introduction to Data Science 005_escience

华盛顿大学公开课Introduction to Data Science 005_escience

1
“eScience” = “Data Science”
4/28/13
Bill Howe, UW
2
Empirical Theoretical Computational
public domain
4/28/13
Bill Howe, UW
3
Empirical Theoretical Computational
The cost of data acquisition has dropped precipitously thanks to advances in technology
– Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) – Life Sciences: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites
Empirical Theoretical Computational
Empirical Theoretical Computational eScience
Science is about asking questions
Traditionally: “Query the world” Data acquisition activities coupled to a specific hypothesis eScience: “Download the world” Data acquired en masse in support of many hypotheses

华盛顿大学公开课Introduction to Data Science 013_declarative_languages

华盛顿大学公开课Introduction to Data Science 013_declarative_languages

2
SQL is the “WHAT” not the “HOW”
Product(pid, name, price) Purchase(pid, cid, store) Customer(cid, name, city)
SELECT DISTINCT , FROM Product x, Purchase y, Customer z WHERE x.pid = y.pid and y.cid = y.cid and x.price > 100 and z.city = ‘Seattle’
Equivalent logical expressions; different costs
σp=knows(R)
o=s
(σp=holdsAccount(R)
σp=holdsAccount(R))
o=s
σp=accountHomepage(R))
right associative
(σp=knows(R)
cid=cid
But a lot of physical details are still left open!
pid=pid
Customer Product Purchase
4
Another Example
R(subject, predicate, object) SELECT r1.subject FROM R r1, R r2, R r3 WHERE r1.predicate = ‘knows’ AND r2.predicate = ‘holdsAccount’ AND r3.predicate = ‘accountHomepage’ AND r1.object = r2.subject AND r2.object = r3.subject

华盛顿大学公开课 Introduction to Data Science 2 - 4 - Relational Algebra Details- Union, Diff, Selec

华盛顿大学公开课 Introduction to Data Science 2 - 4 - Relational Algebra Details- Union, Diff, Selec
Okay, and it's a practical you know it's
something that is practical for
applications.
You'll be able to define what kind of
order the tuples come back in.
[MUSIC].
Okay, so where are we now?
So juce itself.
And one of the things we talked about was
that there's this important aspect of
Include this set operations that are
lifted to support relations, and we'll
see examples of that.
And then the big three are selection,
projection, and join.
But it's not part of the formalism.
It was added in afterward.
So that's just extension between the pure
relation algebra and the extended
relation algebra.
the physical representation okay.
So this idea will come up over and over

华盛顿大学公开课Introduction to Data Science 001_context

华盛顿大学公开课Introduction to Data Science 001_context
Introduction to Data Science
Bill Howe, PhD
Director of Research, Scalable Data Analytics University of Washington eScience Institute
What is Data Science?
4/28/13
Bill Howe, UW
4
Mike Driscoll’s three sexy skills of data geeks
• Statistics
– traditional analysis
• Data Munging
– parsing, scraping, and formatting data
4/28/13 Bill Howe, UW 2
Drew Conway’s Data Science Venn Diagram
4/28/13
Bill Howe, UW
3
What do data scientists do?
“They need to find nuggets of truth in data and then explain it to the business leaders” -- Rchard Snee, EMC
An Introduction to Data Science Jeffrey Stanton Syracuse University School of Information Studies
4/28/13
Bill Howe, UW
6
“A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product.”

华盛顿大学公开课 Introduction to Data Science 1 - 10 - Logistics (7-42)

华盛顿大学公开课 Introduction to Data Science 1 - 10 - Logistics (7-42)
And two assignments will be required, or
will involve writing Python.
One optional, sorry, one optional
assignment will involve sort of
processing big data using Amazon Web
So, for example machine learning.
This is not a machine learning course,
but you can dive deeper into machine
learning by taking this course.
This is not a database course, but you
with a deep dive into specific topics.
And then there's a set of hands-on
assignments that are intended to deliver
specific skills and experiences.
And that's perhaps the most important
beginner in a variety of data science
topics.
And as I've said, you know, the tough,
the tough part here is sort of how to do
something more than just superfiicial

华盛顿大学公开课 Introduction to Data Science 032_other_google_systems

华盛顿大学公开课 Introduction to Data Science 032_other_google_systems

“Although Spanner is scalable in the number of nodes, the node-local data structures have relatively poor performance on complex SQL queries, because they were designed for simple key-value accesses. Algorithms and data structures from DB literature could improve singlenode performance a great deal.”
2004 2005
BigTable
2006 2007
Hadoop
HBase
2008 2009
Pregel Tenzing
Dremel Megastore Spanner
2010 2011 2012
8
5/18/2013
Bill Howe, UW
HBase
• Implementation of Google BigTable • Compatible with Hadoop
– TableInputFormat allows r map phase – One mapper per tablet – Aside: Speculative Execution?
my label sql-like nosql batch nosql nosql nosql nosql sql-like sql-like nosql nosql nosql sql-like nosql sql-like sql-like sql-like nosql sql-like

华盛顿大学公开课Introduction to Data Science 016_parallel_thinking

华盛顿大学公开课Introduction to Data Science 016_parallel_thinking

f
f
f
f
f
f
f is a function to trim a read; apply it to every item
Now we have a big distributed set of trimmed reads
New Task: Convert 405k TIFF images to PNG
Consider a slightly more general program to compute the word frequency of every word in a single document
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Breiman 2001
Random Forest Algorithm
Repeat k times:
– Draw a bootstrap sample from the dataset – Train a decision tree Until the tree is maximum size Choose next leaf node Select m attributes at random from the p available Pick the best attribute/split as usual – Measure out-of-bag error
– Ex: Medical applications can’t typically rely on black box solutions
Bill Howe, UW
2
Gini Coefficient • Entropy captured an intuition for “impurity”
– Trees are built independently
• Handles “small n big p” problems naturally
– A subset of attributes are selected by importance
ቤተ መጻሕፍቲ ባይዱ
Bill Howe, UW
4
Summary: Decision Trees and Forests
– Information Gain or Gini Index to measure impurity and select best attributes
Bill Howe, UW 5
• Representation
– Decision Trees – Sets of decision trees with majority vote
• Evaluation
– Accuracy – Random forests: out-of-bag error
• Optimization
Make a prediction by majority vote among the k trees
Bill Howe, UW 1
Breiman 2001
Random Forests: Variable Importance
• Key Idea: If you scramble the values of a variable and the accuracy of your tree doesn’t change much, then the variable isn’t very important • Measure the error increase • Random Forests are more difficult to interpret than single trees; understanding variable importance helps
• Evaluate against the samples that were not selected in the bootstrap • Provides measures of strength (inverse error rate), correlation between trees (which increases the forest error rate), and variable importance
– We want to choose attributes that split records into pure classes
• The gini coefficient measures inequality
Bill Howe, UW
3
Random Forests on Big Data • Easy to parallelize
相关文档
最新文档