Big_Data_大数据的介绍(全英)

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

new data types with new styles of applications • Bigger than Terabytes volume, variety, velocity, variability
Why ‘Big Data’ is a big Deal
Big data differs from traditional information in mind-bending ways: Not knowing why but only what The challenge with leadership is that it’s very driven by gut instinct in most cases Air travelers can now figure out which flights are likeliest to be on time, thanks to data scientists who tracked a decade of flight history correlated with weather patterns Publishers use data from text analysis and social networks to give readers personalized news. health care is one of the biggest opportunities, If we had electronic records of Americans going back generations, we'd know more about genetic propensities, correlations among symptoms, and how to individualize treatments.
Hadoop
• Hadoop is designed to abstract away much of the complexity of distributed processing
• Different from GRID computing • Widely used
Social media (e.g., Facebook, Twitter) Life sciences Financial services Retail Government
pretty broad topic) • (I) if other things are going on at the same time they
shouldn't be able to see things mid-update • (D) if the system blows up (hardware or software) the
What Is Big Data?
• Capturing and managing lots of information • Working with many new types of data
Structure/Unstructured • Exploiting these masses of information and
Recently IT Trend
• Commodity hardware • Distributed file systems • Open source operating systems,
databases, and other infrastructure • Significantly cheaper storage • Service-oriented architecture
• HDFS (Hadoop Distributed File System) Data is stored on local disk and processing is
done locally on the computer with the data • Can work with raw data stored in file system
development • Strengthen customer service
Main steps in adopting an analytical system
• What Will We Analyze? • Do We Buy or Build? • Are We Ready to Invest? • Do We Understand the Impact?
Design of HDFS
• Namenodes (The Master) Manage metadata/file trees
or database • Two steps: Map and Reduce
Map
• MapReduce uses key/value pairs. (Traditionally using rows and columns)
Example: last name/chen withdraw amount/20 transaction date/06-23-2013
What This Means for You
Big Data can help a company do many things: • Profile customers • Determine pricing strategies • Identify competitive advantages • Better target advertising • Inform internal research and product
ACID
• ACID (Atomicity, Consistency, Isolation, Durability) • (A) when you do something to change a database the
change should work or fail as a whole • (C) the database should remain consistent (this is a
Reduce
• all the intermediate values for a given output key are combined together into a list.
• The reduce() function then combines the intermediate values into one or more final values for the same key.
• The emphasis is on strong consistency, referential integrity, abstraction from the physical layer, and complex queries through the SQL language.
• easily create secondary indexes, perform complex inner and outer joins, count, sum, sort, group, and page your data across a number of tables, rows, and columns.
RDBMS vs MapReduce
• RDBMS
MapReduce
mostly structured data unstructured data
data internal structure none ( does in process)
normalized
need non-nomalize
database needs to be able to pick itself back up; and if it says it finished applying an update, it needs to be certain
MapReduce
• Dividing and conquering • Highly fault tolerant
on MapReduce (such as Pig and Hive) make MapReduce systems more approachable for traditional database programmers.
Architechures
How does MapReduce work
Challenges
• Information growth • Processing power • Physical storage
disk capacity increase dramatically 100 MB/S read from disk (bottle neck) data seeking time is slow than data transferring • Data issues • Costs
Big Data Chain
• Collect Data • Ingest/Clean Data (Originally ETL. Existing
schema) • Human exploration/Infrastructure/Data
mining • Store/Archive • Share (decision make, other system) • Measure/feedback
nodes are expected to fail• Every data block (by default) replicated on 3 nodes (is also rack aware) • Difficult to implement
RDBMS
• fixed-schema, row-oriented databases with ACID properties and a sophisticated SQL query engine.
Hadoop Implementation
• Hadoop is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects
Hadoop Architecture
• Application layer/end user access layer a. Job Tracker (workload management layer) b. Distributed parallel file systems/data layer
Big Data
Weiping Chen
Topics
• What is Big Data? • Why ‘Big Data’ is a big deal? • NoSQL vs SQL • How to Deal with Big Data? • What’s Hadoop/MapReduce? • RDBMS vs Hadoop/MapReduce • Big data players/Software Tools/Platfoational databases start incorporating some of the
ideas from MapReduce (such as Aster Data’s and Greenplum’s databases)
2. the other direction, as higher-level query languages built