lec 3 File and Indexing 数据库课件

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Sorted File BD
Clustered File
(log2 B) * D
[(log2 B) + #match pg]*D
Cost of Operations
Scan all records
Heap File BD
Equality Search
0.5 BD
Range
BD
Search
Insert
based on whether index is clustered or not! Alternative 1 implies clustered, but not vice-versa.
At most one Alternative 1 index per relation
No “pointer lookups” to get data records
Alternatives for Data Entries (Contd.)
Alternative 2 <k, rid of matching data record>
even if search keys are of fixed length.
For large rid lists, data entry spans multiple blocks!
Index Classification
Clustered vs. Unclustered: Clustered index:
Overview of File Organizations and Indexing
courtesy of Joe Hellerstein for some slides
Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY
Context
Query Optimization and Execution
Sorted File BD
Clustered File
Cost of Operations
Scan all records
Heap File BD
Equality Search
0.5 BD
Range Search
Insert
Delete
B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
Choice is orthogonal to the indexing technique.
B+ trees, hash-based structures, R trees, GiSTs, …
Can have multiple (different) indexes per file.
E.g. file sorted by age, with a hash index on salary, and a B+tree index on name.
2-dimensional ranges (“east of Berkeley and west of Truckee and North of Fresno and South of Eureka”)
Or n-dimensional
2-dimensional radii (“within 2 miles of Soda Hall”)
Alternatives for Data Entries (Contd.)
Alternative 1: Actual data record (with key value k)
Index as a file organization for records
Alongside Heap files or sorted files
2D
Delete
B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
Sorted File BD
Clustered File
(log2 B) * D
Three alternatives:
1. Actual data record (with key value k) 2. <k, rid of matching data record>, rid: record id 3. <k, list of rids of matching data records>
and Alternative 3 <k, list of rids of matching data records>
Must use Alternatives 2 or 3 to support >1 index per relation.
Alternative 3 more compact than Alternative 2, but variable sized data entries
More Assumptions
Single record insert and delete. Equality selection
exactly one match
For Heap Files:
Insert always appends to end of file.
For Sorted Files:
Find all students in the “CS” department Find all students with a gpa > 3
Index : disk-based data structure for fast lookup by value
Search key: any subset of columns in the relation. Search key need not be a key of the relation
[(log2 B) + #match pg]*D ((log2B)+B)D
Cost of Operations
Heap File
Scan all BD records
Equality Search
0.5 BD
Range
BD
Search
Insert
2D
Delete
0.5BD + D
B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
Clearly identify assumptions Then estimate cost in a principled way
Foundation for query optimization
Can’t choose the fastest scheme without an estimate of speed!
1st Question to Ask About Indexes
What kinds of selections (lookups) do they support? Selection: <key> <op> <constant> Equality selections (op is =)? Range selections (op is one of <, >, <=, >=, BETWEEN)? More exotic selections?
Relational Operators Files and Access Methods
Buffer Management Disk Space Management
DB
Goal for Today
Big picture of overheads for data access
We’ll simplify things to get focused Still, a bit of discipline:
order of data records the same as, or `close to’, order of index data entries
A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly
Index Breakdown
What selections does the index support Representation of data entries in index
i.e., what kind of info is the index actually storing?
3 alternatives here
Clustered vs. Unclustered Indexes Single Key vs. Composite Indexes Tree-based, hash-based, other
Alternatives for Data Entry k* in Index
Sorted File
Clustered File
BD
(log2 B) * D
[(log2 B) + #match pg]*D ((log2B)+B)D ((log2B)+B)D
Indexes
Allow record retrieval by value in one or more fields
Can have multiple items matching a lookup
Index contains a collection of data entries
Items associated with each search key value k Data entries come in various forms, as we’ll see
Sorted File
Clustered File
Cost of Operations
Scan all records
来自百度文库
Heap File BD
Equality Search
Range Search
Insert
Delete
B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
Sorted File BD
Clustered File
(log2 B) * D
Cost of Operations
Heap File
Scan all BD records
Equality Search
0.5 BD
Range
BD
Search
Insert Delete
B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
Or n-dimensional
Ranking queries (“10 restaurants closest to Berkeley”) Regular expression matches, genome string matches, etc. One common n-dimensional index: R-tree
Files compacted after deletions. Selections on search key.
Cost of Operations
Scan all records
Heap File
Equality Search
Range Search
Insert
Delete
B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page