探讨下Tag标签的数据库设计（千万级数据量）

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

探讨下Tag标签的数据库设计（千万级数据量）探讨下Tag标签的数据库设计（千万级数据量）
现在博客都有tag标签的功能，如何设计千万级数据量Tag数据库呢？现在把功能限制⼀下
tag需求：
（0）假定作品来⾃⼀个表（Poesy）,每个作品tag不超过5个
（1）根据tag可以找到所有包括该tag的作品,可以查看tag包含作品的数量。

(2)⽤户可以根据⾃⼰添加的tag来查看⾃⼰的作品,查看tag包括的作品的数量
(3)⽤户可以查看所有tag,可以查看所有tag包括的作品
(3)排序问题：按照作品访问量和时间排序
个⼈想了两套设计⽅案
⽅案⼀：在作品表⾥增加⼀个varchar(100)的tag列,各个tag以 , 分割，详见下：
--建表（作品表）：
if object_id('Poesy','u') is not null
drop table Poesy
go
create table Poesy
(
id int identity(1,1),
Title varchar(100),
Tag varchar(100)
)
go
alter table Poesy add constraint pk_Posy primary key(Id)
go
if object_id('Tags','u') is not null
drop table Tags
go
create table Tags
(
Id int identity(1,1),
TagName Varchar(30),
TagCount int, --作品数量
TagCreatDate datetime,
TagTips int--访问量
)
go
alter table Tags add constraint pk_Tags primary key(Id)
go
--⾯向⽤户的tag
if object_id('UserTags','u') is not null
drop table UserTags
go
create table UserTags
(
Id int identity(1,1),
TagUserId int,
TagName Varchar(30),
TagCount int, --作品数量
TagCreatDate datetime,
TagTips int--访问量
go
alter table UserTags add constraint pk_UserTags primary key(Id)
go
--⽤户在增加、修改、删除作品的时候，都要对 tags和UserTags表进⾏操作，更新两个表⾥的tag包括的作品数量（如果没有则增加tag) --造数据到作品表
declare@Number int,@NumberTop int,@TagId int
Set@TagId=1
Set@number=1
Set@numberTop=10000000
while@number<=@numberTop
begin
if@TagId=100
Set@TagId=1
INSERT Poesy(Title,Tag)
select'title'+cast(@number as varchar), '历史'++cast(@TagId as varchar)+',铁⽊真,元朝历史,蒙古历史'+cast(@number as varchar) Set@number=@number+1
Set@TagId=@TagId+1
end
--建⽴索引
create index Ix_poesyTag on Poesy(Tag)
--查询很慢要⼀分钟以上（1）
with Orderlist as (
select row_number() over(order by id desc) as rownumber,id,title
from poesy
where charindex(',蒙古历史10000,',','+tag+',')>0
) select RowNumber,Id,Title
from Orderlist
where RowNumber between1and50
--查询很快（2）
with Orderlist as (
select row_number() over(order by id desc) as rownumber,id,title
from poesy
where charindex(',元朝历史,',','+tag+',')>0
) select RowNumber,Id,Title
from Orderlist
where RowNumber between1and50
--建⽴索引
create index Ix_poesyTag on Poesy(Tag)
查询1很慢，因为 tag⾥包含蒙古历史10000的记录只有⼀条。

查询2很快，因为tag⾥包括元朝历史的记录⾮常多
建⽴了索引，也没有什么变化。

⽅案⼆：增加⼀个⽂章TAG关联表。

详见下：
--tag表
if object_id('TestTagName','u') is not null
drop table TestTagName
go
create table TestTagName
id int identity(1,1),
TagName varchar(30),
)
go
alter table TestTagName add constraint pk_TestTagName primary key(Id) go
--插⼊tag
insert TestTagName(Title)
select'铁⽊真'union all
select'元朝历史'union all
select'蒙古历史10000'union all
select'蒙古历史'
--⽂章tag表
if object_id('TestTagPoesy','u') is not null
drop table TestTagPoesy
go
create table TestTagPoesy
(
id int identity(1,1),
TagId int,
poesyid int
)
go
alter table TestTagPoesy add constraint pk_TestTagPoesy primary key(Id) go
--增加测试数据到⽂章tag表
insert TestTagPoesy(TagId,poesyid)
select1,id
from poesy
insert TestTagPoesy(TagId,poesyid)
select2,id
from poesy
insert TestTagPoesy(TagId,poesyid)
select3,id
from poesy
where id=10000
--索引：
create index ix_TestTagPoesy_poesyid on TestTagPoesy(poesyid) create index ix_TestTagPoesy_tagid on TestTagPoesy(tagid)
create index ix_TestTagName_title on TestTagName(TagName)
--查询tag是蒙古历史10000的⽂章
with Orderlist as (
select row_number() over(order by a.id desc) as rownumber,a.id,a.title from poesy a inner join TestTagPoesy as b on a.id = b.poesyid
inner join TestTagName as c on b.tagid = c.id
where c.TagName ='蒙古历史10000'
) select RowNumber,Id,Title
from Orderlist
where RowNumber between1and50
--查询tag是蒙古历史10000的⽂章
with Orderlist as (
select row_number() over(order by a.id desc) as rownumber,a.id,a.title from poesy a inner join TestTagPoesy as b on a.id = b.poesyid
inner join TestTagName as c on b.tagid = c.id
where c.TagName ='元朝历史'
) select RowNumber,Id,Title
from Orderlist
where RowNumber between1and50
查询1很慢，因为 tag⾥包含蒙古历史10000的记录只有⼀条。

查询2很快，因为tag⾥包括元朝历史的记录⾮常多
三系统测试
机器配置：
内存2G
CPU core 6420 双CPU 2.13G
⽅案1 和⽅案⼆的查询结果是⼀样，速度都很慢。

要⼀分钟以上。

（这⾥的测试重点是按照tag去找⽂章）
问题：
（1）这两个⽅案，打击⽀持哪⼀种？或者说这两个都不可取，可以有更好的⽅案。

（2）关于性能问题，这两个⽅案，除了提⾼硬件外还有别的办法提⾼性能吗？。