计算机 软件工程 外文翻译 外文文献 英文文献
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
一、外文资料译文:
Java开发2.0:使用Hibernate Shards 进行切分
横向扩展的关系数据库
Andrew Glover,作者兼开发人员,Beacon50
摘要:Sharding并不适合所有网站,但它是一种能够满足大数据的需求方法。
对于一些商店来说,切分意味着可以保持一个受信任的RDBMS,同时不牺牲数据可伸缩性和系统性能。
在Java 开发 2.0系列的这一部分中,您可以了解到切分何时起作用,以及何时不起作用,然后开始着手对一个可以处理数TB 数据的简单应用程序进行切分。
日期:2010年8月31日
级别:中级
PDF格式:A4和信(64KB的15页)取得Adobe®Reader®软件
当关系数据库试图在一个单一表中存储数TB 的数据时,总体性能通常会降低。
索引所有的数据读取,显然是很耗时的,而且其中有可能是写入,也可能是读出。
因为NoSQL 数据商店尤其适合存储大型数据,但是NoSQL 是一种非关系数据库方法。
对于倾向于使用ACID-ity 和实体结构关系数据库的开发人员及需要这种结构的项目来说,切分是一个令人振奋的选方法。
切分一个数据库分区的分支,不是在本机上的数据库技术,它发生在应用场面上。
在各种切分实现,Hibernate Shards 可能是Java™ 技术世界中最流行的。
这个漂亮的项目可以让您使用映射至逻辑数据库的POJO 对切分数据集进行几乎无缝操作。
当你使用Hibernate Shards 时,您不需要将你的POJO 特别映射至切分。
您可以像使用Hibernate 方法对任何常见关系数据库进行映射时一样对其进行映射。
Hibernate Shards 可以为您管理低级别的切分任务。
迄今为止,在这个系列,我用一个比赛和参赛者类推关系的简单域表现出不同的数据存储技术比喻为基础。
这个月,我将使用这个熟悉的例子,介绍一个实际的切分策略,然后在Hibernate实现它的碎片。
请注意,切分首当其冲的工作是和Hibernate没有必然关系的,事实上,对Hibernate stards编码部分是容易的。
真正难的是搞清楚内容碎片和你的工作方式。
关于本系列
Java的发展前景已经发生了根本变化,因为Java技术初现端倪。
得益于成熟的开源框架和可靠的租金部署基础设施,它现在的组装,测试,运行和维护Java应用开发的速度和成本降低。
在这个系列中,Andrew Glover探讨了技术和工具,使这个新的Java开发有尽可能多的典范。
切分简介
数据库切分是一种划分成一些小团体的逻辑数据,可以将一块表的分成不同的小组。
例如,如果您正在根据时间戳对一个名为foo的超大型表进行分区,2010 年8 月之前的所有数据都将进入分区A,而之后的数据则全部进入分区B。
分区可以加快读写速度,因为它们的目标是单独分区中的较小型数据集。
分区并不总是可用的(MySQL并没有支持它,直到5.1版),而且与商业系统一起做让它的成本可以让人望而却步。
更何况,在同一物理机上实现最分区存储数据,所以你仍然受到硬件基础的限制。
分区也不能解决可靠性的或硬件不足。
因此,聪明的人开始为寻找各种新的方法。
切分基本上是在数据库级别的:而不是分裂的碎片的数据表的行,数据库本身是被分割(通常是在不同的机器)的一些逻辑数据元素,而不是分裂成较小的块表,分割分片成一个完整的数据库小切分基本上是在数据库级别的:而不是分裂的碎片的数据表的行,数据库本身是被分割(通常是在不同的机器)的一些逻辑数据元素,块。
切分典型的例子是基于大型数据库存储划分各地区的全球客户数据:切分 A 用于存储美国的客户信息,切分 B 用户存储亚洲的客户信息,切分 C 欧洲,等。
这些切分分别处于不同的计算机上,且每个切分将存储所有相关数据,如客户喜好或订购历史。
对分片(如分区)的好处是它压缩大数据:在每个单独的碎片表,它允许更快的读取和写入,提高了性能。
分片是也可以提高想象可靠性,因为即使一碎片意外失败,其他人仍然能够满足数据。
而由于分片是在应用层完成,你可以做的数据库在常规下不支持分割它。
资金成本也可能降低。
主键
切分利用多个数据库,所有这些都有自主意识的功能,不干涉其他切分。
因此,如果你依赖于数据库序列(如主键自动生成),很可能是相同的主键将显示在一个数据库上成立。
这是可能的,以协调跨分布式数据库序列,但这样做增加了系统的复杂性。
最安全的方式,禁止重复的主键是让你的应用程序(这将是一个sharded管理系统反正)生成密钥。
跨碎片查询
大部分(包括Hibernate碎片)分片的实现不允许跨碎片查询,这意味着你必须去额外的长度,如果你想利用两对来自不同的碎片的数据集。
(有趣的是,Amazon的SimpleDB 的还禁止跨域查询。
)如果将美国客户信息存储在切分 1 中,还需要将所有相关数据存储在此。
如果您尝试将那些数据存储在切分 2 中,情况就会变得复杂,系统性能也可能受影响。
这种情况也与先前提出的观点- 如果你有点最终需要做跨碎片连接,你最好的管理方式,消除了重复的可能性管理键!显然,你需要充分考虑分片策略,然后再设置你的数据库。
一旦你已经选择了一种特定的方向,你就或多或少地依赖于它- 它很难在走动后,一直sharded数据。
避免过早分片
切分最好采用分片后期。
像过早的优化,分片的基础上增长数据的预期可能是一个灾难。
分片实施的成功是基于一段时间内适当地了解数据增长的应用程序,并推断未来。
一旦你sharded您的数据可能会极其难以走动。
一个策略的例子
由于分片结合你到一个线性数据模型(即,你不能轻易加入不同碎片的数据),你应该从你的数据清楚地了解每个组织碎片是将如何逻辑的。
这通常是最容易由一个域的主节点成为重点。
在一个电子商务系统的情况下,主节点可以是一个命令或一个客户。
因此,如果你选择“客户”作为您的分片策略的基础,然后与客户的所有数据将被转移到各自的碎片,但你还是要选择哪些碎片去移动这些数据。
对客户来说,你可以根据位置碎片(欧洲,亚洲,非洲等),或者你可以在别的东西的碎片。
这取决于你。
您的碎片战略应当指出,纳入均匀分布的碎片之间的所有数据的一
些方法。
分片整体的思路是,打破大套成小的数据,因此,如果某个特定电子商务领域有一个大的欧洲客户在设置和美国比较少,它可能不会基于意义的碎片对客户的位置。
回到比赛——使用切分!
现在让我们回到我经常提到的赛跑应用程序示例中,我可以根据比赛或参赛者进行切分。
在本示例中,我将根据比赛进行切分,因为我看到域是根据参加不同比赛的参赛者进行组织的。
因此,比赛是域的根。
我也将根据比赛距离进行切分,因为比赛应用程序包含不同长度和不同参赛者的多项比赛。
请注意:在进行上述决定时,我已经接受了一个妥协:如果一个参赛者参加了不止一项比赛,他们分属不同的切分,那该怎么办呢?Hibernate Shards (像大多数切分实现一样)不支持跨切分连接。
我必须忍受这些轻微不便,允许参赛者被包含在多个切分中—也就是说,我将在参赛者参加的多个比赛切分中重建该参赛者。
为了简便起见,我将创建两个切分:一个用于10 英里以下的比赛;另一个用于10 英里以上的比赛。
实现Hibernate shards
Hibernate stards与现有的Hibernate项目几乎天衣无缝。
唯一的缺点是,Hibernate的碎片需要一些具体资料和你的行为。
也就是说,它需要一个碎片访问策略,碎片,选择策略,以及碎片,解决策略。
这些接口,你必须执行,尽管在某些情况下,你可以使用默认的。
我们将在后面的部分逐个了解各个接口。
ShardAccessStrategy
执行查询时,Hibernate Shards 需要一个决定首个切分、第二个切分及后续切分的机制。
Hibernate Shards 无需确定查询什么(这是Hibernate Core 和基础数据库需要做的),但是它确实意识到,在获得答案之前可能需要对多个切分进行查询。
因此,Hibernate Shards 提供了两种极具创意的逻辑实现方法:一种方法是根据序列机制(一次一个)对切分进行查询,直到获得答案为止;另一种方法是并行访问策略,这种方法使用一个线程模型一次对所有切分进行查询。
我要保持简单,并利用连续的战略,取名为SequentialShardAccessStrategy。
我们将很快配置。
ShardSelectionStrategy
当创建一个新的对象(即,当一个新的Race或Runner是通过Hibernate创建),Hibernate Shards需要知道什么碎片相应的数据应该写入。
因此,你必须实现这个接口和代码逻辑的分片。
如果你想有一个默认的实现,有一个被称为RoundRobinShardSelectionStrategy,它使用了碎片的数据放入循环赛战略。
对于赛跑应用程序,我需要提供根据比赛距离进行切分的行为。
因此,我们需要实现ShardSelectionStrategy 接口并提供依据Race 对象的distance 采用selectShardIdForNewObject 方法进行切分的简易逻辑。
(我将稍候在Race 对象中展示。
)在运行时,当调用是一些保存在我的领域对象类的方法,该接口的行为是在Hibernate 杠杆内心深处的核心。
清单1。
一个简单的碎片,选择策略
import org.hibernate.shards.ShardId;
import org.hibernate.shards.strategy.selection.ShardSelectionStrategy;
public class RacerShardSelectionStrategy implements ShardSelectionStrategy {
public ShardId selectShardIdForNewObject(Object obj) {
if (obj instanceof Race) {
Race rce = (Race) obj;
return this.determineShardId(rce.getDistance());
} else if (obj instanceof Runner) {
Runner runnr = (Runner) obj;
if (runnr.getRaces().isEmpty()) {
throw new IllegalArgumentException("runners must have at least one race");
} else {
double dist = 0.0;
for (Race rce : runnr.getRaces()) {
dist = rce.getDistance();
break;
}
return this.determineShardId(dist);
}
} else {
throw new IllegalArgumentException("a non-shardable object is being created");
}
}
private ShardId determineShardId(double distance){
if (distance > 10.0) {
return new ShardId(1);
} else {
return new ShardId(0);
}
}
}
正如你可以看到清单1,如果该对象被保存的一场Race,那么它的距离确定,因此,而且(因此)选择了一个切分。
在这种情况下,有两个切分:0 和1,其中切分 1 中包含10 英里以上的比赛,切分0 中包含所有其他比赛。
如果持久化一个Runner 或其他对象,情况会稍微复杂一些。
我已经编码了一个逻辑规则,其中有三个规定:
一名Runner 在没有对应的Race 时无法存在。
如果Runner 被创建时参加了多场Races,这名Runner 将被持久化到寻找到的首场Race 所属的切分中。
(顺便说一句,该原则对未来有负面影响。
)
如果还保存了其他域对象,现在将引发一个异常。
根据这些你就可以擦你眉头上的汗水,因为大多数的辛勤的工作都做完了。
随着比赛应用的增长,我所使用的逻辑可能不灵活,但这行得通为执行本示范!
ShardResolutionStrategy
要找这个对象的关键,Hibernate Stards需要一个办法决定先切分那个。
你就用SharedResolutionStrategy接口去引导。
正如我之前所说的,sharding迫使你对基本有敏锐的钥匙,你可以管理之行。
幸运的是,已经好Hibernate Stards或UUID生成方面表现良好。
因此Hibernate Shards 创造性地提供一个ID 生成器,名为ShardedUUIDGenerator,它可以灵活地将切分ID 信息嵌入到UUID 中。
如果您最后使用ShardedUUIDGenerator 进行键生成(我在本文中也将采取这种方法),那么您也可以使用Hibernate Shards 提供的创新ShardResolutionStrategy 实现,名为AllShardsShardResolutionStrategy,这可以决定依据一个特定对象的ID 搜索什么切分。
配置好Hibernate Shards 工作所需的三个接口后,我们就可以对切分示例应用程序的第二步进行实现了。
现在应该启动Hibernate 的SessionFactory 了。
外文原文资料信息
[1] 外文原文作者:
[2] 外文原文所在书名或论文题目:
[3] 外文原文来源:
出版社或刊物名称、出版时间或刊号、译文部分所在页码:网页地址:
二、外文原文资料:
Java development 2.0: Sharding with Hibernate Shards
Horizontal scalability for relational databases
Andrew Glover, Author and developer, Beacon50
Summary:Sharding isn't for everyone, but it's one way that relational systems can meet the demands of big data. For some shops, sharding means being able to keep a trusted RDBMS in place without sacrificing data scalability or system performance. In this installment of the Java development 2.0series, find out when sharding works, and when it doesn't, and then get your hands busy sharding a simple application capable of handling terabytes of data.
Date:31 Aug 2010
Level: Intermediate
PDF:A4 and Letter (64KB | 15 pages)Get Adobe® Reader®
When relational databases attempt to store terabytes of data in single tables, overall performance typically degrades. Indexing all that data is obviously expensive for reads, but also for writes. While NoSQL datastores are particularly suited to storing big data (think Google's Bigtable), NoSQL is a patently non-relational approach. For the developer who prefers the ACID-ity and solid structure of a relational database, or the project that requires it, sharding could be an exciting alternative.
Sharding, an offshoot of database partitioning, isn't a native database technique — it happens at the level of the application. Among various sharding implementations, Hibernate Shards is possibly the most popular in the world of Java™ technology. This nifty project lets you work more or less seamlessly with sharded datasets (I will explain the "more or less" part shortly) using POJOs that are mapped to a logical database. When you use Hibernate Shards, you don't have to specifically map your POJOs to shards — you map them as you would any normal relational database in the Hibernate way. Hibernate Shards manages the low-level sharding stuff for you.
So far in this series, I've used a simple domain based on the analogy of races and runners to demonstrate various data storage technologies. This month, I'll use this familiar example to introduce a practical sharding strategy, then implement it in Hibernate Shards. Note that the brunt of the work related to sharding isn't necessarily related to Hibernate; in fact, coding for Hibernate Shards is the easy part. The real work is figuring out how and what you'll shard.
About this series
The Java development landscape has changed radically since Java technology first emerged. Thanks to mature open source frameworks and reliable for-rent deployment infrastructures, it's now possible to assemble, test, run, and maintain Java applications quickly and inexpensively. In this series, Andrew Glover explores the spectrum of technologies and tools that make this new Java development paradigm possible.
Sharding at a glance
Database partitioning is an inherently relational process of dividing a table's rows by some logical piece of data into smaller groups. If you were partitioning a gigantic table named foo based on timestamps, for instance, all the data for August 2010 would go in Partition A, while anything since then would be in Partition B, and so on. Partitioning has the effect of making reads and writes faster because they target smaller datasets in individual partitions.
Partitioning isn't always available (MySQL didn't support it until version 5.1), and the cost of doing it with a commercial system can be prohibitive. What's more, most partitioning implementations store data on the same physical machine, so you're still bound to the limits of your hardware. Partitioning also doesn't resolve the reliability, or lack thereof, of your hardware. Thus, various smart people started looking for new ways to scale.
Sharding is essentially partitioning at the database level: rather than divide a table's rows by pieces of data, the database itself is split up (usually across different machines) by some logical data element. That is, rather than splitting up a table into smaller chunks, sharding splits up an entire database into smaller chunks.
The canonical example for sharding is based on dividing a large database storing worldwide customer data by region: Shard A for customers in the United States, Shard B for Asia, Shard C for Europe, and so on. The shards themselves would live on different machines and each shard would hold all related data, such as customer preferences or order history.
The benefit of sharding (like partitioning) is that it compacts big data: individual tables are smaller in each shard, which allows for faster reads and writes, which increases performance. Sharding also conceivably improves reliability, because even if one shard unexpectedly fails, others are still able to serve data. And because sharding is done at the application layer, you can do it for databases that don't support regular partitioning. The monetary cost is also potentially lower.
Sharding and strategy
Like most technologies, sharding does entail some trade-offs. Because sharding isn't a native database technique — that is, you must implement it in your application — you'll need to map out your sharding strategy before you begin. Both primary keys and cross-shard queries play a major role when sharding, mainly by defining what you can't do.
Primary keys
Sharding leverages multiple databases, all of which function autonomously, without awareness of their peers. As a result, if you rely on database sequences (such as for automatic primary key generation), it's likely that an identical primary key will show up across a set of databases. It's possible to coordinate sequences across a distributed database but doing so increases system complexity. The safest way to prohibit duplicate primary keys is to have your application (which will be managing a sharded system anyway) generate keys.
Cross-shard queries
Most sharding implementations (including Hibernate Shards) don't permit cross-shard querying, which means you have to go to extra lengths if you want to leverage two sets of data from different shards. (Interestingly, Amazon's SimpleDB also prohibits cross-domain queries.) For instance, if you're storing United States customers in Shard 1, you also need to store all of their related data there. If you try to store that data in Shard 2, things will get complicated, and system performance will probably suffer. This situation is also related to the point made earlier — if you somehow end up needing to do cross-shard joins, you had better be managing keys in a way that eliminates the possibility of duplicates!
Clearly, you'll need to fully consider a sharding strategy before you set up your database. And once you've chosen a particular direction, you're more or less tied to it — it's hard to move data around after it's been sharded.
Avoid premature sharding
Sharding is best employed late in the game. Like premature optimization, sharding based on expected data growth could be a recipe for disaster. Successful sharding implementations are based on measurably understanding an application's data growth over time, and then extrapolating to the future. Once you've sharded your data it can be extraordinarily hard to move around.
A strategy example
Because sharding binds you to a linear data model (that is, you can't easily join data in different shards), you should start with a clear picture of how your data will be logically organized per shard. This is usually easiest by focusing on the primary node of a domain. In the case of an
e-commerce system, the primary node could be either an order or a customer. Thus, if you choose "customer" as the basis for your sharding strategy, then all data related to customers will be moved into the respective shards, though you'll still have to choose to which shard to move that data.
For customers, you could shard based on location (Europe, Asia, Africa, etc.), or you could shard based on something else. It's up to you. Your shard strategy should, however, incorporate some means of distributing data evenly among all of your shards. The whole idea of sharding is to break up big data sets into smaller ones; thus, if a particular e-commerce domain had a large set of European customers and relatively few in the United States, it probably wouldn't make sense to shard based on customer location.
Off to the races — with sharding!
Getting back to the familiar example of my racing application, I can shard by race or by runner. In this case, I'm going to shard by race, because I see the domain being organized by runners who belong to races. So the race is the root of my domain. I'm also going to shard based on race distance, because my racing application holds myriad races of different lengths, along with myriad runners.
Note that in making these decisions, I have already accepted a trade-off: what if a runner participates in more than one race, each of them living in different shards? Hibernate Shards (like most sharding implementations) doesn't support cross-shard joins. I'm going to have to live with this slight inconvenience and allow runners to live in multiple shards — that is, I will recreate each runner in the shards where his or her various races live.
To keep things simple, I'm going to create two shards: one for races less than 10 miles and another for anything greater than 10 miles.
Implementing Hibernate Shards
Hibernate Shards is made to work almost seamlessly with existing Hibernate projects. The only catch is that Hibernate Shards needs some specific information and behavior from you. Namely, it needs a shard-access strategy, a shard-selection strategy, and a shard-resolution strategy. These are interfaces you must implement, though in some cases you can use default ones. We'll look at each interface separately in the following sections.
ShardAccessStrategy
When a query is executed, Hibernate Shards needs a mechanism for determining which shard to hit first, second, and so on. Hibernate Shards doesn't necessarily figure out what a query is looking for (that's for the Hibernate Core and underlying database to do), but it does recognize that a query might need to execute against multiple shards before an answer is obtained. So, Hibernate Shards provides two logical implementations out of the box: one executes a query in a sequential mechanism (one at a time) against shards until an answer is returned, or until all of the shards have been queried. The other implementation is a parallel-access strategy, which uses a threading model to hit all of the shards at once.
I'm going to keep things simple and utilize the sequential strategy, aptly named SequentialShardAccessStrategy. We'll configure it shortly.
ShardSelectionStrategy
When a new object is created (that is, when a new Race or Runner is created via Hibernate), Hibernate Shards needs to know what shard the corresponding data should be written to. Accordingly, you must implement this interface and code the sharding logic. If you want a default implementation, there's one dubbed RoundRobinShardSelectionStrategy, which uses a round-robin strategy for putting data into shards.
For the racing application, I need to provide behavior that shards by race distance. Accordingly, I'll need to implement the ShardSelectionStrategy interface and provide some simple logic that shards based on a Race object's distance in the selectShardIdForNewObject method. (I'll show the Race object shortly.)
At runtime, when a call is made to some save-like method on my domain objects, this interface's behavior is leveraged deep down in Hibernate's core.
Listing 1. A simple shard-selection strategy
import org.hibernate.shards.ShardId;
import org.hibernate.shards.strategy.selection.ShardSelectionStrategy;
public class RacerShardSelectionStrategy implements ShardSelectionStrategy {
public ShardId selectShardIdForNewObject(Object obj) {
if (obj instanceof Race) {
Race rce = (Race) obj;
return this.determineShardId(rce.getDistance());
} else if (obj instanceof Runner) {
Runner runnr = (Runner) obj;
if (runnr.getRaces().isEmpty()) {
throw new IllegalArgumentException("runners must have at least one race");
} else {
double dist = 0.0;
for (Race rce : runnr.getRaces()) {
dist = rce.getDistance();
break;
}
return this.determineShardId(dist);
}
} else {
throw new IllegalArgumentException("a non-shardable object is being created");
}
}
private ShardId determineShardId(double distance){
if (distance > 10.0) {
return new ShardId(1);
} else {
return new ShardId(0);
}
}
}
As you can see in L Listing 1, if the object being persisted is a Race, then its distance is determined and, accordingly, a shard is picked. In this case, there are two shards: 0 and 1, where Shard 1 holds races with a distance greater than 10 miles and Shard 0 holds all others.
If a Runner or some other object is being persisted, things get a bit more involved. I've coded a logical rule that has three stipulations:
• A Runner can't exist without a corresponding Race.
•If a Runner has been created with multiple Races, the Runner will be persisted in the shard for the first Race found. (This rule has negative implications for the future, by the way.)
•If some other domain object is being saved, for now, an exception will be thrown.
With that, you can wipe the sweat from your brow, because most of the hard work is done. The logic I've captured might not be flexible enough as the racing application grows, but it'll work for the purpose of this demonstration!
ShardResolutionStrategy
When searching for an object by its key, Hibernate Shards needs a way of determining which shard to hit first. You'll use the SharedResolutionStrategy interface to guide it.
As I mentioned earlier, sharding forces you to be keenly aware of primary keys, as you'll manage them yourself. Luckily, Hibernate is already good at providing key or UUID generation. Consequently, out of the box, Hibernate Shards provides an ID generator dubbed ShardedUUIDGenerator, which has the smarts to embed shard ID information in the UUID itself.
If you end up using ShardedUUIDGenerator for key generation (as I will for this article), then you can can also use the Hibernate Shards out-of-the-box ShardResolutionStrategy implementation dubbed AllShardsShardResolutionStrategy, which can determine what shard to search based on a particular object's ID.
Having configured the three interfaces required for Hibernate Shards to work properly, we're ready for the next step in sharding the example application. It's time to launch Hibernate's SessionFactory.。