Sentence Extraction System Assembling Multiple Evidence
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Sentence Extraction System Assembling Multiple Evidence Chikashi NOBATA Satoshi SEKINE Masaki MURATA Kiyotaka UCHIMOTO Masao UTIYAMA Hitoshi ISAHARA Keihanna Human Info-Communication Research CenterCommunications Research Laboratory2-2-2Hikari-dai,Seika-cho,Soraku-gun Kyoto619-0238JAPAN nova,murata,uchimoto,mutiyama, Computer Science Department,New York University715Broadway,7thfloor,New York,NY10003USAsekine@AbstractWe have developed a sentence extraction system,which estimates the significance of sentences by integrating four scoring functions that use evidence such as sentence loca-tion,sentence length,TF/IDF values of words,and similar-ity to the title.Similarity to a given query is also added to the system on the summarization task for Information Re-trieval.Parameters for scoring functions were optimized by an experiment using dry run data of the TSC.Results from the TSC formal run showed that our method was effective in the sentence extraction task.1IntroductionIn recent years,we can see a vast amount of machine readable documents on the web or machine readable text. To obtain useful information from these documents effi-ciently,there are several ongoing researchfields in natural language processing,such as information retrieval,infor-mation extraction,automatic summarization.Sentence extraction is one of useful methods for auto-matic text summarization[5].Various clues have been used for sentence extraction.Lead-based method,which is sim-ple but still effective,uses the sentence location in a given document.Statistical information like word frequency and document frequency has also been used to estimate the sig-nificance of sentences.Linguistic clues that show the struc-ture of a document are also useful to extract important sen-tences.Edmundson[2]proposed a method to integrate several clues for extracting sentences.He manually assigned pa-rameter values to integrate evidence for estimating the sig-nificance score of sentences.On the other hand,Aone et al.[1]Nomoto and Matsumoto[4]generated a decision tree[6]for sentence extraction from training data.Our system follows the above direction;the system is to use several pieces of evidence to estimate the sentence significance in a uniform way.Each evidence is integrated using parameters,which are estimated using training data. Suitable parameter sets can be selected at each section in-formation and compression ratio.In the following sections,we explain methods used in our system,then show and discuss the evaluation results on the TSC,Text Summarization Challenge,which was held by National Information Institute.2MethodsIn this section,we introduce our sentence extraction sys-tem.First,we explain scoring functions used in the system, and then mention the other parts,such as threshold types, patterns and parameters the system uses.2.1Score functionOur system uses four types of metrics to estimate the im-portance of sentences,such as sentence location,sentence length,TF/IDF values of words,and similarity to the title. In the task B of TSC,summarization for information re-trieval task,the system also uses similarity to a given query. The significance of sentences is given by the sum of the val-ues of the above metrics with parameters.Each metric will be explained in the following subsections.2.1.1Lead-basedOur system has a function which uses sentence location as one of the information to set significance of sentences.In this function,there are three different methods to handlesentence location.Thefirst method is to give score1to the first sentences and0to the others,where is a given threshold for the number of sentences.That is,the score of th sentence Score is:ifotherwise(1) The second method is to give the reciprocal of the sen-tence location;the score of th sentence Score is:tfDNused in the task A1when the compression ratio was10%. We tried to utilize the headline information for this task;if a headline shares some words with the query,the parameter for the query scoring function is doubled.Moreover,we set a threshold of the total score for sentences.A sentence was not extracted when the score is lower than the threshold.The other summary set,Sum2,was intended to supply sufficient information for the IR task and to improve the measure of Recall.Therefore the compression ratio was set to50%,and at least three sentences were extracted.2.2ThresholdOur system calculates a significance score for all of the sentences,and sets the rank of each sentence in descending order of scores.Our system can use three types of thresh-olds:the number of sentences,the number of characters, and the score of the sentence.Regardless of the thresh-old type,the order of the extracted sentences is the same as in the original articles.When the number of sentencesis given as a threshold,the system outputs the top sen-tences in the rank.When the number of characters is given, the system converts it to the number of sentences;the maxi-mum number of sentences within a given number of charac-ters is calculated by accumulating the number of characters of sentences in ascending order of ranks.After the number of sentences is calculated,the system uses it as a threshold. When the threshold score is given,the system outputs the sentences that have higher score than the threshold score.2.3PatternsOur system applies patterns to shorten sentences in the task A2of the TSC.We intended that generated summary included as many sentences as possible by applying trans-formation patterns.There are several research on using transformation pattern or rules for summarization.Wakao et al.[8]manually created patterns for subtitles of TV news program.Katoh and Uratani[3]proposed a method to ac-quire transformation rules automatically from TV news test and teletext.We created20rules manually by looking at dry run data. We will try to acquire such rules automatically as a future work.2.4ParametersOur system adds parameters to the results of each scoring function,in order to calculate the total score of sentences. We estimated the optimal values of these parameters with data used in the task A1of the TSC dry run.The dry run data we used are30newspaper articles,and the manually created summary.These summaries are created for every compression ratio(10%,30%and50%)and30summaries are available at each compression ratio.We split the sum-maries into two sets,i.e.15editorials and the other15ar-ticles.We assumed the characteristics of editorials are dif-ferent from those of the other articles.We thereforefinally divided summaries into six classes by the compression ra-tio and the section information,and estimated the parameter values for each summary class.Types of location and length functions were also selected at each class.On the other hand,two compression ratios were set in the task A2:20%and40%.We applied parameter sets of the task A1to the task A2;the parameter set for10%in the task A1was applied to20%in the task A2,that for30%to 40%.3Results and discussionIn this section,we show the evaluation results of our sys-tem in each task of the TSC formal run,and also discuss the actual failures of generated summaries.3.1Task A1:Sentence extractionTable3.1shows the evaluation results of our system and base-line systems in the task A1,sentence extraction task. Each row shows the section that articles appeared on,and each column corresponds to the result at each compression ratio.Thefigures on Table3.1are points by F-measure. Actually,since all of the participants output sentences as many as the upper limit,the values of recall,precision and F-measure are the same.Our system obtained the better results than the baseline systems,especially when the compression ratio is10%.The average performance was the second among9participants.We analyzed causes of errors our systems made.One of the types of missing sentences are short sentences.Since our system considers shorter sentences less significant by the length scoring function,short sentences without a con-tribution from other scoring functions do not appear in the summary.For example,in the30%summary of other sec-tions,42%(29/69)of missing sentences are less than25 characters,and in the50%summary of other sections,64% (33/85)of missing sentences are less than20characters.In addition,the TF/IDF and headline functions add higher score to the sentences that describe specific facts than to the abstract expressions.On the other hand,the key sum-maries,which human annotater generated,include more ab-stract and shorter expressions.Table2shows that the sys-tem’s performance when one of the features is ignored.We can see the contribution of each feature in the task A1from the table.The length,TF/IDF and headline function showed the negative or zero contribution in each compression ratio;Table1.Evaluation results of the task A110%30%50%Our system0.4630.2840.4320.5860.2760.3670.530Table3.Evaluation results of the types of lo-cation functionRatio Ave.0.1580.2560.4740.3940.4780.586All0.391Ratio Ave.0.3230.3600.5570.3560.4360.544All0.429All0.450Comparison with FREE summary20%40%Our system0.509 TF-based0.516 Lead-based0.481 Our system0.559 TF-based0.549 Lead-based0.513Table2.Evaluation results of the task A1when one feature is ignored10%30%50%All features0.4630.326(.037)0.394(.041)0.575(.014)0.372(.009)0.472(.037)0.600(.011)0.372(.009)0.439(.004)0.582(.007)0.403(.040)0.449(.014)0.589(.000)0.381(.018)0.438(.003)0.589(.000)Table5.Evaluation Results of the Task A2byHuman Experts20%R20%C40%R40%C3.07 3.33 2.60 3.07TF-BasedTable6.Average number of sentences in thetask A220%40%4.53(136/30)8.93(268/30)with Patterncompression ratio is basically50%.Table7shows that the evaluation results of the summaries.“Answer level”means the correct answer type with the relevance to a given query in the IR task.When the answer level is A,only the articles judged A are the correct answers of the IR task.When the answer level is B,the articles which A or B are the correct answers of the IR task.Both summaries have better evalua-tion results on the answer level B than on the answer level A, compared with summaries of other participants.From these results,we can say that our summaries do not have enough information to distinguish articles of the level A from those of the level B,but has enough information to distinguish articles of the level A or B from non-relevant articles.Figure1and2show the categorization F-measure versus time for participants.Figure1shows the evaluation result when the answer level is A,and Figure2shows the result when the level is B.While the evaluation of Sum1is lower than that of Sum2in the bothfigures,the average time is shorter than that of Sum2.Considering that the difference of F-measure is small,Sum1is more suitable for the sum-marization for the IR task than Sum2.The average time for Sum2was greater than that for sum-maries of any other participants’summaries;this is an in-adequate point to the IR task.The reason is probably that Table7.Evaluation Results of the TSC Task B MeasurementSum10.8990.7170.7850.8430.7110.751TF-based0.7400.7660.731Sum10.7930.9040.8280.7360.8880.773TF-based0.6250.9210.7120.70.750.80.850.9A v e r a g e F -m e a s u r eAverage timeFigure 1.Evaluation Results between Time and F-measure (Answer level A)References[1]Chinatsu Aone,Mary Ellen Okurowski,James Gorlin-sky,and Bjornal Larsen.A Scalable Summarization System Using Robust NLP.In Proceedings of the ACL Work shop on Intelligent Scalable Text Summarization ,pages 66–73,1997.[2]H.Edmundson.New methods in automatic abstracting.Journal of ACM ,16(2):264–285,1969.[3]Naoto Katoh and Noriyoshi Uratani.A new approachto acquiring linguistic knowledge for locally summariz-ing japanese news sentences (in japanese).Journal of Natural Language Processing ,6(7):73–92,1999.[4]Tadashi Nomoto and Yuji Matsumoto.The Reliabilityof Human Coding and Effects on Automatic Abstract-ing (in Japanese).In IPSJ-NL 120-11,pages 71–76,July 1997.[5]Manabu Okumura and Hidetsugu Nanba.Automatedtext summarization:A survey (in japanese).Journal of Natural Language Processing ,6(6):1–26,1999.[6]J.Ross Quinlan.C4.5:Programs for Machine Learn-ing .Morgan Kaufmann Publishers,Inc.,San Mateo,California,1993.[7]Kiyotaka Uchimoto,Qing Ma,Masaki Murata,HiromiOzaku,and Hitoshi d Entity Extrac-tion Based on A Maximum Entropy Model and Trans-formation Rules.In Proceedings of the 38th Annual Meeting of Association for Computational Linguistics (ACL2000),pages 326–335,October 2000.0.70.750.80.850.9A v e r a g e F -m e a s u r eAverage timeFigure 2.Evaluation Results between Time and F-measure (Answer level B)[8]Takahiro Wakao,Terumasa Ehara,and Katsuhiko Shi-rai.Summarization Methods Used for Captions in TV News Programs (in Japanese).In IPSJ-NL 122-13,pages 83–89,July 1997.。