Linguistic Features and Newsworthiness: An Analysis of News styleMaria Pia di Buono, Jan ŠnajderUniversity of ZagrebFaculty of Electrical Engineering and ComputingText Analysis and Knowledge Engineering LabUnska 3, 10000 Zagreb, ctEnglish. In this paper, we present a preliminary study on the style of headlines in order to evaluate the correlation between linguistic features and newsworthiness. Ourhypothesis is that each particular linguisticform or stylistic variation can be motivatedby the purpose of encoding a certain newsworthiness value. To discover the correlations between newsworthiness and linguistic features, we perform an analysis on thebasis of characteristics considered indicative of a shared communicative functionand of discriminating factors for headlines.Italiano. Questo contributo descrive unostudio preliminare sullo stile dei titoli nellenotizie, al fine di valutare la correlazionetra gli aspetti linguistici e il valore dellenotizie. La nostra ipotesi è che ogni particolare forma linguistica o variazione stilistica possa essere motivata dall’obiettivodi codificare un certo valore di notiziabilità. Al fine di analizzare la correlazionetra il valore delle notizie e gli aspetti linguistici, effettuiamo un‘analisi sulla basedelle caratteristiche considerate indicativedi una funzione comunicativa condivisa edi fattori discriminanti per i titoli.1IntroductionNewsworthiness refers to a set of criteria by meansof which quantity and type of events are selectedin order to produce news (Wolf and de Figueiredo,1987). That is to say, ‘news is not simply thatwhich happens, but that which can be regardedand presented as newsworthy’ (Fowler, 2013). Galtung and Ruge (1965) identify a list of factors thatan event should satisfy to become news; in otherwords, the likelihood of an event being considerednewsworthy increases with the number of factors itcomplies with.The newsworthiness factors reflect a set of valuesand provide a certain representation of the world(Fowler, 2013). This representation and the corresponding values are constructed and encoded inthe language used in the news. For this reason,each particular linguistic form or stylistic variationcan be motivated by the purpose of representing acertain value. According to Labov’s axiom (1972),style ranges along a single dimension, namely theattention paid to speech. Bell (1984) refutes thisaxiom, stating that style can be considered also asa response to other factors. These factors constitutea new dimension of stylistic variation, that, in headlines, might be related to the necessity of reflectingnewsworthy factors, and meeting two needs: attracting users attention and summarizing contents(Ifantidou, 2009).This paper aims to provide a preliminary analysis of the linguistic features in news headlines andhow these relate to specific newsworthiness categories. The analysis rests on the hypothesis thateach particular linguistic form or stylistic variationcan be motivated by the purpose of encoding a certain newsworthy value. The remainder of the paperis structured as follows. In Section 2, we describethe related work on stylistic analysis of news andheadlines. In Section 3, we describe the data setand the classification scheme we use. In Section4 we introduce our methodology together with theanalysis we perform, while in Section 5 we discussthe results. Section 6 concludes the paper.2Related WorkSeveral works, based on sociolinguistic and discourse analysis frameworks, have investigatedstylistic features and linguistic variations in bothnewspapers and headlines, on the basis of different parameters and aspects (Develotte and Rechniewski, 2001; Pajunen, 2008). The large amount

of existing contributions to the field is justified bythe social implications of news media communication and its language.A considerable amount of research has analyzedthe language of news media from a broader prospective (Bell, 1991; Matheson, 2000; Cotter, 2010;Conboy, 2013; Fowler, 2013; Van Dijk, 2013).Generally speaking, these works emphasize theinfluence of news language on our perception ofthe world, due to the fact that news media operate a selection of events and narrative, and use thelanguage to project those.Another strand of research focuses on specificlinguistic aspects in journalistic style. For instance,Tannenbaum and Brewer (1965) analyze the syntactic structure across different news content areas,while Schneider (2000) analyzes the textual structures in British headlines, revising the traditionaldistinction among verbal and nominal headlines.3Data DescriptionIn our work, we adopt the data set proposed forSemEval-2007 task 14 (Strapparava and Mihalcea,2007), which is a corpus formed by 1250 headlines, extracted from major newspapers and newsweb sites such as New York Times, CNN, BBCNews, and Google News search engine. Originally,SemEval-2007 task 14 data set has been developedfor emotion classification and annotated with emotion labels. Relevant for the purpose of the presentwork is the annotation of this dataset by di Buonoet al. (2017), who provided additional newsworthiness labels (“news values”), using the scheme proposed by Harcup and O’Neill (2016). Harcup andO’Neill proposed a set of 15 values, correspondingto a set of requirements that news stories have tosatisfy to be selected for publishing. They claimedthat these criteria are related also to practical considerations, e.g., the availability of resources andtime, and to a mix of other influences, e.g., whois selecting news, for whom, in what medium andby what means (and available resources), that cancause fluctuations within the suggested hierarchy.Di Buono et al. report that two out of 15 news valuelabels (Audio-visuals, News organization’s agenda)were difficult to annotate out of context even fortrained annotators, while two (Exclusivity, Relevance) were not well-represented in the data. Theirfinal dataset thus contains 11 labels.Table 1 lists the news value labels, their countsin the data set, and the inter-annotator agreementNews 60.660.840.450.560.370.340.410.72Bad od newsMagnitudeShareabilitySurprisePower eliteTable 1: News values labels, their counts, and theinter-annotator agreement in terms of kappa-score.measured in terms of (adjudicated) kappa-score, asreported by Di Buono et al.4Linguistic FeaturesOur methodology to define the stylistic variationsrelated to newsworthiness categories relies on adescriptive analysis of different features, i.e., syntactic, lexical and compositional features.We extracted these using Coh-Metrix,1 a computational tool that provides a wide range of languageand discourse metrics (Graesser et al., 2004; McNamara et al., 2014). Coh-Metrix has been developedon the basis of cognitive models in discourse psychology to detect both coherence and cohesion intexts. According to Louwerse (2004), “coherencerefers to the representational relationships of a textin the mind of a reader whereas cohesion refersto the textual indications that coherent texts arebuilt upon.” Coh-Metrix describes coherence andcohesion by means of more than one hundred linguistic features, based on a multilevel framework,i.e., words, syntax, the situation model, the discourse genre, and rhetorical structure (Dowell etal., 2016).We ran Coh-Metrix analysis on headlines fromour dataset, grouped according to the 11 newsworthiness labels. We then analyzed these results manually and decided to adopt a subset of Coh-Metrixindices, which, according to our initial hypothesis,we consider to be discriminating factors for newsworthiness, i.e., text easibility principal componentand word information indices. Being representative of linguistic characteristics and syntax context,such features are suitable to represent stylistic variations and, therefore, the underlied news value.1

News valueBad od newsMagnitudeShareabilitySurprisePower 63.1222.8731.6783.2530.896Table 2: Z-scores for PC Syntactic simplicity (PCSYNz) and PC Word concreteness (PCCNCz).5Analysis and ResultsIn our preliminary analysis, we consider two maintypes of linguistic features: text easability and wordinformation scores.5.1Text Easability FeaturesCoh-Metrix text easibility indices (“Text easabilityprincipal component scores”) are designed to measure text ease that goes beyond traditional readability metrics. We focused specifically on two indicesrelated to the syntactic simplicity (PCSYNz) andword concreteness (PCCNCz) (Table 2).The syntactic simplicity is evaluated on the basis of the number of words and the complexity ofsyntactic structures of sentences. As far as the syntactic simplicity is concerned, the variability amongthe categories is not so high, nevertheless, wecan distinguish two groups. The first group, witha higher PCSYNz, consists of headlines labeledwith the ‘Power elite’, ‘Bad news’, ‘Shareability’,‘Drama’, ‘Magnitude’, and ‘Celebrity’ news values. Higher scores here indicate that the sentencepresents more words and uses complex syntaticstructures, as exemplifed by the following headlines from this group:(1a) China says rich countries should take lead onglobal warming (Power elite)(1b) Iraqi suicide attack kills two US troops asmilitants fight purge (Bad news)(1c) Second opinion: girl or boy? as fertility technology advances, so does an ethical debate(Shareability)(1d) Damaged Japanese whaling ship may resumehunting off Antarctica (Drama)(1e) Ready to eat chicken breasts recalled due tosuspected listeria (Magnitude)(1f) Jackass’ star marries childhood friend Thesecrets people reveal (Celebrity)The second group consists of headlines labledwith ‘Entertainment’, ‘Surprise’, ‘Follow up’,‘Good news’, and ‘Conflict’, which received lowerPCSYNz scores, and are thus of less syntactic complexity. Examples of headlines form this group areas follows:(2a) Action games improve eyesight (Entertainment)(2b) Breast cancer drug promises hope (Goodnews)(2c) Merkel: Stop Iran (Conflict)The second index, word concreteness, differentiates three groups of headlines: (i) ‘Power elite’,‘Entertainment’, ‘Shareability’, ‘Celebrity’, and‘Conflict’, all with a low z-score; (ii) ‘Follow up’,‘Drama’ and ‘Magnitude’, with a medium z-score;and (iii) ‘Bad news’, ‘Surprise’ and ‘Good news’with a high z-score. The following headlines exemplify each of the three groups:(1a) Action intensity boosts vision (Shareability)(2a) Ex-suspect slams anti-terror laws (Drama)(3a) Ancient coin shows Cleopatra was no beauty(Surprise)The word concreteness index measures the concreteness level of content words. Thus, news valueswith lower scores are characterized by a higer number of abstract words and, for this reason, may beless easy to understand without an appropriate context. Our analysis thus suggests that ‘Bad news‘,‘Surprise‘, and ‘Good news‘ headlines are typicallyrefering to more concrete events and entities thanthe other categories of news values.5.2Word InformationThis Coh-Metrix index refers to information aboutsyntactic categories and function words, evaluatedin the sentence context. To visualize the relationsamong newsworthiness and word information, weperformed a hierarchical cluster analysis. We firstrepresent each headline as a vector of ten wordincidence scores (the number of words of a specific part-speech per 1000 words): incidence scores

for nouns, verbs, adjectives, adverbs, personal pronouns, pronouns in first, second, and third person,separately for singular and plural. We then use hierarchical agglomerative clustering with completelinkage and one minus Pearsons correlation coefficient as the distance measure to obtain the clusters.Fig. 1 shows the resulting dendrogram. We canidentify three groups of news values on the basisof their syntactic structures.The first group consists of only news valuesthat can be defined positive contents/sentiments,namely ‘Good news’, ‘Entertainment’, and ‘Shareability’. This group is characterized by a quite highincidence of adjective, low incidence of first personsingular and third person plural pronouns. Furthermore, this group presents the highest incidence ofsecond person pronouns. As in the samples below:(1a) Feeding your brain: new benefits found inchocolate (Good news)Figure 1: Dendrogram of the 11 newsworthinesscategories based on the headline word informationfeatures.(1b) Free Will: Now you have it, now you don’t(Entertainment)(3c) Eight years for Damilola killers (Follow up)(1c) Nap your way to a successful career (Shareability)(3d) Bomb kills 18 on military bus in Iran (Badnews)The second group consists of ‘Celebrity’, ‘Powerelite’, and ‘Drama’. This group presents low incidence of adjective and adverbs. The most incidentpronouns are the first person plural and the thirdperson singular.(3e) Venezuela, Iran fight U.S. dominance (Conflict).(2a) Beyonce new SI bikini cover girl (Celebrity)(2b) Bush vows cooperation on health care (Powerelite)(2c) Collision on icy road kills 7 (Drama)The third group consists of two subsets, the firstone formed by ‘Surprise’ and ‘Magnitude’, andthe second subset formed by ‘Follow up’, ‘Badnews’, and ‘Conflict’. ‘Surprise’ and ‘Magnitude’form a different subset due to the presence of thehighest score within all categories for the adjectiveincidence and a low incidence of pronouns. Forinstance:(3a) In the world of life-saving drugs, a growingepidemic of deadly fakes (Surprise)(3b) Flu Vaccine Appears Safe for Young Children(Magnitude)The second subset is formed by negative contents/sentiment, characterized by the lowest incidence of adverbs and pronouns:6Conclusions and Future workWe described a preliminary study for on style ofheadlines in order to evaluate the correlation amongsyntactic features and newsworthiness. Our hypothesis is that each particular linguistic form orstylistic variation can be motivated by the purposeof encoding a certain newsworthy value. We performed a linguistic analysis to discover the correlations among newsworthiness and some stylisticfeatures, on the basis of characteristics consideredindicative of a shared communicative function anddiscriminating factors for headlines.This preliminary analysis opens up a number ofinteresting research directions. One is the studyof other stylistic variations of headlines, besidesthe ones examined in this paper. Another researchdirection is the comparison between style in headlines and full-text stories. It would also be interesting to analyze how communicative functions inheadlines correlate with the events described in thepertaining text. We intend to pursue some of thiswork in the near future.

AcknowledgmentsThis work has been funded by the Unity ThroughKnowledge Fund of the Croatian Science Foundation, under the grant 19/15: EVEnt RetrievalBased on semantically Enriched Structures for Interactive user Tasks (EVERBEST).ReferencesAllan Bell. 1984. Language style as audience design.Language in society, 13(02):145–204.Allan Bell. 1991. The Language of News Media. Language in society. Blackwell.Martin Conboy.Routledge.2013.The language of the news.C. Cotter. 2010. News Talk: Investigating the Language of Journalism. Cambridge University Press.Christine Develotte and Elizabeth Rechniewski. 2001.Discourse analysis of newspaper headlines: amethodological framework for research into nationalrepresentations. Web Journal of French Media Studies, 4(1).Maria Pia di Buono, Jan Šnajder, Bojana Dalbelo Bašić,Goran Glavaš, Martin Tutek, and Natasa MilicFrayling. 2017. Predicting news values from headline text and emotions. In Proceedings of NaturalLanguage Processing Meets Journalism Workshop(EMNLP 2017), page to appear.Nia M. Dowell, Arthur C. Graesser, and Zhiqiang Cai.2016. Language and discourse analysis with cohmetrix: Applications from educational material tolearning environments at scale. Journal of LearningAnalytics, 3(3):72–95.Roger Fowler. 2013. Language in the News: Discourse and Ideology in the Press. Routledge.Johan Galtung and Mari Holmboe Ruge. 1965. Thestructure of foreign news: The presentation of theCongo, Cuba and Cyprus crises in four Norwegiannewspapers. Journal of peace research, 2(1):64–90.Arthur C Graesser, Danielle S McNamara, Max MLouwerse, and Zhiqiang Cai. 2004. Coh-metrix:Analysis of text on cohesion and language. Behavior Research Methods, 36(2):193–202.Tony Harcup and Deirdre O’Neill. 2016. What isnews? news values revisited (again). JournalismStudies, pages 1–19.Elly Ifantidou. 2009. Newspaper headlines and relevance: Ad hoc concepts in ad hoc contexts. Journalof Pragmatics, 41(4):699–720.William Labov. 1972. Sociolinguistic patterns. Number 4. University of Pennsylvania Press.Max M Louwerse. 2004. Un modelo conciso de cohesión en el texto y coherencia en la comprensión.Revista signos, 37(56):41–58.Donald Matheson. 2000. The birth of news discourse:Changes in news language in British newspapers,1880-1930. Media, Culture & Society, 22(5):557–573.Danielle S McNamara, Arthur C Graesser, Philip MMcCarthy, and Zhiqiang Cai. 2014. Automatedevaluation of text and discourse with Coh-Metrix.Cambridge University Press.Juhani Pajunen. 2008. Linguistic analysis of newspaper discourse in theory and practice.Kristina Schneider. 2000. The emergence and development of headlines in British newspapers. EnglishMedia Texts, Past and Present: Language and Textual Structure, 80:45.Carlo Strapparava and Rada Mihalcea.2007.SemEval-2007 Task 14: Affective text. In Proceedings of the 4th International Workshop on SemanticEvaluations, pages 70–74. Association for Computational Linguistics.Percy H. Tannenbaum and Richard K. Brewer. 1965.Consistency of syntactic structure as a factor in journalistic style. Journalism Quarterly, 42(2):273–275.Teun A Van Dijk. 2013. News as discourse. Routledge.Mauro Wolf and Maria Jorge Vilar de Figueiredo.1987. Teorias da comunicação. Presença.

Text Analysis and Knowledge Engineering Lab Unska 3, 10000 Zagreb, Croatia fmariapia.dibuono,[email protected] Abstract English. In this paper, we present a prelim- inary study on the style of headlines in or-der to evaluate the correlation between lin-guistic features and newsworthiness. Our hypothesis is that each particular linguistic form or stylistic variation can be motivated by the .