Transcription

MEDHA - 2012Proceedings published by International Journal of Computer Applications (IJCA)Spam Detection and Filtering using Different MethodsBhawana S.DakhareME Computers SEM IIITerna Engg.College, Nerul, Navi MumbaiMumbai University ,MumbaiABSTRACTSpam is an unsolicited bulk mail or junk email. Due toincreased communication within shorter duration and forlonger distance and fastest medium email is considered .In therecent years spam became as a big problem of Internet andelectronic communication. So for overcoming these problemssome techniques are developed to fight with them. In thispaper the overview of existing e-mail spam filtering methodsare compared. In this survey paper we focus on theclassification, evaluation, and comparison of traditionalmethods. The methods discussed are Collaborative SpamFiltering Using E-Mail Networks, Support Vector Machinesand Spam Filtering with Dynamically Updated URLStatistics. The methods are compared and performance isevaluated.KeywordsCollaborative spam filtering, Spam, Support vector machine.1. INTRODUCTION1.1 What is Spam?Spam emails are emails that the receiver does not wish toreceive. For increased communication emails are used so oneof the best way for advertises emails are considered and as aresult spams are generated. Increasingly today large volumesof spam emails are causing serious problems for users,Internet Service Providers, and the whole Internet backbone.Spam emails not only waste resources such as bandwidth,storage and computation power, but also the time and energyof email receivers who must search for legitimate emailsamong the spam and take action to dispose the spam. Thedifferent methods are available. One of the SpamAssassintool is a widely used host-level filter. This is a rule-basedfilter that requires constantly changing for the rule to beeffective. [2]But some of the attackers figure out the rulebeing employed and bypass these filters by appropriatelyconstructing the email. Rest of the paper is outlined as Section1.2 discuss what features can be extracted from email, section1.3 classification of filtering depending on scope, section 2different methods for filtration ,section 3 comparison ofmethods, section 4 Conclusion and the references.Ujwala V.GaikwadAssistant ProfessorTerna Engg.College, Nerul, Navi MumbaiMumbai University, Mumbaigroups i.e. a filter may consider that the arrival of a dozen ofsubstantially identical messages in 5min is more suspiciousthan the arrival of one message with the same content. A filterwhich involves user collaboration receives also multiple userjudgments about some of the new messages for the analysis.As shown in Figure.1 (a) and (b).[3] An email messageconsists of two parts body and header .Message body consistsof text natural language ,possibly with HTML language andgraphical elements.Header is consisting of structured set of fields having name ,values and specific meaning. Some of this fields, like From,To, or Subject, are standard, and others may depend on thesoftware involved in message transmission, such as spamfilters installed on mail servers. Subject field contains whatthe user sees as the subject of the message and is often treatedas a part of the message body. The body is sometimes referredto as the content of the message. The non-content features arenot limited to the features of the header. For methods ofmessage analysis its designer must choose way of doingfeature extraction, for deciding what parts of message are usedfor analysis.The simplest way is to represent the message as anunstructured set of tokens namely sequences of charactersseparated by spaces and punctuation marks. This model canbe used to characterize any part of a message, or a message asa whole. In this case, presence of a certain word in themessage is considered a binary feature of the message. Asomewhat more sophisticated approach is to consider theoccurrences of the same word in different parts of the messageeg. say, ‘John’ in the message body and ‘John’ in the ‘From’field as different features. For the message header analysis,more sophisticated ways of selecting features take the headerstructure into account, extracting only some special kind ofinformation. Some of the methods are based on non-contentfeatures, including features extracted from the header, such assender and recipient email names, domain names and zones,and general characteristics of the message, such as themessage size and the number of attachments. Some of themethods uses graphics or images for analysis instead of text.The analysis is performed on checking presence of certainpredefined tokens in message body(key word filtering) or inthe information about sender (blacklist/white list filtering).1.2Feature Extraction from Email MessageThe mail messages can be filtering by separately by justchecking some words on basis of keyword filtering or in1

MEDHA - 2012Proceedings published by International Journal of Computer Applications (IJCA)MessageWhole MessageAs anunstructured set oftokensGeneralcharacteristics(such as size)HeaderAs anunstructuredset of tokensBodySelectedfieldsGraphicalelementsAs anunstructuredset of tokensAs a text ina naturallanguageFigure: 1(a) Feature Extraction1.3Classification of Spam Filtering MethodsDepending on Filtration Scope : Depending onfiltration scope spam filtration methods are divided into thefollowing categories[4].1.3.1 Client Side/Personal Filters:Client side filters works directly on user’s computer. In clientside filtration email loading to the user’s local computer. Inclient side filters users’ personal information are used, inserver side filters the filtration model is defined at once for allusers. In spite of the fact that for the majority of users it isobvious what is spam, the concept of spam for each of them isenough personified. The email message marked as spam bysomeone may be the important information for other one. Onthe other hand, use of personal model of email classificationinvolves an inevitable overhead cost. Firstly the user shouldconstruct his personal model of filtration himself as only hecan define what legal email is, and what spam is for him.Secondly, construction, storage and use of personal modeldemands additional computing resources.1.3.2 Server Side/General Filters:Server side filters work at mail server level. Generally inserver side filtration systems the traditional methods offiltration are applied Server side filtration also own priority.The centralized solution reduces expenses and simplifiessupport and control of this system. User becomes more mobileand simplified so that it is easier to store mail centralized inserver and to have an access to him from different points,using different devices.1.3.3 Spam Filtering In Public Mail ServersThis solution sometimes is better than client or serversolution. In this case users are mobile as in case of server sidefiltration, and personalized as in case of client side solution.But disadvantage of usage of public mail servers is that usersdepend on filtration product installed there. For example,the mail server of Google. Inc company gmail.com uses itsown products against spam . This system considers personalinformation about user to minimize false positives. The publicmail provider Mail.ru uses Kaspersky Anti-Spam productbased on “Spamtest” technology, and absolutely based ontraditional filtration methods.The different methods are listed here. There are severalpopular content filters such as Bayesian filters, Rule BasedFilters, Support Vector Machines (SVM) and Artificial NeuralNetwork (ANN). Many machine learning approaches havebeen explored for this task. For example rule-based methods,such as Ripper , PART, Decision tree, and Rough Sets etc.However, pure rule-based methods have not achieved highperformance because spam emails cannot easily be covered byrules, and rules do not provide any sense of degree ofevidence. Besides Bayesian methods, other machine learningmethods, including Support Vector Machine (SVM), Rocchio,kNN and Boosting, have also been applied in the context ofanti-spam filtering. Spam filtering is required for not onlytechnical reasons such as overspend the network bandwidthand email storage, but also social issues such as child safety,phishing email, and so on. Spam makes users look throughand sort out additional email, not only wasting their time andcausing loss of work productivity, but also irritating them and,as many claim, violating their privacy rights .Spam causeslegal problems by advertising.2

MEDHA - 2012Proceedings published by International Journal of Computer Applications (IJCA)Unstructured set oftokens : headerFrom: [email protected] From,mary,example, com, to,mike, org, recievedSelected field of theheaderTo: [email protected] IP1 [xxx.xxx.xxx.xxx]Received:from [xxx.xxx.xxx.xxx]by,IP2 Unstructured set of tokens:allFrom,mary,example, com, to,mike, org, received,dear, I, would, likeDear Mike!General charactersticsI would like tocongratulate youwithSize 2,411Numberof attachments 0Body as a text in natural languageUnstructured set oftokens:bodyGraphical elementsDear Mike!I would like tocongratulate youwithDear, mike, I,would, like, to,congratulate,Figure: 1(b) Message Structure from Point of Feature Extraction2. Different Methods Use for FiltrationSome of methods are discussed here for filtration:2.1 Spam Filtering With DynamicallyUpdated URL Statistics:Many URL-based spam filters rely on “white” and “black“lists to classify email. [6] The proposed method URL-basedspam filter instead analyzes URL statistics to dynamicallycalculate the probabilities of whether email with specificURLs are spam or legitimate, and then classifies themaccordingly. In this method URL based spam filter based onobserving the statistics of URLs in email. Filter uses the naïveBayesian algorithm to decide whether an email is spam or not.When a new email, E, reaches an email system, our filterextracts the email’s URLs and host names (h1, h2, , hn).Multiple appearances of identical hi are treated as a singleappearance of hi. The filter then calculates two probabilities:that the email is spam, P(Spam E), and that it is legitimate,P(Legitimate E). We calculate these probabilities using afrequency table and naïve Bayesian algorithm. If P(Spam E) isgreater than P(Legitimate E), the filter classifies the email asspam and pushes it into its spam pool. Otherwise, it considersthe email legitimate and sends it to the client. Periodically, thefilter sends the list of spam in the pool to the email clients sothey can recover any misclassified email. If the filter can’tcalculate an email’s probabilities, it classifies it as legitimate.Comparison with other filters: We compared our filter withSpamAssassin on the same email set. SpamAssassin is acollaborative filter that combines more than 20 filtersincluding keyword-based, Bayesian, and URL based Filters toclassify email. In SpamAssassin, each filter assigns a messagea credit; if the message’s accumulated credit is greater than athreshold, SpamAssassin classifies it as a spam.2.2 Support Vector Machines:Support vector machines (SVMs) can classify objects byprojecting them into a n-dimensional space. [7]Thedimensional size is determined by the number ofcharacteristics of the training or query vector. The actualclassification is done by filling the vector space with labeledelements from the training set and creating a hyperplane thatseparates the points according to their labels. A query canthen be categorized by simply projecting it into the samespace and determining on which side of the plane it resides.For this method execution speed is very fast but hasdisadvantage is that the training time more if there are largenumber of examples.[10]The key concepts use are thefollowing: there are two classes yi {-1,1}, and there are N(x1,y1), .(xN,yN), x Rd where d is dimensionality of vector.If the two classes are linearly separable ,then one can find anoptimal vector w* such that w* 2 is minimum and W* xi –b 1 if yi 1 and w * xi – b -1 if yi -1 or equivalentlyyi (w* xi- b ) 1. Training examples that satisfy theequality are termed support vectors. The support vectorsdefine two hyperplanes, one that goes through the supportvectors of one class and one goes through the support vectors3

MEDHA - 2012Proceedings published by International Journal of Computer Applications (IJCA)of the other class. The distance between the two hyperplanesdefines a margin and this margin is maximized when the normof the weight vector w* is minimum. Authors showsminimization and maximizing the following function withrespect to variable αj : W(α) αi – 0.5 αi αj(xi xj ) yiyj subject to constraint : 0 αj where it is assumed that N aretraining examples, xi is one of the training vectors, and represents the dot product. The advantage of the linearrepresentation is that w* can be calculated after training andclassification amounts to computing the dot product of thisoptimum weight vector with the input vector.2.3 Collaborative Spam Filtering Using EMail Networks:Collaborative spam filters use the collective memory of, andfeedback from, users to reliably identify spam. [8]That is, forevery new spam sent out, some user must first identify it asspam for example, via locally generated blacklists or humaninspection; any subsequent user who receives a suspect e-mailcan then query the user community to determine whether themessage is already tagged as spam. In this method spamfiltering system uses two key mechanisms to exploit thetopological properties of social e-mail networks: the novelpercolation search algorithm, which reliably retrieves contentin an unstructured network by looking through only a fractionof the network, and the well -known digest-based indexingscheme.Percolation search: search algorithm: This algorithm passesmessages on direct links only and includes three key steps:This algorithm passes messages on direct links only andincludes three key steps:Cache or content implantation: Each node performs a shortrandom walk in the network and caches its content list oneach visited node. The length of this short random walk isreferred to as the time to live (TTL).Query implantation: A node making a query executes a shortrandom walk of the same length as the TTL used in thecontent implantation process and implants its query requestson the nodes visited.Bond percolation: The algorithm propagates all implantedquery requests through the network in a probabilistic manner;upon receiving the query, a node relays it to each neighboringnode with percolation probability p, which is a constantmultiple of the percolation threshold, pc, of the underlyingnetwork.It consists of following functions: Digest publication. If theclient program determines that the e-mail is definitely spam, itcalls the digest function to generate a digest, De, for themessage and caches the digest on a short random walk oflength l, which is the TTL.Query implantation : If the client program suspects that the email is spam, it can query the system to determine whetherany other user in the network already has De on its spam list.It implants each query message for this digest via a randomwalk of length l, node receives a suspected message andimplants a query via a random walk with a TTL equal to 2.Bond percolation : Nodes with an implanted query requestpercolate the query message containing De through the e-mailcontact network. Each node that the query visits declares a hitif the digest matches any messages cached on that node.Hit routeback :The client program routes all hits back to thenode that originated the query through the same path by whichthe query message arrived at the hit node.The system routes the hits at nodes and Other back to firstnode through the same path.Hit processing :After routing all hits back, the client programcalculates the number of hits received. If this HitScoreexceeds a constant threshold value, the program declares themessage in question as spam; otherwise, it determines themessage not to be spam. The client program places all e-mailmessages declared as spam in the user’s spam folder. It thencalls the function that generates the digest of the spammessage, De, and caches this on a short random walk, takingthe process back to the digest publication step. That thesystem exchanges all messages via background e-mails. Usersare not required to click and open any system message or file.Moreover, the system can program clients to reject allmessages that do not match a predefined format and thus arepotentially malicious. Finally, we recommend adding apersonalization feature that lets the user blacklist only spamaddressed to the public.3. COMPARISION OF METHODSEmail spam is the bulk, promotional, and unsolicited message.Email spam causes a serious problem in waste of time andresources. In this paper we surveyed existing techniques andalgorithms created to fight against web spam. Discussed howspam affects users and search engine companies, and motivateacademic research. Then we turn to the discussion ofalgorithms for web spam detection, and analyze theircharacteristics and underlying ideas. At the end, wesummarize all the key principles behind anti-spam algorithms:Support vector machine is a powerful and popular machinelearning tool in solving supervised classification problems dueto its good generalization performance. The AT&T staffworked on 3000 email messages using SVM found that 850messages were considered as spam, and at the time ofexperiments it consider body of message. It shows thatmethod will be if it consider binary features. The training timerequired is more in SVM. The accuracy can be improve bygenerating list of acceptable senders which are considered asnon spam no matter what the subject and body contents. Amajor drawback of collaborative filtering schemes is that theyignore the already present and pervasive social communitiesin cyberspace and instead try to create new ones of their ownto facilitate information sharing. The method anti spamsystem is social-network-based, it is important to protectusers’ privacy by preventing anybody from using the networkto map out social links. The requirement is to be able toprovide enough benefits to users to encourage theirparticipation, which is relatively easy when it comes to spammanagement. If users become accustomed to a spam-filteringsystem, queries for other information will follow. In SpamFiltering With Dynamically Updated URL Statistics extractedURL information from all URLs in the email and, where4

MEDHA - 2012Proceedings published by International Journal of Computer Applications (IJCA)relevant, extracted HTML tags. used 39,965 email messagesfor the study; about 95 percent were spam. About 80 percentof legitimate email contained one or more URLs, and 99percent of spam contained URLs. In addition to advertising,spam might have such a high rate of URLs because manyinclude only figures (with URLs) to avoid content filters. Wecompared legitimate and spam email based on the number ofURLs that pointed to figures or linked to Web pages. If aURL was included in HTML IMGE or AREA tags, weclassified the URL as representing an image. If a URLaccompanied an A, FORM, INPUT, LINK, BUTTON,IFRAME, OPTION, or NIL tag, we classified it as linking to aWeb page. Although spam includes many URLs with imageand linking tags, our statistics still don’t offer a clear-cut casefor using tags to distinguish between spam and legitimateemail. The given method is compared with SPAMASSASSIAN filter with same emails it found betterperformance.4. CONCLUSIONThe rapid growth of users in the Internet and the abuse of email by unsolicited users cause an exponential increase of emails in user mailboxes. Spam messages are nuisance andhuge problem to most users since they clutter their mailboxesand waste their time to delete all the junk mails before readingthe legitimate ones. They also cost user money with dial upconnections; waste network bandwidth and disk space.Summarizing the article , In this paper we discussed theproblem of spam. And how spam can be detected.[11][12] Wesurveyed different existing techniques and algorithms to fightagainst spam. And compared the methods. In SVM basedmethod which is content based the training required whichrequired time and accuracy is less. In collaborative system, allsystem has to participate in communication and due todisadvantages in this method URL method can be consideredbetter. As URL can be mostly occurring factor in email. Sousing this method, spam filtering can have better performance.REFERENCES[1] Bin Wang Æ Gareth J. F. Jones Æ Wenfeng Pan, “UsingOnline Linear Classifiers to Filter Spam Emails”,Springer-Verlag London Limited 2006, Published online:3 October 2006.[2] Venkatesh Ramanathan and Harry Wechsler,”PhishingDetection Methodology Using Probabilistic LatentSemantic Analysis”, AdaBoost, and 2:1,phishGILLNET, 2012.Springer, Published online: 10 July 2009 SpringerScience Business Media B.V. 2009[4] Baku Azerbaijan ,Saadat Nazirova Institute of InfotechTechnology of Azerbaijan National Academy of on and Network 2011,3,153-160.[5] Wuying Liu Ting Wang, 2011, “Online active multi-fieldlearning for efficient email spam Filtering”, SpringerReceived: 5 August 2010 / Revised: 10 October 2011 /Accepted: 15 November 2011 Springer- VerlagLondon Limited 2011.[6] JangbokKim,Kim,KihyunChung,andKyunghee,2007,Ajou University: ”Spam Filtering WithDynamically Updated URL Statistics”, Published By TheIEEE Computer Society 1540-7993/07/ 25.00 2007IEEE, IEEE SECURITY & PRIVACY .[7] Manuel Egele, Clemens Kolbitsch, Christian Platzer,2009, “Removing Web Spam Links From Search EngineResults”, Springer ,Received: 22 December 2008 /Accepted: 3 August 2009 / Published online: 22 August2009 Springer- Verlag France 2009 06.[8] Joseph S. Kong, Behnam A. Rezaei, Nima Sarshar, andVwani P. Roychowdhury, 2006,” Collaborative SpamFiltering Using E-Mail Networks,” University ofCalifornia, Los Angeles P. Oscar Boykin University ofFlorida: 0018-9162/06/ 20.00 2006 IEEE Publised bythe IEEE Computer Society.[9] Rui Zhang,Wenjian Wang,Yichen Ma,Changqian Men,2009,”Least Square Transduction Support VectorMachine”, Published online Springer: 28 February 2009 Springer Science Business Media, LLC. 2009.[10] Harris Drucker, 1999, “Support Vector Machines forSpam Categorization”,Senior Member, IEEE, DonghuiWu, Student Member, IEEE, and Vladimir N. Vapnik, ,IEEE Transcation On Neural Networks, Vol. 10, No. 5,SEPTEMBER 1999.[11] Taiiki Takashita,Tsuyoshi Itokawa,Teruaki Kitasuka, andMasayoshi Aritsugi,2008,”A Spam Filtering MethodLearning From Web Browsing Behavior”,SpringerVerlag Berlin Hedelberg 2008.[12] Chi-Yao Tseng, Pin-Chieh Sung, and Ming-SyanChen,Fellow IEEE,”A Collaborative Spam DetectionSystem with a novel E-mail Abstraction Scheme”, IEEETransactions on Knowledge and data engineering,Vol23,no5,may 2011.[3] Enrico Blanzieri,Anton Bryl, 10 July 2009, “A Survey OfLearning-Based Techniques of Email Spam Filtering”,5

among the spam and take action to dispose the spam. The different methods are available. One of the SpamAssassin tool is a widely used host-level filter. This is a rule-based filter that requires constantly changing for the rule to be ef