Transcription

International Journal of the Physical Sciences Vol. 5(12), pp. 1869-1882, 4 October, 2010Available online at http://www.academicjournals.org/IJPSISSN 1992 - 1950 2010 Academic JournalsFull Length Research PaperOverview of textual anti-spam filtering techniquesThamarai Subramaniam, Hamid A. Jalab and Alaa Y. Taqa*Computer System and Technology, Faulty of Computer Science and Information Technology, University Malaya,Malaysia.Accepted 31 August, 2010Elecronic mail (E-mail) is an essential communication tool that has been greatly abused by spammersto disseminate unwanted information (messages) and spread malicious contents to Internet users.Current Internet technologies further accelerated the distribution of spam. Effective controls need to bedeployed to countermeasure the ever growing spam problem. Machine learning provides betterprotective mechanisms that are able to control spam. This paper summarizes most common techniquesused for anti-spam filtering by analyzing the e-mail content and also looks into machine learningalgorithms such as Naïve Bayesian, support vector machine and neural network that have beenadopted to detect and control spam. Each machine learning has its own strengths and limitations assuch appropriate preprocessing need to be carefully considered to increase the effectiveness of anygiven machine learning.Key words: Anti-spam filters, text categorization, electronic mail (E-mail), machine learning.INTRODUCTIONE-mail or electronic mail is an electronic messagingsystem that transmits messages across computernetworks. Users simply type in the message, add therecipient’s e-mail address (es) and click the send button.Users can access any free e-mail service such as Yahoomail, Gmail, Hotmail, or register with ISPs (InternetService Providers) in order to obtain an e-mail account atno cost except for the Internet connection charges.Besides that, e-mail can be also received almostimmediately by the recipient once it is sent out.E-mail allows users to communicate with each other ata low cost as well as provides an efficient mail deliverysystem. The reliability, user-friendliness and availabilityof a wide range of free e-mail services make it mostpopular and a preferred communication tool. As such,businesses and individual users alike rely heavily on thiscommunication tool to share information and knowledge.Businesses can drastically cut down on communicationcost since e-mail is extremely fast and inexpensive;furthermore it is a very powerful marketing tool.Businesses can capitalize from this technology since it isa very popular advertising tool. However, the simplicity ofsending e-mail and the almost non-existent cost posesanother problem: Spam. Spam refers to bulk unsolicitedcommercial e-mail sent indiscriminately to users. Table 1enumerates some of them.Based on the Ferris Research (2009), spam can becategorized into the following:1. Health; such as fake pharmaceuticals;2. Promotional products; such as fake fashion items (forexample, watches);3. Adult content; such as pornography and prostitution;4, Financial and refinancing; such as stock kiting, taxsolutions, loan packages;5. Phishing and other fraud; such as “Nigerian 419” and“Spanish Prisoner”;6. Malware and viruses; Trojan horses attempting toinfect your PC with malware;7. Education; such as online diploma;8. Marketing; such as direct marketing material, sexualenhancement products;9. Political; US president votes.E-MAIL STRUCTURE*Corresponding author. E-mail: alaa [email protected] messages are divided into 2 parts: Header

1870Int. J. Phys. Sci.Table 1. Different spam definitions.Author(s)/(year)Vapnik et al.(1999)DefinitionAn e-mail message that is unwanted: Basically it is the electronic version of junkmail that is delivered by the postal service.Oda and White (2003)The electronic equivalent of junk e-mail which typically covers a range of unsolicitedand undesired advertisements and bulk e-mail messages.Lazzari et al. (2005)Electronic messages posted blindly to thousands of recipients, and represent one ofthe most serious and urgent information overload problems.Zhao and Zhang (2005)Spam or junk mail, is an unauthorized intrusion into a virtual space - the E-mail box.Youn and McLeod (2007)Spam as bulk e-mail - e-mail that was not asked for which is send to multiplerecipients.Wu and Deng (2008)Spam e-mails, also known as ‘junk e-mails’, are unsolicited ones sent in bulk(unsolicited bulk E-mail) with hidden or forged identity of the sender, address, andheader information.Amayri and Bouguil (2009)Spam e-mails can be recognized either by content or delivery manner and indicatedthat spam e-mails were recognized according to the volume of dissemination andpermissible delivery.Spamhaus (2010)An electronic message is "spam" if (A) the recipient's personal identity and contextare irrelevant because the message is equally applicable to many other potentialrecipients; AND (B) the recipient has not verifiably granted deliberate, explicit, andstill-revocable permission for it to be sentinformation and message body. Header information orthe header field consists of information about themessage’s transportation which generally shows thefollowing information;5. Reply to: reply address;6. Subject: the subject of message specified by thesender;6. Message Id: unique id of the message and others1. From: displays sender’s detail such as e-mail address;2. To: displays receiver’s detail such as e-mail address;3. Date: displays the date the e-mail was send to therecipient;4. Received: intermediary server’s information and thedate the e-mail message is processed;The message body contains the message of the e-mail.E-mail messages are presented in plain text or HTML. Ane-mail may also have attachments such as graphics,video or other format type and to facilitate theseattachments MIME (multipurpose internet mail extension)is used.SPAMMER TRICKSIn order to send spam, spammers first obtain e-mailaddresses by harvesting addresses through the Internetusing specialized software. This software systematicallygathers e-mail addresses from discussion groups orwebsites (Schaub, 2002), other than that spammer alsoable to purchase or rent collections of e-mail addressesfrom other spammers or services providers. Table 2indicates the many tricks used by spammers to avoiddetection by spam filters.SPAM’s IMPACTSThe MessageLabs Intelligence report for 2009 highlightsspam levels reaching 87.7%, with compromisedcomputers issuing 83.4% of the 107 billion spammessages distributed globally per day on average(MessageLabs Intelligence Annual Security Report,2009). Spam reaching users’ inbox have been graduallyincreasing since 2004 as shown in Figure 1 (data is

Subramaniam et al.1871Table 2. Tricks used by spammers to send spam.TricksZombies or BotnetsDescriptionsCompromised PCs on the Internet that sent vast amount of spam, viruses, and malware.Bayesian sneaking and poisoningWriting spam message so it does not contain any words that are normally used in spammessages, or “poison” the Bayesian filter’s database.IP addressOffshore ISPsOpen proxies / open-relay serversThird-party mailback softwareFalsified header informationBorrowing or using an IP address that has a good or neutral reputation.Usage of offshore ISPs that lack in security measuresCompromised servers to re-direct spam to unsuspecting users.Use improperly-secured mailback applications on innocent websitesAdd bogus header information to the spam messageObfuscationObscuring the words in spam messages by splitting words or messages using nonsenseHTML tags or other ‘creative’ symbolsVertical slicingHTML manipulationWriting the spam messages verticallyManipulation of HTML format to avoid detectionUsage of encoding scheme such as Base64 to turn a binary attachment into plain textcharactersHTML encodingJavaScript messagesPlacing entire contents of the spam message inside a JavaScript snippet that is activatedwhen the message is openedASCII artImage basedUsage of letter glyphs of standard letters to write spam messagesUsing image to send textual informationOnly add URL address to bypass detection / use expendable “portals” to point to theiractual websitesEncrypting message where it only decrypted once it reaches the mail boxURL address or redirect URLEncrypted messagesSpam received (%)100.00%90.00%80.00%72.30% 68.60%87.70% 89.30%86.20% .00%0.00%200420052006200720082009Jun-10Spam AverageFigure 1. Spam average for 2004 to 2010.compiled from MessageLabs Intelligence reports for2005, 2006, 2007, 2008, 2009, 2010). The decline in theyear 2005 is contributed due to awareness campaignlaunched on 2004 aimed to pressure internet service

1872Int. J. Phys. Sci.Table 3. Anti-spam legislation environment (Moustakas et al., 2010).CountryAustraliaLegislation – Anti-spam statutesSpam Act of 2003Telecommunications Act of 1997Australia Parts IVA, V, and VC of the Trade Practices Act of 1974CanadaPersonal Information Protection and Electronic Documents Act (PIPEDA)Competition Act.Charter of Rights FreedomsThe Criminal Code and the Competition ActCanadian Code of Practice for Consumer Protection in E-CommerceEUPrivacy and Electronic Communication Regulations 2003 (UK)Data Protection Act of 1998 (UK)Electronic Commerce Regulations of 2002 (all adapted from EC Directives, e.g. Directive onPrivacy and Electronic Communications 2002/58/EC)JapanLaw on regulation of Transmission of Specified Electronic Mail July, 2002Specific Commercial Transactions Law, 2002USACAN-SPAM ACT of 2003Law enforced by Federal Trade CommissionSection 5 of the Federal Trade Commission Actproviders and other internet bodies to take a responsiblerole in helping to clamp down on e-mail attacks(MessageLabs Intelligence Annual Security Report,2005).A 2009 study by Ferris Research estimated an increasein spam cost to a total of 130 billion worldwide. Thisrepresents a 30% increase from 2007 (Ferris Research,2009). The study indicated that the main cost occurs dueto:1. Productivity loss from inspecting and deleting spamthat gets missed by spam control products (falsenegatives),2. Productivity loss from searching for legitimate e-maildeleted in error by spam control products (false positives)3. Operations and helpdesk running costs (FerrisResearch, 2009).The impacts of spam are becoming far more serious thanmere annoyances. Spam floods up users’ inboxes thusmaking users spend unproductive hours in deleting theseunwanted e-mails, causing displacement of critical orlegitimate e-mails. Besides that, spam also causes theloss of internet performance and bandwidth due toincreased payload on the network (Ferris Research,2010) and it clogs up e-mail servers to the point where itsometimes crashes.Spam increases the spread of malware and viruseswhich poses bigger threats to network security andpersonal privacy (Lai et al., 2009). Based on aMessageLabs research report, spam containing virusesfor 2009 was 1 in 286.4 e-mails and more than 73.1million malware infected e-mails, containing over 2,500different malware strains, were blocked (Wood et al.,2010).Spammers also deploy spam to gain personalinformation about the user for fraudulent proposes.Phishing activity related to identify theft and other internetrelated frauds (e.g. Nigeria 419) are becoming one of themajor concerns for the Internet community. MessageLabsresearchers indicated that the proportion of phishingattacks in e-mail traffic was 1 in 325.2 (0.31%) e-mailsand estimated 161 billion e-mail phishing attacks were incirculation in 2009. The growing threats of spam definitelyrequire drastic control measures.EXISTING SOLUTIONS FOR SPAMTraditionally there are many approaches available tocontrol spam such as using sender domain check,content check, open relay prohibition and checking the IPaddress or domain names (Hideo, 2009). However,spammers easily overcome these simple measures withmore sophisticated variants of spam to evade detection.The measures engaged to control spam are discussedbelow.Legislation approachesGuzella (2009) cited that economical impacts of spamhave led some countries to adopt legislation. Manycountries (Table 3) have enacted different laws and

Subramaniam et al.legislations to protect businesses and individuals alikeagainst spam. Denmark enacted the Danish MarketingPractices Act, Data Protection Act and Danish Act onInternet domains (Frost and Udsen, 2006) that prohibitspammers from harvesting and sending spam e-mails.In USA, CAN-SPAM Act for 2003 was enacted inDecember 2003. CAN-SPAM Act is an abbreviation forcontrolling the assault of non-solicited pornography andmarketing. It places restrictions and regulations to controlspammers activities. For example, it prohibits spammersfrom harvesting e-mail addresses and creating Botnets.Failure to comply with CAN-Spam Act can result in amonetary penalty of 16,000 per incident.However the CAN-Spam Act does allow spammers tosend unsolicited e-mail. McAfee Research reported on2009 despite the six-year-old CAN-SPAM Act, spammersroutinely abuse the law and continue to deliver spam(Wosotowsky and Winkler, 2009).Black-list and white-listBesides legislation, technological spam detectionapproaches have also been employed over the years.Earliest techniques used to block spam were whitelistand blacklist. This content-based technique recognizeswords or patterns of a message which are defined eitherlegitimate mail or spam.Legitimate mails are listed in a whitelist and spam islisted in a blacklist. The e-mail message is then analyzedagainst the lists and legitimate e-mails are allowed whilespam mails are blocked. Unfortunately, since the contextof the e-mail is not taken into consideration, somelegitimate e-mail may be blocked or blacklisted (Dalkilicet al., 2009; Heron, 2009).Messages from previously known source of spam areblocked using Real-time IP blacklist. Real-time IPblacklist typically checks the source of the spam. Theheader information from the messages which contain IPor domain sources is compared against real-time blacklistand matched IP addresses are blocked.On the other hand spammers are using large Botnetsto sent spam thus creating extremely a huge number ofIP addresses to be blacklisted. Real-time IP blacklisttypically blocks only 80 - 90% of spam (Green, 2005).Sometime the filter application blocks legitimate users(false positive) who have unknowingly been used togenerate spam or have been erroneously reported(Heron, 2009). The time and effort it takes to removethese false positive can be overwhelming.Heuristic approachesAnother approach used to control spam is heuristics. Theheuristic approach examines the e-mail’s content andcompares it against thousands of pre-defined rules.1873These rules are assigned a numerical score that weightthe probability of the message being spam. Eachreceived message is verified against the heuristic filteringrules.Compared with a pre-defined threshold, the verificationresult decides whether the message is spam or not (Xieet al., 2006). The score of the weight is then sharedamong users to filter the e-mails. Conversely, spammersuse obfuscation to fool the rules to avoid detection andmodifying heuristic tests to cope with new attack vectorsdevised by spammers which can be complicated, leavinga period of time when there is no protection (Heron,2009).Machine learning approachesMachine learning (ML) is a scientific discipline that isconcerned with the design and development ofalgorithms that allow computers to adapt their behaviorbased on data. ML automatically learns to recognizecomplex patterns and makes intelligent decisions basedon data.ML is capable of automatically building a classifier for acategory by observing the characteristics of a set ofdocuments or corpus manually classified underby adomain expert. From these characteristics, the inductiveprocess gleans the characteristics that a new, unseendocument should have in order to be classified under(Sebastiani, 2002).ML’s automatic builder of classifiers (learner) deems tobe the main advantage when it comes to spamclassification. It is more convenient and easier toautomatically classify a set of documents than to buildand tune a set of rules.NAÏVE BAYESIAN CLASSIFICATIONNaïve Bayesian is a fundamental statistical approachbased on probability initially proposed by Sahami et al.(1998). The Bayesian algorithm predicts the classificationof new e-mail by identifying an e-mail as spam orlegitimate. This is achieved by looking at the featuresusing a ‘training set’ which has already been preclassified correctly and then checking whether aparticular word appears in the e-mail. High probabilityindicates the new e-mail as spam e-mail.Lai (2007) describes naïve Bayesian algorithm asfollows: Given a feature vectorof an e-mail, where values of attributesand n is the number of attributes in the corpus. Eachattribute is a particular word occurring or not in an e-mail.Let c denote the category to be predicted, that is,

1874Int. J. Phys. Sci.by Bayes law the probabilitythatbelongs to c is as given in(1)wheredenotes the a-priori probability of a randomlypicked e-mail has vector as its representation, P(c) isalso the a-prior probability of class c (that is, theprobability that a randomly picked e-mail is from thatclass), anddenotes the probability of a randomlypicked e-mail with class c has as its representation.Androutsopoulos et al. noted that the probabilityis almost impossible to calculate because thefact that the number of possible vectors is too high. Inorder to alleviate this problem, it is common to make theassumption that the components of the vectorareindependent in the class. Thus,decomposed tocan be(2)So, using the NB classifier for spam filtering it can becomputed as)(3)Naïve Bayesian approach is very stable, better and hasfaster performance thus making it very popular (Dong,2004) algorithm to employ in various classification fields.NB performs reasonably consistently and is good indifferent experimental settings (Lai, 2007). It is simple toimplement and independence allows parameters to beestimated on different data sets. Besides that, NB alsohas a very short learning curve (Ko et al., 2009). Themain shortcoming of the NB classifier is it can only learnlinear discriminant functions and thus it is alwayssuboptimal for non-linearly separable concepts (Rish,2001). The Naïve Bayesian approach has beensuccessfully incorporated into other machine learningapproaches to increase the effectiveness of the textclassifications.SUPPORT VECTOR MACHINE CLASSIFICATIONSupport vector machine (SVM) is a framework ofstructural risk minimization and statistical learning theorydeveloped by Vapnik and his coworkers. SVM is basedon the optimal classification hyperplane of linearclassification situation. SVM finds a maximum marginFigure 2. Support vector machine.separating hyperplane between two classes of data(Figure 2). It is a non-linear function and densityestimation based algorithm.Sun et al. (2002) indicated that classifier built on SVMhas shown promising results with its efficiency andeffectiveness. SVM can be achieved by non-linearmapping, polynomial functions and sigmoidal functions.SVM bypasses the curse of dimensionality by employingkernel functions and enables the straight forward analysisof high-dimensional data, the ability to determine themargin completely as well as the capability of handlinghigh dimensionality and small sample problems (Yu et al.,2008). SVM has a great generalization capability too(Sebastiani, 2002).In SVMs hyperplane that separate the training, data(spam or legitimate e-mail) are measured by themaximum margin, therefore all vectors that lie on oneside of the hyperplane are labeled as -1 (w x – b -1)and the other side as 1 (w x – b 1). Thus when newdata is introduced, it maps to the closest support vectorbased on the maximum margin. To find the maximummargin, the following algorithms are used, given linearseparable vectorswith labels, and for linearly separable space, the decisionsurface is a hyperplane which can be written as:(4)And the equation isMIN(5)With a constraint(6)

Subramaniam et al.1875following equation.(8)This is then passed through to the bipolar sigmoidactivation function(9)Figure 3. Neural network: Indicates the input, hidden andoutput layer that make up the neural network.The output of the activation functionbroadcast to all of the nodes on the output layer(10)The optimal hyperplane calculation as follows(7)NEURAL NETWORK CLASSIFICATIONNeural network (NN) was first introduced by McCullochand Pitts in 1943, since the introduction it has beenincreasingly used in text classification. Neural networkemulates the functionality of human brains in whichneurons (nerves cell) communicate with each other bysending messages between them. Artificial neuralnetwork (ANN) represents the mathematical model ofthese biological neurons.It is a parallel distributed information processingstructure consisting of a number of nonlinear processingunits (neurons) (Ko et al., 2009) which can be trained torecognize features and to identify incomplete features/data. Neural network has great mapping capabilities orpattern association thus exhibiting generalization,robustness, high fault tolerance, and high speed parallelinformation processing.NN’s self-learning capability by examples allowsresearchers to train NN with features from e-mailmessages to acquire the knowledge for classifying e-mailinto spam or legitimate mail. Neural network architecturegenerally can be categorized into single layerfeedforward network, multilayer feedforward network andrecurrent network. However over the years many othertypes have emerge such as perceptron, backpropagation network, self-organizing map, adaptiveresonance theory and radial basis function.Figure 3 indicates the input, hidden and output layerthat make up the neural network.The network functions are as follows (Goyal, 2007):Each node in the input layer receives a signalas thenetwork’s input, multiplied by a weight value between theinput layer and the hidden layer. Each nodein thehidden layer receives the signalis thenaccording to thewhereandare the biases in the hidden layer andthe output layer, respectively. The output value will becompared with the target by the mean absolute error asthe error function(11)Where theis the number of training patterns,andare the output value and the target value. The weightis adjusted according to the following expression:(12)whererate.is the number of epochs andis the learningThe learning NN algorithms methods can be broadlydivided into supervised, unsupervised and reinforcedlearning methods.PREVIOUS STUDIES ON MACHINE LEARNINGThe exponential growth of spam e-mails in recent yearshas resulted in the necessity for more accurate andefficient spam filtering. Machine learning (ML) is a veryeffective approach that has been successfully used intext classification. This approach is increasingly beingapplied to combat spam.By allowing machines to classify e-mail into spam andnon-spam messages, it relieves human intervention thusreducing the cost of monitoring spam.Support vector machine (SVM) is one of the popularML approaches being applied in anti-spam classification.Vapnik and his co-workers on 1999 initially applied thisML technique for spam classification. They tested itagainst three other techniques; Ripper, boosting decision

1876Int. J. Phys. Sci.tree and Rocchio. Both boosting trees and SVMsprovided “acceptable” performance, with SVMspreferable given its lesser training requirements (Vapniket al., 1999). The best result yield for SVM is obtained byusing binary representation and a frequency-base forboosting.The Naïve Bayesian (NB) approach was initiallyproposed by Sahami (1998) for automatic e-mailclassification using decision theoretic framework andsince the work, researchers have conducted manystudies focusing on Naïve Bayes defeating spam.Androutsopoulos et al. (2000) investigated the effect ofattribute-set size, training-corpus size, lemmatization, andstop-lists on the Naïve Bayesian filter’s performances.They concluded that after introducing cost-sensitiveevaluation, additional safety nets are needed for theNaïve Bayesian anti-spam filter to be viable in practice.Graham (2002, 2003) later implemented a Bayesian filterthat caught 99.5% of spam with 0.03% false positives.Kun-Kan Li and et al (2002) classified spam usingSimplified Support Vector Machine using pool-basedactive learning which involves selecting a training set ofexamples from a pool of unlabeled examples.Soonthornphisaj et al. (2002) investigated spam classification using a Centroid-Based approach in which thedata items are represented using a vector space model,Naïve Bayesian and K-nearest Neighbor (kNN). tperformed Naïve Bayesian and kNN.Clark et al. (2003) classified spam using LINGER, aneural network-based system which uses a multi-layerperceptron. LINGER includes 2 feature selectors: Information gain (IG) and variance (V). Their results show thatneural network-based filters achieve better accuracy inthe training phase but has unstable portability acrossdifferent corpora (Clark et al., 2003). Woitaszek et al.(2003) used simple SVM along with a personalizeddictionary for model training. They subsequentlyimplemented the classifier as an add-in for MicrosoftOutlook XP providing sorting and grouping capabilitiesusing Outlook’s interface to the typical desktop e-mailuser.Matsumoto et al. (2004), described the results of anempirical study on two spam detection methods: SupportVector Machines (SVMs) and Naive Bayesian Classifier(NBC). They used both term frequency (TF) and termfrequency with inverse document frequency (TF-IDF) forfeatures vector construction. Their results reflect thatNaïve Bayesian has a consistent performance for all thedata sets ranging.Zhao and Zhang (2005) implemented a rough setbased model to classify e-mails into three categories:Spam, non-spam and suspicious and compared it withNaïve Bayesian Classifier. The result shows that theRough Set-based method has a better accuracy rate thanthat of the Naïve Bayesian. Chuan et al. (2005) proposedthe use of LVQ-based neural network for spam e-mailclassification. E-mails are classified into severalsubclasses for easy identification and Learning vectorquantization (LVQ)-based NN. Their experiments showsLVQ-NN has better precision and recall rates comparedto NN-BP and Naïve Bayesian in which Naïve Bayesianshows the lowest rates.Wang et al. (2006) used the integration of two linearclassifiers, Perceptron and Winnow. They concluded thatWinnow produces slightly better results than Perceptron,however both classifiers performed very well andconsiderably outperformed the Naïve Bayesian classifier.Ichimura et al. (2007) propose self organizing map(SOM) for spam classifications and automatically definedgroup (ADG) to extract correct judgment rules. They used3007 e-mails classified as spam from SpamAssassin,SOM is used to classify these spam to obtain the visualdistribution intuitively and the ADG extractedclassification rules to judge spam correctly. Theirexperiment concluded that SOM improves theclassification process and ADG tremendously reducesfalse negatives. Yang and Elfayoumy (2007) evaluatedthe effectiveness of feedforward backpropagation NeuralNetwork and Bayesian classifiers for spam detection.Their result concluded that feedforward backpropagationNN provides relatively high accuracy compared toBayesian classifier.Lobato and Lobato (2008) used binary classificationbased on an extension of Bayes point machines. Byusing the Bayesian approach with inference expectationpropagation (EP) they produced a result that outperformsSVM. Ye et al. (2008) proposed spam discriminationmodel based on SVM and the D-S theory. They usedSVM with probability to sort out mail according to thefeatures of mail headers and mail body textual contentand D-S Theory to identify spam which improves theaccuracy of the spam filter. Yu and Xu (2008) comparedfour ML algorithms; Naïve Bayes (NB), neural network(NN), support vector machine (SVM) and relevancevector machine (RVM). Their experimental results showthat NN classifier is more sensitive to the training set sizeand unsuitable for using alone as spam rejection tool,SVM and RVM are superior to NB, and RVM is muchfaster testing time.Wu (2009) used a hybrid method of rule-basedprocessing and back-propagation neural network forspam filtering. A rule-based process is first employed toidentify and digitize the spamming behaviors observedfrom the headers and syslogs of e-mails. Then theyutilize the spamming behaviors as features for describinge-mails. This information is then used to train the BPNN.The system produced very low false positive andnegative rates and with better results in comparison tocontent- based classification (Guzella, 2009).Wang et al. (2009) developed and experimented antispam filtering system by combining Naïve Bayesian with

Subramaniam et al.1877Table 4. Summaries of previous studies on ML Algorithms used and accuracy (English Language).Reseacher (s)Soonthornphisaj et al. (2002)Graham (2002, 2003)Woitaszek et al. (2003)Zhao and Zhang (2005)Chuan et al. (2005)Wang et el. (2006)Algorithm usedCentroid-Based approachBayesian filterSimple support vector machine with personalized dictionaryRought Set BasedLVQ-based Neural NetworkPerceptronWinnowAccuracy (%)8399.595.2697.3798.9798.8999.31False positiveNA0.03% FP6.80% FPNANANAYang and Elfayoumy (2007)Feed forward back propagation neural network90.240.81% FP0.84% FNLobato and Lobato (2008)Sun et al. (2009)Meizhen et al. (2009)SVM and D-S TheoryLPP and LS-SVMBehavior recognition based on fuzzy decision tree (FDT)98.359497NANANAdistributed checksum clearinghouse (DCC) to avoidexcessive false positives. This combination achieved veryhigh recall, accuracy rates and exhibits excellentreliability an

detection by spam filters. SPAM's IMPACTS The MessageLabs Intelligence report for 2009 highlights spam levels reaching 87.7%, with compromised computers issuing 83.4% of the 107 billion spam messages distributed globally per day on average (MessageLabs Intelligence Annual Security Report, 2009).