This is a repository copy of A HADOOP-Based Framework for Parallel and DistributedFeature Selection.White Rose Research Online URL for this n: Published VersionOther:Hodge, Victoria Jane, Jackson, Tom and Austin, (2013) A HADOOP-Based Framework for Parallel andDistributed Feature Selection. UNSPECIFIED, Department of Computer Science,University of York, UK.ReuseItems deposited in White Rose Research Online are protected by copyright, with all rights reserved unlessindicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted bynational copyright laws. The publisher or other rights holders may allow further reproduction and re-use ofthe full text version. This is indicated by the licence information on the White Rose Research Online recordfor the item.TakedownIf you consider content in White Rose Research Online to be in breach of UK law, please notify us byemailing [email protected] including the URL of the record and the reason for the withdrawal

A HADOOP-BASED FRAMEWORK FOR PARALLEL ANDDISTRIBUTED FEATURE SELECTIONVictoria J. Hodge, Tom Jackson & Jim AustinAdvanced Computer Architecture Group,Department of Computer Science,University of York, York, YO10 5GH, .ukAbstractIn this paper, we introduce a theoretical basis for a Hadoop-based framework for parallel anddistributed feature selection. It is underpinned by an associative memory (binary) neural networkwhich is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm.There are many feature selectors described in the literature which all have various strengths andweaknesses. We present the implementation details of four feature selection algorithms constructedusing our artificial neural network framework embedded in Hadoop MapReduce. Hadoop allowsparallel and distributed processing so each feature selector can be processed in parallel and multiplefeature selectors can be processed together in parallel allowing multiple feature selectors to becompared. We identify commonalities among the four features selectors. All can be processed in theframework using a single representation and the overall processing can also be greatly reduced by onlyprocessing the common aspects of the feature selectors once and propagating these aspects across allfour feature selectors as necessary. This allows the best feature selector and the actual features toselect to be identified for large and high dimensional data sets through exploiting the efficiency andflexibility of embedding the binary associative-memory neural network in Hadoop.KeywordsHadoop; Distributed; Parallel; Data Fusion; Feature Selection; Binary Neural Network1 IntroductionFrequently data contain too many features or too much noise [38] for accurate classification [10][22],prediction [10][15] or outlier detection [24][27] as only a subset of the features are related to thetarget concept (classification label or predicted value). Data from distributed systems may beintermittent and may contain duplicates as distributed systems communicate the data across a widegeographical area. Many machine learning algorithms are adversely affected by noise, omissions andsuperfluous features which can prevent accurate classification or prediction. Consequently, the datamust be pre-processed by the classification or prediction algorithm itself or by a separate featureselection algorithm to prune these superfluous features [36][50]. For distributed systems this consistsof pruning superfluous features either locally (at data source) or globally on an aggregated data set.The benefits of feature selection include reducing the data size when superfluous features arediscarded, improving the classification/prediction accuracy of the underlying algorithm where thealgorithm is adversely affected by noise, producing a more compact and easily understood datarepresentation and reducing the execution time of the underlying algorithm due to the smaller datasize. For distributed systems, feature selection also translates into savings of power, hardware, andtransmission as the data size is reduced. Feature selection can also minimise false positives by

improving the data quality and thus the accuracy of the underlying classification or predictionalgorithm.In this paper, we focus on feature selection across all data for parallel and distributed classificationsystems. We aim to remove noise and reduce redundancy from the distributed network to improveclassification accuracy. There is a wide variety of techniques proposed in the machine learningliterature for feature selection including Correlation-based Feature Selection [17][18][20][21], PrincipalComponent Analysis (PCA) [35], Information Gain [42], Gain Ratio [43], Mutual Information Selection[48], Chi-square Selection [39], Probabilistic Las Vegas Selection [40], Support Vector Machine FeatureElimination [16].It is frequently not clear to the user which feature selector to use for their data and application. In theiranalysis of feature selection, Guyon and Elisseeff [15] recommend evaluating a variety of featureselectors. Hence, allowing the user to run a variety of feature selectors and then evaluate the featuresets chosen using their classification or prediction algorithm is highly desirable. Having multiple featureselectors available also provides the opportunity for ensemble feature selection where the results froma range of feature selectors are merged to generate the best set of features to use. Feature selection isa combinatorial problem so needs to run as efficiently as possible. We have previously developed a kNN classification and prediction algorithm [26][28] using an associative memory (binary) neuralnetwork called the Advanced Uncertain Reasoning Architecture (AURA) [3]. This multi-faceted k-NNmotivated a unified feature selection framework exploiting the speed and storage efficiency of theassociative memory neural network. The framework lends itself to parallel and distributed processingacross multiple nodes. This could be processing the data at the same geographical location using asingle machine with multiple processing cores [47] or at the same geographical location using multiplecompute nodes (parallel search) [47] or even processing the data at multiple geographical locationsand assimilating the results (distributed processing) [5].The main contributions of this paper are: To augment the AURA framework for parallel and distributed processing of data in Hadoop [1][45],To describe four feature selectors in terms of the AURA framework. Two of the featureselectors have been implemented in AURA previously (but not using Hadoop) and two havenot been implemented in AURA before,To analyse the resulting framework to show how the four feature selectors have commonrequirementsTo demonstrate distributed processing in the framework.The feature selectors fit into one common data index representation. If we process any commonelements only once and propagate these common elements to all feature selectors that require themthen we can rapidly and efficiently determine the best feature selector and the best set of features touse for each data set under investigation. In section 2, we discuss AURA and related neural networksand how to store and retrieve data from AURA, section 3 demonstrates how to implement four featureselection algorithms in the AURA unified framework, section 4 describes parallel and distributedfeature selection using AURA, we than analyse the unified framework in section 5 to identify commonaspects of the four feature selectors and how they can be implemented in the unified framework in themost efficient way and finally, section 6 provides the overall conclusions from our implementationsand analyses.2 Binary Neural NetworksAURA [3] is an associative memory (binary) neural network. It is based on binary matrices, calledCorrelation Matrix Memories (CMMs) [6]. CMMs store associations between input and output vectors.

Input vectors address the CMM rows and output vectors address the CMM columns. Binary neuralnetworks have a number of advantages compared to standard neural networks including rapid onepass training, high levels of data compression, computational simplicity, network transparency, apartial match capability and a scalable architecture that can be easily mapped onto high performancecomputing platforms including parallel [47] and distributed platforms [5].Previous parallel and distributed applications of AURA have included distributed text retrieval [23],distributed time-series signal searching [13] and condition monitoring [4]. We have previouslydeveloped a k-NN algorithm [26][28] using AURA called AURA k-NN. The AURA k-NN allowsclassification [31][32][37] and prediction [33] with feature selectors for both classification andprediction [30]. Thus, coupling feature selection, classification and prediction with the speed andstorage efficiency of a binary neural network in the AURA framework allowing parallel and distributeddata mining. This makes AURA ideal to use as the basis of an efficient distributed machine learningframework. A more formal definition of AURA, its components and methods now follows.2.1 AURAThe AURA methods use binary input I and output O vectors to store records in a CMM M as in equation1. (1)Training (construction of a CMM) is a single epoch process with one training step for each input-outputassociation (each Ij OjT in equation 1) which equates to one step for each record in the data set. Ij OjT isan estimation of the weight matrix W(j) of the neural network as a linear associator with binaryweights. Every synapse (matrix element) can update its weight independently and in parallel. Thislearning process is illustrated in Figure 1.Figure 1 Showing a CMM learning input vector In associated with output vector On on the left. TheCMM on the right shows the CMM after five associations Ij OjT. Each column of the CMM represents arecord. Each row represents a feature value for symbolic features or quantisation of feature valuesfor continuous features and each set of rows (shown by the horizontal lines) represents the set ofvalues or set of quantisations for a particular feature.For feature selection, the data are stored in the CMM which forms an index of all features in allrecords. The input vector and CMM rows represent the features and feature values; and the outputvector and the CMM columns represent the data records. Note: for feature selection, the class valuesand the associated records that take those class values are also trained into the CMM as extra rows.

This process is identical to training the other feature values; the class is treated as an extra feature[30]. Figure 1 shows a trained CMM where each row is a feature or class value and each columnrepresents a record. In this paper, we set only one bit in the vector Oj indicating the location of therecord in the data set, the first record has the first bit set, the second record has the second bit set etc.Using a single set bit makes the length of Oj potentially large. However, exploiting a compact listrepresentation [25] (more detail is provided in section 5) means we can compress the storagerepresentation for single bit set vectors to a single index (set bit location), thus allowing AURA to beused for distributed processing with data sets of millions of records yet using a relatively small amountof memory.2.2 DataThe AURA feature selector, classifier and predictor framework can handle symbolic, discrete numericand continuous numeric features.The raw data sets need pre-processing to allow them to be used in the binary AURA framework.Symbolic and numerical unordered features are enumerated and each separate token maps onto aninteger (Token Integer) which identifies the bit to set within the vector. We refer to these features assymbolic henceforth. For example, a SEX TYPE feature would map as (F0) and (M1). Any realvalued or ordered numeric features are quantised (mapped to discrete bins) [29], and each individualbin maps onto an integer which identifies the bit to set in the input vector. We refer to these featuresas continuous henceforth. Next, we describe the simple equi-width quantisation. We note that theCorrelation-Based Feature Selector described in section 3.2 uses a different quantisation technique todetermine the bin boundaries. However, once the boundaries are determined, the mapping to CMMrows is the same as described here.To quantise continuous features, a range of input values for feature Fj map onto each bin. Each binmaps to a unique integer as in equation 2 to index the correct location for the feature in Ij. In thispaper, the range of feature values mapping to each bin is equal to subdivide the feature range into bequi-width bins across each feature.()( )(2)In equation 2, offset(Fj) is a cumulative integer offset within the binary vector for each feature Fj andoffset(Fj 1) offset(Fj) nBins(Fj ) where nBins(Fj ) is the number of bins for feature Fj, is a many-toone mapping and is a one-to-one mapping.For each record in the data setFor each featureCalculate bin for feature value;Set bit in vector as in equation 2;2.3 AURA RecallTo recall the matches for a query record, we firstly produce a recall input vector Rk by quantising thetarget values for each feature to identify the bins (CMM rows) to activate as in equation 3. Duringrecall, the presentation of recall input vector Rk elicits the recall of output vector Ok as vector Rkcontains all of the addressing information necessary to access and retrieve vector Ok. Recall iseffectively the dot product of the recall input vector Rk and CMM M, as in equation 3 and Figure 2.(3)

Figure 2 Showing a CMM recall. Applying the recall input vector Rk to the CMM M retrieves asummed integer vector S with the match score for each CMM column. S is then thresholded toretrieve the matches. The threshold here is either Willshaw with value 3 retrieving all columns thatsum to 3 or more or L-Max with value 2 to retrieve the 2 highest scoring columns.If Rk appeared in the training set, we get an integer-valued vector S (the summed output vector),composed of the required output vector multiplied by a weight based on the dot product of the inputvector with itself. If the recall input Rk is not from the original training set, then the system will recallthe output Ok associated with the closest stored input to Rk, based on the dot product between thetest and training inputs.The AURA technique thresholds the summed output S to produce a binary output vector. For exactmatch, we use the Willshaw threshold [49]. This sets a bit in the thresholded output vector for everylocation in the summed output vector that has a value higher than or equal to a threshold value. Thethreshold varies according to the task. For partial matching, we use the L-Max threshold [9]. L-Maxthresholding essentially retrieves at least L top matches. Our AURA software library automatically setsthe threshold value to the highest integer value that will retrieve at least L matches.Feature selection described in section 3 requires both exact matching using Willshaw thresholding andpartial matching using L-Max thresholding.3 Feature SelectionThere are two fundamental approaches to feature selection [36][50]: (1) filters select the optimal setof features independently of the classifier/predictor algorithm while (2) wrappers select features whichoptimise classification/prediction using the algorithm. We examine the mapping of four filterapproaches to the binary AURA architecture. Filter approaches are more flexible than wrapperapproaches as they are not directly coupled to the algorithm and are thus applicable to a wide varietyof classification and prediction algorithms. They can then be used as stand-alone feature selectors forother classification or prediction algorithms exploiting the high speed and efficiency of the AURAtechniques to perform feature selection which is a combinatorial problem. We also intend to integratethem with the AURA k-NN for classification and prediction in the unified AURA framework.We examine a mutual information based approach (Mutual Information Feature Selection (MI)detailed in section 3.1 that examines features on an individual basis, a correlation-based multivariatefilter approach (Correlation-based Feature Subset Selection (CFS) detailed in section 3.2 that examines

greedily selected subsets of features, a revised Information Gain approach Gain Ratio (GR) detailed insection 3.3 and a feature dependence approach Chi-Square Feature selection(CS) detailed in section3.4 which is univariate. Univariate filter approaches such as Mutual Information or Chi-square arequicker than multivariate as they do not need to evaluate all combinations of subsets of features. Theadvantage of a multivariate filter compared to a univariate filter lies in the fact that a univariateapproach does not account for interactions between features. Multivariate techniques evaluate theworth of feature subsets by considering both the individual predictive ability of each feature and thedegree of redundancy between the features in the set.During training for the MI, CFS, GR and CS algorithms, the input vectors Ij represent the feature andclass values in the data records and are associated with a unique output vector Oj. They all use anidentical CMM and CMM training is an n-iteration process where n is the number of data records. Thissingle CMM means that we can calculate and compare all four feature selectors using a single datarepresentation. We note that the CFS as implemented by Hall [17] uses an entropy-based quantisationwhereas we have used equi-width quantisation for the other feature selectors (MI, GR and CS ). Weplan to investigate unifying the quantisation as a next step. For the purpose of our analysis in section 5,we assume that all feature selectors are using identical quantisation.We assume that all records are to be used during feature selection.3.1 Mutual Information Feature SelectionWettscherek [48] described a mutual information feature selection algorithm. The mutual informationbetween two features is the reduction in uncertainty concerning the possible values of one featurethat is obtained when the value of the other feature is determined' '[48]. We introduced an AURAversion of the MI feature selector in [34] and just provide a brief overview here.For our feature selection, AURA excites the row in the CMM corresponding to feature value fi offeature Fj. This row is a binary vector (BV) and is represented by BVfi . A count of bits set on the rowgives n(BVfi) from equation 4 and is achieved by thresholding the output vector Sk from equation 3 atWillshaw value 1. AURA also excites the row in the CMM corresponding to class value c where thebinary vector is denoted BVc. Again, counting the number of bits set on the row as above gives n(BVc)from equation 4.Figure 3 Diagram showing the feature value row and the class values row excited to determine cooccurrences.

If both the feature value row and the class values row are excited then the summed output vector willhave a two in the column of every record with a co-occurrence of fi with cj as shown in Figure 3. Bythresholding the summed output vector at a threshold of two, we can find all co-occurrences. Werepresent this number of bits set in the vector by n(BVfi BVc) which is a count of the set bits whenBVc is logically ANDed with BVfi . The mutual information is given by equation 4 where rows(Fj) is thenumber of CMM rows for feature Fj and nClass is the number of classes:(4)() ()We can follow the same process for real/discrete ordered numeric features in AURA. In this case, themutual information is given by equation 5:(5)() )(where bins(Fj) is the number of bins (effectively the number of rows) in the CMM for feature Fj , N isthe number of records in the data set, BVbi is a binary vector (CMM row) for the quantisation binmapped to by feature value fi, BVc is a binary vector with one bit set for each record in class c,n(BVbi BVc) is a count of the set bits when BVc is logically ANDed with BVbi (as shown in Figure 3and n(BVc) is the number of records in class c.The MI feature selector assumes independence of features and scores each feature separately so it isthe user's prerogative to determine the number of features to select. The major drawback of the MIfeature selector along with similar information theoretic approaches, for example Information Gain, isthat they are biased toward features with the largest number of distinct values as this splits thetraining records into nearly pure classes. Thus, a feature with a distinct value for each record has amaximal information score. The CFS and GR feature selectors described next make adaptations ofinformation theoretic approaches to prevent this biasing.3.2 Correlation-based Feature Subset SelectionHall [17] proposed the Correlation-based Feature Subset Selection (CFS) which measures the strengthof the correlation between pairs of features. CFS favours feature subsets that contain features that arehighly correlated to the class but uncorrelated to each other to minimise feature redundancy. CFS isthus based on information theory measured using Information Gain. However, as noted in the previoussection, Information Gain is biased toward features with the largest number of distinct values. Hence,Hall and Smith [20] used a modified Information Gain measure (Symmetrical Uncertainty) to estimatethe correlation between features given in equation 6. Symmetrical Uncertainty effectively normalisesthe value in the range [0,1] where two features are completely independent if SU 0 and completelydependent if SU 1.()[( ) ( )( )( (6)]where the entropy of a feature Fj for all feature values fi is given as equation 7:(7)and the entropy of feature Fj after observing values of feature Gl is given as equation 8:

(8)() Any continuous features are discretised using Fayyad and Irani's entropy quantisation [12]. The binboundaries are determined using Information Gain and these quantisation bins map the data into theAURA CMM as previously.As noted previously, for feature selection, the class values and the associated records that take thoseclass values are also trained into the CMM as extra rows (extra features) as shown in Figure 3. CFS hasmany similarities to MI through calculating the values in equations 6, 7 and 8 and through using theCMM as noted below.In the AURA CFS, for each pair of features (Fj ,Gl) to be examined, the CMM is used to calculate Ent(Fj),Ent(Gl) and Ent(Fj ·Gl) from equation 6. There are three parts to the calculation.1. Ent(Fj)requires the count of data records for the particular value fi of feature Fj which is n(BVfi)in equation 4 for symbolic and class features and n(BVbi) in equation 5 for continuousfeatures. Similarly, Ent(Gl) counts the number of records where feature Gl has value gk.2. Ent(Fj ·Gl) requires the number of co-occurrences of a particular value fi of feature Fj with aparticular value gk of feature Gl as in equations 4 and 5 and Figure 3 except that CFScalculates Ent(Fj ·Gl) between a feature and the class n(BVfi BVc) and n(BVbi BVc) as well asbetween pairs of features n(BVfi BVgk) and n(BVbi BVbk).CFS determines the feature subsets to evaluate using forward search. Forward search works bygreedily adding features to a subset of selected features until some termination condition is metwhereby adding new features to the subset does not increase the discriminatory power of the subsetabove a pre-specified threshold value. The major drawback of CFS is that it cannot handle stronglyinteracting features [19].3.3 Gain Ratio Feature SelectionGain Ratio (GR) [43] is a new feature selector for the AURA framework. GR is a modified InformationGain technique and is used in the popular machine learning decision tree classifier C4.5 [43].Information Gain is given in equation 9 for feature Fj and the class C. Hall and Smith [20] used amodified Information Gain measure in their CFS feature selector to prevent biasing toward featureswith the most values. GR is an alternative adaptation which considers the number of splits (number ofvalues) of each feature when calculating the score for each feature using normalisation.()( )( (9)where Ent(Fj) is defined in equation 7 and Ent(Fj C) is defined by equation 8. Then Gain Ratio is definedas equation 10:(10)()where IntrinsicValue is given by equation 11:(11)( ) ()and V is the number of feature values (n(Fj)) for symbolic features and number of quantisation binsn(bi) for continuous features and Sp is a subset of the records that have Fj fi for symbolic features ormap to the quantisation bin bin(fi) for continuous features.

To implement GR using AURA, we train the CMM as described in section 2.1 using a suitablequantisation for continuous features. This could be Fayyad and Irani's quantisation used in CFS or couldbe equi-width binning as described in section 2.2. We can then calculate Ent(Fj) and Ent(Fj C) as perthe CFS feature selector described in section 3.2 to allow us to calculate Gain(Fj, C). To calculateIntrinsicValue(Fj) we need to calculate the number of records that have particular feature values. Thisis achieved by counting the number of set bits n(BVfi) in the binary vector (CMM row) for fi forsymbolic features or n(BVbi) in the binary vector for the quantisation bin bi for continuous features. Wecan store counts for the various feature values and classes as we proceed so there is no need tocalculate any count more than once. The main disadvantage of GR is that it tends to favour featureswith low Intrinsic Value rather than high gain by overcompensating toward a feature just because itsintrinsic information is very low.3.4 Chi-Square AlgorithmWe now demonstrate how to implement a second new feature selector in the AURA framework. TheChi-Square (CS) [39] algorithm is a feature ranker like MI and GR rather than a feature selector; itscores the features but it is the user's prerogative to select which features to use. CS assesses theindependence between a feature (Fj) and a class (C) and is sensitive to feature interactions with theclass. Features are independent if CS is close to zero. Yang and Pedersen [51] and Forman [14]conducted evaluations of filter feature selectors and found that CS is among the most effectivemethods of feature selection for classification.Chi-Square is defined as:(12)() where b(Fj) is the number of bins (CMM rows) representing feature Fj, nClass is the number of classes,w is the number of times fi and c co-occur, x is the number of times fi occurs without c, y is the numberof times c occurs without fi, z is the number of times neither c nor fi occur. Thus, CS is predicated oncounting occurrences and co-occurrences and, hence, has many commonalities with MI, CFS and GR. Figure 3 shows how to produce a binary output vector (BVfi BVc) for symbolic features or(BVbi BVc) for continuous features listing the co-occurrences of a feature value and a classvalue. It is then simply a case of counting the number of set bits (1s) in the thresholded binaryvector T in Figure 3 to count w.To count x for symbolic features, we logically subtract (BVfi BVc) from the binary vector(BVfi) to produce a binary vector and count the set bits in the resulting vector. For continuousfeatures, we subtract (BVbi BVc) from (BVbi) and count the set bits in the resulting binaryvector.To count y for symbolic features, we can logically subtract (BVfi BVc) from (BVc) and countthe set bits and likewise for continuous features we can subtract (BVbi BVc) from BVc andcount the set bits.If we logically OR (BVfi) with (BVc), we get a binary vector representing (Fj fi) (C c) forsymbolic features. For continuous features, we can logically OR (BVbi) with (BVc) to produce(Fj bin(fi)) (C c). If we then logically invert this new binary vector, we retrieve a binary vectorrepresenting z and it is simply a case of counting the set bits to get the count for z.As with MI, CS is univariate and assesses features on an individual basis selecting the features with thehighest scores, namely the features that interact most with the class. The main issue with CS is that itdoes not take into account inter-feature interactions.

4 Parallel and Distributed AURAIn distributed systems, data may be processed locally at each parallel compute node minimisingcommunication overhead or globally across all distributed nodes providing an “overall” data view [46].4.1 ParallelIn Weeks et al. [47], we demonstrated a parallel search implementation of AURA. AURA can besubdivided across multiple processor cores within a single machine or spread across multipleconnected compute nodes. This parallel processing entails “striping” the data across several parallelCMM subsections. The CMM is effectively subdivided vertically across the output vector as shown inFigure 4. In the data, the number of features m is usually much less than the number of records N,hence, m N. Therefore, we subdivide the data along the number of records N (column stripes) asshown inFigure 4.Figure 4 If a CMM contains large data it can be subdivided (striped) across a number of CMM stripes.In the left hand figure, the CMM is striped vertically (by time) and in the right hand figure the CMMis striped horizontally (be feature subsets). On the left, each CMM stripe produces a thresholdedoutput vector Tn containing the top k matches (and their respective scores) for that stripe. All {Tn} areaggregated to form a single output vector T which is thresholded to list the top matches overall. Onthe right, each stripe outputs a summed output vector Sn. All Sn are summed to produce a

If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. View metadata, citation and similar papers at brought to you by CORE provided by White Rose Research Online