Transcription

AFRL-IF-RS-TM-2006-4In-House Technical MemorandumMay 2006A MATLAB-BASED OZTURK ALGORITHMIMPLEMENTATIONAPPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED.AIR FORCE RESEARCH LABORATORYINFORMATION DIRECTORATEROME RESEARCH SITEROME, NEW YORK

STINFO FINAL REPORTThis report has been reviewed by the Air Force Research Laboratory, InformationDirectorate, Public Affairs Office (IFOIPA) and is releasable to the National TechnicalInformation Service (NTIS). At NTIS it will be releasable to the general public,including foreign nations.AFRL-IF-RS-TM-2006-4 has been reviewed and is approved for publication.APPROVED:/s/GERALD C. NETHERCOTTChief, Multi-Sensor Exploitation BranchFOR THE DIRECTOR:/s/JOSEPH CAMERAChief, Information & Intelligence Exploitation DivisionInformation Directorate

Form ApprovedOMB No. 074-0188REPORT DOCUMENTATION PAGEPublic reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, includingsuggestions for reducing this burden to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302,and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 205031. AGENCY USE ONLY (Leave blank)2. REPORT DATE3. REPORT TYPE AND DATES COVEREDMAY 2006In-House Final Technical Memorandum May 05- Aug 054. TITLE AND SUBTITLE5. FUNDING NUMBERSA MATLAB-BASED OZTURK ALGORITHM IMPLEMENTATIONCPEPRTAWU6. AUTHOR(S)Steven Salerno7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)- N/A- 62702F- 459E- PR- OJ8. PERFORMING ORGANIZATIONREPORT NUMBERAir Force Research Laboratory/IFEC525 Brooks RoadRome New York 13441-4505N/A9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES)10. SPONSORING / MONITORINGAGENCY REPORT NUMBERAir Force Research Laboratory/IFEC525 Brooks RoadRome New York 13441-4505AFRL-IF-RS-TM-2006-411. SUPPLEMENTARY NOTESAFRL Project Engineer:Andrew Noga/IFEC/(315) 330-2270/[email protected] DISTRIBUTION / AVAILABILITY STATEMENT12b. DISTRIBUTION CODEAPPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED.13. ABSTRACT (Maximum 200 Words)A method of determining the underlying statistical distribution from small data sample sets is reviewed. In particular, aMatlab graphical interface is used to allow for testing and comparison to simple histogram techniques. The rankordered Ozturk method is seen to perform well for uncorrelated samples14. SUBJECT TERMS15. NUMBER OF PAGES37Ozturk, rank-ordered statistics16. PRICE CODE17. SECURITY CLASSIFICATIONOF REPORT18. SECURITY CLASSIFICATIONOF THIS PAGEUNCLASSIFIEDUNCLASSIFIEDNSN 7540-01-280-550019. SECURITY CLASSIFICATIONOF ABSTRACTUNCLASSIFIED20. LIMITATION OF ABSTRACTULStandard Form 298 (Rev. 2-89)Prescribed by ANSI Std. Z39-18298-102

Table of Contents1.0 INTRODUCTION 11.1 OVERVIEW OF OZTURK ALGORITHM 11.2 SUMMER OVERVIEW 22.0 PDF IDENTIFICATION 22.1 OZTURK ALGORITHM 52.2 ALGORITHM PROCEDURE 62.3 PROBLEMS ENCOUNTERED 93.0 GUI IMPLEMENTATION 103.1 MAIN WINDOW – DISTRIBUTION 113.2.0 MAIN WINDOW – ANALYSIS 143.2.1 USER DATA WINDOW 173.2.2 IDENTIFICATION CHART 183.2.3 DISTRIBUTION TYPE WINDOW 203.2.4 ID ANALYSIS 203.2.5 PARAMETER ESTIMATION 223.3.0 MAIN WINDOW – 3D 233.3.1 DISTRIBUTION 3D 233.3.2 IDENTIFICATION CHART 3D 243.3.3 ID ANALYSIS 3D 254.0 CONCLUSIONS 264.1 APPLICATIONS OF THE OZTURK ALGORITHM 274.2 CONSTRAINTS OF THE OZTURK ALGORITHM 274.3 PROBLEMS & BUGS 274.4 FUTURE WORK 284.5 ACKNOWLEDGEMENTS 294.6 TABULAR VALUES 304.7 REFERENCES 32i

List of FiguresFig 2-1 Histogram with low sample count. 2Fig 2-2 Histogram with high bin count. 2Fig 2-3 Normal CDF for 100 random samples[7] . 3Fig 2-4 Sample Q-Q plot[7] . 5Fig 2-5 Vector Chart for Normal (n 6) . 7Fig 2-6 Sample Data with Confidence Ellipses . 8Fig 3-1 GUI Main Window. 11Fig 3-2 Distribution List . 11Fig 3-3 GUI Parameters Panel . 12Fig 3-4 GUI Distribution Panel. 12Fig 3-5 GUI Distribution Plot . 13Fig 3-6 GUI Histogram Plot . 13Fig 3-7 GUI Dist/Hist Plot. 14Fig 3-8 GUI Analysis Panel. 15Fig 3-9 Sample 1 Histogram . 16Fig 3-10 Sample 2 Histogram . 16Fig 3-11 Sample 3 Histogram . 16Fig 3-12 GUI Data Selection Window. 16Fig 3-13 GUI Normality Test Plot . 17Fig 3-14 User Data Window . 18Fig 3-15 GUI ID Chart Plot for n 100 . 19Fig 3-16 Distribution Type Window. 20Fig 3-17 GUI ID Analysis Test Plot . 21Fig 3-18 GUI Parameter Estimation Plot. 22Fig 3-19 Parameter Estimation Window. 22Fig 3-20 GUI 3D Panel . 23Fig 3-21 GUI Dist 3D Normal n 100 Plot . 24Fig 3-22 GUI ID Dist 3D n 100 Plot . 24Fig 3-23 GUI Analysis 3D Plot . 25Fig 3-24 ID Analysis Window. 26ii

1.0 IntroductionThis report covers work performed over the summer of 2005 as a summerengineering aide. Work in the applications of ranked-ordered statistics as a method ofsignal data analysis was conducted. More specifically work was conducted on the Ozturkalgorithm, which is considered to have demonstrated superior performance overhistogram and simple parameter-based techniques [1-4].The work includes theconstruction of a graphical user interface that utilizes the algorithm’s power to bothidentify and analyze probability density functions.1.1 Overview of Ozturk AlgorithmThe Ozturk algorithm is an identification algorithm that allows for a variety oftests that not only identify a PDF, but also analyzes the sample data. The core algorithmwas developed by Aydin Ozturk [4], and further worked upon in concurrence with E.J.Dudewicz [2]. Prior work was carried out with the help of Syracuse University and theonly documented implementation of the algorithm for real world use was associated withresearch conducted at the AFRL Sensors Directorate at the Rome Research Site.The basis of the algorithm deals with constructing a null-linked graph of vectorsfrom the ordered statistics of the sample data. The algorithm is scale and locationinvariant, which makes it particularly attractive. From this graph one can easily identifya distribution based on its endpoint, Qn, and surrounding null points. From the orderedstatistics other tests can be carried out, including but not limiting to, testing forNormality, and parameter estimation. The power of the test comes from the ability to usesmall sample sizes while still being able to correctly identify the best fit distribution. Theentirety of the algorithm will be discussed in Section 2.1

1.2 Summer OverviewDuring the 2005 summer a Matlab program was developed to be able to evaluateand demonstrate the Ozturk algorithm. Using the computing technology we have now wewere able to recreate the two null data charts using Monte Carlo methods that wereoriginally calculated using numerical integration. With these charts and the equationsderived for the algorithm, the program could match all the results that were provided bysamples in [2], and [3]. Along with the actual implementation of the algorithm andassociated GUI, one further extension of the algorithm was created in order to extend thefeatures of the code. The GUI prototype will be further discussed in Section 3.2.0 PDF IdentificationCurrently there exists an abundance of tests in order to determine/analyze aPDF. Histogram techniques depend on the number of bins, which is specified by theuser, and the number of samples, given by user. It can be shown however that thesetechniques can cause great amounts of error on both sides of the extrema. A low numberof samples can and usually will result in a histogram with no known shape as in Figure 21.Fig 2-1 Histogram with low sample countFig 2-2 Histogram with high bin count2

Changing the bin count has does not correct this problem because from this figure wecannot easily define a known PDF from the histogram. On the other hand if the samplesize and bin count are high, then the histogram becomes jagged in shape, as in Figure 2-2,instead of being smooth. Even though the shape isn’t perfect, the distribution can still bedetermined with a high confidence. Note that the histogram does not use a rank orderingof the data samples.Other well known techniques to identify PDFs include the Chi-Square goodnessof-fit test, Kolmogorov-Smirnov goodness-of-fit test, Anderson-Darling test, ShapiroWilk test, and the quantile-quantile plot, [7]. The Chi-Square goodness-of-fit test isapplied to data that has been binned, such as a histogram. From this the test can decidefrom what specific univariate distribution the sample came from. However the test issensitive to the number of bins, which are specified by the user. Also it is dependent onhow the data has been put into its bins, and requires a sufficient sample size to providefor a satisfactory solution.The K-S test (Kolmogorov-Smirnov test) also decides whether a sample comesfrom a specific distribution. Using rank-ordered data and the empirical distributionfunction a graph can be constructed as shown in Figure 2-3, from which we can estimatethe test statistic. The maximum distance between the two curves is calculated and theFig 2-3 Normal CDF for 100 random samples[7]3

solution is found as the closest fit to a set of known values. Advantages to the K-S testinclude that it can be done with any sample size, and it does not rely on the underlyingCDF being tested. However the K-S test cannot be applied to discontinuous functions, itis more sensitive towards the middle of the distribution, and the parameters of the samplemust be known, or the test is no longer valid. A solution to the last disadvantage comesfrom Lilliefors test for Normality, an extension of the K-S test. In this test the samplemean and standard deviation are calculated and the Z-score is found. With the Z-scorethe empirical CDF is calculated, plotted and the maximum distance is found as the teststatistic.The Anderson-Darling test is a modified version of the K-S test. It improvesupon the sensitivity of the center of the K-S test by weighting the tails. Also theAnderson-Darling test makes use of the specific distribution being tested to gain thecritical values, which in turn makes a more sensitive test overall. However calculatingthe critical values is a disadvantage, and for some distributions can be undefined.Unlike the other tests described thus far, the Shapiro-Wilk test only tests asample for Normality. However the power and sensitivity is much higher because of this.The test calculates a test statistic from which the value can identify whether the sample isNormal or not.The Q-Q plot (quantile-quantile plot) is a graphical means of interpretingwhether two distributions come from the same family. This can both identify a specificdistribution if one is known and the other is unknown, or can determine if bothunknown/known samples come from the same distribution. The Q-Q plot takes thevalues of the quantiles of each distribution and plots them against each other with a 45degree reference line as shown in Figure 2-4. The sample sizes do not have to be large,4

Fig 2-4 Sample Q-Q plot[7]but the bigger the sample size the more accurately the correlation can be portrayed. Thetwo distributions do not have to be equal in size. Also many properties of a distributioncan be tested all at once, including shifts in location and scale, changes in symmetry, thepresence of outliers, etc.2.1 Ozturk AlgorithmAll of the aforementioned tests are reliable in the sense that you are testing for aspecific distribution. Often the test is for normality (Gaussian PDF is assumed).However when the solution does not match the hypothesis no alternative possibility isgiven. Also a majority of the tests start to lose their power when the sample sizedecreases. Both these issues are addressed through the Ozturk algorithm [1-4]. Thealgorithm uses the sample statistics to create “linked vectors’ to create a goodness-of-fittest for univariate and multivariate distributions. The only foreseen disadvantage thealgorithm has is that for larger sample sizes the computational time is sloweddramatically. In this case, the other tests previously mentioned can identify and analyzedistributions with a high confidence level at high sample sizes.5

2.2 Algorithm ProcedureGeneral reference is made to [2]. Let X1:n X2:n Xn:n be the sampleordered observations from a distribution with the CDF F(x;μ, σ) F{(x- μ)/σ} where μand σ are the location and scale parameters, respectively. For our case we use theNormal distribution as the null hypothesis; however any distribution could be used as thenull.For illustration we will assume that X1:n X2:n Xn:n is from a Normaldistribution with an unknown mean μ and unknown variance σ2. To make the samplelinked vectors we need two main components, the length of the ιth vector, and the angle tothe horizontal axis θ. LetYi ( X i X ) / S(2-1)where X X i / n and S { ( X i X ) 2 /( n 1)}1/ 2. Yi:n is now considered the lengthof the vectors. Now let mi:n, m2:n, , mn:n be equivalent to the expected order statistics ofthe standard Normal distribution, mi:n E((Xi:n- μ)/σ). The angle horizontal to the x-axisis θι πФ(mi:n) where π 3.14159 andΦ (mi:n ) mi : n (2π ) 1 / 2 ( t 2 / 2 )edt .(2-2) Since m1:n m2:n mn:n and 0 Ф(mi:n) 1 then 0 θ1 θ2 θn π. θi:n isnow considered the angle between the x-axis and the ιth vector.To obtain the sample linked vectors we use the points Qi (Ui,Vi), whereiij 1j 1U i ( Y j :n Cosθ j ) / n and Vi ( Y j:n Sinθ j ) / n are the coordinates of Qι (ι 1, ,n) and Q0 (0,0). Ui and Vi can then be defined as the test statistics, as followingnU n 1 / n Yi:n Cosθ i ,(2-3)i 1nVn 1 / n Yi:n Sinθ i .(2-4)i 1To obtain the null linked vectors to construct the “known” of the null hypothesis m i:n E( Yi:n ), and E(mi:n) are used. To obtain the expected values numerical integration mustbe used.6

E ( x k n ) n!11x[ Φ ( x)]k 1 [ Φ ( x)]n k φ ( x)dx , (n k )!(k 1)! 22where φ ( x) (2π ) 1 / 2 e x(2-5)x2/2and Φ ( x) φ ( x)dx .0Using the expected values of mi:n and Yi:n we obtain the point Qn (Un,Vn) whichbecomes the null-linked vectors endpoint that is the “known”, other extensions use thisinformation for analysis.In a sample that we assume to be Normal, we would expect that the samplelinked vectors would closely follow the trajectory of the null-linked vectors, as in Figure2-5.Fig 2-5 Vector Chart for Normal (n 6)However if the sample is not Normal the two paths should differ; how much they differby depends on the sample distribution. This use of the algorithm allows for an ad hocversion (visual inspection) to be used and allow for distribution identification. Althoughthis can be done, it is rarely used because we have the test statistics that are provided bythe algorithm to more powerfully identify what a sample distribution’s true distributionis.A group of extensions have been developed for the algorithm to gather moreinformation about both the distribution and the test statistics. Using m i:n E( Yi:n ), wedefine the expected value for Qn for the Normal distribution to be E(Qn) 7

(0, (1 / n) m' i : n Sinθ i ) .Symmetric identities for the Normal distribution can beemployed to see that E(U) 0, and a model can be fitted to V to obtainE (Vn ) 0.326601 0.412921 / n .(2-6)Also using more properties of the test statistics and the linked vectors we can obtainmodels for the variances of U and VVar(U n ) σ u2 0.02123 / n 0.01765 / n 2 ,(2-7)Var(Vn ) σ v2 0.04427 / n 0.0951 / n 2 ,(2-8)which will help in further analysis tests.Both U and V are distributions, morespecifically Normal for large values of n. Since Qn is the joint distribution of U and V, atest statistic for a bivariate Normal distribution can be found and simplified asu2σ u2 (v u v ) 2σ v2 2lnα ,(2-9)where α is the confidence level. Using α we can get the 100(1- α)% confidence ellipses,which can be used to help test for Normality, as shown in Figure 2-6. We can then useFig 2-6 Sample Data with Confidence Ellipsesα as a measure of whether or not to accept or reject the answer given by the test statistic.8

One of the most powerful aspects of the test is that it provides within itself the ability todo parameter estimation based on the test statistics.Using the statisticsT1 X i:n Cos{π F0 (mi:n )} and T2 X i:n Sin{π F 0 (mi:n )} we can estimate the locationand scale. For better estimates we can use nμ (Sinθ i ) X i:n / c ,(2-10)i 1 nσ (Cosθ i ) X i:n / b ,(2-11)i 1where c Sinθ i and b μ i:n Cosθ i .The shape parameter of a distribution can also be calculated but was not done herebecause no distributions with shape parameters were analyzed in this time frame of thisresearch.2.3 Problems EncounteredWhile working with the algorithm there were a few situations in whichinterpretations of wording came up and had to be addressed. Also some information wasobscure and had to be found by other means. The first major problem uncovered was thatthe only identification chart that could be found was in [1]. However this chart, usingtheir own values calculated by their models, shows this chart to be incorrect value wise.Although the values for each of the points, and lines may be wrong, the locations seemedto be preserved. In this case we were able to use their chart as a known to test againstours, by comparing only relative locations and not values.Next we were given the E( Yi:n ) in [2], but were not given any indication ofwhat the mi:n values were, or the means to regenerate either set. The first option was tofind the equation (2-5) for the numerical integration and evaluate it to obtain the correctvalues. However due to the complexity that numerical integration sometimes presents wechose a Monte Carlo method. 1,000,000 runs of sample sizes n 6, 10, 20, 50, and 100were used to generate the values that the numerical integration would. As a result theerror in our answers compared to the chart answers in [5] was less then 0.0001. This9

error is acceptable because through an experiment the values of U and V do not varygreatly unless the error is above 0.01.A problem also occurred with the wording regarding α and the “confidenceellipses.” It is understood that the 100(1- α)% level is the level at which the user is thatconfident. For example, in Figure 2-6, the middle ellipse is the “95% confidence ellipse.”However it is actually saying there is a 95% probability of it not being Normal instead ofbeing 95% confident it is Normal. As you get closer to the null endpoint the confidencein the answer should continue to go up, until the null endpoint is hit and the confidenceshould be 100%. If we used this interpretation, it says that at the null endpoint theconfidence ellipse is 0%. To resolve this wording issue, confidence ellipse could eitherbe considered as more of an error ellipse, or the equation could be changed to 100(α)%.Another problem with maximizing performance deals with the fact that thealgorithm actually has 9 different ways to standardize the sample order statistics. In thispaper and the experiments presented we use what is called the Type 1 version or Qn(1)[3]. This version is used because it presents the best power to identify and analyze overthe most distributions. However if we were to use each of the versions to identify andanalyze we would be able to increase the accuracy of the algorithm even further. This isbecause each of the versions is sensitive to a certain distribution, and thus more powerfulfor those distributions.3.0 GUI ImplementationThe entire GUI and code was built from the ground up for the sole purpose ofPDF identification without specific application in mind. All equations for the algorithm,PDF information, and other likely equations were gathered and stored in multiplefunctions. Matlab was to be the primary focus, while using Octave, a Matlab look alike,to provide sample data. However through the Mathworks community file sharing aMatlab function was found to generate data for all needed distributions. The GUI isconstructed of the main window where all functions takes place, along with 4 smallerwindows that only display data.10

3.1 Main Window – DistributionThe main window is featured in Figure 3-1. The main window is divided into 3Fig 3-1 GUI Main Windowsubsections, Distribution, Analysis, and 3D. In the top left corner is a pull down menuthat lets the user select which distribution they would like to work on, as seen in Figure3.-2.Fig 3-2 Distribution ListThere are currently 11 different distributions that can be chosen for the first part of thetool. The user simple chooses which of the distributions they would like to begin with.11

After the selection of the distribution by the user, the Std. Values pushbuttonneeds to be pressed. This simply puts the values of the location, scale, skewness andkurtosis parameters to 0, 1, 1, and 1, respectively. These values are the standard valuesfor each of the distributions except for the Pareto distribution which needs to have alocation greater than one to be defined, so its standard location value is 1. This is done toreset the data from any previous run of the tool.Next is the Parameters panel, as in Figure 3-3, which displays to the user thecurrent values of the parameters of the distribution they have chosen. For this scenariowe will be using the Normal distribution throughout the discussion of the GUI.If thedistribution is not dependent upon a parameter it becomes grayed out and unable to bechange. Also if a user decides that they would like to see the distribution with parametersother than the standard values, they can simply change them.Fig 3-3 GUI Parameters PanelThe first set of tools that can be used are contained in the Distribution panel, asin Figure 3-4. Within the panel are 3 pushbuttons and an edit text area. The Distributionpushbutton takes the type of distribution selected by the user, and the parameters selectedFig 3-4 GUI Distribution Panel12

Fig 3-5 GUI Distribution Plotby the user, and plots the PDF on the graph on the right hand side of the GUI, as seen inFigure 3-5.The Histogram pushbutton is similar in that it takes the selected distribution andthe set parameters. It plots a histogram of 1,000,000 samples, as shown in Figure 3-6.Fig 3-6 GUI Histogram PlotThe Histogram is also ruled by another parameter which is the number of bins to sort thedata into. This parameter can be defined by the user as was the parameters of thedistribution. The edit text box labeled N is where the user is allowed to make changes tothe number of bins. For Figure 3-6 the current N is equal to 100, however we can change13

that to any number we choose, as long as it is a positive integer. Changing the number ofbins allows the user to maximize the power of the histogram by choosing the optimalnumber of bins.The last tool in the group is the Dist/Hist pushbutton.This first plots thehistogram of the distribution with the selected parameters and bins, and then plots thedistribution over it, as in Figure 3-7. However this tool can only be used after theFig 3-7 GUI Dist/Hist PlotHistogram pushbutton has been used because it reuses the data from the Histogram toolinstead of creating another different set. For larger bin sizes there would be no largedifference if two data sets were constructed, though as bin sizes decrease the appearanceof the Histogram begins to vary more.This tool is not a necessary part of the GUI since its function has nothing to dowith the Ozturk algorithm. However this can be used more so as a learning feature to beable to identify the different distributions. Also if a user has a known shape and knownparameters but does not know the distribution it is, this could become a potentialidentification tool.3.2.0 Main Window – AnalysisThe Analysis panel contains the main tools that have been defined in the Ozturkalgorithm, as shown in Figure 3-8. It contains four different pushbuttons that are allfunctions of the algorithm, and an edit text box for the sample size of the data. There are14

a few restrictions in place as to how to operate within this panel. First when testing asample, the data must be given in a .mat file and the variable of the data must be callx userdata. Also within the data, if there is more than one data set that is given in thevariable they must be arranged by having the first data set in the first column, the secondset in the second column, etc. With regard to the direction of the pushbuttons, theNormality pushbutton must be first because it loads the sample data from the source thensaves it to another file from which the other pushbuttons can use it from.Fig 3-8 GUI Analysis PanelFor our purposes we will use a data set that has been given in [2], which contains 3 datasets, two generated by a Normal distribution and the third by an Exponential.Sample 1: 5.62 6.76 7.90 5.77 7.51 4.92 4.43 6.51 5.30 8.91Sample 2: 7.43 5.55 2.91 4.17 1.27 5.44 8.89 8.75 6.20 7.14Sample 3: 8.51 7.12 21.49 7.15 6.94 12.97 7.30 0.19 7.20 4.18Preliminary tests were run on the data sets to see if their true distributions could be found.All tests proved inconsistent and could not tell us the true distribution, Figure 3-9 – 3-11.15

Fig 3-9 Sample 1 HistogramFig 3-10 Sample 2 HistogramFig 3-11 Sample 3 HistogramThe Normality pushbutton runs the Ozturk algorithm.After pressing thepushbutton the user is prompted to pick which file they would like to load into the GUIfor testing, as in Figure 3-12. After the selection has been chosen the data is put throughFig 3-12 GUI Data Selection Window16

the algorithm one data set at a time. The null-linked vector is plott

The Ozturk algorithm is an identification algorithm that allows for a variety of tests that not only identify a PDF, but also analyzes the sample data. The core algorithm was developed by Aydin Ozturk [4], and further worked upon in concurrence with E.J. Dudewicz [2]. Prior work was carried out with the help of Syracuse University and the