Journal of Information Security, 2022, 13, 23-42 Online: 2153-1242ISSN Print: 2153-1234Hadoop Distributed File System SecurityChallenges and Examination of UnauthorizedAccess IssueWahid RajehDepartment of Information Technology, University of Tabuk, Tabuk, Saudi ArabiaHow to cite this paper: Rajeh, W. (2022)Hadoop Distributed File System SecurityChallenges and Examination of Unauthorized Access Issue. Journal of InformationSecurity, 13, ved: January 14, 2022Accepted: February 13, 2022Published: February 16, 2022Copyright 2022 by author(s) andScientific Research Publishing Inc.This work is licensed under the CreativeCommons Attribution InternationalLicense (CC BY en AccessAbstractHadoop technology is followed by some security issues. At its beginnings,developers paid attention to the development of basic functionalities mostly,and proposal of security components was not of prime interest. Because ofthat, the technology remained vulnerable to malicious activities of unauthorized users whose purpose is to endanger system functionalities or to compromise private user data. Researchers and developers are continuously tryingto solve these issues by upgrading Hadoop’s security mechanisms and preventing undesirable malicious activities. In this paper, the most common HDFSsecurity problems and a review of unauthorized access issues are presented.First, Hadoop mechanism and its main components are described as the introduction part of the leading research problem. Then, HDFS architecture isgiven, and all including components and functionalities are introduced. Further, all possible types of users are listed with an accent on unauthorized users, which are of great importance for the paper. One part of the research isdedicated to the consideration of Hadoop security levels, environment anduser assessments. The review also includes an explanation of Log Monitoringand Audit features, and detail consideration of authorization and authentication issues. Possible consequences of unauthorized access to a system arecovered, and a few recommendations for solving problems of unauthorizedaccess are offered. Honeypot nodes, security mechanisms for collecting valuable information about malicious parties, are presented in the last part of thepaper. Finally, the idea for developing a new type of Intrusion Detector, whichwill be based on using an artificial neural network, is presented. The detectorwill be an integral part of a new kind of virtual honeypot mechanism andrepresents the initial base for future scientific work of authors.KeywordsHadoop Security Issue, Unauthorized Access, Honeypot Node, IntrusionDOI: 10.4236/jis.2022.132002 Feb. 16, 202223Journal of Information Security

W. RajehDetector1. IntroductionThe Big Data concept is based on storing, processing and transferring of vastamounts of unstructured, semi-structured and structured data [1]. It could be acollection of large data sets with petabytes of raw data (user and enterprise data,sensor information, medical and transaction data). Generally, it is a real challenge to store and adequately processes enormous data quantities by using traditional processing tools. Because of that, Big Data technology is gaining globalimportance that will have exponential growth in the future [2]. This technologyprovides new opportunities for all industry sectors, companies, and institutionsthat depend on quality processing of large amounts of raw data. It can be described by three main properties (“3V” properties): volume, velocity, and variety[3]. The volume represents the quantity of data that could be transferred from asource of information to a system of interest. The variety feature can be determined by existing data types within a data set, while velocity represents thespeed of storing and processing the data. Besides these three essential characteristics, Big Data can be described with variability (inconsistency with periodicpeaks during the flow of data) and complexity (various types of data that comefrom multiple sources) [4]. However, along with all the benefits of using Big Data technology, many challenges and potential issues occur [5] [6]. The challengesare the consequence of Big Data complexity and difficulties with performing data operations like storing, sharing, searching, analyzing and transferring largeamounts of information. On the other side, one of the leading Big Data issues isa potential system vulnerability by malicious parties. Large quantities of valuableand private information can be easily exposed to malicious clients who want tosteal or use data without required permissions. That way, the privacy and integrity of data can be strongly jeopardized. With the purpose to improve effectiveness and increase the robustness of existing Big Data systems, the Hadoop mechanism is proposed.1.1. The New Era of Distributed File SystemThe Hadoop is a master-slave open source platform for storing, managing anddistributing data across a more significant number of servers [7]. It is a Java-based solution to the majority of Big Data issues that is distributed under theApache License. It is a highly accessible technology that operates with the largevolumes of data and can be used for high-speed distribution and processing ofinformation. Hadoop efficiently resolves the “3V” challenge by providing nextfeatures to a system: a framework for horizontal scaling of large data sets, for thehandling of furious transfer velocity rates and efficient framework for processinga variety of unstructured data. Also, it can handle the failure of a single machineDOI: 10.4236/jis.2022.13200224Journal of Information Security

W. Rajehby re-executing all of its tasks. However, in the large-scale system as Hadoop, theoccurrence of failures is unavoidable.On a basic level, Hadoop is built from two main components [8]: MapReduceand Hadoop Distributed File System (HDFS). MapReduce component is usedfor the computational implementation of Hadoop in the form of distributedprocessing of data. It organizes multiple processors in a cluster to perform required calculations. MapReduce distributes the computation assignments between computers and puts together final computation results in one place. Additionally, this component takes care of network failures in the way they do notdisturb or disable active computation processes. On the other side, HDFS is usedfor information management and distributed storage of data. It is the file systemcomponent that provides reliable and scalable storage features and global fileaccess option. HDFS component is of the main interest in this paper so that itwill be additionally explained in the next two subsections.1.2. HDFS ArchitectureThe main goals of HDFS are storing large amounts of data in clusters and providing a high throughput of information within a system. Data stores in the formof the same sized blocks, where the typical size of each block is 64 MB or 128MB. Depending on the size, each file is stored in one or a few blocks. Size of ablock is configurable, and each file can have one writer at the moment. Withinthe HDFS component, a client can create new directories, create, save or deletefiles, rename and change a path of a file, and etc.HDFS architecture is based on the master-slave principle, and it is built from asingle NameNode and group of DataNodes [9]. The NameNode (the masternode), as the core part of the system, manages the HDFS directory tree and storesall metadata of the file system. Clients communicate directly with the NameNode with the purpose to perform standard file operations. Further, the NameNode performs a mapping between files stored at DataNodes with proper filenames. Another function is monitoring the possible failure of a DataNode andresolving this issue by creating a block replica [10]. The NameNode can havetwo other roles in the system: it can act as a CheckpointNode and a BackupNode. A periodical checkpoint is an excellent way to protect the system metadata. On the other hand, BackupNode maintains file image that is synchronizedwith the NameNode state. It handles potential failures and rolls back or restartsusing the last good checkpoint. Additionally, in enterprise versions of Hadoop,there is a practice to introduce Secondary NameNode. It is a useful system addition in a case the original NameNode crashes. In that case, Secondary NameNode uses saved HDFS checkpoint and restarts crashed NameNode. DataNodesare proposed to store all file blocks and to perform the tasks that are delegatedby the NameNode. Each file of a DataNode can be split into a few blocks and labelled with an identification timestamp. These nodes are used to provide serviceof writing and reading of desired files. By default, each data block is replicatedDOI: 10.4236/jis.2022.13200225Journal of Information Security

W. Rajehthree times, of which, two copies are stored in two different DataNodes in a single rack, and a third copy is saved on a DataNode which belongs to another rack.2. Key Security ChallengesMany security challenges characterize Hadoop technology [11] [12]. That comesfrom the fact that it operates by using a variety of different technologies such asdatabases, operating systems, networks, communication protocols, memory resources, processors, etc. The occurrence of a security problem in one of thementioned components can endanger the work of the entire system. That’s whyit is necessary to seriously consider the security challenges and operational issuesof Hadoop from all perspectives. It is a complex system that could be significantly affected by network security problems, authentication and authorizationissues, data availability and integrity and by additional security requirements.Parallel computation capability of Hadoop results with a complex environmentthat is at high risk of attacks. Users share physical resources between each other,and therefore, a user does not have complete control over data. It is a consequence of parallelism which establishes the data storage across many machines.In that case, a client and a malicious party can easily share the same physical devices. If an adequate security system is not implemented, a malicious party canget full access to data and compromise honest clients. Compromised clientspropagate malicious data through a network by what a whole system can be affected. However, three challenges are discussed in the following section.2.1. Utilizing Remote Procedure Call ProtocolThe Hadoop technology is based on Remote Procedure Call over TCP/IP for thetransfer of data between different nodes [13]. Default communication is not secure, and malicious parties can easily modify internode communication forhacking the system. Computations can be performed anywhere within clusters,so it is a complex task to find the precise location of a single computation. Because of that, it is also challenging to ensure the security of each computationlocation. Apart from that, insecure communication can influence data leakageduring the transfer of data between a DataNode and a client. Also, it is not a rarecase that undesirable nodes are added to a system with a task to steal data orcompromise computations.2.2. Replication Storage ModelHadoop technology is based on storing large amounts of information in multipleclusters in a distributed way. Every cluster could be built from thousands of DataNodes, which makes the entire system structure complex. A simple example ofHDFS storage working principles is presented in Figure 1. A large file is firstsplit into three data blocks. Each of them is replicated three times and saved intodifferent DataNotes from security reasons. Distributed File System (DFS) organizesDOI: 10.4236/jis.2022.13200226Journal of Information Security

W. RajehFigure 1. Replication storage model.the transfer of metadata between the NameNode and a client.This operational algorithm is designed for all honest users within a system.However, a malicious party can access HDFS as any user, which can result in avariety of severe security issues. From that reason, implementation of accesscontrol algorithm and denial of access of unauthorized clients are of the mostprominent importance for reliable and secure distributed systems. An entrypoint for a malicious client can be every DataNode itself. When unauthorizedparty access to a DataNode, private user data is easily accessible. This severe andunacceptable risk for all potential Hadoop users who want to use this framework. Additionally, a malicious intruder can run malicious executive codes toHadoop services and interrupt the operational mode of HDFS, NameNode, DataNodes, and all other network components.Generally speaking, there are four types of Hadoop users. The first one is aregular user who exploits Hadoop capacities to process, transfer, and store data.The second type of user is a business user, for which one the Hadoop technologyis as an ideal solution for improving existing business solutions. Scientific researchers and developers are the third types of users. The final, fourth group, aremalicious users who try to steal or misuse the data by accessing it on the unauthorized way. These users cannot be stopped from attacking the system in advance. But learning about the potential threats and possibilities for unauthorizedattack can result in improving security performances of Hadoop. The platform iscontinuously evolving, getting2.3. Cluster Security LevelsDefault Hadoop setup includes no security feature within clusters. On Level 0 ofDOI: 10.4236/jis.2022.13200227Journal of Information Security

W. Rajehthe security, the system assumes that including parties are honest ones and that atrust level exists. However, authorization features may exist inside Level 0 inwhich case they are assigned to objects. An attacker could quickly overcome thissecurity level by performing a bypass authentication attack. Level 1 is the firststronger defense line from unauthorized attacks. It includes EdgeNode whichlimits and restrict access to a cluster. It is a type of intermediator on which usersare logging into first, and then EdgeNode connects directly with the cluster andperforms transmission of information. Direct connectivity between users andclusters can be disabled by using EdgeNode. On the other side, Level 2 providesauthentication and access control, which are based on an authentication mechanism that guarantees that all active services and participating users are authenticated. One of the most popular authentication mechanisms is Kerberos[14]. Network encryption feature represents the Level 3 of Hadoop security mechanism. Final security line, Level 4, is HDFS encryption. If a malicious partysomehow overcomes previous security levels, there is a good chance that it willget access to the compromised node from which it can exploit data blocks andget access to files at the level of the operating system. HDFS encryption is onesuccessful way to prevent this threat.Distribute data processing and parallelism of Hadoop mostly affect that mentioned issues have not been entirely resolved until now. By other words, no perfect security algorithm in Hadoop Environment is proposed. As can be concluded from the previous analysis, one of the biggest problems is the potentialattack of malicious clients and compromising of user data. In the next section,the HDFS issue of unauthorized access to a system will be analyzed in detail.2.4. User Access MonitoringFirst, it is important to explain concretely what an unauthorized client is. Such aclient does not have permission to access data or perform any operation within acluster. A malicious user can attack Hadoop by accessing a file via HTTP protocols or via the remote procedural call (RPC). Further, it can execute maliciouscodes to a system and read or write arbitrary data block of a file by using a pipeline streaming data-transfer protocol. Also, it can get privileges which couldprovide him the capability to change the priority of assigned jobs inside the Hadoop, to delete jobs or to submit its malicious tasks. When an unauthorized userperforms an operation on a data block, he bypasses a mechanism of access control. Another way of a malicious party to get unauthorized access is to interceptcommunications to consoles of Hadoop. It could be any communication processbetween a NameNode and DataNodes. When communication is intercepted,credentials or data could be stolen.In order to make the Hadoop system secure from unauthorized users, it isfrom vital importance to check all system changes. These changes could be anyaddition or deletion of data, modification of information, node management,etc. With the purpose to provide a suitable security mechanism, a log monitorDOI: 10.4236/jis.2022.13200228Journal of Information Security

W. Rajehing system must be deployed, and the complete Hadoop system must be auditedall the time [15].Log monitoring became essential Hadoop component for acquiring frequentinformation of the entire system. The problem is that the Hadoop frameworkdoes not have built-in monitoring features for detecting malicious queries ormisuse of data. Further, ongoing researches still cannot precisely theoreticallyformulate a malicious query and its characteristics. Therefore, the universal monitoring solution is not proposed yet, and every individual usage case of HDFS andHadoop has its monitoring functions.The purpose of the other feature, an audit [16], is to comply with all securityrequirements, and it is used by MapReduce and HDFS components. An auditcan be used to detect when a party accesses the system on an unauthorized wayby exploiting an event log and checking an activity record. Event logs are proposed to register all user actions, from correct ones to the activities which arewrongly or maliciously performed. But it is not enough to have records of userIDs and IP addresses when logs are created. It is also strongly recommended tosnapshot information about issued queries. However, another problem can occur here because there is a possibility that an attacker can delete or modify evenlog entries in the system. That threat is a tremendous challenge for already proposed monitoring features which continuously need to be upgraded.Usage of log monitoring feature and a proper audit mechanism can significantly lower a possibility for security violations. Still, to build an adequate security mechanism, it is also essential to know the environment in which a systemoperates and to perform detail user assessment.2.5. User and Environment AssessmentDirect access to the HDFS is required by two types of clients: developers/analystsand indirect access users. If a developer wants access permissions, it is expectedthat he will need access to different nodes, developer tools, log files, etc. If a dataanalyst wants to use HDFS resources, it is logical that he will not need the sametools as the developer, but he will require analytical tools instead. On the contrary, an indirect access user does not have reasons to use developer tools, exploit and analyze data or to use computation resources of a system. Indirectaccess users do not require exclusive access, and they should be involved as apart of the general security model.It is not often enough to understand the types of users that utilize the system.In order to fully assess the risk to the security of the system, it is necessary toknow the environment in which the system operates [17]. First, it is essential todetermine if a system is available on a global network and connected to the internet. If that is the case, the system is open to many different threats, viruses,and potential attacks of unauthorized clients, which tries to exploit its vulnerabilities. HDFS components with internet access must have an appropriate mechanism for monitoring and continuous alert platform in the case of occurringDOI: 10.4236/jis.2022.13200229Journal of Information Security

W. Rajehof unauthorized parties. A lot of time and programming effort is required tocontinually investigate possible threats, to develop new security patches, upgradeexisting ones, and to research the latest state of development of techniques forendangering system security.The environment should also be evaluated by its physical characteristics. Thephysical location of machines that are a part of a distributed system could be essential from the perspective of determining who has direct access to individualcomputers. Generally, machines can be stored inside a company data center, athird-party data center, or in a cloud. Based on these expectations, the security ofthe system must be accustomed to defend the complete structure sufficiently. Abig problem can occur if the servers are hosted in a public cloud because it ischallenging to tell for sure who has access to the system. In that case, the securitymechanism will be much more complex and demanding, in comparison withrequirements if the servers are located in private property and absolute controlof a few people. Communication problems and security issues are significantlylower if cloud services are entirely avoided.The overall security of Hadoop is improved significantly in the last few years.Most of the issues which occur today are examined and researched by expertsfrom the field. However, absolute protection from unauthorized access does notexist, and it is an everyday struggle of developers to improve performances ofdistributed systems and HDFS component. Enterprise Hadoop distributionshave high power in dealing with access and identity management tasks. Majorityof organizations which utilize Hadoop distributed storing system have two different types of administrators: Hadoop and platform administrators. Both groupshave authorized access to the files of a cluster. Regular issue is the requirementto separate duties and access restrictions of two groups. Sometimes it is necessary that an administrator does not see private information and sensitive data.Then, the feature with the capability of segregation of administrative roles andlimitation of access to the desired level is required. Newer versions of Hadoopprotection include authorization of different roles, file permissions, managementof access lists, etc. However, these features cannot stop malicious users fromunauthorized access, acquiring private information, and snapshotting content offiles. Better security performances are possible if features such as different keymanagement services are implemented within a system. One popular and highlyused key management service is HDFS encryption, which provides a unique keyfor every application [18].2.6. Authorization LabelingDevelopers of Hadoop technology, in its beginning, mostly devoted their attention to preventing data loss accidents, while unauthorized access to HDFS anddata was not correctly considered. HDFS system possesses the file permissionfeature that prevents a user from accidentally deleting a file system. But therewas a lack of protection from an unauthorized party which wants to assumeDOI: 10.4236/jis.2022.13200230Journal of Information Security

W. Rajehroot’s identity with the intention to disclose or delete some cluster data. Also,there is a possibility that an honest party tries to access data on some inappropriate way or by mistake, which can also be labeled as illustrated in Figure 2where an attempt of unauthorized access is detected. HDFS file permission feature is based on an authorization mechanism [19], which is not enough forcomplete security and protection from unauthorized users. The Hadoop is stillFigure 2. Detection of Untheorized Named Dataset.DOI: 10.4236/jis.2022.13200231Journal of Information Security

W. Rajehvulnerable from attacks, and therefore, integrity and confidentiality of data areendangered. The system is open to abuse via manipulations of data by a malicious party which can access a network cluster. File permissions manage whatoperations are allowed/non-allowed for each user and what specific user can dowith a single file. For instance, a file can be writable only by one user and readable by a group of users. Additionally, all other users have no rights on the file,and it is completely locked to them.It is essential to provide the system with a mechanism which will adequatelyimplement all data access policies within HDFS. Every authorization decisionthat is made on the NameNode must also be consistently executed on DataNodes. That way, unauthorized access can be stopped more easily. However, themechanism of authorization is one of many security steps which are required forfull protection of data.2.7. Node Based Authentication IssuesThe process of authentication indicates activity of proving the identity of a partyto someone else. Authentication is an essential step for secure and reliable workof distributed data systems [20]. Before the communication between two partiesstarts, an authentication protocol runs. The task of the protocol is to establishidentities between two parties, after what they can cooperate securely. Existenceof such a protocol is an essential feature for building secured and protected environment.A lack of authentication feature characterizes the elemental Hadoop framework. The first problem here is that the Hadoop does not authenticate a user ora service. However, it is vital to provide Hadoop with a mechanism that allows auser to perform a file operation only if it is verified that the user is who its claimto be. In that case, the user becomes an honest or trusted party of the system.Further, personal information of users like names, IP addresses, credit card numbers, etc., should be protected by the framework and access of a random user tothis data should be restricted. Generally, the authentication mechanism must include all regulatory requirements which are essential for the protection of thedata. The typical case is building such a mechanism which will allow differentsecurity levels for different sets of data. For example, data with a low level of security can be some public or anonymized data which can be shared between allnetwork parties without any restriction. On the other hand, the highest level ofprotection should be provided for personal and sensitive data, allowing onlychosen parties to have access rights. If a proper authentication mechanism isbuilt, Hadoop will effectively determine if a user has required permissions toperform an action. Every action of a user is followed by the mechanism whichrequires user credentials whenever the user tries to access the data.Three main components are required for the authentication process. The firstone is a string in the form of a username or a service name. Additionally, apassword could be added to each user. If the password is added, the protectionDOI: 10.4236/jis.2022.13200232Journal of Information Security

W. Rajehof it can be achieved by applying the salted password hashing algorithm [21].The second, optional component, is the instance which defines the specific roleof a user or a name of a host on which a service is running. The third one is therealm itself, and it can be described as a type of DNS domain. By the default setup, DataNodes do not execute any access checkups on input points of datablocks. If an unauthorized client wants to perform an operation to some datablock, it is enough for him only to know the block ID. Also, it is possible for allusers (honest and malicious) to write a new data block to DataNodes. In thatcase, any malicious party can overload the system with a large amount of garbage data, and computation power of Hadoop will be used for processing spamfiles. This activity is known as a denial of resources. A malicious party is submitting multiple jobs to the system which tries to perform all required tasks andspends the majority of available cluster resources on these tasks. The system isoccupied with undesirable activities, and other users are disabled to process realjobs optimally.Lack of authentication problem can result also with allowing a random user tostart random service on any machine. When NameNode registers a malicioususer, it will automatically begin to receive data blocks from a cluster. It is theconsequence of the replication feature of Hadoop, where data is replicated threetimes within the system from the safety reasons. One way of solving this problemis the implementation of a feature which can restrict machine registration as DataNodes. It is shown in practice that the main element of this feature is dfs.hostsproperty that contains names of hosts which are allowed to connect and registerwith a NameNode. The property is physically stored inside hdfssite.xml. Initially, this feature is turned off and needs to be activated individually. While it is off,a malicious client can communicate with any DataNode, read existing data blocks,or add new malicious data blocks.Secure communication of new Hadoop versions is provided by using theTransport Layer Security (TLS) protocol [22]. This protocol ensures communication privacy between all nodes and name servers. That way, HDFS can protect thetransfer of data through a system and guarantee the safety of all honest parties.2.8. Threats and Possible AttacksIn the previous subsection, unauthorized access was described from the perspective of Hadoop. The most important part of developing a secure and reliablesystem is to have good insight at possible threats of unauthorized access [23] andwhat an unauthorized party can do to a network. If an intruder is known, thedefense mechanism and security feature could be developed easier.Port Scanning Attack is a malicious method for collecting data that discloseinformation about openness of services and ports. Also, the technique recordsways in which services are responding to individual queries. On the other hand,a Diction

DOI: 10.4236/jis.2022.132002 25 Journal of Information Security by re-executing all of its tasks. However, in the large-scale system as Hadoop, the occurrence of failures is unavoidable. On a basic level, Hadoop is built from two main components [8]: MapReduce and Hadoop Distributed File System (HDFS). MapReduce component is used