How Machine Learning protects big data: ML in Cybersecurity
Why Machine Learning is needed in Cybersecurity: 5 areas of automation
In 2018, the CIA, the FBI, the US Department of Defense, the United Kingdom, the International Olympic Committee, the People’s Bank of China, the Marriott hotel chain, as well as users of BitTorrent, GitHub, Skype, Tinder, WhatsApp, YouTube, suffered from information leaks. Read more about the largest data leaks in the last few years here. To prevent such incidents in the field of information security and reduce the damage from their occurrence, cybersecurity experts connect Machine Learning to existing tools.
Thanks to success in solving the problems of clustering and classification, Machine Learning copes with the determination of anomalies. Many cybersecurity methods are built on this principle. In particular, Machine Learning is used in the following cases:
• recognition of false documents, biometric data and other identifiers;
• identification of fraudulent transactions (antifraud), for example, when the scenario for using a bank card differs from the usual one;
• leak detection due to illegal actions of privileged users, for example, administrators who steal or delete sensitive data. Machine Learning algorithms will allow you to correlate several features (volume and type of data, time, protocol, recipient address) to separate the planned unloading of a new version of the database or distributions for remote offices from information theft.
In addition, Machine Learning has been successfully used in the development of antivirus software, allowing the automatic detection of new malicious programs based on a retrospective analysis of the already accumulated virus signature database. Having learned from a large number of samples, the ML model is able to generalize information and detect future threats.
Another useful use case for Machine Learning in cybersecurity is automatic monitoring of the behavior of integrated Big Data systems and corporate IT infrastructure. For example, at Home Credit Bank, Machine Learning specialists in banking services operations can help identify abnormal activity of individual components or users in a timely manner.
How Machine Learning protects Big Data and other information
Consider the mechanics of Machine Learning’s work in data protection, dividing ML methods into two categories: learning with a teacher and without a teacher.
When training with a teacher, the input dataset has a set of properties of object X and the corresponding labels of object Y. It is necessary to create a model that will give correct definitions of Y ’for previously unknown test objects X’. Some properties of the content or behavior of the file / request (statistics, a list of used API functions, etc.) may act as X. Output Y can be classified as a “harmless” or “malicious” object: a virus, a Trojan downloader, spam advertising, etc. Thus, training with a teacher will allow you to classify new data, revealing something anomalous in it to detect the loading of previously unknown malicious code, spam and phishing attacks, DGA domains (automatically generated malicious domains), communications with team servers and botnets. Classification algorithms (decision trees, random forest, support vector method) will help predict the category of threat / vulnerability. In this way, for example, SQL-injection attacks or suspicious traffic can be detected. Regression models will be needed to predict specific values, in particular when the growth of attacks is most likely. Since teaching with a teacher is also called use-case learning, we can say that this method is based on the Case Based Reasoning (CBR) approach, in which a new task is solved on the basis of reasoning by analogy, by deriving assumptions from similar cases (use-cases).
Machine-learning methods without a teacher are aimed at revealing hidden data structures, allowing you to detect groups of similar objects or related properties. For this, clustering algorithms are used that allow you to effectively separate large volumes of incoming unknown files or requests into clusters, which can be processed automatically due to the presence of an already known object in their composition. Thus, it is possible to detect, for example, information leaks due to illegal actions of users by analyzing their behavior logs and data state. Today, neural network models are considered the most popular ML-method of teaching without a teacher. Many Big Data frameworks (Apache Spark, Flink, Storm), focused on interactive processing of streaming information, allow you to use Machine Learning to analyze user actions in real time, connecting the appropriate libraries for this.
The combination of different machine learning methods increases the efficiency of malware detection and attack prevention. In this way, behavioral analytics is implemented, for example, when a sequence of events is logged and then analyzed during the execution of a process. Having classified the event, the ML model reduces it to a set of binary vectors and teaches a deep neural network to distinguish dangerous activity from logs of legitimate events.