shows the state transition diagram for a sample DFA. A non-deterministic finite automaton (NFA) is defined just like the DFA except that the transition function defines a mapping from Q ?! 2 Q. In general, a finite state automaton (FSA) refers to either a DFA or a NFA.

shows the state transition diagram for a sample DFA. A non-deterministic finite automaton (NFA) is defined just like the DFA except that the transition function defines a mapping from Q ?! 2 Q. In general, a finite state automaton (FSA) refers to either a DFA or a NFA.

Source publication

We present a framework for learning DFA from simple examples. We show that efficient PAC learning of DFA is possible if the class of distributions is restricted to simple distributions where a teacher might choose examples based on the knowledge of the target concept. This answers an open research question posed in Pitt's seminal paper: Are DFA's P...

... We converted the formulas into negative normal form following the precedent set by Camacho and McIlraith (2019). Then, we adopted an approach from Camacho and McIlraith (2019) and generated a characteristic sample of traces for each formula's corresponding minimal deterministic finite-state automaton (DFA) (Parekh and Honavar 2001). A set of labeled traces is considered characteristic if the set uniquely defines a minimal DFA over a fixed number of states, N . ...

  • Homer Walke
  • Daniel Ritter
  • Carl Trimbach Carl Trimbach
  • Michael Littman

Finite linear temporal logic ($\mathsf{LTL}_f$) is a powerful formal representation for modeling temporal sequences. We address the problem of learning a compact $\mathsf{LTL}_f$ formula from labeled traces of system behavior. We propose a novel neural network operator and evaluate the resulting architecture, Neural$\mathsf{LTL}_f$. Our approach includes a specialized recurrent filter, designed to subsume $\mathsf{LTL}_f$ temporal operators, to learn a highly accurate classifier for traces. Then, it discretizes the activations and extracts the truth table represented by the learned weights. This truth table is converted to symbolic form and returned as the learned formula. Experiments on randomly generated $\mathsf{LTL}_f$ formulas show Neural$\mathsf{LTL}_f$ scales to larger formula sizes than existing approaches and maintains high accuracy even in the presence of noise.

... S-PDFA model quality. Quantifying S-PDFA model quality is a difficult problem [5,19]. A common option is to measure its prediction power using Perplexity [2,25]. ...

Attack graphs (AG) are used to assess pathways availed by cyber adversaries to penetrate a network. State-of-the-art approaches for AG generation focus mostly on deriving dependencies between system vulnerabilities based on network scans and expert knowledge. In real-world operations however, it is costly and ineffective to rely on constant vulnerability scanning and expert-crafted AGs. We propose to automatically learn AGs based on actions observed through intrusion alerts, without prior expert knowledge. Specifically , we develop an unsupervised sequence learning system, SAGE, that leverages the temporal and probabilistic dependence between alerts in a suffix-based probabilistic deterministic finite automaton (S-PDFA)-a model that accentuates infrequent severe alerts and summarizes paths leading to them. AGs are then derived from the S-PDFA. Tested with intrusion alerts collected through Collegiate Penetration Testing Competition, SAGE produces AGs that reflect the strategies used by participating teams. The resulting AGs are succinct, interpretable, and enable analysts to derive actionable insights, e.g., attackers tend to follow shorter paths after they have discovered a longer one.

... The number of observations can differ between the samples. One possible algorithm for passive automata learning is the RPNI algorithm [16]. For the localization system, samples are generated from several measurements along random trajectories. ...

Cyber Physical Systems (CPSs) are often black box systems for which no exact model exists. Automata learning allows to build abstract models of CPSs and is used in several scenarios, i.e. simulation, monitoring, and test case generation. Real time localization systems (RTLSs) are an example of particularly complex and often safety critical CPSs. We present a procedure for automatic test case generation with automata learning and apply this approach in a case study to a localization system.

... However, unless P = N P, from a set of labelled training samples, no polynomial time algorithm can be guaranteed to produce a DFA with the number of states polynomial to the source DFA [17]. On the positive side, the class of DFA whose canonical representations have logarithmic Kolmogorov complexity has been shown efficiently PAClearnable if samples are given from the Solomonoff-Levin universal distribution [16]. ...

  • Chenyi Zhang Chenyi Zhang

We review the grammatical inference problem for regular languages which aims to generate a deterministic finite automaton from a representative set of training sample strings known to be in or not in the language. Although the general problem of producing a minimal DFA consistent with a given sample is known to be NP-hard, it is possible to generate minimal consistent DFA in polynomial time if certain constraints are satisfied by the given samples. In this work we propose a new algorithm which generates minimal DFA if the given training samples satisfy a certain sufficient condition. On the negative side, we also show that this problem is indeed hard, such that even for a more restricted class of training sets, the problem of generating minimal consistent DFA is already intractable.

... q 0 =Initial state of a machine, F= Set of final states of a machine [4]. An example of DFA for all strings over ∑ = {a, b} having suffix 'ab' is shown in Figure 1 and corresponding transition table is shown in Figure 2. Here, ...

... In [11], recurrent neural networks are trained to behave like DFA. Learning from example approach [12] is used for construction and minimization of DFA is discussed in [13]. Hill climbing with the heuristically guided approach is used for construction, minimization and to implement regular set recognizer (RR). ...

In real-world applications like network software design, pattern recognition, compiler construction moreover in some of the applications of formal and natural language like accepter, spell checker and advisor, language dictionary, the Deterministic Finite Automata (DFA) plays an important role. For such applications, specification rules to construct DFA are more complex in nature. More generally, the rules consist of the intricate words like 'and', 'or', 'not having', 'followed by'. Constructing DFA for such rules is time-consuming and tedious process. To make the construction process easier, simple DFA construction algorithm is proposed. The proposed algorithm is based on divide-and-conquer (D&C) algorithm design strategy. To apply D&C, the first step is to divide given language specification rule into a number of manageable pieces of sub-language specification rules, second construct DFA for each sub-language by any efficient method and at last, combine the DFA using closure properties of the regular set to obtain resultant DFA. To verify the correctness of the proposed algorithm, the obtained resultant DFA is compared with DFA constructed by a conventional method.

... Several modifications of RPNI have been studied, e.g. incremental version [27], with faster convergence for some languages subfamilies [28], and suitable for PAC learning [29]. ...

... This problem is different from ours in that it requires users to annotate the parts of examples to be extracted, where examples are typically a text file composed of tens or hundreds of lines. [2,9,23] and we can easily convert DFAs to regular expressions with standard algorithms. However, there are crucial disadvantages to apply this approach to our system. ...

... Since regular expression minimization is computationally hard, namely PSPACEcomplete [21], it introduces another difficult problem to solve. In addition, many methods for learning DFA require training data to meet certain properties [22,23]. For example, the work in [22] requires a structurally complete training set which covers every state transition in the target DFA. ...

  • Mina Lee
  • Sunbeom So
  • Hakjoo Oh Hakjoo Oh

We present a method for synthesizing regular expressions for introductory automata assignments. Given a set of positive and negative examples, the method automatically synthesizes the simplest possible regular expression that accepts all the positive examples while rejecting all the negative examples. The key novelty is the search-based synthesis algorithm that leverages ideas from over- and under-approximations to effectively prune out a large search space. We have implemented our technique in a tool and evaluated it with non-trivial benchmark problems that students often struggle with. The results show that our system can synthesize desired regular expressions in 6.7 seconds on the average, so that it can be interactively used by students to enhance their understanding of regular expressions.

... Most work of identification of regular languages focuses on learning automata (Denis, 2001;Parekh and Honavar, 2001;Clark and Thollard, 2004). Since regular languages are 3709 accepted by finite automata, the problems of learning regular languages and learning finite automata are tightly coupled. ...

... Existing work to identify regular languages ranges from learning automata that accept a regular language (Denis, 2001;Clark and Thollard, 2004;Parekh and Honavar, 2001) or restricted classes of deterministic finite automata Angluin, 1980;. Other papers address the task of inferring a regular expressions in which each symbol occurs at most k times , disjunction-free expressions (Brāzma, 1993) , and disjunctions of left-aligned disjunctionfree expressions (Fernau, 2009). ...

  • Paul Prasse Paul Prasse

Computer Security deals with the detection and mitigation of threats to computer networks, data, and computing hardware. This thesis addresses the following two computer security problems: email spam campaign and malware detection. Email spam campaigns can easily be generated using popular dissemination tools by specifying simple grammars that serve as message templates. A grammar is disseminated to nodes of a bot net, the nodes create messages by instantiating the grammar at random. Email spam campaigns can encompass huge data volumes and therefore pose a threat to the stability of the infrastructure of email service providers that have to store them. Malware -software that serves a malicious purpose- is affecting web servers, client computers via active content, and client computers through executable files. Without the help of malware detection systems it would be easy for malware creators to collect sensitive information or to infiltrate computers. The detection of threats -such as email-spam messages, phishing messages, or malware- is an adversarial and therefore intrinsically difficult problem. Threats vary greatly and evolve over time. The detection of threats based on manually-designed rules is therefore difficult and requires a constant engineering effort. Machine-learning is a research area that revolves around the analysis of data and the discovery of patterns that describe aspects of the data. Discriminative learning methods extract prediction models from data that are optimized to predict a target attribute as accurately as possible. Machine-learning methods hold the promise of automatically identifying patterns that robustly and accurately detect threats. This thesis focuses on the design and analysis of discriminative learning methods for the two computer-security problems under investigation: email-campaign and malware detection. The first part of this thesis addresses email-campaign detection. We focus on regular expressions as a syntactic framework, because regular expressions are intuitively comprehensible by security engineers and administrators, and they can be applied as a detection mechanism in an extremely efficient manner. In this setting, a prediction model is provided with exemplary messages from an email-spam campaign. The prediction model has to generate a regular expression that reveals the syntactic pattern that underlies the entire campaign, and that a security engineers finds comprehensible and feels confident enough to use the expression to blacklist further messages at the email server. We model this problem as two-stage learning problem with structured input and output spaces which can be solved using standard cutting plane methods. Therefore we develop an appropriate loss function, and derive a decoder for the resulting optimization problem. The second part of this thesis deals with the problem of predicting whether a given JavaScript or PHP file is malicious or benign. Recent malware analysis techniques use static or dynamic features, or both. In fully dynamic analysis, the software or script is executed and observed for malicious behavior in a sandbox environment. By contrast, static analysis is based on features that can be extracted directly from the program file. In order to bypass static detection mechanisms, code obfuscation techniques are used to spread a malicious program file in many different syntactic variants. Deobfuscating the code before applying a static classifier can be subjected to mostly static code analysis and can overcome the problem of obfuscated malicious code, but on the other hand increases the computational costs of malware detection by an order of magnitude. In this thesis we present a cascaded architecture in which a classifier first performs a static analysis of the original code and -based on the outcome of this first classification step- the code may be deobfuscated and classified again. We explore several types of features including token $n$-grams, orthogonal sparse bigrams, subroutine-hashings, and syntax-tree features and study the robustness of detection methods and feature types against the evolution of malware over time. The developed tool scans very large file collections quickly and accurately. Each model is evaluated on real-world data and compared to reference methods. Our approach of inferring regular expressions to filter emails belonging to an email spam campaigns leads to models with a high true-positive rate at a very low false-positive rate that is an order of magnitude lower than that of a commercial content-based filter. Our presented system -REx-SVMshort- is being used by a commercial email service provider and complements content-based and IP-address based filtering. Our cascaded malware detection system is evaluated on a high-quality data set of almost 400,000 conspicuous PHP files and a collection of more than 1,00,000 JavaScript files. From our case study we can conclude that our system can quickly and accurately process large data collections at a low false-positive rate.

... There are many algorithms for learning DFAs, the most well-known being the algorithm due to Dana Angluin [4] [5]. There are many approaches for regular inference [6] [7] [11] [16] [17] [19]. For more information, the book [14] presents an overview on learning automata and grammar inference. ...

  • Catherine Combes Catherine Combes
  • Jean Azéma

We investigate the contribution of unsupervised learning and regular grammatical inference to respectively identify profiles of elderly people and their development over time in order to evaluate care needs (human, finan-cial and physical resources). Grammatical Inference (also known as automata induction, grammar induction and automatic language acquisition) allows grammar and language learning from data. Machine learning by using grammar has a variety of applications: pattern recognition, adaptive intelligent agents, diagnosis, biology, systems modelling, prediction, natural language acquisition, data mining... The proposed approach is based on regular grammar. An adaptation of k-Testable Languages in the Strict Sense Inference algorithm is proposed in order to infer a probabilistic automaton from which a Markovian model which has a discrete (finite or countable) state-space has been deduced. In simulating the corresponding Markov chain model, it is possible to obtain information on population ageing. We have verified if our observed system conforms to a unique long term state vector, called the stationary distribution and the steady-state.