For enquiries call:

+1-469-442-0620

banner-in1

  • Programming

Top 10 Software Engineer Research Topics for 2024

Home Blog Programming Top 10 Software Engineer Research Topics for 2024

Play icon

Software engineering, in general, is a dynamic and rapidly changing field that demands a thorough understanding of concepts related to programming, computer science, and mathematics. As software systems become more complicated in the future, software developers must stay updated on industry innovations and the latest trends. Working on software engineering research topics is an important part of staying relevant in the field of software engineering. 

Software engineers can do research to learn about new technologies, approaches, and strategies for developing and maintaining complex software systems. Software engineers can conduct research on a wide range of topics. Software engineering research is also vital for increasing the functionality, security, and dependability of software systems. Going for the Top Programming Certification course contributes to the advancement of the field's state of the art and assures that software engineers can continue to build high-quality, effective software systems.

What are Software Engineer Research Topics?

Software engineer research topics are areas of exploration and study in the rapidly evolving field of software engineering. These research topics include various software development approaches, quality of software, testing of software, maintenance of software, security measures for software, machine learning models in software engineering, DevOps, and architecture of software. Each of these software engineer research topics has distinct problems and opportunities for software engineers to investigate and make major contributions to the field. In short, research topics for software engineering provide possibilities for software engineers to investigate new technologies, approaches, and strategies for developing and managing complex software systems. 

For example, research on agile software development could identify the benefits and drawbacks of using agile methodology, as well as develop new techniques for effectively implementing agile practices. Software testing research may explore new testing procedures and tools, as well as assess the efficacy of existing ones. Software quality research may investigate the elements that influence software quality and develop approaches for enhancing software system quality and minimizing the faults and errors. Software metrics are quantitative measures that are used to assess the quality, maintainability, and performance of software. 

The research papers on software engineering topics in this specific area could identify novel measures for evaluating software systems or techniques for using metrics to improve the quality of software. The practice of integrating code changes into a common repository and pushing code changes to production in small, periodic batches is known as continuous integration and deployment (CI/CD). This research could investigate the best practices for establishing CI/CD or developing tools and approaches for automating the entire CI/CD process.

Top Software Engineer Research Topics

1. artificial intelligence and software engineering.

Intersections between AI and SE

The creation of AI-powered software engineering tools is one potential research area at the intersection of artificial intelligence (AI) and software engineering. These technologies use AI techniques that include machine learning, natural language processing, and computer vision to help software engineers with a variety of tasks throughout the software development lifecycle. An AI-powered code review tool, for example, may automatically discover potential flaws or security vulnerabilities in code, saving developers a lot of time and lowering the chance of human error. Similarly, an AI-powered testing tool might build test cases and analyze test results automatically to discover areas for improvement. 

Furthermore, AI-powered project management tools may aid in the planning and scheduling of projects, resource allocation, and risk management in the project. AI can also be utilized in software maintenance duties such as automatically discovering and correcting defects or providing code refactoring solutions. However, the development of such tools presents significant technical and ethical challenges, such as the necessity of large amounts of high-quality data, the risk of bias present in AI algorithms, and the possibility of AI replacing human jobs. Continuous study in this area is therefore required to ensure that AI-powered software engineering tools are successful, fair, and responsible.

Knowledge-based Software Engineering

Another study area that overlaps with AI and software engineering is knowledge-based software engineering (KBSE). KBSE entails creating software systems capable of reasoning about knowledge and applying that knowledge to enhance software development processes. The development of knowledge-based systems that can help software engineers in detecting and addressing complicated problems is one example of KBSE in action. To capture domain-specific knowledge, these systems use knowledge representation techniques such as ontologies, and reasoning algorithms such as logic programming or rule-based systems to derive new knowledge from already existing data. 

KBSE can be utilized in the context of AI and software engineering to create intelligent systems capable of learning from past experiences and applying that information to improvise future software development processes. A KBSE system, for example, may be used to generate code based on previous code samples or to recommend code snippets depending on the requirements of a project. Furthermore, KBSE systems could be used to improve the precision and efficiency of software testing and debugging by identifying and prioritizing bugs using knowledge-based techniques. As a result, continued research in this area is critical to ensuring that AI-powered software engineering tools are productive, fair, and responsible.

2. Natural Language Processing

Multimodality

Multimodality in Natural Language Processing (NLP) is one of the appealing research ideas for software engineering at the nexus of computer vision, speech recognition, and NLP. The ability of machines to comprehend and generate language from many modalities, such as text, speech, pictures, and video, is referred to as multimodal NLP. The goal of multimodal NLP is to develop systems that can learn from and interpret human communication across several modalities, allowing them to engage with humans in more organic and intuitive ways. 

The building of conversational agents or chatbots that can understand and create responses using several modalities is one example of multimodal NLP in action. These agents can analyze text input, voice input, and visual clues to provide more precise and relevant responses, allowing users to have a more natural and seamless conversational experience. Furthermore, multimodal NLP can be used to enhance language translation systems, allowing them to more accurately and effectively translate text, speech, and visual content.

The development of multimodal NLP systems must take efficiency into account. as multimodal NLP systems require significant computing power to process and integrate information from multiple modalities, optimizing their efficiency is critical to ensuring that they can operate in real-time and provide users with accurate and timely responses. Developing algorithms that can efficiently evaluate and integrate input from several modalities is one method for improving the efficiency of multimodal NLP systems. 

Overall, efficiency is a critical factor in the design of multimodal NLP systems. Researchers can increase the speed, precision, and scalability of these systems by inventing efficient algorithms, pre-processing approaches, and hardware architectures, allowing them to run successfully and offer real-time replies to consumers. Software Engineering training will help you level up your career and gear up to land you a job in the top product companies as a skilled Software Engineer. 

3. Applications of Data Mining in Software Engineering

Mining Software Engineering Data

The mining of software engineering data is one of the significant research paper topics for software engineering, involving the application of data mining techniques to extract insights from enormous datasets that are generated during software development processes. The purpose of mining software engineering data is to uncover patterns, trends, and various relationships that can inform software development practices, increase software product quality, and improve software development process efficiency. 

Mining software engineering data, despite its potential benefits, has various obstacles, including the quality of data, scalability, and privacy of data. Continuous research in this area is required to develop more effective data mining techniques and tools, as well as methods for ensuring data privacy and security, to address these challenges. By tackling these issues, mining software engineering data can continue to promote many positive aspects in software development practices and the overall quality of product.

Clustering and Text Mining

Clustering is a data mining approach that is used to group comparable items or data points based on their features or characteristics. Clustering can be used to detect patterns and correlations between different components of software, such as classes, methods, and modules, in the context of software engineering data. 

On the other hand, text mining is a method of data mining that is used to extract valuable information from unstructured text data such as software manuals, code comments, and bug reports. Text mining can be applied in the context of software engineering data to find patterns and trends in software development processes

4. Data Modeling

Data modeling is an important area of research paper topics in software engineering study, especially in the context of the design of databases and their management. It involves developing a conceptual model of the data that a system will need to store, organize, and manage, as well as establishing the relationships between various data pieces. One important goal of data modeling in software engineering research is to make sure that the database schema precisely matches the system's and its users' requirements. Working closely with stakeholders to understand their needs and identify the data items that are most essential to them is necessary.

5. Verification and Validation

Verification and validation are significant research project ideas for software engineering research because they help us to ensure that software systems are correctly built and suit the needs of their users. While most of the time, these terms are frequently used interchangeably, they refer to distinct stages of the software development process. The process of ensuring that a software system fits its specifications and needs is referred to as verification. This involves testing the system to confirm that it behaves as planned and satisfies the functional and performance specifications. In contrast, validation is the process of ensuring that a software system fulfils the needs of its users and stakeholders. 

This includes ensuring that the system serves its intended function and meets the requirements of its users. Verification and validation are key components of the software development process in software engineering research. Researchers can help to improve the functionality and dependability of software systems, minimize the chance of faults and mistakes, and ultimately develop better software products for their consumers by verifying that software systems are designed correctly and that they satisfy the needs of their users.

6. Software Project Management

Software project management is an important component of software engineering research because it comprises the planning, organization, and control of resources and activities to guarantee that software projects are finished on time, within budget, and to the needed quality standards. One of the key purposes of software project management in research is to guarantee that the project's stakeholders, such as users, clients, and sponsors, are satisfied with their needs. This includes defining the project's requirements, scope, and goals, as well as identifying potential risks and restrictions to the project's success.

7. Software Quality

The quality of a software product is defined as how well it fits in with its criteria, how well it performs its intended functions, and meets the needs of its consumers. It includes features such as dependability, usability, maintainability, effectiveness, and security, among others. Software quality is a prominent and essential research topic in software engineering. Researchers are working to provide methodologies, strategies, and tools for evaluating and improving software quality, as well as forecasting and preventing software faults and defects. Overall, software quality research is a large and interdisciplinary field that combines computer science, engineering, and statistics. Its mission is to increase the reliability, accessibility, and overall quality of software products and systems, thereby benefiting both software developers and end consumers.

8. Ontology

Ontology is a formal specification of a conception of a domain used in computer science to allow knowledge sharing and reuse. Ontology is a popular and essential area of study in the context of software engineering research. The construction of ontologies for specific domains or application areas could be a research topic in ontology for software engineering. For example, a researcher may create an ontology for the field of e-commerce to give common knowledge and terminology to software developers as well as stakeholders in that domain. The integration of several ontologies is another intriguing study topic in ontology for software engineering. As the number of ontologies generated for various domains and applications grows, there is an increasing need to integrate them in order to enable interoperability and reuse.

9. Software Models

In general, a software model acts as an abstract representation of a software system or its components. Software models can be used to help software developers, different stakeholders, and users communicate more effectively, as well as to properly evaluate, design, test, and maintain software systems. The development and evaluation of modeling languages and notations is one research example connected to software models. Researchers, for example, may evaluate the usefulness and efficiency of various modeling languages, such as UML or BPMN, for various software development activities or domains. 

Researchers could also look into using software models for software testing and verification. They may investigate how models might be used to produce test cases or to do model checking, a formal technique for ensuring the correctness of software systems. They may also examine the use of models for monitoring at runtime and software system adaptation.

The Software Development Life Cycle (SDLC) is a software engineering process for planning, designing, developing, testing, and deploying software systems. SDLC is an important research issue in software engineering since it is used to manage software projects and ensure the quality of the resultant software products by software developers and project managers. The development and evaluation of novel software development processes is one SDLC-related research topic. SDLC research also includes the creation and evaluation of different software project management tools and practices. 

Researchers may also check the implementation of SDLC in specific sectors or applications. They may, for example, investigate the use of SDLC in the development of systems that are more safety-critical, such as medical equipment or aviation systems, and develop new processes or tools to ensure the safety and reliability of these systems. They may also look into using SDLC to design software systems in new sectors like the Internet of Things or in blockchain technology.

Why is Software Engineering Required?

Software engineering is necessary because it gives a systematic way to developing, designing, and maintaining reliable, efficient, and scalable software. As software systems have become more complicated over time, software engineering has become a vital discipline to ensure that software is produced in a way that is fully compatible with end-user needs, reliable, and long-term maintainable.

When the cost of software development is considered, software engineering becomes even more important. Without a disciplined strategy, developing software can result in overinflated costs, delays, and a higher probability of errors that require costly adjustments later. Furthermore, software engineering can help reduce the long-term maintenance costs that occur by ensuring that software is designed to be easy to maintain and modify. This can save money in the long run by lowering the number of resources and time needed to make software changes as needed.

2. Scalability

Scalability is an essential factor in software development, especially for programs that have to manage enormous amounts of data or an increasing number of users. Software engineering provides a foundation for creating scalable software that can evolve over time. The capacity to deploy software to diverse contexts, such as cloud-based platforms or distributed systems, is another facet of scalability. Software engineering can assist in ensuring that software is built to be readily deployed and adjusted for various environments, resulting in increased flexibility and scalability.

3. Large Software

Developers can break down huge software systems into smaller, simpler parts using software engineering concepts, making the whole system easier to maintain. This can help to reduce the software's complexity and makes it easier to maintain the system over time. Furthermore, software engineering can aid in the development of large software systems in a modular fashion, with each module doing a specific function or set of functions. This makes it easier to push new features or functionality to the product without causing disruptions to the existing codebase.

4. Dynamic Nature

Developers can utilize software engineering techniques to create dynamic content that is modular and easily modifiable when user requirements change. This can enable adding new features or functionality to dynamic content easier without disturbing the existing codebase. Another factor to consider for dynamic content is security. Software engineering can assist in ensuring that dynamic content is generated in a secure manner that protects user data and information.

5. Better Quality Management

An organized method of quality management in software development is provided by software engineering. Developers may ensure that software is conceived, produced, and maintained in a way that fulfills quality requirements and provides value to users by adhering to software engineering principles. Requirement management is one component of quality management in software engineering. Testing and validation are another part of quality control in software engineering. Developers may verify that their software satisfies its requirements and is error-free by using an organized approach to testing.

In conclusion, the subject of software engineering provides a diverse set of research topics with the ability to progress the discipline while enhancing software development and maintenance procedures. This article has dived deep into various research topics in software engineering for masters and research topics for software engineering students such as software testing and validation, software security, artificial intelligence, Natural Language Processing, software project management, machine learning, Data Mining, etc. as research subjects. Software engineering researchers have an interesting chance to explore these and other research subjects and contribute to the development of creative solutions that can improve software quality, dependability, security, and scalability. 

Researchers may make important contributions to the area of software engineering and help tackle some of the most serious difficulties confronting software development and maintenance by staying updated with the latest research trends and technologies. As software grows more important in business and daily life, there is a greater demand for current research topics in software engineering into new software engineering processes and techniques. Software engineering researchers can assist in shaping the future of software creation and maintenance through their research, ensuring that software stays dependable, safe, reliable and efficient in an ever-changing technological context. KnowledgeHut’s top Programming certification course will help you leverage online programming courses from expert trainers.

Frequently Asked Questions (FAQs)

Ans: To find a research topic in software engineering, you can review recent papers and conference proceedings, talk to different experts in the field, and evaluate your own interests and experience. You can use a combination of these approaches. 

Ans: You should study software development processes, various programming languages and their frameworks, software testing and quality assurance, software architecture, various design patterns that are currently being used, and software project management as a software engineering student. 

Ans: Empirical research, experimental research, surveys, case studies, and literature reviews are all types of research in software engineering. Each sort of study has advantages and disadvantages, and the research method chosen is determined by the research objective, resources, and available data. 

Profile

Eshaan Pandey

Eshaan is a Full Stack web developer skilled in MERN stack. He is a quick learner and has the ability to adapt quickly with respect to projects and technologies assigned to him. He has also worked previously on UI/UX web projects and delivered successfully. Eshaan has worked as an SDE Intern at Frazor for a span of 2 months. He has also worked as a Technical Blog Writer at KnowledgeHut upGrad writing articles on various technical topics.

Avail your free 1:1 mentorship session.

Something went wrong

Upcoming Programming Batches & Dates

Course advisor icon

software engineering Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Identifying Non-Technical Skill Gaps in Software Engineering Education: What Experts Expect But Students Don’t Learn

As the importance of non-technical skills in the software engineering industry increases, the skill sets of graduates match less and less with industry expectations. A growing body of research exists that attempts to identify this skill gap. However, only few so far explicitly compare opinions of the industry with what is currently being taught in academia. By aggregating data from three previous works, we identify the three biggest non-technical skill gaps between industry and academia for the field of software engineering: devoting oneself to continuous learning , being creative by approaching a problem from different angles , and thinking in a solution-oriented way by favoring outcome over ego . Eight follow-up interviews were conducted to further explore how the industry perceives these skill gaps, yielding 26 sub-themes grouped into six bigger themes: stimulating continuous learning , stimulating creativity , creative techniques , addressing the gap in education , skill requirements in industry , and the industry selection process . With this work, we hope to inspire educators to give the necessary attention to the uncovered skills, further mitigating the gap between the industry and the academic world.

Opportunities and Challenges in Code Search Tools

Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, such as deep learning and information retrieval approaches, to retrieve expected code from a large-scale codebase. However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code search studies, we systematically reviewed 81 relevant studies. We investigated the publication trends of code search studies, analyzed key components, such as codebase, query, and modeling technique used to build code search tools, and classified existing tools into focusing on supporting seven different search tasks. Based on our findings, we identified a set of outstanding challenges in existing studies and a research roadmap for future code search research.

Psychometrics in Behavioral Software Engineering: A Methodological Introduction with Guidelines

A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered. Psychology theory can facilitate the systematic and sound development as well as the adoption of instruments (e.g., psychological tests, questionnaires) to assess these constructs. In particular, to ensure high quality, the psychometric properties of instruments need evaluation. In this article, we provide an introduction to psychometric theory for the evaluation of measurement instruments for SE researchers. We present guidelines that enable using existing instruments and developing new ones adequately. We conducted a comprehensive review of the psychology literature framed by the Standards for Educational and Psychological Testing. We detail activities used when operationalizing new psychological constructs, such as item pooling, item review, pilot testing, item analysis, factor analysis, statistical property of items, reliability, validity, and fairness in testing and test bias. We provide an openly available example of a psychometric evaluation based on our guideline. We hope to encourage a culture change in SE research towards the adoption of established methods from psychology. To improve the quality of behavioral research in SE, studies focusing on introducing, validating, and then using psychometric instruments need to be more common.

Towards an Anatomy of Software Craftsmanship

Context: The concept of software craftsmanship has early roots in computing, and in 2009, the Manifesto for Software Craftsmanship was formulated as a reaction to how the Agile methods were practiced and taught. But software craftsmanship has seldom been studied from a software engineering perspective. Objective: The objective of this article is to systematize an anatomy of software craftsmanship through literature studies and a longitudinal case study. Method: We performed a snowballing literature review based on an initial set of nine papers, resulting in 18 papers and 11 books. We also performed a case study following seven years of software development of a product for the financial market, eliciting qualitative, and quantitative results. We used thematic coding to synthesize the results into categories. Results: The resulting anatomy is centered around four themes, containing 17 principles and 47 hierarchical practices connected to the principles. We present the identified practices based on the experiences gathered from the case study, triangulating with the literature results. Conclusion: We provide our systematically derived anatomy of software craftsmanship with the goal of inspiring more research into the principles and practices of software craftsmanship and how these relate to other principles within software engineering in general.

On the Reproducibility and Replicability of Deep Learning in Software Engineering

Context: Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Objective: Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) reproducibility —whether the reported experimental results can be obtained by other researchers using authors’ artifacts (i.e., source code and datasets) with the same experimental setup; and (2) replicability —whether the reported experimental result can be obtained by other researchers using their re-implemented artifacts with a different experimental setup. We observed that DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process, unlike classical supervised machine learning (ML) methods (e.g., random forest). This study aims to investigate the urgency and importance of reproducibility and replicability for DL studies on SE tasks. Method: In this study, we conducted a literature review on 147 DL studies recently published in 20 SE venues and 20 AI (Artificial Intelligence) venues to investigate these issues. We also re-ran four representative DL models in SE to investigate important factors that may strongly affect the reproducibility and replicability of a study. Results: Our statistics show the urgency of investigating these two factors in SE, where only 10.2% of the studies investigate any research question to show that their models can address at least one issue of replicability and/or reproducibility. More than 62.6% of the studies do not even share high-quality source code or complete data to support the reproducibility of their complex models. Meanwhile, our experimental results show the importance of reproducibility and replicability, where the reported performance of a DL model could not be reproduced for an unstable optimization process. Replicability could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. Conclusion: It is urgent for the SE community to provide a long-lasting link to a high-quality reproduction package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.

Predictive Software Engineering: Transform Custom Software Development into Effective Business Solutions

The paper examines the principles of the Predictive Software Engineering (PSE) framework. The authors examine how PSE enables custom software development companies to offer transparent services and products while staying within the intended budget and a guaranteed budget. The paper will cover all 7 principles of PSE: (1) Meaningful Customer Care, (2) Transparent End-to-End Control, (3) Proven Productivity, (4) Efficient Distributed Teams, (5) Disciplined Agile Delivery Process, (6) Measurable Quality Management and Technical Debt Reduction, and (7) Sound Human Development.

Software—A New Open Access Journal on Software Engineering

Software (ISSN: 2674-113X) [...]

Improving bioinformatics software quality through incorporation of software engineering practices

Background Bioinformatics software is developed for collecting, analyzing, integrating, and interpreting life science datasets that are often enormous. Bioinformatics engineers often lack the software engineering skills necessary for developing robust, maintainable, reusable software. This study presents review and discussion of the findings and efforts made to improve the quality of bioinformatics software. Methodology A systematic review was conducted of related literature that identifies core software engineering concepts for improving bioinformatics software development: requirements gathering, documentation, testing, and integration. The findings are presented with the aim of illuminating trends within the research that could lead to viable solutions to the struggles faced by bioinformatics engineers when developing scientific software. Results The findings suggest that bioinformatics engineers could significantly benefit from the incorporation of software engineering principles into their development efforts. This leads to suggestion of both cultural changes within bioinformatics research communities as well as adoption of software engineering disciplines into the formal education of bioinformatics engineers. Open management of scientific bioinformatics development projects can result in improved software quality through collaboration amongst both bioinformatics engineers and software engineers. Conclusions While strides have been made both in identification and solution of issues of particular import to bioinformatics software development, there is still room for improvement in terms of shifts in both the formal education of bioinformatics engineers as well as the culture and approaches of managing scientific bioinformatics research and development efforts.

Inter-team communication in large-scale co-located software engineering: a case study

AbstractLarge-scale software engineering is a collaborative effort where teams need to communicate to develop software products. Managers face the challenge of how to organise work to facilitate necessary communication between teams and individuals. This includes a range of decisions from distributing work over teams located in multiple buildings and sites, through work processes and tools for coordinating work, to softer issues including ensuring well-functioning teams. In this case study, we focus on inter-team communication by considering geographical, cognitive and psychological distances between teams, and factors and strategies that can affect this communication. Data was collected for ten test teams within a large development organisation, in two main phases: (1) measuring cognitive and psychological distance between teams using interactive posters, and (2) five focus group sessions where the obtained distance measurements were discussed. We present ten factors and five strategies, and how these relate to inter-team communication. We see three types of arenas that facilitate inter-team communication, namely physical, virtual and organisational arenas. Our findings can support managers in assessing and improving communication within large development organisations. In addition, the findings can provide insights into factors that may explain the challenges of scaling development organisations, in particular agile organisations that place a large emphasis on direct communication over written documentation.

Aligning Software Engineering and Artificial Intelligence With Transdisciplinary

Study examined AI and SE transdisciplinarity to find ways of aligning them to enable development of AI-SE transdisciplinary theory. Literature review and analysis method was used. The findings are AI and SE transdisciplinarity is tacit with islands within and between them that can be linked to accelerate their transdisciplinary orientation by codification, internally developing and externally borrowing and adapting transdisciplinary theories. Lack of theory has been identified as the major barrier toward towards maturing the two disciplines as engineering disciplines. Creating AI and SE transdisciplinary theory would contribute to maturing AI and SE engineering disciplines.  Implications of study are transdisciplinary theory can support mode 2 and 3 AI and SE innovations; provide an alternative for maturing two disciplines as engineering disciplines. Study’s originality it’s first in SE, AI or their intersections.

Export Citation Format

Share document.

Software Engineering

At Google, we pride ourselves on our ability to develop and launch new products and features at a very fast pace. This is made possible in part by our world-class engineers, but our approach to software development enables us to balance speed and quality, and is integral to our success. Our obsession for speed and scale is evident in our developer infrastructure and tools. Developers across the world continually write, build, test and release code in multiple programming languages like C++, Java, Python, Javascript and others, and the Engineering Tools team, for example, is challenged to keep this development ecosystem running smoothly. Our engineers leverage these tools and infrastructure to produce clean code and keep software development running at an ever-increasing scale. In our publications, we share associated technical challenges and lessons learned along the way.

Recent Publications

Some of our teams.

Climate and sustainability

Software engineering and programming languages

We're always looking for more talented, passionate people.

Careers

  • Publications
  • News and Events
  • Education and Outreach

Software Engineering Institute

Technical papers.

The SEI Digital Library houses thousands of technical papers and other documents, ranging from SEI Technical Reports on groundbreaking research to conference proceedings, survey results, and source code.

Considerations for Evaluating Large Language Models for Cybersecurity Tasks

February 20, 2024 • white paper, by jeff gennari, shing-hon lau, samuel j. perl, joel parish (openai), girish sastry (openai).

In this paper, researchers from SEI and OpenAI explore the opportunities and risks associated with using Large Language Models (LLMs) for cybersecurity tasks.

Navigating Capability-Based Planning: The Benefits, Challenges, and Implementation Essentials

February 7, 2024 • white paper, by anandi hira, william nichols.

Based on industry and government sources, this paper summarizes the benefits and challenges of implementing Capability-Based Planning (CBP).

Encoding Verification Arguments to Analyze High-Level Design Certification Claims: Experiment Zero (E0)

January 18, 2024 • white paper, by bjorn andersson, mark h. klein, dionisio de niz, douglas schmidt (vanderbilt university), ronald koontz (boeing company), john lehoczky (carnegie mellon university), george romanski (federal aviation administration), jonathan preston (lockheed martin corporation), daniel shapiro (institute of defense analysis), floyd fazi (lockheed martin corporation), david tate (institute of defense analysis), gordon putsche (the boeing company), hyoseung kim (university of california, riverside).

This paper discusses whether automation of certification arguments can identify problems that occur in real systems.

The Measurement Challenges in Software Assurance and Supply Chain Risk Management

December 22, 2023 • white paper, by nancy r. mead, carol woody, scott hissam.

This paper recommends an approach for developing and evaluating cybersecurity metrics for open source and other software in the supply chain.

Report to the Congressional Defense Committees on National Defense Authorization Act (NDAA) for Fiscal Year 2022 Section 835 Independent Study on Technical Debt in Software-Intensive Systems

December 7, 2023 • technical report, by ipek ozkaya, brigid o'hearn, julie b. cohen, forrest shull.

This independent study of technical debt in software-intensive systems was sent to Congress in December 2023 to satisfy the requirements of NDAA Section 835.

Assessing Opportunities for LLMs in Software Engineering and Acquisition

November 1, 2023 • white paper, by julie b. cohen, james ivers, ipek ozkaya, stephany bellomo, shen zhang.

This white paper examines how decision makers, such as technical leads and program managers, can assess the fitness of large language models (LLMs) to address software engineering and acquisition needs.

Acquisition Security Framework (ASF): Managing Systems Cybersecurity Risk (Expanded Set of Practices)

October 2, 2023 • technical note, by michael s. bandor, charles m. wallen, carol woody, christopher j. alberts.

This framework of practices helps programs coordinate their management of engineering and supply chain risks across the systems lifecycle.

Simulating Realistic Human Activity Using Large Language Model Directives

October 2, 2023 • technical report, by sean huff, thomas g. podnar, dustin d. updyke.

The authors explore how activities generated from the GHOSTS Framework’s NPC client compare to activities produced by GHOSTS’ default behavior and LLMs.

Why Your Software Cost Estimates Change Over Time and How DevSecOps Data Can Help Reduce Cost Risk

September 29, 2023 • white paper, by julie b. cohen.

Early software cost estimates are often off by over 40%; this paper discusses how programs must continually update estimates as more information becomes available.

A Retrospective in Engineering Large Language Models for National Security

By andrew o. mellinger, tyler brooks, shannon gallagher, bryan brown, eric heim, hollen barmer, william nichols, nick winski, nathan m. vanhoudnos, jasmine ratchford, angelique mcdowell, swati rallapalli.

This document discusses the findings, recommendations, and lessons learned from engineering a large language model for national security use cases.

U.S. Leadership in Software Engineering & AI Engineering: Critical Needs & Priorities Workshop - Executive Summary

August 25, 2023 • white paper, by ipek ozkaya, douglas schmidt (vanderbilt university), forrest shull, john e. robert, erin harper, anita carleton.

A joint SEI/NITRD workshop will advance U.S. national interests through software and AI engineering and accelerate progress across virtually all scientific domains.

A Holistic View of Architecture Definition, Evolution, and Analysis

August 24, 2023 • technical report, by james ivers, sebastián echeverría, rick kazman.

This report focuses on performing architectural decisions and architectural analysis, spanning multiple quality attributes, in a sustainable and ongoing way.

Emerging Technologies: Seven Themes Changing the Future of Software in the DoD

August 24, 2023 • white paper, by scott hissam, shen zhang, michael abad-santos.

This report summarizes the SEI's Emerging Technologies Study (ETS) and identifies seven emerging technologies to watch in software engineering practices and technology.

Demonstrating the Practical Utility and Limitations of ChatGPT Through Case Studies

August 23, 2023 • white paper, by matthew walsh, clarence worrell, alejandro gomez, dominic a. ross.

In this study, SEI researchers conducted four case studies using GPT-3.5 to assess the practical utility of large language models such as ChatGPT.

Software Excellence Through the Agile High Velocity Development℠ Process

July 17, 2023 • technical report, by barti k. perini (ishpi information technologies, inc.), stephen shook (ishpi information technologies, inc.).

The High Velocity Development℠ process earned Ishpi Information Technologies, Inc. the 2023 Watts Humphrey Software Quality Award.

Coding the Future: Recommendations for Defense Software R&D

July 13, 2023 • white paper, by software engineering institute.

This report outlines the key recommendations from the November 2022 workshop "Software as a Modernization Priority."

Engineering of Edge Software Systems: A Report from the November 2022 SEI Workshop on Software Systems at the Edge

June 30, 2023 • white paper, by ipek ozkaya, grace lewis, kevin a. pitstick.

Based on a workshop with thought leaders in the field, this report identifies recommended areas of focus for engineering software systems at the edge.

Software Bill of Materials Framework: Leveraging SBOMs for Risk Reduction

June 14, 2023 • white paper, by carol woody, christopher j. alberts, michael s. bandor, charles m. wallen.

This paper is a Software Bill of Materials (SBOM) Framework that is a starting point for expanding the use of SBOMs for managing software and systems risk.

Generative AI: Key Opportunities and Research Challenges

June 9, 2023 • white paper.

This 2023 workshop report identifies DoD use cases for generative AI and discusses meeting challenges and needs such as investing in guardrails and responsible AI amid a race to capability.

Securing UEFI: An Underpinning Technology for Computing

May 30, 2023 • white paper, by vijay s. sarvepalli.

This paper highlights the technical efforts to secure the UEFI-based firmware that serves as a foundational piece of modern computing environments.

Using Model-Based Systems Engineering (MBSE) to Assure a DevSecOps Pipeline is Sufficiently Secure

May 23, 2023 • technical report, by timothy a. chick, nataliya shevchenko, scott pavetti.

This report describes how analysts can use a model-based systems engineering (MBSE) approach to detect and mitigate cybersecurity risks to a DevSecOps pipeline.

A Strategy for Component Product Lines: Report 2: Specification Modeling for Components in a Component Product Line

May 17, 2023 • special report, by john mcgregor, john j. hudak, sholom g. cohen.

This report introduces the “model chain” concept for specifying a component product line and realizing architecture requirements through the creation–evolution process.

A Strategy for Component Product Lines: Report 3: Component Product Line Governance

May 4, 2023 • special report, by sholom g. cohen, alfred schenker.

This report provides guidance for the community involved with developing and sustaining product lines of components used by the U.S. government.

Program Managers—The DevSecOps Pipeline Can Provide Actionable Data

April 24, 2023 • white paper, by julie b. cohen, bill nichols.

This paper describes the Automated Continuous Estimation for a Pipeline of Pipelines research project, which automates data collection to track program progress.

Zero Trust Industry Day 2022: Areas of Future Research

January 25, 2023 • white paper, by timothy morrow, trista polaski, matthew nicolai.

This paper describes the future research discussed at the 2022 Zero Trust Industry Day event.

Industry Best Practices for Zero Trust Architecture

December 13, 2022 • white paper, by timothy morrow, nathaniel richmond, matthew nicolai.

This paper describes best practices identified during the SEI’s Zero Trust Industry Day 2022, and provides ways to help organizations shift to zero trust.

A Strategy for Component Product Lines: Report 1: Scoping, Objectives, and Rationale

December 8, 2022 • special report, by gabriel moreno, john j. hudak, sholom g. cohen, alfred schenker, john mcgregor.

This report establishes a Component Product Line Strategy to address problems in systematically reusing and integrating components built to conform to component specification models.

Acquisition Security Framework (ASF): Managing Systems Cybersecurity Risk

November 11, 2022 • technical note.

This report provides an overview of the Acquisition Security Framework (ASF), a description of the practices developed thus far, and a plan for completing the ASF body of work.

Zero Trust Industry Day Experience Paper

October 31, 2022 • white paper, by rhonda brown, mary popeck, timothy morrow.

This paper describes the results of the 2022 Zero Trust Industry Day event.

Challenge Development Guidelines for Cybersecurity Competitions

October 27, 2022 • technical report, by dennis m. allen, leena arora, joseph vessella, josh hammerstein, matt kaar, jarrett booz.

This paper draws on the SEI’s experience to provide general-purpose guidelines and best practices for developing effective cybersecurity challenges.

Acquisition Security Framework (ASF): An Acquisition and Supplier Perspective on Managing Software-Intensive Systems’ Cybersecurity Risk

October 4, 2022 • white paper, by carol woody, christopher j. alberts, charles m. wallen, michael s. bandor.

The Acquisition Security Framework (ASF) contains practices that support programs acquiring/building a secure, resilient software-reliant system to manage risks.

Designing Vultron: A Protocol for Multi-Party Coordinated Vulnerability Disclosure (MPCVD)

September 15, 2022 • special report, by allen d. householder.

This report proposes a formal protocol specification for MPCVD to improve the interoperability of both CVD and MPCVD processes.

Common Sense Guide to Mitigating Insider Threats, Seventh Edition

September 7, 2022 • technical report.

The guide describes 22 best practices for mitigating insider threat based on the CERT Division's continued research and analysis of more than 3,000 insider threat cases.

Coordinated Vulnerability Disclosure User Stories

August 25, 2022 • white paper, by art manion, timur d. snoke, vijay s. sarvepalli, jonathan spring, allen d. householder, laurie tyzenhaus, brad runyon, eric hatleback, charles g. yarbrough.

This paper provides user stories to guide the development of a technical protocol and application programming interface for Coordinated Vulnerability Disclosure.

LLVM Intermediate Representation for Code Weakness Identification

July 8, 2022 • white paper, by shannon gallagher, william klieber, david svoboda.

This paper examines whether intermediate representation used in Large Language Models can be useful to indicate the presence of software vulnerabilities.

Digital Engineering Effectiveness

May 19, 2022 • white paper, by alfred schenker, bill nichols, tyler smith (adventium labs, inc.).

This paper explores the reluctance of developers of cyber-physical systems to embrace digital engineering (DE), how DE methods should be tailored to achieve their stakeholders' goals, and how to measure …

A Brief Introduction to the Evaluation of Learned Models for Aerial Object Detection

May 2, 2022 • white paper, by eric heim.

The SEI AI Division assembled guidance on the design, production, and evaluation of machine-learning models for aerial object detection.

Guidance for Tailoring DoD Request for Proposals (RFPs) to Include Modeling

April 27, 2022 • special report, by tom merendino, robert wojcik, julie b. cohen.

This report provides guidance for government program offices that are including digital engineering/modeling requirements into a request for proposal.

Modeling to Support DoD Acquisition Lifecycle Events (Version 1.4)

April 26, 2022 • white paper, by robert wojcik, tom merendino, julie b. cohen.

This document provides suggestions for producing requirement, system, and software models that will be used to support various DoD system acquisition lifecycle events.

Experiences with Deploying Mothra in Amazon Web Services (AWS)

April 26, 2022 • technical report, by daniel ruef, john stogoski, brad powell.

The authors describe development of an at-scale prototype of an on-premises system to test the performance of Mothra in the cloud and provide recommendations for similar deployments.

Extensibility

April 6, 2022 • technical report.

This report summarizes how to systematically analyze a software architecture with respect to a quality attribute requirement for extensibility.

TwinOps: Digital Twins Meets DevOps

March 24, 2022 • technical report, by joe yankel, jerome hugues, anton hristozov, john j. hudak.

This report describes ModDevOps, an approach that bridges model-based engineering and software engineering using DevOps concepts and code generation from models, and TwinOps, a specific ModDevOps pipeline.

March 16, 2022 • Technical Report

By philip bianco, james ivers, sebastián echeverría, rick kazman.

This report summarizes how to systematically analyze a software architecture with respect to a quality attribute requirement for robustness.

An Analysis of How Many Undiscovered Vulnerabilities Remain in Information Systems

March 9, 2022 • white paper, by jonathan spring.

This paper examines the paradigm that the number of undiscovered vulnerabilities is manageably small through the lens of mathematical concepts from the theory of computing.

Using XML to Exchange Floating Point Data

February 10, 2022 • white paper, by john klein.

This paper explains issues of using XML to exchange floating point values, how to address them, and the limits of technology to enforce a correct implementation.

Using Machine Learning to Increase NPC Fidelity

December 1, 2021 • technical report, by dustin d. updyke, geoffrey b. dobson, john yarger, thomas g. podnar.

The authors describe how they used machine learning (ML) modeling to create decision-making preferences for non-player characters (NPCs).

A Prototype Set of Cloud Adoption Risk Factors

October 27, 2021 • white paper, by christopher j. alberts.

Alberts discusses the results of a study to identify a prototype set of risk factors for adopting cloud technologies.

Cloud Security Best Practices Derived from Mission Thread Analysis

September 2, 2021 • technical report, by timothy morrow, donald faatz, nathaniel richmond, angel luis hueca, vincent lapiana.

This report presents practices for secure, effective use of cloud computing and risk reduction in transitioning applications and data to the cloud, and considers the needs of limited-resource businesses.

Accenture: An Automation Maturity Journey

July 29, 2021 • technical report, by rajendra t. prasad (accenture).

This paper describes work in the area of automation that netted Accenture the 2020 Watts Humphrey Software Process Achievement Award.

Planning and Design Considerations for Data Centers

July 19, 2021 • technical note, by lyndsi a. hughes, david sweeney, mark kasunic.

This report shares important lessons learned from establishing small- to mid-size data centers.

Integrating Zero Trust and DevSecOps

July 5, 2021 • white paper, by timothy morrow, geoff sanders, nathaniel richmond, carol woody.

This paper discusses the interdependent strategies of zero trust and DevSecOps in the context of application development.

A State-Based Model for Multi-Party Coordinated Vulnerability Disclosure (MPCVD)

July 1, 2021 • special report, by allen d. householder, jonathan spring.

This report discusses performance indicators that stakeholders in Coordinated Vulnerability Disclosure (CVD) can use to measure its effectiveness.

Human-Centered AI

June 25, 2021 • white paper, by jay palat, matt gaston, carol j. smith, tanisha smith, frank redner, rachel dzombak, hollen barmer.

This white paper discusses Human-Centered AI: systems that are designed to work with, and for, people.

Robust and Secure AI

By eric heim, hollen barmer, rachel dzombak, tanisha smith, nathan m. vanhoudnos, frank redner, matt gaston, jay palat.

This white paper discusses Robust and Secure AI systems: AI systems that reliably operate at expected levels of performance, even when faced with uncertainty and in the presence of danger …

Scalable AI

By jay palat, matt gaston, tanisha smith, frank redner, john wohlbier, rachel dzombak, hollen barmer.

This white paper discusses Scalable AI: the ability of AI algorithms, data, models, and infrastructure to operate at the size, speed, and complexity required for the mission.

The Sector CSIRT Framework: Developing Sector-Based Incident Response Capabilities

June 8, 2021 • technical report, by tracy bills, sharon mudd, justin novak, brittany manley, angel luis hueca, david mcintire.

This framework guides the development and implementation of a sector CSIRT.

Foundation of Cyber Ranges

May 19, 2021 • technical report, by bill reed, dustin d. updyke, geoffrey b. dobson, thomas g. podnar.

This report details the design considerations and execution plan for building high-fidelity, realistic virtual cyber ranges that deliver maximum training and exercise value for cyberwarfare participants.

Software Assurance Guidance and Evaluation (SAGE) Tool

May 3, 2021 • white paper, by robert schiela, ebonie mcneil, luiz antunes, hasan yasar.

The Software Assurance Guidance and Evaluation (SAGE) tool helps an organization assess the security of its systems development and operations practices.

Prioritizing Vulnerability Response: A Stakeholder-Specific Vulnerability Categorization (Version 2.0)

April 30, 2021 • white paper, by jonathan spring, allen d. householder, art manion, vijay s. sarvepalli, eric hatleback, laurie tyzenhaus, madison oliver, charles g. yarbrough.

This paper presents version 2.0 of a testable Stakeholder-Specific Vulnerability Categorization (SSVC) that takes the form of decision trees and that avoids some problems with the Common Vulnerability Scoring System …

Modeling and Validating Security and Confidentiality in System Architectures

March 19, 2021 • technical report, by aaron greenhouse, lutz wrage, jörgen hansson (university of skovde).

This report presents an approach for modeling and validating confidentiality using the Bell–LaPadula security model and the Architecture Analysis & Design Language.

Overview of Practices and Processes of the CMMC 1.0 Assessment Guides (CMMC 1.0)

March 3, 2021 • white paper, by douglas gardner.

This document is intended to help anyone unfamiliar with cybersecurity standards get started with the Department of Defense (DoD) Cybersecurity Maturity Model Certification (CMMC).

Zero Trust: Risks and Research Opportunities

March 1, 2021 • white paper, by geoff sanders, timothy morrow.

This paper describes a zero trust vignette and three mission threads that highlight risks and research areas to consider for zero trust environments.

Artificial Intelligence (AI) and Machine Learning (ML) Acquisition and Policy Implications

February 26, 2021 • white paper, by william e. novak.

This paper reports on a high-level survey of a set of both actual and potential acquisition and policy implications of the use of Artificial Intelligence (AI) and Machine Learning (ML) …

Security Engineering Risk Analysis (SERA) Threat Archetypes

December 16, 2020 • white paper, by carol woody, christopher j. alberts.

This report examines the concept of threat archetypes and how analysts can use them during scenario development.

Loss Magnitude Estimation in Support of Business Impact Analysis

December 15, 2020 • technical report, by brett tucker, daniel j. kambic, david tobar, andrew p. moore.

The authors describe a project to develop an estimation method that yields greater confidence in and improved ranges for estimates of potential cyber loss magnitude.

Emerging Technologies 2020: Six Areas of Opportunity

December 14, 2020 • white paper.

This study seeks to understand what the software engineering community perceives to be key emerging technologies. The six technologies described hold great promise and, in some cases, have already attracted …

Maintainability

December 1, 2020 • technical report, by rick kazman, john klein, james ivers, philip bianco.

This report summarizes how to systematically analyze a software architecture with respect to a quality attribute requirement for maintainability.

Advancing Risk Management Capability Using the OCTAVE FORTE Process

November 17, 2020 • technical note, by brett tucker.

OCTAVE FORTE is a process model that helps organizations evaluate their security risks and use ERM principles to bridge the gap between executives and practitioners.

Analytic Capabilities for Improved Software Program Management

November 2, 2020 • white paper, by christopher miller, david zubrow.

This white paper describes an update to the SEI Quantifying Uncertainty in Early Lifecycle Cost Estimation approach.

AI Engineering for Defense and National Security: A Report from the October 2019 Community of Interest Workshop

October 29, 2020 • special report.

Based on a workshop with thought leaders in the field, this report identifies recommended areas of focus for AI Engineering for Defense and National Security.

NICE Framework Cybersecurity Evaluator

August 20, 2020 • white paper, by christopher herr.

This cybersecurity evaluator is designed to assess members of the cyber workforce within the scope of the NICE Cybersecurity Workforce Framework.

Current Ransomware Threats

August 19, 2020 • white paper, by marisa midler, kyle o'meara.

This report by Marisa Midler, Kyle O'Meara, and Alexandra Parisi discusses ransomware, including an explanation of its design, distribution, execution, and business model.

An Updated Framework of Defenses Against Ransomware

August 18, 2020 • white paper, by timur d. snoke, timothy j. shimeall.

This report, loosely structured around the NIST Cybersecurity Framework, seeks to frame an approach for defending against Ransomware-as-a-Service (RaaS) as well as direct ransomware attacks.

Historical Analysis of Exploit Availability Timelines

August 13, 2020 • white paper, by david warren, jeff chrabaszcz (govini), trent novelly, allen d. householder, jonathan spring.

This paper analyzes when and how known exploits become associated with the vulnerabilities that made them possible.

Architecture Evaluation for Universal Command and Control

August 3, 2020 • white paper, by john klein, harry l. levinson, reed little, jason popowski, philip bianco, patrick donohoe.

The SEI developed an analysis method to assess function allocations in existing C2 systems and reason about design choices and tradeoffs during the design of new C2 systems.

A Risk Management Perspective for AI Engineering

June 10, 2020 • white paper.

This paper describes several steps of OCTAVE FORTE in the context of adopting AI technology.

Attack Surface Analysis - Reduce System and Organizational Risk

June 8, 2020 • white paper, by robert j. ellison, carol woody.

This paper offers system defenders an overview of how threat modeling can provide a systematic way to identify potential threats and prioritize mitigations.

Guide to Implementing DevSecOps for a System of Systems in Highly Regulated Environments

April 8, 2020 • technical report, by jose a. morales, peter capell, david james shepard, richard turner, patrick r. place, suzanne miller.

This Technical Report provides guidance to projects interested in implementing DevSecOps (DSO) in defense or other highly regulated environments, including those involving systems of systems.

Integrability

February 7, 2020 • technical report, by rick kazman, john klein, philip bianco, james ivers.

This report summarizes how to systematically analyze a software architecture with respect to a quality attribute requirement for integrability.

Comments on NISTIR 8269 (A Taxonomy and Terminology of Adversarial Machine Learning)

February 4, 2020 • white paper, by jonathan spring, april galyardt, nathan m. vanhoudnos.

Feedback to the U.S. National Institute of Standards and Technology (NIST) about NIST IR 8269, a draft report detailing the proposed taxonomy and terminology of Adversarial Machine Learning (AML).

Penetration Tests Are The Check Engine Light On Your Security Operations

January 7, 2020 • white paper, by dan j. klinedinst, allen d. householder.

A penetration test serves as a lagging indicator of a network security operations problem. Organizations should implement and document several security controls before a penetration test can be useful.

Prioritizing Vulnerability Response: A Stakeholder-Specific Vulnerability Categorization

December 4, 2019 • white paper, by allen d. householder, jonathan spring, art manion, deana shick, eric hatleback.

This paper presents a testable Stakeholder-Specific Vulnerability Categorization (SSVC) that takes the form of decision trees and that avoids some problems with the Common Vulnerability Scoring System (CVSS).

AI Engineering: 11 Foundational Practices

September 12, 2019 • white paper.

This initial set of recommendations can help organizations that are beginning to build, acquire, and integrate artificial intelligence capabilities into business and mission systems.

Machine Learning in Cybersecurity: A Guide

September 5, 2019 • technical report, by ed stoner, joshua fallon, april galyardt, jonathan spring, leigh b. metcalf, angela horneman.

This report suggests seven key questions that managers and decision makers should ask about machine learning tools to effectively use those tools to solve cybersecurity problems.

Operational Test & Evaluation (OT&E) Roadmap for Cloud-Based Systems

September 2, 2019 • white paper, by john klein, christopher j. alberts, carol woody, charles m. wallen.

This paper provides an overview of the preparation and work that the AEC needs to perform to successfully transition the Army to cloud computing.

IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award 2018: U.S. Army Combat Capabilities Development Command Armaments Center, Fire Control Systems and Technology Directorate

August 1, 2019 • technical report, by victor a. elias (u.s. army ccdc armaments center, fire control systems and technology directorate).

This report presents a systemic approach to software development process improvement and its impact for the U.S. Army Combat Capabilities Development Command Armaments Center, Fire Control Systems and Technology Directorate …

Overview of Risks, Threats, and Vulnerabilities Faced in Moving to the Cloud

July 11, 2019 • technical report, by kelwyn pender, carrie lee (u.s. department of veteran affairs), donald faatz, timothy morrow.

This report, updated in October 2020, examines the changes to risks, threats, and vulnerabilities when applications are deployed to cloud services.

Automatically Detecting Technical Debt Discussions

June 24, 2019 • white paper, by robert nord, ipek ozkaya, zachary kurtz, raghvinder sangwan.

This study introduces (1) a dataset of expert labels of technical debt in developer comments and (2) a classifier trained on those labels.

Multi-Method Modeling and Analysis of the Cybersecurity Vulnerability Management Ecosystem

By allen d. householder, andrew p. moore.

This paper presents modeling and analysis of two critical foundational processes of the cybersecurity vulnerability management ecosystem using a combination of system dynamics and agent-based modeling techniques.

SCAIFE API Definition Beta Version 0.0.2 for Developers

June 14, 2019 • white paper, by ebonie mcneil, lori flynn.

This paper provides the SCAIFE API definition for beta version 0.0.2. SCAIFE is an architecture that supports static analysis alert classification and prioritization.

Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery

May 21, 2019 • white paper.

We present a preliminary report for xBD, a new large-scale dataset for the advancement of change detection and building damage assessment for humanitarian assistance and disaster recovery research.

Integration of Automated Static Analysis Alert Classification and Prioritization with Auditing Tools: Special Focus on SCALe

May 13, 2019 • technical report, by lori flynn, david svoboda, ebonie mcneil, zachary kurtz, derek leung, jiyeon lee (carnegie mellon university).

This report summarizes progress and plans for developing a system to perform automated classification and advanced prioritization of static analysis alerts.

Cybersecurity Career Paths and Progression

May 7, 2019 • white paper, by nicholas giruzzi, marie baker, dennis m. allen, melissa burns.

This paper explores the current state of cybersecurity careers, from the importance of early exposure, to methods of entry into the field, to career progression.

Cybersecurity Talent Identification and Assessment

By dennis m. allen, marie baker, christopher herr.

To help fill cybersecurity roles, this paper explores how organizations identify talent, discusses assessment capabilities, and provides recommendations on recruitment and talent evaluations.

Cybersecurity Careers of the Future

By dennis m. allen.

Using workforce data analysis, this paper identifies key cybersecurity skills the workforce needs to close the cybersecurity workforce gap.

A Targeted Improvement Plan for Service Continuity

April 8, 2019 • technical note, by philip a. scolieri, jeffrey pinckard, robert a. vrtis, andrew f. hoover, gavin jurecko.

Describes how an organization can leverage the results of a Cyber Resilience Review to create a Targeted Improvement Plan for its service continuity management.

Exploring the Use of Metrics for Software Assurance

March 7, 2019 • technical note, by carol woody, robert j. ellison, charlie ryan.

This report proposes measurements for each Software Assurance Framework (SAF) practice that a program can select to monitor and manage the progress it's making toward software assurance.

Common Sense Guide to Mitigating Insider Threats, Sixth Edition

February 27, 2019 • technical report, by sarah miller, tracy cassidy, michael c. theis, daniel l. costa, william r. claycomb, andrew p. moore, randall f. trzeciak.

The guide presents recommendations for mitigating insider threat based on the CERT Division's continued research and analysis of more than 1,500 insider threat cases.

An Approach for Integrating the Security Engineering Risk Analysis (SERA) Method with Threat Modeling

February 6, 2019 • white paper.

This report examines how cybersecurity data generated by a threat modeling method can be integrated into a mission assurance context using the SERA Method.

Infrastructure as Code: Final Report

January 28, 2019 • white paper, by doug reynolds, john klein.

This project explored the feasibility of infrastructure as code, developed prototype tools, populated a model of the deployment architecture, and automatically generated IaC scripts from the model.

Incident Management Capability Assessment

December 19, 2018 • technical report, by samuel j. perl, mark zajicek, robin ruefle, christopher j. alberts, pennie walters, carly l. huth, audrey j. dorofee, david mcintire.

The capabilities presented in this report provide a benchmark of incident management practices.

Program Manager's Guidebook for Software Assurance

December 14, 2018 • special report, by carol woody, timothy a. chick, kenneth nidiffer.

This guidebook helps program managers address the software assurance responsibilities critical in defending software-intensive systems, including mission threads and cybersecurity.

DoD Developer’s Guidebook for Software Assurance

By bill nichols, tom scanlon.

This guidebook helps software developers for DoD programs understand expectations for software assurance and standards and requirements that affect assurance.

Towards Improving CVSS

December 4, 2018 • white paper, by allen d. householder, jonathan spring, deana shick, art manion, eric hatleback.

This paper outlines challenges with the Common Vulnerability Scoring System (CVSS).

GHOSTS in the Machine: A Framework for Cyber-Warfare Exercise NPC Simulation

December 3, 2018 • technical report, by adam d. cerini, benjamin l. earl, thomas g. podnar, geoffrey b. dobson, luke j. osterritter, dustin d. updyke.

This report outlines how the GHOSTS (General HOSTS) framework helps create realism in cyber-warfare simulations and discusses how it was used in a case study.

Composing Effective Software Security Assurance Workflows

October 18, 2018 • technical report, by bill nichols, jim mchale, aaron volkmann, david sweeney, william snavely.

In an effort to determine how to make secure software development more cost effective, the SEI conducted a research study to empirically measure the effects that security tools—primarily automated static …

FedCLASS: A Case Study of Agile and Lean Practices in the Federal Government

October 5, 2018 • special report, by jeff davenport, tamara marshall-keim, linda parker gates, nanette brown.

This study reports the successes and challenges of using Agile and Lean methods and cloud-based technologies in a government software development environment.

Threat Modeling for Cyber-Physical System-of-Systems: Methods Evaluation

September 25, 2018 • white paper, by nataliya shevchenko, carol woody, brent frye.

This paper compares threat modeling methods for cyber-physical systems and recommends which methods (and combinations of methods) to use.

Software Architecture Publications

September 17, 2018 • white paper.

The SEI compiled this bibliography of publications about software architecture as a resource for information about system architecture throughout its lifecycle.

Practical Precise Taint-flow Static Analysis for Android App Sets

August 27, 2018 • white paper, by william klieber, lori flynn, william snavely, michael zheng.

This paper describes how to detect taint flow in Android app sets with a static analysis method that is fast and uses little disk and memory space.

Threat Modeling: A Summary of Available Methods

August 9, 2018 • white paper, by carol woody, nataliya shevchenko, tom scanlon, timothy a. chick, paige o'riordan.

This paper discusses twelve threat modeling methods from a variety of sources that target different parts of the development process.

Navigating the Insider Threat Tool Landscape: Low-Cost Technical Solutions to Jump-Start an Insider Threat Program

July 3, 2018 • white paper, by michael j. albrethsen, derrick spooner, daniel l. costa, george silowash.

This paper explores low cost technical solutions that can help organizations prevent, detect, and respond to insider incidents.

Blacklist Ecosystem Analysis: July - December 2017

April 19, 2018 • white paper, by leigh b. metcalf, eric hatleback.

This short report provides a summary of the various analyses of the blacklist ecosystem performed from July 1, 2017, through December 31, 2017.

ROI Analysis of the System Architecture Virtual Integration Initiative

April 12, 2018 • technical report, by jörgen hansson (university of skovde), steve helton (the boeing company), peter h. feiler.

This report presents an analysis of the economic effects of the System Architecture Virtual Integration approach on the development of software-reliant systems for aircraft compared to existing development paradigms.

Implementing DevOps Practices in Highly Regulated Environments

April 2, 2018 • white paper, by jose a. morales, aaron volkmann, hasan yasar.

In this paper, the authors layout the process with insights on performing a DevOps assessment in a highly regulated environment.

A Mapping of the Health Insurance Portability and Accountability Act (HIPAA) Security Rule to the Cyber Resilience Review (CRR)

March 29, 2018 • technical note, by robert a. vrtis, matthew trevors, greg porter (heinz college at carnegie mellon university).

This technical note describes mapping of HIPAA Security Rule requirements to practice questions found in the CERT Cyber Resilience Review for organizations' use in HIPAA compliance.

A Hybrid Threat Modeling Method

March 27, 2018 • technical note, by krishnamurthy vemuru (university of virginia), ole villadsen (carnegie mellon university), nancy r. mead, forrest shull.

Presents a hybrid method of threat modeling that attempts to meld the desirable features of three methods: Security Cards, Persona non Grata, and STRIDE.

Cyber Mutual Assistance Workshop Report

February 13, 2018 • special report, by katie c. stewart, jonathon monken (pjm interconnection), fernando maymi, phd (army cyber institute), dan bennett, phd (army cyber institute), dan huynh (army cyber institute), blake rhoades (army cyber institute), matt hutchison (army cyber institute), judy esquibel (army cyber institute), bill lawrence (north american electric reliability corporation).

The Army Cyber Institute hosted a Cyber Mutual Assistance Workshop to identify challenges in defining cyber requirements for Regional Mutual Assistance Groups.

Embedded Device Vulnerability Analysis Case Study Using Trommel

December 6, 2017 • white paper, by kyle o'meara, madison oliver.

This document provides security researchers with a repeatable methodology to produce more thorough and actionable results when analyzing embedded devices for vulnerabilities.

2017 Emerging Technology Domains Risk Survey

October 5, 2017 • technical report, by kyle o'meara, dan j. klinedinst, joel land.

This report describes our understanding of future technologies and helps US-CERT identify vulnerabilities, promote security practices, and understand vulnerability risk.

R-EACTR: A Framework for Designing Realistic Cyber Warfare Exercises

September 29, 2017 • technical report, by adam d. cerini, thomas g. podnar, geoffrey b. dobson, luke j. osterritter.

R-EACTR is a design framework for cyber warfare exercises. It ensures that designs of team-based exercises factor realism into all aspects of the participant experience.

Architecture Practices for Complex Contexts

September 26, 2017 • white paper.

This doctoral thesis, completed at Vrije Universiteit Amsterdam, focuses on software architecture practices for systems of systems, including data-intensive systems.

Defining a Progress Metric for CERT-RMM Improvement

September 8, 2017 • technical note, by david tobar, nader mehravari, gregory crabb (united states postal service).

Describes the Cybersecurity Program Progress Metric and how its implementation in a large, diverse U.S. national organization can serve to indicate progress toward improving cybersecurity and resilience capabilities.

Blacklist Ecosystem Analysis: January - June, 2017

August 22, 2017 • white paper.

This short report provides a summary of the various analyses of the blacklist ecosystem performed to date. It also appends the latest additional data to those analyses; the added data …

The CERT Guide to Coordinated Vulnerability Disclosure

August 15, 2017 • special report, by allen d. householder, art manion, christopher king, garret wassermann.

This guide provides an introduction to the key concepts, principles, and roles necessary to establish a successful Coordinated Vulnerability Disclosure process. It also provides insights into how CVD can go …

Systemic Vulnerabilities in Customer-Premises Equipment (CPE) Routers

July 11, 2017 • special report, by joel land.

This report describes a test framework that the CERT/CC developed to identify systemic and other vulnerabilities in CPE routers.

Department of Defense Software Factbook

July 11, 2017 • technical report, by david zubrow, christopher miller, rhonda brown, james mccurley, brad clark, mike zuccher (no affiliation).

In this report, the Software Engineering Institute has analyzed data related to DoD software projects and translated it into information that is frequently sought-after across the DoD.

DidFail: Coverage and Precision Enhancement

July 6, 2017 • technical report, by karan dwivedi (no affiliation), hongli yin (no affiliation), pranav bagree (no affiliation), xiaoxiao tang (no affiliation), william snavely, william klieber, lori flynn.

This report describes recent enhancements to Droid Intent Data Flow Analysis for Information Leakage (DidFail), the CERT static taint analyzer for sets of Android apps.

The Hard Choices Game Explained

June 26, 2017 • white paper, by erin lim, philippe kruchten, robert nord, nanette brown, ipek ozkaya.

The Hard Choices game is a simulation of the software development cycle meant to communicate the concepts of uncertainty, risk, and technical debt.

Federal Virtual Training Environment (FedVTE)

June 5, 2017 • white paper, by april galyardt, dominic a. ross, marie baker.

The Federal Virtual Training Environment (FedVTE) is an online, on‐demand training system containing cybersecurity and certification prep courses, at no cost to federal, state, and local government employees.

Blacklist Ecosystem Analysis: July – December 2016

June 1, 2017 • white paper.

This report provides a summary of various analyses of the blacklist ecosystem performed to date. It also appends the latest additional data to those analyses; the added data in this …

Guide to Software Architecture Tools

May 22, 2017 • white paper.

This document discusses tools and methods for analyzing the architecture, establishing requirements, evaluating the architecture, and defining the architecture.

System-of-Systems Software Architecture Evaluation

May 15, 2017 • white paper.

The SoS Architecture Evaluation Method provides an initial identification of SoS architectural risks and quality attribute inconsistencies across the constituent systems.

IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award

SEI-Certified PSP Developer Examination: Sample Questions

This page contains sample questions similar to those found on the PSP Developer examination.

IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award 2016: Raytheon Integrated Defense Systems

April 28, 2017 • technical report, by neal mackertich (raytheon), peter kraus (raytheon), kurt mittelstaedt (raytheon), brian foley (raytheon), dan bardsley (raytheon), kelli grimes (raytheon), mike nolan (raytheon).

The Raytheon Integrated Defense Systems DFSS team has been recognized with the 2016 Watts Humphrey Software Process Achievement Award.

IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement (SPA) Award 2016: Nationwide

April 13, 2017 • technical report, by will j.m. pohlman (nationwide it).

This report describes the 10-year history of Nationwide's software process improvement journey. Nationwide received the 2016 Watts Humphrey Software Process Achievement Award from the SEI and IEEE.

Prototype Software Assurance Framework (SAF): Introduction and Overview

April 6, 2017 • technical note.

In this report, the authors discuss the Software Assurance Framework (SAF), a collection of cybersecurity practices that programs can apply across the acquisition lifecycle and supply chain.

15 Tips for Preparing and Delivering a Great Presentation at SATURN

March 14, 2017 • white paper.

You submitted a proposal to SATURN, and it got accepted. Congratulations! Here are 15 tips for creating and giving a great presentation at SATURN.

The CISO Academy

February 23, 2017 • white paper, by pamela d. curtis, summer c. fowler, david tobar, david ulicne.

In this paper, the authors describe the project that led to the creation of the U.S. Postal Service's CISO Academy.

Agile Acquisition and Milestone Reviews

February 15, 2017 • white paper.

Acquisition & Management Concerns for Agile Use in Government Series - 4

Management and Contracting Practices for Agile Programs

Acquisition & Management Concerns for Agile Use in Government Series - 3

Estimating in Agile Acquisition

Acquisition & Management Concerns for Agile Use in Government Series - 5

Agile Development and DoD Acquisitions

Acquisition & Management Concerns for Agile Use in Government Series - 1

Agile Culture in the DoD

Acquisition & Management Concerns for Agile Use in Government Series - 2

Adopting Agile in DoD IT Acquisitions

Acquisition & Management Concerns for Agile Use in Government Series - 6

Supply Chain and Commercial-off-the-Shelf (COTS) Assurance

January 24, 2017 • white paper.

The Software Engineering Institute can help your organization apply techniques to reduce software supply chain risk.

COTS-Based Systems

This paper presents a summary of SEI commercial off-the-shelf (COTS) software documents and COTS tools.

Create a CSIRT

January 18, 2017 • white paper.

This white paper discusses the issues and decisions organizations should address when planning, implementing, and building a CSIRT.

Skills Needed When Staffing Your CSIRT

This white paper describes a set of skills that CSIRT staff members should have to provide basic incident-handling services.

CSIRT Frequently Asked Questions (FAQ)

This FAQ addresses CSIRTS, organizations responsible for receiving, reviewing, and responding to computer security incident reports and activity.

CERT-RMM Capability Appraisals

January 17, 2017 • white paper.

The white paper describe CERT-RMM appraisals and the benefits they offer organizations.

A Technical History of the SEI

January 6, 2017 • special report, by larry druffel.

This report chronicles the technical accomplishments of the Software Engineering Institute and its impact on the Department of Defense software community, as well as on the broader software engineering community.

SQUARE Frequently Asked Questions (FAQ)

January 5, 2017 • white paper.

This paper contains information about SQUARE, a process that helps organizations build security into the early stages of the software production lifecycle.

Common Sense Guide to Mitigating Insider Threats, Fifth Edition

December 21, 2016 • technical report, by tracy cassidy, michael j. albrethsen, michael c. theis, daniel l. costa, jason w. clark, andrew p. moore, randall f. trzeciak, matthew l. collins, jeremy r. strozer.

Presents recommendations for mitigating insider threat based on CERT's continued research and analysis of over 1,000 cases.

Architecture-Led Safety Process

By david p. gluch, julien delange, peter h. feiler, john mcgregor.

Architecture-Led Safety Analysis (ALSA) is a safety analysis method that uses early architecture knowledge to supplement traditional safety analysis techniques to identify faults as early as possible.

The Critical Role of Positive Incentives for Reducing Insider Threats

December 15, 2016 • technical report, by palma buttles-valdez, nathan m. vanhoudnos, samuel j. perl, tracy cassidy, andrew p. moore, daniel bauer, jennifer cowley, jeff savinda, allison parshall, matthew l. collins, elizabeth a. monaco, jamie l. moyes, denise m. rousseau (carnegie mellon university).

This report describes how positive incentives complement traditional practices to provide a better balance for organizations' insider threat programs.

Update 2016: Considerations for Using Agile in DoD Acquisition

December 14, 2016 • technical note, by alfred schenker, mary ann lapham, suzanne miller, ray c. williams, charles (bud) hammons, dan ward (dan ward consulting), daniel burton.

This report updates a 2010 technical note, addressing developments in commercial Agile practices as well as the Department of Defense (DoD) acquisition environment.

Scaling Agile Methods for Department of Defense Programs

December 13, 2016 • technical note, by suzanne miller, mary ann lapham, peter capell, eileen wrubel, will hayes.

This report discusses methods for scaling Agile processes to larger software development programs in the Department of Defense.

Low Cost Technical Solutions to Jump Start an Insider Threat Program

December 12, 2016 • technical note.

This technical note explores free and low cost technical solutions to help organizations prevent, detect, and respond to malicious insiders.

RFP Patterns and Techniques for Successful Agile Contracting

December 2, 2016 • special report, by larri ann rosser (raytheon intelligence information and services), steven martin (space and missile systems center), thomas e. friend (agile on target), greg howard (mitre), michael ryan (btas), john h. norton iii (raytheon integrated defense systems), keith korzec, peter capell, mary ann lapham.

This report discusses request-for-proposal patterns and techniques for successfully contracting a federal Agile project.

Ultra-Large-Scale Systems: Socio-adaptive Systems

December 1, 2016 • white paper, by mark h. klein, gabriel moreno, linda m. northrop, scott hissam, lutz wrage.

Ultra-large-scale systems are interdependent webs of software, people, policies, and economics. In socio-adaptive systems, humans and software interact as peers.

Cyber-Physical Systems

By david kyle, scott hissam, gabriel moreno, jeffrey hansen, john j. hudak, bjorn andersson, mark h. klein, dionisio de niz, sagar chaki.

Cyber-physical systems (CPS) integrate computational algorithms and physical components. SEI promotes the efficient development of high-confidence, distributed CPS.

Pervasive Mobile Computing

By edwin j. morris, grace lewis, james edmondson, william anderson, marc novakouski, jeff boleng, ben w. bradshaw, james root.

Pervasive mobile computing focuses on how soldiers and first responders can use smartphones, tablets, and other mobile/wearable devices at the tactical edge.

Predictability by Construction

By scott hissam, gabriel moreno, linda m. northrop, kurt c. wallnau, sagar chaki.

Predictability by construction (PBC) makes the behavior of a component-based system predictable before implementation, based on known properties of components.

Blacklist Ecosystem Analysis: January – June, 2016

Faa research project on system complexity effects on aircraft safety: testing the identified metrics, november 30, 2016 • white paper, by bill nichols, sarah sheard, michael d. konrad, charles weinstock.

This report describes a test of an algorithm for estimating the complexity of a safety argument.

FAA Research Project on System Complexity Effects on Aircraft Safety: Estimating Complexity of a Safety Argument

By charles weinstock, michael d. konrad, sarah sheard, bill nichols.

This report presents a formula for estimating the complexity of an avionics system and directly connects that complexity to the size of its safety argument.

FAA Research Project on System Complexity Effects on Aircraft Safety: Identifying the Impact of Complexity on Safety

By donald firesmith, sarah sheard, michael d. konrad, charles weinstock.

This report organizes our work on the impact of software complexity on aircraft safety by asking, “How can complexity complicate safety and, thus, certification?”

FAA Research Project on System Complexity Effects on Aircraft Safety: Candidate Complexity Metrics

By sarah sheard, bill nichols.

This special report identifies candidate measures of complexity for systems with embedded software that relate to safety, assurance, or both.

FAA Research Project on System Complexity Effects on Aircraft Safety: Literature Search to Define Complexity for Avionics Systems

By sarah sheard, michael d. konrad.

This special report describes the results of a literature review sampling what is known about complexity for application in the context of safety and assurance.

Seven Proposal-Writing Tips That Make Conference Program Committees Smile

By mike petock, bill pollak.

Writing a great session proposal for a conference is difficult. Here are seven tips for writing a session proposal that will make reviewers go from frown to smile.

Definition and Measurement of Complexity in the Context of Safety Assurance

October 27, 2016 • technical report, by bill nichols, charles weinstock, michael d. konrad, sarah sheard.

This report describes research to define complexity measures for avionics systems to help the FAA identify when systems are too complex to assure their safety.

Establishing Trusted Identities in Disconnected Edge Environments

October 27, 2016 • white paper, by dan j. klinedinst, sebastián echeverría, keegan m. williams.

he goal of this paper is to present a solution for establishing trusted identities in disconnected environments based on secure key generation and exchange in the field.

A Mapping of the Federal Financial Institutions Examination Council (FFIEC) Cybersecurity Assessment Tool (CAT) to the Cyber Resilience Review (CRR)

October 25, 2016 • technical note, by jeffrey pinckard, robert a. vrtis, michael rattigan.

To help financial organizations assess cyber resilience, we map FFIEC Cybersecurity Assessment Tool (CAT) statements to Cyber Resilience Review (CRR) questions.

Managing Third Party Risk in Financial Services Organizations: A Resilience-Based Approach

September 27, 2016 • white paper, by john haller, charles m. wallen.

A resilience-based approach can help financial services organizations to manage cybersecurity risks from outsourcing and comply with federal regulations.

Agile Development in Government: Myths, Monsters, and Fables

September 21, 2016 • white paper, by mary ann lapham, suzanne miller, david j. carney.

This volume is a reflection on attitudes toward Agile software development now current in the government workplace.

Striving for Effective Cyber Workforce Development

September 12, 2016 • white paper, by marie baker.

This paper reviews the issue of cyber awareness and identify efforts to combat this deficiency and concludes with strategies moving forward.

Segment-Fixed Priority Scheduling for Self-Suspending Real-Time Tasks

August 18, 2016 • technical report, by ragunathan (raj) rajkumar, junsung kim, jian-jia chen, wen-hung huang, geoffrey nelissen, bjorn andersson, dionisio de niz.

This report describes schedulability analyses and proposes segment-fixed priority scheduling for self-suspending tasks.

Creating Centralized Reporting for Microsoft Host Protection Technologies: The Enhanced Mitigation Experience Toolkit (EMET)

August 18, 2016 • technical note, by joseph tammariello, craig lewis.

This report describes how to set up a centralized reporting console for the Windows Enhanced Mitigation Experience Toolkit.

The QUELCE Method: Using Change Drivers to Estimate Program Costs

August 17, 2016 • technical note, by sarah sheard.

This technical note introduces Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE), a method for estimating program costs early in development.

Blacklist Ecosystem Analysis: 2016 Update

August 15, 2016 • white paper, by eric hatleback, leigh b. metcalf, jonathan spring.

This white paper, which is the latest in a series of regular updates, builds upon the analysis of blacklists presented in our 2013 and 2014 reports.

Architecture Fault Modeling and Analysis with the Error Model Annex, Version 2

June 22, 2016 • technical report, by peter h. feiler, julien delange, john j. hudak, david p. gluch.

This report describes the Error Model Annex, Version 2 (EMV2), notation for architecture fault modeling, which supports safety, reliability, and security analyses.

A Requirement Specification Language for AADL

By lutz wrage, julien delange, peter h. feiler.

This report describes a textual requirement specification language, called ReqSpec, for the Architecture Analysis & Design Language (AADL) and demonstrates its use.

DMPL: Programming and Verifying Distributed Mixed-Synchrony and Mixed-Critical Software

June 21, 2016 • technical report, by sagar chaki, david kyle.

DMPL is a language for programming distributed real-time, mixed-criticality software. It supports distributed systems in which each node executes a set of periodic real-time threads that are scheduled by priority …

Wireless Emergency Alerts Commercial Mobile Service Provider (CMSP) Cybersecurity Guidelines

June 9, 2016 • special report, by christopher j. alberts, carol woody, audrey j. dorofee.

This report provides members of the Commercial Mobile Service Provider (CMSP) community with practical guidance for better managing cybersecurity risk exposure, based on an SEI study of the CMSP element …

Report Writer and Security Requirements Finder: User and Admin Manuals

June 7, 2016 • special report, by anand sankalp (carnegie mellon university), gupta anurag (carnegie mellon), priyam swati (carnegie mellon university), yaobin wen (carnegie mellon university), walid el baroni (carnegie mellon university), nancy r. mead.

This report presents instructions for using the Malware-driven Overlooked Requirements (MORE) website applications.

Applying the Goal-Question-Indicator-Metric (GQIM) Method to Perform Military Situational Analysis

May 23, 2016 • technical note, by douglas gray.

This report describes how to use the goal-question-indicator-metric method in tandem with the military METT-TC method (mission, enemy, time, terrain, troops available, and civil-military considerations).

An Insider Threat Indicator Ontology

May 10, 2016 • technical report, by matthew l. collins, samuel j. perl, michael j. albrethsen, derrick spooner, daniel l. costa, george silowash.

This report presents an ontology for insider threat indicators, describes how the ontology was developed, and outlines the process by which it was validated.

Using Honeynets and the Diamond Model for ICS Threat Analysis

May 6, 2016 • technical report, by deana shick, kyle o'meara, john kotheimer.

This report presents an approach to analyzing approximately 16 gigabytes of full packet capture data collected from an industrial control system honeynet—a network of seemingly vulnerable machines designed to lure …

2016 State of Cybercrime Survey

May 2, 2016 • white paper.

This paper examines the current state of cybercrime and explores how organizations and individuals respond to cybercrime threats.

April 19, 2016 • White Paper

This report introduces the Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE) method for estimating program costs early in a development lifecycle.

A Unique Approach to Threat Analysis Mapping: A Malware-Centric Methodology

April 19, 2016 • technical report, by kyle o'meara, deana shick.

As they constantly change network infrastructure, adversaries consistently use and update their tools. This report presents a way for researchers to begin threat analysis with those tools rather than with …

On Board Diagnostics: Risks and Vulnerabilities of the Connected Vehicle

April 13, 2016 • white paper, by christopher king, dan j. klinedinst.

This report describes cybersecurity risks and vulnerabilities in modern connected vehicles.

2016 Emerging Technology Domains Risk Survey

April 8, 2016 • technical report, by todd lewellen, dan j. klinedinst, christopher king, garret wassermann.

This 2016 report provides a snapshot of our current understanding of future technologies.

Malware Capability Development Patterns Respond to Defenses: Two Case Studies

March 7, 2016 • white paper, by ed stoner, deana shick, jonathan spring, kyle o'meara.

In this paper, the authors describe their analysis of two case studies to outline the relationship between adversaries and network defenders.

Cyber-Foraging for Improving Survivability of Mobile Systems

February 18, 2016 • technical report, by sebastián echeverría, grace lewis, james root, ben w. bradshaw.

This report presents an architecture and experimental results that demonstrate that cyber-foraging using tactical cloudlets increases the survivability of mobile systems.

CERT-RMM Version 1.2 Release Notes

February 14, 2016 • white paper.

This document contains the release notes for CERT-RMM Version 1.2, released February 2014.

DoD Software Factbook

December 31, 2015 • white paper, by david zubrow, james mccurley, brad clark.

This DoD Factbook is an initial analysis of software engineering data from the perspective of policy and management questions about software projects.

Architecture-Led Safety Analysis of the Joint Multi-Role (JMR) Joint Common Architecture (JCA) Demonstration System

December 31, 2015 • special report, by peter h. feiler.

This report summarizes an architecture-led safety analysis of the aircraft-survivability situation-awareness system for the Joint Multi-Role vertical lift program.

Requirements and Architecture Specification of the Joint Multi-Role (JMR) Joint Common Architecture (JCA) Demonstration System

This report describes a method for capturing information from requirements documents in AADL and the draft Requirement Definition & Analysis Language Annex.

Potential System Integration Issues in the Joint Multi-Role (JMR) Joint Common Architecture (JCA) Demonstration System

By john j. hudak, peter h. feiler.

This report describes a method for capturing information from requirements documents in AADL to identify potential integration problems early in system development.

Extending AADL for Security Design Assurance of Cyber-Physical Systems

December 16, 2015 • technical report, by allen d. householder, rick kazman, john j. hudak, robert j. ellison, carol woody.

This report demonstrates the viability and limitations of using the Architecture Analysis and Design Language (AADL) through an extended example that allows for specifying and analyzing the security properties of …

Cybersecurity Considerations for Vehicles

December 10, 2015 • white paper, by mark sherman, jens palluch (method park).

In this paper the authors discuss the number of ECUs and software in modern vehicles and the need for cybersecurity to include vehicles.

Analytic Approaches to Detect Insider Threats

December 9, 2015 • white paper.

This paper identifies steps that organizations can use to enhance their security posture to detect potential insider threats.

Intelligence Preparation for Operational Resilience (IPOR)

December 7, 2015 • special report.

The author describes Intelligence Preparation for Operational Resilience (IPOR), a framework for preparing intelligence that complements commonly used intelligence frameworks such as Intelligence Preparation of the Battlefield (IPB).

Evaluating and Mitigating the Impact of Complexity in Software Models

December 3, 2015 • technical report, by min-young nam, john j. hudak, julien delange, jim mchale, bill nichols.

This report defines software complexity, metrics for complexity, and the effects of complexity on cost and presents an analysis tool to measure complexity in models.

Cyber + Culture Early Warning Study

November 30, 2015 • special report, by char sample.

This study was designed to profile cyber actors, and to examine the time interval between cyber and kinetic events in order to gain greater insights into nation-state cyber responses to …

Effective Insider Threat Programs: Understanding and Avoiding Potential Pitfalls

October 16, 2015 • white paper, by matthew l. collins, randall f. trzeciak, andrew p. moore, william e. novak, michael c. theis.

In this paper, the authors describe the potential ways an insider threat program (InTP) could go wrong and engage the community to discuss its concerns.

Structuring the Chief Information Security Officer Organization

October 6, 2015 • technical note, by pamela d. curtis, gregory crabb (united states postal service), brendan fitzpatrick, david tobar, nader mehravari, julia h. allen.

The authors describe how they defined a CISO team structure and functions for a national organization using sources such as CISOs, policies, and lessons learned from cybersecurity incidents.

Improving Federal Cybersecurity Governance Through Data-Driven Decision Making and Execution

September 16, 2015 • technical report, by robert w. stoddard, julia h. allen, anne connell, c. aaron cois, douglas gray, michael riley (veris group), brian d. wisniewski, erik ebel (veris group), william gulley (veris group), marie vaughn (veris group).

This technical report focuses on cybersecurity at the indirect, strategic level. It discusses how cybersecurity decision makers at the tactical or implementation level can establish a supportive contextual environment to …

Secure Coding Analysis of an AADL Code Generator's Runtime System

September 12, 2015 • white paper, by david keaton.

This paper describes a secure coding analysis of the PolyORB-HI-C runtime system used by C language code output from the Ocarina AADL code generator.

Contracting for Agile Software Development in the Department of Defense: An Introduction

August 18, 2015 • technical note, by eileen wrubel, jon gross.

This technical note addresses effective contracting for Agile software development and offers a primer on Agile based on a contracting officer's goals.

CND Equities Strategy

July 22, 2015 • white paper, by jonathan spring, ed stoner.

In this paper, the authors discuss strategies for successful computer network defense (CND) based on considering the adversaries' responses.

Comments on Bureau of Industry and Security (BIS) Proposed Rule Regarding Wassenaar Arrangement 2013 Plenary Agreements Implementation for Intrusion and Surveillance Items

By art manion, allen d. householder.

In this paper, CERT researchers comment on the proposed rule, Wassenaar Arrangement 2013 Plenary Agreements Implementation: Intrusion and Surveillance Items.

Enabling Incremental Iterative Development at Scale: Quality Attribute Refinement and Allocation in Practice

June 4, 2015 • technical report, by neil ernst, robert nord, stephany bellomo, ipek ozkaya.

This report describes industry practices used to develop business capabilities and suggests approaches to enable large-scale iterative development, or agile at scale.

State of Practice Report: Essential Technical and Nontechnical Issues Related to Designing SoS Platform Architectures

May 13, 2015 • technical report, by john klein, sholom g. cohen.

This report analyzes the state of the practice in system-of-systems (SoS) development, based on 12 interviews of leading SoS developers in the DoD and industry.

Emerging Technology Domains Risk Survey

April 30, 2015 • technical note, by andrew o. mellinger, christopher king, jonathan chu.

This report provides a snapshot in time of our current understanding of future technologies.

SCALe Analysis of JasPer Codebase

April 1, 2015 • white paper, by david svoboda.

In this paper, David Svoboda provides the findings of a SCALe audit on a codebase.

Model-Driven Engineering: Automatic Code Generation and Beyond

March 25, 2015 • technical note, by harry l. levinson, john klein, jay marchetti.

This report offers guidance on selecting, analyzing, and evaluating model-driven engineering tools for automatic code generation in acquired systems.

Defining a Maturity Scale for Governing Operational Resilience

March 19, 2015 • technical note, by julia h. allen, katie c. stewart, lisa r. young, michelle a. valdez, audrey j. dorofee.

Governing operational resilience requires the appropriate level of sponsorship, a commitment to strategic planning that includes resilience objectives, and proper oversight of operational resilience activities.

SEI SPRUCE Project: Curating Recommended Practices for Software Producibility

March 16, 2015 • white paper, by bill pollak, michael d. konrad, mike petock, tamara marshall-keim, b. craig meyers, gerald w. miller.

This paper describes the Systems and Software Producibility Collaboration Environment (SPRUCE) project and the resulting recommended practices on five software topics.

Improving Quality Using Architecture Fault Analysis with Confidence Arguments

March 10, 2015 • technical report, by peter h. feiler, julien delange, john b. goodenough, charles weinstock, neil ernst, ari z. klein.

The case study shows that by combining an analytical approach with confidence maps, we can present a structured argument that system requirements have been met and problems in the design …

Making DidFail Succeed: Enhancing the CERT Static Taint Analyzer for Android App Sets

March 4, 2015 • technical report, by william snavely, jonathan burket, jonathan lim, wei shen, lori flynn, william klieber.

In this report, the authors describe how the DidFail tool was enhanced to improve its effectiveness.

Eliminative Argumentation: A Basis for Arguing Confidence in System Properties

February 25, 2015 • technical report, by charles weinstock, john b. goodenough, ari z. klein.

This report defines the concept of eliminative argumentation and provides a basis for assessing how much confidence one should have in an assurance case argument.

A Proven Method for Meeting Export Control Objectives in Postal and Shipping Sectors

February 10, 2015 • technical note, by gregory crabb (united states postal service), pamela d. curtis, julia h. allen, nader mehravari.

This report describes how the CERT-RMM enabled the USPIS to implement an innovative approach for achieving complex international mail export control objectives.

Measuring What Matters Workshop Report

February 9, 2015 • technical note, by katie c. stewart, julia h. allen, lisa r. young, michelle a. valdez.

This report describes the inaugural Measuring What Matters Workshop conducted in November 2014, and the team's experiences in planning and executing the workshop and identifying improvements for future offerings.

A Dynamic Model of Sustainment Investment

February 5, 2015 • technical report, by sarah sheard, mike phillips, andrew p. moore, robert ferguson.

This paper describes a dynamic sustainment model that shows how budgeting, allocation of resources, mission performance, and strategic planning are interrelated and how they affect each other over time.

Cybersecurity Assurance

January 15, 2015 • white paper.

This paper describes the SEI research and solutions that help organizations gain justified confidence in their cybersecurity posture.

Blacklist Ecosystem Analysis Update: 2014

January 7, 2015 • white paper, by leigh b. metcalf, jonathan spring.

This white paper compares the contents of 85 different Internet blacklists to discover patterns in shared entries.

Predicting Software Assurance Using Quality and Reliability Measures

December 22, 2014 • technical note, by bill nichols, carol woody, robert j. ellison.

In this report, the authors discuss how a combination of software development and quality techniques can improve software security.

Regional Use of Social Networking Tools

December 17, 2014 • technical report, by kate meeuf.

This paper explores the regional use of social networking services (SNSs) to determine if participation with a subset of SNSs can be applied to identify a user's country of origin.

Domain Parking: Not as Malicious as Expected

December 10, 2014 • white paper, by jonathan spring, leigh b. metcalf.

In this paper we discuss scalable detection methods for domain names parking on reserved IP address space, and then using this data set, evaluate whether this behavior appears to be …

Pattern-Based Design of Insider Threat Programs

December 9, 2014 • technical note, by robin ruefle, dave mundie, andrew p. moore, david mcintire, matthew l. collins.

In this report, the authors describe a pattern-based approach to designing insider threat programs that could provide a better defense against insider threats.

Introduction to the Security Engineering Risk Analysis (SERA) Framework

December 4, 2014 • technical note, by audrey j. dorofee, christopher j. alberts, carol woody.

This report introduces the SERA Framework, a model-based approach for analyzing complex security risks in software-reliant systems and systems of systems early in the lifecycle.

Using Malware Analysis to Tailor SQUARE for Mobile Platforms

November 18, 2014 • technical note, by nancy r. mead, gregory paul alice.

This technical note explores the development of security requirements for the K-9 Mail application, an open source email client for the Android operating system.

A Method for Aligning Acquisition Strategies and Software Architectures

October 29, 2014 • technical note, by david j. carney, cecilia albert, patrick r. place, lisa brownsword.

This report describes the third year of the SEI's research into aligning acquisition strategies and software architecture.

Agile Methods in Air Force Sustainment: Status and Outlook

October 23, 2014 • technical note, by mary ann lapham, eileen wrubel, stephen beck, michael s. bandor, colleen regan.

This paper examines using Agile techniques in the software sustainment arena—specifically Air Force programs. The intended audience is the staff of DoD programs and related personnel who intend to use …

Development of an Intellectual Property Strategy: Research Notes to Support Department of Defense Programs

October 14, 2014 • special report, by charlene gross.

This report is intended to help program managers understand categories of intellectual property, various intellectual property challenges, and approaches to assessing the license rights that the program needs for long-term …

AADL Fault Modeling and Analysis Within an ARP4761 Safety Assessment

October 10, 2014 • technical report, by david p. gluch, peter h. feiler, julien delange, john j. hudak.

This report describes how the Architecture Analysis and Design Language (AADL) Error Model Annex supports the safety-assessment methods in SAE Standard ARP4761.

CERT Resilience Management Model—Mail-Specific Process Areas: International Mail Transportation (Version 1.0)

September 18, 2014 • technical note, by pamela d. curtis, gregory crabb (united states postal service), sam lin, dawn wilkes, nader mehravari, julia h. allen.

This report describes a new process area that ensures that international mail is transported according to Universal Postal Union standards.

CERT Resilience Management Model—Mail-Specific Process Areas: Mail Revenue Assurance (Version 1.0)

By julia h. allen, nader mehravari, david w. white, gregory crabb (united states postal service), pamela d. curtis.

This report describes a new process area that ensures that the USPS is compensated for mail that is accepted, transported, and delivered.

CERT Resilience Management Model—Mail-Specific Process Areas: Mail Induction (Version 1.0)

By pamela d. curtis, gregory crabb (united states postal service), david w. white, nader mehravari, julia h. allen.

This report describes a new process area that ensures that mail is inducted into the U.S. domestic mail stream according to USPS standards and requirements.

Smart Collection and Storage Method for Network Traffic Data

September 15, 2014 • technical report, by angela horneman, nathan dell.

This report discusses considerations and decisions to be made when designing a tiered network data storage solution.

A Systematic Approach for Assessing Workforce Readiness

August 18, 2014 • technical report, by david mcintire, christopher j. alberts.

In this report, the authors present the Competency Lifecycle Roadmap and the readiness test development method, both used to maintain workforce readiness.

Assuring Software Reliability

August 15, 2014 • special report, by robert j. ellison.

This report describes ways to incorporate the analysis of the potential impact of software failures--regardless of their cause--into development and acquisition practices through the use of software assurance.

Patterns and Practices for Future Architectures

August 15, 2014 • technical note, by eric werner, scott mcmillan, jonathan chu.

This report discusses best practices and patterns that will make high-performance graph analytics on new and emerging architectures more accessible to users.

Abuse of Customer Premise Equipment and Recommended Actions

August 7, 2014 • white paper, by jonathan spring, paul vixie, chris hallenbeck.

In this paper, the authors provide recommendations for addressing problems related to poor management of Consumer Premise Equipment (CPE).

Performance of Compiler-Assisted Memory Safety Checking

July 31, 2014 • technical note, by david keaton, robert c. seacord.

This technical note describes the criteria for deploying a compiler-based memory safety checking tool and the performance that can be achieved with two such tools whose source code is freely …

Unintentional Insider Threats: A Review of Phishing and Malware Incidents by Economic Sector

July 18, 2014 • technical note, by cert insider threat team.

This report analyzes unintentional insider threat cases of phishing and other social engineering attacks involving malware.

Evaluation of the Applicability of HTML5 for Mobile Applications in Resource-Constrained Edge Environments

July 2, 2014 • technical note, by grace lewis, bryan yan (carnegie mellon university – institute for software research).

This technical note presents an analysis of the feasibility of using HTML5 for developing mobile applications, for "edge" environments where resources and connectivity are uncertain, such as in battlefield or …

Agile Software Teams: How They Engage with Systems Engineering on DoD Acquisition Programs

July 1, 2014 • technical note, by mary ann lapham, suzanne miller, timothy a. chick, eileen wrubel.

This technical note addresses issues with Agile software teams engaging systems engineering functions in developing and acquiring software-reliant systems.

Improving the Automated Detection and Analysis of Secure Coding Violations

June 27, 2014 • technical note, by daniel plakosh, robert c. seacord, robert w. stoddard, david svoboda, david zubrow.

This technical note describes the accuracy analysis of the Source Code Analysis Laboratory (SCALe) tools and the characteristics of flagged coding violations.

CERT® Resilience Management Model (CERT®-RMM) V1.1: NIST Special Publication Crosswalk Version 2

June 11, 2014 • technical note, by lisa r. young, kevin g. partridge, mary popeck.

This update to Version 1 of this same title (CMU/SEI-2011-TN-028) maps CERT-RMM process areas to certain NIST 800-series special publications.

The Business Case for Systems Engineering: Comparison of Defense Domain and Non-defense Projects

June 10, 2014 • special report, by dennis goldenson, joseph p. elm.

This report analyzes differences in systems-engineering activities for defense and non-defense projects and finds differences in both deployment and effectiveness.

Job Analysis Results for Malicious-Code Reverse Engineers: A Case Study

June 3, 2014 • technical report, by jennifer cowley.

This report describes individual and team factors that enable, encumber, or halt the development of malicious-code reverse engineering expertise.

An Introduction to the Mission Risk Diagnostic for Incident Management Capabilities (MRD-IMC)

May 30, 2014 • technical note, by christopher j. alberts, robin ruefle, mark zajicek, audrey j. dorofee.

The Mission Risk Diagnostic for Incident Management Capabilities revises the Incident Management Mission Diagnostic Method with updated and expanded drivers.

A Taxonomy of Operational Cyber Security Risks Version 2

May 21, 2014 • technical note, by lisa r. young, mary popeck, james j. cebula.

This second version of the 2010 report presents a taxonomy of operational cyber security risks and harmonizes it with other risk and security activities.

An Evaluation of A-SQUARE for COTS Acquisition

May 13, 2014 • technical note, by nancy r. mead, sidhartha mani.

An evaluation of the effectiveness of Software Quality Requirements Engineering for Acquisition (A-SQUARE) in a project to select a COTS product for the advanced metering infrastructure of a smart grid.

Investigating Advanced Persistent Threat 1 (APT1)

May 12, 2014 • technical report, by deana shick, angela horneman.

This report analyzes unclassified data sets in an attempt to understand APT1's middle infrastructure.

Precise Static Analysis of Taint Flow for Android Application Sets

May 9, 2014 • white paper, by amar s. bhosale (no affiliation).

This thesis describes a static taint analysis for Android that combines the FlowDroid and Epicc analyses to track inter- and intra-component data flow.

Data-Driven Software Assurance: A Research Study

May 9, 2014 • technical report, by julia l. mullaney, michael f. orlando, erin harper, michael d. konrad, art manion, bill nichols, andrew p. moore.

In 2012, Software Engineering Institute (SEI) researchers began investigating vulnerabilities reported to the SEI's CERT Division. A research project was launched to investigate design-related vulnerabilities and quantify their effects.

ALTernatives to Signatures (ALTS)

April 30, 2014 • white paper, by george jones, john stogoski.

This paper presents the results of a study of non-signature-based approaches to detecting malicious activity in computer network traffic.

Potential Use of Agile Methods in Selected DoD Acquisitions: Requirements Development and Management

April 29, 2014 • technical note, by david j. carney, kenneth nidiffer, suzanne miller.

This report explores issues that practitioners in the field who are actively adopting Agile methods have identified in our interviews about their experience in defining and managing requirements.

The Readiness & Fit Analysis: Is Your Organization Ready for Agile?

April 28, 2014 • white paper, by suzanne miller.

This paper summarizes the Readiness & Fit Analysis and describes its extension to support risk identification for organizations that are adopting agile methods.

International Implementation of Best Practices for Mitigating Insider Threat: Analyses for India and Germany

April 16, 2014 • technical report, by randall f. trzeciak, george silowash, lori flynn, michael c. theis, tracy cassidy, palma buttles-valdez, carly l. huth, travis wright (carnegie mellon university, master of science in information security policy and management program).

This report analyzes insider threat mitigation in India and Germany, using the new framework for international cybersecurity analysis described in the paper titled “Best Practices Against Insider Threats in All …

Wireless Emergency Alerts (WEA) Cybersecurity Risk Management Strategy for Alert Originators

March 31, 2014 • special report, by the wea project team.

In this report, the authors describe a cybersecurity risk management (CSRM) strategy that alert originators can use throughout WEA adoption, operations, and sustainment, as well as a set of governance …

Maximizing Trust in the Wireless Emergency Alerts (WEA) Service

February 28, 2014 • special report, by carol woody, robert j. ellison.

This 2014 report presents recommendations for stakeholders of the Wireless Emergency Alerts (WEA) service that resulted from the development of two trust models, focusing on how to increase both alert …

Wireless Emergency Alerts: Trust Model Simulations

February 26, 2014 • special report, by timothy morrow, joseph p. elm, robert w. stoddard.

This report presents four types of simulations run on the public trust model and the alert originator trust model developed for the Wireless Emergency Alerts (WEA) service, focusing on how …

Commercial Mobile Alert Service (CMAS) Alerting Pipeline Taxonomy

February 24, 2014 • technical report.

This report presents the Commercial Mobile Alert Service (CMAS) Alerting Pipeline Taxonomy, a hierarchical classification that encompasses four elements of the alerting pipeline, to help stakeholders understand and reason about …

Best Practices in Wireless Emergency Alerts

February 19, 2014 • special report, by elizabeth trocki stark (sra international, inc.), jennifer lavan (sra international, inc.), robert j. ellison, john mcgregor, tamara marshall-keim, rita c. creel, carol woody, christopher j. alberts, joseph p. elm.

This report presents four best practices for the Wireless Emergency Alerts (WEA) service, including implementing WEA in a local jurisdiction, training emergency staff in using WEA, cross-jurisdictional governance of WEA, …

Study of Integration Strategy Considerations for Wireless Emergency Alerts

This report identifies key challenges and offers recommendations for alert originators navigating the process of adopting and integrating the Wireless Emergency Alerts (WEA) service into their emergency management systems.

Results in Relating Quality Attributes to Acquisition Strategies

February 4, 2014 • technical note, by lisa brownsword, cecilia albert, patrick r. place, david j. carney.

This technical note describes the second phase of a study that focuses on the relationships between software architecture and acquisition strategy -- more specifically, their alignment or misalignment.

Agile Metrics: Progress Monitoring of Agile Contractors

January 27, 2014 • technical note, by timothy a. chick, eileen wrubel, will hayes, mary ann lapham, suzanne miller.

This technical note offers a reference for those working to oversee software development on the acquisition of major systems from developers using Agile methods.

Agile Methods and Request for Change (RFC): Observations from DoD Acquisition Programs

January 24, 2014 • technical note, by mary ann lapham, eileen wrubel, michael s. bandor.

This technical note looks at the evaluation and negotiation of technical proposals that reflect iterative development approaches that in turn leverage Agile methods.

Unintentional Insider Threats: Social Engineering

January 21, 2014 • technical note, by cert insider threat center.

In this report, the authors explore the unintentional insider threat (UIT) that derives from social engineering.

Improving the Security and Resilience of U.S. Postal Service Mail Products and Services Using the CERT® Resilience Management Model

January 17, 2014 • technical note.

In this report, the authors describe how to improve the resilience of U.S. Postal Service products and services

A Proven Method for Identifying Security Gaps in International Postal and Transportation Critical Infrastructure

By nader mehravari, julia h. allen, pamela d. curtis, gregory crabb (united states postal service).

In this report, the authors describe a method of identifying physical security gaps in international mail processing centers and similar facilities.

Cloud Service Provider Methods for Managing Insider Threats: Analysis Phase II, Expanded Analysis and Recommendations

January 8, 2014 • technical note, by chas difatta (no affiliation), greg porter (heinz college at carnegie mellon university), lori flynn.

In this report, the authors discuss the countermeasures that cloud service providers use and how they understand the risks posed by insiders.

TSP Symposium 2013 Proceedings

January 8, 2014 • special report, by sergio cardona (universidad del quindío), leticia pérez (universidad de la república), rafael rincón (universidad eafit), joão pascoal faria (university of porto), mushtaq raza (university of porto), pedro c. henriques (strongstep – innovation in software quality), diego vallespir (universidad de la república), fernanda grazioli (universidad de la república), silvana moreno (universidad de la república), bill nichols, jim mchale.

This special report contains proceedings of the 2013 TSP Symposium. The conference theme was “When Software Really Matters,” which explored the idea that when product quality is critical, high-quality practices …

Understanding Patterns for System-of-Systems Integration

December 17, 2013 • technical report, by klaus schmid, claus nielsen (no affiliation), rick kazman.

This report discusses how a software architect can address the system-of-systems integration challenge from an architectural perspective.

Foundations for Software Assurance

December 16, 2013 • white paper, by carol woody, nancy r. mead, dan shoemaker (university of detroit mercy).

In this paper, the authors highlight efforts to address the principles of software assurance and its educational curriculum.

The Topological Properties of the Local Clustering Coefficient

December 9, 2013 • white paper, by leigh b. metcalf.

In this paper, Leigh Metcalf examines the local clustering coefficient for and provides a new formula to generate the local clustering coefficient.

Using Software Development Tools and Practices in Acquisition

December 3, 2013 • technical note, by harry l. levinson, richard librizzi.

This technical note provides an introduction to key automation and analysis techniques.

Spotlight On: Programmers as Malicious Insiders–Updated and Revised

December 2, 2013 • white paper, by andrew p. moore, randall f. trzeciak, dawn cappelli, matthew l. collins, thomas c. caron (john heinz iii college, school of information systems management, carnegie mellon university).

In this paper, the authors describe the who, what, when, where, and how of attacks by insiders using programming techniques and includes case examples.

Software Assurance Measurement – State of the Practice

November 29, 2013 • technical note, by dan shoemaker (university of detroit mercy), nancy r. mead.

In this report, the authors describe the current state of the practice and emerging trends in software assurance measurement.

A Defect Prioritization Method Based on the Risk Priority Number

November 26, 2013 • white paper, by will hayes, julie b. cohen, robert ferguson.

This paper describes a technique that helps organizations address and resolve conflicting views and create a better value system for defining releases.

Agile Security - Review of Current Research and Pilot Usage

November 21, 2013 • white paper, by carol woody.

This white paper was produced to focus attention on the opportunities and challenges for embedding information assurance considerations into Agile development and acquisition.

Cloud Service Provider Methods for Managing Insider Threats: Analysis Phase I

November 15, 2013 • technical note, by greg porter (heinz college at carnegie mellon university).

In this report, Greg Porter documents preliminary findings from interviews with cloud service providers on their insider threat controls.

Mobile SCALe: Rules and Analysis for Secure Java and Android Coding

November 8, 2013 • technical report, by david svoboda, dean sutherland, william klieber, lori flynn, limin jia (carnegie mellon university, department of electrical and computer engineering), lujo bauer (carnegie mellon university, department of electrical and computer engineering), fred long.

In this report, the authors describe Android secure coding rules, guidelines, and static analysis developed as part of the Mobile SCALe project.

Advancing Cybersecurity Capability Measurement Using the CERT-RMM Maturity Indicator Level Scale

November 7, 2013 • technical note, by richard a. caralli, matthew j. butkovic.

In this report, the authors review the specific and generic goals and practices in CERT-RMM to determine if a better scale could be developed.

CERT® Resilience Management Model (CERT®-RMM) V1.1: NIST Special Publication 800-66 Crosswalk

October 28, 2013 • technical note, by ma-nyahn kromah (sungard availability services), lisa r. young.

In this report, the authors map CERT-RMM process areas to key activities in NIST Special Publication 800-66 Revision 1.

Passive Detection of Misbehaving Name Servers

October 4, 2013 • technical report.

In this report, the authors explore name-server flux and two types of data that can reveal it.

Insider Threat Control: Using Plagiarism Detection Algorithms to Prevent Data Exfiltration in Near Real Time

October 3, 2013 • technical note, by todd lewellen, daniel l. costa, george silowash.

In this report, the authors describe how an insider threat control can monitor an organization's web request traffic for text-based data exfiltration.

Introduction to the Mission Thread Workshop

October 1, 2013 • technical report, by william wood, michael j. gagliardi, timothy morrow.

This report introduces the Mission Thread Workshop, a method for understanding architectural and engineering considerations for developing and sustaining systems of systems. It describes the three phases of the workshop …

Parallel Worlds: Agile and Waterfall Differences and Similarities

October 1, 2013 • technical note, by ipek ozkaya, suzanne miller, mary ann lapham, timothy a. chick, steve palmquist.

This report helps readers understand Agile. The report assembles terms and concepts from both the traditional world of waterfall-based development and the Agile environment to show the many similarities and …

Everything You Wanted to Know About Blacklists But Were Afraid to Ask

September 30, 2013 • white paper.

This document compares the contents of 25 different common public-internet blacklists in order to discover any patterns in the shared entries.

Roadmap to Software Assurance Competency

September 23, 2013 • white paper.

This white paper describes the Software Assurance (SwA) Core Body of Knowledge and SwA competency levels.

TSP Performance and Capability Evaluation (PACE): Customer Guide

September 1, 2013 • special report, by mark kasunic, bill nichols, timothy a. chick.

This guide describes the evaluation process and lists the steps organizations and programs must complete to earn a TSP-PACE certification.

TSP Performance and Capability Evaluation (PACE): Team Preparedness Guide

By timothy a. chick, bill nichols, mark kasunic.

This document describes the TSP team data that teams normally produce and that are required as input to the TSP-PACE process.

Best Practices Against Insider Threats in All Nations

August 27, 2013 • technical note, by carly l. huth, palma buttles-valdez, lori flynn, randall f. trzeciak.

In this report, the authors summarize best practices for mitigating insider threats in international contexts.

The Role of Computer Security Incident Response Teams in the Software Development Life Cycle

August 20, 2013 • white paper, by robin ruefle.

In this paper, Robin Ruefle describes how an incident management can provide input to the software development process.

State of Cyber Workforce Development

August 15, 2013 • white paper.

This paper summarizes the current posture of the cyber workforce and several initiatives designed to strengthen, grow, and retain cybersecurity professionals.

Training and Awareness

August 7, 2013 • white paper, by carol sledge, ken van wyk (no affiliation).

In this paper, the authors provide guidance on training and awareness opportunities in the field of software security.

Evidence of Assurance: Laying the Foundation for a Credible Security Case

By howard f. lipson, charles weinstock.

In this paper, the authors provide examples of several of the kinds of evidence that can contribute to a security case.

Security and Project Management

August 6, 2013 • white paper.

In this paper, Robert Ellison explains what project managers should consider because they relate to security needs.

An Evaluation of Cost-Benefit Using Security Requirements Prioritization Methods

August 5, 2013 • white paper, by travis christian, nancy r. mead.

In this paper, the authors provide background information on penetration testing processes and practices.

Unintentional Insider Threats: A Foundational Study

August 1, 2013 • technical note.

In this report, the CERT Insider Threat team examines unintentional insider threat (UIT), a largely unrecognized problem.

Teaching Security Requirements Engineering Using SQUARE

July 31, 2013 • white paper, by nancy r. mead, dan shoemaker (university of detroit mercy), jeff ingalsbe (university of detroit mercy).

In this paper, the authors detail the validation of a teaching model for security requirements engineering that ensures that security is built into software.

Trustworthy Composition: The System Is Not Always the Sum of Its Parts

In this paper, Robert Ellison surveys several profound technical problems faced by practitioners assembling and integrating secure and survivable systems.

Development of a Master of Software Assurance Reference Curriculum - 2013 IJSSE

By julia h. allen, nancy r. mead, mark a. ardis (stevens institute of technology), thomas b. hilburn (embry-riddle aeronautical university), andrew j. kornecki (embry-riddle aeronautical university), richard c. linger (oak ridge national laboratory), james mcdonald (monmouth university).

In this paper, the authors present an overview of the Master of Software Assurance curriculum, including its history, student prerequisites, and outcomes

Strengthening Ties Between Process and Security

In this paper, Carol Woody summarizes recent key accomplishments, including harmonizing security practices with CMMI and using assurance cases.

Estimating Benefits from Investing in Secure Software Development

By ashish arora, rahul telang, steven frank.

In this paper, the authors discuss the costs and benefits of incorporating security in software development and presents formulas for calculating security costs and security benefits.

What Measures Do Vendors Use for Software Assurance?

By jeremy epstein.

In this paper, Jeremy Epstein examines what real vendors do to ensure that their products are reasonably secure.

The Development of a Graduate Curriculum for Software Assurance

By nancy r. mead, mark a. ardis (stevens institute of technology).

In this paper, the authors describe the work of the Master of Software Assurance curriculum project, including sources, process, products, and more.

Secure Software Development Life Cycle Processes

By noopur davis.

In this paper, Noopur Davis presents information about processes, standards, and more that support or could support secure software development.

Applicability of Cultural Markers in Computer Network Attack Attribution

July 11, 2013 • white paper.

In this 2013 white paper, Char Sample discusses whether cultural influences leave traces in computer network attack (CAN) choices and behaviors.

Improving Software Assurance

July 5, 2013 • white paper.

In this paper, the authors discuss what practitioners should know about software assurance, where to look, what to look for, and how to demonstrate improvement.

Scale: System Development Challenges

In this paper, the authors describe software assurance challenges inherent in networked systems development and propose a solution.

Requirements Prioritization Case Study Using AHP

By nancy r. mead.

In this paper, Nancy Mead describes a tradeoff analysis that can select a suitable requirements prioritization method and the results of trying one method.

Arguing Security - Creating Security Assurance Cases

By john b. goodenough, charles weinstock, howard f. lipson.

In this paper, the authors explain an approach to documenting an assurance case for system security.

SQUARE Process

In this paper, Nancy Mead describes the SQUARE process as a means for eliciting, categorizing, and prioritizing security requirements for IT systems.

Requirements Elicitation Case Studies Using IBIS, JAD, and ARM

In this paper, Nancy Mead describes a tradeoff analysis that can be used to select a suitable requirements elicitation method.

The Common Criteria

In this paper, Nancy Mead discusses how Common Criteria is evaluated, it also presents a standard that is related to developing security requirements.

Measures and Measurement for Secure Software Development

July 3, 2013 • white paper, by david zubrow, james mccurley, carol dekkers.

In this paper, the authors discuss how measurement can be applied improve the security characteristics of the software being developed.

Predictive Models for Identifying Software Components Prone to Failure During Security Attacks

By laurie williams, michael gegick, mladan vouk.

In this paper, the authors describes how the presence of security faults correlates strongly with the presence of a more general category of reliability faults.

Measuring the Software Security Requirements Engineering Process

In this paper, Nancy Mead describes a measurement approach to security requirements engineering to analyze projects that were developed with and without SQUARE.

System-of-Systems Influences on Acquisition Strategy Development

July 2, 2013 • white paper, by rita c. creel, robert j. ellison.

In this paper, the authors discuss significant new sources of risk and recommend ways to address them.

Risk-Centered Practices

By julia h. allen.

In this paper, Julia Allen discusses the role that risk management and risk assessment play in choosing which security practices to implement.

Supply-Chain Risk Management: Incorporating Security into Software Development

In this paper, the authors describe practices that address defects and mechanisms for introducing these practices into the acquisition lifecycle.

Prioritizing IT Controls for Effective, Measurable Security

By daniel phelps, kurt milne, gene kim (ip services and itpi).

In this paper, the authors summarize results from the IT Controls Performance Study conducted by the IT Process Institute.

Building Security into the Business Acquisition Process

By dan shoemaker (university of detroit mercy).

In this paper, Dan Shoemaker presents the standard process for acquiring software products and services in business.

Navigating the Security Practice Landscape

In this paper, Julia Allen presents a summary of ten leading sources of security practice definition and implementation guidance.

Assuring Software Systems Security: Life Cycle Considerations for Government Acquisitions

By rita c. creel.

In this paper, Rita Creel identifies acquirer activities and resources necessary to support contractor efforts to build secure software-intensive systems.

Plan, Do, Check, Act

In this paper, Ken van Wyk provides a primer on the most commonly used tools for traditional penetration testing.

Finding a Vendor You Can Trust in the Global Marketplace

By dan shoemaker (university of detroit mercy), art conklin.

In this paper, the authors introduce the concept of standardized third-party certification of supplier process capability.

Results of SEI Line-Funded Exploratory New Starts Projects: FY 2012

July 1, 2013 • technical report, by robert nord, robert w. stoddard, lisa brownsword, dennis goldenson, mary ann lapham, david zubrow, william r. claycomb, lori flynn, peter h. feiler, rick kazman, robert ferguson, stephany bellomo, ipek ozkaya, sagar chaki, arie gurfinkel, julie b. cohen, jeff havrilla, john j. hudak, bjorn andersson, john mcgregor, james mccurley, carly l. huth, david mcintire, david p. gluch, wesley jin, chuck hines, brittany phillips, yuanfang cai (drexel university).

This report describes line-funded exploratory new starts (LENS) projects that were conducted during fiscal year 2012 (October 2011 through September 2012).

Insider Threat Attributes and Mitigation Strategies

July 1, 2013 • technical note, by george silowash.

In this report, George Silowash maps common attributes of insider threat cases to characteristics important for detecting, preventing, or mitigating the threat.

Pointer Ownership Model

June 10, 2013 • white paper.

In this paper, David Svoboda describes the Pointer Ownership Model, which can statically identify classes of errors involving dynamic memory in C/C++ programs.

Common Software Platforms in System-of-Systems Architectures: The State of the Practice

June 6, 2013 • white paper, by rick kazman, sholom g. cohen, john klein.

System-of-systems (SoS) architectures based on common software platforms have been commercially successful, but progress on creating and adopting them has been slow. This study aimed to understand technical issues for …

Software Assurance for Executives: Mapping of Common Topics to Specific Materials

June 3, 2013 • white paper.

In this paper, the authors present common topics, course materials, and resources related to the Software Assurance for Executives course held in June 2013.

Software Assurance for Executives

This legal form was used in the Software Assurance for Executives course that was held in June 2013.

Isolating Patterns of Failure in Department of Defense Acquisition

June 1, 2013 • technical note, by lisa brownsword, patrick r. place, cecilia albert, john j. hudak, charles (bud) hammons, david j. carney.

This report documents an investigation into issues related to aligning acquisition strategies with business and mission goals.

Socio-Adaptive Systems Challenge Problems Workshop Report

June 1, 2013 • special report, by mark h. klein, timothy morrow, scott hissam.

This report presents a summary of the findings of the Socio-Adaptive Systems Challenge Problem Workshop, held in Pittsburgh, PA, on April 12-13, 2012.

Strengths in Security Solutions

May 31, 2013 • white paper, by carol woody, allen d. householder, robert c. seacord, arjuna shunn (microsoft).

In this white paper, the authors map eight CERT tools, services, and processes to Microsoft's Simplified Security Development Lifecycle.

Integrating Software Assurance Knowledge into Conventional Curricula

May 23, 2013 • white paper.

In this paper, the authors discuss the results of comparing the Common Body of Knowledge for Secure Software Assurance with traditional computing disciplines.

Maturity of Practice

In this paper, Julia Allen identifies indicators that organizations are addressing security as a governance and management concern, at the enterprise level.

Integrating Security and IT

May 21, 2013 • white paper.

In this paper, Julia Allen describes the key relationship between IT processes and security controls.

Individual Certification of Security Proficiency for Software Professionals: Where Are We? Where Are We Going?

In this paper, Dan Shoemaker describes existing professional certifications in information assurance and emerging certifications for secure software assurance.

How Much Security Is Enough?

In this paper, Julia Allen provides guidelines for answering this question, including means for determining adequate security based on risk.

Models for Assessing the Cost and Value of Software Assurance

By john bailey, dan shoemaker (university of detroit mercy), antonio drommi, jeff ingalsbe (university of detroit mercy), nancy r. mead.

In this paper, the authors present IT valuation models that represent the most commonly accepted approaches to the valuation of IT and IT processes.

Adapting Penetration Testing for Software Development Purposes

By ken van wyk (no affiliation).

In this paper, Ken van Wyk provides background information on penetration testing processes and practices.

Requirements Engineering Annotated Bibliography

In this paper, Nancy Mead provides a bibliography of sources related to requirements engineering.

Defining the Discipline of Secure Software Assurance: Initial Findings from the National Software Assurance Repository

By nancy r. mead, jeff ingalsbe (university of detroit mercy), dan shoemaker (university of detroit mercy), rita barrios.

In this paper, the authors characterize the current state of secure software assurance work and suggest future directions.

Making the Business Case for Software Assurance

In this paper, Nancy Mead provides an overview of the Business Case content area.

Spotlight On: Insider Theft of Intellectual Property Inside the United States Involving Foreign Governments or Organizations (2013)

May 20, 2013 • technical note, by andrew p. moore, randall f. trzeciak, derrick spooner, dawn cappelli, matthew l. collins.

In this report, the authors provide a snapshot of individuals involved in insider threat cases and recommends how to mitigate the risk of similar incidents.

The Software Assurance Competency Model: A Roadmap to Enhance Individual Professional Capability

May 16, 2013 • white paper, by nancy r. mead, dan shoemaker (university of detroit mercy).

In this paper, the authors describe a software assurance competency model that can be used by professionals to improve their software assurance skills.

Building a Body of Knowledge for ICT Supply Chain Risk Management

In this paper, the authors propose a set of Supply Chain Risk Management (SCRM) activities and practices for Information and Communication Technologies (ICT).

Modeling Tools References

May 15, 2013 • white paper, by samuel t. redwine.

In this paper, Samuel Redwine provides references related to modeling tools.

Software Assurance Education Overview

In this paper, Nancy Mead discusses the growing demand for skilled professionals who can build security and correct functionality into software.

Governance and Management References

May 14, 2013 • white paper.

In this paper, Julia Allen provides references related to governance and management.

Getting Secure Software Assurance Knowledge into Conventional Practice

By linda laird, nancy r. mead, dan shoemaker (university of detroit mercy).

In this paper, the authors describe three educational initiatives in support of software assurance education.

General Modeling Concepts

In this paper, Samuel Redwine introduces several concepts related to the Introduction to Modeling Tools for Software Security article and modeling in general.

A Systemic Approach for Assessing Software Supply-Chain Risk

By robert j. ellison, carol woody, christopher j. alberts, rita c. creel, audrey j. dorofee.

In this paper, the authors highlight the approach being implemented by SEI researchers for assessing and managing software supply-chain risks and provides a summary of the status of this work.

Framing Security as a Governance and Management Concern: Risks and Opportunities

In this paper, Julia Allen describes six "assets" or requirements of being in business that can be compromised by insufficient security investment.

Assembly, Integration, and Evolution Overview

By howard f. lipson.

In this paper, Howard Lipson describes the objective of the Assembly, Integration & Evolution content area.

A Common Sense Way to Make the Business Case for Software Assurance

By dan shoemaker (university of detroit mercy), antonio drommi, jeff ingalsbe (university of detroit mercy), nancy r. mead, john bailey.

In this article, the authors demonstrate how a true cost/benefit for secure software can be derived.

Deployment and Operations References

In this paper, Julia Allen provides a list of references related to deployment and operations.

Deploying and Operating Secure Systems

In this paper, Julia Allen provides a brief overview of deployment and operations security issues and advice for using related practices.

Two Nationally Sponsored Initiatives for Disseminating Assurance Knowledge

In this paper, the authors describe two efforts that support national cybersecurity education goals.

By Dan Shoemaker (University of Detroit Mercy), Nancy R. Mead, Carol Woody

In this paper, the authors highlight efforts underway to address our society's growing dependence on software and the need for effective software assurance.

Assurance Cases Overview

In this paper, Howard Lipson introduces the concepts and benefits of developing and maintaining assurance cases for security.

It’s a Nice Idea but How Do We Get Anyone to Practice It? A Staged Model for Increasing Organizational Capability in Software Assurance

May 13, 2013 • white paper.

In this paper, Dan Shoemaker presents a standard approach to increasing the security capability of a typical IT function.

Software Security Engineering: A Guide for Project Managers (white paper)

By sean barnum, gary mcgraw, julia h. allen, nancy r. mead, robert j. ellison.

In this guide, the authors discuss our reliance on software and systems that use the internet or internet-exposed private networks.

Requirements Elicitation Introduction

In this paper, Nancy Mead discusses elicitation methods and the kind of tradeoff analysis that can be done to select a suitable one.

Requirements Prioritization Introduction

In this paper, Nancy Mead discusses using a systematic prioritization approach to prioritize security requirements.

Optimizing Investments in Security Countermeasures: A Practical Tool for Fixed Budgets

By jonathan caulkins, eric hough, hassan osman, nancy r. mead.

In this paper, the authors introduce a novel method of optimizing using integer programming (IP).

Security Is Not Just a Technical Issue

In this paper, Julia Allen defines the scope of governance concern as they apply to security.

PSP-VDC: An Adaptation of the PSP that Incorporates Verified Design by Contract

May 7, 2013 • technical report, by diego vallespir (universidad de la república), silvana moreno (universidad de la república), álvaro tasistro (universidad ort uruguay), bill nichols.

This paper describes a proposal for integrating Verified Design by Contract into PSP in order to reduce the amount of defects present at the Unit Testing phase, while preserving or …

How You Can Help Your Utility Clients with a Critical Aspect of Smart Grid Transformation They Might be Overlooking

May 1, 2013 • white paper, by the sgmm communications team.

This paper discusses how you can use the Smart Grid Maturity Model (SGMM) to benefit your utility clients.

Five Smart Grid Questions Every Utility Executive Should Ask

This paper recommends the Smart Grid Maturity Model (SGMM), a tool utilities can use to plan and measure smart grid progress.

Application Virtualization as a Strategy for Cyber Foraging in Resource-Constrained Environments

May 1, 2013 • technical note, by dominik messinger, grace lewis.

This technical note explores application virtualization as a more lightweight alternative to VM synthesis for cloudlet provisioning.

The Perils of Treating Software as a Specialty Engineering Discipline

April 30, 2013 • white paper, by keith korzec, tom merendino.

This paper reviews the perils of insufficiently engaging key software domain experts during program development.

Four Pillars for Improving the Quality of Safety-Critical Software-Reliant Systems

April 29, 2013 • white paper, by lutz wrage, charles weinstock, john b. goodenough, arie gurfinkel, peter h. feiler.

This white paper presents an improvement strategy comprising four pillars of an integrate-then-build practice that lead to improved quality through early defect discovery and incremental end-to-end validation and verification.

MERIT Interactive Insider Threat Training Simulator

April 16, 2013 • white paper.

In this paper, the authors describe how state-of-the-art multi-media technologies were used to develop the MERIT InterActive training simulator.

Software Assurance Competency Model

March 11, 2013 • technical note, by thomas b. hilburn (embry-riddle aeronautical university), andrew j. kornecki (embry-riddle aeronautical university), mark a. ardis (stevens institute of technology), glenn johnson ((isc)2), nancy r. mead.

In this report, the authors describe a model that helps create a foundation for assessing and advancing the capability of software assurance professionals.

Detecting and Preventing Data Exfiltration Through Encrypted Web Sessions via Traffic Inspection

March 1, 2013 • technical note, by todd lewellen, daniel l. costa, george silowash, joshua w. burns.

In this report, the authors present methods for detecting and preventing data exfiltration using a Linux-based proxy server in a Microsoft Windows environment.

Justification of a Pattern for Detecting Intellectual Property Theft by Departing Insiders

By dave mundie, david zubrow, andrew p. moore, david mcintire.

In this report, the authors justify applying the pattern “Increased Review for Intellectual Property (IP) Theft by Departing Insiders.”

Quantifying Uncertainty in Expert Judgment: Initial Results

March 1, 2013 • technical report, by robert w. stoddard, dennis goldenson.

The work described in this report, part of a larger SEI research effort on Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE), aims to develop and validate methods for calibrating …

History of CERT-RMM

February 15, 2013 • white paper.

This paper explains the history of how the CERT-RMM came to be.

The MAL: A Malware Analysis Lexicon

February 1, 2013 • technical note, by david mcintire, dave mundie.

In this report, the authors present results of the Malware Analysis Lexicon (MAL) initiative, which developed the first common vocabulary for malware analysis.

Tunisia Case Study

January 24, 2013 • white paper.

This case study describes the experiences of the Tunisia CSIRT in getting its organization up and running.

Columbia CSIRT Case Study

This case study describes the experiences of the Columbia CSIRT in getting its organization up and running.

Insider Threat Control: Using Universal Serial Bus (USB) Device Auditing to Detect Possible Data Exfiltration by Malicious Insiders

January 1, 2013 • technical note, by george silowash, todd lewellen.

In this report, the authors present methods for auditing USB device use in a Microsoft Windows environment.

Cyber Intelligence Tradecraft Project: Summary of Key Findings

January 1, 2013 • white paper, by kate ambrose, troy townsend, andrew o. mellinger, jay mcallister, melissa ludwick.

This study, known as the Cyber Intelligence Tradecraft Project (CITP), seeks to advance the capabilities of organizations performing cyber intelligence by elaborating on best practices and prototyping solutions to shared …

Insider Threat Control: Understanding Data Loss Prevention (DLP) and Detection by Correlating Events from Multiple Sources

By christopher king, george silowash.

In this report, the authors present methods for controlling removable media devices in a MS Windows environment.

SEI Product Line Bibliography

December 31, 2012 • white paper.

This bibliography lists SEI and non-SEI resources that have informed the SEI Product Lines efforts. Examples cover diverse domains and show the kind of improvements you can achieve using a …

A Framework for Software Product Line Practice, Version 5.0

By sholom g. cohen, linda m. northrop, reed little, john mcgregor, paul c. clements, felix bachmann, john k. bergey, gary chastek, patrick donohoe, liam o'brien, lawrence g. jones, robert w. krut, jr..

This document describes the activities and practices in which an organization must be competent before it can benefit from fielding a product line of software systems.

Chronological Examination of Insider Threat Sabotage: Preliminary Observations

December 1, 2012 • white paper, by carly l. huth, david mcintire, william r. claycomb, lori flynn, todd lewellen.

In this paper, the authors examine 15 cases of insider threat sabotage of IT systems to identify points in the attack time-line.

The Business Case for Systems Engineering Study: Assessing Project Performance from Sparse Data

December 1, 2012 • special report, by joseph p. elm.

This report describes the data collection and analysis process used to support the assessment of project performance for the systems engineering (SE) effectiveness study.

Analyzing Cases of Resilience Success and Failure - A Research Study

December 1, 2012 • technical note, by andrew p. moore, randall f. trzeciak, robert w. stoddard, julia h. allen, nader mehravari, pamela d. curtis, kevin g. partridge.

In this report, the authors describe research aimed at helping organizations to know the business value of implementing resilience processes and practices.

Common Sense Guide to Mitigating Insider Threats, Fourth Edition

December 1, 2012 • technical report, by dawn cappelli, timothy j. shimeall, lori flynn, george silowash, andrew p. moore, randall f. trzeciak.

In this report, the authors define insider threats and outline current insider threat patterns and trends.

Arabic Language Translation of CMMI for Services V1.3

November 1, 2012 • white paper, by the cmmi product team.

Arabic translation of CMMI-SVC V1.3

TSP Symposium 2012 Proceedings

November 1, 2012 • special report, by shigeru kusakabe (kyushu university), yoichi omori (kyushu university), keijiro araki (kyushu university), fernanda grazioli (universidad de la república), álvaro tasistro (universidad ort uruguay), diego vallespir (universidad de la república), silvana moreno (universidad de la república), joão pascoal faria (university of porto), mushtaq raza (university of porto), pedro c. henriques (strongstep – innovation in software quality), césar duarte (strongstep – innovation in software quality), elias fallon (cadence design systems, inc.), lee gazlay (cadence design systems, inc.), bill nichols.

The 2012 TSP Symposium was organized by the Software Engineering Institute (SEI) and took place September 18-20 in St. Petersburg, FL. The goal of the TSP Symposium is to bring …

DoD Information Assurance and Agile: Challenges and Recommendations Gathered Through Interviews with Agile Program Managers and DoD Accreditation Reviewers

November 1, 2012 • technical note, by stephany bellomo, carol woody.

This paper discusses the natural tension between rapid fielding and response to change (characterized as agility) and DoD information assurance policy. Data for the paper was gathered through interviews with …

Reliability Improvement and Validation Framework

By peter h. feiler, arie gurfinkel, charles weinstock, john b. goodenough, lutz wrage.

This report discusses the reliability validation and improvement framework developed by the SEI. The purpose of this framework is to provide a foundation for addressing the challenges of qualifying increasingly …

The Business Case for Systems Engineering Study: Results of the Systems Engineering Effectiveness Survey

By joseph p. elm, dennis goldenson.

This report summarizes the results of a survey that had the goal of quantifying the connection between the application of systems engineering (SE) best practices to projects and programs and …

Maturity Models 101: A Primer for Applying Maturity Models to Smart Grid Security, Resilience, and Interoperability

By richard a. caralli, austin montgomery, mark knight (cgi group).

In this paper, the authors explain the history and evolution of and applications for maturity models.

Technical Debt: From Metaphor to Theory and Practice

By robert nord, ipek ozkaya, philippe kruchten.

This article discusses the technical debt metaphor and considers it beyond a "rhetorical concept." The article explores the role of decision making about developmental activities and future changes and the …

Architecture-Driven Semantic Analysis of Embedded Systems (Dagstuhl Seminar 12272)

October 10, 2012 • special report, by peter h. feiler, jerome hugues.

This report documents the program and outcomes of presentations and working groups from Dagstuhl Seminar 12272, "Architecture-Driven Semantic Analysis of Embedded Systems."

Spotlight On: Insider Threat from Trusted Business Partners Version 2: Updated and Revised

October 1, 2012 • white paper, by andrew p. moore, randall f. trzeciak, derrick spooner, todd lewellen, robert weiland (carnegie mellon university), dawn cappelli.

In this article, the authors focus on cases in which the malicious insider was employed by a trusted business partner of the victim organization.

The Role of Standards in Cloud-Computing Interoperability

October 1, 2012 • technical note, by grace lewis.

This report explores the role of standards in cloud-computing interoperability. It covers cloud-computing basics and standard-related efforts, discusses several use cases, and provides recommendations for cloud-computing adoption.

Cloud Computing at the Tactical Edge

By grace lewis, edwin j. morris, soumya simanta, mahadev satyanarayanan (carnegie mellon university), kiryong ha (carnegie mellon school of computer science).

This technical note presents a strategy to overcome the challenges of obtaining sufficient computation power to run applications needed for warfighting and disaster relief missions. It discusses the use of …

Well There’s Your Problem: Isolating the Crash-Inducing Bits in a Fuzzed File

In this 2012 report, Allen Householder describes an algorithm for reverting bits from a fuzzed file to those found in the original seed file to recreate the crash.

Resource Allocation in Dynamic Environments

October 1, 2012 • technical report, by daniel plakosh, joe seibel, jeffrey hansen, gabriel moreno, scott hissam, b. craig meyers, lutz wrage.

When warfighting missions are conducted in a dynamic environment, the allocation of resources needed for mission operation can change from moment to moment. This report addresses two challenges of resource …

Building an Incident Management Body of Knowledge

September 7, 2012 • white paper, by dave mundie, robin ruefle.

In this paper, the authors describe the components of the CERT Incident Management Body of Knowledge (CIMBOK) and how they were constructed.

SEPG Europe 2012 Conference Proceedings

September 1, 2012 • special report, by jose maria garcia (software quality assurance), ana m. moreno (universidad politecnica de madrid), radouane oudrhiri (systonomy), fabrizio pellizzetti (systonomy), alejandro ruiz-robles (university of piura), maria-isabel sanchez-segura (carlos iii university of madrid), prasad m. shrasti (tata consultancy services), aman kumar singhal (infosys), javier garcia-guzman (carlos iii university of madrid), javier garzas (kybele research and kybele consulting), amit arun javadekar (infosys), patrick kirwan, joaquin lasheras (centic), fuensanta medina-dominguez (carlos iii university of madrid), erich meier (method park), arturo mora-soto (carlos iii university of madrid).

This report compiles seven papers based on presentations given at SEPG Europe 2012.

Competency Lifecycle Roadmap: Toward Performance Readiness

September 1, 2012 • technical note, by robin ruefle, christopher j. alberts, sandra behrens.

In this report, the authors describe the Competency Lifecycle Roadmap (CLR), a preliminary roadmap for understanding and building workforce readiness.

Communication Among Incident Responders – A Study

By robert floodeen, brett tjaden.

In this report, the authors describe three factors for helping or hindering the cooperation of incident responders.

Toward a Theory of Assurance Case Confidence

September 1, 2012 • technical report, by ari z. klein, charles weinstock, john b. goodenough.

In this report, the authors present a framework for thinking about confidence in assurance case arguments.

Insider Fraud in Financial Services

August 3, 2012 • white paper.

In this brochure, the authors present the findings of a study that analyzed computer criminal activity in the financial services sector.

Probability-Based Parameter Selection for Black-Box Fuzz Testing

August 1, 2012 • technical note, by allen d. householder, jonathan foote.

In this report, the authors describe an algorithm for automating the selection of seed files and other parameters used in black-box fuzz testing.

Results of SEI Line-Funded Exploratory New Starts Projects

August 1, 2012 • technical report, by bill nichols, robert nord, cory cohen, soumya simanta, rick kazman, nanette brown, william casey, david french, edwin j. morris, arie gurfinkel, sagar chaki, dionisio de niz, ipek ozkaya, gene cahill, ofer strichman, brad myers, raghvinder sangwan, len bass, peppo valetto.

This report describes the line-funded exploratory new starts (LENS) projects that were undertaken during fiscal year 2011. For each project, the report presents a brief description and a recounting of …

Network Profiling Using Flow

By sid faber, austin whisnant.

In this report, the authors provide a step-by-step guide for profiling and discovering public-facing assets on a network using netflow data.

Insider Threats to Cloud Computing: Directions for New Research Challenges

July 16, 2012 • white paper, by william r. claycomb, alex nicoll.

In this paper, the authors explain how cloud computing related insider threats are a serious concern, but that this threat has not been thoroughly explored.

Insider Threat Study: Illicit Cyber Activity Involving Fraud in the U.S. Financial Services Sector

July 1, 2012 • special report, by david mcintire, adam cummings, andrew p. moore, randall f. trzeciak, todd lewellen.

In this report, the authors describe insights and risk indicators of malicious insider activity in the banking and finance sector.

Supporting the Use of CERT Secure Coding Standards in DoD Acquisitions

July 1, 2012 • technical note, by john k. bergey, philip miller, robert c. seacord, timothy morrow.

In this report, the authors provide guidance for helping DoD acquisition programs address software security in acquisitions.

The Evolution of a Science Project: A Preliminary System Dynamics Model of a Recurring Software-Reliant Acquisition Behavior

July 1, 2012 • technical report, by william e. novak, andrew p. moore, christopher j. alberts.

This report uses a preliminary system dynamics model to analyze a specific adverse acquisition dynamic concerning the poorly controlled evolution of small prototype efforts into full-scale systems.

Introduction to System Strategies

June 27, 2012 • white paper.

In this paper, the authors discuss the effects of the changing operational environment on the development of secure systems.

Introduction to Modeling Tools for Software Security

June 24, 2012 • white paper.

In this paper, Samuel Redwine introduces security concepts and tools useful for modeling security properties.

Security-Specific Bibliography

June 22, 2012 • white paper.

In this paper, the authors provide a bibliography of sources related to security.

A Virtual Upgrade Validation Method for Software-Reliant Systems

June 1, 2012 • technical report, by dionisio de niz, peter h. feiler, david p. gluch, lutz wrage.

This report presents the Virtual Upgrade Validation (VUV) method, an approach that uses architecture-centric, model-based analysis to identify system-level problems early in the upgrade process to complement established test qualification …

Report from the First CERT-RMM Users Group Workshop Series

April 1, 2012 • technical note, by lisa r. young, julia h. allen.

In this report, the authors describe the first CERT RMM Users Group (RUG) Workshop Series and the experiences of participating members and CERT staff.

Source Code Analysis Laboratory (SCALe)

By david svoboda, robert w. stoddard, robert c. seacord, will dormann, james mccurley, philip miller, jefferson welch.

In this report, the authors describe the CERT Program's Source Code Analysis Laboratory (SCALe), a conformance test against secure coding standards.

Insider Threat Security Reference Architecture

April 1, 2012 • technical report, by joji montelibano, andrew p. moore.

In this report, the authors describe the Insider Threat Security Reference Architecture (ITSRA), an enterprise-wide solution to the insider threat.

A Pattern for Increased Monitoring for Intellectual Property Theft by Departing Insiders

By andrew p. moore, dave mundie, michael hanley.

In this report, the authors present techniques for helping organizations plan, prepare, and implement means to mitigate insider theft of intellectual property.

The Impact of Passive DNS Collection on End-User Privacy

March 22, 2012 • white paper, by jonathan spring, carly l. huth.

In this paper, the authors discuss whether pDNS allows reconstruction of an end user's DNS behavior and if DNS behavior is personally identifiable information.

Approaches for Edge-Enabled Tactical Systems

March 19, 2012 • white paper.

This booklet contains brief articles about using mobile devices in the areas of edge-enabled systems and cloud computing and a report on cloud offload in hostile environments.

Digital Investigation Workforce Development

March 1, 2012 • white paper.

In this paper, the authors describe an approach for deriving measures of software security from well-established and commonly used standard practices.

What’s New in V2 of the Architecture Analysis & Design Language Standard?

March 1, 2012 • special report, by peter h. feiler, joe seibel, lutz wrage.

This report provides an overview of changes and improvements to the Architecture Analysis & Design Language (AADL) standard for describing both the software architecture and the execution platform architectures of …

Principles of Trust for Embedded Systems

March 1, 2012 • technical note, by david fisher.

In this report, David Fisher provides substance and explicit meaning to the terms trust and trustworthy as they relate to automated systems.

Deriving Software Security Measures from Information Security Standards of Practice

February 16, 2012 • white paper, by robert w. stoddard, julia h. allen, christopher j. alberts.

In this paper, the authors describe an approach for deriving measures of software security from common standard practices for information security.

Risk-Based Measurement and Analysis: Application to Software Security

February 1, 2012 • technical note, by christopher j. alberts, julia h. allen, robert w. stoddard.

In this report, the authors present the concepts of a risk-based approach to software security measurement and analysis and describe the IMAF and MRD.

Mission Risk Diagnostic (MRD) Method Description

By christopher j. alberts, audrey j. dorofee.

In this report, the authors describe the Mission Risk Diagnostic (MRD) method, which is used to assess risk in systems across the lifecycle and supply chain.

Proceedings of the Smart Grid Maturity Model Leadership Workshop

January 31, 2012 • special report.

In January 2012, leaders in the electric power industry collaborated with the SEI to build the future of the Smart Grid Maturity Model at the SGMM Leadership Workshop.

Modifying Lanchester's Equations for Modeling and Evaluating Malicious Domain Name Take-Down

January 6, 2012 • white paper.

In this paper, Jonathan Spring models internet competition on large, decentralized networks using a modification of Lanchester's equations for combat.

January 2, 2012 • White Paper

In this paper, the authors demonstrate that there are name servers that exhibit IP address flux, a behavior that falls outside the prescribed parameters.

Discerning the Intent of Maturity Models from Characterizations of Security Posture

January 1, 2012 • white paper.

In this paper, Rich Caralli discusses how using maturity models and characterizing security posture are activities with different intents, outcomes, and uses.

Communication Among Incident Responders - A Study

In this paper, the authors describe preliminary results of a study of how effective nine autonomous incident response organizations are.

Best Practices for Artifact Versioning in Service-Oriented Systems

January 1, 2012 • technical note, by william anderson, marc novakouski, grace lewis, jeff davenport.

This report describes some of the challenges of software versioning in an SOA environment and provides guidance on how to meet these challenges by following industry guidelines and recommended practices.

Interoperability in the e-Government Context

By marc novakouski, grace lewis.

This report describes a proposed model through which to understand interoperability in the e-government context.

Spotlight On: Malicious Insiders and Organized Crime Activity

By christopher king.

In this report, Christopher King provides a snapshot of who malicious insiders are, what and how they strike, and why.

A Closer Look at 804: A Summary of Considerations for DoD Program Managers

December 1, 2011 • special report, by stephany bellomo.

The information in this report is intended to help program managers reason about actions they may need to take to adapt and comply with the Section 804 NDAA for 2010 …

Standards-Based Automated Remediation: A Remediation Manager Reference Implementation, 2011 Update

By sagar chaki, mary popeck, rita c. creel, benjamin mccormick, mike kinney (national security agency), jeff davenport.

In this report, the authors describe work to develop standards for automated remediation of vulnerabilities and compliance issues on DoD networked systems.

Using Defined Processes as a Context for Resilience Measures

December 1, 2011 • technical note, by pamela d. curtis, linda parker gates, julia h. allen.

In this report, the authors describe how implementation-level processes can provide context for identifying and defining measures of operational resilience.

Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE)

December 1, 2011 • technical report, by debra anderson, james mccurley, robert w. stoddard, dennis goldenson, david zubrow, robert ferguson.

The method of quantifying uncertainty described in this report synthesizes scenario building, Bayesian Belief Network (BBN) modeling and Monte Carlo simulation into an estimation method that quantifies uncertainties, allows subjective …

An Investigation of Techniques for Detecting Data Anomalies in Earned Value Management Data

By mark kasunic, david zubrow, dennis goldenson, james mccurley.

This research demonstrated the effectiveness of various statistical techniques for discovering quantitative data anomalies.

German language translation of CMMI for Development, V1.3

November 1, 2011 • white paper.

This PDF contains a German language translation of CMMI for Development, V1.3.

Japanese Language Translation of CMMI for Development, V1.3

CERT® Resilience Management Model (CERT®-RMM) V1.1: NIST Special Publication Crosswalk Version 1

November 1, 2011 • technical note, by lisa r. young, kevin g. partridge.

In this report, the authors map CERT-RMM process areas to selected NIST special publications in the 800 series.

Agile Methods: Selected DoD Management and Acquisition Concerns

October 1, 2011 • technical note, by mary ann lapham, suzanne miller, nanette brown, alfred schenker, bart hackemack, linda levine, lorraine adams, charles (bud) hammons.

This technical note addresses some of the key issues that either must be understood to ease the adoption of Agile or are seen as potential barriers to adoption of Agile …

An Acquisition Perspective on Product Evaluation

By harry l. levinson, richard librizzi, grady campbell.

This technical note focuses on software acquisition and development practices related to the evaluation of products before, during, and after implementation.

CERT® Resilience Management Model (RMM) v1.1: Code of Practice Crosswalk Commercial Version 1.1

By kevin g. partridge, lisa r. young.

In this report, the authors explain how CERT-RMM process areas, industry standards, and codes of practice are used by organizations in an operational setting.

Insider Threat Control: Using Centralized Logging to Detect Data Exfiltration Near Insider Termination

By joji montelibano, michael hanley.

In this report, the authors present an insider threat pattern on how organizations can combat insider theft of intellectual property.

CERT® Resilience Management Model Capability Appraisal Method (CAM) Version 1.1

October 1, 2011 • technical report, by resilient enterprise management team.

In this report, the authors demonstrate that the SCAMPI method can be adapted and applied to CERT-RMM V1.1 as the reference model for a process appraisal.

Smart Grid Maturity Model: Matrix, Version 1.2

September 1, 2011 • white paper.

This document shows a matrix related to Smart Grid Maturity Model levels.

Proceedings of the Fourth International Workshop on a Research Agenda for Maintenance and Evolution of Service-Oriented Systems (MESOA 2010)

September 1, 2011 • special report, by dennis b. smith, kostas kontogiannis, grace lewis.

This report summarizes the proceedings from the 2010 MESOA workshop and includes the accepted papers that were the basis for the presentations given during the workshop.

Software Assurance Curriculum Project Volume IV: Community College Education

September 1, 2011 • technical report, by nancy r. mead, mark a. ardis (stevens institute of technology), elizabeth k. hawthorne (union county college).

In this report, the authors focus on community college courses for software assurance.

Understanding and Leveraging a Supplier’s CMMI Efforts: A Guidebook for Acquirers (Revised for V1.3)

By john scibilia, lawrence t. osiecki, mike phillips.

This guidebook helps acquisition organizations formulate questions for their suppliers related to CMMI. It also helps organizations interpret responses to identify and evaluate risks for a given supplier.

Smart Grid Maturity Model, Version 1.2: Model Definition

By the sgmm team.

The Smart Grid Maturity Model (SGMM) is business tool that provides a framework for electric power utilities to help modernize their operations and practices for delivering electricity.

Keeping Your Family Safe in a Highly Connected World

August 10, 2011 • white paper, by jonathan frederick, marie baker.

In this paper, the authors describe the risks of being victims of theft, including becoming involved unknowingly in illegal activities over a networked device.

Which CMMI Model Is for You?

August 1, 2011 • white paper, by mike phillips, sandra shrum.

A short white paper that provides guidance on selecting the best CMMI model for process improvement.

Architecting Service-Oriented Systems

August 1, 2011 • technical note, by philip bianco, grace lewis, paulo merson, soumya simanta.

This report presents guidelines for architecting service-oriented systems and the effect of architectural principles on system quality attributes.

Standards-Based Automated Remediation: A Remediation Manager Reference Implementation

July 1, 2011 • special report, by sagar chaki, mary popeck, rita c. creel, jeff davenport, mike kinney (national security agency), benjamin mccormick.

In this report, the authors describe work to develop standards for vulnerability and compliance remediation on DoD networked systems.

A Decision Framework for Selecting Licensing Rights for Noncommercial Computer Software in the DoD Environment

July 1, 2011 • technical report.

This report describes standard noncommercial software licensing alternatives as defined by U.S. Government and DoD regulations. It suggests an approach for identifying agency needs for license rights and the license …

Measures for Managing Operational Resilience

By pamela d. curtis, julia h. allen.

In this report, the Resilient Enterprise Management (REM) team suggests a set of top ten strategic measures for managing operational resilience.

An Online Learning Approach to Information Systems Security Education

June 13, 2011 • white paper, by robert c. seacord, norman bier (carnegie mellon university), marsha lovett (carnegie mellon university).

In this paper, the authors describe the development of a secure coding module that shows how to capture content, ensure learning, and scale to meet demand.

Monitoring Cloud Computing by Layer, Part 2

June 1, 2011 • white paper.

In this paper, Jonathan Spring presents a set of recommended restrictions and audits to facilitate cloud security.

A Preliminary Model of Insider Theft of Intellectual Property

June 1, 2011 • technical note, by dawn cappelli, thomas c. caron (john heinz iii college, school of information systems management, carnegie mellon university), eric d. shaw, andrew p. moore, randall f. trzeciak, derrick spooner.

In this report, the authors describe general observations about and a preliminary system dynamics model of insider crime based on our empirical data.

Software Assurance for System of Systems

May 1, 2011 • white paper, by john b. goodenough, linda m. northrop.

In this paper, the authors discuss confidence in system and SoS behavior and how theories can be used to make the assurance process more effective.

Architecture Evaluation without an Architecture: Experience with the Smart Grid

April 30, 2011 • white paper, by rick kazman, gabriel moreno, james ivers, len bass.

This paper describes an analysis of some of the challenges facing one portion of the Electrical Smart Grid in the United States - residential Demand Response (DR) systems.

Correlating Domain Registrations and DNS First Activity in General and for Malware

April 11, 2011 • white paper, by ed stoner, jonathan spring, leigh b. metcalf.

In this paper, the authors describe a pattern in the amount of time it takes for that domain to be actively resolved on the Internet.

Architectures for the Cloud: Best Practices for Navy Adoption of Cloud Computing

April 5, 2011 • white paper.

The goal of SEI research is to create best practices for architecture and design of systems that take advantage of the cloud, leading to greater system quality from both a …

Monitoring Cloud Computing by Layer, Part 1

April 1, 2011 • white paper, principles of survivability and information assurance.

In this paper, the authors describe a Security Information and Event Management signature for detecting possible malicious insider activity.

Employing SOA to Achieve Information Dominance

SEI research will enable the Navy to to develop service-oriented systems that address information dominance priority requirements.

Managing Technical Debt in Software-Reliant Systems

By nanette brown.

This whitepaper argues that there is an opportunity to study and improve the “technical debt” metaphor concept and offers software engineers a foundation for managing such trade-offs based on models …

Appraisal Requirements for CMMI Version 1.3 (ARC, V1.3)

April 1, 2011 • technical report, by scampi upgrade team.

The Appraisal Requirements for CMMI, Version 1.3 (ARC, V1.3), defines the requirements for appraisal methods intended for use with Capability Maturity Model Integration (CMMI) and with the People CMM.

Best Practices for National Cyber Security: Building a National Computer Security Incident Management Capability, Version 2.0

By bradford j. willke, samuel a. merrell, john haller, matthew j. butkovic.

In this 2011 report, an update to its 2010 counterpart, the authors provide insight that interested organizations and governments can use to develop a national incident management capability.

Trusted Computing in Embedded Systems Workshop

March 1, 2011 • special report, by archie d. andrews, jonathan m. mccune.

In this report, the authors describe the November 2010 Trusted Computing in Embedded Systems Workshop held at Carnegie Mellon University.

Issues and Opportunities for Improving the Quality and Use of Data in the Department of Defense

By erin harper, mark kasunic, david zubrow.

This report contains the recommendations of an SEI-lead, joint-sponsored workshop by the OSD (AT&L) and DDR&, around the topics of data quality, data analysis, and data use.

IEEE Computer Society/Software Engineering Institute Software Process Achievement (SPA) Award 2009

March 1, 2011 • technical report, by satyendra kumar, ramakrishnan m..

This report describes the work of the 2009 recipient of the IEEE Computer Society Software Process Achievement Award, jointly established by the SEI and IEEE to recognize outstanding achievements in …

CMMI for Acquisition (CMMI-ACQ) Primer, Version 1.3

By mike phillips.

Acquisition practices for the project level that help you get started with CMMI for Acquisition practices without using the whole model.

Software Assurance Curriculum Project Volume III: Master of Software Assurance Course Syllabi

By julia h. allen, nancy r. mead, richard c. linger (oak ridge national laboratory), thomas b. hilburn (embry-riddle aeronautical university), andrew j. kornecki (embry-riddle aeronautical university), mark a. ardis (stevens institute of technology).

In this report, the authors provide sample syllabi for the nine core courses in the Master of Software Assurance Reference Curriculum.

Delivering Software-Reliant Products Faster: Take Action to Help Your Organization Gain Speed Without Sacrificing Quality

February 14, 2011 • white paper.

Learn how to deliver software-reliant products faster and explore ways to use software architecture more effectively.

Delivering Software-Reliant Products Faster: Help Your Organization Gain Speed Without Sacrificing Quality

Learn how to look into the initial steps suggested for delivering software-reliant products faster.

A Framework for Evaluating Common Operating Environments: Piloting, Lessons Learned, and Opportunities

February 1, 2011 • special report, by steve rosemergy, cecilia albert.

This report explores the interdependencies among common language, business goals, and soft-ware architecture as the basis for a common framework for conducting evaluations of software technical solutions.

Integrating the Master of Software Assurance Reference Curriculum into the Model Curriculum and Guidelines for Graduate Degree Programs in Information Systems

February 1, 2011 • technical note, by dan shoemaker (university of detroit mercy), jeff ingalsbe (university of detroit mercy), nancy r. mead.

In this report, the authors examine how the Master of Software Assurance Reference Curriculum can be used for a Master of Science in Information Systems.

An Analysis of Technical Observations in Insider Theft of Intellectual Property Cases

By michael hanley, joji montelibano, tyler dean, will schroeder, matt houy, randall f. trzeciak.

In this report, the authors provide an overview of techniques used by malicious insiders to steal intellectual property.

Results of SEI Independent Research and Development Projects (FY 2010)

February 1, 2011 • technical report, by gabriel moreno, jeffrey hansen, john j. hudak, daniel plakosh, joe seibel, charles weinstock, cory cohen, william anderson, soumya simanta, peter h. feiler, robert nord, dionisio de niz, ipek ozkaya, edwin j. morris, nanette brown, lutz wrage, david p. gluch, richard c. linger (oak ridge national laboratory), jörgen hansson (university of skovde), howard f. lipson, david fisher, onur mutlu, christopher craig, tim daly, andres diaz-pace, ragunathan rajkumar, karthik lakshmanan, mark pleszkoch, archie d. andrews.

This report describes results of independent research and development (IRAD) projects undertaken in fiscal year 2010.

Network Monitoring for Web-Based Threats

By matthew heckathorn.

In this report, Matthew Heckathorn models the approach an attacker would take and provides detection or prevention methods to counter that approach.

Function Extraction (FX) Research for Computation of Software Behavior: 2010 Development and Application of Semantic Reduction Theorems for Behavior Analysis

By tim daly, mark pleszkoch, richard c. linger (oak ridge national laboratory).

In this report, the authors present research to compute the behavior of software with mathematical precision and how this research has been implemented.

FloCon 2011 Proceedings

January 10, 2011 • white paper.

These papers were presented at FloCon 2011, where participants discussed dark space, web servers, spam, and the susceptibility of DNS servers to cache poisoning.

Deriving Candidate Technical Controls and Indicators of Insider Attack from Socio-Technical Models and Data

January 1, 2011 • technical note, by michael hanley.

In this 2011 report, Michael Hanley demonstrates how a method for modeling insider crimes can create candidate technical controls and indicators.

Trust and Trusted Computing Platforms

By archie d. andrews, jonathan m. mccune, david fisher.

This technical note examines the Trusted Platform Module, which arose from work related to the Independent Research and Development project "Trusted Computing in Extreme Adversarial Environments: Using Trusted Hardware as …

Enabling Agility Through Architecture

December 16, 2010 • white paper, by nanette brown, ipek ozkaya, robert nord.

Enabling Agility Through Architecture: A Crosstalk article by Nanette Brown, Rod Nord, and Ipek Ozkaya.

Software Supply Chain Risk Management: From Products to Systems of Systems

December 1, 2010 • technical note, by christopher j. alberts, carol woody, rita c. creel, robert j. ellison, audrey j. dorofee.

In this report, the authors consider current practices in software supply chain analysis and suggest some foundational practices.

A Taxonomy of Operational Cyber Security Risks

By james j. cebula, lisa r. young.

In this report, the authors present a taxonomy of operational cyber security risks and its harmonization with other risk and security activities.

Source Code Analysis Laboratory (SCALe) for Energy Delivery Systems

December 1, 2010 • technical report, by philip miller, jefferson welch, james mccurley, david svoboda, robert w. stoddard, robert c. seacord, will dormann.

In this report, the authors describe the Source Code Analysis Laboratory (SCALe), which tests software for conformance to CERT secure coding standards.

Adaptive Flow Control for Enabling Quality of Service in Tactical Ad Hoc Wireless Networks

By edwin j. morris, soumya simanta, scott hissam, jeffrey hansen, daniel plakosh, b. craig meyers, lutz wrage.

The network infrastructure for users such as emergency responders or warfighters is wireless, ad hoc, mobile, and lacking in sufficient bandwidth. This report documents the results from 18 experiments to …

Combining Architecture-Centric Engineering with the Team Software Process

By robert nord, felix bachmann, jim mchale.

ACE methods and the TSP provides an iterative approach for delivering high quality systems on time and within budget. The combined approach helps organizations that must set an architecture/developer team …

Beyond Technology Readiness Levels for Software: U.S. Army Workshop Report

By suzanne miller, cecilia albert, stephen blanchette, jr..

This report synthesizes presentations, discussions, and outcomes from the "Beyond Technology Readiness Levels for Software" workshop from August 2010.

The CERT Approach to Cybersecurity Workforce Development

By christopher may, josh hammerstein.

This report describes a model commonly used for developing and maintaining a competent cybersecurity workforce, explains some operational limitations associated with that model, and presents a new approach to cybersecurity …

Guide for SCAMPI Appraisals: Accelerated Improvement Method (AIM)

December 1, 2010 • special report.

This document provides guidance to lead appraisers and appraisal teams unfamiliar with TSP+ when conducting Standard CMMI Appraisal Method for Process Improvement (SCAMPI) appraisals within organizations that use the TSP+ …

Implementation Guidance for the Accelerated Improvement Method (AIM)

This 2010 report describes the (AIM which helps an organization to implement high-performance, high-quality CMMI practices much more quickly than industry norms.

Executive Overview: Best Practices for Adoption of Cloud Computing

November 24, 2010 • white paper.

This paper describes the SEI approach to cloud computing research for the DoD.

Executive Overview: Employing SOA to Achieve Information Dominance

The current ability to implement systems in the DoD based on SOA technologies falls short of the DoD's goals. To close the gaps in these areas, research is needed in …

French language translation of CMMI for Development, V1.3

November 1, 2010 • white paper.

This is The French language translation of CMMI for Development, V1.3.

Dutch language translation of CMMI for Development V1.3

This document is the Dutch language translation of CMMI-DEV V1.3.

Spanish Language Translation of CMMI for Development, v1.3

Spanish language translation of CMMI for Development, v1.3

Traditional Chinese Language Translation of CMMI for Development V1.3

CMMI-DEV V1.3 Traditional Chinese Translation

A Workshop on Analysis and Evaluation of Enterprise Architectures

November 1, 2010 • technical note, by john klein, michael j. gagliardi.

This report summarizes a workshop on the analysis and evaluation of enterprise architectures that was held at the SEI in April of 2010.

Performance Analysis of WS-Security Mechanisms in SOAP-Based Web Services

November 1, 2010 • technical report, by gunnar peterson, marc novakouski, soumya simanta, edwin j. morris, grace lewis.

This paper presents the results of a series of experiments targeted at analyzing the performance impact of adding WS-Security, a common security standard used in IdM frameworks, to SOAP-based web …

CMMI for Acquisition, Version 1.3

The CMMI-ACQ model provides guidance for applying CMMI best practices in an acquiring organization. Best practices in the model focus on activities for initiating and managing the acquisition of products …

CMMI for Development, Version 1.3

This 2010 report details CMMI for Development (CMMI-DEV) V.1.3, which provides a comprehensive integrated set of guidelines for developing products and services.

CMMI for Services, Version 1.3

This 2010 report details CMMI for Services (CMMI-SVC) V.1.3, which provides a comprehensive integrated set of guidelines for providing superior services.

Strategic Planning with Critical Success Factors and Future Scenarios: An Integrated Strategic Planning Framework

By linda parker gates.

This report explores the value of enhancing typical strategic planning techniques with the CSF method and scenario planning.

Designing for Incentives: Better Information Sharing for Better Software Engineering

October 31, 2010 • white paper.

This paper outlines a research agenda in bridging to the economic theory of mechanism design, which seeks to align incentives in multi-agent systems with private information and conflicting goals.

Cloud Computing Basics Explained

September 30, 2010 • white paper.

This paper seeks to help organizations understand cloud computing essentials, including drivers for and barriers to adoption, in support of making decisions about adopting the approach.

Primer on SOA Terms

September 1, 2010 • white paper.

This white paper presents basic terminology related to Service- Oriented Architecture (SOA). The goal of the paper is to establish a baseline of terms for service-oriented systems.

T-Check in System-of-Systems Technologies: Cloud Computing

September 1, 2010 • technical note, by grace lewis, harrison d. strowd.

The purpose of this report is to examine a set of claims about cloud computing adoption.

Emerging Technologies for Software-Reliant Systems of Systems

The purpose of this report is to present an informal survey of technologies that are, or are likely to become, important for software-reliant systems of systems in response to current …

Integrated Measurement and Analysis Framework for Software Security

By christopher j. alberts, robert w. stoddard, julia h. allen.

In this report, the authors address how to measure software security in complex environments using the Integrated Measurement and Analysis Framework (IMAF).

Security Requirements Reusability and the SQUARE Methodology

In this report, the authors discuss how security requirements engineering can incorporate reusable requirements.

Measuring Operational Resilience Using the CERT® Resilience Management Model

By noopur davis, julia h. allen.

In this 2010 report, the authors begin a dialogue and establish a foundation for measuring and analyzing operational resilience.

Program Executive Officer Aviation, Major Milestone Reviews: Lessons Learned Report

September 1, 2010 • technical report, by kate ambrose, scott reed.

This report documents ideas and recommendations for improving the overall acquisition process and presents the actions taken by project managers in several programs to develop, staff, and obtain approval for …

Smart Grid Maturity Model, Version 1.1: Model Definition

Success in acquisition: using archetypes to beat the odds, by william e. novak, linda levine.

This report describes key elements in systems thinking, provides an introduction to general systems archetypes, and applies these concepts to the software acquisition domain.

Building Assured Systems Framework

By julia h. allen, nancy r. mead.

This report presents the Building Assured Systems Framework (BASF) that addresses the customer and researcher challenges of selecting security methods and research approaches for building assured systems.

Using TSP Data to Evaluate Your Project Performance

By bill nichols, james mccurley, shigeru sasao.

A set of measures was determined that allow analyses This report discusses the application of a set of measures to a data set of 41 TSP projects from an organization …

Suggestions for Documenting SOA-Based Systems

This report provides suggestions for documenting service-oriented architecture-based systems based on the Views & Beyond (V&B) software documentation approach.

Exploring Acquisition Strategies for Adopting a Software Product Line

August 25, 2010 • white paper, by john k. bergey, lawrence g. jones.

Some basics of software product line practice, the challenges that make product line acquisition unique, and three basic acquisition strategies are all part of this white paper.

YAF: Yet Another Flowmeter

August 23, 2010 • white paper, by chris inacio, brian trammell.

In this paper, the authors describe issues encountered in designing and implementing YAF.

A Continuous Time List Capture Model for Internet Threats

August 4, 2010 • white paper, by rhiannon weaver.

In this paper, Rhiannon Weaver describes a population study of malware files under the CTLC framework and presents a simulation study as well as future work.

Software Assurance Curriculum Project Volume I: Master of Software Assurance Reference Curriculum

August 1, 2010 • technical report, by richard c. linger (oak ridge national laboratory), james mcdonald (monmouth university), thomas b. hilburn (embry-riddle aeronautical university), andrew j. kornecki (embry-riddle aeronautical university), mark a. ardis (stevens institute of technology), julia h. allen, nancy r. mead.

In this report, the authors present a master of software assurance curriculum that educational institutions can use to create a degree program or track.

Risk Management Framework

By audrey j. dorofee, christopher j. alberts.

In this report, the authors specify (1) a framework that documents best practice for risk management and (2) an approach for evaluating a program's risk management practice in relation to …

Software Assurance Curriculum Project Volume II: Undergraduate Course Outlines

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Topic modeling in software engineering research

  • Open access
  • Published: 06 September 2021
  • Volume 26 , article number  120 , ( 2021 )

Cite this article

You have full access to this open access article

  • Camila Costa Silva   ORCID: orcid.org/0000-0002-3690-1711 1 ,
  • Matthias Galster   ORCID: orcid.org/0000-0003-3491-1833 1 &
  • Fabian Gilson   ORCID: orcid.org/0000-0002-1465-3315 1  

6553 Accesses

29 Citations

1 Altmetric

Explore all metrics

Topic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.

Similar content being viewed by others

Semantic topic models for source code analysis.

Anas Mahmoud & Gary Bradshaw

A survey on the use of topic models when mining software repositories

Tse-Hsun Chen, Stephen W. Thomas & Ahmed E. Hassan

topics research papers in software engineering

A Systematic Comparison of Search Algorithms for Topic Modelling—A Study on Duplicate Bug Report Identification

Avoid common mistakes on your manuscript.

1 Introduction

Text mining is about searching, extracting and processing text to provide meaningful insights from the text based on a certain goal. Techniques for text mining include natural language processing (NLP) to process, search and understand the structure of text (e.g., part-of-speech tagging), web mining to discover information resources on the web (e.g., web crawling), and information extraction to extract structured information from unstructured text and relationships between pieces of information (e.g., co-reference, entity extraction) (Miner et al. 2012 ). Text mining has been widely used in software engineering research (Bi et al. 2018 ), for example, to uncover architectural design decisions in developer communication (Soliman et al. 2016 ) or to link software artifacts to source code (Asuncion et al. 2010 ).

Topic modeling is a text mining and concept extraction method that extracts topics (i.e., coherent word clusters) from large corpora of textual documents to discovery hidden semantic structures in text (Miner et al. 2012 ). An advantage of topic modeling over other techniques is that it helps analyzing long texts (Treude and Wagner 2019 ; Miner et al. 2012 ), creates clusters as “topics” (rather than individual words) and is unsupervised (Miner et al. 2012 ).

Topic modeling has become popular in software engineering research (Sun et al. 2016 ; Chen et al. 2016 ). For example, Sun et al. ( 2016 ) found that topic modeling had been used to support source code comprehension, feature location and defect prediction. Additionally, Chen et al. ( 2016 ) found that many repository mining studies apply topic modeling to textual data such as source code and log messages to recommend code refactoring (Bavota et al. 2014b ) or to localize bugs (Lukins et al. 2010 ).

Probabilistic topic models such as Latent Semantic Indexing (LSI) (Deerwester et al. 1990 ) and Latent Dirichlet Allocation (LDA) (Blei et al. 2003b ) discover topics in a corpus of textual documents, using the statistical properties of word frequencies and co-occurrences (Lin et al. 2014 ). However, Agrawal et al. ( 2018 ) warn about systematic errors in the analysis of LDA topic models that limit the validity of topics. Lin et al. ( 2014 ) also advise that classical topic models usually generate sub-optimal topics when applied “as is” to small amounts or short text documents.

Considering the limitations of topic modeling techniques and topic models on the one hand and their potential usefulness in software engineering on the other hand, our goal is to describe how topic modeling has been applied in software engineering research. In detail, we explore the following research questions:

RQ1. Which topic modeling techniques have been used and for what purpose? There are different topic modeling techniques (see Section  2 ), each with their own limitations and constraints (Chen et al. 2016 ). This RQ aims at understanding which topic modeling techniques have been used (e.g., LDA, LSI) and for what purpose studies applied such techniques (e.g., to support software maintenance tasks). Furthermore, we analyze the types of contributions in studies that used topic modeling (e.g., a new approach as a solution proposal, or an exploratory study).

RQ2. What are the inputs into topic modeling? Topic modeling techniques accept different types of textual documents and require the configuration of parameters (see Section  2.1 ). Carefully choosing parameters (such as the number of topics to be generated) is essential for obtaining valuable and reliable topics (Agrawal et al. 2018 ; Treude and Wagner 2019 ). This RQ aims at analysing types of textual data (e.g., source code), actual documents (e.g., a Java class or an individual Java method) and configured parameters used for topic modeling to address software engineering problems.

RQ3: How are data pre-processed for topic modeling? Topic modeling requires that the analyzed text is pre-processed (e.g., by removing stop words) to improve the quality of the produced output (Aggarwal and Zhai 2012 ; Bi et al. 2018 ). This RQ aims at analysing how previous studies pre-processed textual data for topic modeling, including the steps for cleaning and transforming text. This will help us understand if there are specific pre-processing steps for a certain topic modeling technique or types of textual data.

RQ4. How are generated topics named? This RQ aims at analyzing if and how topics (word clusters) were named in studies. Giving meaningful names to topics may be difficult but may be required to help humans comprehend topics. For example, naming topics can provide a high-level view on topics discussed by developers in Stack Overflow (a Q&A website) (Barua et al. 2014 ) or by end mobile app users in tweets (Mezouar et al. 2018 ). Analysts (e.g., developers interested in what topics are discussed on Stack Overflow or app reviews) can then look at the name of the topic (i.e., its “label”) rather than the cluster of words. These labels or names must capture the overarching meaning of all words in a topic. We describe different approaches to naming topics generated by a topic model, such as manual or automated labeling of clusters with names based on the most frequent words of a topic (Hindle et al. 2013 ).

In this paper, we provide an overview of the use of topic modeling in 111 papers published between 2009 and 2020 in highly ranked venues of software engineering (five journals and five conferences). We identify characteristics and limitations in the use of topic models and discuss (a) the appropriateness of topic modeling techniques, (b) the importance of pre-processing, (c) challenges related to defining meaningful topics, and (d) the importance of context when manually naming topics.

The rest of the paper is organized as follows. In Section  2 we provide an overview of topic modeling. In Section  3 we describe other literature reviews on the topic as well as “meta-studies” that discuss topic modeling more generally. We describe the research method in Section  4 and present the results in Section  5 . In Section  6 , we summarize our findings and discuss implications and threats to validity. Finally, in Section  7 we present concluding remarks and future work.

2 Topic Modeling

Topic modeling aims at automatically finding topics, typically represented as clusters of words, in a given textual document (Bi et al. 2018 ). Unlike (supervised) machine learning-based techniques that solve classification problems, topic modeling does not use tags, training data or predefined taxonomies of concepts (Bi et al. 2018 ). Based on the frequencies of words and frequencies of co-occurrence of words within one or more documents, topic modeling clusters words that are often used together (Barua et al. 2014 ; Treude and Wagner 2019 ). Figure  1 illustrates the general process of topic modeling, from a raw corpus of documents (“Data input”) to topics generated for these documents (“Output”). Below we briefly introduce the basic concepts and terminology of topic modeling (based on Chen et al. ( 2016 )):

Word w : a string of one or more alphanumeric characters (e.g., “software” or “management”);

Document d : a set of n words (e.g., a text snippet with five words: w 1 to w 5 );

Corpus C : a set of t documents (e.g., nine text snippets: d 1 to d 9 );

Vocabulary V : a set of m unique words that appear in a corpus (e.g., m = 80 unique words across nine documents);

Term-document matrix A : an m by t matrix whose A i , j entry is the weight (according to some weighting function, such as term-frequency) of word w i in document d j . For example, given a matrix A with three words and three documents as

topics research papers in software engineering

A 1,1 = 5 indicates that “code” appears five times in d 1 , etc.;

Topic z : a collection of terms that co-occur frequently in the documents of a corpus. Considering probabilistic topic models (e.g., LDA), z refers to an m -length vector of probabilities over the vocabulary of a corpus. For example, in a vector z 1 = ( c o d e : 0.35; t e s t : 0.17; b u g : 0.08),

0.35 indicates that when a word is picked from a topic z 1 , there is a 35% chance of drawing the word “code”, etc.;

Topic-term matrix ϕ (or T ): a k by m matrix with k as the number of topics and ϕ i , j the probability of word w j in topic z i . Row i of ϕ corresponds to z i . For example, given a matrix ϕ as

topics research papers in software engineering

0.05 in the first column indicates that the word “code” appears with a probability of 0.5% in topic z 3 , etc.;

Topic membership vector 𝜃 d : for document d i , a k -length vector of probabilities of the k topics. For example, given a vector \(\theta _{d_{i}} = (z_{1}: 0.25; z_{2}: 0.10; z_{3}: 0.08)\) ,

0.25 indicates that there is a 25% chance of selecting topic z 1 in d i ;

Document-topic matrix 𝜃 (or D ): an n by k matrix with 𝜃 i , j as the probability of topic z j in document d i . Row i of 𝜃 corresponds to \(\theta _{d_{i}}\) . For example, given a matrix 𝜃 as

topics research papers in software engineering

0.10 in the first column indicates that document d 2 contains topic z 1 with probability of 10%, etc.

figure 1

General topic modeling process

2.1 Data Input

Data used as input into topic modeling can take many forms. This requires decisions on what exactly are documents and what the scope of individual documents is (Miner et al. 2012 ). Therefore, we need to determine which unit of text shall be analyzed (e.g., subject lines of e-mails from a mailing list or the body of e-mails).

To model topics from raw text in a corpus C (see Fig.  1 ), the data needs to be converted into a structured vector-space model, such as the term-document matrix A . This typically also requires some pre-processing. Although each text mining approach (including topic modeling) may require specific pre-processing steps, there are some common steps, such as tokenization, stemming and removing stop words (Miner et al. 2012 ). We discuss pre-processing for topic modeling in more detail when presenting the results for RQ3 in Section  5.4 .

2.2 Modeling

Different models can be used for topic modeling. Models typically differ in how they model topics and underlying assumptions. For example, besides LDA and LSI mentioned before, other examples of topic modeling techniques include Probabilistic Latent Semantic Indexing (pLSI) (Hofmann 1999 ). LSI and pLSI reduce the dimensionality of A using Singular Value Decomposition (SVD) (Hofmann 1999 ). Furthermore, variants of LDA have been proposed, such as Relational Topic Models (RTM) (Chang and Blei 2010 ) and Hierarchical Topic Models (HLDA) (Blei et al. 2003a ). RTM finds relationships between documents based on the generated topics (e.g., if document d 1 contains the topic “microservices”, document d 2 contains the topic “containers” and document d n contains the topic “user interface”, RTM will find a link between documents d 1 and d 2 (Chang and Blei 2010 )). HLDA discovers a hierarchy of topics within a corpus, where each lower level in the hierarchy is more specific than the previous one (e.g., a higher topic “web development” may have subtopics such as “front-end” and “back-end”).

Topic modeling techniques need to be configured for a specific problem, objectives and characteristics of the analyzed text (Treude and Wagner 2019 ; Agrawal et al. 2018 ). For example, Treude and Wagner ( 2019 ) studied parameters, characteristics of text corpora and how the characteristics of a corpus impact the development of a topic modeling technique using LDA. Treude and Wagner ( 2019 ) found that textual data from Stack Overflow (e.g., threads of questions and answers) and GitHub (e.g., README files) require different configurations for the number of generated topics ( k ). Similarly, Barua et al. ( 2014 ) argued that the number of topics depends on the characteristics of the analyzed corpora. Furthermore, the values of modeling parameters (e.g., LDA’s hyperparameters α and β which control an initial topic distribution) can also be adjusted depending on the corpus to improve the quality of topics (Agrawal et al. 2018 ).

By finding words that are often used together in documents in a corpus, a topic modeling technique creates clusters of words or topics z k . Words in such a cluster are usually related in some way, therefore giving the topic a meaning. For example, we can use a topic modeling technique to extract five topics from unstructured document such as a combination of Stack Overflow posts. One of the clusters generated could include the co-occurring words “error”, “debug” and “warn”. We can then manually inspect this cluster and by inference suggest the label “Exceptions” to name this topic (Barua et al. 2014 ).

3 Related Work

3.1 previous literature reviews.

Sun et al. ( 2016 ) and Chen et al. ( 2016 ), similar to our study, surveyed software engineering papers that applied topic modeling. Table  1 shows a comparison between our study and prior reviews. As shown in the table, Sun et al. ( 2016 ) focused on finding which software engineering tasks have been supported by topic models (e.g., support source code comprehension, feature location, traceability link recovery, refactoring, software testing, developer recommendations, software defects prediction and software history comprehension), and Chen et al. ( 2016 ) focused on characterizing how studies used topic modeling to mine software repositories.

Furthermore, as shown in Table  1 , in comparison to Sun et al. ( 2016 ) and Chen et al. ( 2016 ), our study surveys the literature considering other aspects of topic modeling such as data inputs (RQ2), data pre-processing (RQ3), and topic naming (RQ4). Additionally, we searched for papers that applied topic models to any type of data (e.g., Q&A websites) rather than to data in software repositories. We also applied a different search process to identify relevant papers.

Although some of the search venues of these two previous studies and our study overlap, our search focused on specific venues. We also searched papers published between 2009 and 2020, a period which only partially overlaps with the searches presented by Sun et al. ( 2016 ) and Chen et al. ( 2016 ).

Regarding the data analysed in previous studies, Chen et al. ( 2016 ) analyzed two aspects not covered in our study: (a) tools to implement topic models in papers, and (b) how papers evaluated topic models (note that even though we did not cover this aspect explicitly, we checked whether papers compared different topic models, and if so, what metrics they used to compare topic models). However, different to Chen et al. ( 2016 ) we analyzed (a) the types of contribution of papers (e.g., a new approach); (b) details about the types of data and documents used in topic modeling techniques, and (c) whether and how topics were named. Additionally, we extend the survey of Chen et al. ( 2016 ) by investigating hyperparameters (see Section  2.1 ) of topic models and data pre-processing in more detail. We provide more details and a justification of our research method in Section  4 .

3.2 Meta-studies on Topic Modeling

In addition to literature surveys, there are “meta-studies” on topic modeling that address and reflect on different aspects of topic modeling more generally (and are not considered primary studies for the purpose of our review, see our inclusion and exclusion criteria in Section  4 ). In the following paragraphs we organized their discussion into three parts: (1) studies about parameters for topic modeling, (2) studies on topic models based on the type of analyzed data, and (3) studies about metrics and procedures to evaluate the performance of topic models. We refer to these studies throughout this manuscript when reflecting on the findings of our study.

Regarding parameters used for topic modeling, Treude and Wagner ( 2019 ) performed a broad study on LDA parameters to find optimal settings when analyzing GitHub and Stack Overflow text corpora. The authors found that popular rules of thumb for topic modeling parameter configuration were not applicable to their corpora, which required different configurations to achieve good model fit. They also found that it is possible to predict good configurations for unseen corpora reliably. Agrawal et al. ( 2018 ) also performed experiments on LDA parameter configurations and proposed LDADE, a tool to tune the LDA parameters. The authors found that due to LDA topic model instability, using standard LDA with “off-the-shelf” settings is not advisable. We also discuss parameters for topic modeling in Section  2.2 .

For studies on topic models based on the analyzed data, researchers have investigated topic modeling involving short texts (e.g., a tweet) and how to improve the performance of topic models that work well with longer text (e.g., a book chapter) (Lin et al. 2014 ). For example, the study of Jipeng et al. ( 2020 ) compared short-text topic modeling techniques and developed an open-source library of the short-text models. Another example is the work of Mahmoud and Bradshaw ( 2017 ) who discussed topic modeling techniques specific for source code.

Finally, regarding metrics and procedures to evaluate the performance of topic models, some works have explored how semantically meaningful topics are for humans (Chang et al. 2009 ). For example, Poursabzi-Sangdeh et al. ( 2021 ) discuss the importance of interpretability of models in general (also considering other text mining techniques). Another example is the work of Chang et al. ( 2009 ) who presented a method for measuring the interpretability of a topic model based on how well words within topics are related and how different topics are between each other. On the other hand, as an effort to quantify the interpretability of topics without human evaluation, some studies developed topic coherence metrics . These metrics score the probability of a pair of words from topics being found together in (a) external data sources (e.g., Wikipedia pages) or (b) in the documents used by the model that generated those topics (Röder et al. 2015 ). Röder et al. ( 2015 ) combined different implementations of coherence metrics in a framework. Perplexity is another measure of performance for statistical models in natural language processing, which indicates the uncertainty in predicting a single word (Blei et al. 2003b ). This metric is often applied to compare the configurations of a topic modeling technique (e.g., Zhao et al. ( 2020 )). Other studies use perplexity as an indicator of model quality (such as Chen et al. 2019 and Yan et al. 2016b ).

4 Research Method

We conducted a literature survey to describe how topic modeling has been applied in software engineering research. To answer the research questions introduced in Section  1 , we followed general guidelines for systematic literature review (Kitchenham 2004 ) and mapping study methods (Petersen et al. 2015 ). This was to systematically identify relevant works, and to ensure traceability of our findings as well as the repeatability of our study. However, we do not claim to present a fully-fledged systematic literature review (e.g., we did not assess the quality of primary studies) or a mapping study (e.g., we only analyzed papers from carefully selected venues). Furthermore, we used parts of the procedures from other literature surveys on similar topics (Bi et al. 2018 ; Chen et al. 2016 ; Sun et al. 2016 ) as discussed throughout this section.

4.1 Search Procedure

To identify relevant research, we selected high-quality software engineering publication venues. This was to ensure that our literature survey includes studies of high quality and described at sufficient level of detail. We identified venues rated as A and A ∗ for Computer Science and Information Systems research in the Excellence Research for Australia (CORE) ranking (ARC 2012 ). Only one journal was rated B (IST), but we included it due to its relevance for software engineering research. These venues are a subset of venues also searched by related previous literature surveys (Chen et al. 2016 ; Sun et al. 2016 ), see Section  3 . The list of searched venues includes five journals: (1) Empirical Software Engineering (EMSE); (2) Information and Software Technology (IST); (3) Journal of Systems and Software (JSS); (4) ACM Transactions on Software Engineering & Methodology (TOSEM); (5) IEEE Transaction on Software Engineering (TSE). Furthermore, we included five conferences: (1) International Conference on Automated Software Engineering (ASE); (2) ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM); (3) International Symposium on the Foundations of Software Engineering / European Software Engineering Conference (ESEC/FSE); (4) International Conference on Software Engineering (ICSE); (5) International Workshop/Working Conference on Mining Software Repositories (MSR).

We performed a generic search on SpringerLink (EMSE), Science Direct (IST, JSS), ACM DL (TOSEM, ESEC/FSE, ASE, ESEM, ICSE, MSR) and IEEE Xplore (TSE, ASE, ESEM, ICSE, MSR) using the venue (journal or conference) as a high-level filtering criterion. Considering that the proceedings of ASE, ESEM, ICSE and, MSR are published by ACM and IEEE, we searched these venues on ACM DL and IEEE Xplore to avoid missing relevant papers. We used a generic search string (“topic model[l]ing” and “topic model”). Furthermore, in order to find studies that apply specific topic models but do not mention the term “topic model”, we used a second search string with topic model names (“lsi” or “lda” or “plsi” or “latent dirichlet allocation” or “latent semantic”). This second string was based on the search string used by Chen et al. ( 2016 ), who also present a review and analysis of topic modeling techniques in software engineering (see Section  3 ). We applied both strings to the full text and metadata of papers. We considered works published between 2009 and 2020. The search was performed in March 2021. Limiting the search to the last twelve years allowed us to focus on more mature and recent works.

4.2 Study Selection Criteria

We only considered full research papers since full papers typically report (a) mature and complete research, and (b) more details about how topic modeling was applied. Furthermore, to be included, a paper should either apply, experiment with, or propose a topic modeling technique (e.g., develop a topic modeling technique that analyzes source code to recommend refactorings (Bavota et al. 2014b )), and meet none of the exclusion criteria: (a) the paper does not apply topic models (e.g., it applies other text mining techniques and only cites topic modeling in related or future work, such as the paper by Lian et al. ( 2020 ); (b) the paper focuses on theoretical foundation and configurations for topic models (e.g., it discusses how to tune and stabilize topic models, such as Agrawal et al. ( 2018 ) and other meta-studies listed in Section  3.2 ); and (c) the paper is a secondary study (e.g., a literature review like the studies discussed in Section  3.1 ). We evaluated inclusion and exclusion criteria by first reading the abstracts and then reading full texts.

The search with the first search string (see Section  4.1 ) resulted in 215 papers and the search with the second search string resulted in an additional 324 papers. Applying the filtering outlined above resulted in 114 papers. Furthermore, we excluded three papers from the final set of papers: (a) Hindle et al. ( 2011 ), (b) Chen et al. ( 2012 ), and (c) Alipour et al. ( 2013 ). These papers were earlier and shorter versions of follow-up publications; we considered only the latest publications of these papers (Hindle et al. 2013 ; Chen et al. 2017 ; Hindle et al. 2016 ). This resulted in a total of 111 papers for analysis.

4.3 Data Extraction and Synthesis

We defined data items to answer the research questions and characterize the selected papers (see Table  2 ). The extracted data was recorded in a spreadsheet for analysis (raw data are available online Footnote 1 ). One of the authors extracted the data and the other authors reviewed it. In case of ambiguous data, all authors discussed to reach agreement. To synthesize the data, we applied descriptive statistics and qualitatively analyzed the data as follows:

RQ1: Regarding the data item “Technique”, we identified the topic modeling techniques applied in papers. For the data item “Supported tasks”, we assigned to each paper one software engineering task. Tasks emerged during the analysis of papers (see more details in Section  5.2.2 ). We also identified the general study outcome in relation to its goal (data item “Type of contribution”). When analyzing the type of contribution, we also checked whether papers included a comparison of topic modeling techniques (e.g., to select the best technique to be included in a newly proposed approach). Based on these data items we checked which techniques were the most popular, whether techniques were based on other techniques or used together, and for what purpose topic modeling was used.

RQ2: We identified types of data (data item “Type of data”) in selected papers as listed in Section  5.3.1 . Considering that some papers addressed one, two or three different types of data, we counted the frequency of types of data and related them with the document. Regarding “Document”, we identified the textual document and (if reported in the paper) its length. For the data item “Parameters”, we identified whether papers described modeling parameters and if so, which values were assigned to them.

RQ3: Considering that some papers may have not mentioned any pre-processing, we first checked which papers described data pre-processing. Then, we listed all pre-processing steps found and counted their frequencies.

RQ4: Considering the papers that described topic naming, we analyzed how generated topics were named (see Section  5.5 ). We used three types of approaches to describe how topics were named: (a) Manual - manually analysis and labeling of topics; (b) Automated - use automated approaches to label names to topics; and (c) Manual & Automated - mix of both manual and automated approaches to analyse and name topics. We also described the procedures performed to name topics.

5.1 Overview

As mentioned in Section  4.1 , we analyzed 111 papers published between 2009 and 2020 (see Appendix  A.1 - Papers Reviewed). Most papers were published after 2013. Furthermore, most papers were published in journals (68 papers in total, 32 in EMSE alone), while the remaining 43 papers appeared in conferences (mostly MSR with sixteen papers). Table  3 shows the number of papers by venue and year.

5.2 RQ1: Topic Models Used

In this Section we first discuss which topic modeling techniques are used (Section  5.2.1 ). Then, we explore why or for what purpose these techniques were used (Section  5.2.2 ). Finally, we describe the general contributions of papers in relation to their goals (Section  5.2.3 ).

5.2.1 Topic Modeling Techniques

The majority of the papers used LDA (80 out of 111), or a LDA-based technique (30 out of 111), such as Twitter-LDA (Zhao et al. 2011 ). The other topic modeling technique used is LSI. Figure  2 shows the number of papers per topic modeling technique. The total number (125) exceeds the number of papers reviewed (111), because ten papers experimented with more than one technique: Thomas et al. ( 2013 ), De Lucia et al. ( 2014 ), Binkley et al. ( 2015 ), Tantithamthavorn et al. ( 2018 ), Abdellatif et al. ( 2019 ) and Liu et al. ( 2020 ) experimented with LDA and LSI; Chen et al. ( 2014 ) experimented with LDA and Aspect and Sentiment Unification Model (ASUM); Chen et al. ( 2019 ) experimented with Labeled Latent Dirichlet Allocation (LLDA) and Label-to-Hierarchy Model (L2H); Rao and Kak ( 2011 ) experimented with LDA and MLE-LDA; and Hindle et al. ( 2016 ) experimented with LDA and LLDA. ASUM, LLDA, MLE-LDA and L2H are techniques based on LDA.

figure 2

Number of papers per topic modeling technique

The popularity of LDA in software engineering has also been discussed by others, e.g., Treude and Wagner ( 2019 ). LDA is a three-level hierarchical Bayesian model (Blei et al. 2003b ). LDA defines several hyperparameters, such as α (probability of topic z i in document d i ), β (probability of word w i in topic z i ) and k (number of topics to be generated) (Agrawal et al. 2018 ).

Thirty-seven (out of 75) papers applied LDA with Gibbs Sampling (GS). Gibbs sampling is a Markov Chain Monte Carlo algorithm that samples from conditional distributions of a target distribution. Used with LDA, it is an approximate stochastic process for computing α and β (Griffiths and Steyvers 2004 ). According to experiments conducted by Layman et al. ( 2016 ), Gibbs sampling in LDA parameter estimation ( α and β ) resulted in lower perplexity than the Variational Expectation-Maximization (VEM) estimations. Perplexity is a standard measure of performance for statistical models of natural language, which indicates the uncertainty in predicting a single word. Therefore, lower values of perplexity mean better model performance (Griffiths and Steyvers 2004 ).

Thirty papers applied modified or extended versions of LDA (“LDA-based” in Fig.  2 ). Table  4 shows a comparison between these LDA-based techniques. Eleven papers proposed a new extension of LDA to adapt LDA to software engineering problems (hence the same reference in the third and fourth column of Table  4 ). For example, the Multi-feature Topic Model (MTM) technique by Xia et al. ( 2017b ), which implements a supervised version of LDA to create a bug triaging approach. The other 19 papers applied existing modifications of LDA proposed by others (third column in Table  4 ). For example, Hu and Wong ( 2013 ) used the Citation Influence Topic Model (CITM), developed by Dietz et al. ( 2007 ), which models the influence of citations in a collection of publications.

The other topic modeling technique, LSI (Deerwester et al. 1990 ), was published in 1990, before LDA which was published in 2003. LSI is an information extraction technique that reduces the dimensionality of a term-document matrix using a reduction factor k (number of topics) (Deerwester et al. 1990 ). Compared to LDA, LDA follows a generative process that is statistically more rigorous than LSI (Blei et al. 2003b ; Griffiths and Steyvers 2004 ). From the 16 papers that used LSI, seven papers compared this technique to others:

One paper (Rosenberg and Moonen 2018 ) compared LSI with other two dimensionality reduction techniques: Principal Component Analysis (PCA) (Wold et al. 1987 ) and Non-Negative Matrix Factorization (NMF) (Lee and Seung 1999 ). The authors applied these models to automatically group log messages of continuous deployment runs that failed for the same reasons.

Four papers applied LDA and LSI at the same time to compare the performance of these models to Vector Space Model (VSM) (Salton et al. 1975 ), an algebraic model for information extraction. These studies supported documentation (De Lucia et al. 2014 ); bug handling (Thomas et al. 2013 ; Tantithamthavorn et al. 2018 ); and maintenance tasks (Abdellatif et al. 2019 )).

Regarding the other two papers, Binkley et al. ( 2015 ) compared LSI to Query likelihood LDA (QL-LDA) and other information extraction techniques to check the best model for locating features in source code; and Liu et al. ( 2020 ) compared LSI and LDA to Generative Vector Space Model (GVSM), a deep learning technique, to select the best performer model for documentation traceability to source code in multilingual projects.

5.2.2 Supported Tasks

As mentioned before, we aimed to understand why topic modeling was used in papers, e.g., if topic modeling was used to develop techniques to support specific software engineering tasks, or if it was used as a data analysis technique in exploratory studies to understand the content of large amounts of textual data. We found that the majority of papers aimed at supporting a particular task, but 21 papers (see Table  5 ) used topic modeling in empirical exploratory and descriptive studies as a data analysis technique.

We extracted the software engineering tasks described in each study (e.g., bug localization, bug assignment, bug triaging) and then grouped them into eight more generic tasks (e.g., bug handling) considering typical software development activities such as requirements, documentation and maintenance (Leach 2016 ). The specific tasks collected from papers are available online 1 . Note that we kept “Bug handling” and “Refactoring” separate rather than merging them into maintenance because of the number of papers (bug handling) and the cross-cutting nature (refactoring) in these categories. Each paper was related to one of these tasks:

Architecting: tasks related to architecture decision making, such as selection of cloud or mash-up services (e.g., Belle et al. ( 2016 ));

Bug handling: bug-related tasks, such as assigning bugs to developers, prediction of defects, finding duplicate bugs, or characterizing bugs (e.g., Naguib et al. ( 2013 ));

Coding: tasks related to coding, e.g., detection of similar functionalities in code, reuse of code artifacts, prediction of developer behaviour (e.g., Damevski et al. ( 2018 ));

Documentation: support software documentation, e.g., by localizing features in documentation, automatic documentation generation (e.g., Souza et al. ( 2019 ));

Maintenance: software maintenance-related activities, such as checking consistency of versions of a software, investigate changes or use of a system (e.g., Silva et al. ( 2019 ));

Refactoring: support refactoring, such as identifying refactoring opportunities and removing bad smell from source code (e.g., Bavota et al. ( 2014b ));

Requirements: related to software requirements evolution or recommendation of new features (e.g., Galvis Carreno and Winbladh ( 2012 ));

Testing: related to identification or prioritization of test cases (e.g., Thomas et al. ( 2014 )).

Table  5 groups papers based on the topic modeling technique and the purpose. Few papers applied topic modeling to support Testing (three papers) and Refactoring (three papers). Bug handling is the most frequent supported task (33 papers). From the 21 exploratory studies, 13 modeled topics from developer communication to identify developers’ information needs: 12 analyzed posts on Stack Overflow, a Q&A website for developers (Chatterjee et al. 2019 ; Bajaj et al. 2014 ; Ye et al. 2017 ; Bagherzadeh and Khatchadourian 2019 ; Ahmed and Bagherzadeh 2018 ; Barua et al. 2014 ; Rosen and Shihab 2016 ; Zou et al. 2017 ; Chen et al. 2019 ; Han et al. 2020 ; Abdellatif et al. 2020 ; Haque and Ali Babar 2020 ) and one paper analyzed blog posts (Pagano and Maalej 2013 ). Regarding the other eight exploratory studies, three papers investigated web search queries to also identify developers’ information needs (Xia et al. 2017a ; Bajracharya and Lopes 2009 ; 2012 ); four papers investigated end user documentation to analyse users’ feedback on mobile apps (Tiarks and Maalej 2014 ; El Zarif et al. 2020 ; Noei et al. 2018 ; Hu et al. 2018 ); and one paper investigated historical “bug” reports of NASA systems to extract trends in testing and operational failures (Layman et al. 2016 ).

5.2.3 Types of Contribution

For each study, we identified what type of contribution it presents based on the study goal. We used three types of contributions (“Approach”, “Exploration” and “Comparison”, as described below) by analyzing the research questions and main results of each study. A study could contribute either an “Approach” or an “Exploration”, while “Comparison” is orthogonal, i.e., a study that presents a new approach could present a comparison of topic models as part of this contribution. Similarly, a comparison of topic models can also be part of an exploratory study.

Approach: a study develops an approach (e.g., technique, tool, or framework) to support software engineering activities based on or with the support of topic models. For example, Murali et al. ( 2017 ) developed a framework that applies LDA to Android API methods to discover types of API usage errors, while Le et al. ( 2017 ) developed a technique (APRILE+) for bug localization which combines LDA with a classifier and an artificial neural network.

Exploration: a study applies topic modeling as the technique to analyze textual data collected in an empirical study (in contrast to for example open coding). Studies that contributed an exploration did not propose an approach as described in the previous item, but focused on getting insights from data. For example, Barua et al. ( 2014 ) applied LDA to Stack Overflow posts to discover what software engineering topics were frequently discussed by developers; Noei et al. ( 2018 ) explored the evolution of mobile applications by applying LDA to app descriptions, release notes, and user reviews.

Comparison: the study (that can also contribute with an “Approach” or an “Exploration”) compares topic models to other approaches. For example, Xia et al. ( 2017b ) compared their bug triaging approach (based on the so called Multi-feature Topic Model - MTM) with similar approaches that apply machine learning (Bugzie (Tamrawi et al. 2011 )) and SVM-LDA (combining a classifier with LDA (Somasundaram and Murphy 2012 )). On the other hand, De Lucia et al. ( 2014 ) compared LDA and LSI to define guidelines on how to build effective automatic text labeling techniques for program comprehension.

From the papers that contributed an approach , twenty-two combined a topic modeling technique with one or more other techniques applied for text mining:

Information extraction (e.g., VSM) (Nguyen et al. 2012 ; Zhang et al. 2018 ; Chen et al. 2020 ; Thomas et al. 2013 ; Fowkes et al. 2016 );

Classification (e.g., Support Vector Machine - SVM) (Hindle et al. 2013 ; Le et al. 2017 ; Liu et al. 2017 ; Demissie et al. 2020 ; Zhao et al. 2020 ; Shimagaki et al. 2018 ; Gopalakrishnan et al. 2017 ; Thomas et al. 2013 );

Clustering (e.g., K-means) (Jiang et al. 2019 ; Cao et al. 2017 ; Liu et al. 2017 ; Zhang et al. 2016 ; Altarawy et al. 2018 ; Demissie et al. 2020 ; Gorla et al. 2014 );

Structured prediction (e.g., Conditional Random Field - CRF) (Ahasanuzzaman et al. 2019 );

Artificial neural networks (e.g., Recurrent Neural Network - RNN) (Murali et al. 2017 ; Le et al. 2017 );

Evolutionary algorithms (e.g., Multi-Objective Evolutionary Algorithm - MOEA) (Blasco et al. 2020 ; Pérez et al. 2018 );

Web crawling (Nabli et al. 2018 ).

Pagano and Maalej ( 2013 ) was the only study that contributed an exploration that combined LDA with another text mining technique. To analyze how developer communities use blogs to share information, the authors applied LDA to extract keywords from blog posts and then analyzed related “streams of events” (commit messages and releases by time in relation to blog posts), which were created with Sequential pattern mining.

Regarding comparisons we found that (1) 13 out of the 63 papers that contribute an approach also include some form of comparison, and (2) ten out of the 48 papers contribute an exploration also include some form of comparison. We discuss comparisons in more detail below in Section  6.1.2

5.3 RQ2: Topic Model Inputs

In this section we first discuss the type of data (Section  5.3.1 ). Then we discuss the actual textual documents used for topic modeling (Section  5.3.2 ). Finally, we describe which model parameters were used (Section  5.3.3 ) to configure models.

5.3.1 Types of Data

Types of data help us describe the textual software engineering content that has been analyzed with topic modeling. We identified 12 types of data in selected papers as shown in Table  6 . In some papers we identified two or three of these types of data; for example, the study of Tantithamthavorn et al. ( 2018 ) dealt with issue reports, log information and source code.

Source code (37 occurrences), issue/bug reports (22 occurrences) and developer communication (20 occurrences) were the most frequent types of data used. Seventeen papers used two to four types of data in their topic modeling technique; twelve of these papers used a combination of source code with another type of data. For example, Sun et al. ( 2015 ) generated topics from source code and developer communication to support software maintenance tasks, and in another study, Sun et al. ( 2017 ) used topics found in source code and commit messages to assign bug-fixing tasks to developers.

5.3.2 Documents

A document refers to a piece of textual data that can be longer or shorter, such as a requirements document or a single e-mail subject. Documents are concrete instances of the types of data discussed above. Figure  3 shows documents (per type of data) and how often we found them in papers. The most frequent documents are bug reports (12 occurrences), methods from source code (9 occurrences), Q&A posts (9 occurrences) and user reviews (8 occurrences).

figure 3

Documents (leaves in the figure) by type of data (nodes in the figure)

We also analyzed document length and found the following:

In general, papers described the length of documents in number of words, see Table  7 . Footnote 2 On the other hand, two papers (Moslehi et al. 2016 , 2020 ) described their documents’ length in minutes of screencast transcriptions (videos with one to ten minutes, no information about the size of transcripts). Sixteen papers mentioned the actual length of the documents, see Table  7 . Ten papers that described the actual document length did that when describing the data used for topic modeling; four papers discussed document length while describing results; and one mentioned document length as a metric for comparing different data sources;

Most papers (80 out of 111) did not mention document length and also do not acknowledge any limitations or the impact of document length on topics.

Fifteen papers did not mention the actual document length, but at some point acknowledge the influence of document length on topic modeling. For example, Abdellatif et al. ( 2019 ) mentioned that the documents in their data set were “not long”. Similarly, Yan et al. ( 2016b ) did not mention the length of the bug reports used but discussed the impact of the vocabulary size of their corpus on results. Moslehi et al. ( 2018 ) mentioned document length as a limitation and acknowledge that using LDA on short documents was a threat to construct validity. According to these authors, using techniques specific for short documents could have improved the outcomes of their topic modeling.

5.3.3 Model Parameters

Topic models can be configured with parameters that impact how topics are generated. For example, LDA has typically been used with symmetric Dirichlet priors over 𝜃 (document-topic distributions) and ϕ (topic-word distributions) with fixed values for α and β (Wallach et al. 2009 ). Wallach et al. ( 2009 ) explored the robustness of a topic model with asymmetric priors over 𝜃 (i.e., varying values for α ) and a symmetric prior (fixed value for β ) over ϕ . Their study found that such topic model can capture more distinct and semantically-related topics, i.e., the words in clusters are more distinct. Therefore, we checked which parameters and values were used in papers. Overall, we found the following:

Eighteen of the 111 papers do not mention parameters (e.g., number of topics k , hyperparameters α and β ). Thirteen of these papers use LDA or an LDA-based technique, four papers use LSI, while (Liu et al. 2020 ) use LDA and LSI.

The remaining 93 papers mention at least one parameter. The most frequent parameters discussed were k , α and β :

Fifty-eight papers mentioned actual values for k , α and β ;

Two papers mentioned actual values for α and β , but no values for k ;

Twenty-nine papers included actual values for k but not for α and β ;

Thirty-two (out of 58) papers mentioned other parameters in addition to k , α and β . For example, Chen et al. ( 2019 ) applied L2H (in comparison to LLDA), which uses the hyperparameters γ 1 and γ 2 ;

One paper (Rosenberg and Moonen 2018 ) that applied LSI, mentioned the parameter “similarity threshold” rather than k , α and β .

We then had a closer look at the 60 papers that mentioned actual values for hyperparameters α and β :

α based on k : The most frequent setting (29 papers) was α = 50/ k and β = 0.01 (i.e., α was depending on the number of topics, a strategy suggested by Steyvers and Griffiths ( 2010 ) and Wallach et al. ( 2009 )). These values are a default setting in Gibbs Sampling implementations for LDA such as Mallet. Footnote 3

Fixed α and β : Five papers fixed 0.01 for both hyperparameters, as suggested by Hoffman et al. ( 2010 ). Another eight papers fixed 0.1 for both hyperparameters, a default setting in Stanford Topic Modeling Toolbox (TMT); Footnote 4 and three other papers fixed α = 0.1 and β = 1 (these three studies applied RTM).

Varying α or β : Four papers tested different values for α , where two of these papers also tested different values for β ; and one paper varied β but fixed a value for α .

Optimized parameters : Four papers obtained optimized values for hyperparameters (Sun et al. 2015 ; Catolino et al. 2019 ; Yang et al. 2017 ; Zhang et al. 2018 ). These papers applied LDA-GA (as proposed by Panichella et al. ( 2013 )) which, based on genetic algorithms; finds the best values for LDA hyperparameters. In regards to the actual values chosen for optimized hyperparameters, Catolino et al. ( 2019 ) did not mention the values for hyperparameters; Sun et al. ( 2015 ) and Yang et al. ( 2017 ) mentioned only the values used for k ; and Zhang et al. ( 2018 ) described the values for k , α and β .

Regarding the values for k we observed the following:

The 90 papers that mentioned values for k modeled three (Cao et al. 2017 ) to 500 (Li et al. 2018 ; Lukins et al. 2010 ; Chen et al. 2017 ) topics;

Twenty-four (out of 90) papers mentioned that a range of values for k was tested in order to check the performance of the technique (e.g., Xia et al. ( 2017b )) or as a strategy to select the best number of topics (e.g., Layman et al. ( 2016 ));

Although the remaining 66 (out of 90) papers mentioned a single value used for k , most of them acknowledged that had tried several number of topics or used the number of topics suggested by other studies.

As can be seen in Table  7 , there is no common trend of what values for hyperparameter or k depending on the document or document length.

5.4 RQ3: Pre-processing Steps

Thirteen of the papers did not mention what pre-processing steps were applied to the data before topic modeling. Seven papers only described how the data analyzed were selected, but not how they were pre-processed. Table  8 shows the pre-processing steps found in the remaining 91 papers. Each of these papers mentioned at least one of these steps.

Removing noisy content (76 occurrences), Stemming terms (61 occurrences) and Splitting terms (33 occurrences) were the most used pre-processing steps. The least frequent pre-processing step (Resolving negations) was found only in the studies of Noei et al. ( 2019 ) and Noei et al. ( 2018 ). Resolving synonyms and Expanding contractions were also less frequent, with three occurrences each.

Table  9 shows the types of noise removal in papers and their frequency. Most of the papers that described pre-processing steps removed stop words (76 occurrences). Stop words are the most common words in a language, such as “a/an” and “the” in English. Removing stop words allows topic modeling techniques to focus on more meaningful words in the corpus (Miner et al. 2012 ). Eight papers mentioned the stop words list used: Layman et al. ( 2016 ) and Pettinato et al. ( 2019 ) used the SMART stop words list; Footnote 5 Martin et al. ( 2015 ) and Hindle et al. ( 2013 ) used the Natural Language Toolkit English stop words list; Footnote 6 Bagherzadeh and Khatchadourian ( 2019 ), Ahmed and Bagherzadeh ( 2018 ) and Yan et al. ( 2016b ) used the Mallet stop words list; Footnote 7 and Mezouar et al. ( 2018 ) used the Moby stop words list. Footnote 8

As can be seen in Table  9 , some papers removed words based on the frequency of their occurrence (most or least frequent terms) or length (words shorter than four, three or two letters or long terms). Other papers removed long paragraphs. For example, Henß et al. ( 2012 ) removed paragraphs longer than 800 characters because most paragraphs in their data set were shorter than that. We also found two papers that removed short documents: Gorla et al. ( 2014 ) removed documents with fewer than ten words, and Palomba et al. ( 2017 ) removed documents with fewer than three words. The concept of non-informative content depends on the context of each paper. In general, it refers to any data considered not relevant for the objective of the study. For example, Choetkiertikul et al. ( 2017 ), which aimed at predicting bugs in issue reports, removed issues that took too much time to be resolved. Noei et al. ( 2019 ) and Fu et al. ( 2015 ) removed content (end user reviews and commit messages) that did not describe feedback or cause of change.

5.5 RQ4: Topic Naming

Topic naming is about assigning labels (names) to topics (word clusters) to give the clusters a human-understandable meaning. Seventy-five papers (out of 111) did not mention whether or how topics were named. These papers only used the word clusters for analysis, but did not require a name. For example, Xia et al. ( 2017a ) and Canfora et al. ( 2014 ) did not name topics, but mapped the word clusters to the documents (search queries and source code comments) used as input for topic modeling. These papers used the probability of a document to belong to a topic ( 𝜃 ) to associate a document to the topic with the highest probability.

From the 36 papers (out of 111) that mentioned topic naming (see Table  10 ), we identified three ways of how they named topics:

Automated: Assigning names to word clusters without human intervention;

Manual: Manually checking the meaning and the combination of words in cluster to “deduct” a name, sometimes validated with expert judgment;

Manual & Automated: Mix of manual and automated; e.g., topics are manually labeled for one set of clusters to then train a classifier for naming another set of clusters.

Most of the papers (30 out of 36) assigned one name to one topic. However, we identified six papers that used one name for multiple topics (Hindle et al. 2013 ; Pagano and Maalej 2013 ; Bajracharya and Lopes 2012 ; Rosen and Shihab 2016 ) or labeled a topic with multiple names (Zou et al. 2017 ; Gao et al. 2018 ). Two of the papers (Hindle et al. 2013 ; Bajracharya and Lopes 2012 ) that assigned one name to multiple topics used predefined labels, and in the other two papers (Pagano and Maalej 2013 ; Rosen and Shihab 2016 ) authors interpreted words in the clusters to deduct names.

Regarding the papers that assigned multiple names to a topic, Zou et al. ( 2017 ) assigned no, one or more names, depending on how many words in the predefined word list matched words in clusters. Gao et al. ( 2018 ) used an automated approach to label topics with the three most relevant phrases and sentences from the end user reviews inputted to their topic model. The relevance of phrases and sentences were obtained with the metrics Semantic and Sentiment scores proposed by these authors.

6 Discussion

6.1 rq1: topic modeling techniques, 6.1.1 summary of findings.

LDA is the most frequently used topic model. Almost all papers (95 out of 111) applied LDA or a LDA-based technique, while nine papers applied LSI to identify topics and seven papers used LDA and LSI. Regarding the papers that used LDA-based techniques, eleven (out of 30) proposed their own LDA-based technique (Fu et al. 2015 ; Nguyen et al. 2011 ; Liu et al. 2017 ; Cao et al. 2017 ; Panichella et al. 2013 ; Yan et al. 2016a ; Xia et al. 2017b ; Nguyen et al. 2012 ; Damevski et al. 2018 ; Gao et al. 2018 ; Rao and Kak 2011 ). This may indicate that the LDA default implementation may not be adequate to support specific software engineering tasks or extract meaningful topics from all types of data. We discuss more about topic modeling techniques and their inputs in Section  6.2.2 . Furthermore, we found that topic modeling is used to develop tools and methods to support software engineers and concrete tasks (the most frequently supported task we found was bug handling), but also as a data analysis technique for textual data to explore empirical questions (see for example the “oldest” paper in our sample published in 2009 (Bajracharya and Lopes 2009 )).

One aspect that we did not specifically address in this review, but which impacts the applicability of topics models is their computational overhead. Computational overhead refers to processing time and computational resources (e.g., memory, CPU) required for topic modeling. As discussed by others, topic modeling can be computational intensive (Hoffman et al. 2010 ; Treude and Wagner 2019 ; Agrawal et al. 2018 ). However, we found that only few papers (seven out of 111) mentioned computational overhead at all. From these seven papers, five mentioned processing time (Bavota et al. 2014b ; Zhao et al. 2020 ; Luo et al. 2016 ; Moslehi et al. 2016 ; Chen et al. 2020 ), one paper mentioned computational requirements and some processing times (e.g., processor, data pre-processing time, LDA processing time and clustering processing time), and one paper only mention that their technique was processed in “few seconds” (Murali et al. 2017 ). Hence, based on the reviewed studies we cannot provide broader insights into the practical applicability and potential constraints of topic modeling based on the computational overhead.

6.1.2 Comparative Studies

As mentioned in Sections  5.2.1 and  5.2.3 , we identified studies that used more than one topic modeling technique and compared their performance. In detail, we found studies that (1) compared topic modeling techniques to information extraction techniques, such as Vector Space Model (VSM), an algebraic model (Salton et al. 1975 ) (see Table  11 ), (2) proposed an approach that uses a topic modeling technique and compared it to other approaches (which may or may not use topic models) with similar goals (see Table  12 ), and (3) compared the performance of different settings for a topic modeling technique or a newly proposed approach that utilizes topic models (see Table  13 ). In column “Metric” of Tables  11 ,  12 and  13 the metrics show the metrics used in the comparisons to decide which techniques performed “better” (based on the metrics’ interpretation). Metrics in bold were proposed for or adapted to a specific context (e.g., SCORE and Effort reduction), while the other metrics are standard NLP metrics (e.g., Precision, Recall and Perplexity). Details about the metrics used to compare the techniques are provided in Appendix  A.2 - Metrics Used in Comparative Studies.

As shown in Table  11 , ten papers compared topic modeling techniques to information extraction techniques. For example, Rosenberg and Moonen ( 2018 ) compared LSI with two other dimensionality reduction techniques (PCA and NMF) to group log messages of failing continuous deployment runs. Nine out of these ten papers presented explorations, i.e., studies experimented with different models to discuss their application to specific software engineering tasks, such as bug handling, software documentation and maintenance. Thomas et al. ( 2013 ) on the other hand experimented with multiple models to propose a framework for bug localization in source code that applies the best performing model.

Four papers in Table  11 (De Lucia et al. 2014 ; Tantithamthavorn et al. 2018 ; Abdellatif et al. 2019 ; Thomas et al. 2013 ) compared the performance of LDA, LSI and VSM with source code and issue/bug reports. Except for De Lucia et al. ( 2014 ), these studies applied Top-k accuracy (see Appendix  A.2 - Metrics Used in Comparative Studies) to measure the performance of models, and the best performing model was VSM. Tantithamthavorn et al. ( 2018 ) found that VSM achieves both the best Top-k performance and the least required effort for method-level bug localization. Additionally, according to De Lucia et al. ( 2014 ), VSM possibly performed better than LSI and LDA due to the nature of the corpus used in their study: LDA and LSI are ideal for heterogeneous collections of documents (e.g., user manuals from different systems), but in De Lucia et al. ( 2014 ) study each corpus was a collection of code classes from a single software system.

Ten studies proposed an approach that uses a topic modeling technique and compared it to similar approaches (shown in Table  12 ). In column “Approaches compared” of Table  12 , the approach in bold is the one proposed by the study (e.g., Cao et al. 2017 ) or the topic modeling technique used in their approach (e.g., Thomas et al. 2014 ). All newly proposed approaches were the best performing ones according to the metrics used.

In addition to the papers mentioned in Tables  11 and  12 , four papers compared the performance of different settings for a topic modeling technique or tested which topic modeling technique works best in their newly proposed approach (see Table  13 ). Biggers et al. ( 2014 ) offered specific recommendations for configuring LDA when localizing features in Java source code, and observed that certain configurations outperform others. For example, they found that commonly used heuristics for selecting LDA hyperparameter values ( beta = 0.01 or beta = 0.1) in source code topic modeling are not optimal (similar to what has been found by others, see Section  3.2 ). The other three papers (Chen et al. 2014 ; Fowkes et al. 2016 ; Poshyvanyk et al. 2012 ) developed approaches which were tested with different settings (e.g., the approach applying LDA or ASUM (Chen et al. 2014 )).

Regarding the datasets used by comparative studies, only Rao and Kak ( 2011 ) used a benchmarking dataset (iBUGS). Most of the comparative studies (13 out of 24) used source code or issue/bug reports from open source software, which are subject to evolution. The advantage of using benchmarking datasets rather than “living” datasets (e.g., an open source Java system) is that its data will be static and the same across studies. Additionally, data in benchmarking datasets are usually curated. This means that the results of replicating studies can be compared to the original study when both used the same benchmarking dataset.

Finally, we highlight that each of the above mentioned comparisons has a specific context. This means that, for example, the type of data analyzed (e.g., Java classes), the parameter setting (e.g., k = 50), the goal of the comparison (e.g., to select the best model for bug localization or for tracing documentation in source code) and pre-processing (e.g., stemming and stop word removal) were different. Therefore, it is not possible to “synthesize” the results from the comparisons across studies by aggregating the different comparisons in different papers, even for studies that appear to have similar goals or use the same topic modeling techniques, such as comparing the same models with similar types of data (such as Tantithamthavorn et al. 2018 and Abdellatif et al. 2019 ).

6.2 RQ2: Inputs to Topic Models

6.2.1 summary of findings.

Source code, developer communication and issue/bug reports were the most frequent types of data used for topic modeling in the reviewed papers. Consequently, most of the documents referred to individual or groups of functions or methods, individual Q&A posts, or individual bug reports; another frequent document was an individual user review (more discussions are in Section  6.2.3 ). We also found that few papers (16 out of 111) mentioned the actual length of documents used for topic modeling (we discuss this more in Section  6.2.2 ).

Regarding modeling parameters, most of the papers (93 out of 111) explicitly mentioned the configuration of at least one parameter, e.g., k , α or β for LDA. We observed that the setting α = 50/ k and β = 0.01 (asymmetric α and symmetric β ) as suggested by Steyvers and Griffiths ( 2010 ) and Wallach et al. ( 2009 ) was frequently used (28 out of 93 papers). Additionally, papers that applied LDA mostly used the default parameters of the tools used to implement LDA (e.g., Mallet 3 with α = 50/ k and β = 0.01 as default). This finding is similar to what has been reported by others, e.g., according to another review by Agrawal et al. ( 2018 ), LDA is frequently applied “as is out-of-the-box” or with little tuning. This means that studies may rely on the default settings of the tools used with their topic modeling technique, such as Mallet and TMT, rather than try to optimize parameters.

6.2.2 Documents and Parameters for Topic Models

Short texts : According to Lin et al. ( 2014 ), topic models such as LDA have been widely adopted and successfully used with traditional media like edited magazine articles. However, applying LDA to informal communication text such as tweets, comments on blog posts, instant messaging, Q&A posts, may be less successful. Their user-generated content is characterized by very short document length, a large vocabulary and a potentially broad range of topics. As a consequence, there are not enough words in a document to create meaningful clusters, compromising the performance of the topic modeling. This means that probabilistic topic models such as LDA perform sub-optimally when applied “as is” with short documents even when hyperparameters ( α and β in LDA) are optimized (Lin et al. 2014 ). In our sample there were only two papers that mentioned the use of a LDA-based technique specifically for short documents (Hu et al. 2019 ; Hu et al. 2018 ). Hu et al. ( 2019 ) and Hu et al. ( 2018 ) applied Twitter-LDA with end user reviews. Furthermore, Moslehi et al. ( 2018 ) used a weighting algorithm in documents to generate topics with more relevant words, they also acknowledge that the use of a short text technique could have improved their topic model.

As shown in Table  7 , few papers mentioned the actual length of documents. Considering a single document from a corpus, we observed that most papers potentially used short texts (all documents found in papers are shown in Fig.  3 ). For example, papers used an individual search query (Xia et al. 2017a ), an individual Q&A post (Barua et al. 2014 ), an individual user review (Nayebi et al. 2018 ), or an individual commit message (Canfora et al. 2014 ) as a document. Among the papers that mentioned document length, the shortest documents were an individual commit message (9 to 20 words) (Canfora et al. 2014 ) and an individual method (14 words) (Tantithamthavorn et al. 2018 ). Both studies applied LDA.

Two approaches to improve the performance of LDA when analyzing short documents are pooling and contextualization (Lin et al. 2014 ). Pooling refers to aggregating similar (e.g., semantically or temporally) documents into a single document (Mehrotra et al. 2013 ). For example, among the papers analysed, Pettinato et al. ( 2019 ) used temporal pooling and combined short log messages into a single document based on a temporal order. Contextualization refers to creating subsets of documents according to a type of context; considering tweets as documents, the type of context can refer to time, user and hashtags associated with tweets (Tang et al. 2013 ). For example, Weng et al. ( 2010 ) combined all the individual tweets of an author into one pseudo-document (rather than treating each tweet as a document). Therefore, with the contextualization approach, the topic model uses word co-occurrences at a context level instead of at the document level to discover topics.

Hyperparameters Table  14 shows the hyperparameter settings and types of data of the papers that mentioned the value of at least one model parameter. In Table  14 we also highlight the topic modeling techniques used. Note that some topic modeling techniques (e.g., RTM) can receive more parameters that the ones mentioned in Table  14 (e.g., number of documents, similarity thresholds); all parameters mentioned in papers are available online in the raw data of our study 1 . When comparing hyperparameter settings, topic modeling techniques and types of data, we observed the following:

Papers that used LDA-GA, an LDA-based technique that optimizes hyperparameters with Genetic algorithms, applied it to data from developer documentation or source code;

LDA was used with all three types of hyperparameter settings across studies. The most common setting was α based on k for developer communication and source code;

Most of the LDA-based techniques applied fixed values for α and β .

Most of the papers that applied only LSI as the topic modeling technique did not mention hyperparameters. As LSI is a model simpler than LDA, it generally requires the number of topics k . For example, a paper that applied LSI to source code mentioned α and k (Poshyvanyk et al. 2012 ).

Number of topics By relating the type of data to the number of topics, we aimed at finding whether the choice of the number of topics is related to the data used in the topic modeling techniques (see also Table  7 ). However, the number of topics used and data in the studies are rather diverse. Therefore, synthesizing practices and offering insights from previous studies on how to choose the number topics is rather limited.

From the 90 papers that mentioned number of topics ( k ), we found that 66 papers selected a specific number of topics (e.g., based on previous works with similar data or addressing the same task), while 24 papers used several numbers of topics (e.g., Yan et al. ( 2016b ) used 10 to 120 topics in steps of 10). To provide an example of how the number of topics differed even when the same type of data was analyzed with the same topic modeling technique, we looked at studies that applied LDA in textual data from developer communication (mostly Q&A posts) to propose an approach to support documentation. For these papers we found one paper that did not mention k (Henß et al. 2012 ), one paper that modeled different numbers of topics ( k = 10,20,30) (Asuncion et al. 2010 ), one paper that modeled k = 15 (Souza et al. 2019 ) and another paper that modeled k = 40 (Wang et al. 2015 ). This illustrates that there is no common or recommended practice that can be derived from the papers.

Some papers mentioned that they tested several numbers of topics before selecting the most appropriate value for k (in regards to studies’ goals) but did not mention the range of values tested. In regards to papers that mentioned such range, we identified four studies (Nayebi et al. 2018 ; Chen et al. 2014 ; Layman et al. 2016 ; Nabli et al. 2018 ) that tested several values for k and used perplexity (see details in Appendix  A.2 - Metrics Used in Comparative Studies) of models to evaluate which value of k generated the best performing model; three studies (Zhao et al. 2020 ; Han et al. 2020 ; El Zarif et al. 2020 ) also selected the number of topics after testing several values for k ; however they used topic coherence (Röder et al. 2015 ) to evaluate models. One paper (Haque and Ali Babar 2020 ) used both perplexity and topic coherence to select a value for k . Metrics of topic coherence score the probability of a pair of words from the resulted word clusters being found together in (a) external data sources (e.g., Wikipedia pages) or (b) in the documents used by the topic model that generated those word clusters (Röder et al. 2015 ).

6.2.3 Supported Tasks, Types of Data and Types of Contribution

We looked into the relationship between the tasks supported by papers, the type of data used and the types of contributions (see Table  15 ). We observed the following:

Source code was a frequent type of data in papers; consequently it appeared for almost all supported tasks, except for exploratory studies;

Considering exploratory studies, most papers used developer communication (13 out of 21), followed by search queries and end user communication (three papers each);

Papers that supported bug handling mostly used issue/bug reports, source code and end user communication;

Log information was used by papers that supported maintenance, bug handling, and coding;

Considering the papers that supported documentation, three used transcript texts from speech;

From the four papers related to the type of data developer documentation, two supported architecting tasks and the other two, documentation tasks.

Regarding the type of data, URLs and transcripts were only used in studies that contributed an approach.

We found that most of the exploratory studies used data that is less structured. For example, developer communication, such as Q&A posts and conversation threads generally do not follow a standardized template. On the other hand, issue reports are typically submitted through forms which enforces a certain structure.

6.3 RQ3: Data Pre-processing

6.3.1 summary of findings.

Most of the papers (91 out of 111) pre-processed the textual data before topic modeling. Removing noisy content was the most frequent pre-processing step (as typical for natural language processing), followed by stemming and splitting words. Miner et al. ( 2012 ) consider tokenizing as one of the basic data pre-processing steps in text mining. However, in comparison to other basic pre-processing steps such as stemming, splitting words and removing noise, tokenizing was not frequently found in papers (it was at least not mentioned in papers).

Eight papers (Henß et al. 2012 ; Xia et al. 2017b ; Ahasanuzzaman et al. 2019 ; Abdellatif et al. 2019 ; Lukins et al. 2010 ; Tantithamthavorn et al. 2018 ; Poshyvanyk et al. 2012 ; Binkley et al. 2015 ) tested how pre-processing steps affected the performance of topic modeling or topic model-based approaches. For example, Henß et al. ( 2012 ) tested several pre-processing steps (e.g., removing stop words, long paragraphs and punctuation) in e-mail conversations analyzed with LDA. They found that removing such content increased LDA’s capability to grasp the actual semantics of software mailing lists. Ahasanuzzaman et al. ( 2019 ) proposed an approach which applies LDA and Conditional Random Field (CRF) to localize concerns in Stack Overflow posts. The authors did not incorporate stemming and stop words removal in their approach because in preliminary tests these pre-processing steps decreased the performance of the approach.

6.3.2 Pre-processing Different Types of Data

Table  16 shows how different types of data were pre-processed. We observed that stemming, removing noise, lowercasing, and splitting words were commonly used for all types of data. Regarding the differences, we observed the following:

For developer communication there were specific types of noisy content that was removed: URLs, HTML tags and code snippets. This might have happened because most of the papers used Q&A posts as documents, which frequently contain hyperlinks and code examples;

Removing non-informative content was frequently applied to end user communication and end user documentation;

Expanding contracted terms (e.g., “didn’t” to “did not”) were applied to end user communication and issue/bug reports;

Removing empty documents and eliminating extra white spaces were applied only in end user communication. Empty documents occurred in this type of data because after the removal of stop words no content was left (Chen et al. 2014 );

For source code there was a specific noise to be removed: program language specific keywords (e.g., “public”, “class”, “extends”, “if”, and “while”).

Table  16 shows that splitting words, stop words removal and stemming were frequently applied to source code and most of these studies (15) applied these three steps at the same time. Studies that performed these pre-processing steps to source code mostly used methods, classes, or comments in classes/methods as documents. For example, Silva et al. ( 2016 ) who applied LDA, performed these three pre-processing steps in classes from two open source systems using TopicXP (Savage et al. 2010 ). TopicXP is a Eclipse plug-in that extracts source code, pre-process it and executes LDA. This plug-in implements splitting words, stop words removal and stemming.

Splitting words was the most frequent pre-processing step in source code. Studies used this step to separate Camel Cases in methods and classes (e.g., the class constructor InvalidRequestTest produces the terms “invalid”, “request” and “test”). For example, Tantithamthavorn et al. ( 2018 ) compared LDA, LSI and VSM testing different combinations of pre-processing steps to the methods’ identifiers inputted to these techniques. The best performing approach was VSM with splitting words, stop words removal and stemming.

Removing stop words in source code refer to the exclusion of the most common words in a language (e.g., “a/an” and “the” in English), as in studies that used other types of data. Removing stop words in source code is also different from removing programming language keywords and studies mentioned these as separate steps. Lukins et al. ( 2010 ), for example, tested how removing stop words from their documents (comments and identifiers of methods) affected the topics generated by their LDA-based approach. They found that this step did not improve the results substantially.

As mentioned in Section  5.4 , stemming is the process of normalizing words into their single forms by identifying and removing prefixes, suffixes and pluralisation (e.g., “development”, “developer”, “developing” become “develop”). Regarding stemming in source code, papers normalized identifiers of classes and methods, comments related to classes and methods, test cases or a source code file. Three papers tested the effect of this pre-processing step in the performance of their techniques (Tantithamthavorn et al. 2018 ; Poshyvanyk et al. 2012 ; Binkley et al. 2015 ), and one of these papers also tested removing stop words and splitting words (Tantithamthavorn et al. 2018 ). Poshyvanyk et al. ( 2012 ) tested the effect of stemming classes in the performance of their LSI-based approach. The authors concluded that stemming can positively impact features localization by producing topics (“concept lattices” in their study) that effectively organize the results of searches in source code. Binkley et al. ( 2015 ) compared the performance of LSI, QL-LDA and other techniques. They also tested the effects of stemming (with two different stemmers: Porter Footnote 9 and Krovetz Footnote 10 ) and non-stemming methods from five open source systems. These authors found that they obtained better performances in terms of models’ Mean Reciprocal Rank (MRR, details in Appendix  A.2 - Metrics Used in Comparative Studies) with non-stemming.

Additionally, we found that even though some papers used the same type of data, they pre-processed data differently since they had different goals and applied different techniques. For example, Ye et al. ( 2017 ), Barua et al. ( 2014 ) and Chen et al. ( 2019 ) used developer communication (Q&A posts as documents). Ye et al. ( 2017 ) and Barua et al. ( 2014 ) removed stop words, code snippets and HTML tags, while Barua et al. ( 2014 ) also stemmed words. On the other hand, Chen et al. ( 2019 ) removed stop words and the least and the most frequent words, and identified bi-grams. Some studies considered the advice on data pre-processing from previous studies (e.g., Chen et al. 2017 ; Li et al. 2018 ), while others adopted steps that are commonly used in NLP, such as noise removal and stemming (Miner et al. 2012 ) (e.g., Demissie et al. 2020 ). This means that the choice of pre-processing steps do not only depend on the characteristics of the type of data inputted to topic modeling techniques.

6.4 RQ4: Assigning Names to Topics

Most papers did not mention if or how they named topics. The majority of papers that explicitly assigned names to topics (27 out of 36) used a manual approach and relied on human judgment (researchers’ interpretation) of words in clusters. One paper (Rosen and Shihab 2016 ) justified their use of a manual approach by arguing that there was no tool that could give human readable topics based on word clusters. Thus, authors checked every word cluster generated and the documents used (an individual question of a Q&A website) to make sure they would label topics appropriately.

Table  17 shows how topics were named and the type of data analyzed. Table  18 shows how topics were named and the type of contributions they make. We observed the following:

Studies that modeled topics from developer documentation, transcripts and URLs did not mention topic naming. Studies that contributed with both exploration and comparison also did not mention topic naming;

Topics were mostly named in studies that used data from developer communication (ten occurrences) and in exploratory studies (22 occurrences).

From studies that compared topic models or topic modeling-based approaches (see Section  6.1.2 ), only one study (Yan et al. 2016b ) named topics (automatically with predefined labels).

Fourteen papers acknowledged limitations of manual topic naming:

Twelve papers (Bagherzadeh and Khatchadourian 2019 ; Ahmed and Bagherzadeh 2018 ; Martin et al. 2015 ; Hindle et al. 2013 ; Pagano and Maalej 2013 ; Zou et al. 2017 ; Pettinato et al. 2019 ; Layman et al. 2016 ; Ray et al. 2014 ; Tiarks and Maalej 2014 ; Mezouar et al. 2018 ; Abdellatif et al. 2020 ) acknowledged that how topics were named could be a threat to validity. For example, Layman et al. ( 2016 ) mentioned that they did not evaluate the accuracy of the manual topic naming, which was based on their expertise.

Three papers (Hindle et al. 2015 ; Bajracharya and Lopes 2012 ; Li et al. 2018 ) mentioned difficulties to assign names to topics. Hindle et al. ( 2015 ), for example, explained that labeling topics was difficult due to many project specific and unclear terms in clusters.

One paper (Pettinato et al. 2019 ) acknowledged that there is another topic naming approach that could be applied to their data: authors acknowledged that an automated extraction of topic names could replace manual labeling.

Hindle et al. ( 2015 ) provided some recommendations on topic analysis in software engineering based on their experiences. Below are some of their recommendations related to topic naming:

Some of the generated topics will not be relevant (e.g., clusters filled with common terms may not address any particular subject) and topics may be duplicated. This means that not all topics have to be named and used for analysis;

Domain experts can label topics better than non-experts, because they are more familiar to domain-specific keywords that may appear in word clusters;

It is important to rely on the relationship between topics generated and the original data. Hindle et al. ( 2015 ) argued that “the content of the topic can be interpreted in many different ways and LDA does not look for the same patterns that people do”.

6.5 Implications

The goal of this study was to describe how topic modeling is applied in software engineering research. We found studies that experimented, explored data, or proposed solutions to support different software engineering tasks with topic models. Our findings help researchers and practitioners as follows:

Understand which topic modeling techniques to use for what purpose . Researchers and practitioners that are going to select and apply a topic modeling technique, for example, to refactor legacy systems; may consider the experiences of other studies with similar objectives.

Pre-processing based on the type of data to be modeled . Pre-processing steps depend on the type of data analyzed (e.g., removing HTML tags in developer communication, mainly Q&A posts). Researchers and practitioners who, for example, intend to model topics from source code; may consider the same pre-processing steps that other studies applied to source code.

Understand how to name topics . Researchers and practitioners may check how other studies named topics to get insights on how to give meaning to their own topics.

We present some additional insights:

Appropriateness of topic modeling . Although we found that most of papers applied LDA “as is”, it may not be the best approach for other studies or for practical application. LDA is popular because it is an unsupervised model, i.e., it does not require previous knowledge about the data (e.g., pre-defined classes for model training), it is statistically more rigorous than other techniques (e.g., LSI), and it discovers latent relationships (i.e., topics) between documents in a large textual corpus (Griffiths and Steyvers 2004 ). However, LDA is an unstable and non-deterministic model. This means that generated topics cannot be replicated by others, even if the same model inputs (data pre-processing and configuration of parameters) are used. Furthermore, LDA performs poorly with short documents (Lin et al. 2014 ).

Meaningful topics . Topic models should discover semantically meaningful topics. Chang et al. ( 2009 ) argue about the importance of the interpretability of topics generated by probabilistic topic modeling techniques such as LDA. To create meaningful and replicable topics with LDA, Mantyla et al. ( 2018 ) highlight the importance of stabilizing the topic model (e.g., through tuning (Agrawal et al. 2018 )) and advocate the use of stability metrics (e.g., rank-biased overlap - RBO (Mantyla et al. 2018 )).

Research opportunities . Researchers interested in investigating topic modeling in software engineering may consider developing guidelines for researchers on how to use topic modeling, depending on the type of data, goals, etc. Further studies may also explore issues related to approaches for naming topics (e.g., based on domain experts), on the evaluation of the semantic accuracy of topics generated (e.g., how meaningful the topics are and if the context of document have to be considered), and on metrics to measure the performance of topic models supporting different software engineering tasks.

6.6 Threats to Validity

We analysed the validity threats to our study considering four types of threats to validity in systematic literature mapping studies (Petersen et al. 2015 ):

Theoretical validity This threat to validity refers to concerns related to capturing the data as intended, i.e., bias and limitations in the data selection and extraction. As we focused on the practice of topic modeling in software engineering, we restricted the search to highly ranked software engineering venues, which generally publish more mature studies. We used “topic model”, “topic model[l]ing”, “lsi”, “lda”, “plsi”, “latent dirichlet allocation”, “latent semantic” as search keywords to find all papers related to topic modeling. To select papers to the survey, we established inclusion and exclusion criteria. One author selected the papers and the others checked whether the selection criteria were applied appropriately. Furthermore, to minimize this threat in relation to data extraction, we first defined the data items (details are in Table  2 ) to be extracted from papers and the relevance of the data for each research question. Then, one author extracted the data and the others reviewed the results. Controversial data results were discussed to reach agreement.

Descriptive validity In the context of a literature survey, descriptive validity refers to bias and limitations in data synthesis and the accurate and objective description of the data. To mitigate this threat, we described in detail how the data was synthesized (see Section  4.3 ); furthermore, one of the authors synthesized the data and the others reviewed the results. Still, data and results depend on what is reported in papers which was sometimes incomplete, inconsistent or inaccurate (see for example information about document length).

Interpretive validity This threat to validity refers to bias and limitations in the results of the data analysis. We frequently reviewed the synthesized data during the data analysis and the authors with more experience in this type of study checked the occurrence of inconsistencies in results. Still, we recognize that interpretation bias may not have been removed completely.

Repeatability This threat to validity concerns whether the study and its results can be replicated. To reduce this threat, we described our search procedures in detail (Section  4 ), and the processes of data selection, extraction and synthesis in detail. We also followed general guidelines for systematic literature review as suggested by Kitchenham ( 2004 ) and mapping study method as suggested by Petersen et al. ( 2015 ). Furthermore, raw data of our study are available online 1 .

7 Conclusions

We analyzed 111 papers that applied topic modeling. These papers were published in the last twelve years (2009-2020) in ten highly ranked software engineering venues (five conferences and five journals). Below we summarize our findings:

LDA and LDA-based techniques are the most frequently used topic modeling techniques;

Topic modeling was mostly used to develop techniques for handling bugs (e.g., to predict defects). Exploratory studies that use topic modeling as a data analysis technique were also frequent;

Most papers modeled topics from source code (using methods as documents);

Most papers used LDA “as is” and without adapting values of hyperparameters ( α and β );

Most papers describe pre-processing. Some pre-processing steps depend on the type of textual data used (e.g., removal of URL and HTML tags), while others are commonly used in NLP techniques (e.g., stop words removal or stemming);

Only 36 (out of 111) papers named the topics. When naming topics, papers mostly adopted manual topic naming approaches such as deducting names (or labeling pre-defined names) based on the meaning of frequent words in that topic.

By analysing topic modeling techniques, data inputs, data pre-processing, and how topics were named, we identified characteristics and limitations in the use of topic models. Our study can provide insights and references to researchers and practitioners to make the best use of topic modeling, considering the experiences from previous studies.

Our study did not investigate all potential characteristics of topic modeling in software engineering or compared topic models to other text mining techniques. To answer our research questions, we analyzed data items shown in Table  2 . Future studies may investigate other characteristics of the use of topic modeling in software engineering, for example, topic modeling tools or libraries (e.g., Mallet) used; the context of a specific supported software engineering task; or compare topic modeling techniques to other text mining techniques, such as clustering and summarization (e.g., sentence or document embeddings). Furthermore, future work can reflect on other fields or uses of topic modeling to contrast how topic modeling is applied in software engineering. Further studies may also investigate how papers evaluate the performance of their topic modeling techniques, how papers evaluate the the quality of the generated topics, and how exactly word clusters were used when topics were not named.

https://doi.org/10.5281/zenodo.5280890

This table also shows hyperparameters and the number of topics which are discussed in the following subsection.

http://mallet.cs.umass.edu/topics.php

https://nlp.stanford.edu/software/tmt/tmt-0.4/

http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

https://gist.github.com/sebleier/554280

https://github.com/mengjunxie/ae-lda/blob/master/misc/mallet-stopwords-en.txt

http://icon.shef.ac.uk/Moby/mwords.html

https://tartarus.org/martin/PorterStemmer/

https://pypi.org/project/krovetz/

Abdellatif A, Costa D, Badran K, Abdalkareem R, Shihab E (2020) Challenges in Chatbot Development: A Study of Stack Overflow Posts. In: Proceedings of the 17th international conference on mining software repositories. https://doi.org/10.1145/3379597.3387472 , vol 12. IEEE/ACM, Seoul, pp 174–185

Abdellatif TM, Capretz LF, Ho D (2019) Automatic recall of software lessons learned for software project managers. Inf Softw Technol 115:44–57. https://doi.org/10.1016/j.infsof.2019.07.006

Article   Google Scholar  

Aggarwal CC, Zhai C (2012) Mining text data. Springer, New York. https://doi.org/10.1007/978-1-4614-3223-4

Book   Google Scholar  

Agrawal A, Fu W, Menzies T (2018) What is wrong with topic modeling? And how to fix it using search-based software engineering. Inf Softw Technol 98(January 2017):74–88. https://doi.org/10.1016/j.infsof.2018.02.005

Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2019) CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empir Softw Eng 25:1493–1532. https://doi.org/10.1007/s10664-019-09743-4

Ahmed S, Bagherzadeh M (2018) What do concurrency developers ask about?: A large-scale study using Stack Overflow. In: Proceedings of the international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3239235.3239524 . ACM, Oulu, pp 1–10

Ali N, Sharafi Z, Guéhéneuc Y G, Antoniol G (2015) An empirical study on the importance of source code entities for requirements traceability. Empir Softw Eng 20(2):442–478. https://doi.org/10.1007/s10664-014-9315-y

Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: IEEE international working conference on mining software repositories. pp 183–192. https://doi.org/10.1109/MSR.2013.662402

Altarawy D, Shahin H, Mohammed A, Meng N (2018) LASCAD: Language-agnostic software categorization and similar application detection. J Syst Softw 142:21–34. https://doi.org/10.1016/j.jss.2018.04.018

ARC ARC (2012) Excellence in research for australia (ERA). https://www.arc.gov.au/excellence-research-australia http://www.arc.gov.au/pdf/era12/ERAFactsheet_Jan2012_1.pdf

Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the international conference on software engineering. IEEE/ACM, Cape Town, pp 95–104

Bagherzadeh M, Khatchadourian R (2019) Going big: a large-scale study on what big data developers ask. In: Proceedings of the 27th joint european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3338906.3338939 . ACM, Tallinn, pp 432–442

Bajaj K, Pattabiraman K, Mesbah A (2014) Mining questions asked by web developers. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597083 . ACM, Hyderabad, pp 112–121

Bajracharya S, Lopes C (2009) Mining search topics from a code search engine usage log. In: Proceedings of the 6th international working conference on mining software repositories. https://doi.org/10.1109/MSR.2009.5069489 . IEEE, Vancouver, pp 111–120

Bajracharya SK, Lopes CV (2012) Analyzing and mining a code search engine usage log. Empir Softw Eng 17:424–466. https://doi.org/10.1007/s10664-010-9144-6

Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng 19 (3):619–654. https://doi.org/10.1007/s10664-012-9231-y

Bavota G, Gethers M, Oliveto R, Poshyvanyk D, Lucia ADE (2014a) Improving software modularization via automated analysis of latent. ACM Trans Softw Eng Methodol 23(1):1–33. https://doi.org/10.1145/2559935

Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014b) Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694. https://doi.org/10.1109/TSE.2013.60

Beitzel SM, Jensen EC, Frieder O (2009) MAP. In: Encyclopedia of database systems. https://doi.org/10.1007/978-0-387-39940-9_492 . Springer US, Boston, pp 1691–1692

Belle AB, Boussaidi GE, Kpodjedo S (2016) Combining lexical and structural information to reconstruct software layers. Inf Softw Technol 74:1–16. https://doi.org/10.1016/j.infsof.2016.01.008

Bi T, Liang P, Tang A, Yang C (2018) A systematic mapping study on text analysis techniques in software architecture. J Syst Softw 144:533–558. https://doi.org/10.1016/j.jss.2018.07.055

Biggers LR, Bocovich C, Capshaw R, Eddy BP, Etzkorn LH, Kraft NA (2014) Configuring latent Dirichlet allocation based feature location. Empir Softw Eng 19(3):465–500. https://doi.org/10.1007/s10664-012-9224-x

Binkley D, Lawrie D, Uehlinger C, Heinz D (2015) Enabling improved IR-based feature location. J Syst Softw 101:30–42. https://doi.org/10.1016/j.jss.2014.11.013

Blasco D, Cetina C, Pastor O (2020) A fine-grained requirement traceability evolutionary algorithm: Kromaia, a commercial video game case study. Inf Softw Technol 119:1–12. https://doi.org/10.1016/j.infsof.2019.106235

Blei DM, Jordan MI, Griffiths TL, Tenenbaum JB (2003a) Hierarchical topic models and the nested chinese restaurant process. In: Proceedings of the 16th international conference on neural information processing systems. Neural Information Processing Systems Foundation, Vancouver, pp 17–24

Blei DM, Ng AY, Jordan MI (2003b) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993

MATH   Google Scholar  

Brank J, Mladenić D, Grobelnik M, Liu H, Mladenić D, Flach PA, Garriga GC, Toivonen H, Toivonen H (2011) F 1-measure. In: Encyclopedia of machine learning. https://doi.org/10.1007/978-0-387-30164-8_298 . Springer US, pp 397–397

Canfora G, Cerulo L, Cimitile M, Di Penta M (2014) How changes affect software entropy: An empirical study. Empir Softw Eng 19:1–38. https://doi.org/10.1007/s10664-012-9214-z

Cao B, Frank Liu X, Liu J, Tang M (2017) Domain-aware Mashup service clustering based on LDA topic model from multiple data sources. Inf Softw Technol 90:40–54. https://doi.org/10.1016/j.infsof.2017.05.001

Capiluppi A, Ruscio DD, Rocco JD, Nguyen PT, Ajienka N (2020) Detecting Java software similarities by using different clustering techniques. Inf Softw Technol 122. https://doi.org/10.1016/j.infsof.2020.106279

Catolino G, Palomba F, Zaidman A, Ferrucci F (2019) Not all bugs are the same: Understanding, characterizing, and classifying bug types. J Syst Softw 152:165–181. https://doi.org/10.1016/j.jss.2019.03.002

Chang J, Blei DM (2009) Relational topic models for document networks. In: Proceedings of the 12th international conference on artificial intelligence and statistics. Society for Artificial Intelligence and Statistics, Clearwater Beach, pp 81–88

Chang J, Blei DM (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150. https://doi.org/10.1214/09-AOAS309

Article   MathSciNet   MATH   Google Scholar  

Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Proceedings of the 2009 conference advances in neural information. Neural Information Processing Systems Foundation, Vancouver, pp 288–296

Chatterjee P, Damevski K, Pollock L (2019) Exploratory study of slack q&a chats as a mining source for software engineering tools. In: Proceedings of the 16th international conference on mining software repositories. IEEE, Montreal, pp 1–12

Chen H, Coogle J, Damevski K (2019) Modeling stack overflow tags and topics as a hierarchy of concepts. J Syst Softw 156:283–299. https://doi.org/10.1016/j.jss.2019.07.033

Chen L, Hassan F, Wang X, Zhang L (2020) Taming behavioral backward incompatibilities via cross-project testing and analysis. In: Proceedings of the 42nd international conference on software engineering. https://doi.org/10.1145/3377811.3380436 . IEEE/ACM, Seoul, pp 112–124

Chen N, Lin J, Hoi SC, Xiao X, Zhang B (2014) AR-miner: Mining informative reviews for developers from mobile app marketplace. In: Proceedings of the international conference on software engineering. https://doi.org/10.1145/2568225.2568263 , vol 1. IEEE/ACM, Hyderabad, pp 767–778

Chen TH, Thomas SW, Nagappan M, Hassan AE (2012) Explaining software defects using topic models. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2012.6224280 . IEEE, Zurich, pp 189–198

Chen TH, Thomas SW, Hassan AE (2016) A survey on the use of topic models when mining software repositories. Empir Softw Eng 21(5):1843–1919. https://doi.org/10.1007/s10664-015-9402-8

Chen TH, Shang W, Nagappan M, Hassan AE, Thomas SW (2017) Topic-based software defect explanation. J Syst Softw 129:79–106. https://doi.org/10.1016/j.jss.2016.05.015

Choetkiertikul M, Dam HK, Tran T, Ghose A (2017) Predicting the delay of issues with due dates in software projects. Empir Softw Eng 22:1223–1263. https://doi.org/10.1007/s10664-016-9496-7

Craswell N (2009) Mean reciprocal rank. In: Encyclopedia of database systems. https://doi.org/10.1007/978-0-387-39940-9_488 . Springer US, pp 1703–1703

Croft WB, Metzler D (2010) Search engines: Information retrieval in practice. Addison-Wesley, Reading

Google Scholar  

Cui D, Liu T, Cai Y, Zheng Q, Feng Q, Jin W, Guo J, Qu Y (2019) Investigating the impact of multiple dependency structures on software defects, IEEE/ACM, Montreal. https://doi.org/10.1109/ICSE.2019.00069

Damevski K, Chen H, Shepherd DC, Kraft NA, Pollock L (2018) Predicting future developer behavior in the IDE using topic models. IEEE Trans Softw Eng 44(11):1100–1111. https://doi.org/10.1109/TSE.2017.2748134

De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2014) Labeling source code with information retrieval methods: An empirical study. Empir Softw Eng 19(5):1383–1420. https://doi.org/10.1007/s10664-013-9285-5

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391-407 https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Demissie BF, Ceccato M, Shar LK (2020) Security analysis of permission re-delegation vulnerabilities in Android apps. Empir Softw Eng 25:5084–5136. https://doi.org/10.1007/s10664-020-09879-8

Dietz L, Bickel S, Scheffer T (2007) Unsupervised prediction of citation influences. In: Proceedings of the 24th international conference on machine learning. https://doi.org/10.1145/1273496.1273526 . ACM, Corvallis, pp 233–240

Dit B, Revelle M, Poshyvanyk D (2013) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir Softw Eng 18(2):277–309. https://doi.org/10.1007/s10664-011-9194-4

El Zarif O, Da Costa DA, Hassan S, Zou Y (2020) On the relationship between user churn and software issues. In: Proceedings of the 17th international conference on mining software repositories. https://doi.org/10.1145/3379597.3387456 . ACM, New York, pp 339–349

Fowkes J, Chanthirasegaran P, Ranca R, Allamanis M, Lapata M, Sutton C (2016) Autofolding for source code summarization. Proc Int Conf Softw Eng 43(12):649–652. https://doi.org/10.1145/2889160.2889171

Fu Y, Yan M, Zhang X, Xu L, Yang D, Kymer JD (2015) Automated classification of software change messages by semi-supervised Latent Dirichlet Allocation. Inf Softw Technol 57:369–377. https://doi.org/10.1016/j.infsof.2014.05.017

Galvis Carreno LV, Winbladh K (2012) Analysis of user comments: an approach for software requirements evolution. In: Proceedings of the international conference on software engineering. IEEE/ACM, San Francisco, pp 582–591

Gao C, Zeng J, Lyu MR, King I (2018) Online app review analysis for identifying emerging issues. In: Proceedings of the 40th international conference on software engineering. https://doi.org/10.1145/3180155.3180218 . IEEE/ACM, Gothenburg, pp 48–58

Gopalakrishnan R, Sharma P, Mirakhorli M, Galster M (2017) Can latent topics in source code predict missing architectural tactics?. In: Proceedings of the 39th international conference on software engineering, IEEE/ACM, pp 15–26. https://doi.org/10.1109/ICSE.2017.10 . http://ghtorrent.org/

Gorla A, Tavecchia I, Gross F, Zeller A (2014) Checking app behavior against app descriptions. In: Proceedings of the international conference on software engineering. https://doi.org/10.1145/2568225.2568276 . IEEE/ACM, Hyderabad, pp 1025–1035

Griffiths TL, Steyvers M (2004) Finding scientific topics. In: Proceedings of the national academy of sciences. https://doi.org/10.1073/pnas.0307752101 , vol 101. Neural Information Processing Systems Foundation, Irvine, pp 5228–5235

Haghighi A, Vanderwende L (2009) Exploring content models for multi-document summarization. In: Proceedings of the conference on human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics. https://doi.org/10.3115/1620754.1620807 , http://www-nlpir.nist.gov/projects/duc/data.html . Association for Computational Linguistics, Boulder, pp 362–370

Han J, Shihab E, Wan Z, Deng S, Xia X (2020) What do programmers discuss about deep learning frameworks. Empir Softw Eng 25:2694–2747. https://doi.org/10.1007/s10664-020-09819-6

Haque MU, Ali Babar M (2020) Challenges in docker development: a large-scale study using stack overflow. In: Proceedings of the 14th international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3382494.3410693 . IEEE/ACM, Bari, pp 1–11

Hariri N, Castro-Herrera C, Mirakhorli M, Cleland-Huang J, Mobasher B (2013) Supporting domain analysis through mining and recommending features from online product listings. IEEE Trans Softw Eng 39(12):1736–1752. https://doi.org/10.1109/TSE.2013.39

Henß S, Monperrus M, Mezini M (2012) Semi-automatically extracting FAQs to improve accessibility of software development knowledge. In: Proceedings of the international conference on software engineering. https://doi.org/10.1109/ICSE.2012.6227139 . IEEE/ACM, Zurich, pp 793–803

Hindle A, Godfrey MW, Ernst NA, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 33rd international conference on software engineering. ACM, Waikiki, pp 163–172

Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2013) Automated topic naming: Supporting cross-project analysis of software maintenance activities. Empir Softw Eng 18(6):1125–1155. https://doi.org/10.1007/s10664-012-9209-9

Hindle A, Bird C, Zimmermann T, Nagappan N (2015) Do topics make sense to managers and developers? Empir Softw Eng 20:479–515. https://doi.org/10.1007/s10664-014-9312-1

Hindle A, Alipour A, Stroulia E (2016) A contextual approach towards more accurate duplicate bug report detection and ranking. Empir Softw Eng 21 (2):368–410. https://doi.org/10.1007/s10664-015-9387-3

Hoffman M, Blei D, Bach F (2010) Online learning for latent dirichlet allocation. In: Proceedings of the neural information processing systems conference. https://doi.org/10.1.1.187.1883. Neural Information Processing Systems Foundation, Vancouver, pp 1–9

Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international conference on research and development in information retrieval. ACM, Berkeley, pp 50–57

Hu H, Bezemer CP, Hassan AE (2018) Studying the consistency of star ratings and the complaints in 1 & 2-star user reviews for top free cross-platform Android and iOS apps. Empir Softw Eng 23(6):3442–3475. https://doi.org/10.1007/s10664-018-9604-y

Hu H, Wang S, Bezemer CP, Hassan AE (2019) Studying the consistency of star ratings and reviews of popular free hybrid Android and iOS apps. Empir Softw Eng 24:7–32. https://doi.org/10.1007/s10664-018-9617-6

Hu W, Wong K (2013) Using citation influence to predict software defects. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624058 . IEEE, San Francisco, pp 419–428

Jiang H, Zhang J, Ren Z, Zhang T (2017) An unsupervised approach for discovering relevant tutorial fragments for APIs. In: Proceedings of the 39th international conference on software engineering. https://doi.org/10.1109/ICSE.2017.12 . IEEE/ACM, Buenos Aires, pp 38–48

Jiang HE, Zhang J, Li X, Ren Z, Lo D, Wu X, Luo Z (2019) Recommending new features from mobile app descriptions. ACM Trans Softw Eng Methodol 28(4):1–29. https://doi.org/10.1145/3344158

Jipeng Q, Zhenyu Q, Yun L, Yunhao Y, Xindong W (2020) Short text topic modeling techniques, applications, and performance: a survey. https://doi.org/10.1109/TKDE.2020.2992485

Jo Y, Oh A (2011) Aspect and sentiment unification model for online review analysis. In: Proceedings of the fourth ACM international conference on Web search and data mining. https://doi.org/10.1145/1935826 . ACM, New York, pp 815–824

Jones JA, Harrold MJ (2005) Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th international conference on automated software engineering. https://doi.org/10.1145/1101908.1101949 , http://portal.acm.org/citation.cfm?doid=1101908.1101949 . IEEE/ACM, New York, pp 273–282

Kakas AC, Cohn D, Dasgupta S, Barto AG, Carpenter GA, Grossberg S, Webb GI, Dorigo M, Birattari M, Toivonen H, Timmis J, Branke J, Toivonen H, Strehl AL, Drummond C, Coates A, Abbeel P, Ng AY, Zheng F, Webb GI, Tadepalli P (2011) Area under curve. In: Encyclopedia of machine learning. https://doi.org/10.1007/978-0-387-30164-8_28 . Springer US, pp 40–40

Kitchenham BA (2004) Procedures for performing systematic reviews. Keele, UK, Keele University 33(TR/SE-0401):28. https://doi.org/10.1.1.122.3308

Layman L, Nikora AP, Meek J, Menzies T (2016) Topic modeling of NASA space system problem reports research in practice. In: Proceedings of the 13th working conference on mining software repositories. https://doi.org/10.1145/2901739.2901760 . ACM, Austin, pp 303–314

Le TDB, Thung F, Lo D (2017) Will this localization tool be effective for this bug? Mitigating the impact of unreliability of information retrieval based bug localization tools. Empir Softw Eng 22:2237–2279. https://doi.org/10.1007/s10664-016-9484-y

Leach RJ (2016) Introduction to software engineering, 2nd edn. CRC Press LLC, Boca Raton. https://ebookcentral.proquest.com/lib/canterbury/detail.action?docID=4711469&query=Software+Engineering

Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

Article   MATH   Google Scholar  

Li H, Chen THP, Shang W, Hassan AE (2018) Studying software logging using topic models. Empir Softw Eng 23:2655–2694. https://doi.org/10.1007/s10664-018-9595-8

Lian X, Liu W, Zhang L (2020) Assisting engineers extracting requirements on components from domain documents. Inf Softw Technol 118(September 2019):106196. https://doi.org/10.1016/j.infsof.2019.106196

Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: Mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on world wide web. https://doi.org/10.1145/2566486.2567980 . ACM, Seoul, pp 539–549

Liu Y, Liu L, Liu H, Wang X, Yang H (2017) Mining domain knowledge from app descriptions. J Syst Softw 133:126–144. https://doi.org/10.1016/j.jss.2017.08.024

Liu Y, Lin J, Cleland-Huang J (2020) Traceability support for multi-lingual software projects. In: Proceedings of the 17th international conference on mining software repositories. https://doi.org/10.1145/3379597.3387440 . ACM, Seoul, pp 443–454

Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52:972–990. https://doi.org/10.1016/j.infsof.2010.04.002

Luo Q, Moran K, Poshyvanyk D (2016) A large-scale empirical comparison of static and dynamic test case prioritization techniques. In: Proceedings of the 24th international symposium on foundations of software engineering. https://doi.org/10.1145/2950290.2950344 . ACM, Seattle, pp 559–570

Mahmoud A, Bradshaw G (2017) Semantic topic models for source code analysis. Empir Softw Eng 22(4):1965–2000. https://doi.org/10.1007/s10664-016-9473-1

Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60. https://doi.org/10.1214/aoms/1177730491 , http://projecteuclid.org/euclid.aoms/1177730491

Manning CD, Raghavan P, Schütze H (2008) Evaluation of Clustering. In: Introduction to information retrieval. chap 16, https://doi.org/10.33899/csmj.2008.163987 . https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html , http://nlp.stanford.edu/IR?book/html/htmledition/evaluation?of?clustering?1.htmlwhereisthesetofclustersan . Cambridge University Press

Mantyla MV, Claes M, Farooq U (2018) Measuring LDA topic stability from clusters of replicated runs, ACM, Oulu. https://doi.org/10.1145/3239235.3267435

Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: Proceedings of the 12th international working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.19 . IEEE, Florence, pp 123–133

Martin W, Sarro F, Harman M (2016) Causal impact analysis for app releases in google play. In: Proceedings of the 24th international symposium on foundations of software engineering. https://doi.org/10.1145/2950290.2950320 . ACM, Seattle, pp 435–446

McIlroy S, Ali N, Khalid H, E Hassan A (2016) Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empir Softw Eng 21:1067–1106. https://doi.org/10.1007/s10664-015-9375-7

Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In: Proceedings of the 36th International Conference on Research and Development in Information Retrieval. ACM, Dublin, pp 889–892

Mezouar ME, Zhang F, Zou Y (2018) Are tweets useful in the bug fixing process? An empirical study on Firefox and Chrome. Empir Softw Eng 23 (3):1704–1742. https://doi.org/10.1007/s10664-017-9559-4

Miner G, Elder J, Fast A, Hill T, Nisbet R, Delen D (2012) Practical text mining and statistical analysis for non-structured text data applications. Elsevier Science & Technology, Waltham . https://doi.org/10.1016/C2010-0-66188-8

Moslehi P, Adams B, Rilling J (2016) On mining crowd-based speech documentation. In: Proceedings of the 13th working conference on mining software repositories. https://doi.org/10.1145/2901739.2901771 . ACM, Austin, pp 259–268

Moslehi P, Adams B, Rilling J (2018) Feature location using crowd-based screencasts. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196439 . ACM, New York, pp 192–202

Moslehi P, Adams B, Rilling J (2020) A feature location approach for mapping application features extracted from crowd-based screencasts to source code. Empir Softw Eng 25:4873–4926. https://doi.org/10.1007/s10664-020-09874-z

Murali V, Chaudhuri S, Jermaine C (2017) Bayesian specification learning for finding API usage errors. In: Proceedings of the Joint european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3106237.3106284 . ACM, Paderborn, pp 151–162

Nabli H, Ben Djemaa R, Ben Amor IA (2018) Efficient cloud service discovery approach based on LDA topic modeling. J Syst Softw 146:233–248. https://doi.org/10.1016/j.jss.2018.09.069

Naguib H, Narayan N, Brügge B, Helal D (2013) Bug report assignee recommendation using activity profiles. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6623999 . IEEE, San Francisco, pp 22–30

Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng 23:2764–2794. https://doi.org/10.1007/s10664-018-9601-1

Nguyen AT, Nguyen TT, Al-Kofahi J, Nguyen HV, Nguyen TN (2011) A topic-based approach for narrowing the search space of buggy files from a bug report. In: Proceedings of the 26th international conference on automated software engineering. https://doi.org/10.1109/ASE.2011.6100062 . IEEE/ACM, Lawrence, pp 263–272

Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th international conference on automated software engineering. https://doi.org/10.1145/2351676.2351687 . IEEE/ACM, Essen, pp 70–79

Nguyen VA, Boyd-Graber J, Resnik P, Chang J, Graber JB (2014) Learning a concept hierarchy from multi-labeled documents. In: Proceedings of the neural information processing systems conference. Neural Information Processing Systems Foundation, Montreal, pp 1–9

Noei E, Heydarnoori A (2016) EXAF: A search engine for sample applications of object-oriented framework-provided concepts. Inf Softw Technol 75:135–147. https://doi.org/10.1016/j.infsof.2016.03.007

Noei E, Da Costa DA, Zou Y (2018) Winning the app production rally. In: Proceedings of the 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3236024.3236044 . ACM, Lake Buena Vista, pp 283–294

Noei E, Zhang F, Wang S, Zou Y (2019) Towards prioritizing user-related issue reports of mobile applications. Empir Softw Eng 24:1964–1996. https://doi.org/10.1007/s10664-019-09684-y

Pagano D, Maalej W (2013) How do open source communities blog? Empir Softw Eng 18(6):1090–1124. https://doi.org/10.1007/s10664-012-9211-2

Palomba F, Salza P, Ciurumelea A, Panichella S, Gall H, Ferrucci F, De Lucia A (2017) Recommending and localizing change requests for mobile apps based on user reviews. In: Proceedings of the 39th international conference on software engineering. https://doi.org/10.1109/ICSE.2017.18 . IEEE/ACM, Buenos Aires, pp 106–117

Panichella A, Dit B, Oliveto R, Di Penta M, Poshynanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms. In: Proceedings of the international conference on software engineering. https://doi.org/10.1109/ICSE.2013.6606598 . IEEE/ACM, San Francisco, pp 522–531

Pérez F, Lapeṅa R, Font J, Cetina C (2018) Fragment retrieval on models for model maintenance: Applying a multi-objective perspective to an industrial case study. Inf Softw Technol 103:188–201. https://doi.org/10.1016/j.infsof.2018.06.017

Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: An update. Inf Softw Technol 64(1):1–18. https://doi.org/10.1016/j.infsof.2015.03.007

Pettinato M, Gil JP, Galeas P, Russo B (2019) Log mining to re-construct system behavior: An exploratory study on a large telescope system. Inf Softw Technol 114:121–136. https://doi.org/10.1016/j.infsof.2019.06.011

Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. https://doi.org/10.1109/TSE.2007.1016 . https://www.researchgate.net/publication/3189749 , vol 33, pp 420–431

Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Softw Eng 14(1):5–32. https://doi.org/10.1007/s10664-008-9088-2 , http://www.mozilla.org/

Poshyvanyk D, Gethers M, Marcus A (2012) Concept location using formal concept analysis and information retrieval. ACM Trans Softw Eng Methodol 21(4):1–34. https://doi.org/10.1145/2377656.2377660

Poursabzi-Sangdeh F, Goldstein DG, Hofman JM, Vaughan JW, Wallach H (2021) Manipulating and measuring model interpretability. In: Proceedings of the conference on human factors in computing systems. https://doi.org/10.1145/3411764.3445315 . ACM, Yokohama

Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the conference on empirical methods in natural language processing. https://doi.org/10.5555/1699510.1699543 . ACL/AFNLP, Singapore, pp 248–256

Rao S, Kak A (2011) Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In: Proceedings of the international conference on software engineering. https://doi.org/10.1145/1985441.1985451 . IEEE/ACM, Waikiki, pp 43–52

Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in GitHub. In: Proceedings of the symposium on the foundations of software engineering, pp 155–165. https://doi.org/10.1145/2635868.2635922

Revelle M, Gethers M, Poshyvanyk D (2011) Using structural and textual information to capture feature coupling in object-oriented software. Empir Softw Eng 16(6):773–811. https://doi.org/10.1007/s10664-011-9159-7

Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining - WSDM ’15. https://doi.org/10.1145/2684822.2685324 . ACM, Shanghai, pp 399–408

Rosen C, Shihab E (2016) What are mobile developers asking about? A large scale study using Stack Overflow. Empir Softw Eng 21:1192–1223. https://doi.org/10.1007/s10664-015-9379-3

Rosenberg CM, Moonen L (2018) Improving problem identification via automated log clustering using dimensionality reduction. In: Proceedings of the international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3239235.3239248 . ACM, Oulu, pp 1–10

Rothermel G, Untcn RH, Chu C, Harrold MJ (2001) Prioritizing test cases for regression testing. IEEE Trans Softw Eng 27(10):929–948. https://doi.org/10.1109/32.962562

Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. https://doi.org/10.1145/361219.361220

Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent Dirichlet allocation. IEEE, Timisoara. https://doi.org/10.1109/ICSM.2010.5609654

Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Shimagaki J, Kamei Y, Ubayashi N, Hindle A (2018) Automatic topic classification of test cases using text mining at an android smartphone vendor. In: Proceedings of the 12th international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3239235.3268927 . IEEE/ACM, Oulu, pp 1–10

Silva B, Sant’anna C, Rocha N, Chavez C (2016) The effect of automatic concern mapping strategies on conceptual cohesion measurement. Inf Softw Technol 75:56–70. https://doi.org/10.1016/j.infsof.2016.03.006

Silva LL, Valente MT, Maia MA (2019) Co-change patterns: A large scale empirical study. J Syst Softw 152:196–214. https://doi.org/10.1016/j.jss.2019.03.014

Soliman M, Galster M, Salama AR, Riebisch M (2016) Architectural knowledge for technology decisions in developer communities: An exploratory study with Stack Overflow. In: Proceedings of the 13th working conference on software architecture. https://doi.org/10.1109/WICSA.2016.13 . IEEE, Venice, pp 128–133

Somasundaram K, Murphy GC (2012) Automatic categorization of bug reports using latent Dirichlet allocation. In: Proceedings of the 5th India software engineering conference. https://doi.org/10.1145/2134254.2134276 , vol 12. ACM, pp 125–130

Souza LB, Campos EC, Madeiral F, Paixão K, Rocha AM, Maia M d A (2019) Bootstrapping cookbooks for APIs from crowd knowledge on Stack Overflow. Inf Softw Technol 111(March 2018):37–49. https://doi.org/10.1016/j.infsof.2019.03.009

Steyvers M, Griffiths T (2010) Probalistic Topic Models. In: Landauer T, McNamara D, Dennis S, Kintsch W (eds) Latent semantic analysis: a road to meaning. https://doi.org/10.1016/s0364-0213(01)00040-4 . University of California, Irvine, pp 993–1022

Sun X, Li B, Leung H, Li B, Li Y (2015) MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks. Inf Softw Technol 66:1–12. https://doi.org/10.1016/j.infsof.2015.05.003

Sun X, Liu X, Li B, Duan Y, Yang H, Hu J (2016) Exploring topic models in software engineering data analysis: A survey, IEEE, Shangai. https://doi.org/10.1109/SNPD.2016.7515925

Sun X, Yang H, Xia X, Li B (2017) Enhancing developer recommendation with supplementary information via mining historical commits. J Syst Softw 134:355–368. https://doi.org/10.1016/j.jss.2017.09.021

Taba SES, Keivanloo I, Zou Y, Wang S (2017) An exploratory study on the usage of common interface elements in android applications. J Syst Softw 131:491–504. https://doi.org/10.1016/j.jss.2016.07.010

Tairas R, Gray J (2009) An information retrieval process to aid in the analysis of code clones. https://doi.org/10.1007/s10664-008-9089-1 , http://www.cis.uab.edu/tairasr/clones/literature , vol 14, pp 33–56

Tamrawi A, Nguyen TT, Al-Kofahi JM, Nguyen TN (2011) Fuzzy set and cache-based approach for bug triaging. In: Proceedings of the 19th ACM symposium on foundations of software engineering. https://doi.org/10.1145/2025113.202516 . ACM, pp 365–375

Tang J, Zhang M, Mei Q (2013) One theme in all views: modeling consensus topics in multiple contexts. In: Proceedings of the 19th international conference on knowledge discovery and data mining. ACM, New York, pp 5–13

Tantithamthavorn C, Lemma Abebe S, Hassan AE, Ihara A, Matsumoto K (2018) The impact of IR-based classifier configuration on the performance and the effort of method-level bug localization. Inf Softw Technol 102(June):160–174. https://doi.org/10.1016/j.infsof.2018.06.001

Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581. https://doi.org/10.1198/016214506000000302

Thomas SW, Nagappan M, Blostein D, Hassan AE (2013) The impact of classifier configuration and classifier combination on bug localization. IEEE Trans Softw Eng 39(10):1427–1443. https://doi.org/10.1109/TSE.2013.27

Thomas SW, Hemmati H, Hassan AE, Blostein D (2014) Static test case prioritization using topic models. Empir Softw Eng 19:182–212. https://doi.org/10.1007/s10664-012-9219-7

Tiarks R, Maalej W (2014) How does a typical tutorial for mobile development look like?. In: Proceedings of the 11th international conference on mining software repositories. https://doi.org/10.1145/2597073.2597106 . IEEE/ACM, Hyderabad, pp 272–281

Treude C, Wagner M (2019) Predicting good configurations for GitHub and stack overflow topic models. In: Proceedings of the 16th international conference on mining software repositories. https://doi.org/10.1109/MSR.2019.00022 . IEEE, Montreal, pp 84–95

Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132. https://doi.org/10.3102/10769986025002101

Wallach HM, Mimno D, McCallum A (2009) Rethinking LDA: Why priors matter. In: Proceedings of the conference on advances in neural information processing systems. Curran Associates Inc., Vancouver, pp 1973–1981. http://rexa.info/

Wang C, Blei DM (2011) Collaborative topic modeling for recommending scientific articles. In: Proceedings of the international conference on knowledge discovery and data mining. https://doi.org/10.1145/2020408.2020480 . ACM, New York, pp 448–456

Wang W, Malik H, Godfrey MW (2015) Recommending posts concerning API issues in developer Q&A sites. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.28 . http://stackoverflow.com/questions/5358219/ . IEEE/ACM, pp 224–234

Wei X, Croft WB (2006) LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international conference on research and development in information retrieval. https://doi.org/10.1145/1148170.1148204 . ACM, Seattle, pp 178–185

Weng J, Lim EP, Jiang J, He Q (2010) TwitterRank: Finding topic-sensitive influential twitterers. In: Proceedings of the 3rd international conference on web search and data mining. https://doi.org/10.1145/1718487.1718520 . ACM, New York, pp 261–270

Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2:37–52. https://doi.org/10.1016/0169-7439(87)80084-9

Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017a) What do developers search for on the web? Empir Softw Eng 22(6):3149–3185. https://doi.org/10.1007/s10664-017-9514-4

Xia X, Lo D, Ding Y, Al-Kofahi JM, Nguyen TN, Wang X (2017b) Improving automated bug triaging with specialized topic model. IEEE Trans Softw Eng 43(3):272–297. https://doi.org/10.1109/TSE.2016.2576454

Yan M, Fu Y, Zhang X, Yang D, Xu L, Kymer JD (2016a) Automatically classifying software changes via discriminative topic model: Supporting multi-category and cross-project. J Syst Softw 113:296–308. https://doi.org/10.1016/j.jss.2015.12.019

Yan M, Zhang X, Yang D, Xu L, Kymer JD (2016b) A component recommender for bug reports using Discriminative Probability Latent Semantic Analysis. Inf Softw Technol 73:37–51. https://doi.org/10.1016/j.infsof.2016.01.005

Yang X, Lo D, Li L, Xia X, Bissyandé T F, Klein J (2017) Characterizing malicious Android apps by mining topic-specific data flow signatures. Inf Softw Technol 90:27–39. https://doi.org/10.1016/j.infsof.2017.04.007

Ye D, Xing Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific Q&A sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406. https://doi.org/10.1007/s10664-016-9430-z

Zaman S, Adams B, Hassan AE (2011) Security versus performance bugs: A case study on firefox. In: Proceedings - international conference on software engineering. https://doi.org/10.1145/1985441.198545 , pp 93–102

Zeugmann T, Poupart P, Kennedy J, Jin X, Han J, Saitta L, Sebag M, Peters J, Bagnell JA, Daelemans W, Webb GI, Ting KM, Ting KM, Webb GI, Shirabad JS, Fürnkranz J, Hüllermeier E, Matwin S, Sakakibara Y, Flener P, Schmid U, Procopiuc CM, Lachiche N, Fürnkranz J (2011) Precision and recall. In: Encyclopedia of machine learning. https://doi.org/10.1007/978-0-387-30164-8_652 . Springer US, pp 781–781

Zhang E, Zhang Y (2009) Average precision. In: Encyclopedia of database systems. https://doi.org/10.1007/978-0-387-39940-9_482 . Springer US, pp 192–193

Zhang T, Chen J, Yang G, Lee B, Luo X (2016) Towards more accurate severity prediction and fixer recommendation of software bugs. J Syst Softw 117:166–184. https://doi.org/10.1016/j.jss.2016.02.034

Zhang Y, Lo D, Xia X, Scanniello G, Le TDB, Sun J (2018) Fusing multi-abstraction vector space models for concern localization. Empir Softw Eng 23:2279–2322. https://doi.org/10.1007/s10664-017-9585-2

Zhao N, Chen J, Wang Z, Peng X, Wang G, Wu Y, Zhou F, Feng Z, Nie X, Zhang W, Sui K, Pei D (2020) Real-time incident prediction for online service systems. In: Proceedings of the 28th ACM joint meeting european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3368089.3409672 , vol 20. ACM, pp 315–326

Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-20161-5-34 , vol 6611. Springer, Berlin, chap Advances i, pp 338–349

Zhao Y, Zhanq F, Shlhab E, Zou Y, Hassan AE (2016) How are discussions associated with bug reworking? an empirical study on open source projects. In: Proceedings of the 10th international symposium on empirical software engineering and measurement. https://doi.org/10.1145/2961111.296259 . IEEE/ACM, Ciudad Real, pp 1–10

Zou J, Xu L, Yang M, Zhang X, Yang D (2017) Towards comprehending the non-functional requirements through Developers’ eyes: An exploration of Stack Overflow using topic analysis. Inf Softw Technol 84(1):19–32. https://doi.org/10.1016/j.infsof.2016.12.003

Download references

Acknowledgements

We would like to thank the editor and the anonymous reviewers for their insightful and detailed feedback that helped us to significantly improve the manuscript.

Author information

Authors and affiliations.

University of Canterbury, Christchurch, New Zealand

Camila Costa Silva, Matthias Galster & Fabian Gilson

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Camila Costa Silva .

Ethics declarations

Conflict of interests.

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Andrea De Lucia

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1.1 A.1 Papers Reviewed

1.2 a.2 metrics used in comparative studies.

The column “Context-specific” indicates if the metric was proposed or adapted to a specific context (“Yes”) or is a standard NLP metric (“No”).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Silva, C.C., Galster, M. & Gilson, F. Topic modeling in software engineering research. Empir Software Eng 26 , 120 (2021). https://doi.org/10.1007/s10664-021-10026-0

Download citation

Accepted : 29 July 2021

Published : 06 September 2021

DOI : https://doi.org/10.1007/s10664-021-10026-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Topic modeling
  • Text mining
  • Natural language processing
  • Literature analysis
  • Find a journal
  • Publish with us
  • Track your research

Journal of Software Engineering Research and Development Cover Image

  • Search by keyword
  • Search by citation

Page 1 of 2

Metric-centered and technology-independent architectural views for software comprehension

The maintenance of applications is a crucial activity in the software industry. The high cost of this process is due to the effort invested on software comprehension since, in most of cases, there is no up-to-...

  • View Full Text

Back to the future: origins and directions of the “Agile Manifesto” – views of the originators

In 2001, seventeen professionals set up the manifesto for agile software development. They wanted to define values and basic principles for better software development. On top of being brought into focus, the ...

Investigating the effectiveness of peer code review in distributed software development based on objective and subjective data

Code review is a potential means of improving software quality. To be effective, it depends on different factors, and many have been investigated in the literature to identify the scenarios in which it adds qu...

On the benefits and challenges of using kanban in software engineering: a structured synthesis study

Kanban is increasingly being used in diverse software organizations. There is extensive research regarding its benefits and challenges in Software Engineering, reported in both primary and secondary studies. H...

Challenges on applying genetic improvement in JavaScript using a high-performance computer

Genetic Improvement is an area of Search Based Software Engineering that aims to apply evolutionary computing operators to the software source code to improve it according to one or more quality metrics. This ...

Actor’s social complexity: a proposal for managing the iStar model

Complex systems are inherent to modern society, in which individuals, organizations, and computational elements relate with each other to achieve a predefined purpose, which transcends individual goals. In thi...

Investigating measures for applying statistical process control in software organizations

The growing interest in improving software processes has led organizations to aim for high maturity, where statistical process control (SPC) is required. SPC makes it possible to analyze process behavior, pred...

An approach for applying Test-Driven Development (TDD) in the development of randomized algorithms

TDD is a technique traditionally applied in applications with deterministic algorithms, in which the input and the expected result are known. However, the application of TDD with randomized algorithms have bee...

Supporting governance of mobile application developers from mining and analyzing technical questions in stack overflow

There is a need to improve the direct communication between large organizations that maintain mobile platforms (e.g. Apple, Google, and Microsoft) and third-party developers to solve technical questions that e...

Working software over comprehensive documentation – Rationales of agile teams for artefacts usage

Agile software development (ASD) promotes working software over comprehensive documentation. Still, recent research has shown agile teams to use quite a number of artefacts. Whereas some artefacts may be adopt...

Development as a journey: factors supporting the adoption and use of software frameworks

From the point of view of the software framework owner, attracting new and supporting existing application developers is crucial for the long-term success of the framework. This mixed-methods study explores th...

Applying user-centered techniques to analyze and design a mobile application

Techniques that help in understanding and designing user needs are increasingly being used in Software Engineering to improve the acceptance of applications. Among these techniques we can cite personas, scenar...

A measurement model to analyze the effect of agile enterprise architecture on geographically distributed agile development

Efficient and effective communication (active communication) among stakeholders is thought to be central to agile development. However, in geographically distributed agile development (GDAD) environments, it c...

A survey of search-based refactoring for software maintenance

This survey reviews published materials related to the specific area of Search-Based Software Engineering that concerns software maintenance and, in particular, refactoring. The survey aims to give a comprehen...

Guest editorial foreword for the special issue on automated software testing: trends and evidence

Similarity testing for role-based access control systems.

Access control systems demand rigorous verification and validation approaches, otherwise, they can end up with security breaches. Finite state machines based testing has been successfully applied to RBAC syste...

An algorithm for combinatorial interaction testing: definitions and rigorous evaluations

Combinatorial Interaction Testing (CIT) approaches have drawn attention of the software testing community to generate sets of smaller, efficient, and effective test cases where they have been successful in det...

How diverse is your team? Investigating gender and nationality diversity in GitHub teams

Building an effective team of developers is a complex task faced by both software companies and open source communities. The problem of forming a “dream”

Investigating factors that affect the human perception on god class detection: an analysis based on a family of four controlled experiments

Evaluation of design problems in object oriented systems, which we call code smells, is mostly a human-based task. Several studies have investigated the impact of code smells in practice. Studies focusing on h...

On the evaluation of code smells and detection tools

Code smells refer to any symptom in the source code of a program that possibly indicates a deeper problem, hindering software maintenance and evolution. Detection of code smells is challenging for developers a...

On the influence of program constructs on bug localization effectiveness

Software projects often reach hundreds or thousands of files. Therefore, manually searching for code elements that should be changed to fix a failure is a difficult task. Static bug localization techniques pro...

DyeVC: an approach for monitoring and visualizing distributed repositories

Software development using distributed version control systems has become more frequent recently. Such systems bring more flexibility, but also greater complexity to manage and monitor multiple existing reposi...

A genetic algorithm based framework for software effort prediction

Several prediction models have been proposed in the literature using different techniques obtaining different results in different contexts. The need for accurate effort predictions for projects is one of the ...

Elaboration of software requirements documents by means of patterns instantiation

Studies show that problems associated with the requirements specifications are widely recognized for affecting software quality and impacting effectiveness of its development process. The reuse of knowledge ob...

ArchReco: a software tool to assist software design based on context aware recommendations of design patterns

This work describes the design, development and evaluation of a software Prototype, named ArchReco, an educational tool that employs two types of Context-aware Recommendations of Design Patterns, to support us...

On multi-language software development, cross-language links and accompanying tools: a survey of professional software developers

Non-trivial software systems are written using multiple (programming) languages, which are connected by cross-language links. The existence of such links may lead to various problems during software developmen...

SoftCoDeR approach: promoting Software Engineering Academia-Industry partnership using CMD, DSR and ESE

The Academia-Industry partnership has been increasingly encouraged in the software development field. The main focus of the initiatives is driven by the collaborative work where the scientific research work me...

Issues on developing interoperable cloud applications: definitions, concepts, approaches, requirements, characteristics and evaluation models

Among research opportunities in software engineering for cloud computing model, interoperability stands out. We found that the dynamic nature of cloud technologies and the battle for market domination make clo...

Game development software engineering process life cycle: a systematic review

Software game is a kind of application that is used not only for entertainment, but also for serious purposes that can be applicable to different domains such as education, business, and health care. Multidisc...

Correlating automatic static analysis and mutation testing: towards incremental strategies

Traditionally, mutation testing is used as test set generation and/or test evaluation criteria once it is considered a good fault model. This paper uses mutation testing for evaluating an automated static anal...

A multi-objective test data generation approach for mutation testing of feature models

Mutation approaches have been recently applied for feature testing of Software Product Lines (SPLs). The idea is to select products, associated to mutation operators that describe possible faults in the Featur...

An extended global software engineering taxonomy

In Global Software Engineering (GSE), the need for a common terminology and knowledge classification has been identified to facilitate the sharing and combination of knowledge by GSE researchers and practition...

A systematic process for obtaining the behavior of context-sensitive systems

Context-sensitive systems use contextual information in order to adapt to the user’s current needs or requirements failure. Therefore, they need to dynamically adapt their behavior. It is of paramount importan...

Distinguishing extended finite state machine configurations using predicate abstraction

Extended Finite State Machines (EFSMs) provide a powerful model for the derivation of functional tests for software systems and protocols. Many EFSM based testing problems, such as mutation testing, fault diag...

Extending statecharts to model system interactions

Statecharts are diagrams comprised of visual elements that can improve the modeling of reactive system behaviors. They extend conventional state diagrams with the notions of hierarchy, concurrency and communic...

On the relationship of code-anomaly agglomerations and architectural problems

Several projects have been discontinued in the history of the software industry due to the presence of software architecture problems. The identification of such problems in source code is often required in re...

An approach based on feature models and quality criteria for adapting component-based systems

Feature modeling has been widely used in domain engineering for the development and configuration of software product lines. A feature model represents the set of possible products or configurations to apply i...

Patch rejection in Firefox: negative reviews, backouts, and issue reopening

Writing patches to fix bugs or implement new features is an important software development task, as it contributes to raise the quality of a software system. Not all patches are accepted in the first attempt, ...

Investigating probabilistic sampling approaches for large-scale surveys in software engineering

Establishing representative samples for Software Engineering surveys is still considered a challenge. Specialized literature often presents limitations on interpreting surveys’ results, mainly due to the use o...

Characterising the state of the practice in software testing through a TMMi-based process

The software testing phase, despite its importance, is usually compromised by the lack of planning and resources in industry. This can risk the quality of the derived products. The identification of mandatory ...

Self-adaptation by coordination-targeted reconfigurations

A software system is self-adaptive when it is able to dynamically and autonomously respond to changes detected either in its internal components or in its deployment environment. This response is expected to ensu...

Templates for textual use cases of software product lines: results from a systematic mapping study and a controlled experiment

Use case templates can be used to describe functional requirements of a Software Product Line. However, to the best of our knowledge, no efforts have been made to collect and summarize these existing templates...

F3T: a tool to support the F3 approach on the development and reuse of frameworks

Frameworks are used to enhance the quality of applications and the productivity of the development process, since applications may be designed and implemented by reusing framework classes. However, frameworks ...

NextBug: a Bugzilla extension for recommending similar bugs

Due to the characteristics of the maintenance process followed in open source systems, developers are usually overwhelmed with a great amount of bugs. For instance, in 2012, approximately 7,600 bugs/month were...

Assessing the benefits of search-based approaches when designing self-adaptive systems: a controlled experiment

The well-orchestrated use of distilled experience, domain-specific knowledge, and well-informed trade-off decisions is imperative if we are to design effective architectures for complex software-intensive syst...

Revealing influence of model structure and test case profile on the prioritization of test cases in the context of model-based testing

Test case prioritization techniques aim at defining an order of test cases that favor the achievement of a goal during test execution, such as revealing failures as earlier as possible. A number of techniques ...

A metrics suite for JUnit test code: a multiple case study on open source software

The code of JUnit test cases is commonly used to characterize software testing effort. Different metrics have been proposed in literature to measure various perspectives of the size of JUnit test cases. Unfort...

Designing fault-tolerant SOA based on design diversity

Over recent years, software developers have been evaluating the benefits of both Service-Oriented Architecture (SOA) and software fault tolerance techniques based on design diversity. This is achieved by creat...

Method-level code clone detection through LWH (Light Weight Hybrid) approach

Many researchers have investigated different techniques to automatically detect duplicate code in programs exceeding thousand lines of code. These techniques have limitations in finding either the structural o...

The problem of conceptualization in god class detection: agreement, strategies and decision drivers

The concept of code smells is widespread in Software Engineering. Despite the empirical studies addressing the topic, the set of context-dependent issues that impacts the human perception of what is a code sme...

  • Editorial Board
  • Sign up for article alerts and news from this journal

Research Topics in Software Engineering

topics research papers in software engineering

This seminar is an opportunity to become familiar with current research in software engineering and more generally with the methods and challenges of scientific research.

Each student will be asked to study some papers from the recent software engineering literature and review them. This is an exercise in critical review and analysis. Active participation is required (a presentation of a paper as well as participation in discussions).

The aim of this seminar is to introduce students to recent research results in the area of programming languages and software engineering. To accomplish that, students will study and present research papers in the area as well as participate in paper discussions. The papers will span topics in both theory and practice, including papers on program verification, program analysis, testing, programming language design, and development tools.

  • Search Search for:
  • Architecture
  • Military Tech
  • DIY Projects

Wonderful Engineering

Software Engineer Research Paper Topics 2021: Top 5

topics research papers in software engineering

Whether you’re studying in advance or you’re close to getting that Software Engineering degree, it’s crucial that you look for possible research paper topics in advance. This will help you have an advantage in your course.

First off, remember that software engineering revolves around tech development and improvement.

Hence, your research paper should have the same goal. It shouldn’t be too complex so that you can go through it smoothly. At the same time, it shouldn’t be too easy to the point that it can be looked up online.

Choosing can be a difficult task. Students are often choosing buy assignment from a professional writer because of the wrong topic choice. Thus, to help you land on the best topic for your needs, we have listed the top 5 software engineer research paper topics in the next sections.

Machine Learning

Machine learning is one of the most used research topics of software engineers. If you’re not yet familiar with this, it’s a field that revolves around producing programs that improve its algorithm on its own just by the use of existing data and experience.

Basically, the art of machine learning aims to make intelligent tools. Here, you will need to use various statistical methods for your computers’ algorithms. This somehow makes it a complex and long topic.

Even so, the good thing about the said field is it covers a lot of subtopics. These can include using machine learning for face spoof detection, iris detection, sentiment analysis technique, and likes. Usually, though, machine learning will go hand in hand with certain detection systems.

Artificial Intelligence

Artificial Intelligence is a much easier concept than machine learning. Note, though, that the latter is just another type of AI tool.

AI refers to the human-like intelligence integrated into machines and computer programs. Focusing on this will give you much more topics to write about. Since it’s present in a lot of fields like gaming, marketing, and even random automated tasks, you will have more materials to refer to.

Some things that you can write about in your paper include AI’s relationship with software engineering, robotics, and natural processing. You can also write about the different types of artificial intelligence tools for a more guided research paper.

Internet Of Things

Another topic that you can write about is the Internet of Things, or more commonly known as IoT . This refers to interconnected devices, machines, or even living beings as long as a network exists.

Writing about IoT will open a huge array of possibilities to write about. You can talk about whether the topic is a problem that needs additional solutions or improvements. At the same time, you will be able to talk about specific machine requirements since IoT works mainly with communication servers.

In addition, the concept of the Internet of Things is also used in several fields like agriculture, e-commerce, and medicine. Because of this, you can rest assured that you won’t run out of things to talk about or refer to.

Software Development Models

Next up, we have software development models. If you want to write about a research paper(or maybe you decided to purchase custom research paper ?) relating to how one can start building an app or software, then using software development models as a topic is a good choice.

Here, you can choose to write about what the concept is or delve deeper into its different types. You can look into the Waterfall Model, V-Model, Incremental, RAD, Agile, Iterative, Spiral, and Prototype. You can choose either one or all of the models and then relate them to software engineering.

Clone Management

One of the most important elements in software engineering is the clone base. Hence, using this as a research topic will help you stay relevant to your course and its needs. In particular, you can focus on clone management.

Clone management is a task that revolves around ensuring that a database is free from error and duplicated codes. What makes this a good topic is its materials are still limited in the field of software engineering. This is compared to other clone-related topics. Hence, you can ensure a distinct topic for your paper.

To land on the best topic, take your interest into account. Look for the field that makes you curious and entertained. In this way, you can build motivation to actually know more about it, and not just for the sake of submitting.

Another good tip is to choose a unique topic. The ones we discussed above can be considered unique since they are some of the latest software-related topics. If you’re going to use a common one, then make sure that you put your own little twist to it. You can also consider seeing the topic in a different light.

Anyhow, your research paper, its grade, and overall quality will greatly depend on what you choose to write about.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

topics research papers in software engineering

Help | Advanced Search

Computer Science > Software Engineering

Title: automated unit test improvement using large language models at meta.

Abstract: This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.

Submission history

Access paper:.

  • Download PDF
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Grad Coach

Research Topics & Ideas: CompSci & IT

50+ Computer Science Research Topic Ideas To Fast-Track Your Project

IT & Computer Science Research Topics

Finding and choosing a strong research topic is the critical first step when it comes to crafting a high-quality dissertation, thesis or research project. If you’ve landed on this post, chances are you’re looking for a computer science-related research topic , but aren’t sure where to start. Here, we’ll explore a variety of CompSci & IT-related research ideas and topic thought-starters, including algorithms, AI, networking, database systems, UX, information security and software engineering.

NB – This is just the start…

The topic ideation and evaluation process has multiple steps . In this post, we’ll kickstart the process by sharing some research topic ideas within the CompSci domain. This is the starting point, but to develop a well-defined research topic, you’ll need to identify a clear and convincing research gap , along with a well-justified plan of action to fill that gap.

If you’re new to the oftentimes perplexing world of research, or if this is your first time undertaking a formal academic research project, be sure to check out our free dissertation mini-course. In it, we cover the process of writing a dissertation or thesis from start to end. Be sure to also sign up for our free webinar that explores how to find a high-quality research topic. 

Overview: CompSci Research Topics

  • Algorithms & data structures
  • Artificial intelligence ( AI )
  • Computer networking
  • Database systems
  • Human-computer interaction
  • Information security (IS)
  • Software engineering
  • Examples of CompSci dissertation & theses

Topics/Ideas: Algorithms & Data Structures

  • An analysis of neural network algorithms’ accuracy for processing consumer purchase patterns
  • A systematic review of the impact of graph algorithms on data analysis and discovery in social media network analysis
  • An evaluation of machine learning algorithms used for recommender systems in streaming services
  • A review of approximation algorithm approaches for solving NP-hard problems
  • An analysis of parallel algorithms for high-performance computing of genomic data
  • The influence of data structures on optimal algorithm design and performance in Fintech
  • A Survey of algorithms applied in internet of things (IoT) systems in supply-chain management
  • A comparison of streaming algorithm performance for the detection of elephant flows
  • A systematic review and evaluation of machine learning algorithms used in facial pattern recognition
  • Exploring the performance of a decision tree-based approach for optimizing stock purchase decisions
  • Assessing the importance of complete and representative training datasets in Agricultural machine learning based decision making.
  • A Comparison of Deep learning algorithms performance for structured and unstructured datasets with “rare cases”
  • A systematic review of noise reduction best practices for machine learning algorithms in geoinformatics.
  • Exploring the feasibility of applying information theory to feature extraction in retail datasets.
  • Assessing the use case of neural network algorithms for image analysis in biodiversity assessment

Topics & Ideas: Artificial Intelligence (AI)

  • Applying deep learning algorithms for speech recognition in speech-impaired children
  • A review of the impact of artificial intelligence on decision-making processes in stock valuation
  • An evaluation of reinforcement learning algorithms used in the production of video games
  • An exploration of key developments in natural language processing and how they impacted the evolution of Chabots.
  • An analysis of the ethical and social implications of artificial intelligence-based automated marking
  • The influence of large-scale GIS datasets on artificial intelligence and machine learning developments
  • An examination of the use of artificial intelligence in orthopaedic surgery
  • The impact of explainable artificial intelligence (XAI) on transparency and trust in supply chain management
  • An evaluation of the role of artificial intelligence in financial forecasting and risk management in cryptocurrency
  • A meta-analysis of deep learning algorithm performance in predicting and cyber attacks in schools

Research topic idea mega list

Topics & Ideas: Networking

  • An analysis of the impact of 5G technology on internet penetration in rural Tanzania
  • Assessing the role of software-defined networking (SDN) in modern cloud-based computing
  • A critical analysis of network security and privacy concerns associated with Industry 4.0 investment in healthcare.
  • Exploring the influence of cloud computing on security risks in fintech.
  • An examination of the use of network function virtualization (NFV) in telecom networks in Southern America
  • Assessing the impact of edge computing on network architecture and design in IoT-based manufacturing
  • An evaluation of the challenges and opportunities in 6G wireless network adoption
  • The role of network congestion control algorithms in improving network performance on streaming platforms
  • An analysis of network coding-based approaches for data security
  • Assessing the impact of network topology on network performance and reliability in IoT-based workspaces

Free Webinar: How To Find A Dissertation Research Topic

Topics & Ideas: Database Systems

  • An analysis of big data management systems and technologies used in B2B marketing
  • The impact of NoSQL databases on data management and analysis in smart cities
  • An evaluation of the security and privacy concerns of cloud-based databases in financial organisations
  • Exploring the role of data warehousing and business intelligence in global consultancies
  • An analysis of the use of graph databases for data modelling and analysis in recommendation systems
  • The influence of the Internet of Things (IoT) on database design and management in the retail grocery industry
  • An examination of the challenges and opportunities of distributed databases in supply chain management
  • Assessing the impact of data compression algorithms on database performance and scalability in cloud computing
  • An evaluation of the use of in-memory databases for real-time data processing in patient monitoring
  • Comparing the effects of database tuning and optimization approaches in improving database performance and efficiency in omnichannel retailing

Topics & Ideas: Human-Computer Interaction

  • An analysis of the impact of mobile technology on human-computer interaction prevalence in adolescent men
  • An exploration of how artificial intelligence is changing human-computer interaction patterns in children
  • An evaluation of the usability and accessibility of web-based systems for CRM in the fast fashion retail sector
  • Assessing the influence of virtual and augmented reality on consumer purchasing patterns
  • An examination of the use of gesture-based interfaces in architecture
  • Exploring the impact of ease of use in wearable technology on geriatric user
  • Evaluating the ramifications of gamification in the Metaverse
  • A systematic review of user experience (UX) design advances associated with Augmented Reality
  • A comparison of natural language processing algorithms automation of customer response Comparing end-user perceptions of natural language processing algorithms for automated customer response
  • Analysing the impact of voice-based interfaces on purchase practices in the fast food industry

Research Topic Kickstarter - Need Help Finding A Research Topic?

Topics & Ideas: Information Security

  • A bibliometric review of current trends in cryptography for secure communication
  • An analysis of secure multi-party computation protocols and their applications in cloud-based computing
  • An investigation of the security of blockchain technology in patient health record tracking
  • A comparative study of symmetric and asymmetric encryption algorithms for instant text messaging
  • A systematic review of secure data storage solutions used for cloud computing in the fintech industry
  • An analysis of intrusion detection and prevention systems used in the healthcare sector
  • Assessing security best practices for IoT devices in political offices
  • An investigation into the role social media played in shifting regulations related to privacy and the protection of personal data
  • A comparative study of digital signature schemes adoption in property transfers
  • An assessment of the security of secure wireless communication systems used in tertiary institutions

Topics & Ideas: Software Engineering

  • A study of agile software development methodologies and their impact on project success in pharmacology
  • Investigating the impacts of software refactoring techniques and tools in blockchain-based developments
  • A study of the impact of DevOps practices on software development and delivery in the healthcare sector
  • An analysis of software architecture patterns and their impact on the maintainability and scalability of cloud-based offerings
  • A study of the impact of artificial intelligence and machine learning on software engineering practices in the education sector
  • An investigation of software testing techniques and methodologies for subscription-based offerings
  • A review of software security practices and techniques for protecting against phishing attacks from social media
  • An analysis of the impact of cloud computing on the rate of software development and deployment in the manufacturing sector
  • Exploring the impact of software development outsourcing on project success in multinational contexts
  • An investigation into the effect of poor software documentation on app success in the retail sector

CompSci & IT Dissertations/Theses

While the ideas we’ve presented above are a decent starting point for finding a CompSci-related research topic, they are fairly generic and non-specific. So, it helps to look at actual dissertations and theses to see how this all comes together.

Below, we’ve included a selection of research projects from various CompSci-related degree programs to help refine your thinking. These are actual dissertations and theses, written as part of Master’s and PhD-level programs, so they can provide some useful insight as to what a research topic looks like in practice.

  • An array-based optimization framework for query processing and data analytics (Chen, 2021)
  • Dynamic Object Partitioning and replication for cooperative cache (Asad, 2021)
  • Embedding constructural documentation in unit tests (Nassif, 2019)
  • PLASA | Programming Language for Synchronous Agents (Kilaru, 2019)
  • Healthcare Data Authentication using Deep Neural Network (Sekar, 2020)
  • Virtual Reality System for Planetary Surface Visualization and Analysis (Quach, 2019)
  • Artificial neural networks to predict share prices on the Johannesburg stock exchange (Pyon, 2021)
  • Predicting household poverty with machine learning methods: the case of Malawi (Chinyama, 2022)
  • Investigating user experience and bias mitigation of the multi-modal retrieval of historical data (Singh, 2021)
  • Detection of HTTPS malware traffic without decryption (Nyathi, 2022)
  • Redefining privacy: case study of smart health applications (Al-Zyoud, 2019)
  • A state-based approach to context modeling and computing (Yue, 2019)
  • A Novel Cooperative Intrusion Detection System for Mobile Ad Hoc Networks (Solomon, 2019)
  • HRSB-Tree for Spatio-Temporal Aggregates over Moving Regions (Paduri, 2019)

Looking at these titles, you can probably pick up that the research topics here are quite specific and narrowly-focused , compared to the generic ones presented earlier. This is an important thing to keep in mind as you develop your own research topic. That is to say, to create a top-notch research topic, you must be precise and target a specific context with specific variables of interest . In other words, you need to identify a clear, well-justified research gap.

Fast-Track Your Research Topic

If you’re still feeling a bit unsure about how to find a research topic for your Computer Science dissertation or research project, check out our Topic Kickstarter service.

You Might Also Like:

Business/management/MBA research topics

Investigating the impacts of software refactoring techniques and tools in blockchain-based developments.

Steps on getting this project topic

Joseph

I want to work with this topic, am requesting materials to guide.

Yadessa Dugassa

Information Technology -MSc program

Andrew Itodo

It’s really interesting but how can I have access to the materials to guide me through my work?

kumar

Investigating the impacts of software refactoring techniques and tools in blockchain-based developments is in my favour. May i get the proper material about that ?

BEATRICE OSAMEGBE

BLOCKCHAIN TECHNOLOGY

Nanbon Temasgen

I NEED TOPIC

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly
  • Google Meet
  • Mobile Dialer

topics research papers in software engineering

Resent Search

image

Management Assignment Writing

image

Technical Assignment Writing

image

Finance Assignment Writing

image

Medical Nursing Writing

image

Law Writing

image

Resume Writing

image

Civil engineering writing

image

Mathematics and Statistics Projects

image

CV Writing Service

image

Essay Writing Service

image

Online Dissertation Help

image

Thesis Writing Help

image

RESEARCH PAPER WRITING SERVICE

image

Case Study Writing Service

image

Electrical Engineering Assignment Help

image

IT Assignment Help

image

Mechanical Engineering Assignment Help

image

Homework Writing Help

image

Science Assignment Writing

image

Arts Architecture Assignment Help

image

Chemical Engineering Assignment Help

image

Computer Network Assignment Help

image

Arts Assignment Help

image

Coursework Writing Help

image

Custom Paper Writing Services

image

Personal Statement Writing

image

Biotechnology Assignment Help

image

C Programming Assignment Help

image

MBA Assignment Help

image

English Essay Writing

image

MATLAB Assignment Help

image

Narrative Writing Help

image

Report Writing Help

image

Get Top Quality Assignment Assistance

image

Online Exam Help

image

Macroeconomics Homework Help

image

Change Management Assignment Help

image

Operation management Assignment Help

image

Strategy Assignment Help

image

Human Resource Management Assignment Help

image

Psychology Assignment Writing Help

image

Algebra Homework Help

image

Best Assignment Writing Tips

image

Statistics Homework Help

image

CDR Writing Services

image

TAFE Assignment Help

image

Auditing Assignment Help

image

Literature Essay Help

image

Online University Assignment Writing

image

Economics Assignment Help

image

Programming Language Assignment Help

image

Political Science Assignment Help

image

Marketing Assignment Help

image

Project Management Assignment Help

image

Geography Assignment Help

image

Do My Assignment For Me

image

Business Ethics Assignment Help

image

Pricing Strategy Assignment Help

image

The Best Taxation Assignment Help

image

Finance Planning Assignment Help

image

Solve My Accounting Paper Online

image

Market Analysis Assignment

image

4p Marketing Assignment Help

image

Corporate Strategy Assignment Help

image

Project Risk Management Assignment Help

image

Environmental Law Assignment Help

image

History Assignment Help

image

Geometry Assignment Help

image

Physics Assignment Help

image

Clinical Reasoning Cycle

image

Forex Assignment Help

image

Python Assignment Help

image

Behavioural Finance Assignment Help

image

PHP Assignment Help

image

Social Science Assignment Help

image

Capital Budgeting Assignment Help

image

Trigonometry Assignment Help

image

Java Programming Assignment Help

image

Corporate Finance Planning Help

image

Sports Science Assignment Help

image

Accounting For Financial Statements Assignment Help

image

Robotics Assignment Help

image

Cost Accounting Assignment Help

image

Business Accounting Assignment Help

image

Activity Based Accounting Assignment Help

image

Econometrics Assignment Help

image

Managerial Accounting Assignment Help

image

R Studio Assignment Help

image

Cookery Assignment Help

image

Solidworks assignment Help

image

UML Diagram Assignment Help

image

Data Flow Diagram Assignment Help

image

Employment Law Assignment Help

image

Calculus Assignment Help

image

Arithmetic Assignment Help

image

Write My Assignment

image

Business Intelligence Assignment Help

image

Database Assignment Help

image

Fluid Mechanics Assignment Help

image

Web Design Assignment Help

image

Student Assignment Help

image

Online CPM Homework Help

image

Chemistry Assignment Help

image

Biology Assignment Help

image

Corporate Governance Law Assignment Help

image

Auto CAD Assignment Help

image

Public Relations Assignment Help

image

Bioinformatics Assignment Help

image

Engineering Assignment Help

image

Computer Science Assignment Help

image

C++ Programming Assignment Help

image

Aerospace Engineering Assignment Help

image

Agroecology Assignment Help

image

Finance Assignment Help

image

Conflict Management Assignment Help

image

Paleontology Assignment Help

image

Commercial Law Assignment Help

image

Criminal Law Assignment Help

image

Anthropology Assignment Help

image

Biochemistry Assignment Help

image

Get the best cheap assignment Help

image

Online Pharmacology Course Help

image

Urgent Assignment Help

image

Paying For Assignment Help

image

HND Assignment Help

image

Legitimate Essay Writing Help

image

Best Online Proofreading Services

image

Need Help With Your Academic Assignment

image

Assignment Writing Help In Canada

image

Assignment Writing Help In UAE

image

Online Assignment Writing Help in the USA

image

Assignment Writing Help In Australia

image

Assignment Writing Help In the UK

image

Scholarship Essay Writing Help

image

University of Huddersfield Assignment Help

image

Ph.D. Assignment Writing Help

topics research papers in software engineering

150 Best Research Paper Topics For Software Engineering

Software Engineering is a branch which deals with the creation and improvement of software applications using specific methodologies and clearly defined scientific principles. When developing software products, certain procedures must be followed, the outcome of which is a reliable and reliable software product. Software is a collection of executable code for programs with associated libraries. Software that is designed to meet certain requirements is referred to as a Software Product . This is an excellent subject for a master's thesis, research, or project. There are a variety of topics within Software Engineering which will be useful to M.Tech and other students studying for their masters to write their software thesis.

What is the reason Software Engineering is required?

Software Engineering is necessary due to the frequent shifts in the requirements of users as well as the environment. Through yourch and thesis, you will learn more about the significance of Software Engineering. Here are some other areas in software engineering that are needed:

  • Big Software: The massive dimension of software makes it necessary for the requirements in software engineering .
  • Scalability The concept of scaling Software Engineering makes it possible to increase the size of existing software rather than develop brand-new software.
  • Cost Price Software Engineering also cuts down the manufacturing cost that is incurred during software development.
  • The dynamic nature of Software - Software Engineering is a crucial factor when the need for new features is to be made in software in place, in the event that the nature of software is fluid.
  • Better Quality Management - Software Engineering can provide more efficient software development processes to provide superior-high-quality services .

Best Research Paper Topics on Software

  • Software Engineering Management Unified Software Development Process and Extreme ProgrammingThere are a lot of difficulties with managing the development of software for web-based applications and projects for systems integration that were completed in recent times.
  • The Blue Sky Software Consulting Company Analysis
  • Blue Sky Software Consulting Blue Sky Software Consulting company has seen great success over 15 years. The company is not as well-equipped for the current market.
  • LabVIEW Software: Design Systems of Measurement
  • LabVIEW is a software program that was created to design systems for measurement. LabVIEW gives you a range of instruments to control the process in an experiment.
  • Software-producing Firm Reducing Inventory
  • The link between the reduction in inventory levels and the number of orders is evident. An organization that produces software may think of increasing the amount of software to a lower level.
  • Moet Hennessy - Louis Vuitton: Enterprise Software
  • The report will demonstrate how the introduction of ERP will help LVHM Group improve its results by improving its inventories, logistics and accounting.
  • Virtualization and Software-Defined Networking
  • The goal of this paper is to analyze the developments in the field of virtualization, software-defined networks and security for networks in the last three years.
  • Computer Hardware and Software Components
  • Computers that were developed at the time of the 40s of 1940 have evolved into complex machines that require software and hardware for their operation.
  • Applications, Software and System Development
  • The usage the Microsoft Office applications greatly enhance productivity in the classroom as well as at work and during everyday activities at home.
  • PeopleSoft Inc.'s Software Architecture and Design
  • With the PIA architecture, any company with an ERP application can access all of its operations through a Web browser.
  • Co-operative Banking Group's Enterprise Software
  • The report demonstrates how the implementation of the ERP system within the Co-operative Banking Group will help in improving the company's accounting, inventory and accounting practices as well as logistics processes.
  • Software Testing: Manual and Automated Web-Application Testing Tools
  • This research is an empirical study of automated and manual web-based application testing tools to determine the best tool for testing software.
  • JDA Software Company's Services
  • JDA Software is a company that has proven its worth in the development of services in areas like manufacturing, wholesale distribution, retailing and travel.
  • Data Management, Networking and Enterprise Software
  • Enterprise software is typically developed "in-house" and thus has an inflated cost when contrasted to purchasing the software from another firm.
  • Software Workshops and Seminars Reflections
  • Most seminars inspire participants to use their potential as they strive to attain their goals.
  • The Various Enterprise Resource Planning Software Packages
  • This paper's purpose is to provide an overview of the various Enterprise Resource Planning (ERP) software applications that are widely employed by companies to manage their business operations.
  • Explore Factors in IBM SPSS Statistical Software
  • The "Explore" or "Explore" command in IBM SPSS generates an output with a variety of stats for a single variable, across the entire sample or in sections of the sample.
  • Split Variables in IBM SPSS Statistical Software
  • It is the IBM SPSS software provides an option to split files into groups. The members of cases within groups can be determined by the values of split variables in this particular instance.
  • Syntax Code Writing in Statistical Software
  • The process of analyzing quantitative data by using IBM SPSS software package IBM SPSS software package often involves performing a variety of operations to calculate the statistical data for the information.
  • Data Coding in Statistical Software
  • Data coding is of utmost importance when a proper analysis of this data has to be conducted. Data coding plays an important function when you need to make use of statistical software.
  • Software Piracy at Kaspersky Cybersecurity Company
  • Software piracy is a pressing current issue that is manifested both locally with respect to an individual company and also globally.
  • Hotjar: Web Analytics Software Difference
  • This report examines Hotjar, which is a web-based analytics tool that comes with a full set of tools to evaluate. This paper examines its strengths and advantages, as well showing how it can aid in the management of decision-making.
  • Avast Software: Company Analysis
  • Avast Software is a globally well-known multinational company that is an industry leader in providing security solutions for both business and individual customers.
  • Project Failure, Project Planning Fundamentals, and Software Tools and Techniques for Alternative Scheduling
  • From lack of communication to generally unfavourable working conditions, Projects may fail when managers fail to prepare for their implementation.
  • Computer Elements such as Hardware and Software
  • Personal computers are usually different from computers used for business in terms of capabilities and the extent of technology used within the equipment.
  • Review of a New Framework for Software Reliability Measurement
  • This study draws upon the in-depth study of the software reliability measurement methods and the suggestion of a fresh foundation for reliability measurement built on the software metrics studied in the work of Amar as well as Rabai.

Good Software Research Topics & Essay Examples

  • Task Management Software in Organization
  • The goal of the plan for managing projects is to present the process of creating task management software that can be integrated into the context of the company.
  • A task management software plan's risk management strategy
  • The present study introduces us to the techniques for risk identification as well as quality assurance and a control plan and explains their significance.
  • Computer Software Development and Reality Shows
  • The growth of software in computers has been at such a fast rate over the last 10 years that it has impacted all aspects of our lives and every fibre of our being.
  • Scrum - Software Development Process
  • Digital systems and computerized systems have brought life to many areas. Scrum is a process for software development that guarantees high quality and efficiency.
  • Distribution of Anti-Virus Software
  • Numerous new threats are reported every fortnight. Cyberattacks, viruses, and other cyber-related threats are becoming an issue.
  • Marketing Plan: Innovative Type of Software Product
  • This paper will create an advertisement plan for the new kind of software, which will help to define the segment of clients and the price and communications platform.
  • Marketing System of Sakhr Software Co
  • The principal objective of this paper is to examine the marketing process in the same type of organization, like Sakhr Software Co.
  • Managing Information of Sakhr Software Co
  • This paper will examine the ideas of managing information for Sakhr Software, which is a well-known language software firm.
  • CRM Software in Amazon: Gains
  • The software for managing customers that Amazon.com developed is, from the beginning, one of the latest technology.
  • Neurofeedback Software and Technology Comparison
  • MIDI technology helps make the making of, learning or playing more enjoyable. Mobile phones and computer keyboards for music, computers etc., utilize MIDI.
  • PeopleSoft Software and HR.net Enterprise Software
  • With the help of HRIS software, HR employees are able to manage their own benefits updates and make changes, allowing them to take more time to focus on other important tasks.
  • Business Applications: Revelation HelpDesk by Yellow Fish Software
  • "Revelation HelpDesk" is an online Tracking and Support Software that facilitates seamless coordination to occur between the most important divisions within an organization.
  • 3D signal editing methods and editing software for stereoscopic movies
  • 3D editing for movies is one of the newest trends and is among the most complex processes in the modern film industry.
  • ERP Software in Inventory Management
  • Management of inventory ERP applications will be useful when a business has to manage the manner in which it gets goods and cleans up the merchandise.
  • The Capabilities of Compiere Software and How Well It Fits Into Different Industries
  • It is the ERP software Compiere can be used by a wide variety of users, including governments, businesses as well as non-governmental organizations (NGOs).
  • Software Tools for Qualitative Research
  • This paper reviews software tools to solve complicated tasks in the analysis of data. The paper compares NVivo, HyperRESEARCH, and Dedoose.
  • Data Scientist and Software Development
  • Data scientists convert data into insights, giving elaborate guidance to those who use the data to make educated decisions and take action.
  • IPR Violations in Software Development
  • The copyright law protects only the declaration but not the software concept. It prohibits copying code from the source without asking permission.
  • Health IT: Epic Software Analysis
  • Implementation and adoption of Health IT systems are crucial to improve the efficiency of medical practices, efficiency of workflow as well as patient outcomes.
  • Agile Software Development Process
  • The agile process for software development offers numerous benefits, such as the speedy and continuous execution of your project.
  • Project Management Software and Tools Comparison
  • The software is used by managers to ensure that there isn't any worker who is receiving more work than others and also to ensure that no worker is falling behind in their job.
  • Visually impaired people: challenges in Assistive Technology Software
  • Blind people suffer from a number of disadvantages each day while using digital technology. The various types of software and software discussed in this paper have been specifically designed to help improve the lives of blind people.
  • WBS completion and software project management
  • The PERT's results resulted in the development of The Gantt chart. This essay provides an account of the method of working with the Gantt chart.
  • International Software Development's Ethical Challenges: User-Useful Software
  • The importance of ethics is when it comes to software development. It helps the creator to create software that will be useful for the user as well as the management.
  • Achieving the Optimal Process. Software Development
  • The industry of software development is growing rapidly as the requirements of users change. This requires applications to meet these needs.

Innovative Software to Blog About

  • System Software: Analysis of Various Types of System Software
  • The paper provides opinions on the various system softwares using their strengths and weaknesses from the personal experiences of the creator.
  • Sakhr Software Co.'s Marketing System
  • The principal goal of this paper is to study the uniqueness of the system of marketing in such an organization as Sakhr Software Co from Kuwait, which specializes in NLP.
  • Program Code in Assembly Language Using Easy68K Software
  • A typical scenario is described in the report to write program code in assembly language with Easy68K software. The appropriate tests were carried out with success and outputs.
  • Benefits and Drawbacks of Agile Software Development Techniques
  • The use of agile methodologies in the software development process contributes to the improvement of work as well as the effectiveness of performance.
  • The use of agile methodologies in the development of software contributes to the efficiency of work and efficiency of performance.
  • Large Scale Software Development
  • This report gives information on this Resource Scheduling project. It can be useful to an advisory firm that offers various types of resources.
  • Penguin Sleuth, a Forensic Software Tool
  • The primary goal of this paper is to examine the various tools for forensic analysis and also provide a comprehensive overview of the functions available for each tool or tool pack.
  • System Software: Computer System Management
  • Computer software comprises precise preprogrammed instructions that regulate and coordinate hardware components of the computer.
  • Ethical Issues Involved in Software Project Management
  • Ethics within IT have been proven to be very different from other areas of ethics. Ethics issues in IT are usually described as having little.
  • Advantages and Disadvantages of Software Suites
  • Computer software comprises specific preprogrammed commands that control and coordinate computer hardware components of an info system.
  • Descriptive Statistics Using SPSS Software Suite
  • This paper focuses on the process of producing the descriptive statistical analysis by using SPSS. The purpose of this article is to make use of SPSS to perform an analysis of descriptive data.
  • Software Development: Creating a Prototype
  • The aim of this article is to develop an experimental software program that can be utilized to aid breast cancer patients.
  • Software Engineering and Methodologies
  • The paper explains how the author learned the software engineering process and methods as an outcome of his experiences at BTR IT Consulting Company.
  • Information System Hardware and Software
  • Information technology covers a wide variety of applications in which computer software, along with hardware, is employed.
  • Software Development Project Using Agile Methods
  • The report will provide reasons behind why the agile methodology was chosen, the method used, how the team applied this methodology, and also the lessons learned from the massive project of software development.
  • Flight Planning Software and Aircraft Incidents
  • Software for flight planning refers to programs utilized to control and manage flights and other procedures while the plane is in flight.
  • Hardware and Software Systems and Criminal Justice
  • One of the primary techniques used to decrease the chance of criminal activity is crime mapping. This involves collecting information on crimes and their causes and then analyzing it in order to identify issues.
  • Why Open-Source Software Will (Or Will Not) Soon Dominate the Field of Database Management Tools
  • The research aims to determine whether open-source software will rule the field of the database since there is an evolution in the market for business.
  • Business HRM Software and the Affordable Care Act
  • The Affordable Care Act has its strengths but also flaws. The reason is the complex nature of the law that creates a variety of challenges.
  • Antivirus Software Ensuring Security Online
  • Although it's not perfect and fragmentary, it can be seen as a supplement and not the sole instrument; antivirus software will help protect one's privacy online.
  • Evaluating Teaching Instructional Software for 21st-Century Technology Resources
  • The software for teaching Joe Rock and Friends Book 2 is designed for third-grade students who are studying English as an additional language to read and learn new vocabulary.
  • Britam Insurance Company's Sales and Marketing Management Software
  • Britam Insurance Company needs to implement the latest marketing and management software in order to keep its place at the forefront of the extremely competitive insurance market.
  • Software Programs: Adobe Illustrator
  • With Adobe Illustrator, users can quickly and precisely create various products, like logos, icons as well as drawings.
  • Strawberry Business: Software Project Management
  • Although the company has an established management strategy as well as a team of employees and efficient information systems, it lacks a standardized workplace culture and customer relations systems.
  • Value of Salesforce Software Using VRIO Model
  • Salesforce CRM software is created to help managers manage their businesses effectively. It connects all teams and managers and collects and manages customer information.
  • Agile software development, as well as popular variations like Scrum, are the foundation for the work of a variety of testers and developers. No matter what team or method you're currently using, you can get expert guidance on process structure and the skills required to use Lean, Agile, DevOps, Waterfall and more to help you implement it for your business.

Most Interesting Software Research Titles

  • What Are the Essential Attributes of Good Software?
  • How Computer Software Can Be Used as a Tool for Education
  • Accounting Software and Application Software
  • Online National Polling Software Requirements Specification
  • Building Their Software for a Company's Success
  • The Role of Antivirus Software in Protecting Your Computer Data
  • Intellectual Property Rights, Innovation and Software Technologies
  • Software Piracy and the Canadian Piracy Act
  • For the development of software projects, agile methodologies and their Waterscrumfall derivative are used.
  • Software Tools for Improving Underground Mine Access Layouts
  • How Software Can Support Academic Librarians' Changing Role
  • Using the Untangle Software to Overcome Obstacles for Small Businesses
  • By employing travel portal software, online booking sales will increase.
  • Analysis of Network Externality and Commercial Software Piracy
  • Accounting Software and Business Solutions
  • Analysis of Key Issues and Effects Relating to International Software Piracy
  • The Distinction Between Computer Science and Software Engineering
  • Modulation: Computer Software and Unknown Music Virus
  • Math Software for High School Students with Disabilities
  • Keyboarding Software Packages: Analysis and Purchase Recommended
  • Basic Software Development Life Cycle
  • India's Problems with Software Patents, Copyright, and Piracy
  • Why Has India Been Able to Build a Thriving Software Industry
  • Does Social Software Increase Labour Productivity
  • The Role of Open Source Software for Database Servers

Simple Software Essay Ideas

  • Human Capital and the Indian Software Industry
  • Input-Output Computer Windows Software
  • Business Software Development and Its Implementation
  • Evaluating Financial Management Software: Quicken Software
  • Which governance tools are important in Africa for combating software piracy?
  • Distinguish Between Proprietary Software and Off-The-Shelf
  • Does Social Software Support Service Innovation
  • Ambulatory Revenue Management Software
  • Difference Between Operating Systems and Application Software
  • Leading a Global Insurgency in the Software Sector are China and India
  • Call Accounting Software for Every Enterprise
  • Technology Standards for Software Outsourcing
  • The Importance of the Agile Approach for Software Development
  • Application Software: Publisher, Word, and Excel
  • Employee Monitoring Through Computer Software
  • Software Development Lifecycle and Testing's Importance
  • Tools for Global Conditional Policy to Combat Software Piracy
  • Software for Designing Solar Water Heating Systems
  • Open Source Software, Competition, and Potential Entry
  • Indian Software Industry: Gains are distorted and consolidated
  • Software Programs for Disabled Computer Users and Assistive Technology
  • Agile Software Architecture, Written by Christine Miyachi
  • Software Development: The Disadvantages of Agile Methods
  • Computer Software Technology for Early Childhood
  • Developing Test Automation Software Development

Easy Software Essay Topics

  • Growth Trends, Barriers, and Government Initiatives in the Indian Software Industry
  • How Does Enterprise Software Enable a Business to Use
  • Integrated Management Software the Processing of Information
  • Computer Software Training for Doctor's Office
  • Software Intellectual Property Rights and Venture Capitalist Access
  • Computer Science Software Specification
  • Software Projects and Student Software Risk Exposure
  • Why It Is Difficult to Create Software for Wireless Devices
  • Affiliate Tracking Software Your Payment Options
  • How Can Volkswagen Recover From the Cheating Issues It Had Because Illegal Software Was Installed?
  • Principles of Best Forensic Software Tool
  • The American Software Industry: A Historical Analysis
  • How Peripheral Developers Contribute to the Development of Open-Source Software
  • Agile Methodologies for Software Development
  • Key Macroeconomic Factors That Affect Software Industry
  • The Software Industry and India's Economic Development
  • Improving Customer Service Through Help Desk Software
  • Enterprise Resource Planning and Sap Software
  • Antivirus Software and Its Importance
  • Hardware and Software Used in Public Bank
  • The Effects of Computer Software Piracy on the Global Economy
  • Using the Winqsb Software in Critical Path Analysis
  • General Information About Interactive Multimedia-Based Educational Software
  • How Affiliate Tracking Software Can Benefit You
  • Computer Software and Recent Technologies

Frequently asked questions

What are the main topics of software engineering .

software development.

  • Introduction
  • Models and architecture for software development
  • Project management for software (SPM)
  • Software prerequisites
  • Testing and debugging software

What makes good research in software engineering ?

The most typical research strategy in software engineering is coming up with a novel method or methodology, validating it through analysis, or demonstrating its application through a case study;

What projects are good for software engineering ?

  • monitoring of Android tasks.
  • Analyzing attitudes to rate products
  • ATM with a fingerprint-based method.
  • a modern system for managing employees.
  • Using the AES technique for image encryption.
  • vote-by-fingerprint technology.
  • system for predicting the weather

What are the research methods in software engineering ?

We list and contrast the five categories of research methodology that, in our opinion, are most pertinent to software engineering: controlled experiments (including quasi-experiments); case studies (both exploratory and confirmatory); survey research; ethnographies; action research; and controlled experiments.

Is software engineering a research area ?

A relatively recent area of research, software engineering is derived from computer science. Its significance has been generally acknowledged by more and more academics in the field of computers throughout the course of six decades, from 1948 to the present, and it has developed into a vibrant and promising division of the computing profession.

Is software engineering easy ?

Yes, learning software engineering can be challenging at first, especially for those without programming or coding experience or any background in technology. However, numerous courses, tools, and other resources are available to assist with learning how to become a software engineer.

Who is the father of software engineering ?

The "father of software quality," Watts S. Humphrey, was an American software engineering pioneer who lived in Battle Creek, Michigan (U.S.) from July 4, 1927, to October 28, 2010.

What do you do in software engineering ?

  • roles and tasks for software engineers
  • creating and keeping up software systems.
  • testing and evaluating new software applications.
  • software speed and scalability optimization.
  • code creation and testing.
  • consulting with stakeholders such as clients, engineers, security experts, and others.

Which is better it or software engineering ?

IT support engineers cannot build sophisticated solutions, while software engineers can. In a word, they are in charge of creating and putting into use software. Knowing the distinctions makes it easier to choose the right individual to handle our tech-related problems.

Are junior software engineers in demand ?

Yes, there is a need for young coders.

Is software engineering going down ?

Software experts and software goods are oversaturating the job market for software engineers.

What degree do I need to be a software engineer ?

undergraduate degree

Can I be a software engineer without a degree ?

Many software developers lack a degree from a reputable university (or, in some circumstances, none at all).

How many years can a software engineer work ?

An engineer who wants to work in IT has a 15–20 year window.

How many hours do software engineers work ?

Software developers put in 8 to 9 hours each day, or 40 to 45 hours per week.

topics research papers in software engineering

Top 10 Best Universities Ranking list in India 2022

Generic Conventions: Assignment Help

Generic Conventions: Assignment Help Services

Research Paper Topics For Medical | AHECounselling

Research Paper Topics For Medical

Top 5 Resources for Writing Excellent Academic Assignmentsb

Top 5 Resources for Writing Excellent Academic Assignments

How to Write a Literature Review for Academic Purposes

How to Write a Literature Review for Academic Purposes

topics research papers in software engineering

Tips for Writing a killer introduction to your assignment

How To Write A Compelling Conclusion For Your University Assignment

How To Write A Compelling Conclusion For Your University Assignment

Social Science, research ideas

Research Papers Topics For Social Science

Best 150 New Research Paper Ideas For Students

Best 150 New Research Paper Ideas For Students

7 Best Plagiarism Checkers for Students And Teachers in 2024

7 Best Plagiarism Checkers for Students And Teachers in 2024

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

📚 A curated list of papers for Software Engineers

facundoolano/software-papers

Folders and files, repository files navigation, papers for software engineers.

A curated list of papers that may be of interest to Software Engineering students or professionals. See the sources and selection criteria below.

Von Neumann's First Computer Program. Knuth (1970) . Computer History; Early Programming

  • The Education of a Computer. Hopper (1952) .
  • Recursive Programming. Dijkstra (1960) .
  • Programming Considered as a Human Activity. Dijkstra (1965) .
  • Goto Statement Considered Harmful. Dijkstra (1968) .
  • Program development by stepwise refinement. Wirth (1971) .
  • The Humble Programmer. Dijkstra (1972) .
  • Computer Programming as an Art. Knuth (1974) .
  • The paradigms of programming. Floyd (1979) .
  • Literate Programming. Knuth (1984) .

Computing Machinery and Intelligence. Turing (1950) . Early Artificial Intelligence

  • Some Moral and Technical Consequences of Automation. Wiener (1960) .
  • Steps towards Artificial Intelligence. Minsky (1960) .
  • ELIZA—a computer program for the study of natural language communication between man and machine. Weizenbaum (1966) .
  • A Theory of the Learnable. Valiant (1984) .

A Method for the Construction of Minimum-Redundancy Codes. Huffman (1952) . Information Theory

  • A Universal Algorithm for Sequential Data Compression. Ziv, Lempel (1977) .
  • Fifty Years of Shannon Theory. Verdú (1998) .

Engineering a Sort Function. Bentley, McIlroy (1993) . Data Structures; Algorithms

  • On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Kruskal (1956) .
  • A Note on Two Problems in Connexion with Graphs. Dijkstra (1959) .
  • Quicksort. Hoare (1962) .
  • Space/Time Trade-offs in Hash Coding with Allowable Errors. Bloom (1970) .
  • The Ubiquitous B-Tree. Comer (1979) .
  • Programming pearls: Algorithm design techniques. Bentley (1984) .
  • Programming pearls: The back of the envelope. Bentley (1984) .
  • Making data structures persistent. Driscoll et al (1986) .

A Design Methodology for Reliable Software Systems. Liskov (1972) . Software Design

  • On the Criteria To Be Used in Decomposing Systems into Modules. Parnas (1971) .
  • Information Distribution Aspects of Design Methodology. Parnas (1972) .
  • Designing Software for Ease of Extension and Contraction. Parnas (1979) .
  • Programming as Theory Building. Naur (1985) .
  • Software Aging. Parnas (1994) .
  • Towards a Theory of Conceptual Design for Software. Jackson (2015) .

Programming with Abstract Data Types. Liskov, Zilles (1974) . Abstract Data Types; Object-Oriented Programming

  • The Smalltalk-76 Programming System Design and Implementation. Ingalls (1978) .
  • A Theory of Type Polymorphism in Programming. Milner (1978) .
  • On understanding types, data abstraction, and polymorphism. Cardelli, Wegner (1985) .
  • SELF: The Power of Simplicity. Ungar, Smith (1991) .

Why Functional Programming Matters. Hughes (1990) . Functional Programming

  • Recursive Functions of Symbolic Expressions and Their Computation by Machine. McCarthy (1960) .
  • The Semantics of Predicate Logic as a Programming Language. Van Emden, Kowalski (1976) .
  • Can Programming Be Liberated from the von Neumann Style? Backus (1978) .
  • The Semantic Elegance of Applicative Languages. Turner (1981) .
  • The essence of functional programming. Wadler (1992) .
  • QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. Claessen, Hughes (2000) .
  • Church's Thesis and Functional Programming. Turner (2006) .

An Incremental Approach to Compiler Construction. Ghuloum (2006) . Language Design; Compilers

  • The Next 700 Programming Languages. Landin (1966) .
  • Programming pearls: little languages. Bentley (1986) .
  • The Essence of Compiling with Continuations. Flanagan et al (1993) .
  • A Brief History of Just-In-Time. Aycock (2003) .
  • LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. Lattner, Adve (2004) .
  • A Unified Theory of Garbage Collection. Bacon, Cheng, Rajan (2004) .
  • A Nanopass Framework for Compiler Education. Sarkar, Waddell, Dybvig (2005) .
  • Bringing the Web up to Speed with WebAssembly. Haas (2017) .

No Silver Bullet: Essence and Accidents of Software Engineering. Brooks (1987) . Software Engineering; Project Management

  • How do committees invent? Conway (1968) .
  • Managing the Development of Large Software Systems. Royce (1970) .
  • The Mythical Man Month. Brooks (1975) .
  • On Building Systems That Will Fail. Corbató (1991) .
  • The Cathedral and the Bazaar. Raymond (1998) .
  • Out of the Tar Pit. Moseley, Marks (2006) .

Communicating sequential processes. Hoare (1978) . Concurrency

  • Solution Of a Problem in Concurrent Program Control. Dijkstra (1965) .
  • Monitors: An operating system structuring concept. Hoare (1974) .
  • On the Duality of Operating System Structures. Lauer, Needham (1978) .
  • Software Transactional Memory. Shavit, Touitou (1997) .

The UNIX Time- Sharing System. Ritchie, Thompson (1974) . Operating Systems

  • An Experimental Time-Sharing System. Corbató, Merwin Daggett, Daley (1962) .
  • The Structure of the "THE"-Multiprogramming System. Dijkstra (1968) .
  • The nucleus of a multiprogramming system. Hansen (1970) .
  • Reflections on Trusting Trust. Thompson (1984) .
  • The Design and Implementation of a Log-Structured File System. Rosenblum, Ousterhout (1991) .

A Relational Model of Data for Large Shared Data Banks. Codd (1970) . Databases

  • Granularity of Locks and Degrees of Consistency in a Shared Data Base. Gray et al (1975) .
  • Access Path Selection in a Relational Database Management System. Selinger et al (1979) .
  • The Transaction Concept: Virtues and Limitations. Gray (1981) .
  • The design of POSTGRES. Stonebraker, Rowe (1986) .
  • Rules of Thumb in Data Engineering. Gray, Shenay (1999) .

A Protocol for Packet Network Intercommunication. Cerf, Kahn (1974) . Networking

  • Ethernet: Distributed packet switching for local computer networks. Metcalfe, Boggs (1978) .
  • End-To-End Arguments in System Design. Saltzer, Reed, Clark (1984) .
  • An algorithm for distributed computation of a Spanning Tree in an Extended LAN. Perlman (1985) .
  • The Design Philosophy of the DARPA Internet Protocols. Clark (1988) .
  • TOR: The second generation onion router. Dingledine et al (2004) .
  • Why the Internet only just works. Handley (2006) .
  • The Network is Reliable. Bailis, Kingsbury (2014) .

New Directions in Cryptography. Diffie, Hellman (1976) . Cryptography

  • A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Rivest, Shamir, Adleman (1978) .
  • How To Share A Secret. Shamir (1979) .
  • A Digital Signature Based on a Conventional Encryption Function. Merkle (1987) .
  • The Salsa20 family of stream ciphers. Bernstein (2007) .

Time, Clocks, and the Ordering of Events in a Distributed System. Lamport (1978) . Distributed Systems

  • Self-stabilizing systems in spite of distributed control. Dijkstra (1974) .
  • The Byzantine Generals Problem. Lamport, Shostak, Pease (1982) .
  • Impossibility of Distributed Consensus With One Faulty Process. Fisher, Lynch, Patterson (1985) .
  • Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. Schneider (1990) .
  • Practical Byzantine Fault Tolerance. Castro, Liskov (1999) .
  • Paxos made simple. Lamport (2001) .
  • Paxos made live - An Engineering Perspective. Chandra, Griesemer, Redstone (2007) .
  • In Search of an Understandable Consensus Algorithm. Ongaro, Ousterhout (2014) .

Designing for Usability: Key Principles and What Designers Think. Gould, Lewis (1985) . Human-Computer Interaction; User Interfaces

  • As We May Think. Bush (1945) .
  • Man-Computer symbiosis. Licklider (1958) .
  • Some Thoughts About the Social Implications of Accessible Computing. David, Fano (1965) .
  • Tutorials for the First-Time Computer User. Al-Awar, Chapanis, Ford (1981) .
  • The star user interface: an overview. Smith, Irby, Kimball (1982) .
  • Design Principles for Human-Computer Interfaces. Norman (1983) .
  • Human-Computer Interaction: Psychology as a Science of Design. Carroll (1997) .

The anatomy of a large-scale hypertextual Web search engine. Brin, Page (1998) . Information Retrieval; World-Wide Web

  • A Statistical Interpretation of Term Specificity in Retrieval. Spärck Jones (1972) .
  • World-Wide Web: Information Universe. Berners-Lee et al (1992) .
  • The PageRank Citation Ranking: Bringing Order to the Web. Page, Brin, Motwani (1998) .

Dynamo, Amazon’s Highly Available Key-value store. DeCandia et al (2007) . Internet Scale Data Systems

  • The Google File System. Ghemawat, Gobioff, Leung (2003) .
  • MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat (2004) .
  • Bigtable: A Distributed Storage System for Structured Data. Chang et al (2006) .
  • ZooKeeper: wait-free coordination for internet scale systems. Hunt et al (2010) .
  • The Hadoop Distributed File System. Shvachko et al (2010) .
  • Kafka: a Distributed Messaging System for Log Processing. Kreps, Narkhede, Rao (2011) .
  • CAP Twelve Years Later: How the "Rules" Have Changed. Brewer (2012) .
  • Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. Verbitski et al (2017) .

On Designing and Deploying Internet Scale Services. Hamilton (2007) . Operations; Reliability; Fault-tolerance

  • Ironies of Automation. Bainbridge (1983) .
  • Why do computers stop and what can be done about it? Gray (1985) .
  • Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Patterson et al (2002) .
  • Crash-Only Software. Candea, Fox (2003) .
  • Building on Quicksand. Helland, Campbell (2009) .

Thinking Methodically about Performance. Gregg (2012) . Performance

  • Performance Anti-Patterns. Smaalders (2006) .
  • Thinking Clearly about Performance. Millsap (2010) .

Bitcoin, A peer-to-peer electronic cash system. Nakamoto (2008) . Crytpocurrencies

  • Ethereum: A Next-Generation Smart Contract and Decentralized Application Platform. Buterin (2014) .

A Few Useful Things to Know About Machine Learning. Domingos (2012) . Machine Learning

  • Statistical Modeling: The Two Cultures. Breiman (2001) .
  • The Unreasonable Effectiveness of Data. Halevy, Norvig, Pereira (2009) .
  • ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky, Sutskever, Hinton (2012) .
  • Playing Atari with Deep Reinforcement Learning. Mnih et al (2013) .
  • Generative Adversarial Nets. Goodfellow et al (2014) .
  • Deep Learning. LeCun, Bengio, Hinton (2015) .
  • Attention Is All You Need. Vaswani et al (2017) .
  • Von Neumann's First Computer Program. Knuth (1970) .
  • Computing Machinery and Intelligence. Turing (1950) .
  • A Method for the Construction of Minimum-Redundancy Codes. Huffman (1952) .
  • Engineering a Sort Function. Bentley, McIlroy (1993) .
  • A Design Methodology for Reliable Software Systems. Liskov (1972) .
  • Programming with Abstract Data Types. Liskov, Zilles (1974) .
  • Why Functional Programming Matters. Hughes (1990) .
  • An Incremental Approach to Compiler Construction. Ghuloum (2006) .
  • No Silver Bullet: Essence and Accidents of Software Engineering. Brooks (1987) .
  • Communicating sequential processes. Hoare (1978) .
  • The UNIX Time- Sharing System. Ritchie, Thompson (1974) .
  • A Relational Model of Data for Large Shared Data Banks. Codd (1970) .
  • A Protocol for Packet Network Intercommunication. Cerf, Kahn (1974) .
  • New Directions in Cryptography. Diffie, Hellman (1976) .
  • Time, Clocks, and the Ordering of Events in a Distributed System. Lamport (1978) .
  • Designing for Usability: Key Principles and What Designers Think. Gould, Lewis (1985) .
  • The anatomy of a large-scale hypertextual Web search engine. Brin, Page (1998) .
  • Dynamo, Amazon’s Highly Available Key-value store. DeCandia et al (2007) .
  • On Designing and Deploying Internet Scale Services. Hamilton (2007) .
  • Thinking Methodically about Performance. Gregg (2012) .
  • Bitcoin, A peer-to-peer electronic cash system. Nakamoto (2008) .
  • A Few Useful Things to Know About Machine Learning. Domingos (2012) .

This list was inspired by (and draws from) several books and paper collections:

  • Papers We Love
  • Ideas That Created the Future
  • The Innovators
  • The morning paper
  • Distributed systems for fun and profit
  • Readings in Database Systems (the Red Book)
  • Fermat's Library
  • Classics in Human-Computer Interaction
  • Awesome Compilers
  • Distributed Consensus Reading List
  • The Decade of Deep Learning

A few interesting resources about reading papers from Papers We Love and elsewhere:

  • Should I read papers?
  • How to Read an Academic Article
  • How to Read a Paper. Keshav (2007) .
  • Efficient Reading of Papers in Science and Technology. Hanson (1999) .
  • On ICSE’s “Most Influential Papers”. Parnas (1995) .

Selection criteria

  • The idea is not to include every interesting paper that I come across but rather to keep a representative list that's possible to read from start to finish with a similar level of effort as reading a technical book from cover to cover.
  • I tried to include one paper per each major topic and author. Since in the process I found a lot of noteworthy alternatives, related or follow-up papers and I wanted to keep track of those as well, I included them as sublist items.
  • The papers shouldn't be too long. For the same reasons as the previous item, I try to avoid papers longer than 20 or 30 pages.
  • They should be self-contained and readable enough to be approachable by the casual technical reader.
  • They should be freely available online.
  • Examples of this are classic works by Von Neumann, Turing and Shannon.
  • That being said, where possible I preferred the original paper on each subject over modern updates or survey papers.
  • Similarly, I tended to skip more theoretical papers, those focusing on mathematical foundations for Computer Science, electronic aspects of hardware, etc.
  • I sorted the list by a mix of relatedness of topics and a vague chronological relevance, such that it makes sense to read it in the suggested order. For example, historical and seminal topics go first, contemporary internet-era developments last, networking precedes distributed systems, etc.

Sponsor this project

Contributors 4.

  • Python 100.0%

IMAGES

  1. (PDF) Writing good software engineering research papers

    topics research papers in software engineering

  2. Research Paper On Software Reengineering : Re-engineering Software

    topics research papers in software engineering

  3. PPT

    topics research papers in software engineering

  4. List of case study topics for software engineering

    topics research papers in software engineering

  5. (PDF) Emerging topics in software engineering

    topics research papers in software engineering

  6. PPT

    topics research papers in software engineering

VIDEO

  1. #1 Introduction To Software Engineering

  2. Requirements Analysis In Software Engineering-Requirement Analysis In Software Testing-Requirements

  3. How To Choose A Research Topic For A Dissertation Or Thesis (7 Step Method + Examples)

  4. Research in Computer Science & Engineering

  5. 60000+ research papers| Top 6 Data Science research paper sites

  6. Software Engineering

COMMENTS

  1. Top 10 Software Engineer Research Topics for 2024

    The research papers on software engineering topics in this specific area could identify novel measures for evaluating software systems or techniques for using metrics to improve the quality of software.

  2. Software Engineering's Top Topics, Trends, and Researchers

    Software Engineering's Top Topics, Trends, and Researchers Abstract: For this theme issue on the 50th anniversary of software engineering (SE), Redirections offers an overview of the twists, turns, and numerous redirections seen over the years in the SE research literature.

  3. software engineering Latest Research Papers

    software engineering Recently Published Documents TOTAL DOCUMENTS 14141 (FIVE YEARS 2173) H-INDEX 99 (FIVE YEARS 10) Latest Documents Most Cited Documents Contributed Authors Related Sources Related Keywords Identifying Non-Technical Skill Gaps in Software Engineering Education: What Experts Expect But Students Don't Learn

  4. Highly-cited papers in software engineering: The top-100

    Studying highly-cited SE papers helps researchers to see the type of approaches and research methods presented and applied in such papers, so as to be able to learn from them to write higher quality papers which will likely receive high citations.

  5. Software Engineering

    Software Engineering At Google, we pride ourselves on our ability to develop and launch new products and features at a very fast pace. This is made possible in part by our world-class engineers, but our approach to software development enables us to balance speed and quality, and is integral to our success.

  6. Carnegie Mellon University, Software Engineering Insitute

    DOWNLOAD this white paper examines how decision makers, such as technical leads and program managers, can assess the fitness of large language models (llms) to address software engineering and acquisition needs. Artificial Intelligence Engineering,Software Engineering Research and Development,Acquisition 2023

  7. Trending Topics in Software Engineering

    Trending Topics in Software Engineering (1) The continuous evolution of Software Engineering (SE) comes with a series of methodological and technical challenges to be faced, modelled and suitably tackled. Particularly, we observed that modern software systems are more and more deployed onto ...

  8. 319424 PDFs

    Software engineering and the application of knowledge-based, simulation-based, data-driven, human-centred and automated approaches. | Explore the latest full-text research PDFs, articles ...

  9. The state of research on software engineering competencies: A

    2.2. Related literature review studies. Three literature review studies on SEC were found from the literature search. Cruz et al. (2015) used a systematic mapping study to plot the current landscape of published empirical and theoretical studies that explored the role of personality in software engineering. The authors reviewed 90 papers published from 1970 to 2010.

  10. Understanding peer review of software engineering papers

    Understanding peer review of software engineering papers Published: 17 July 2021 Volume 26, article number 103, ( 2021 ) Cite this article Download PDF Empirical Software Engineering Aims and scope Submit manuscript Neil A. Ernst, Jeffrey C. Carver, Daniel Mendez & Marco Torchiano 623 Accesses 6 Citations 2 Altmetric Explore all metrics Abstract

  11. IEEE Transactions on Software Engineering

    IEEE Transactions on Software Engineering. null | IEEE Xplore. Need Help? US & Canada: +1 800 678 4333 Worldwide: +1 732 981 0060 Contact & Support

  12. How software engineering research aligns with design science ...

    Background Assessing and communicating software engineering research can be challenging. Design science is recognized as an appropriate research paradigm for applied research, but is rarely explicitly used as a way to present planned or achieved research contributions in software engineering. Applying the design science lens to software engineering research may improve the assessment and ...

  13. Topic modeling in software engineering research

    Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was "prepared" (i.e., pre-processed) for topic modeling, and (4) how gen...

  14. [1608.08100] Finding Trends in Software Research

    This paper explores the structure of research papers in software engineering. Using text mining, we study 35,391 software engineering (SE) papers from 34 leading SE venues over the last 25 years. These venues were divided, nearly evenly, between conferences and journals. An important aspect of this analysis is that it is fully automated and repeatable. To achieve that automation, we used a ...

  15. Research in software engineering: an analysis of the literature

    In this paper, we examine the state of software engineering (SE) research from the point of view of the following research questions: 1. What topics do SE researchers address? 2. What research approaches do SE researchers use? 3. What research methods do SE researchers use? 4. On what reference disciplines does SE research depend? 5.

  16. PDF Writing Good Software Engineering Research Papers

    Minitutorial. Mary Shaw. Carnegie Mellon University [email protected]. Abstract. Software engineering researchers solve problems of several different kinds. To do so, they produce several different kinds of results, and they should develop appropriate evidence to validate these results. They often report their research in conference papers.

  17. Journal of Software Engineering Research and Development

    PDF Investigating the effectiveness of peer code review in distributed software development based on objective and subjective data Code review is a potential means of improving software quality. To be effective, it depends on different factors, and many have been investigated in the literature to identify the scenarios in which it adds qu...

  18. (PDF) Current Trends in Software Engineering Research

    The new trends in software engineering research topics resolves under the research field of Cloud Computing, Big Data, Android Computing, Network Security and Software Engineering Project...

  19. (PDF) A review of software engineering research from a design science

    Aim: The aim of this study is to 1) evaluate how well the design science lens helps frame software engineering research contributions, and 2) identify and characterize different types of...

  20. Research Topics in Software Engineering

    Research Topics in Software Engineering VVZ: Open in Course Catalogue Semester: Fall 2020 Number: 263-2100-00L Lecturer: Prof. Martin Vechev , Prof. Zhendong Su TA: Dimitar I. Dimitrov , Momchil Peychev , Samuel Steffen , Jingxuan He , Petar Tsankov , Pinjia He, Manuel Rigger, Daming Zou, Shaohua Li, Sverrir Thorgeirsson, Dominik Winterer

  21. An Analysis of Research in Software Engineering:

    项目名称 An Analysis of Research in Software Engineering: Assessment and Trends Zhi Wang a,b, Bing Li c,d, Yutao Ma b,d,∗ State Key Lab of Software Engineering, Wuhan University, Wuhan 430072, China School of Computer, Wuhan University, Wuhan 430072, China International School of Software, Wuhan University, Wuhan 430079, China

  22. Software Engineer Research Paper Topics 2021: Top 5

    Software Engineer Research Paper Topics 2021: Top 5 - Wonder Software Engineer Research Paper Topics 2021: Top 5 Admin October 20, 2021 0 Whether you're studying in advance or you're close to getting that Software Engineering degree, it's crucial that you look for possible research paper topics in advance.

  23. (PDF) Software Engineering Research Topics

    April 2014 · International Journal of Software Engineering and Knowledge Engineering Vahid Garousi Guenther Ruhe Bibliometric rankings are quite common in the field of software...

  24. [2402.09171] Automated Unit Test Improvement using Large Language

    This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the ...

  25. Computer Science Research Topics (+ Free Webinar)

    Overview: CompSci Research Topics. Algorithms & data structures. Artificial intelligence ( AI) Computer networking. Database systems. Human-computer interaction. Information security (IS) Software engineering. Examples of CompSci dissertation & theses.

  26. 150 Best Research Paper Topics For Software Engineering

    Best Research Paper Topics on Software Software Engineering Management Unified Software Development Process and Extreme ProgrammingThere are a lot of difficulties with managing the development of software for web-based applications and projects for systems integration that were completed in recent times.

  27. Gartner Emerging Technologies and Trends Impact Radar for 2024

    This theme focuses on making the right business and ethical choices in the adoption of AI and using AI design principles that will benefit people and society.. Human-centered AI (HCAI) is a common AI design principle that calls for AI to continuously benefit from human input. Behavioral analytics refers to session-tracking capabilities that monitor user interactions with a protected service to ...

  28. Papers for Software Engineers

    List of papers by topic Von Neumann's First Computer Program. Knuth (1970). Computer History; Early Programming The Education of a Computer. Hopper (1952). Recursive Programming. Dijkstra (1960). Programming Considered as a Human Activity. Dijkstra (1965). Goto Statement Considered Harmful. Dijkstra (1968).