Giter VIP home page Giter VIP logo

sap-sam's Introduction

REUSE status

SAP Signavio Academic Models (SAP-SAM)

This repository contains the source code for the paper SAP Signavio Academic Models: A Large Process Model Dataset by Diana Sola, Christian Warmuth, Bernhard Schäfer, Peyman Badakhshan, Jana-Rebecca Rehse, and Timotheus Kampik.

Link to the paper: https://arxiv.org/abs/2208.12223 (pre-print)

Link to the dataset: https://zenodo.org/record/7012043

License

The example code in this repository is licensed as follows. Note that a different license applies to the dataset itself!

Copyright (c) 2022 by SAP.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

The following license applies to the SAP-SAM dataset.

Copyright (c) 2022 by SAP.

SAP grants to Recipient a non-exclusive copyright license to the Model Collection to use the Model Collection for Non-Commercial Research purposes of evaluating Recipient’s algorithms or other academic research artefacts against the Model Collection. Any rights not explicitly granted herein are reserved to SAP. For the avoidance of doubt, no rights to make derivative works of the Model Collection is granted and the license granted hereunder is for Non-Commercial Research purposes only.

"Model Collection" shall mean all files in the archive (which are JSON, XML, or other representation of business process models or other models).

"Recipient" means any natural person receiving the Model Collection.

"Non-Commercial Research" means research solely for the advancement of knowledge whether by a university or other learning institution and does not include any commercial or other sales objectives.

Citing SAP-SAM

@misc{SAP-SAM-paper,
  doi = {10.48550/ARXIV.2208.12223},
  url = {https://arxiv.org/abs/2208.12223},
  author = {Sola, Diana and Warmuth, Christian and Schäfer, Bernhard and Badakhshan, Peyman and Rehse, Jana-Rebecca and Kampik, Timotheus},
  keywords = {Other Computer Science (cs.OH), Software Engineering (cs.SE), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {SAP Signavio Academic Models: A Large Process Model Dataset},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

or

@dataset{SAP-SAM-dataset,
  author       = {Kampik, Timotheus and Warmuth, Christian and Sola, Diana and Schäfer, Bernhard and Axworthy, Liz and Ivarsson, Erica and
                  Ouda, Karim and Eickhoff, David},
  title        = {SAP Signavio Academic Models},
  month        = aug,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {0.5.1},
  doi          = {10.5281/zenodo.6964944},
  url          = {https://doi.org/10.5281/zenodo.6964944}
}

Setup

You need to download the dataset and place it into the folder ./data/raw such that the models are in ./data/raw/sap_sam_2022/models.

It is also possible to run the analysis on any .sgx files (Signavio workspace exports). Place the files in ./data/raw/sap_sam_2022/models and the conversion will be performed automatically.

To get started on Mac or Windows, we provide a dependency setup with poetry. Make sure poetry is installed on your system with poetry --version. If not, run pip poetry install.

To install the dependencies, do to the root of the cloned repository, type this line in the terminal, and press enter:

poetry install

It is important to note that you should have the latest stable version of python or python3 installed on your machine, and not a pre-release one (try python --version). The current latest stable version is 3.12.2 (as of April 2024).

After executing the script, you should be able to setup the kernel:

python -m ipykernel install --user --name=sap-sam-kernel

Then, to open the project, simply type:

jupyter notebook

Alternatively, a conda setup is possible.

We provide two conda environment.yml files that can be used to create a new environment and install the required dependencies:

  • environment.yml: contains the abstract dependencies (pandas, numpy, ...).
  • environment-lock.yml: contains versions for all dependencies and the transitive dependencies to ensure reproducible results.

You can use the following conda command to create the environment:

conda env create -f environment.yml  

or

conda env create -f environment-lock.yml  

Getting started

We provide a tutorial Jupyter Notebook that illustrates the dataset format in more detail and shows how to use the csv parsers developed in ./src.

The properties Jupyter Notebook gives an overview of selected properties of the dataset.

Dataset Format

The SAP-SAM dataset contains 103 csv files with a rough size of 38 GB of process models (see modeling notations of the models below).

CSV Format

  1. csv columns:
    • Revision ID: Unique identifier for model revision
    • Model ID: Unique identifier for model
    • Organization ID: Unique identifier for organization this model originates from
    • Datetime: Date and time of creation
    • Model JSON: JSON containing model information
    • Description: Description of model (typically empty)
    • Name: Model name
    • Type: Model type (duplicate and less specific than namespace)
    • Namespace: Stencilset/modeling notation (e.g. BPMN, DMN, UML,...)
  2. Number of models: 1,021,471
  3. Number of models by modeling notation:
Modeling notation Frequency
BPMN 2.0 618,807
Value Chain 194,078
DMN 1.0 98,286
EPC 32,369
BPMN 1.0 15,643
UML 2.2 Class 14,953
Petri Net 11,207
ArchiMate 2.1 10,956
UML Use Case 10,228
Organigram 4,568
BPMN 2.0 Choreography 4,096
BPMN 2.0 Conversation 2,788
FMC Block Diagram 1,398
CMMN 1.0 999
CPN 385
Journey Map 287
YAWL 2.2 238
Process Documentation Template 86
jBPM 4 76
XForms 20
Chen Notation 3

Dummy Data

In order to remove personal first and last names, emails or in some cases matriculations numbers (which users have added in non-compliance with the T&Cs), we have applied a simple replacement script. In particular, we have replaced - to the extent possible - emails, names, and (matriculation) numbers with the following dummy values:

Context Dummy
Email Dummy [email protected]
Name Dummy Jane Doe
Matriculation/Number Dummy 12345678

Project Organization

├── data
│   ├── interim           <- Intermediate data that has been transformed.
│   └── raw               <- The raw dataset should be placed in this folder.
├── notebooks             <- Jupyter notebooks.
├── reports            
│   └── figures           <- Generated graphics and figures used in the paper.
├── src               
│   └── sapsam            <- Source code and dictionaries for use in this project.
├── LICENSE               <- License that applies to the example code in this repository.
├── README.md             <- The top-level README for developers using this project.
├── environment-lock.yml  <- Contains versions for all dependencies and the transitive dependencies to ensure reproducible results.
├── environment.yml       <- Contains the abstract dependencies (pandas, numpy, ...).
└── setup.py              <- Makes project pip installable (pip install -e .) such that src can be imported.

sap-sam's People

Contributors

ajinkyapatil8190 avatar dependabot[bot] avatar dubmix avatar par-vathy avatar renovate[bot] avatar sap-dianasola avatar timkam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sap-sam's Issues

Update filter for BPMN 2.0 parsing stage

– Updated code to include a 'name' column after in the table of BPMN 2.0 diagrams
– Using the 'name' column, filtering out example processes is now possible
– Updated markdown to make it clear that diagrams with no elements are ignored during parsing stage, hence explaining count discrepancies

Running analysis on data subsets

Noticed some unexpected behaviour in the code after trying to run the analysis on a very small subset of the data (~25 models).

Getting error invalid_request_missing_parameter

Hi and thank you for your work!

I try to use your code to convert BPMN diagrams to an image. However, I run into an error when trying:

Traceback (most recent call last):                                                                                            
  File "convert_to_bpmn.py", line 41, in <module>
  	image_request = gen.generate_image(model_name, model_json, model_namespace)
  File "venv/lib/python3.10/site-packages/sapsam-0.0.1-py3.10.egg/sapsam/ImageGenerator.py", line 117, in generate_image
    return self.generate_representation(name, data, namespace, 'png', deletes)
  File "venv/lib/python3.10/site-packages/sapsam-0.0.1-py3.10.egg/sapsam/ImageGenerator.py", line 91, in generate_representation
    model_id = result['href'].replace('/model/', '')
KeyError: 'href'

When printing the result variable from line 90, I get the following data, including an error message:

{'requestId': '***deleted***', 'message': 'Ein Fehler ist aufgetreten (invalid_request_missing_parameter)', 'errors': ['invalid_request_missing_parameter']}

I checked the login data end entered a wrong password, which lead to a different error. In addition, the authenticator information looked good when printing them with the correct login information.

Can you help me? What am I doing wrong?

Rate limiting for Imagegenerator

I am encountering a technical issue while attempting to convert SAP Signavio Academic models into event logs (XES) using the sap-sam Image Generator module.

I have been following the process outlined in the python notebooks (https://github.com/signavio/sap-sam/blob/main/notebooks/3_images_and_XMLs.ipynb) and utilizing the sap-sam Image Generator module (https://github.com/signavio/sap-sam/blob/main/src/sapsam/ImageGenerator.py) to convert the models. I have successfully converted approximately 50 BPMN JSON files from the CSV format to .bpmn files.
However, I am now facing an issue with the generate_xml method within the Image Generator module. After successfully converting the initial set of files, the method suddenly stops returning any output, and the conversion process halts without generating any errors or indications of failure. I have reviewed the logs and examined my code to ensure there are no obvious errors or misconfigurations.

Despite my efforts, I have not been able to identify the root cause of this issue. To provide more context, I am utilizing the SAP Signavio Academic models available at this link: https://zenodo.org/record/7012043.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

pep621
pyproject.toml
poetry
pyproject.toml
  • python ^3.11
  • matplotlib ^3.8.4
  • pillow ^10.3.0
  • pandas ^2.2.1
  • numpy ^1.26.4
  • toml ^0.10.2
  • seaborn ^0.13.2
  • wordcloud ^1.9.3
  • language-data ^1.2
  • tqdm ^4.66.2
  • thinc ^8.2.3
  • spacy ^3.7.4
  • stringcase ^1.2.0
  • ipykernel ^6.29.4
  • pydantic ^2.6.4
  • spacy_langdetect ^0.1.2
  • pyarrow ^16.0.0
  • jupyter ^1.0.0

  • Check this box to trigger a request for Renovate to run again on this repository

Some data-sets seem to have missing header information

When calling parser.parse_model_metadata(), some of the data-sets could not be parsed on my machine, with the error message that the header information was missing. The data-sets that were affected are:

70000.csv
140000.csv
330000.csv
510000.csv
590000.csv
670000.csv
820000.csv

After removing these data-sets from the data folder, the parsing worked fine.

change the default "setup folder" to "my documents" (instead of shared documents)

In ImageGenerator.py, the functionction "setup_folder" currently ensures a folder "SAP-SAM" in the workspace under "shared documents". When other functions such as "generate_xml" are then called, the entire network with access to the workspace will however then get (potentially thousands) of notifications such as "model x created, model y deleted". Maybe this can be changed to setup the folder under "my documents" per default. I will try this and create a pull request if applicable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.