Code between the Lines: Semantic Analysis of Android Applications

Johannes Feichtner, Stefan Gruber

Research output: Contribution to conferencePaperResearchpeer-review

Abstract

Static and dynamic program analysis are the key concepts researchers apply to uncover security-critical implementation weaknesses in Android applications. As it is often not obvious in which context problematic statements occur, it is challenging to assess their practical impact. While some flaws may turn out to be bad practice but not undermine the overall security level, others could have a serious impact. Distinguishing them requires knowledge of the designated app purpose.

In this paper, we introduce a machine learning-based system that is capable of generating natural language text describing the purpose and core functionality of Android apps based on their actual code. We design a dense neural network that captures the semantic relationships of resource identifiers, string constants, and API calls contained in apps to derive a high-level picture of implemented program behavior. For arbitrary applications, our system can predict precise, human-readable keywords and short phrases that indicate the main use-cases apps are designed for.

We evaluate our solution on 67,040 real-world apps and find that with a precision between 69% and 84% we can identify keywords that also occur in the developer-provided description in Google Play. To avoid incomprehensible black box predictions, we apply a model explaining algorithm and demonstrate that our technique can substantially augment inspections of Android apps by contributing contextual information.
Original languageEnglish
Number of pages14
Publication statusAccepted/In press - 2020
Event35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection - Maribor, Slovenia
Duration: 26 May 202028 May 2020
https://sec2020.um.si

Conference

Conference35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection
Abbreviated titleIFIP SEC 2020
CountrySlovenia
CityMaribor
Period26/05/2028/05/20
Internet address

Fingerprint

Application programs
Semantics
Application programming interfaces (API)
Learning systems
Inspection
Neural networks
Defects

Keywords

  • Android
  • TF-IDF
  • Deep Learning
  • NLP

Cite this

Feichtner, J., & Gruber, S. (Accepted/In press). Code between the Lines: Semantic Analysis of Android Applications. Paper presented at 35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection , Maribor, Slovenia.

Code between the Lines: Semantic Analysis of Android Applications. / Feichtner, Johannes; Gruber, Stefan.

2020. Paper presented at 35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection , Maribor, Slovenia.

Research output: Contribution to conferencePaperResearchpeer-review

Feichtner, J & Gruber, S 2020, 'Code between the Lines: Semantic Analysis of Android Applications' Paper presented at 35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection , Maribor, Slovenia, 26/05/20 - 28/05/20, .
Feichtner J, Gruber S. Code between the Lines: Semantic Analysis of Android Applications. 2020. Paper presented at 35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection , Maribor, Slovenia.
Feichtner, Johannes ; Gruber, Stefan. / Code between the Lines: Semantic Analysis of Android Applications. Paper presented at 35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection , Maribor, Slovenia.14 p.
@conference{77316b358edd4f85a7e5426b43ba1d98,
title = "Code between the Lines: Semantic Analysis of Android Applications",
abstract = "Static and dynamic program analysis are the key concepts researchers apply to uncover security-critical implementation weaknesses in Android applications. As it is often not obvious in which context problematic statements occur, it is challenging to assess their practical impact. While some flaws may turn out to be bad practice but not undermine the overall security level, others could have a serious impact. Distinguishing them requires knowledge of the designated app purpose.In this paper, we introduce a machine learning-based system that is capable of generating natural language text describing the purpose and core functionality of Android apps based on their actual code. We design a dense neural network that captures the semantic relationships of resource identifiers, string constants, and API calls contained in apps to derive a high-level picture of implemented program behavior. For arbitrary applications, our system can predict precise, human-readable keywords and short phrases that indicate the main use-cases apps are designed for.We evaluate our solution on 67,040 real-world apps and find that with a precision between 69{\%} and 84{\%} we can identify keywords that also occur in the developer-provided description in Google Play. To avoid incomprehensible black box predictions, we apply a model explaining algorithm and demonstrate that our technique can substantially augment inspections of Android apps by contributing contextual information.",
keywords = "Android, TF-IDF, Deep Learning, NLP",
author = "Johannes Feichtner and Stefan Gruber",
year = "2020",
language = "English",
note = "35rd IFIP TC-11 SEC 2020 International Conference on Information Security and Privacy Protection , IFIP SEC 2020 ; Conference date: 26-05-2020 Through 28-05-2020",
url = "https://sec2020.um.si",

}

TY - CONF

T1 - Code between the Lines: Semantic Analysis of Android Applications

AU - Feichtner, Johannes

AU - Gruber, Stefan

PY - 2020

Y1 - 2020

N2 - Static and dynamic program analysis are the key concepts researchers apply to uncover security-critical implementation weaknesses in Android applications. As it is often not obvious in which context problematic statements occur, it is challenging to assess their practical impact. While some flaws may turn out to be bad practice but not undermine the overall security level, others could have a serious impact. Distinguishing them requires knowledge of the designated app purpose.In this paper, we introduce a machine learning-based system that is capable of generating natural language text describing the purpose and core functionality of Android apps based on their actual code. We design a dense neural network that captures the semantic relationships of resource identifiers, string constants, and API calls contained in apps to derive a high-level picture of implemented program behavior. For arbitrary applications, our system can predict precise, human-readable keywords and short phrases that indicate the main use-cases apps are designed for.We evaluate our solution on 67,040 real-world apps and find that with a precision between 69% and 84% we can identify keywords that also occur in the developer-provided description in Google Play. To avoid incomprehensible black box predictions, we apply a model explaining algorithm and demonstrate that our technique can substantially augment inspections of Android apps by contributing contextual information.

AB - Static and dynamic program analysis are the key concepts researchers apply to uncover security-critical implementation weaknesses in Android applications. As it is often not obvious in which context problematic statements occur, it is challenging to assess their practical impact. While some flaws may turn out to be bad practice but not undermine the overall security level, others could have a serious impact. Distinguishing them requires knowledge of the designated app purpose.In this paper, we introduce a machine learning-based system that is capable of generating natural language text describing the purpose and core functionality of Android apps based on their actual code. We design a dense neural network that captures the semantic relationships of resource identifiers, string constants, and API calls contained in apps to derive a high-level picture of implemented program behavior. For arbitrary applications, our system can predict precise, human-readable keywords and short phrases that indicate the main use-cases apps are designed for.We evaluate our solution on 67,040 real-world apps and find that with a precision between 69% and 84% we can identify keywords that also occur in the developer-provided description in Google Play. To avoid incomprehensible black box predictions, we apply a model explaining algorithm and demonstrate that our technique can substantially augment inspections of Android apps by contributing contextual information.

KW - Android

KW - TF-IDF

KW - Deep Learning

KW - NLP

M3 - Paper

ER -