Do Such Reviews Exist in Academic Publishing?

Recently, my co-author and I submitted this manuscript “Are Free Android App Security Analysis Tools Effective in Detecting Known Vulnerabilities?” for review to IEEE S&P 2019, a premier conference on security. Upon receiving the reviews, I decided to publicly share them because most of the remarks in the reviews are lacking in one of the following ways.

  1. Reviews do not point to factual or methodological errors.

Worse yet, these are the reviews provided to the authors after they were extensively discussed by the program committee (PC).

If you think my assessment of the reviews is incorrect, then please tell me why by leaving a comment.

Review of the Reviews

Instead of merely sharing the review and the submitted manuscript, I will list parts of the reviews that are against the manuscript and non-trivial (in code block) followed by one or more of the following.

  • Evidence (E) from the submitted manuscript that contradict the review.
The paper lacks clear scientific contributions.

(L) Why are the claimed contributions not scientific?

A significant amount of space is devoted to arguing that Ghera is representative of all vulnerabilities in the Android ecosystem but the argumentation does not seem completely sound.

(E) On page 2 of the paper, we say “we evaluated if Ghera benchmarks were representative of real world apps, i.e., do the benchmarks capture vulnerabilities as they occur in real world apps?” This is our position through out the paper. This is not the same as the observation made by the reviewer, i.e., “representative of all vulnerabilities”.
(L) Assuming we slipped and did make the claim as observed by the reviewer somewhere in the paper, the reviewer should point out this location to make the review helpful.

The authors cover too different problem areas: vulnerabilities in android apps as well as malicious behavior from android apps. These problem areas seem fairly unrelated to each other and the paper would benefit from a more detailed explanation. It seems that the most common occurrence of malicious behavior is because the developer intentionally included that malicious behavior in their app. If the paper is meant to cover a different problem scenario, e.g. the inclusion of third-party libraries that may hide malicious behavior, it would be helpful to explicitly address that.

(A) We could remove Section IV.D.3 from the paper. However, since it is only one-fourth of a page long, I doubt if it would immensely improve the focus of the paper. Instead, we could change the title to An Evaluation of Effectiveness of Free Android App Security Analysis Tools”.
(L) Since the evaluation is focused on evaluating security analysis tools, why it is wrong considering tools that claim to detect malicious behavior (independent of how the behavior comes being in apps)?

The section on the representativeness of Ghera is quite prolonged. However, it does not really address the core question, e.g. are the vulnerabilities included in Ghera representative of vulnerabilities found in real Android apps.  The paper makes the assumption that the usage of APIs is a good proxy for vulnerabilities. I don't believe that to be the case and the conclusions derived from the API analysis as a result do not seem convincing enough to answer the original question that was posed about the vulnerabilities included in Ghera.

(L) Section III.A on page 3 of the paper explains why API usage is a weak yet general (proxy) measure of representativeness — since vulnerable real-world apps are almost impossible to find. Why does the reviewer not believe in the assumptions and conclusions?

The overall discussion on APIs was also a little bit confusing as the paper did not really include a concise definition of what is being considered an API.

(E) API is a well-defined term in the context of developing apps.
On page 2 of the paper, we say “the nature of the APIs (including features of XML-based configuration)” and, on page 4, Section III.B.3, we say “Android apps access various capabilities of the Android platform via features of XML-based manifest files and Android programming APIs. We refer to the published Android programming APIs and the elements and attributes (features) of manifest files collectively as APIs.”

When evaluating the security tools, the paper makes a number of quite big assumptions. One of them is that the tools should work without any configuration. Why was that a good assumption to make?

(E) In Section IV.B.1, we say “Also, we wanted to mimic a simple developer workflow — procure/build the tool, follow its documentation, and apply it to programs.”

And does that assumption align with how the security tools that were studied are supposed to be used?

(E) In Section IV.B, we explicitly state tools that did not fit the evaluation criteria/assumptions were disregarded along with the reason (as opposed to being incorrectly evaluated).

This takes a lot of space without conveying deep insights about the problem space.

(L) An example of what would qualify as a deep insight would have been helpful. Further, why is the insight “existing security analysis tools for Android apps are very limited in their ability to detect known vulnerabilities: all of the evaluated vulnerability detection tools together could only detect 30 of the 42 known vulnerabilities.” shallow?

It would have been more helpful to look into the individual security tools in more detail to better understand why they did or did not detect the various vulnerabilities.

(R) The suggested task is best done by the tool authors as they know the tool well. Folks who are not tool authors are better suited to evaluate how well the current space of vulnerabilities are covered by existing tools, which was the point of this effort. Also, the findings from our effort can inform the tool authors about limitations of their tools.

The summary finding is essentially that existing security tools don't find all vulnerabilities. In some sense that seems like a foregone conclusion.

(R) The point of such efforts is to assess the current state, convert folk-lore into conclusion based on empirical data, and help guide subsequent efforts in relevant direction.

As a result, they find out that 50% of the realworld applications use the APIs used by the applications in the Ghera benchmark.

(E) Section III.C.1 and Figure 1 explicitly state that at least 50% of apps use more than 60% of the APIs used in Ghera benchmarks.

The evaluation done on 28 free Android applications results in a number of findings such as vulnerability detector tools are not able to identify class of vulnerabilities they claim to etc.

(E) The abstract states “We considered 64 security analysis tools and empirically evaluated 19 of them — 14 vulnerability detection tools and 5 malicious behavior detection tools…” So, could 28 instead of 19 be a typo? May be.

I don't think it is fair to criticize academicians because they compare their new solutions with existing work on academia abandoning existing available tools in app stores and also because they do not provide practical solutions that are easy to understand by app developers.

(E) In section I.A, we point out the limitations of evaluations in efforts [3] and [4] as part of the motivation for this evaluation. Why is this wrong?

The goal of a researcher is to improve the state-of-the art and compare itself with the most current, the most efficient and effective technique that was proposed earlier, not the simplistic techniques that are focusing only on the detection of known vulnerabilities.

(L) The paper merely finds that existing tools fail to detect known vulnerabilities. It never dictates what the researchers should do. What is purpose of this remark?

I suggest the authors to remove the parts of the paper where they criticize the existing works [4] and [5] in the paper due to these reasons and instead, focus more on a nice discussion regarding..

(E) Section V does an objective comparison between efforts [3] and [4] and our effort. How is this criticism?
Further, [5] is our own work (Ghera) and we do not compare this effort with [5]. So, I guess the mentioned references are typos.

The authors critisize the existing works because they do not compare   themselves with real-world tools, however, this paper itself does the same by doing the opposite. Wouldn't this study have been more complete if the authors also discussed the academical solutions evaluating them on the same benchmark and touch base on the topic of why these methods are not developed in practice and what we should do to reach out to the community better.

(E) This effort does evaluate all of the tools on the same benchmarks from Ghera and points out how each tool fares.
(E) As for reasoning why these solutions are not developed in practice, I believe our findings (e.g., about building tools, applying tools, and effectiveness of tools) answer this question.
Addressing these concerns — usability and effectiveness — will help reaching out to the community.

as we can see from the results, the Ghera benchmark does not have the best coverage for all existing vulnerabilities and this makes the outcome of this discussion questionable. To be more precise,8 of the vulnerability detection tools claimed to identify other vulnerabilities that are not found in the Ghera benchmark.

(L) The vulnerabilities captured by Ghera serve as the reference in this evaluation. So, what is the meaning of “Ghera benchmark does not have the best coverage for all existing vulnerabilities”?
(L) What is this set of all existing vulnerabilities that the reviewer is referring to?
(E) In Section IV.D.2 on page 10, we say “In the evaluation, 8 out of 14 tools reported vulnerabilities that were not the focus of Ghera benchmarks. (See last column in TABLE III.) Upon manual examination of the benchmarks, we found none of these reported vulnerabilities were present in the benchmarks.”

In particular, the authors evaluate only free applications. I believe the study would have been more complete if the commercial apps were also evaluated.

(R) Yes, it would have been more complete if we could have considered more tools. However, this is the best we could do (that no one has done to date) with the given time and money constraints. So, we mention the evaluation of commercial tools as Future Work item 1 (Section VII).

the authors do not discuss the details about these vulnerabilities that are identified or missed. Which kind of vulnerabilities are generally missed by the tools? Are the tools generally identifying the same set of vulnerabilities or different ones.

(E) In Section IV.D.2 on page 10, we say
“While both COVERT and DialDroid claim to detect vulnerabilities related to communicating apps, neither detected such vulnerabilities in any of the 33 Ghera benchmarks that are contained a benign app and a malicious app. While MalloDroid focuses solely on SSL/TLS related vulnerabilities, it did not detect any of the SSL vulnerabilities captured in Ghera benchmarks. We observed similar failures with FixDroid.” and
“Switching the focus to vulnerabilities, each of the 5 vulnerabilities captured by Permission and System benchmarks were detected by some tool. However, none of the 2 vulnerabilities captured by Networking benchmarks category were detected
by any tool.”
Further, Table III, Appendix, and the repository of artifacts provide sufficient details to dig deeper into the detected or missed vulnerabilities.

After all the authors intentionally picked tools that perform shallow analysis and do not require additional input from developers regarding details from the source code or annotations on the code.

(E) We considered tools that performed deep analysis (e.g., Amandroid, FlowDroid, HornDroid) and tools that performed shallow analysis (e.g., QARK, Marvin-SA, AndroBugs). See Section IV.D.2 on page 10.
(E) In Section IV.B.1, we say “we wanted to mimic a simple developer workflow — procure/build the tool, follow its documentation, and apply it to programs.” Clearly, our choice was driven by simplicity and ease of use and not shallowness of analysis performed by tools.

Therefore, I am not surprised about the low true positives. I think to be more fair to the existing more sophisticated tools, the authors could implement an application with existing interesting vulnerabilities and provide the required input for these more sophisticated tools and evaluate them. Or if the source code of the appsin the benchmark is available, they could analyze some of them and provide the required information.

(R) We mention this in Future Work item 3 (Section VII).

I don't see why it wasn't possible for the authors to complete the items in the future works sections of the paper.

(R) By experience, it takes quite a bit of effort to evaluate 19 tools by building and executing them in simple-developer-workflow mode on 42 benchmarks; TBOMK, not done by any to date. Considering commercial tools and nuances of configurations would have only extended the project.
(A) We just completed Future Work item 5 (Section VII). Research takes time.

"While software development community has recently realized the importance of security, developer awareness about how security issues transpire and how to avoid them is still lacking" : The study of the effectiveness of automatic vulnerability scanners is not at all a novel topic and here, this motivating sentence in this paper is not accurate. The software development community is not at all new to the concept of security, even for the mobile software community.

(E) How is the quoted sentence from the paper claiming novelty of study of effectiveness? As for developer awareness, we cite [2] from 2016 that backs our statement.

The paper evaluates 64 vulnerability and malicious detection tools for Android apps, why did the authors include the malicious detection tools to their analysis as well? This is very irrelevant to the goal of the paper.

(A/L) If discussion of effectiveness of malicious tools is irrelevant due to the current title, then we could change the title of the paper to “An Evaluation of Effectiveness of Free Android App Security Analysis Tools” If not, then why is the discussion of the effectiveness of malicious tools irrelevant?

Why did the authors only use the lean benchmarks of Ghera, excluding the real world apps with vulnerabilities from the paper? Wouldn't it be more realistic to test these vulnerability tools with real applications rather than stripped down apps?

(A) Ghera did not have a large number of fat benchmarks at the time of this evaluation. We should mention this in the manuscript.
(R) If the tools failed to detect vulnerabilities in stripped down apps found in lean benchmark, it is unlikely the tools would have found vulnerabilities in larger real-world apps.
Further, we suggest this extension to this evaluation in Future Work item 4 (Section VII).
(E) We briefly explore this aspect in Section IV.D.2 on page 10 when we discuss effectiveness of shallow and deep analysis.

Why did the authors make the distinction of relevant and security relevant APIs?

(A) We should mention this distinction in the manuscript.

This is the first time I am reading an anonymized paper where parts of the paper have been blackened out. It feels like the paper was deanonymized by just censoring certain parts of it. This actually makes the paper more difficult to understand, and it raises questions. For example, what is the answer for why the evaluation took more than a year? Why has that part been partially censored?

(R) Since we were also the authors of Ghera, we could not have mentioned the changes to Ghera delayed this effort without divulging our identities; the submission process was double-blinded. So, redaction was the best way out. That said, I am open to suggestions to anonymize the blacked-out content in Section IV.G; the unanonymized version of the manuscript is available here.
(A) Partial censorship would be a concern if the reason for the time taken for the effort (as the title of Section IV.E suggests) was central to the findings of the manuscript. That said, describing the reasons for the time taken for an effort informs other researchers about issues to consider when undertaking such efforts.

As the authors state, their selection of tools and the configurations could have really been biased by their preferences and their know-how. Starting out with 64 tools and bringing it down to 19 takes away quite a bit from the value of the evaluation and the study.

(R) While I agree considering more tools would have been better, I am not aware of a study that considered 19 tools. So, why isn’t this considered as a merit of this study?

For example, the claim "there exists a gap between the claimed capabilities and the observed capabilities of tools that could lead to vulnerabilities in apps" is generally not surprising in the domain of security.

(R) The reviewer’s statement about the claim implies that 1) authors of tools in security domain lie about the capabilities of their tools, 2) the security community is aware of it, and 3) the security community is fine with this practice. If true, then this is a non-trivial concern. So, instead of welcoming a data-based observation that provide the validity of the concern, the reviewer is trivializing the observation. Not sure how to respond to this remark.

this reported insight: "most tools prefer to report only valid vulnerabilities or most tools can only detect specific manifestations of vulnerabilities" is also not very deep and I am not sure what we can learn from it.

(E) “For 11 out of 14 tools, the number of false negatives was within 30% of the number of true negatives. This suggests two possibilities: most tools prefer to report only valid vulnerabilities or most tools can only detect specific
manifestations of vulnerabilities. Both these possibilities have limited effectiveness in assisting developers build secure apps because validity of reported vulnerabilities takes precedence over building secure apps.”
in Section IV.D.2 on page 10 clearly states why both these possibilities are bad.

it is clear that the evaluation might have significant biases that have been introduced by the authors

(E) Researchers conducting such evaluation would be affected by the same or different kind biases that affected us in this evaluation. The only way to rule out the effect of biases is to repeat the evaluation and cross check their results from these repetitions. We explicitly mention this concern and how to tackle it in Section IV.E.

My understanding was that some tools were rejected because they were commercial. I understand that they may be expensive or difficult to get, but if such tools were eliminated from the batch, then the general title "are tools effective" does not really hold. This is also true for a bunch of other tools that were eliminated from the study for a variety of reasons. The more fitting title would be something like "an analysis of 19 tools for...".

(E) The title of the manuscript is “Are Free Android App Security Analysis Tools Effective in Detecting Known Vulnerabilities?”

Done

Now, if you made it till here and you believe these reviews are indeed questionable, then please share this post with others to help start a conversation about review process in academic publishing.

Written by

Programming, experimenting, writing | Past: SWE, Researcher, Professor | Present: SWE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store