Over the last few months, I've been working on an improved App Not Responding (ANR) dashboard for Fennec, which is now hosted at telemetry.mozilla.org/hang/anr. With the help of many people, I'm glad to say that the dashboard is now mature enough to be a useful tool for Fennec developers.
The idea of ANR/hang reporting is similar to crash reporting — every time the Fennec UI becomes unresponsive for more than five seconds, Android would show an “App Not Responding” dialog; the ANR Reporter detects this condition and collects these information about the hang:
The ANR Reporter is enabled on Nightly and Aurora builds only, and if the user has not opted out of telemetry, the collected information is sent back to Mozilla, where the data are aggregated and presented through the ANR Dashboard. Because the debug logs may contain private information, they are not processed and are only available internally, within Mozilla.
The ANR Dashboard presents weekly aggregated data collected through the ANR reporter. Use the drop-down list at the top of the page to choose a week to display.
Data for each week are then grouped by certain parameters from ANR reports. The default grouping is “appName”, and because ANR reports are specific to Fennec, you only see one column in the top hangs chart labeled “Fennec”. However, if you choose to group by, for example, “memsize”, you will see many columns in the chart, with each column representing a different device memory size seen from ANR reports.
Each column in the top hangs chart shows the number of hangs, and each column is further divided into blocks, each representing a different hang. Hover over the blocks to see the hang stack and the number of hangs. This example shows 8 hangs with that signature occurred on devices with 768MB of memory over the past week.
Colors are preserved across columns, so the same colored blocks all represent the same hang. The blue blocks at the bottom represent all hangs outside of the top 10 list.
To the right of the top hangs chart is the distributions chart. It shows how different parameters are distributed for all hangs. Hover over each block to see details. This example shows 36% of all hangs occurred on devices running Android API level 15 (corresponding to Android 4.0.3-4.0.4 Ice Cream Sandwich) over the past week.
The distributions chart can also be narrowed down to specific groups. This would let us find out, for example, on devices having 1GB of memory, what is the percentage of hangs occurring on the Nightly update channel.
Clicking on a block in the top hangs chart bring up a Hang Report. The hang report is specific to the column that you clicked on. For example, if you are grouping by “memsize”, clicking on a hang in the “1G” column will give you one hang report and clicking on the same hang in the “2G” column will give you a different hang report. Switch grouping to “appName” if you want to ignore groups — in that case there is only one column, “Fennec”.
The hang report also contains a distributions chart specific to the hang. The example above shows that 14% of this hang occurred on Nexus 7 devices.
In addition, the hang report contains a builds chart that shows the frequency of occurrence for different builds. This example shows there was one hang from build 20140224030203 on the 30.0a1 branch over the past week. The chart can be very useful when verifying that a hang has been fixed in newer builds.
Last but not least, the hang report contains stacks from the hang. The stacks in the hang report are more detailed than the stack shown on the main page. You can also look at stacks from other threads — useful for finding deadlocks!
When comparing the volume of hangs, a higher number can mean two things — the side with higher number is more likely to hang, or the side with higher number has more usage. For example, if we are comparing hangs between devices A and B, and A has a higher number of hangs. It is possible that A is more prone to hanging; however, it is also possible that A simply has more users and therefore more chances for hangs to occur.
To provide better comparisons, the ANR Dashboard has a normalization feature that tries to account for usage. Once “Normalize” is enabled at the top of the dashboard, all hang numbers in the dashboard will be divided by usage as measured by reported uptime. Instead of displaying the raw number of hangs, the top hangs chart will display the number of hangs per one thousand user-hours. For example, 10 hangs per 1k user-hour means, on average, 1000 users each using Fennec for one hour will experience 10 hangs combined; or equivalently, one user using Fennec for 1000 hours will experience 10 hangs total. The distributions chart is also updated to reflect usage.
As a demonstration, the image below shows un-normalized hangs grouped by device memory size. There is no clear trend among the different values.
The image below shows normalized hangs based on the same data. In this case, it is clear that, once usage is accounted for, higher device memory size generally corresponds to lower number of hangs. Note that the “unknown” column became hidden because there is not enough usage data for devices with “unknown” memory size.
At the moment, I think uptime is the best available measurement for usage. Hopefully there will be a better metric in the future to provide more accurate results. Or let me know if it already exists!