Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
40 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
On Wed, Aug 26, 2015 at 7:04 AM,  <[hidden email]> wrote:

> The Buildbot has detected a new failure on builder clang-native-arm-cortex-a9 while building llvm.
> Full details are available at:
>  http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/29883
>
> Buildbot URL: http://lab.llvm.org:8011/
>
> Buildslave for this Build: as-bldslv2
>
> Build Reason: scheduler
> Build Source Stamp: [branch trunk] 246031
> Blamelist: davide
>
> BUILD FAILED: failed compile
>
> sincerely,
>  -The Buildbot
>
>
>

I see frequent timeouts on this bot -- can the timeout be increased?

Thank you for the service offered!


--
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
On 26 August 2015 at 15:07, Davide Italiano <[hidden email]> wrote:
> I see frequent timeouts on this bot -- can the timeout be increased?

Hi Davide,

Increasing the timeout is not ideal. We need to fix the problem in a
different way.

Our other ARM bots are using CMake 3.2 + Ninja 1.5, which gets rid of
one source of time outs (slow testing), so that may be one way out.

For now, please ignore any timeout on bots, as they happen not just
with this one. :)

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
On 08/26/2015 04:12 PM, Renato Golin via llvm-dev wrote:

> On 26 August 2015 at 15:07, Davide Italiano <[hidden email]> wrote:
>> I see frequent timeouts on this bot -- can the timeout be increased?
>
> Hi Davide,
>
> Increasing the timeout is not ideal. We need to fix the problem in a
> different way.
>
> Our other ARM bots are using CMake 3.2 + Ninja 1.5, which gets rid of
> one source of time outs (slow testing), so that may be one way out.
>
> For now, please ignore any timeout on bots, as they happen not just
> with this one. :)

What's the problem with increasing the timeout? Asking people to ignore
buildbot mails does not seem right. If the buildbot is flaky I believe
the buildbot owner should ensure it shuts up until the problems have
been resolved and the buildbot has a low false positive rate again.

Best,
Tobias

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
On Wed, Aug 26, 2015 at 10:32 AM, Tobias Grosser via llvm-dev
<[hidden email]> wrote:

> On 08/26/2015 04:12 PM, Renato Golin via llvm-dev wrote:
>>
>> On 26 August 2015 at 15:07, Davide Italiano <[hidden email]> wrote:
>>>
>>> I see frequent timeouts on this bot -- can the timeout be increased?
>>
>>
>> Hi Davide,
>>
>> Increasing the timeout is not ideal. We need to fix the problem in a
>> different way.
>>
>> Our other ARM bots are using CMake 3.2 + Ninja 1.5, which gets rid of
>> one source of time outs (slow testing), so that may be one way out.
>>
>> For now, please ignore any timeout on bots, as they happen not just
>> with this one. :)
>
>
> What's the problem with increasing the timeout? Asking people to ignore
> buildbot mails does not seem right. If the buildbot is flaky I believe
> the buildbot owner should ensure it shuts up until the problems have
> been resolved and the buildbot has a low false positive rate again.

Yes, please. I would go one step further and ask that authors of flaky
tests with non-deterministic failures should revert those tests until
the problems have been resolved as well. The number of false positives
currently makes it very hard to know what the state of the product is.

~Aaron

>
> Best,
> Tobias
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 26 August 2015 at 15:32, Tobias Grosser <[hidden email]> wrote:
> What's the problem with increasing the timeout? Asking people to ignore
> buildbot mails does not seem right. If the buildbot is flaky I believe
> the buildbot owner should ensure it shuts up until the problems have
> been resolved and the buildbot has a low false positive rate again.

That's the point I make about solving the real issue, not increase the timeout.

CMake + Ninja has fixed virtually all our flakiness on all other ARM
bots, so I think we should give it a try first.

--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
On 08/26/2015 04:38 PM, Renato Golin via llvm-dev wrote:

> On 26 August 2015 at 15:32, Tobias Grosser <[hidden email]> wrote:
>> What's the problem with increasing the timeout? Asking people to ignore
>> buildbot mails does not seem right. If the buildbot is flaky I believe
>> the buildbot owner should ensure it shuts up until the problems have
>> been resolved and the buildbot has a low false positive rate again.
>
> That's the point I make about solving the real issue, not increase the timeout.
>
> CMake + Ninja has fixed virtually all our flakiness on all other ARM
> bots, so I think we should give it a try first.

What time-line do you have in mind for this fix? If you are in charge
and can make this happen within a day, giving cmake + ninja a chance seems
OK.

However, if the owner of the buildbot is not known or the fix can not come
soon, I am in favor of disabling the noise and (re)enabling it when someone
found time to address the problem and verify the solution. The cost of
buildbot noise is very high, both in terms of developer time spent, but
more importantly due to people starting to ignore them when monitoring them
becomes costly.

Best,
Tobias
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
On 26 August 2015 at 15:44, Tobias Grosser <[hidden email]> wrote:
> What time-line do you have in mind for this fix? If you are in charge
> and can make this happen within a day, giving cmake + ninja a chance seems
> OK.

It's not my bot. All my bots are CMake+Ninja based and are stable enough.


> However, if the owner of the buildbot is not known or the fix can not come
> soon, I am in favor of disabling the noise and (re)enabling it when someone
> found time to address the problem and verify the solution.

That's up to Galina. We haven't had any action against unstable bots
so far, and this is not the only one. There are lots of Windows and
sanitizer bots that break randomly and provide little information, are
we going to disable them all? How about the perf bots that still fail
occasionally and we haven't managed to fix the root cause, are we
going to disable then, too?

You're asking to reduce considerably the quality of testing on some
areas so that you can reduce the time spent looking at spurious
failures. I don't agree with that in principle. There were other
threads focusing on how to make them less spurious, more stable, less
noisy, and some work is being done on the GreenDragon bot structure.
But killing everything that looks suspicious now will reduce our
ability to validate LLVM on the range of configurations that we do
today, and that, for me, is a lot worse than a few minutes' worth of
some engineers.


> The cost of
> buildbot noise is very high, both in terms of developer time spent, but
> more importantly due to people starting to ignore them when monitoring them
> becomes costly.

I think you're overestimating the cost.

When I get bot emails, I click on the link and if it was timeout, I
always ignore it. If I can't make heads or tails (like the sanitizer
ones), I ignore it temporarily, then look again next day.

My assumption is that the bot owner will make me aware if the reason
is not obvious, as I do with my bots. I always wait for people to
realise, and fix. But if they can't, either because the bot was
already broken, or because the breakage isn't clear, I let people know
where to search for the information in the bot itself. This is my
responsibility as a bot owner.

I appreciate the benefit of having green / red bots, but you also have
to appreciate that hardware is not perfect, and they will invariably
fail once in a while. I had some Polly bots failing randomly and it
took me only a couple of seconds to infer so. I'm not asking to remove
them, even those that fail more than pass throughout the year. I
assume that, if they're still there, it provides *some* value to
someone.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev

> On Aug 26, 2015, at 8:21 AM, Renato Golin via llvm-dev <[hidden email]> wrote:
>
> On 26 August 2015 at 15:44, Tobias Grosser <[hidden email]> wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance seems
>> OK.
>
> It's not my bot. All my bots are CMake+Ninja based and are stable enough.
>
>
>> However, if the owner of the buildbot is not known or the fix can not come
>> soon, I am in favor of disabling the noise and (re)enabling it when someone
>> found time to address the problem and verify the solution.
>
> That's up to Galina. We haven't had any action against unstable bots
> so far, and this is not the only one. There are lots of Windows and
> sanitizer bots that break randomly and provide little information, are
> we going to disable them all? How about the perf bots that still fail
> occasionally and we haven't managed to fix the root cause, are we
> going to disable then, too?
>
> You're asking to reduce considerably the quality of testing on some
> areas so that you can reduce the time spent looking at spurious
> failures. I don't agree with that in principle.

That’s not how I understand his point. In my opinion, he is asking to increase the quality of testing. You just happen to disagree on his solution :)

The situation does not seem that black and white to me here. In the end, it seems to me that is is about a threshold: if a bot is crashing 90% of the time, does it really contributes to increase the quality of testing or on the opposite it is just adding noise? Same question with 20%, 40%, 60%, …  We may all have a different answer, but I’m pretty sure we could reach an agreement on what seems appropriate

Another way of considering in general the impact of a bot on the quality is: “how many legit failures were found by this bot in the last x years that weren’t covered by another bot”.
Because sometimes you may just having a HW lab stress rack, without providing any increased coverage for the software.

Cheers,


Mehdi



> There were other
> threads focusing on how to make them less spurious, more stable, less
> noisy, and some work is being done on the GreenDragon bot structure.
> But killing everything that looks suspicious now will reduce our
> ability to validate LLVM on the range of configurations that we do
> today, and that, for me, is a lot worse than a few minutes' worth of
> some engineers.
>
>
>> The cost of
>> buildbot noise is very high, both in terms of developer time spent, but
>> more importantly due to people starting to ignore them when monitoring them
>> becomes costly.
>
> I think you're overestimating the cost.
>
> When I get bot emails, I click on the link and if it was timeout, I
> always ignore it. If I can't make heads or tails (like the sanitizer
> ones), I ignore it temporarily, then look again next day.
>
> My assumption is that the bot owner will make me aware if the reason
> is not obvious, as I do with my bots. I always wait for people to
> realise, and fix. But if they can't, either because the bot was
> already broken, or because the breakage isn't clear, I let people know
> where to search for the information in the bot itself. This is my
> responsibility as a bot owner.
>
> I appreciate the benefit of having green / red bots, but you also have
> to appreciate that hardware is not perfect, and they will invariably
> fail once in a while. I had some Polly bots failing randomly and it
> took me only a couple of seconds to infer so. I'm not asking to remove
> them, even those that fail more than pass throughout the year. I
> assume that, if they're still there, it provides *some* value to
> someone.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=Ka76E8XTfggJYWrDeaGXLSBKQHN2iCVEjKVsTb2pHwI&s=7HEhGhQSdWB_XWL-36BNpvyorugu1RCgTDgqEzWMVX4&e= 

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev


On Wed, Aug 26, 2015 at 9:01 AM, Mehdi Amini via llvm-dev <[hidden email]> wrote:

> On Aug 26, 2015, at 8:21 AM, Renato Golin via llvm-dev <[hidden email]> wrote:
>
> On 26 August 2015 at 15:44, Tobias Grosser <[hidden email]> wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance seems
>> OK.
>
> It's not my bot. All my bots are CMake+Ninja based and are stable enough.
>
>
>> However, if the owner of the buildbot is not known or the fix can not come
>> soon, I am in favor of disabling the noise and (re)enabling it when someone
>> found time to address the problem and verify the solution.
>
> That's up to Galina. We haven't had any action against unstable bots
> so far, and this is not the only one. There are lots of Windows and
> sanitizer bots that break randomly and provide little information, are
> we going to disable them all? How about the perf bots that still fail
> occasionally and we haven't managed to fix the root cause, are we
> going to disable then, too?
>
> You're asking to reduce considerably the quality of testing on some
> areas so that you can reduce the time spent looking at spurious
> failures. I don't agree with that in principle.

That’s not how I understand his point. In my opinion, he is asking to increase the quality of testing. You just happen to disagree on his solution :)

The situation does not seem that black and white to me here. In the end, it seems to me that is is about a threshold: if a bot is crashing 90% of the time, does it really contributes to increase the quality of testing or on the opposite it is just adding noise? Same question with 20%, 40%, 60%, …  We may all have a different answer, but I’m pretty sure we could reach an agreement on what seems appropriate

Another way of considering in general the impact of a bot on the quality is: “how many legit failures were found by this bot in the last x years that weren’t covered by another bot”.

Even that doesn't really capture it - if the bot has enough false positives, or spends long periods being red, even those legit failures will be lost in the noise & the cost to the whole project (not only in ignoring that bot, but in reducing confidence in the bots in general (which is pretty low generally because of this kind of situation)) may outweigh the value of those bugs being found.

If a bot is of low enough quality that most engineers ignore it due to false positives, long periods of broken-ness, then it makes sense to me to remove it from the main buildbot view and from sending email. The owner can monitor the bot and, once they triage a failure, manually reach out to those who might be to blame.

(oh, and add long cycle times to the list of issues - people do have a tendency to ignore bots that come back with giant blame lists & no obvious determination as to who's patch caused the problem, if any)

- David
 
Because sometimes you may just having a HW lab stress rack, without providing any increased coverage for the software.

Cheers,


Mehdi



> There were other
> threads focusing on how to make them less spurious, more stable, less
> noisy, and some work is being done on the GreenDragon bot structure.
> But killing everything that looks suspicious now will reduce our
> ability to validate LLVM on the range of configurations that we do
> today, and that, for me, is a lot worse than a few minutes' worth of
> some engineers.
>
>
>> The cost of
>> buildbot noise is very high, both in terms of developer time spent, but
>> more importantly due to people starting to ignore them when monitoring them
>> becomes costly.
>
> I think you're overestimating the cost.
>
> When I get bot emails, I click on the link and if it was timeout, I
> always ignore it. If I can't make heads or tails (like the sanitizer
> ones), I ignore it temporarily, then look again next day.
>
> My assumption is that the bot owner will make me aware if the reason
> is not obvious, as I do with my bots. I always wait for people to
> realise, and fix. But if they can't, either because the bot was
> already broken, or because the breakage isn't clear, I let people know
> where to search for the information in the bot itself. This is my
> responsibility as a bot owner.
>
> I appreciate the benefit of having green / red bots, but you also have
> to appreciate that hardware is not perfect, and they will invariably
> fail once in a while. I had some Polly bots failing randomly and it
> took me only a couple of seconds to infer so. I'm not asking to remove
> them, even those that fail more than pass throughout the year. I
> assume that, if they're still there, it provides *some* value to
> someone.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=Ka76E8XTfggJYWrDeaGXLSBKQHN2iCVEjKVsTb2pHwI&s=7HEhGhQSdWB_XWL-36BNpvyorugu1RCgTDgqEzWMVX4&e=

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 26 August 2015 at 17:01, Mehdi Amini <[hidden email]> wrote:
> The situation does not seem that black and white to me here. In the end, it seems to me that is is about a threshold: if a bot is crashing 90% of the time, does it really contributes to increase the quality of testing or on the opposite it is just adding noise?

That question is not alone, and per se, it's meaningless. Your next
question, however, is the key.


> Another way of considering in general the impact of a bot on the quality is: “how many legit failures were found by this bot in the last x years that weren’t covered by another bot”.

In this criteria, which was my point, those bots haven't added much
after I added some faster ones. So, if we want to shut them down,
let's do so because they don't add value, not because they are
unstable.

However, that is the *only* bot running on an A9. As an example, this
year, I spent 2 whole weeks during the release to bisect and fix an
issue because I disabled, for two months, one bot that I thought was
already covered by another.

That headache was real. I've wasted two whole weeks, maybe more or my
time. I've wasted time from other people waiting to do the release
validation. I have delayed the release and all that it entangles. All
because I thought that bot was noisy. It's ratio was about 20 passes
to 1 failure.

The A9 bots are more than that, so on my monitor[1], I currently
ignore their results. I still keep them there to see what's going on,
and when my bots fail, I look at that, too, to see if the problem is
the same. Sometimes, they do provide useful insight on the other bot's
breakages.

So, for me, disabling the A9 bots would be a loss. But as I said
before, that's up to Galina, as she's the bot owner. If she's ok with
finally putting them to rest, I'll respect the community's decision
and remove it from my monitor. But we can't transform this in to a
which hunt. It's not about thresholds, it's about cost and value,
which may be different for you than it is for me. We have to consider
the whole community, not just our opinions.

For every broken bot that someone want to get rid of, I propose
consult the bot owner first, and then a vote on llvm-dev@ / cfe-dev@
if the owner is not in agreement. After all, you can always have an
internal buildmaster (like I have) for unstable bots.

cheers,
--renato
[1] http://llvm.tcwglab.linaro.org/monitor/
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
@Galina: It seems this bot is now almost permanently running into a compile-time
timeout. Maybe you can fix this by either increasing the timeout or by
switching to a cmake/ninja based build as suggested by Renato.

On 08/26/2015 05:21 PM, Renato Golin via llvm-dev wrote:
> On 26 August 2015 at 15:44, Tobias Grosser <[hidden email]> wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance seems
>> OK.
>
> It's not my bot. All my bots are CMake+Ninja based and are stable enough.

I should have looked it up myself. I did not want to finger-point, but
ensure we understand who will address this issue. I just looked up who
owns these builders and to my understanding it is Galina herself.
I CC her such that she can take action.

I also have the feeling I was generally to harsh in my mail, as I seem
to have triggered a rather defensive reply. Sorry for this.

Regarding the discussion about disabling/enabling buildbots. I agree with
Mehdi there is no black and white. For this bot, it seems important to
address this issue as it seems to start failing very regularly now.

Regarding my own bots: In case you see flaky polly buildbots or any other
of my bots sending emails without reason, please send me a short ping
such that I can fix the issue. None of my LNT bots send emails as they
run too long before starting to report.

Best,
Tobias


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 26 August 2015 at 17:21, David Blaikie <[hidden email]> wrote:
> (oh, and add long cycle times to the list of issues - people do have a
> tendency to ignore bots that come back with giant blame lists & no obvious
> determination as to who's patch caused the problem, if any)

Yes, but remember, not all hardware is as fast as a multi-core Xeon
server. Build times can't always be controlled.

But I agree with you on all accounts. The bot owner should bear the
responsibility of his/her own unstable bots. If it brings less value
than it adds cost to the community, it should be moved to a separate
buildmaster that doesn't email people around, but can be accessed, so
the owner can point breakages to devs.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 08/26/2015 08:21 AM, Renato Golin via llvm-dev wrote:

> On 26 August 2015 at 15:44, Tobias Grosser <[hidden email]> wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance seems
>> OK.
> It's not my bot. All my bots are CMake+Ninja based and are stable enough.
>
>
>> However, if the owner of the buildbot is not known or the fix can not come
>> soon, I am in favor of disabling the noise and (re)enabling it when someone
>> found time to address the problem and verify the solution.
> That's up to Galina. We haven't had any action against unstable bots
> so far, and this is not the only one. There are lots of Windows and
> sanitizer bots that break randomly and provide little information, are
> we going to disable them all? How about the perf bots that still fail
> occasionally and we haven't managed to fix the root cause, are we
> going to disable then, too?
If the bot fails regularly (say false positive rate 1 in 10 runs), then
yes, it should be disabled until the owner fixes it.  It's perfectly
okay for it to be put into a "known unstable" list and for the bot owner
to report failures after they've been confirmed.

To say this differently, we will revert a *change* which is
problematic.  Why shouldn't we "revert" a bot?

>
> You're asking to reduce considerably the quality of testing on some
> areas so that you can reduce the time spent looking at spurious
> failures. I don't agree with that in principle. There were other
> threads focusing on how to make them less spurious, more stable, less
> noisy, and some work is being done on the GreenDragon bot structure.
> But killing everything that looks suspicious now will reduce our
> ability to validate LLVM on the range of configurations that we do
> today, and that, for me, is a lot worse than a few minutes' worth of
> some engineers.
>
>
>> The cost of
>> buildbot noise is very high, both in terms of developer time spent, but
>> more importantly due to people starting to ignore them when monitoring them
>> becomes costly.
> I think you're overestimating the cost.
>
> When I get bot emails, I click on the link and if it was timeout, I
> always ignore it. If I can't make heads or tails (like the sanitizer
> ones), I ignore it temporarily, then look again next day.
I disagree strongly here.  The cost of having flaky bots is quite high.  
When I make a commit, I'm committing to be responsive to problems it
introduces over the next few hours.  Every one of those false positives
is a 5-10 minute high priority interruption to what I'm actually working
on.  In practice, that greatly diminishes my effectiveness.

As an illustrative example, I submitted some documentation changes
earlier this week and got 5 unique build failure notices.  In this case,
I ignored them, but if that had been a small code change, that would
have cost me at least an hour of productivity.
>
> My assumption is that the bot owner will make me aware if the reason
> is not obvious, as I do with my bots. I always wait for people to
> realise, and fix. But if they can't, either because the bot was
> already broken, or because the breakage isn't clear, I let people know
> where to search for the information in the bot itself. This is my
> responsibility as a bot owner.
First, thanks for being a responsible bot owner.  :)

If all bot owners were doing this, having a unstable list which doesn't
actively notify would be completely workable.  If not all bot owners are
doing this, I can't say I really care about the status of those bots.

>
> I appreciate the benefit of having green / red bots, but you also have
> to appreciate that hardware is not perfect, and they will invariably
> fail once in a while. I had some Polly bots failing randomly and it
> took me only a couple of seconds to infer so. I'm not asking to remove
> them, even those that fail more than pass throughout the year. I
> assume that, if they're still there, it provides *some* value to
> someone.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev


On Wed, Aug 26, 2015 at 9:27 AM, Renato Golin <[hidden email]> wrote:
On 26 August 2015 at 17:21, David Blaikie <[hidden email]> wrote:
> (oh, and add long cycle times to the list of issues - people do have a
> tendency to ignore bots that come back with giant blame lists & no obvious
> determination as to who's patch caused the problem, if any)

Yes, but remember, not all hardware is as fast as a multi-core Xeon
server. Build times can't always be controlled.

Small blame lists can still be acquired by having more hardware. Certainly not always possible/in the budget for those who want to verify these things.
 
But I agree with you on all accounts. The bot owner should bear the
responsibility of his/her own unstable bots. If it brings less value
than it adds cost to the community, it should be moved to a separate
buildmaster that doesn't email people around, but can be accessed, so
the owner can point breakages to devs.

More than just not emailing, it'd be great to have a generally-green dashboard to look at. For now it's hard to get a sense of what's 'really' broken'.

But yeah, all of that stuff.


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 26 August 2015 at 17:27, Tobias Grosser <[hidden email]> wrote:
> @Galina: It seems this bot is now almost permanently running into a
> compile-time
> timeout. Maybe you can fix this by either increasing the timeout or by
> switching to a cmake/ninja based build as suggested by Renato.

How I fixed my bots:

1. Remove cmake and ninja from your system. They are too old.
2. Download cmake stable sources (3.2+), untar, bootstrap, make, make install
3. Checkout ninja from github, bootstrap, copy "ninja" to /usr/local/bin
4. Install ccache from packages, add ccache to path
5. Change the builder to ClangCMakeBuilder like all ARM and AArch64 bots now in.
6. Restart.

The Ninja+CMake combo has a feature that makes sure you print
everything without buffering, so the time out works exactly as
intended: if any single process takes more than that time, it's a bug.

I'm copying Gabor, as AFAIK, his bot is not based on the new
CMake+Ninja fix, but on an old polling script we had, which makes
timeouts useless.

Let's try this one first, and only consider any more drastic solution after.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 26 August 2015 at 17:30, Philip Reames <[hidden email]> wrote:
> To say this differently, we will revert a *change* which is problematic.
> Why shouldn't we "revert" a bot?

I don't disagree, just don't want to do that lightly. Most certainly
not before we have comments from the bot owner.


> As an illustrative example, I submitted some documentation changes earlier
> this week and got 5 unique build failure notices.  In this case, I ignored
> them, but if that had been a small code change, that would have cost me at
> least an hour of productivity.

I have to say, I never spent more than a few minutes looking up
failing bots. If there's nothing that I can find in 30 seconds of
looking at the bot screen, I rely on the bot owners to ping me, revert
my patches, let me know what's wrong.

I'll make your words, mine:

> If all bot owners were doing this, having a unstable list which doesn't
> actively notify would be completely workable.  If not all bot owners are
> doing this, I can't say I really care about the status of those bots.

:D

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 08/26/2015 09:27 AM, Renato Golin via llvm-dev wrote:
> On 26 August 2015 at 17:21, David Blaikie <[hidden email]> wrote:
>> (oh, and add long cycle times to the list of issues - people do have a
>> tendency to ignore bots that come back with giant blame lists & no obvious
>> determination as to who's patch caused the problem, if any)
> Yes, but remember, not all hardware is as fast as a multi-core Xeon
> server. Build times can't always be controlled.
No, but there are many known partial solutions.  We can explore some of
them if build time is the only issue.

Some options (I don't know which if any have already been explored)
- Kick off builds on slow bots at revisions which have passed a fast bot
- Overlap builds on different pieces of hardware so that revision ranges
are smaller (i.e. "that change was in the previous build and didn't
fail, so it can't be that")
- Build incrementally with occasional clean builds at already known good
revisions (or for failure recovery)
- Separate build and test.  Cross build on a faster machine then
transfer the binaries to a slower test machine.  Run only a build on the
slow machine.  (i.e. decrease latency of build+test to latency of build
OR test)
- Consider using an emulator on a faster machine to get initial
results.  Only kick off builds on hardware for changes that pass the
emulator.  (This is a particular variant of (1) above.)

>
> But I agree with you on all accounts. The bot owner should bear the
> responsibility of his/her own unstable bots. If it brings less value
> than it adds cost to the community, it should be moved to a separate
> buildmaster that doesn't email people around, but can be accessed, so
> the owner can point breakages to devs.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev
On 26 August 2015 at 17:32, David Blaikie <[hidden email]> wrote:
> Small blame lists can still be acquired by having more hardware. Certainly
> not always possible/in the budget for those who want to verify these things.

More unstable hardware is more unstable. :)

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev


On Wed, Aug 26, 2015 at 9:34 AM, Renato Golin via llvm-dev <[hidden email]> wrote:
On 26 August 2015 at 17:27, Tobias Grosser <[hidden email]> wrote:
> @Galina: It seems this bot is now almost permanently running into a
> compile-time
> timeout. Maybe you can fix this by either increasing the timeout or by
> switching to a cmake/ninja based build as suggested by Renato.

How I fixed my bots:

1. Remove cmake and ninja from your system. They are too old.
2. Download cmake stable sources (3.2+), untar, bootstrap, make, make install
3. Checkout ninja from github, bootstrap, copy "ninja" to /usr/local/bin
4. Install ccache from packages, add ccache to path
5. Change the builder to ClangCMakeBuilder like all ARM and AArch64 bots now in.
6. Restart.

The Ninja+CMake combo has a feature that makes sure you print
everything without buffering, so the time out works exactly as
intended: if any single process takes more than that time, it's a bug.

I'm copying Gabor, as AFAIK, his bot is not based on the new
CMake+Ninja fix, but on an old polling script we had, which makes
timeouts useless.

Let's try this one first, and only consider any more drastic solution after.

*shrug* I haven't looked at whatever specific bots are under discussion, but I really wouldn't mind/would like if the bots had a more "revert to green" feel to them just like we have for commits: take a bot offline, fix/iterate/improve it, see if it comes good, then bring it back to the mainline.
 

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Tobias Hieta via llvm-dev
In reply to this post by Tobias Hieta via llvm-dev


On Wed, Aug 26, 2015 at 9:39 AM, Renato Golin <[hidden email]> wrote:
On 26 August 2015 at 17:32, David Blaikie <[hidden email]> wrote:
> Small blame lists can still be acquired by having more hardware. Certainly
> not always possible/in the budget for those who want to verify these things.

More unstable hardware is more unstable. :)

I was referring specifically to the issue of long cycle times producing long blame lists. That can be reduced by having more bots so that blame lists are smaller.

Even if the hardware is just as unstable this is actually better, it's not more unstable as such. It means that when it does flake out, fewer people are distracted/informed of this. That's an improvement in a small degree.
 

cheers,
--renato


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
12