1) Message boards : Number crunching : Off topic: Radioactive@Home down (Message 6289)
Posted 30 Nov 2023 by xii5ku
Post:
xii5ku wrote:
It came back earlier today but is nonfunctional again since >2 h ago:
Uploads are failing with transient HTTP error. server_status.php shows upload/download server as disabled. Web site and scheduler can be reached. Scheduler responses are malformed though ("no start tag in scheduler reply").
I can see that hosts of other users had been uploading during that time. After a reset of my cable modem, my uploads went through right away, and the "no start tag" errors no longer happen either. I.e. my connection was locally broken somehow (but not to several other projects).
2) Message boards : Number crunching : Off topic: Radioactive@Home down (Message 6287)
Posted 29 Nov 2023 by xii5ku
Post:
It came back earlier today but is nonfunctional again since >2 h ago:
Uploads are failing with transient HTTP error. server_status.php shows upload/download server as disabled. Web site and scheduler can be reached. Scheduler responses are malformed though ("no start tag in scheduler reply").
3) Message boards : Number crunching : Off topic: Radioactive@Home down (Message 6279)
Posted 16 Nov 2023 by xii5ku
Post:
Hi Krzysztof,

http://radioactiveathome.org/boinc/ is not responding anymore since Friday, November 10.
4) Message boards : Number crunching : Project has no tasks available (Message 5482)
Posted 17 May 2022 by xii5ku
Post:
Grant (SSSF) wrote:
Sorry, but that's just a load of self serving rubbish.
As I said, about 90 % of the computer time which I, for one, donated to Universe@home so far were motivated by contests. In other words, if it weren't for contests, that part of my modest contribution, as an example, wouldn't be here. And yes, some care for competitions, others don't.

ace_quaker wrote:
[...] it incentivizes throwing every system under the sun regardless of performance per watt.
There is probably some truth to that. I on the other hand brought only comparably efficient server systems to the contest. (I built them myself and operate them at home.)
5) Message boards : Number crunching : Project has no tasks available (Message 5420)
Posted 14 May 2022 by xii5ku
Post:
@Brummig, this metaphor doesn't work quite right. Whose front lane is this? From a quick look at the footer of this page, we are all guests of the Copernicus Astronomical Centre of the Polish Academy of Sciences here (invited to bring computer time for their research).

Note, unlike some other contests, the Pentathlon is carefully prepared upfront; most notably the organizers are in touch with the project owners beforehand and during the contest. Projects enter the Pentathlon after the explicit agreement of the project owners, among other preconditions. And both the owner and the organizers were certainly well aware of the specific limitations of the respective project when they planned the contest. The one element which is hard for them to estimate beforehand is how massive the computer capacity of the Pentathletes will be exactly.

And another side note: As an anecdotal example, about 90 % of the computer time which I for one donated so far to Universe@Home, ever since I registered here, was motivated by contests. We all are here for different reasons, and some of us are here for contests. Pentathlon especially gathers a lot of users, and many of these users bring everything and the kitchen sink online while they participate in the contest. Not just their participation, even their investments into computer hardware throughout the year are quite often planned for those occasions during the year when they would want to bring all of it online.

Anyway. By graciously hosting the Obstacle Run contest of the Pentathlon, the Universe@Home project struck true gold. For about 2.5 weeks wall clock time, this project is producing near or at its actual capacity. This means that the project servers into which the project owners invested, and/or their internet link, are now fully utilized. Inevitably this also means that a part of the donated computing capacity is underutilized (unless donors react and shift some capacity to a different project). So this is a quite ideal situation for the project to be in, with the only downsides of (1.) presumably an increased need for the admin's attention to keep everything going, (2.) invariably, some complaints from some donors about underutilization of their computers, which is natural as they focus on the operation of their computers, while looking less at the bigger picture.
6) Message boards : Number crunching : Server Thread (Message 5415)
Posted 13 May 2022 by xii5ku
Post:
Grant (SSSF) wrote:
Several times a day i just keep hitting Retry pending transfers till they all clear, then Update till the Scheduler finally doesn't error out & doesn't give a "Project has no available Tasks" response, then hit Retry pending transfers again till all of the downloads have finally come through.
Then i let it be for 6-10 hours, and then start hitting Retry pending transfers all over again...
There are rather obvious alternatives to what you say you are doing.

a) The steps which you describe are simple mechanical steps. You could let a computer do them for you. Computers excel at repetitive trivial tasks.

b) There are so many other DC projects out there waiting for your computer capacity. If you like U@h much more than the other projects, no problem, the duration of the Pentathlon's 'Obstacle Run' at Universe@home is public. Things will be back to normal at U@h one or two days after the competition.
7) Message boards : Number crunching : Server Thread (Message 5349)
Posted 8 May 2022 by xii5ku
Post:
And another batch of work has been loaded as well.
Most work requests are responded to with "Project has no tasks available" though, because the feeder's(?) buffer of tasks to assign (or something like that) is not refilled frequently enough. IOW this buffer is too small for the current rate of client requests for more work.
8) Message boards : Number crunching : Double your task throughput on Linux (Message 5202)
Posted 26 Mar 2022 by xii5ku
Post:
On 25 February 2021 Keith Myers wrote:
We noticed the speedup was specific to the glibc library when running Ubuntu 18.04 with kernel 5.4 and glibc 2.29 against Ubuntu 20.04 with kernel 5.4 and glibc library 2.31.
Are you sure about the glibc version of your Ubuntu 18.04 test case? According to distrowatch, Ubuntu 18.04 comes with glibc 2.27.

Empirically, I didn't narrow the relevant glibc update down closer than >2.27 && ≤2.31 myself. But from reading glibc release notes, it appears to me that glibc 2.29 introduced the change which sped up BHspin v2 and ULX. (2.29 release notes, relevant patchset)
9) Message boards : Number crunching : Universe@home selected for Formula BOINC Sprint #2 (24-27 March 2022) (Message 5201)
Posted 26 Mar 2022 by xii5ku
Post:
Work requests are still responded to with "Project has no tasks available" most of the time.

From what I understood, the scheduler does not access the entire pool of what we see as "tasks ready to send" at server_status.php. Rather, the scheduler is supplied from a small in-memory queue of unassigned tasks. (Was this a shared-memory buffer between feeder and scheduler?) Whenever a server is receiving very frequent work requests like the U@h server is now, the challenge is to get this small in-memory buffer refilled.

I don't know what precisely can be tuned server-side to address this. It may be a matter of increasing responsiveness of the database, or maybe it's as simple as increasing the size of this mentioned in-memory buffer.
10) Message boards : Number crunching : download failed (Message 5117)
Posted 17 Feb 2022 by xii5ku
Post:
xii5ku wrote:
Confirmed: https://universeathome.pl/universe/download/BHspin2_19_x86_64-pc-linux-gnu gets HTTP error 404.

And more files are missing according to the other thread, "Couldn't Get Input Files".
This error was fixed today by the server-side installation of version 20 of the application (cf. apps.php).
11) Message boards : Number crunching : download failed (Message 5094)
Posted 16 Feb 2022 by xii5ku
Post:
Confirmed: https://universeathome.pl/universe/download/BHspin2_19_x86_64-pc-linux-gnu gets HTTP error 404.

And more files are missing according to the other thread, "Couldn't Get Input Files".

Edit: I see that this was already reported in the news thread as well.
12) Message boards : Number crunching : Upload fails (Message 4300)
Posted 11 May 2020 by xii5ku
Post:
Keith Myers wrote:
Does anybody know if it uses advanced SIMD instructions like AVX2 or FMA?
It most certainly does not. On Broadwell-EP for instance, BHspin v2 does not trigger the AVX2/FMA clock frequency offset.


Keith Myers wrote:
Those are the major architectural differences between Zen+ and Zen 2 plus the 256bit wide AVX registers.
Classic FPU code throughput per core was circa doubled as well, wasn't it?
13) Message boards : Number crunching : Upload fails (Message 4292)
Posted 11 May 2020 by xii5ku
Post:
Keith Myers wrote:
New to the project. Was not aware of the project being part of the Pentathlon. Is the availability of work because of the contest? Or is the quantity available normal?
It is normal. See the green line (rts = ready to send) in these graphs by @kiska:
https://munin.kiska.pw/munin/Munin-Node/Munin-Node/results_universe.html


Keith Myers wrote:
Are the BHspin2 tasks a special set for the contest? Or normal?
They are normal.


Keith Myers wrote:
Are the normal tasks all the same in runtimes?
On a given hardware + operating system, run times of most BHspin v2 are variable, but not to a large extent — with the exception of a (usually) small number of tasks which you get (usually) only occasionally which take longer, e.g. maybe ~5 times as long as the usual average.


Keith Myers wrote:
I have two hosts with very large differences in runtimes and I am trying to determine why? The set of tasks given to each host were from different species of work based on my assumption in tasknames. Is there a post explaining the makeup of tasknames? There are no parameters visible for each task when hovering over a task unlike what Milkyway or Einstein shows. Can someone explain the task construction?
I don't have an answer for these two.


Keith Myers wrote:
Or is one computer fundamentally faster than the other even though both have the same clocks.
From a very quick look at your two computers, the difference in task run times is larger than to be expected indeed. The current top 20 valid tasks of each computer were downloaded on two different times on Sunday, May 10. On this day, several batches with tasks with increased run time were emitted. So a possible explanation is indeed that one of your computers happened to receive many tasks out of such a more intensive batch. Check how the two computers fare on other days.


Keith Myers wrote:
That also showed me that a large number of hosts loaded up work for the City Run part of the Pentathlon which ends in a few hours and they are aborting huge quantities of work. Not good BOINC etiquette in my opinion.
I agree with you in principle. But it needs to be pointed out that part of the problem is the increase in average task run times on Sunday, as mentioned. This probably caused many computers to fetch more tasks than desired by the users, due to BOINC underestimating the new task run times. — That said, I guess you may have seen hosts which had a disproportional count of aborted tasks, which would indicate lack of planning (or care) by the computer owners indeed.
14) Message boards : Number crunching : Upload fails (Message 4282)
Posted 10 May 2020 by xii5ku
Post:
mikey wrote:
MOST of the time Projects get notified that they were selected to be a part of the Pentathlon about a week ahead of time. Sometimes Projects say 'no' but it's proably too late and they suffer for it,
From what I remember, the current (and only feasible) process is that the Pentathlon organizers
    – ask project admins whether or not they are OK with taking part,
    – keep only projects for selection whose admins responded positively (IOW, remove projects from their set of candidates if the admins declined or never responded).

This is only from my memory; I don't have a primary source to quote.

--------

Jon Melusky wrote:

I was able to clear my upload fails. I had two fails in the transfer tab. I clicked one of them and clicked retry now. When that one WU said "Active", I quickly selected the other one and also clicked retry now for that one. Retying them separately didn't work, but sometimes it does work for some WUs. Some of the WUs that are stuck show as very tiny so perhaps the server doesn't recognize them as WUs? Sometimes the stuck WUs are cleared from the transfer tab, but then they are moved to the tasks tab and they say Ready to Report. Other times, the stuck WUs seem to be cleared directly to the Universe servers and they don't show in the tasks tab.
The transfer tab shows files, not tasks.
Each successfully computed BHspin v2 task produces 6 result files.
After all 6 files were uploaded successfully, a task becomes ready to report.
15) Message boards : Number crunching : Certificate expired. (Message 3825)
Posted 13 Oct 2019 by xii5ku
Post:
Krzysztof, or whoever intervened on short notice,

thank you for fixing this very quickly!
16) Message boards : Number crunching : Formula Boinc Selected Sprint for •10/19/2018 05:00 (UTC) - 10/22/2018 04:59 (UTC) (Message 3055)
Posted 18 Oct 2018 by xii5ku
Post:
Krysztof,

the limit of tasks in progress was 512 per host, or 8 per active logical CPU, whichever was less.
Did you change anything at all due to bcavnaugh's post?

If you lifted the latter limit (and just kept the former), then a lot of unsupervised slow low-core-count hosts may suddenly queue a lot more tasks than normal. This in turn will cause many tasks left pending validation at the end of the sprint.

That's perhaps not exactly what bcavnaugh had in mind.
17) Message boards : News : BHDB application (Message 2716)
Posted 13 Mar 2018 by xii5ku
Post:
Alessio Susi wrote:
I have 33 task completed but I can't upload the results. What's the problem?
I uploaded about 2000 tasks during the last few days. It went slow and I had quite a few retries, but it generally worked for me. I have client versions 7.8.1, .3 and .4, each one on Linux. I did not alter any http related cc_config.xml items, nor do I use a proxy.


Jim1348 wrote:
xii5ku wrote:
However, I have a nasty problem with the _9 series tasks from Sunday:

Their results files are ~20 MB per task. And with the amount of processors that I have (somewhat above normal), and my internet connection on the other hand (a pretty normal German cable modem connection), I can compute about 4 times more tasks than I can upload during the same time!
I wonder how that compares with the _5 (or other) tasks. The reason may simply be that the _9 are larger now. But if they were divided into smaller pieces, then the upload amount would still be the same in total. It could be just that the earlier ones failed more often that you did not see the problem earlier.
The _5 tasks are affected the same way, it's just that I had a lot more _9ers than _5ers.

I didn't watch the other batches closely.


mmonnin wrote:
xii5ku wrote:
mmonnin wrote:
they should have been aborted on the server side. I aborted quite a few others that were not batch 9.
These WUs are cancelled after 4 or more clients return a task of the respective WU with an error. (Would be good if Krzysztof cancelled them earlier.)
if only the server actually stopped sending them out after 4 errors.
https://universeathome.pl/universe/workunit.php?wuid=14676708
It's been sent out twice since the 4th error. One abort and in progress still.
Hmm, you are right. 4th error was reported March 12, 2:50:33 UTC, and 5th error at 3:19:48 UTC, yet there was still a new task sent at 3:20:01 UTC.
Am I interpreting the line "max # of error/total/success tasks | 4, 10, 2" incorrectly?

6 hours of processor time wasted on that one alone... :-(
18) Message boards : News : BHDB application (Message 2711)
Posted 12 Mar 2018 by xii5ku
Post:
mmonnin wrote:
Still some disk errors

Those are the _5. They are now up to _9, and work fine.


Missed that. Then they should have been aborted on the server side. I aborted quite a few others that were not batch 9.

Juergen Fricke wrote:
@Admin:

please stop sending us _5 or below tasks. nearly all of them end up in failure.
it is a waste of time and energy. for us and for you.

jf
These WUs are cancelled after 4 or more clients return a task of the respective WU with an error. (Would be good if Krzysztof cancelled them earlier.)

For fun, I downloaded a bunch of these _5 tasks yesterday (among a whole lot of _9 tasks), shut down the client, edited client_state.xml for a bigger <rsc_disk_bound>, restarted the client, and computed these tasks successfully.

But many or all of these tasks (of the ones that I uploaded so far) were then marked as "Completed, can't validate" on the server, with errors = "Too many errors (may have bug)" status of the WU. I'm sure nobody else did the editing that I did, and I was probably not my own wingman in any of these tasks... :-/

However, I have a nasty problem with the _9 series tasks from Sunday:

Their results files are ~20 MB per task. And with the amount of processors that I have (somewhat above normal), and my internet connection on the other hand (a pretty normal German cable modem connection), I can compute about 4 times more tasks than I can upload during the same time!
19) Message boards : News : BHDB application (Message 2684)
Posted 11 Mar 2018 by xii5ku
Post:
Krzysztof 'krzyszp' Piszczek wrote:
Because of saving and zipping result files takes a moment Manager think that is something wrong with application and kill it.
[...]
So, in app version 0.03 I have added 2 second "sleep" command before call boinc_finish() function to prevent this.

Instead of the sleep, isn't there a way to actually wait for the compression and write to be finished? Should be possible if these are done in threads of (or child processes of) the science application, but I don't know if they are.
20) Message boards : Number crunching : extreme long wu's (Message 2487)
Posted 12 Nov 2017 by xii5ku
Post:
Jim1348 wrote:
After crunching for a week with no problems (setup as noted above with no suspensions or other projects running), I picked up four long runners in quick succession over a 24-hour period. They ran for about 17 hours and then got stuck.

But rather than aborting them, I rebooted and two of them got unstuck and completed normally.
[...]
However, the last two remain stuck. So after the first two completed, I rebooted again. Surprisingly, the last two then completed normally also.
[...]
So there are no "bad" work units, only something causing them to get stuck at random intervals ranging from days to about two months apart for me. That is strange. Maybe it is a hyper threading problem, or a limitation on the CPU cache that some of them hang? I don't know at this point.

It seems random to me. If I switch the option "Leave non-GPU tasks in memory while suspended" off, suspend tasks which get stuck, and a while later un-suspend them, I have seen any one of the following outcomes:

  • Resumed task finishes properly.
  • Resumed task gets stuck again after a while. Suspend and resume again once or twice, and it finishes properly.
  • Resumed task gets stuck again, even after retries. Abort it.


So far I had a handful of these on Linux, and could finish them all after suspend + resume. I also had a handful on Windows, and could only finish one but had to abort the others (since multiple retries of suspend and resume did not get them further).

Note that a stuck task does not suspend immediately after requesting it to do so; it needs some time (to reach a checkpoint? or simply to poll for a signal?).

Furthermore, I have not yet tried to suspend a stuck task with the option "Leave non-GPU tasks in memory while suspended" switched on. My guess is this won't help repair a stuck task.



Next 20




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek