Message boards :
Number crunching :
extreme long wu's
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next
Author | Message |
---|---|
Send message Joined: 4 Feb 15 Posts: 17 Credit: 158,222,691 RAC: 0 |
Quick ugly script to abort task with enormous run time: #!/bin/sh MAX_TIME="20240" ########## RM="/bin/rm" GREP="/bin/grep" AWK="/usr/bin/awk" TAIL="/usr/bin/tail" DATE="/bin/date" TMP="/tmp/task.tmp" BCMD="/usr/bin/boinccmd $@" if [ ! -r ${TMP} ];then ${BCMD} --get_tasks > ${TMP} fi for TASK in `${GREP} "^ name: " ${TMP}|${AWK} -F ": " '{print $2}'`;do URL=`${GREP} -A2 ${TASK} ${TMP}|${TAIL} -n1|${AWK} -F ": " '{print $2}'` if [ ${URL} == "http://universeathome.pl/universe/" ];then TIME=`${GREP} -A15 ${TASK} ${TMP}|${TAIL} -n1|${AWK} -F ": " '{print $2}'|${AWK} -F "." '{print $1}'` if [ ${TIME} -ge ${MAX_TIME} ];then echo -e "`${DATE} "+%Y.%m.%d %H:%M:%S"`\tName:\t${TASK}\tTime:\t${TIME}" ${BCMD} --task ${URL} ${TASK} abort fi fi done if [ -r ${TMP} ];then ${RM} -f ${TMP} fi exit 0 It works on Linux but is using remote rpc so it can connect to any client with rpc enabled via network: ./script.sh --host 192.168.x.y --paswd password MAX_TIME is in seconds and should be tuned to match host speed. Run it from cron for example every 5 minutes and voila. |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
Agree with Jacob, there are people that would help you if they knew what the REAL situation was, you did say you had a fix but couldn't compile it for example. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
If there is anything you'd like me to test or try, tell me what to do and I'll do it. I want it solved, and am willing to try things for you. Unfortunately, I can't as I can't send source code. The problem is the code compiles correctly without BOINC libraries, but with them not. What makes situation strange, BOINC code in sources didn't change for almost a year, so it is well tested by us (all of us). We make checks nearly every day trying to find wrong part, but this is going part by part in above 20k of source code lines. With the current app - I can't identify problem as it exists only on some computers in very rare conditions and doesn't replicate on my development computers (two of them). Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
If the routines in the libraries have not changed, then the way you call the routines, the parameters you pass them, type, number etc., you say compile, so presumably it is not a linking problem. You will understand, without the code, or specific error messages from the compiler, it is not possible to be specific. Have you changed versions of something somewhere? Some of our embedded systems had over a million lines of code. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
If the routines in the libraries have not changed, then the way you call the routines, the parameters you pass them, type, number etc., you say compile, so presumably it is not a linking problem. You will understand, without the code, or specific error messages from the compiler, it is not possible to be specific. Have you changed versions of something somewhere? In fact, "compilation problem" is in linking stage. Just linker says that some of statics are declared twice, which is not true. Edit: Environment isn't changed. The machines still based on Debian 6 and on same stage from two years. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 13 May 15 Posts: 87 Credit: 4,320,738 RAC: 0 |
krzyszp, can't you do something to stop the bleeding, even?? Here, stick your finger right here on your neck and hold it. Hold right there. (Pulls out a bunch of gauze tape and wraps your finger and hand to your neck tightly.) There, bleeding stopped. What do you mean you're loosing feeling in your hand? You were bleeding out before and now you're worried about a little loss of sensation in your hand?? Geesh, people have no sense of priorities. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
Hi. You are posting on a forum where computer enthusiasts care passionately about their usage of computer resources. A bug like this can result in large numbers of computers wasting their resources, and wasting the environments resources, for no gain, in an indefinite loop with no way to exit the problem (tasks run indefinitely). If you're not here to help, then don't post. I'm offering my help, despite my disagreement with the admins' response thus far. |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
Are you using statics in your program? |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
Are you using statics in your program? Yes, loads of them. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 4 Mar 17 Posts: 1 Credit: 29,980,900 RAC: 0 |
I have all the symptoms described by other members above. The elapsed time increases, but so is the remaining estimate. From the original estimate of slightly over 2 hours I get elapsed time of 16 hours and remaining over 1 hour and keeps growing. Restarting either the task, the project project or boinc, gets these numbers back to something more reasonable (last check point), from where they keep growing again. I did not record file names, but these tasks do get stuck, with no way of finishing them. I'm not a scientist, and I wouldn't run a scientific project. When I hear that there are errors during compilation or linking and the app is still distributed (did I get it right?) ..... c'mon, get a software engineer to help out. I have wasted enough time on these tasks.... Please either fix it or let me know how can I help. Until then, I'm donating my cpu where is is going to do actual work. No hard feelings. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
krzyszp: Would it help if I actively tried to get the problem to happen on a task on my machine(s)? Then what is the next step? Do you have debug output we can get to? Do we copy the files locally to run them outside of BOINC, to see if it is still reproducible on my machine? Come on ---- Let's SOLVE this already! :) |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
On the info available I am forced to be vague, but are your own and BOINC's statics "names" being hashed, and you are getting a hash clash? Hell this is frustrating. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
krzyszp: Yes, you can run it outside BOINC - just simply copy executable file and param.in file to another folder and start is from command line - this is a way which I used to check it. Sometimes I think that it MAYBE BOINC API problem and/or specific combination of software/hardware/drivers on particular machine. We will see with new app version, but I'm still wait now for scientists to check some code parts. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
When running a new task, what is the first indication that we've encountered the problem in this thread? |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
When running a new task, what is the first indication that we've encountered the problem in this thread? I think the best clue is when the progress percentage does not increase, and the time remaining does increase. For me, anything over about twelve hours falls into that category thus far, but I would not count on that, since I just completed a couple of tasks that took almost eight hours, while they normally take less than three hours on this machine. http://universeathome.pl/universe/result.php?resultid=21185473 http://universeathome.pl/universe/result.php?resultid=21185592 |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
>>> Sometimes I think that it MAYBE BOINC API problem and/or specific combination of software/hardware/drivers on particular machine. At this project only... |
Send message Joined: 23 Mar 16 Posts: 96 Credit: 23,431,842 RAC: 0 |
Just to redress the balance, I have a Raspberry Pi that has been flawlessly running tasks throughout this "long WU" period. Some tasks have taken quite a bit longer than others, but they complete. I would be disappointed if the project was put on hold until the problem was fixed just because some computers were experiencing difficulties. I found I had problems crunching climateprediction.net tasks on my work PC (they almost always crashed), so I simply stopped crunching them, but I can see from my negative progress in the climateprediction.net rankings that others are crunching successfully. Every so often I go back and try again. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
I found I had problems crunching climateprediction.net tasks on my work PC (they almost always crashed), so I simply stopped crunching them, but I can see from my negative progress in the climateprediction.net rankings that others are crunching successfully. Every so often I go back and try again. That makes the point that each project is different and may have subtle hardware differences to succeed. I found that I could do well on CPDN by the use of a write-cache, at least on the older projects; it seems less critical now with the new WAH2. But apparently write-contention was causing a lot of problems and people didn't realize the cause (I have posted on it). The magic bullet on Universe is not clear yet. Ideally, it would work for everyone. I am currently investigating the idea that running other projects at the same time may sometimes trigger the long-runners. Such interactions are not impossible. There is the case now on ATLAS that running them by themselves works, while running them with other projects at the same time results in validation errors, and I can think of other cases. So I am running Universe by itself for a few weeks to see what happens. |
Send message Joined: 4 Feb 15 Posts: 847 Credit: 144,180,465 RAC: 0 |
>>> Sometimes I think that it MAYBE BOINC API problem and/or specific combination of software/hardware/drivers on particular machine. Believe me, not. For my few years (actually, about 17 years) with projects, it's happens quite often. We had some time ago application version where most AMD CPU's crashes... Without any serious reason until... Windows update where some stuff was fixed in Windows drivers. That was on test phase of project two years (or so) ago... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
>>> I would be disappointed if the project was put on hold until the problem was fixed just because some computers were experiencing difficulties. I wonder how you would feel if it was one of your systems that was having its resources stolen. What is more, he KNOWS the problem exists, yet continues to send the work units with it included. Help has been offered, and rejected. You say you would be disapointed, I AM disappointed. |