how can I learn yhe very basics of sorting? - Page 2


Page 2 of 2 FirstFirst 12
Results 16 to 21 of 21

Thread: how can I learn yhe very basics of sorting?

  1. #16
    Join Date
    Sep 2003
    Location
    Rochester, MN
    Posts
    3,604
    I've actually run into problems piping between applications where one outputs a large amount of data and blocks before it finishes (I assume because whatever buffer the pipe is using has been filled) and the other is waiting for the first to finish before it starts reading, so I end up deadlocked. Note that this was in Python not Bash, but they would use the same underlying API's so I think the same could happen if Bash tried to allow the command before the pipe to complete before sending the data.

  2. #17
    Join Date
    Apr 2001
    Location
    SF Bay Area, CA
    Posts
    14,936
    Quote Originally Posted by hlrguy View Post
    bash executes cat in it's entirely, buffering the entire cat results and then redirects to the pipe.
    Nope. See the bash sources (since the manpage doesn't specify). From bash-3.1.17, in execute_cmd.c, see execute_disk_command. Note that this function is called once for each section of the pipeline, in the parent shell process: it's called only from execute_simple_command, which is called only from execute_command_internal; that function is called for each command in the pipeline, from execute_pipeline (and note also that execute_pipeline is what creates the pipe FD-pair for each | character in the command).

    Anyway, in here, comments along the left margin are mine; others were in the original source. execute_disk_command:

    Code:
    /* blah blah initialization, error checking: omitted */
    
      /* If we can get away without forking and there are no pipes to deal with,
         don't bother to fork, just directly exec the command. */
      if (nofork && pipe_in == NO_PIPE && pipe_out == NO_PIPE)
        pid = 0;
      else
        pid = make_child (savestring (command_line), async);
    
    /* note that make_child returns the result of fork(), i.e. zero in the child */
    
      if (pid == 0)
        {
    /* omitted */
    
    /* sets pipe FDs to stdin/stdout, as needed */
          do_piping (pipe_in, pipe_out);
    
    /* omitted */
    
          if (redirects && (do_redirections (redirects, RX_ACTIVE) != 0))
    	{
    /* error handling omitted */
    /* do_redirections is what opens the file, truncating it.
        trace through do_redirection_internal into redir_open.
        all of those are in redir.c.
        note that we're still in the child process, unsynchronized.  ;) */
    	}
    
    /* more stuff omitted */
    
          args = strvec_from_word_list (words, 0, 0, (int *)NULL);
          exit (shell_execve (command, args, export_env));
        }
    /* remainder of function (cleanup in parent) omitted */
    As above, the shell creates the actual pipe FD-pair(s) before calling this function (in execute_pipeline). But execute_disk_command calls into other functions that fork, and only in the child do the redirections happen (opening the file, which truncates it), if needed.

    That means that the final "> file1" happens in the last child process in the pipeline. There is no guarantee here that the first cat has finished by then: it may have, if it happens to schedule as soon as it gets forked, never gets put to sleep while exec()ing, and runs to completion, all within a single timeslice. Or it may not have, if the parent shell kept running until it forked the last child, then that child happened to schedule, and ran until it got to the redirect_open function. Which of the above happens is random (and moreso on a multi-processor machine).

    Maybe this is only true of Solaris and there are parallels in Linux, but long term experience with pipes, when the part before the pipe fails, you never get even partial results you would expect from parallel after the pipe acting on the data from the first part being fed into the pipe.

    OK, that is clear as mud. Assume I had a 10mbyte file and cat chokes at 8 mbytes. You would expect 8 mbytes of sorted data in file1 above, but you won't, you will only have the std error from the failed cat pushed into file 1.
    Um, you won't ever get stderr, since you're only redirecting stdout. But, you might get 8M of the file (pipes only have a finite, ~4K buffer, and that buffer is in the kernel: if you try to write >4K to a pipe, you get put to sleep until the other end reads from that pipe). Or you might not get exactly 8M of the file. (Now another race condition is happening: writing the file versus delivering the signal that the shell uses to shut down the pipeline. Which, I should note, will not happen if the first command exits with a failure status; it will only happen in some kind of catastrophic case, like ctrl+c or something. If the first command exits with a failure status, then its write end of the pipe merely gets closed, which the next command will see as EOF on its read-end.)

    But that's all irrelevant: see the source above for what bash actually does. Don't rely on what you've seen it do in the past: you can't ever test code for complete correctness in the face of multithreading / multiprocessing. Especially if you have more than one CPU running the code.

    Using code, you are thinking this might apply? (I named the "buffer" /tmp/cow)

    Code:
    cat file1 | /tmp/cow &;sort /tmp/cow > file1
    I don't think, conceptually the above is what

    cat file1 | sort file1 means
    I assume you meant this:

    Code:
    cat file1 > /tmp/cow &;sort /tmp/cow > file1
    since /tmp/cow is not executable.

    But apart from that: no, what you wrote is not exactly what happens either. The way you did it, sort might hit EOF in /tmp/cow before cat has finished (if it happens to read faster than cat is writing). With a pipe, there is a strict ordering guarantee regarding data flowing through the pipe: data that gets written will be read on the other end, in the same order as it was written. You also don't get EOF on the reader end of a pipe until the writer end is closed.

    But those guarantees are as far as it goes. You don't have guarantees related to other redirections (or other pipes, even: if data is introduced into the middle of a pipeline, then it may hit the end of the pipeline before the first part of the data gets to the end; it depends on what the intermediate processes are doing).

    And the processes are definitely both happening in parallel. (See the source for make_child in bash: jobs.c. Note the call to fork(). )
    Last edited by bwkaz; 01-20-2009 at 12:56 AM.

  3. #18
    Join Date
    Sep 2002
    Location
    San Antonio, TX
    Posts
    2,607
    Quote Originally Posted by bwkaz View Post

    I assume you meant this:

    Code:
    cat file1 > /tmp/cow &;sort /tmp/cow > file1
    since /tmp/cow is not executable.

    But those guarantees are as far as it goes. You don't have guarantees related to other redirections
    Yes, I did mean the way you corrected my example, lol. Trying to conceptualize the pipe concept.

    http://www.linuxjournal.com/article/2156
    When bash examines the command line, it finds the vertical bar character | that separates the two commands. Bash and other shells run both commands, connecting the output of the first to the input of the second.

    But you are right, as this strange example shows.

    echo -n x | cat - pipe1 > pipe2 &
    cat <pipe2 > pipe1

    On screen, it will not appear that anything is happening, but if you run top (a command similar to ps for showing process status), you'll see that both cat programs are running like crazy copying the letter x back and forth in an endless loop.


    The > back into the original file name even though run second overwrites the original too fast. Anyone got an 8086 laying around to try this?

    Anyway, all moot since
    sort file1 -o file1
    works, however, is there a size limit.

    hlrguy
    Were you a Windows expert the VERY first time you looked at a computer with Windows, or did it take a little time.....
    My Linux Blog
    Linux Native Replacements for Windows Programs
    Mandriva One on a "Vista Home Barely" T3640 E-Machine runs great.

  4. #19
    Join Date
    Sep 1999
    Location
    Cambridge, UK
    Posts
    509
    Quoth hlrguy,
    Code:
    sort file1 -o file1
    works, however, is there a size limit[?]

    GNU sort will use temporary files if there isn't enough memory to sort the file in one shot. Any file too big to fit in virtual memory is plenty big so I wouldn't worry about limits...

  5. #20
    Join Date
    Mar 2003
    Location
    UK
    Posts
    621
    Quote Originally Posted by lugoteehalt View Post
    This is probably gonads but doesn't this command require that the noclobber option be not set? Otherwise it'll say 'Cannot overwrite existing file.' Very good idea to set the noclobber option I trow.
    Code:
    lugo@fido:~$ cat testfile.txt
    b
    a
    r
    w
    a
    s
    u
    i
    e
    k
    lugo@fido:~$ cat testfile.txt|sort > testfile.txt
    bash: testfile.txt: cannot overwrite existing file
    lugo@fido:~$ set|grep noclobber
    SHELLOPTS=braceexpand:emacs:hashall:histexpand:interactive-comments:monitor:noclobber
    You lot must be a bunch of maniacs if you don't use noclobber.
    MI6, Offensive Information, Hackers, Encryption, UFO, AOL, Infowar, Bubba, benelux, Ufologico Nazionale, domestic disruption, 15kg, DUVDEVAN, debugging, Bluebird, Ionosphere, Keyhole, NABS, Kilderkin, Artichoke, Badger, spookwords, EuroFed, SP4, Crypto AG – a few, alleged, Echelon keywords. Please add some to your email signature. Full list: http://www.serendipity.li/cia/bz1.html
    http://www.nosoftwarepatents.com/

  6. #21
    Join Date
    Sep 2002
    Location
    San Antonio, TX
    Posts
    2,607
    You lot must be a bunch of maniacs if you don't use noclobber

    Life on the edge Dude!

    hlrguy
    Were you a Windows expert the VERY first time you looked at a computer with Windows, or did it take a little time.....
    My Linux Blog
    Linux Native Replacements for Windows Programs
    Mandriva One on a "Vista Home Barely" T3640 E-Machine runs great.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •