use sed to place "hard returns" in a massive (one-line!) text file?


Results 1 to 10 of 10

Thread: use sed to place "hard returns" in a massive (one-line!) text file?

  1. #1
    Join Date
    May 2001
    Location
    Manchester, UK
    Posts
    382

    use sed to place "hard returns" in a massive (one-line!) text file?

    Hi people,

    long time no log-in. Any ideas on this?:

    I have a long list of e-mail addresses which I want to use for a mass-subscription to a Mailman mailing list. Don't worry, I'm not spamming or anything, just streamlining a previously unwieldy manual mailing-list, for an NGO in Peru.

    The curious problem is that all these e-mail addresses are arranged on one line in a text file, having been prepared in Microsoft Word (shudder!). The addresses are each separated from the next by one comma and one space. Hoewever, Mailman requires one address per line, each line separated by a "hard return".

    I know only enough sed to do this:
    Code:
    sed 's/, /\n/'g mailing-list.txt
    Which basically does the job, replacing each occurrence of ", " with "\n" (a hard return).

    BUT...

    some of the addresses are of the form:

    Code:
    Smith, John D <blah@blah.org>
    Is there a way to NOT remove the comma when it's NOT separating two e-mail addresses, merely separating last names from first names?

    Yours hopefully,

    Andrew.

  2. #2
    Join Date
    Apr 2001
    Location
    SF Bay Area, CA
    Posts
    14,947
    How is sed supposed to know when a ", " sequence is separating two email addresses versus when it's separating a person's last and first names? It's not like sed is clairvoyant...

    Unless all the email addresses end in a > character? Then it may work to match ">, " and replace it with ">\n". But if the only thing you can go on is the ", " sequence, then I'm pretty sure the lack of clairvoyance is going to sink the effort. Maybe you should get whoever to redo the list.

  3. #3
    Join Date
    Nov 2002
    Posts
    205
    I dont see why you cant just match .org, and repace it with .org\n and the same with the other top level domains. You could do it with one long reg exp match to grab them all but i don't know if its worth the trouble to make it a little more elegant.
    but really, i dont know what im talking about.

  4. #4
    Join Date
    Feb 2004
    Location
    austin, tx
    Posts
    145
    that may work as well, but matching based on ">", as bwkaz suggested, seems to be the simplest solution. 1 pass and he's done
    Roses are red, violets are blue. All my base, are belong to you.

  5. #5
    Join Date
    Nov 2002
    Posts
    205
    Agreed, I was thinking that perhaps he placed the <> on the email addresses for illustrative purposes.
    but really, i dont know what im talking about.

  6. #6
    Join Date
    May 2003
    Location
    San Diego, CA
    Posts
    140
    Just threw this together:

    Code:
    (\w+\.)*\w+@(\w+\.)*\w+\.\w\w\w
    You can use vim to do a pcre substitution to place something unique after each email address:
    Code:
    :perldo s/(\w+\.)*\w+@(\w+\.)*\w+\.\w\w\w/$&*|*/g
    should place "*|*" after each email address.

    Then you can use sed to replace each "*|*" with a hard return.
    Code:
    sed 's/\*|\*/\n/'g textfile.txt
    Of course you'd probably want to strip away all the other junk like "Smith, John <" and ">".
    Last edited by nabetse; 09-01-2007 at 01:07 PM. Reason: typo

  7. #7
    Join Date
    May 2001
    Location
    Manchester, UK
    Posts
    382
    "sed is not clairvoyant"! How true...

    I have two lines of enquiry now:


    1) do what nabetse said and concentrate on stripping out all the cruft ( "Smith, John S" and <>).

    2) try and re-attach the real names that I've split apart using sed.


    once I've done the simple ", " for "\n" replacement, a grep for all lines containing "@" shows that only 140 addresses have this problem:
    [CODE]
    veg@purplemonster:~$ sed 's/, /\n/'g MailingListOriginal > FirstPass
    veg@purplemonster:~$ wc -l MailingListOriginal
    2 MailingListOriginal
    veg@purplemonster:~$ wc -l FirstPass
    2966 FirstPass
    veg@purplemonster:~$ grep @ FirstPass | wc -l
    2826
    veg@purplemonster:~$ [\CODE]

    Hmm. I'm going with option 2 right now, as it doesn't involve much mucking around with the lines which actually contain the useful addresses.

    I think I'm going to try finding all lines with no "@" sign and concatenating each one with its following line. If anyone knows how to do this, you may save me much man-page hell over the next hour hour or two! Thanks for the help so far, guys!

  8. #8
    Join Date
    Dec 1999
    Location
    Fargo, ND
    Posts
    1,817
    I'ld think that opening it up in openoffice and then saving it as a text file might work as well.
    Knute

    You live, you die, enjoy the interval!

  9. #9
    Join Date
    Nov 2003
    Location
    Phoenix, AZ, USA
    Posts
    287
    what knute said, except:

    do a replace all search with '\n' and each of the names will have a hard return afterwards.

    THEN save it as text file.
    BEHOLD!!! MY AWESOME HUMILITY!
    Ex Linux, Scientia

    i use:
    centos 5.2 on 3.0 GHz Pentium 4 (filer/print server)
    ubuntu 8.10 on 1.6 GHz Celeron M (personal laptop)

  10. #10
    Join Date
    Apr 2006
    Posts
    32
    show a sample of how your email file looks like

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •