What's new

(Solved) Need a smart sed, grep or awk command

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

thelonelycoder

Part of the Furniture
I'm at a loss, google ends at one point with examples.

Situation:
The Diversion blocking list maps up to 25 domains to an IP, like this example with 3 domains:
Code:
172.18.0.2 hit.123c.vn iclickyou.com icloud.com-locations.in
Note that there is a space before and after each domain, except the last, which makes it even more tricky.

The regular whitelisting during download of the hosts files works as at that time the file is in one-domain-per-line format. For later whitelisting I need a way of removing explicit domains from the file. Only exact matches may be removed and the integrity of the file must remain intact.

This would be an example whitelist which can be 20 or more lines long:
Code:
icloud.com-locations.in
icloud.com
I want to make this fast, so a "while read line" way is out of the question.
grep -v (select non-matching lines) does not work either, it deletes the whole line which I do with the one-domain-per-line file.

I have an excellent sed command, but that would remove the matching part of icloud.com-locations.in with icloud.com.
I tried adding a space before the domain, like " icloud.com" or the typical word boundary tricks but nothing works to my full satisfaction.

The sed command creates another sed command in itself, like this, taking the whitelist as input and acts on file:
Code:
sed -i "$(sed 's:.*:s/&//g:' whitelist)" file
Any help would be appreciated.
 
How about something like this?
Code:
 sed -Ei 's,[[:space:]]icloud.com[[:space:]]?|$,,' blockinglist

EDIT: no I see the deficiencies.

Or
Code:
 sed -Ei 's,[[:space:]]icloud.com([[:space:]]+|$),,' blockinglist

EDIT: I keep flailing at it, just to keep the conversation going.
 
Last edited:
Does busybox sed understand [[:space:]] ?
Yes

But you all misunderstood, I'm not looking for one or a couple of patterns. I need to run the command with input from file whitelist.

I think I have it, with the modification that the last domain in the line also gets a space after.
Then I make sure that the (temporary) whitelist also contains spaces, like " icloud.com "
And then the slightly modified blazing fast sed works, I simply add one space when a match is removed:
Code:
sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" file

More testing follows, but I believe the case is solved. Thanks for offering ideas.
 
Yes

But you all misunderstood, I'm not looking for one or a couple of patterns. I need to run the command with input from file whitelist.

I think I have it, with the modification that the last domain in the line also gets a space after.
Then I make sure that the (temporary) whitelist also contains spaces, like " icloud.com "
And then the slightly modified blazing fast sed works, I simply add one space when a match is removed:
Code:
sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" file

More testing follows, but I believe the case is solved. Thanks for offering ideas.
My best ideas come from rejecting other peoples' ideas. :D
 
My best ideas come from rejecting other peoples' ideas. :D
Well, another day spent on a small but to me significant problem.
A coders hell of 30 tabs open in Firefox and none help.
 
You pulled a Captain Kirk Kobayashi Maru by changing the rules to allow a space at the end of the last host on a line. :)
It’s a small change for the blocking file but huge improvement for the el functions.
 
Yup, the posted solution is it, it passed all tests I usually run.
 
I think I have it, with the modification that the last domain in the line also gets a space after.
Then I make sure that the (temporary) whitelist also contains spaces, like " icloud.com "
And then the slightly modified blazing fast sed works, I simply add one space when a match is removed:
Code:
sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" file

It might be better to put the whitelist into a "|" seperated string and use "sed -Ei 's/ (regexstring)( |$)/\2/g' blockinglist". The reason being you'll need to escape them anyway since "." matches any character (a.c.com = abc.com).

Usually I'd test before posting code, but I'll leave that up to you. You can use [[:space:]]+ if you want, but since they're files you generated they should just have a single space.
Code:
# Escape, seperate, and remove newlines
REGEX="$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"
# Remove matching words, delete lines with no hosts eg "0.0.0.0"
sed -Ei -e 's/ ('"$REGEX"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist

# As one line
sed -Ei -e 's/ ('"$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist

Edit: '/ /!d' might be better than '/^[^ ]*$/d'
 
Last edited:
It might be better to put the whitelist into a "|" seperated string and use "sed -Ei 's/ (regexstring)( |$)/\2/g' blockinglist". The reason being you'll need to escape them anyway since "." matches any character (a.c.com = abc.com).

Usually I'd test before posting code, but I'll leave that up to you. You can use [[:space:]]+ if you want, but since they're files you generated they should just have a single space.
Code:
# Escape, seperate, and remove newlines
REGEX="$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"
# Remove matching words, delete lines with no hosts eg "0.0.0.0"
sed -Ei -e 's/ ('"$REGEX"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist

# As one line
sed -Ei -e 's/ ('"$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist
The code is only used in the el function for the whitelist when one processes them. The temp whitelist is assembled from several files: the hard coded whitelist, the shared whitelist from Skynet and the users edited whitelist. I add a space before and after the domain name.
Then run the posted command. It works and is still readable without a headache.
The final code is in functions.div in todays last update I pushed.
 
The code is only used in the el function for the whitelist when one processes them. The temp whitelist is assembled from several files: the hard coded whitelist, the shared whitelist from Skynet and the users edited whitelist. I add a space before and after the domain name.
Then run the posted command. It works and is still readable without a headache.
The final code is in functions.div in todays last update I pushed.

I think you still need to escape "." characters, otherwise whitelisting "a.c.com" would unblock "abc.com". And if the whitelists are user editable they might have other special sed characters which will break things entirely.

There's also a problem if a line with a single host is removed, you might do more processing later on to fix it, but it'll just be an IP otherwise.

Finally, the trailing space could be done per line instead of changing the host format (if you choose not to use "( |$)").
Code:
sed -i -e 's/$/ /' -e "$(...)" -e 's/ $//' file
 
Since we are sharing , i have been occasionally slipping a pipe with this command into my generators because of the occasional domain or two that slips through with a non-real world symbol.

Code:
tr -dc '[:print:]\n\r' | tr '[:upper:]' '[:lower:]'
 
Since we are sharing , i have been occasionally slipping a pipe with this command into my generators because of the occasional domain or two that slips through with a non-real world symbol.

Code:
tr -dc '[:print:]\n\r' | tr '[:upper:]' '[:lower:]'
That should cover it for the hosts file:
Code:
dos2unix "${hf_inuse}" || true
if expr "$(grep -m1 "^[^#]" "${hf_inuse}" | awk '{print $1}')" : '[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$' >/dev/null; then
    echo " file is in hosts file format (IP-domain pair)"
    /opt/bin/grep "^[^#]" "${hf_inuse}" \
    | sed -e "s/^[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}/X/g" | /opt/bin/grep -P '^[[:ascii:]]+$' \
    | /opt/bin/grep -w "^X" | awk '{print " "$2}' | /opt/bin/grep -E '[[:alnum:]]+[.][[:alnum:]_.-]+' \
    | awk '!/ [0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$/' | awk '!/[:?\/;]/' >"${hf_inuse}.tmp"
else
    echo " file is in domains only format"
    /opt/bin/grep "^[^#]" "${hf_inuse}" | /opt/bin/grep -P '^[[:ascii:]]+$' \
    | awk '{print " "$1}' | /opt/bin/grep -E '[[:alnum:]]+[.][[:alnum:]_.-]+' \
    | awk '!/ [0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$/' | awk '!/[:?\/;]/' >"${hf_inuse}.tmp"
fi
 
I think you still need to escape "." characters, otherwise whitelisting "a.c.com" would unblock "abc.com". And if the whitelists are user editable they might have other special sed characters which will break things entirely.
I can't fix stupid, but you are right with escaping the dots. I forgot to add the that to the published version.
There's also a problem if a line with a single host is removed, you might do more processing later on to fix it, but it'll just be an IP otherwise.
Good observation, I added a check after. Thanks.
Finally, the trailing space could be done per line instead of changing the host format (if you choose not to use "( |$)").
Remember, I'm reading in a list of matches from the whitelist, which then are removed from the blocking list. All in one go. Adding the additional space to the blockinglist has more advantages elsewhere as grep and other sed commands can be simpler because all domains in the file have the spaces before and after.
Code:
sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" blockinglist
It works great AFAICT, and no false positives.
 

Latest threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top