(Solved) Need a smart sed, grep or awk command

thelonelycoder · May 6, 2020

I'm at a loss, google ends at one point with examples.

Situation:
The Diversion blocking list maps up to 25 domains to an IP, like this example with 3 domains:

Code:

172.18.0.2 hit.123c.vn iclickyou.com icloud.com-locations.in

Note that there is a space before and after each domain, except the last, which makes it even more tricky.

The regular whitelisting during download of the hosts files works as at that time the file is in one-domain-per-line format. For later whitelisting I need a way of removing explicit domains from the file. Only exact matches may be removed and the integrity of the file must remain intact.

This would be an example whitelist which can be 20 or more lines long:

Code:

icloud.com-locations.in
icloud.com

I want to make this fast, so a "while read line" way is out of the question.
grep -v (select non-matching lines) does not work either, it deletes the whole line which I do with the one-domain-per-line file.

I have an excellent sed command, but that would remove the matching part of icloud.com-locations.in with icloud.com.
I tried adding a space before the domain, like " icloud.com" or the typical word boundary tricks but nothing works to my full satisfaction.

The sed command creates another sed command in itself, like this, taking the whitelist as input and acts on file:

Code:

sed -i "$(sed 's:.*:s/&//g:' whitelist)" file

Any help would be appreciated.

dave14305 · May 6, 2020

How about something like this?

Code:

 sed -Ei 's,[[:space:]]icloud.com[[:space:]]?|$,,' blockinglist

EDIT: no I see the deficiencies.

Or

Code:

 sed -Ei 's,[[:space:]]icloud.com([[:space:]]+|$),,' blockinglist

EDIT: I keep flailing at it, just to keep the conversation going.

Jack Yaz · May 6, 2020

dave14305 said:
How about something like this?

Code:

sed -Ei 's,[[:space:]]icloud.com[[:space:]]?|$,,' blockinglist

Does busybox sed understand [[:space:]] ?

dave14305 · May 6, 2020

Seemingly, but my sed also removes the space after the name in the middle of the line.

dave14305 · May 6, 2020

My final offer:

Code:

sed -Ei 's,([[:space:]]icloud.com)([[:space:]]+|$),\2,' blockinglist

thelonelycoder · May 6, 2020

Jack Yaz said:
Does busybox sed understand [[:space:]] ?

Yes

But you all misunderstood, I'm not looking for one or a couple of patterns. I need to run the command with input from file whitelist.

I think I have it, with the modification that the last domain in the line also gets a space after.
Then I make sure that the (temporary) whitelist also contains spaces, like " icloud.com "
And then the slightly modified blazing fast sed works, I simply add one space when a match is removed:

Code:

sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" file

More testing follows, but I believe the case is solved. Thanks for offering ideas.

dave14305 · May 6, 2020

thelonelycoder said:
Yes

But you all misunderstood, I'm not looking for one or a couple of patterns. I need to run the command with input from file whitelist.

I think I have it, with the modification that the last domain in the line also gets a space after.
Then I make sure that the (temporary) whitelist also contains spaces, like " icloud.com "
And then the slightly modified blazing fast sed works, I simply add one space when a match is removed:

Code:

sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" file

More testing follows, but I believe the case is solved. Thanks for offering ideas.

My best ideas come from rejecting other peoples' ideas.

thelonelycoder · May 6, 2020

dave14305 said:
My best ideas come from rejecting other peoples' ideas.

Well, another day spent on a small but to me significant problem.
A coders hell of 30 tabs open in Firefox and none help.

dave14305 · May 6, 2020

thelonelycoder said:
Well, another day spent on a small but to me significant problem.
A coders hell of 30 tabs open in Firefox and none help.

You pulled a Captain Kirk Kobayashi Maru by changing the rules to allow a space at the end of the last host on a line.

thelonelycoder · May 6, 2020

dave14305 said:
You pulled a Captain Kirk Kobayashi Maru by changing the rules to allow a space at the end of the last host on a line.

It’s a small change for the blocking file but huge improvement for the el functions.

thelonelycoder · May 6, 2020

Yup, the posted solution is it, it passed all tests I usually run.

Ro berto · May 6, 2020

thelonelycoder said:
It’s a small change for the blocking file but huge improvement for the el functions.

I read it in Neil Armstrong's voice

Sent from my SM-G970F using Tapatalk

thelonelycoder · May 6, 2020

Ro berto said:
I read it in Neil Armstrong's voice

Sent from my SM-G970F using Tapatalk

I held the ladder he was stepping down from. On the ground.

Dabombber · May 6, 2020

thelonelycoder said:
I think I have it, with the modification that the last domain in the line also gets a space after.
Then I make sure that the (temporary) whitelist also contains spaces, like " icloud.com "
And then the slightly modified blazing fast sed works, I simply add one space when a match is removed:

Code:

sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" file

It might be better to put the whitelist into a "|" seperated string and use "sed -Ei 's/ (regexstring)( |$)/\2/g' blockinglist". The reason being you'll need to escape them anyway since "." matches any character (a.c.com = abc.com).

Usually I'd test before posting code, but I'll leave that up to you. You can use [[:space:]]+ if you want, but since they're files you generated they should just have a single space.

Code:

# Escape, seperate, and remove newlines
REGEX="$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"
# Remove matching words, delete lines with no hosts eg "0.0.0.0"
sed -Ei -e 's/ ('"$REGEX"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist

# As one line
sed -Ei -e 's/ ('"$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist

Edit: '/ /!d' might be better than '/^[^ ]*$/d'

thelonelycoder · May 6, 2020

Dabombber said:
It might be better to put the whitelist into a "|" seperated string and use "sed -Ei 's/ (regexstring)( |$)/\2/g' blockinglist". The reason being you'll need to escape them anyway since "." matches any character (a.c.com = abc.com).

Usually I'd test before posting code, but I'll leave that up to you. You can use [[:space:]]+ if you want, but since they're files you generated they should just have a single space.

Code:

# Escape, seperate, and remove newlines REGEX="$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')" # Remove matching words, delete lines with no hosts eg "0.0.0.0" sed -Ei -e 's/ ('"$REGEX"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist # As one line sed -Ei -e 's/ ('"$(sed -e 's/[])}|\/$*+?.^&{([]/\\&/g' -e '$!s/$/|/' whitelist | tr -d '\n')"')( |$)/\2/g' -e '/^[^ ]*$/d' blockinglist

The code is only used in the el function for the whitelist when one processes them. The temp whitelist is assembled from several files: the hard coded whitelist, the shared whitelist from Skynet and the users edited whitelist. I add a space before and after the domain name.
Then run the posted command. It works and is still readable without a headache.
The final code is in functions.div in todays last update I pushed.

Dabombber · May 6, 2020

thelonelycoder said:
The code is only used in the el function for the whitelist when one processes them. The temp whitelist is assembled from several files: the hard coded whitelist, the shared whitelist from Skynet and the users edited whitelist. I add a space before and after the domain name.
Then run the posted command. It works and is still readable without a headache.
The final code is in functions.div in todays last update I pushed.

I think you still need to escape "." characters, otherwise whitelisting "a.c.com" would unblock "abc.com". And if the whitelists are user editable they might have other special sed characters which will break things entirely.

There's also a problem if a line with a single host is removed, you might do more processing later on to fix it, but it'll just be an IP otherwise.

Finally, the trailing space could be done per line instead of changing the host format (if you choose not to use "( |$)").

Code:

sed -i -e 's/$/ /' -e "$(...)" -e 's/ $//' file

SomeWhereOverTheRainBow · May 7, 2020

Since we are sharing , i have been occasionally slipping a pipe with this command into my generators because of the occasional domain or two that slips through with a non-real world symbol.

Code:

tr -dc '[:print:]\n\r' | tr '[:upper:]' '[:lower:]'

thelonelycoder · May 7, 2020

SomeWhereOverTheRainBow said:
Since we are sharing , i have been occasionally slipping a pipe with this command into my generators because of the occasional domain or two that slips through with a non-real world symbol.

Code:

tr -dc '[:print:]\n\r' | tr '[:upper:]' '[:lower:]'

That should cover it for the hosts file:

Code:

dos2unix "${hf_inuse}" || true
if expr "$(grep -m1 "^[^#]" "${hf_inuse}" | awk '{print $1}')" : '[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$' >/dev/null; then
    echo " file is in hosts file format (IP-domain pair)"
    /opt/bin/grep "^[^#]" "${hf_inuse}" \
    | sed -e "s/^[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}/X/g" | /opt/bin/grep -P '^[[:ascii:]]+$' \
    | /opt/bin/grep -w "^X" | awk '{print " "$2}' | /opt/bin/grep -E '[[:alnum:]]+[.][[:alnum:]_.-]+' \
    | awk '!/ [0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$/' | awk '!/[:?\/;]/' >"${hf_inuse}.tmp"
else
    echo " file is in domains only format"
    /opt/bin/grep "^[^#]" "${hf_inuse}" | /opt/bin/grep -P '^[[:ascii:]]+$' \
    | awk '{print " "$1}' | /opt/bin/grep -E '[[:alnum:]]+[.][[:alnum:]_.-]+' \
    | awk '!/ [0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$/' | awk '!/[:?\/;]/' >"${hf_inuse}.tmp"
fi

thelonelycoder · May 7, 2020

Dabombber said:
I think you still need to escape "." characters, otherwise whitelisting "a.c.com" would unblock "abc.com". And if the whitelists are user editable they might have other special sed characters which will break things entirely.

I can't fix stupid, but you are right with escaping the dots. I forgot to add the that to the published version.

Dabombber said:
There's also a problem if a line with a single host is removed, you might do more processing later on to fix it, but it'll just be an IP otherwise.

Good observation, I added a check after. Thanks.

Dabombber said:
Finally, the trailing space could be done per line instead of changing the host format (if you choose not to use "( |$)").

Remember, I'm reading in a list of matches from the whitelist, which then are removed from the blocking list. All in one go. Adding the additional space to the blockinglist has more advantages elsewhere as grep and other sed commands can be simpler because all domains in the file have the spaces before and after.

Code:

sed -i "$(sed 's:.*:s/&/ /g:' whitelist)" blockinglist

It works great AFAICT, and no false positives.

Dabombber · May 7, 2020

And in the typical fashion of "oh busysbox supports that" after finishing everything. Word boundaries work in sed's regex

Code:

printf '%s\n' 'a ba' 'b a bab' 'abc a' | sed 's/\<a\>/X/'
X ba
b X bab
abc X

Thread starter	Title	Forum	Replies	Date
T	[Solved]Configure Policy based routing for transparent proxy	Asuswrt-Merlin	6	Jun 13, 2024
W	[Solved] RT-AX88U jffs/ubi brain mix	Asuswrt-Merlin	25	May 28, 2024
S	[Solved] GT-AX6000 - Temperature not showing	Asuswrt-Merlin	4	May 25, 2024
N	(Solved - Bad router) RT-AX88U node causes network instability (3004.388.7)	Asuswrt-Merlin	2	May 8, 2024
B	(solved) Dnscrypt blocked-names.txt automatically deleted upon modification	Asuswrt-Merlin	4	Feb 20, 2024
G	Miracast problems solved by router reboot -RT-AC86U 386.12_4	Asuswrt-Merlin	0	Jan 10, 2024
W	Solved Solved (for me): WiFi issues and/or crashes with Merlin on RT-AC86U	Asuswrt-Merlin	1	Jan 3, 2024
	Need help with Namecheap -> Router -> Caddy	Asuswrt-Merlin	10	Oct 15, 2024
C	Need info on bringing AC86U back to life as AIMesh unit	Asuswrt-Merlin	3	Sep 28, 2024
W	Support ending Dec 2024 for AC3100 - Need hardware buying advice?	Asuswrt-Merlin	12	Sep 25, 2024

(Solved) Need a smart sed, grep or awk command

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Regular Contributor

Part of the Furniture

Senior Member

Part of the Furniture

Senior Member

Part of the Furniture

Part of the Furniture

Part of the Furniture

Senior Member

Similar threads

Similar threads

Support SNBForums w/ Amazon

Sign Up For SNBForums Daily Digest