What's new

Custom firmware build for R7800

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Kyle,

Thanks for your good words. In general it is your own winning (fight with R7500). So my congratulations!

Voxel.
 
So one way to look at Krait - it's a really fast A9 with VFP4 and Neon, and leave it at that..


I tried to make some additional benchmarks to have more accurate comparison of compilation options results. So, environment is two my R7800, first is with my firmware 1.0.2.21SF (i.e. compiled with “Cortex-A15” options), second is the same firmware, but compiled with “Cortex-A9” options.

I used CPUBECH from OpenWRT. Compiled with “Cortex-A9” and “Cortex-A15” options. No optimization of program is used (-O0 option). I.e. I run two versions of CPUBENCH on both routers.

Results are:

FW is compiled with “Cortex-A9” options, CPUBENCH is compiled with:
-mcpu=cortex-a9 -mfpu=neon-vfpv4 -mtune=cortex-a9 -mfloat-abi=softfp -O0
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.49[secs]
Time to run computation of pi (2400 digits, 10 times): 2.88[secs]
Time to run computation of e (9009 digits): 2.41[secs]
Time to run float bench: 0.01[secs]
Total time: 5.8s

FW is compiled with “Cortex-A9” options, CPUBENCH is compiled with:
-mcpu=cortex-a15 -mfpu=neon-vfpv4 -mtune=cortex-a15 -mfloat-abi=softfp -O0
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.43[secs]
Time to run computation of pi (2400 digits, 10 times): 1.50[secs]
Time to run computation of e (9009 digits): 1.43[secs]
Time to run float bench: 0.01[secs]
Total time: 3.4s

FW is compiled with “Cortex-A15” options, CPUBENCH is compiled with:
-mcpu=cortex-a9 -mfpu=neon-vfpv4 -mtune=cortex-a9 -mfloat-abi=softfp -O0
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.50[secs]
Time to run computation of pi (2400 digits, 10 times): 1.81[secs]
Time to run computation of e (9009 digits): 1.69[secs]
Time to run float bench: 0.01[secs]
Total time: 4.0s

FW is compiled with “Cortex-A15” options, CPUBENCH is compiled with:
-mcpu=cortex-a15 -mfpu=neon-vfpv4 -mtune=cortex-a15 -mfloat-abi=softfp -O0
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.44[secs]
Time to run computation of pi (2400 digits, 10 times): 1.51[secs]
Time to run computation of e (9009 digits): 1.44[secs]
Time to run float bench: 0.01[secs]
Total time: 3.4s

So results show that my selection of “Cortex-A15” options for IPQ806x CPU was correct ;-) It is interesting that even compiled with “Cortex-A9” options this CPUBENCH works faster in “Cortex-A15” environment.

Voxel.
 
Last edited:
Small add-on:

Results of CPUBENCH compiled with stock options, executed on my FW with "Cortex-A9" options:
-march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=softfp -O0
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.52[secs]
Time to run computation of pi (2400 digits, 10 times): 3.01[secs]
Time to run computation of e (9009 digits): 2.52[secs]
Time to run float bench: 0.01[secs]
Total time: 6.1s

It is why I said that CPU is not used in full power with stock FW.

Voxel.
 
It is why I said that CPU is not used in full power with stock FW.

Would be fun to see performance on something like Liquid DSP

https://github.com/jgaeddert/liquid-dsp/

and if one wants to fully build liquid DSP, need Phil Karn's libfec - mostly because liquid-dsp will leverage into Phil's code there (android uses this, btw)

git clone https://github.com/Opendigitalradio/ka9q-fec.git

Code:
./bootstrap
./configure
make
sudo make install
sudo ldconfig

And then do the build for liquid-dsp - and rinse/lather/repeat - ensuring that libfec is available...

Once all that is done...

Grab - http://liquidsdr.org/blog/raspberry-pi3-benchmarks/benchmark_threaded.c

Code:
gcc -Wall -O2 -pthread -lm -lc -lliquid -o benchmark_threaded benchmark_threaded.c

Play around with the options there ;)

Code:
# auto-generated file
# ./benchmark_threaded
#      threads   buffer len   filter len  runtime [s]              samples    samples/s
            1          256           57       3.0000              9800704   3.2669e+06
            2          256           57       3.0000             19592960   6.5310e+06
            3          256           57       3.0000             29215488   9.7385e+06
            4          256           57       3.0000             39013632   1.3005e+07
            5          256           57       3.0000             39353600   1.3118e+07
            6          256           57       3.0000             39301888   1.3101e+07
            7          256           57       3.0000             39640064   1.3213e+07
            8          256           57       3.0000             39363584   1.3121e+07
            9          256           57       3.0000             39424768   1.3142e+07
            10          256           57       3.0000             39308032   1.3103e+07
            11          256           57       3.0000             39888384   1.3296e+07
            12          256           57       3.0000             39669760   1.3223e+07
            13          256           57       3.0000             39742208   1.3247e+07
            14          256           57       3.0000             39844352   1.3281e+07
            15          256           57       3.0000             39696384   1.3232e+07
            16          256           57       3.0000             40227584   1.3409e+07

That'll definitely flex the neon and vfp units...

One of the concerns might be the pthreads - many consumer router/ap's purposely disable libpthread in uclibc - it's a sane choice, given that they have to support multiple SoC's, and some are not safe there (comment below)
 
Last edited:
Also - might want to reach out to @kvic -

He was doing some exploration on pthreads within uclibc on the Cortex-A9 within the BCM4708 - risky as some Cortex-A9/A8 cores are not thread safe due to a chip errata...

But with Krait, it should be ok...
 
The other option - migrate to musl - openwrt has done this major move already with designated driver, and performance gains there have been fairly nice... comparable to where glibc is now, and the binaries are close to the same size as what we see with uclibc (glibc is a big library), there is a gcc-wrapper present...

With the ipq - there's always the Qualcomm provided Snapdragon LLVM, but outside of Android and the NDK, I wouldn't recommend consideration as that's a huge amount of work for what might be not much benefit... (but within ASOP, it's a viable option, and there, one has a specific build target compared to gcc)

One can spend a lot of time for small gains - while a rising tide lifts all boats, what it's that tide is only an inch or two?

As @RMerlin mentioned in another thread - one can "over optimize"... and attention might be better focused on the major bugs - compiler strategy is corner case optimization - there can be some big gains found with security and functional app updates, with less effort.
 
Small add-on:

Results of CPUBENCH compiled with stock options, executed on my FW with "Cortex-A9" options:
-march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=softfp -O0

Hehe... I had to grab code from OpenEmbedded - using Raspbian mainly due to the armhf userland and toolchain - and it works well with ARM6/ARM7 in any event...

gcc version 4.9.2 (Raspbian 4.9.2-10)

https://github.com/openembedded/openembedded/tree/master/recipes/cpubench/files

Code:
pi@raspy3:~ $ gcc -Wall -O -pthread -lm -lc -o cpubench cpubench.c
pi@raspy3:~ $ ./cpubench
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)

Overhead for getting time: 1us
Time to run memory bench: 0.41[secs]
Time to run computation of pi (2400 digits, 10 times): 0.02[secs]
Time to run computation of e (9009 digits): 0.03[secs]
Time to run float bench: 0.00[secs]
Total time: 0.5s

the gcc complier is doing something interesting/weird with that code... when we take some options out, and just compile it straight up...

Code:
$ gcc -Wall -o cpubench cpubench.c

we get... which is kinda strange - both deliver good results...

Code:
pi@raspy3:~ $ ./cpubench 
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 1us
Time to run memory bench: 0.92[secs]
Time to run computation of pi (2400 digits, 10 times): 3.56[secs]
Time to run computation of e (9009 digits): 3.83[secs]
Time to run float bench: 0.01[secs]
Total time: 8.3s

See what I mean?
 
Last edited:
Plus, people rarely go through a 20+ pages thread before posting, so you often get the same question asked over and over again within the thread. With shorter threads, people are more likely to take the time to read them first.
You do know that you can do a search only in a specific thread using the search box that's in the upper right had corner?
 
You do know that you can do a search only in a specific thread using the search box that's in the upper right had corner?

Most folks don't ;)

Eric did give some good advice here...

And now that Voxel has a github running - it's all good - going to be a bit of work on his part, as it has for both RMerlin and yourself, but if he's serious about it - he's in the right place...
 
You might consider starting separate threads for each firmware revision, at least major ones. That way bug reports and discussion specific to each release can be more easily found.

@Voxel - this thread is getting to be a bit long - but ok for development.

When you do a formal release - open a new thread for that specific release.
 
I think is better to make a new posts here (to notify people who are using his firmware) and edit the first post for new and old users to easily find the latest build.
 
You do know that you can do a search only in a specific thread using the search box that's in the upper right had corner?

Most people don't even know they can search the whole site, let alone a single thread :)

How many times have we been asked so far if "is my router temperature too high?"
 
Not an option when dealing with a bunch of closed source components that are linked against uclibc.

Worthwhile to explore - many are statically built so they can be portable across differnt SoC's and not dependent on the system-wide clib...
 
Worthwhile to explore - many are statically built so they can be portable across differnt SoC's and not dependent on the system-wide clib...

Not in Broadcom-land, at least. First one I picked at random was dynamic linked:

Code:
admin@Stargate88:/tmp/home/root# ldd /usr/sbin/acsd
    libnvram.so => /usr/lib/libnvram.so (0x2abfb000)
    libshared.so => /usr/lib/libshared.so (0x2ac40000)
    libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x2abc9000)
    libc.so.0 => /lib/libc.so.0 (0x2ac93000)
    ld-uClibc.so.0 => /lib/ld-uClibc.so.0 (0x2aaf8000)
admin@Stargate88:/tmp/home/root#

Keep in mind that these are usually compiled at the same time as the rest of the firmware, they just aren't distributed in their source form when the manufacturer releases the GPL drop. They're not generic, static build provided by the SDK.
 
Wow… So many replies… When I closed the lid of my laptop yesterday there was nothing except “like” from pege63… Do you intend to increase my rating here? ;-)

I’ll answer/comment a bit later, some overload with my main job.

Are you going to make a firmware for the r7500v2?

I would of course if I could. I am even more or less in touch with R7500v2 GPL codes, but I am private person, not company, nor computer shop, I just do not have all line of Netgear production. I do not have R7500v2. Sorry.

Voxel.
 
Would be fun to see performance on something like Liquid DSP

It is a bit troublesome and time consuming to perform all these actions with cross compiler (and I am too lazy ;-)), so I tried to compile and run the benchmark using my chroot-ed Debian under R7800. Also, it is not enough to compile only benchmark_threaded.c to play with options, but for more accurate comparison I had to recompile libliquid.

So environment: R7800, chroot-ed Debian Jessie (ARMHF), gcc (Debian 4.9.2-10) 4.9.2, two versions of program and lib:


"A9" - lib and program are compiled with options:
-O2 -mcpu=cortex-a9 -mtune=cortex-a9 -mfpu=neon-vfpv4 -pthread

"A15" - lib and program are compiled with options:
-O2 -mcpu=cortex-a15 -mtune=cortex-a15 -mfpu=neon-vfpv4 -pthread

Results are following:
A9:
Code:
# auto-generated file
# ./benchmark_threaded-a9
#      threads   buffer len   filter len  runtime [s]              samples    samples/s
             1          256           57       3.0000             23316224   7.7721e+06
             2          256           57       3.0000             45248256   1.5083e+07
             3          256           57       3.0000             47870464   1.5957e+07
             4          256           57       3.0000             47976192   1.5992e+07
             5          256           57       3.0000             48998656   1.6333e+07
             6          256           57       3.0000             49361152   1.6454e+07
             7          256           57       3.0000             49408768   1.6470e+07
             8          256           57       3.0000             48601344   1.6200e+07
             9          256           57       3.0000             50805760   1.6935e+07
            10          256           57       3.0000             50150656   1.6717e+07
            11          256           57       3.0000             50438144   1.6813e+07
            12          256           57       3.0000             50007040   1.6669e+07
            13          256           57       3.0000             50723328   1.6908e+07
            14          256           57       3.0000             50170624   1.6724e+07
            15          256           57       3.0000             50676224   1.6892e+07
            16          256           57       3.0000             52382208   1.7461e+07

A15:
Code:
# auto-generated file
# ./benchmark_threaded-a15
#      threads   buffer len   filter len  runtime [s]              samples    samples/s
             1          256           57       3.0000             25693184   8.5644e+06
             2          256           57       3.0000             45039616   1.5013e+07
             3          256           57       3.0000             50476800   1.6826e+07
             4          256           57       3.0000             50361856   1.6787e+07
             5          256           57       3.0000             50704384   1.6901e+07
             6          256           57       3.0000             50806016   1.6935e+07
             7          256           57       3.0000             50941184   1.6980e+07
             8          256           57       3.0000             50291968   1.6764e+07
             9          256           57       3.0000             50178048   1.6726e+07
            10          256           57       3.0000             50896640   1.6966e+07
            11          256           57       3.0000             52323072   1.7441e+07
            12          256           57       3.0000             51564032   1.7188e+07
            13          256           57       3.0000             52723712   1.7575e+07
            14          256           57       3.0000             52975872   1.7659e+07
            15          256           57       3.0000             53508608   1.7836e+07
            16          256           57       3.0000             54482688   1.8161e+07

So conclusion is that A15 (i.e. with options used by me in firmware) is faster. ;-)

Voxel.
 
Last edited:
Strange, software dislikes me ;-) Again I cannot post all at once: blocked by forum robot. So I continue (after splitting):


Regarding CPUBENCH and optimization. Target of this test was to check how fast it works with instructions generated for target Cortex-A9 vs generated for target Cortex-A15. With all the rest identical conditions/options. Honest loops of calculations. Any optimization would nullify the accuracy of comparison, there would be the same identical results, so I used "no-optimization" option.
(Generally, it is not quite correct comparison. I am using gcc 4.8.5 and uClbc, and "-mfloat-abi=softfp" in my FW. But anyway :))

CPUBENCH compiled and executed under my chroot-ed Debian (ARMHF):

A9:
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.48[secs]
Time to run computation of pi (2400 digits, 10 times): 3.18[secs]
Time to run computation of e (9009 digits): 2.70[secs]
Time to run float bench: 0.01[secs]
Total time: 6.4s

A15:
Code:
This is CPU and memory benchmark for OpenWRT v0.6. This will then take some time... (typically 30-60 seconds on a 200MHz computer)
Overhead for getting time: 0us
Time to run memory bench: 0.44[secs]
Time to run computation of pi (2400 digits, 10 times): 1.80[secs]
Time to run computation of e (9009 digits): 1.73[secs]
Time to run float bench: 0.01[secs]
Total time: 4.0s

Voxel.
 

Latest threads

Support SNBForums w/ Amazon

If you'd like to support SNBForums, just use this link and buy anything on Amazon. Thanks!

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top