keep_data_small.txt 8.04 KB
Newer Older
1
		Keeping data small
2 3 4

When many applets are compiled into busybox, all rw data and
bss for each applet are concatenated. Including those from libc,
Denis Vlasenko's avatar
Denis Vlasenko committed
5
if static busybox is built. When busybox is started, _all_ this data
6 7 8
is allocated, not just that one part for selected applet.

What "allocated" exactly means, depends on arch.
Denis Vlasenko's avatar
Denis Vlasenko committed
9
On NOMMU it's probably bites the most, actually using real
10 11 12
RAM for rwdata and bss. On i386, bss is lazily allocated
by COWed zero pages. Not sure about rwdata - also COW?

Denis Vlasenko's avatar
Denis Vlasenko committed
13
In order to keep busybox NOMMU and small-mem systems friendly
14 15
we should avoid large global data in our applets, and should
minimize usage of libc functions which implicitly use
Denis Vlasenko's avatar
Denis Vlasenko committed
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
such structures.

Small experiment to measure "parasitic" bbox memory consumption:
here we start 1000 "busybox sleep 10" in parallel.
busybox binary is practically allyesconfig static one,
built against uclibc. Run on x86-64 machine with 64-bit kernel:

bash-3.2# nmeter '%t %c %m %p %[pn]'
23:17:28 .......... 168M    0  147
23:17:29 .......... 168M    0  147
23:17:30 U......... 168M    1  147
23:17:31 SU........ 181M  244  391
23:17:32 SSSSUUU... 223M  757 1147
23:17:33 UUU....... 223M    0 1147
23:17:34 U......... 223M    1 1147
23:17:35 .......... 223M    0 1147
23:17:36 .......... 223M    0 1147
23:17:37 S......... 223M    0 1147
23:17:38 .......... 223M    1 1147
23:17:39 .......... 223M    0 1147
23:17:40 .......... 223M    0 1147
23:17:41 .......... 210M    0  906
23:17:42 .......... 168M    1  147
23:17:43 .......... 168M    0  147
40 41

This requires 55M of memory. Thus 1 trivial busybox applet
Denis Vlasenko's avatar
Denis Vlasenko committed
42 43 44 45
takes 55k of memory on 64-bit x86 kernel.

On 32-bit kernel we need ~26k per applet.

46 47 48 49 50 51 52 53 54 55
Script:

i=1000; while test $i != 0; do
        echo -n .
        busybox sleep 30 &
        i=$((i - 1))
done
echo
wait

Denis Vlasenko's avatar
Denis Vlasenko committed
56
(Data from NOMMU arches are sought. Provide 'size busybox' output too)
57 58


59
		Example 1
60 61

One example how to reduce global data usage is in
62
archival/libarchive/decompress_unzip.c:
63 64 65 66 67 68 69 70 71 72 73 74

/* This is somewhat complex-looking arrangement, but it allows
 * to place decompressor state either in bss or in
 * malloc'ed space simply by changing #defines below.
 * Sizes on i386:
 * text    data     bss     dec     hex
 * 5256       0     108    5364    14f4 - bss
 * 4915       0       0    4915    1333 - malloc
 */
#define STATE_IN_BSS 0
#define STATE_IN_MALLOC 1

75 76
(see the rest of the file to get the idea)

77
This example completely eliminates globals in that module.
78
Required memory is allocated in unpack_gz_stream() [its main module]
79
and then passed down to all subroutines which need to access 'globals'
80 81
as a parameter.

82 83

		Example 2
84 85 86

In case you don't want to pass this additional parameter everywhere,
take a look at archival/gzip.c. Here all global data is replaced by
87
single global pointer (ptr_to_globals) to allocated storage.
88 89 90

In order to not duplicate ptr_to_globals in every applet, you can
reuse single common one. It is defined in libbb/messages.c
91
as struct globals *const ptr_to_globals, but the struct globals is
92
NOT defined in libbb.h. You first define your own struct:
93

94
struct globals { int a; char buf[1000]; };
95 96 97 98 99

and then declare that ptr_to_globals is a pointer to it:

#define G (*ptr_to_globals)

100 101
ptr_to_globals is declared as constant pointer.
This helps gcc understand that it won't change, resulting in noticeably
102
smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro:
103

104
	SET_PTR_TO_GLOBALS(xzalloc(sizeof(G)));
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

Typically it is done in <applet>_main().

Now you can reference "globals" by G.a, G.buf and so on, in any function.


		bb_common_bufsiz1

There is one big common buffer in bss - bb_common_bufsiz1. It is a much
earlier mechanism to reduce bss usage. Each applet can use it for
its needs. Library functions are prohibited from using it.

'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer:

#define G (*(struct globals*)&bb_common_bufsiz1)

Denis Vlasenko's avatar
Denis Vlasenko committed
121 122 123 124
Be careful, though, and use it only if globals fit into bb_common_bufsiz1.
Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change
from one libc to another, you have to add compile-time check for it:

125
if (sizeof(struct globals) > sizeof(bb_common_bufsiz1))
Denis Vlasenko's avatar
Denis Vlasenko committed
126
	BUG_<applet>_globals_too_big();
127 128 129 130 131 132 133 134 135 136 137 138


		Drawbacks

You have to initialize it by hand. xzalloc() can be helpful in clearing
allocated storage to 0, but anything more must be done by hand.

All global variables are prefixed by 'G.' now. If this makes code
less readable, use #defines:

#define dev_fd (G.dev_fd)
#define sector (G.sector)
139 140


141
		Word of caution
142

143 144 145
If applet doesn't use much of global data, converting it to use
one of above methods is not worth the resulting code obfuscation.
If you have less than ~300 bytes of global data - don't bother.
Denis Vlasenko's avatar
Denis Vlasenko committed
146 147


148 149 150 151 152
		Finding non-shared duplicated strings

strings busybox | sort | uniq -c | sort -nr


Denis Vlasenko's avatar
Denis Vlasenko committed
153 154 155 156 157 158 159 160
		gcc's data alignment problem

The following attribute added in vi.c:

static int tabstop;
static struct termios term_orig __attribute__ ((aligned (4)));
static struct termios term_vi __attribute__ ((aligned (4)));

Denis Vlasenko's avatar
Denis Vlasenko committed
161
reduces bss size by 32 bytes, because gcc sometimes aligns structures to
Denis Vlasenko's avatar
Denis Vlasenko committed
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
ridiculously large values. asm output diff for above example:

 tabstop:
        .zero   4
        .section        .bss.term_orig,"aw",@nobits
-       .align 32
+       .align 4
        .type   term_orig, @object
        .size   term_orig, 60
 term_orig:
        .zero   60
        .section        .bss.term_vi,"aw",@nobits
-       .align 32
+       .align 4
        .type   term_vi, @object
        .size   term_vi, 60

gcc doesn't seem to have options for altering this behaviour.
Denis Vlasenko's avatar
Denis Vlasenko committed
180

Denis Vlasenko's avatar
Denis Vlasenko committed
181 182
gcc 3.4.3 and 4.1.1 tested:
char c = 1;
Denis Vlasenko's avatar
Denis Vlasenko committed
183
// gcc aligns to 32 bytes if sizeof(struct) >= 32
Denis Vlasenko's avatar
Denis Vlasenko committed
184 185 186 187 188 189 190 191
struct {
    int a,b,c,d;
    int i1,i2,i3;
} s28 = { 1 };    // struct will be aligned to 4 bytes
struct {
    int a,b,c,d;
    int i1,i2,i3,i4;
} s32 = { 1 };    // struct will be aligned to 32 bytes
Denis Vlasenko's avatar
Denis Vlasenko committed
192 193 194
// same for arrays
char vc31[31] = { 1 }; // unaligned
char vc32[32] = { 1 }; // aligned to 32 bytes
Denis Vlasenko's avatar
Denis Vlasenko committed
195

196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
-fpack-struct=1 reduces alignment of s28 to 1 (but probably
will break layout of many libc structs) but s32 and vc32
are still aligned to 32 bytes.

I will try to cook up a patch to add a gcc option for disabling it.
Meanwhile, this is where it can be disabled in gcc source:

gcc/config/i386/i386.c
int
ix86_data_alignment (tree type, int align)
{
#if 0
  if (AGGREGATE_TYPE_P (type)
       && TYPE_SIZE (type)
       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
       && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256
           || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256)
    return 256;
#endif

Result (non-static busybox built against glibc):

# size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox
   text    data     bss     dec     hex filename
 634416    2736   23856  661008   a1610 busybox
 632580    2672   22944  658196   a0b14 busybox_noalign
Denys Vlasenko's avatar
Denys Vlasenko committed
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252



		Keeping code small

Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once",
produce "make bloatcheck", see the biggest auto-inlined functions.
Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE
to some of these functions. In 1.16.x timeframe, the results were
(annotated "make bloatcheck" output):

function             old     new   delta
expand_vars_to_list    -    1712   +1712 win
lzo1x_optimize         -    1429   +1429 win
arith_apply            -    1326   +1326 win
read_interfaces        -    1163   +1163 loss, leave w/o NOINLINE
logdir_open            -    1148   +1148 win
check_deps             -    1148   +1148 loss
rewrite                -    1039   +1039 win
run_pipe             358    1396   +1038 win
write_status_file      -    1029   +1029 almost the same, leave w/o NOINLINE
dump_identity          -     987    +987 win
mainQSort3             -     921    +921 win
parse_one_line         -     916    +916 loss
summarize              -     897    +897 almost the same
do_shm                 -     884    +884 win
cpio_o                 -     863    +863 win
subCommand             -     841    +841 loss
receive                -     834    +834 loss

855 bytes saved in total.
Denys Vlasenko's avatar
Denys Vlasenko committed
253 254 255 256

scripts/mkdiff_obj_bloat may be useful to automate this process: run
"scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE"
and select modules which shrank.