Determining C compiler implementation characteristics

Message

WIckedWitch · #1 Post by **WIckedWitch** » Tue 22 May 2018, 23:21

Up front, I now have to say that owing to commercial confidentiality constraints, I'm going to have to be selectively reticent on exactly why I'm doing some of the things to be aired in this thread. Suffice it to say that it concerns how you determine by objective, reproducible means, the implementation characteristics of a C compiler if you want to use certain kinds of theorem-proving or model-checking verification tools on C programs.

The first step is to try to determine all the predefined macro names without just asking the compiler to tell you (because, at the outset of this kind of testing, you trust it only to compile and run conforming C programs).

The basic steps are:

1. find all .h or .H files on the file system
2. extract from them all identifiers that occur within them
3. generate from these identifiers the kind of C program shown below

I now have a Tcl script to do this. It has scanned all the .h and .H files on my tahrpup installation and extracted from them a list of all distinct identifiers occurring within them. For each such identifief ident it has generated the code:

#ifdef ident
printf("ident == %s\n", stringify(ident));
#endif

(I've discussed this pattern in the thread, "The Coming Software Apocalypse".)

and embedded all such generated lines in an int main function so that when executed, the program advises the user which of the idents are actually predefined macro names.

Now for the numbers: The C test file so generated is 47MB in size. This size does not bother me. On a custom-configured test system, I would have far fewer linux packages installed than come with a standard tahrpup 6.0.5 install, so in a live situation, the C test file size would be small because the Tcl scanner would be looking only at header files in the directories created by packages on which the compiler install depends.

One question so far, however, is: Linux packages seem to come with lots of .h and .H files - over 16,000 on my tahrpup 6.0.5 installation. Why is this? Is it only because, for open source software, the source code comes with it so you can play with it?

WIckedWitch · #2 Post by **WIckedWitch** » Wed 23 May 2018, 16:27

Well, after following the process through, my 47MB C test program finds just one predefined macro, which is "str" and produces the value "str" when compiled in my gcc test environment. Big deal, you would be justified in saying - but this is only the (crude) absolute form of the test. I'll now move on to the various relative forms and report the results.

rockedge · #3 Post by **rockedge** » Wed 23 May 2018, 18:50

interesting......

technosaurus · #4 Post by **technosaurus** » Wed 23 May 2018, 20:12

Code: Select all

echo | $CC $CFLAGS -E -dM - | sort

where CC is your compiler and CFLAGS is your compiler options
for gcc compatible compilers but most compilers have similar options
There is a sourceforge project that lists a bunch of them for different compilers and some gists on github - see google.

It looks like your script only looks for macros that are used for conditional compilation, but ignores things like #if defined(foo) || !defined(bar) && defined(baz)

a simple grep would be much quicker, but you'd need to refine your parameters to get any real help ... but you can just do something like:

Code: Select all

grep -h '#if' path1/*.h path1/*/*.h path2 etc.... |sort -u

WIckedWitch · #5 Post by **WIckedWitch** » Thu 24 May 2018, 18:53

technosaurus wrote:
Code: Select all
echo | $CC $CFLAGS -E -dM - | sort
where CC is your compiler and CFLAGS is your compiler options
for gcc compatible compilers but most compilers have similar options
There is a sourceforge project that lists a bunch of them for different compilers and some gists on github - see google.

It looks like your script only looks for macros that are used for conditional compilation, but ignores things like #if defined(foo) || !defined(bar) && defined(baz)

a simple grep would be much quicker, but you'd need to refine your parameters to get any real help ... but you can just do something like:
Code: Select all
grep -h '#if' path1/*.h path1/*/*.h path2 etc.... |sort -u

Ah. I see I have not explained myself clearly. My project is aimed at supporting the use of C compilers in critical applications where they will be used in conjunction with state-of-the art program verification tools based on theorem-proving or model checking. When using such tools, it is necessary to determine the nature of implementation-dependent aspects of the C compiler without relying on any implementation-dependent characteristics to perform the determination.

Hence:

1. Your proposed use of

Code: Select all

echo | $CC $CFLAGS -E -dM - | sort

is ruled out right from the start.

2. The means of generating test programs must be portable and work across all platforms on which the compilers might be running. For this reason use of grep is also ruled out. In practice, Tcl turns out to be the simplest way of performing the required searches and generating the required test programs. This is because Tcl is (a) technically well-suited to the task and (b) as portable a scripting language as you can get.

Also, my script looks for anything matching the syntactic pattern of an identifier, so it finds all such tokens in code and not just the ones that are tested in #ifdef or #if defined directives.

What I am developing cannot assume that it is running in a POSIX environment and therefore cannot rely on any POSIX-only facilities. Also, all it may rely on in the compiler is that it correctly compiles conforming C code.

I appreciate that this may seem draconian to people who may not be accustomed to working on safety-, security- and mission-critical projects. For such projects, test-program generation procedures must be technically traceable only to what is guaranteed by the language standard and not on what happens to be provided in a particular compiler. Otherwise one is relying on implementation-dependent features of a compiler to help generate tests that themselves seek to determine the nature of such features - and this is circular reliance and hence impermissible (unless demonstrably unavoidable).

I would point out that I have previously used such techniques on an air-traffic control project to configure the QAC C static checking tool for use in conjunction with code that plots aircraft positions on radar displays - and there are very few applications that are more critical than that. I've also had to deploy these techniques before in a major international litigation in the aviation industry where the integrity of static analysis had to be demonstrated by means that would survive scrutiny by expert witnesses in a court of law. Has this made me paranoid about how to do things? You bet!

The current work is aimed at making my previously-used techniques more thorough, more systematic and more portable, as is required when determining compiler characteristics for the most powerful static analysis tools. For this reason I am piloting the techniques for gcc, g++, clang, tcc and MSVC/C++ first, as those cases are the most difficult.

technosaurus · #6 Post by **technosaurus** » Thu 24 May 2018, 21:00

So a replacement for autotools? I get why you wouldn't want to touch that mess. Unfortunately the alternatives like waf, pkgbuild, ninja (or samurai) aren't perfect either. You can always look at their generated c code and incorporate them into your own tests. Basically how they work, they include a header(s) and define a variable(s) and include a minimal amount of code that should compile successfully if supported - based on that they set a variable to 1, 0 or other appropriate value ex. HAVE_FOO_H=1 or NEED_BAR=0... From there it generates a config.h that includes alternatives as needed and defines any needed types, macros, functions etc... sometimes a c file is produced also if certain necessary functions are missing. When implementing these tests it is helpful to know those predefined macros that I showed you how to get, otherwise you end up running orders of magnitude more unnecessary tests. Just to name a few: architecture, endianness, sizes of each type, what floating point spec is used, max values, etc...

I hate to say it, but just use autotools.

WIckedWitch · #7 Post by **WIckedWitch** » Thu 24 May 2018, 22:52

technosaurus wrote:
I hate to say it, but just use autotools.

No can do on account of the GNU-centric origins of Autotools. Again I shall have to observe some reticence here but the configuration-determining tests must be entirely free of any GNU/POSIX provenance.

I can use the gcc -dM option to generate candidate predefined macro names but the definitive veridical test must be by running a C program or programs that actually determine the value of the predefined macro in the C execution environment under the same compiler options as will be used to compile the C code that is going to be analysed by the static analysis tools.

The tools for which I am doing this are top-end, state-of-the art program verifiers that rely on the compiler only to compile conforming C code. Anything other than this minimal assumption cannot be built into the tool and definitions of predefined macros must be supplied to the verifier as parameters. The requirement for such parameters is that they be determined by veridical tests that rely on the compiler no more than the verifier relies on it. At this level of software criticality, one simply does not trust the compiler to tell the truth about itself other than by revealing it in the execution of conforming programs.

Trust me on this. I learned how to do critical compiler testing from one of the leading computer scientists at the UK National Physical Laboratory and the apparently paranoid approach that I'm taking is based on documented experience with real compilers (for various languages) in the past. Borland's Turbo Pascal back in the early 1980s exhibited such dire non-conformance with ISO 7185 that it opened the eyes of certification testers to how paranoid one needs to be.

technosaurus · #8 Post by **technosaurus** » Fri 25 May 2018, 05:30

WIckedWitch wrote:
technosaurus wrote:
I hate to say it, but just use autotools.
No can do on account of the GNU-centric origins of Autotools. Again I shall have to observe some reticence here but the configuration-determining tests must be entirely free of any GNU/POSIX provenance.

Yeah, then thats as far as I am willing to help on a forum question. This actually requires extensive knowledge of standards and compiler specific behaviors that I only provide to useful open source projects or when paid at least 6 figures. There are just too many variables - the equivalent of asking how do I program in C without using implementation defined behavior or undefined behaviors (there are entire books on this that barely scratch the surface). If you have a specific question though, put it on stackoverflow - we only have a handful of decent C programmers here and they stay pretty busy (shell scripting is Puppy's main programming language thanks to gtkdialog)

I always recommend the trust but verify method. In this case, use the predefined macros to generate the expected values and then verify that it compiles as it says. If it doesn't, file a bug report (just out of courtesy) and add a workaround as needed. Hopefully you can at least assume c99 support, otherwise you run into a lot of architecture specific anomalies.

WIckedWitch · #9 Post by **WIckedWitch** » Fri 25 May 2018, 20:04

technosaurus wrote:
WIckedWitch wrote:
technosaurus wrote:
I hate to say it, but just use autotools.
No can do on account of the GNU-centric origins of Autotools. Again I shall have to observe some reticence here but the configuration-determining tests must be entirely free of any GNU/POSIX provenance.
Yeah, then thats as far as I am willing to help on a forum question. This actually requires extensive knowledge of standards and compiler specific behaviors that I only provide to useful open source projects or when paid at least 6 figures. There are just too many variables - the equivalent of asking how do I program in C without using implementation defined behavior or undefined behaviors (there are entire books on this that barely scratch the surface). If you have a specific question though, put it on stackoverflow - we only have a handful of decent C programmers here and they stay pretty busy (shell scripting is Puppy's main programming language thanks to gtkdialog)

I always recommend the trust but verify method. In this case, use the predefined macros to generate the expected values and then verify that it compiles as it says. If it doesn't, file a bug report (just out of courtesy) and add a workaround as needed. Hopefully you can at least assume c99 support, otherwise you run into a lot of architecture specific anomalies.

No problem. And I do have the requisite knowledge of the C standards, having served on the C language panel for the British Standards Institution and represented BSI at ISO meetings. Moreover, as a founder member, back in 1978, of the British Computer Society Specialist Group in Formal Aspects of Computing Science (BCS-FACS), and having led a formal methods research project, I am well versed in the theoretical computer science required to understand how theorem-provers and model-checkers work.

Incidentally, this work will end up as open-source because the test programs and the test program generators have to be open to public scrutiny. Indeed, all compiler test suites have to be available in source code form otherwise they cannot accomplish what they seek to achieve. I am not going to make any money out of this other than consulting fees on helping people to use the tests. My commercially-constrained reticence is for reasons not related to whether my test suite will be open-source.

WIckedWitch · #10 Post by **WIckedWitch** » Fri 25 May 2018, 20:16

Can I repeat one question that got missed a few posts back?

Linux packages seem to come with lots of .h and .H files - over 16,000 on my tahrpup 6.0.5 installation. Why is this? Is it only because, for open source software, the source code comes with it so you can play with it?

jafadmin · #11 Post by **jafadmin** » Fri 25 May 2018, 22:12

WIckedWitch wrote: Linux packages seem to come with lots of .h and .H files - over 16,000 on my tahrpup 6.0.5 installation. Why is this? Is it only because, for open source software, the source code comes with it so you can play with it?

This thread is an example of why: http://murga-linux.com/puppy/viewtopic.php?t=98228
(second post in the thread)

.

wiak · #12 Post by **wiak** » Sat 26 May 2018, 03:47

WIckedWitch wrote:
... And I do have the requisite knowledge of the C standards, having served on the C language panel for the British Standards Institution and represented BSI at ISO meetings. Moreover, as a founder member, back in 1978, of the British Computer Society Specialist Group in Formal Aspects of Computing Science (BCS-FACS), and having led a formal methods research project, I am well versed in the theoretical computer science required to understand how theorem-provers and model-checkers work.

Regarding C standards and security on Puppy Linux you might be interested in this:

http://www.murga-linux.com/puppy/viewto ... 214#993214

Particularly the last three or four paragraphs of that. The post is about issues with gtkdialog program used in most Puppy-created gui apps, which has been described as 'mission-critical' to Puppy...

wiak

WIckedWitch · #13 Post by **WIckedWitch** » Sat 26 May 2018, 19:26

wiak wrote:
WIckedWitch wrote:
... And I do have the requisite knowledge of the C standards, having served on the C language panel for the British Standards Institution and represented BSI at ISO meetings. Moreover, as a founder member, back in 1978, of the British Computer Society Specialist Group in Formal Aspects of Computing Science (BCS-FACS), and having led a formal methods research project, I am well versed in the theoretical computer science required to understand how theorem-provers and model-checkers work.
Regarding C standards and security on Puppy Linux you might be interested in this:

http://www.murga-linux.com/puppy/viewto ... 214#993214

Particularly the last three or four paragraphs of that. The post is about issues with gtkdialog program used in most Puppy-created gui apps, which has been described as 'mission-critical' to Puppy...

wiak

Interesting.

I never use any non-portable facilities for any kind of GUI. Using Tcl/Tk or Python/Tkinter gives me cross-platform implementations every time

In fact, for the last 25 years, I have always written code to be portable even if there is no explicit requirement for it, because making code portable coerces you to use language constructs only as they are defined in the standard for the language in which you are programming.

This has at times made some of the great unwashed denounce my coding style as "anally obsessive" but the discipline of aiming for portability pays off in terms of ultimate code quality.

I also program in a single-assignment style (google for it) because this tends to produce code that is easier for model-checkers and theorem-provers to verify.

OK, being inclined to be this systematic is undoubtedly easier if, like me, you are autistic, but, then, why not turn such an atypicality to advantage?

technosaurus · #14 Post by **technosaurus** » Sun 27 May 2018, 01:08

WIckedWitch wrote:Can I repeat one question that got missed a few posts back?

Linux packages seem to come with lots of .h and .H files - over 16,000 on my tahrpup 6.0.5 installation. Why is this? Is it only because, for open source software, the source code comes with it so you can play with it?

Depends on the distro. By default when you just run
./configure; make; make install
It will install all files needed for use - for libraries, that includes the header files.
Some distros (especially debian and derivatives) will split packages in various different ways such as DEV (.a, .la, .so and .h files), DOC (man, info, html and pdf pages in /usr/share/doc), NLS(localization files), common(other non-architecture dependent files), arch/BIN(binaries, libraries) but others (slackware, LFS, etc...) leave the whole intall intact as a single package. Puppy has tools in woof to split these into separate DEVX and NLS squash filesystems, but its not perfect.

WIckedWitch · #15 Post by **WIckedWitch** » Sun 27 May 2018, 23:12

technosaurus wrote:
WIckedWitch wrote:Can I repeat one question that got missed a few posts back?

Linux packages seem to come with lots of .h and .H files - over 16,000 on my tahrpup 6.0.5 installation. Why is this? Is it only because, for open source software, the source code comes with it so you can play with it?
Depends on the distro. By default when you just run
./configure; make; make install
It will install all files needed for use - for libraries, that includes the header files.
Some distros (especially debian and derivatives) will split packages in various different ways such as DEV (.a, .la, .so and .h files), DOC (man, info, html and pdf pages in /usr/share/doc), NLS(localization files), common(other non-architecture dependent files), arch/BIN(binaries, libraries) but others (slackware, LFS, etc...) leave the whole intall intact as a single package. Puppy has tools in woof to split these into separate DEVX and NLS squash filesystems, but its not perfect.

Thanks-This is very helpful to me.

I've now started going about developing all these tests in a more organised way, working directly from (initially) Annex J of the 1999 ISO C standard plus later corrigenda. The initial experimentation reported in this forum has helped me to formulate the more systematic approach.

Thanks for this and all your other contributions.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

Determining C compiler implementation characteristics

Determining C compiler implementation characteristics