26332 total geeks with 3498 solutions
Recent challengers:
  • krc bonus 5 - 12:32PM
  • krc bonus 23 - 11:03AM
  • krc bonus 12 - 08:08AM
 Welcome, you are an anonymous user! [register] [login] Get a yourname@osix.net email address 

Articles

GEEK

User's box
Username:
Password:

Forgot password?
New account

Shoutbox
MaxMouse
It's Friday... That's good enough for me!
CodeX
non stop lolz here but thats soon to end thanks to uni, surely the rest of the world is going good?
stabat
how things are going guys? Here... boring...
CodeX
I must be going wrong on the password lengths then, as long as it was done on ECB
MaxMouse
lol... the key is in hex (MD5: of the string "doit" without the "'s) and is in lower case. Maybe i should have submitted this as a challenge!

Donate
Donate and help us fund new challenges
Donate!
Due Date: Jun 30
June Goal: $40.00
Gross: $0.00
Net Balance: $0.00
Left to go: $40.00
Contributors


News Feeds
The Register
Ex-Systemax veep
charged with $230m
fraud
Roving IT
contractors and
private landlords
are my heroes -
here"s why
Wi-Fi Alliance
takes grid place,
revs engine in race
to 802.11ac
Julian Assange: I"m
quite happy to
sleep on Ecuador"s
sofa FOREVER
Supercomputer sage
Cray musters Lustre
cluster storage
hustler
Sneaky Seagate
slips "world"s
fastest" enterprise
disk mutant into
the wild
Hey mobile firms:
About that Android
thing... Did Google
add a lockout
clause?
PC makers REALLY
need Windows 8.1 to
walk on water - but
guess what?
Norks taunt, yank
Yanks" crank over
PRISM: US is
"rights abuse
kingpin"
Washout 2012
summer, melty
Greenland "nothing
to do with Arctic
ice or warm oceans"
Slashdot
Lobster, a New Game
Programming
Language, Now
Available As Open
Source
Google"s Crazy Lack
of Focus: Is It
Really Serious
About Enterprise?
Cat-like Robot Runs
Like the Wind
Revisiting Amdahl"s
Law
Altering Text In
eBooks To Track
Pirates
NVIDIA To License
Its GPU Tech
MySQL Man Pages
Silently Relicensed
Away From GPL
Verizon Accused of
Intentionally
Slowing Netflix
Video Streaming
Oculus Rift Raises
Another $16 Million
KWin Maintainer:
Fanboys and Trolls
Are the Cancer
Killing Free
Software
Article viewer

Character Frequency Analysis : Part 3 of a Series



Written by:typedeaF
Published by:Nightscript
Published on:2004-07-16 15:36:38
Topic:C
Search OSI about C.More articles by typedeaF.
 viewed 21678 times send this article printer friendly

Digg this!
    Rate this article :
Character Frequency Analyzer (AKA Cipher Buster) written in C.
This is a program that reads a plain text file and reports the most commonly occuring alphabetic characters. This will be very important when I introduce a program that attempts to crack a cipher text file. As is, this program is great for making you wonder why the heck the keyboard is mapped so unefficiently.

#include <stdio.h>
#include <ctype.h>

typedef struct CHAR_FREQUENCY_ANALYSIS_
{
    size_t Frequency;
    double Percentage;
    unsigned char Character;
}CHAR_FREQ_ANAL;

int printCFA(CHAR_FREQ_ANAL *);

main(int argc, char *argv[])
{
    FILE *data;
    CHAR_FREQ_ANAL cfa[26] = {{0, 0, 0}};
    size_t thisEntry;
    size_t thisByte;
    size_t fileLen;
    char *prog = argv[0];
    unsigned char alphabet[] = "abcdefghijklmnopqrstuvwxyz";
    unsigned char ch;
    size_t count = 0;
    int pos;

    /* file handling */
    if(argc == 1)
    {
        fprintf(stderr, "ERROR: %s requires a target file.\n", prog);
        exit(1);
    }
    else
    {
        if((data = fopen(*++argv, "r")) == NULL)
        {
            fprintf(stderr, "ERROR: %s cannot open file %s\n", prog, *argv);
            exit(1);
        }
    }

    /* find the file length */
    fseek(data, 0L, SEEK_END);
    fileLen = ftell(data);
    fseek(data, 0L, SEEK_SET);

    /* initialize the structure members */
    for(thisEntry = 0; thisEntry < 26; thisEntry++)
    {
        cfa[thisEntry].Frequency = 0;
        cfa[thisEntry].Percentage = 0.0;
        cfa[thisEntry].Character = alphabet[thisEntry];
    }

    /* get a char and if its alphabetic update the struct */
    for(thisByte = 0; thisByte < fileLen; thisByte++)
    {
        if(isalpha(ch = fgetc(data)))
        {
            pos = tolower(ch) - 'a';
            ++cfa[pos].Frequency;
            ++count;
        }
    }
    
    /* run thru the array and calculate the Frequency */
    for(thisEntry = 0; thisEntry < 26; thisEntry++)
    {
        cfa[thisEntry].Percentage = (double) cfa[thisEntry].Frequency * 100.0 / count;
    }

    /* Display the Results */
    fprintf(stdout, "Character Frequency Analyzer by typedeaF @ www.osix.net\n\n");
    printCFA(cfa);

    exit(0);
}

int printCFA(CHAR_FREQ_ANAL *cfa)
{
    size_t x;
    printf("Ch Count %%age Ch Count %%age\n");
    for(x = 0; x < 13; x++)
    {
        printf(" %c %9lu %6.3f %c %9lu %6.3f\n", cfa[x].Character, (unsigned long)cfa[x].Frequency, cfa[x].Percentage, cfa[x + 13].Character, (unsigned long)cfa[x + 13].Frequency, cfa[x + 13].Percentage);
    }
    return 0;
}


This program in itself does no encoding, as you might be able to tell from looking it over. It is however a very important tool in most cryptanalysis programs. In English, the most commonly occuring letters are "E T A O I N S H R D L U", in that order. When plain text is encoded using a fixed mask there will be a repetition of certain characters. This frequency analyzer scans through the data file and makes a tally of how often certain letters occur.

For this particular example we only count alphabetic chars. With a little tweaking and the use of SlipTops ASCII char function one could make this program count all 256 ASCII chars. Now back to the program. The first thing of considerable notice is our structure. The CFA struct will be used to create an array of 26 of its type; each one holding the stats for a letter of the alphabet. The variables included in the struct are simply the Frequency of which the char occurs (counts that character), the Percentage of occurance of that letter in the data file, and the Character which the struct represents. Next thing you will notice is a function prototype. This is simply a printing function and although its ugly as heck its actually quite simple and displays a nice formatted analysis. After the file handling comes a simple routine to get the length of the file by seeking to its end and returning its offset to a variable.

The next block of code simply initializes all of the structure's member variables to zero. OK, now we do some real work. We read through the entire file and each alphabetic character we find we increment the Frequency for that char. Since it's linear we can simply find its index in the array by "ch - 'a'". Count is incremented on every char. Now that this is done we can again cycle through the array of structs and calculate the percentage because we now have a total number of chars in our Count variable: ((freq * 100)/count.) Well that's it. The last thing we do is print the results to the screen using a another for loop. Sounds simple, looks simple, is simple.

This routine will be very useful for many kinds of applications as you might already know. Ideally this would be a function, but for this example I made it a full fledged program. Also AFTER the calculations are done one might want to run a qsort on the array to sort by Frequency. Now that we know the most common occuring letters we can make some good guesses at what the cipher text might be; if 'v' occurs the most frequently then we can assume that 'v' might have originally be 'E'.

On a simple 8-bit XOR cipher all we would have to do is XOR 'v' with 'E' to find the mask. Things get more complicated on larger masks but the theory remains the same. Also for real world use one would have to a lot for upper and lower case letters and punctuation marks. Doing so would have made this program much larger.

Once you grasp the concept, rewrite it so it fits your own needs.
Later...

typedeaF

Did you like this article? There are hundreds more.

Comments:
jebus
2004-07-20 18:10:39
interesting approach :D i just did this recently in an experiment but i used the C++ STL map<> container. basically, it went like this:
map<char, int> charMap;
// read the txt file into a char[] called buffer
for (int i = 0; i < len; i++)
{
   charMap[buffer[i]]++;
}


and done! the STL map is automatically sorted by the first element in the pair, in this case the integer value of the char. the first element in my map is the rarest character, the last entry is the mostly used one!
jebus
2004-07-20 18:13:37
oops! i was a little premature with my comment posting ;P

there was also a check in the for loop to make sure the character was printable ASCII. and the map is NOT sorted based on character frequency. i actually resorted it off the second pair element (the frequency) to get the final answer. i'll be posting an article soon that uses this exact technique.
Alask
2004-10-27 04:00:08
E T A O I N S H R D L U

Not entierly correct. That may be in one form of english, but it can't be assumed. Everyone has a different writing style and their letter frequencies change. For a sufficiently large text, you can still only guess at E T A and even then the order of A and T arent set.
anilg
2006-01-19 12:46:14
SENORITA si a way to remember the the first 8.. though not in order
Anonymous
2007-03-29 22:19:19
This is going to sound stupid but im new at this so if you could just bare with me i would be most gratefull. I am operating on the assumtion that the code text above is the complete code for your letter Frequency Analysis program; Furthermore i assume that you save it in notepad but save it as what? how do you execute the program with the desired text inserted? I know it probably sounds stupid but this is my first experiance with code other than the very basic vbs scripts. Thanks
Domuk
2007-03-30 06:15:10
I won't 'bare' with you ('though I dare say someone in the John the Ripper article may)... You need to save it with a filename like 'analysis.c', Notepad should be fine. You then need a C compiler (search Google for that), and when you have it as an executable, you should be able to run it by giving it an input file as an argument.
Anonymous
2009-05-18 14:26:54
.296.294.255.268.313.278.311.270.290.305.322.252.276.286.301.305.264.301.251.269.274.311.304.
230.280.264.327.301.301.265.287.285.306.265.282.319.235.262.278.249.239.284.237.249.289.250.
282.240.256.287.303.310.314.242.302.289.268.315.264.293.261.298.310.242.253.299.278.272.333.
272.295.306.276.317.286.250.272.272.274.282.308.262.285.326.321.285.270.270.241.283.305.319.
246.263.311.299.295.315.263.304.279.286.286.299.282.285.289.298.277.292.296.282.267.245.304.
322.252.265.313.288.310.281.272.266.243.285.309.295.269.295.308.275.316.267.283.311.300.252.
270.318.288.266.276.252.313.280.288.258.272.329.321.291.271.279.250.265.261.293.319.309.303.
260.266.291.237.299.286.293.279.267.320.290.265.308.278.239.277.314.300.253.274.309.289.280.
279.302.307.317.252.261.291.311.268.262.329.312.271.294.291.291.281.282.292.288.240.248.306.
277.298.295.267.312.284.265.294.321.260.293.310.300.307.263.304.297.276.262.291.241.284.312.
277.276.265.323.280.257.257.303.320.255.291.292.290.270.267.345.264.291.312.295.269.297.280.
290.224.308.313.240.308.311.247.284.311.268.289.266.316.299.269.299.298.265.298.262.260.337.
320.285.265.273.307.297.282.287.225.302.277.288.284.310.278.255.263.276.283.322.273.300.264.
302.312.289.262.236.278.280.286.292.298.296.313.258.300.280.300.260.274.329.288.272.316.256.
259.279.297.296.283.273.286.320.287.313.272.301.311.260.302.261.304.280.264.328.259.259.347.
245.291.258.289.270.300.301.318.251.305.278.290.311.280.281.293.313.259.300.262.315.263.319.
285.282.297.283.290.293.280.237.234.323.289.305.279.314.274.291.309.273.294.249.283.262.271.
286.310.305.306.261.298.282.282.307.287.285.305.297.275.306.280.292.291.284.301.278.293.296.
277.301.281.274.315.281.254.251.289.313.307.244.256.302.301.317.305.239.316.274.277.296.269.
305.301.279.287.317.284.277.305.298.264.304.286.273.275.293.309.286.282.240.287.239.268.269.
267.315.311.292.270.271.272.336.282.237.275.316.306.239.305.314.240.296.306.270.247.245.302.
317.316.241.291.310.266.274.274.313.288.262.319.280.276.238.297.295.287.285.288.301.272.275.
247.305.292.286.272.310.291.301.322.256.315.298.263.281.276.237.294.284.296.284.302.273.298.
287.298.301.265.305.270.315.278.283.302.287.263.270.345.258.270.266.302.309.262.260.277.327.
263.277.254.283.276.239.272.264.276.279.264.267.298.264.244.245.273.292.289.273.248.259.263.
288.290.294.210.288.268.311.318.312.242.285.293.216.262.276.340.292.299.275.259.293.311.234.
266.294.278.307.286.267.307.285.269.310.288.274.270.326.273.276.311.304.267.302.318.265.299.
263.283.248.257.314.288.321.321.236.284.283.227.320.312.246.261.289.316.288.263.312.241.265.
288.298.286.287.274.306.279.276.289.307.303.293.281.298.317.252.312.283.278.263.304.305.258.
266.270.294.286.293.290.291.291.258.254.282.282.283.313.268.282.316.310.299.254.264.234.296.
270.265.326.288.292.293.321.305.250.320.299.253.270.296.297.298.266.312.234.273.287.309.286.
278.269.279.316.284.276.234.293.255.267.242.253.318.270.246.278.292.285.282.314.266.292.286.
263.313.249.290.255.289.264.292.301.299.278.291.292.225.250.261.283.303.262.264.264.303.299.
297.274.288.267.293.316.320.317.233.303.258.302.271.283.323.247.279.268.312.269.297.313.280.
280.273.266.332.276.313.284.281.316.279.290.273.313.308.305.260.302.306.273.234.279.281.284.
298.278.259.290.314.275.264.339.293.322.266.261.296.306.277.275.311.284.270.318.259.249.286.
292.301.285.280.303.283.287.299.277.273.293.228.311.283.272.304.292.277.271.306.302.278.298.
300.287.281.309.243.272.279.282.300.291.295.284.285.252.291.251.285.283.245.250.252.318.298.
277.235.288.259.263.278.274.307.261.260.350.250.288.256.282.316.261.285.295.292.300.298.264.
245.241.308.301.261.253.289.264.267.300.262.248.287.257.266.275.287.297.320.287.264.279.297.
232.231.256.288.243.252.277.274.245.256.253.229.290.263.305.278.260.294.312.283.301.275.276.
299.297.312.275.282.294.272.228.302.324.257.261.286.326.280.283.316.294.254.258.275.264.236.
240.277.255.231.258.286.242.277.253.296.290.250.314.320.239.292.313.261.294.261.317.273.285.
236.292.282.271.264.297.300.272.308.299.300.269.301.269.317.284.286.262.315.276.279.328.269.
254.252.232.272.268.309.273.264.296.305.272.267.291.324.302.297.268.268.263.298.300.261.312.
241.254.299.280.263.292.260.301.311.317.297.248.314.272.293.298.281.298.276.311.291.297.318.
261.274.300.293.297.267.295.261.275.334.289.238.267.289.283.257.300.262.304.311.278.274.265.
261.345.301.296.270.273.299.289.274.272.313.282.268.320.287.320.270
Anonymous
2009-11-08 17:04:07
Quote Anon:
can u tell me in details about what is cookie?
A cookie is how a site recognizes your browser from other browsers; it's how when you check the remember me checkbox, you don't have to login again. Articulos Gratis
Anonymously add a comment: (or register here)
(registration is really fast and we send you no spam)
BB Code is enabled.
Captcha Number:


Blogs: (People who have posted blogs on this subject..)
amisauv
Creating a Lexical Analyzer in C on Tue 9th Dec 11am
#include<stdio.h> #include<string.h> #include<conio.h> #include<ctype.h> /*************************************** ************************* Functions prototype. **************************************** *************************/ void Open_File(
amisauv
Controling digital circuit through computer on Tue 9th Dec 10am
this code access the lpt port.here only 4 of the total 8 pins are used but can be modified for full 8 pins.it has a complete GUI with mouse & keyboard interactive control panel.works well in win98, but not in winxp. #include<stdio.h> #include<conio.
amisauv
/* Computerised Electrical Equipment Control */ /* PC BASED DEVICE CONTROLLER * on Tue 9th Dec 10am
#include<stdio.h> #include<conio.h> #include<dos.h> void main() { void tone(void); int p=0x0378; char ex={"Created By Mrc"}; int j; char ex1={"For Further Details & Improvements"}; int k; char ex2={"Contact : E-mail : anbudan
amisauv
Calendar Program on Tue 9th Dec 10am
This program prints Weekdays of specified date. It even prints calendar of a given year too. /*Ccalendar library*/ #include<stdio.h> #include<string.h> #include<conio.h> int getNumberOfDays(int month,int year) { switch(month) { case
amisauv
Calculator: on Tue 9th Dec 10am
#include"graphics.h" #include"dos.h" #include"stdio.h" #include"math.h" union REGS i,o; char text={ "7","8","9","*","4","5","6","/","1","2", "3","+","0","00",".","-","M","M+", "M-","+/-","MR","MC","x^2","sr","OFF","A C","CE","="}; int s=0,k=0,pass
amisauv
INFECTED CODES WRITTEN IN C\C++ on Tue 9th Dec 10am
This is a simple code that changes system time and date. It is written using c/c++ but can be easily converted to java. #include "stdio.h" #include "process.h" #include "dos.h" int main(void) { struct date new_date; struct date old_date; s
amisauv
A C programme which can print the file name it is kept in on Tue 9th Dec 9am
#include<stdio.h> main(){ printf(”the source file name is %s\n”,__FILE__); } actually __FILE__ is a macro which stands for the file name the programme is kept in and the compiler does the rest .. for you ..
amisauv
BOOTSECTOR EDITOR: on Tue 9th Dec 9am
Code : /*program to save the partion table of your hard disk for future use. it will save your partition table in a file partition.dat */ #include<stdio.h> #include<bios.h> #include<conio.h> #include<stdlib.h> #include<ctype.h> void main () {
amisauv
BLINKING STAR : on Tue 9th Dec 9am
#include<conio.h> #include<graphics.h> #include<stdlib.h> #include<dos.h> void main() { int gdriver=DETECT,gmode; int i,x,y; initgraph(&gdriver,&gmode,"e: cgi"); while(!kbhit()) { x=random(640); y=random(480); setcolor
amisauv
// To print semicolons using C programming without using semicolons any where i on Tue 9th Dec 9am
// To print semicolons using C programming without using semicolons any where in the C code in program. // #include<stdio.h> #include<conio.h> void main() { char a; a=59; if(printf("%c",a)){} getch();

Test Yourself: (why not try testing your skill on this subject? Clicking the link will start the test.)
BSD sockets API by skrye

This is a test of your knowledge of the BSD socket interface
C Programming by keoki

This test is aimed at a C programmer that is at an intermediate level.


     
Your Ad Here
 
Copyright Open Source Institute, 2006