History of Compression
In 1952, David Huffman invented the Huffman coding to compress text. He based his analysis on the redundancy of bytes.
The more redundant bytes the better the compression.
He did an excellent job. Thanks to him we have WinZip, WinRar, Power Archieve, 7-Zip and alot of compression tools.
we use them daily.
Are the compression Algorithms mature enough?
When we try to compress exectuables and binaries that are naturally doesn't contain so much redundant bytes, we didn't get so much out of it.
We Suffer. We suffer when we want to send a rar achieve of 10 MB the email keep bouncing back. We suffer when we want to download large files such as music, movies, maps, ISOs. We suffer when we want to open a webpage filled with flash, images, rich texts or videos. We suffer when we want to want to review an online album.
Why only those with 4Mb T connections can enjoy the fast speed just because files are getting bigger and bigger and WinZip and WinRar failed to make them smaller?
I was working on this for the last 4 years.
I failed 3 times, I realize I am in a dip and that at the other side of the dip there is a promising IT world .
This is what I want to do:
I want to extract a 20 KB file from a 1GB file, that file is the essence and should be used to "regenerate" the 1GB file.There must be a new approach very different then the classical one, this will involve new technologies and science such as Quantum Physics .
I have setup some approaches that will make some of you laugh. I'll continue to add more approaches from you guys
First approach
The idea is to extract a unique descriptive identifier of the subject file. (Think of this as taking a seed from a given tree) The (seed) is a very small file that can regenerate the original file (tree).
Technically, the seed will be a mathematical formula. By using complex numerical analysis I want to get a unique formula for each file, I started to read some numerical analysis books. Few people told me that its not possible because NA gives you an approximation of the formula. But I want to see for myself.
We know that, 1 Byte = 255 decimal. A typical file looks something like this.
offset 0 -> 125
offset 1 -> 56
offset 2 -> 12
offset 3 -> 55
...
Until the end of the file. This will generate a graph, we only need to get the formula.
Second approachThis is the backup plan. I need to study some biology first to achieve that. Please, open the window now, and look, really look outside.
What can you see? I bet you can see a lot, you can see trees, houses, there is a girl playing with her doll, you can see the colors, you can see the movements, there are birds flying, you can see the color of the birds how fast they are moving you can measure the distance you can see which object is near, which one is far.
You can see a lot amount of details. All these data, are handled by a very tiny part of our body. The eye. If your eye can capture and shrink this amount of details into a place smaller then an egg.
Can't we make a 1 GB file into a 20 KB ?
Third approachDigital Seeds with Fourier Series, I wrote about it here.
My readers suggestions
In this section I will list the approaches suggested by my lovely reader.
Yaseen titi
This guy is really interested in Ceed idea, in fact he wrote a lot of suggestions more than I've even did, you can read his comments. I love this guy.
PragmaTechie
If you can't find such a formula, then maybe you can divide the file in two arbitrary parts and then find the formula for each. That way, you're increasing the probability of a hit. You can shift this divider both ways for a more greater possibility of determining a formula.
Dr. Christian Lapp
What about considering every byte as one dimension in a multi(n-)dimensional space. so the entire file would be the surface of a n-dimensional figure, itself having the dimension n-1.
1 byte --> a point (0D)
2 bytes -> a line (1D)
3 bytes -> a plane (2D)
etc. Check out the idea of parallel coordinates
Dr4g0nF1y :
Now that I think of it, wasn't there another compression method, called Rainbow Paper? I can't find it anywhere, but I do recall that it should be able to print a 1GB file on a paper, using a regular inkjet-printer and read it back in using a regular flatbed-scanner.
Why Ceed?Imagine the life with Ceed.
You Tube
You will download the Ceed version of the video to your machine very fast (20 kb) then regenerate it in your machine very fast and play it as if it was locally there. No more stopping the video to let youtube download it full.
Google Earth
Google earth will send Ceeds Satalite images quickly to your computer then regereate the original files in your machine. Your internet speed doesn't matter so much. Because The work will be done on the machine as if you are browsing your own computer data.
Ceed Torrents
I don't need to wait for days to download a file.
Ceed Games
1 GB game can be shrinked into its orginial ceed 20 KB and downloaded easily, or sent via MSN messenger.
Ceed Browser
No more .html, or .asp or .php
its .ceed, www.yahoo.com/index.ceed. This will contain the whole page with the images and the text and videos and flash into about 50 kb size ceed file. Your ceed browser will download the ceed to your machine and the browser regenerates the original page easily, as if you are opening a page on your machine. No flickering, no loading. Nice.
You may add to the list.
Why I'm sharing thisI don't want the idea to die with me by being selfish, I have to let people read it and understand and share it and add to it.
That's why I am sharing it today.
Your comments are most welcomed.

 
Nice approaches. If this works, it would certainly change the world!
ReplyDeleteFor the first approach, the compression algorithm has to be able to compress any combination of bytes. Bytes generally don't have a pattern, which would make it impossible to have a working formula most of the time. Though it sounds promising if you were actually able to come up with a way to do it.
I think one way to do it is to make the algorithm smart. Basically, the algorithm has to recognize the file type, analyze its contents, and generate formulas based on that. Unfortunately, that won't work for all file types and would be very tiresome to implement.
I've seen a 3D game that was 87kb, the installer contained some formulas to generate the models on your computer instead of carrying them inside the installer.
I don't imagine it would be hard to write a software to analyze a file containing a model and generate formulas to rebuild that model, making it much smaller, but that would be only one file type, three trillion left to go.
I'll have to think about this.
Yaseen
ReplyDeleteI see we are synchronized.
If I go with the file type, it will be as you said very difficult work. But it is also a good approach if we want to focus our work on one file type. Like movies, so we divide and conqure.
Yes I've also seen a game which was like 500 kb and when I extract it (with its own algorithm) it become 1.5 GB !
So formulas have to do something with it.
The problem with bytes is that they are arbitrary, but no matter what. There exist a set of formulas that describe it. Sin/Cos/Log + multiple polynomial.. I don't know. I don't have the knowledge what I have is the imagination right now. (Read this http://hnaser.blogspot.com/2009/02/always-think-like-kid.html)
I'll acquire the knowledge to do so.
I also thought to make this process iterative, when we finish generating the formulas we compress them using lz then we take that file and we try to generate a new set of forumlas. To make it even smaller.
But I am thinking of the performance at the same time
Two factors
Size
Performance
Thanks for the support Titi.
Your ideas is most welcomed.
This is just what the world needs--creative individuals. Keep up the good work guys!
ReplyDeleteWith an inexhaustible combination of mathematical terms, it might be possible to find one such combination which would describe the whole sequence of bytes in a large file. If you find such a combination, you'll only have to store that then replicate the original sequence.
If you can't find such a formula, then maybe you can divide the file in two arbitrary parts and then find the formula for each. That way, you're increasing the probability of a hit. You can shift this divider both ways for a more greater possibility of determining a formula.
This is not programming; this is computer science at its best. Good luck!
Great analysis PragmaTechie ,
ReplyDeleteThere is lot of mathematical primitives, we need the machine to find the right pattern and assemble that into a formula or collection of formulae
We need to think that the file is so large. Millions of bytes need to fit into a formula.
Programming will be the last step
I should have shared this 4 years ago.
This is just... WOW! As Yassen says, you would totally change the world. Unfortunately for you I don't have an analysis to write you, but I can do mental support :).
ReplyDeleteI wish you all luck in this project! And please, keep me (or us) updated on your blog!
We need individuals as yourself.
(Amazing..)
Anders,
ReplyDeleteI really thank you for this heart-warming words.
If you liked the idea just share this post with your friends.. I would like their comments, their opinion if they have any analysis, any books to recommend.
I am here in the learning phase in this project, I have to learn as much as I can to accomplish that.
I need you people to do this, I cannot do it alone, I would love you add to this concept or strengthen its core.
Thanks again
I was thinking about how to generate a function for every number series possible, while it hasn't been done yet, we have a couple of advantages:
ReplyDelete-The numbers in our series can only be between 0 and 255.
-We know how big is our series, it doesn't go to infinity.
I'll be back
Told you you and I are synchronized.
ReplyDeleteExactly the way I think that's why I am still climbing the steep dig and I know we can reach something..
Or prove something at least..
I had some thoughts about this. Basically, compressing a file and retrieving it should be a 1 to 1 process; each file has its own unique compressed equivalent.
ReplyDeleteAssuming we’re going to compress a 1gb file to 20kbs, that means that every unique 1gb file will need its own 20kb unique mathematical formula, there aren’t enough combinations in 20kb to cover all the combinations of a 1gb.
Wait what? Yeah that’s exactly what I thought and I asked myself the following questions:
1. The mathematical formula doesn’t always have to be 20kb, it could be 19kb or 21kb, each kilobyte I add is 1024^256 extra possibilities, that should cover a 1gb file right?
Well, the compressed file doesn’t always have to be 1gb as well.
2. If I had two files, the first was part of the second (Basically the same formula that works for the second file works for the first file, but the first file is 3 bytes shorter), won’t that mean I’ll be able to put many more formulas inside 20kbs?
The indicator that I’ll use to tell the formula where to stop will be different; hence no difference.
3. Well if you’re saying that it cannot be done, how do hash functions do it?
Every hash function has the possibility to collide. Fortunately, due to the way hashing is used, that doesn’t matter. The chance of error is very low to the point that it’s acceptable.
4. If I compress a file using winrar, it will be reduced in size, I can compress any file on my computer and it’ll have less size, according to your 1 to 1 theory I’m doing the impossible, aren’t I?
Well, due to the way the compressing algorithm works, every file you compress will have the size of 2 <= compressed size <= file size (Of course this math doesn’t include the headers added by the compression algorithm to be able to retrieve the data). Fortunately, random numbers are random and chances are your compressed file will be somewhere in the middle most of the time.
A file that just consists of bytes repeating after each other will become 2 bytes compressed; a file that has no pattern at all will stay the same size.
When you compress a 1000 bytes file, there’s 999^256 possible output that is smaller, yet a 1000 bytes could be represented in 1000^256 ways. Therefore, only 1 out of every 256 file will actually lose size when compressed. Luckily, the way data is represented in computers is redundant and most of the time your file is compressible (I just thought of this, I need to do more math to prove it).
Conclusion:
Using mathematical formulas to compress a file could work, but the formulas would have to be >= minimum_formula_size and <= file_size or maybe > file_size.
Due to the way this compression technique works, you can re-compress the compressed file and you might get an even smaller compressed file.
When I think all about this in math perspective it’s very different, I don’t feel bound by how computers work and how bytes are represented, but I also don’t think about how big of a number series actual files are.
When I talk to people about new ways to compress files, they think of weird creative methods, unfortunately most of their ideas don't result in smaller files, but they manage to represent files differently(they're trying).
ReplyDeleteThis made me think, what if we're able to represent files in a different way that's very compressible by the regular compression algorithms?
It's not reinventing the wheel, but it's something at least.
Great analysis!
ReplyDeleteI have chosen 1GB to be 20KB as a threshold and pivot. Yes If you think in from a mathematical perspective, you are doing mapping.
You can represent 20 KB file into 1GB, 20 * 1000 * 256 = 5,120,000 possible combination.
But the idea wasn't about mapping. (Although it could be a good candidate) remember we shouldn't eliminate anything.
First we need to create our own representation for formula (shortcuts, e.g. S means sin, C means Cos , and so on.
Second, If I can read this sequence of "random" (I would love to write about random in another post) bytes and somehow create a set of formula that will regenerate this sequence.
Something like this
Cx+Sx-Ax^2 ...
This will be a text format file that represent our formula... Yes, Text, Rar just loves to compress Text files. Then we are going to feed our formula into a rar to create a very shrinked file. With this we don't need to worry about mapping or file sizes.
its ok if the file went to 50 or 60 KB.
our aim to shrink it to 20 KB.
@Yaseen2
Already thought of that last year, but didn't fully implemented it.
The old idea goes like this
we have this file map
0->255
1->124
2->110
3->35
...
100000->12
Create an image (black and white only)
this image will contain a graph, x and y axis
the x axis will be the position of the byte
and the y axis will be the byte code (up to 255)
place a black pixels as you go ..
you will have an image of great white space and black pixels equal to the number of bytes in the file.
Feed it into a rar, , it will harsh it out! because it contains a lot of redundant bytes..
,,
Keep this momentum Yasseen
Thanks for the great comments.
Evolving a function from a table of values
ReplyDeletehttp://www.critticall.com/func.html
Just For our records
I'm still thinking about your last post, I don't have input on your first response, I need to think more of that.
ReplyDeleteI'm not convinced by your second technique, since I'll be expanding the file and then compressing it, but I'm working on a tool that tries to do that, I might be surprised.
The code you linked seems complex, but it seems something we need! I'll analyze it and let you know what I find.
@Yaseen
ReplyDeleteGreat,
let us know the results of your analysis.
As for that technique, the best way to prove that is to try it. expanding a file into an image with only white and black can really make some great compression, but huge job more time..
I didn't try it though
Keep us updated,
thank you
I can't seem to understand a thing from the page you linked, reading the homepage makes it look like a scam.
ReplyDeleteI'll give it another shot however.
@Yaseen
ReplyDeleteIn fact,
I didn't get that too,
I linked the page because it has our same concept maybe one day we will refer to it when we reach advance stages..
it sounds like Jason language
Let me do some brainstorming on this:
ReplyDeleteFinding the most simple formula is actually what the whole physics has been about for centuries: The formulas, that put the earth in the center of the solar system are much more complicated than the ones putting the sun in center. etc.
Simple is beautiful. Occam's razor.
I guess, if a bunch of smart people takes a single file and works a while on it, they will find an algorithm describing the file. Of course this would only apply to this single file.
So the challenge is to find an algorithm that creates an algorithm for every file.
@PragmaTechie's idea to break down the file and find formulas for each part. I like that. Then you mix them and look for meta-formulas describing them. And so on, until you have one or a few.
Doing this randomly might not deliver you the very best formula (which would take ages) but a pretty good one for sure.
There is an algorithm in optmization theory, i believe it is called Noah's Arch, that does something similar: Every possible solution is a mountain, you are the arch and the water level constantly rises. You stick to one solution, until the level is too high, then you float around until you find another, higher one. You define either a max water level, where you take the solution you just stick to, or a max floating time, after which you take the peak you were last at.
@seed of a tree: the seed does not represent the actual tree. it is less and more at the same time. less because the tree grew in interaction with its specific environment. and more because it contains all the possible trees of this species.
e.g.: i see a bonsai and want to have one myself. so i buy a seed. and i get some rules with it, how to treat it. then i get a bonsai myself - but it is not the same, it is similar.
Maybe we have to reinvent the wheel, that after compression-extraction the file has to be the same. maybe it is enough to have it similar...?
For a 1 GB file you get a 20 KB ceed plus a 10 KB of rules and you grow your own file, that is then maybe 1.1 or 0.9 GB, but has similar characteristics as the original.
Maybe the entire file structure of the original has to be different: selfrefferential, chaotic, ...
I have to dig in my HD for a paper on similarity in information theory - I will post that later...
Here is the paper. It is the frame of chemistry, but as we are thinking in biology already that should not matter.
ReplyDelete@Chris
ReplyDelete"I guess, if a bunch of smart people takes a single file and works a while on it, they will find an algorithm describing the file. Of course this would only apply to this single file."
You are totally right, I actually tried that with one of my students back in the university I gave him a file and he found a formula after 4 days of working..
It was a very complicated formula
exactly this is the challenge to find a formula for any given file.
We need to expand on Noah's Arch theory.. It might lead us some where...
The concept of the seed I proposed here is one to one, this seed is for this file only..
that way we have unlimited seeds, that's why I changed the name to Ceed.
Thanks for the Paper Dr, lets say we need to try every possible doors that will lead us to change this world...
I was amazed by the way of your analysis...
Thanks for sharing this valuable post
Keep it coming guys..
what if:
ReplyDelete- you have a formula, that not exactly represents the file (like the seed does with the tree)
- you combine it with several (thousands, millions?) formulas that are stored on your computer (like rules for growing the seed)
- and then you check which of your results matches the original (with a test-code that was implemented into the original before compressing and that you send together with the seed)
Ok lets say we have created this formula that create 60% of the file
ReplyDeleteour program shall predict the unmatched part of the file and try to create another formula that covers the other 40%
That way we will have great results..
But the trade off will be a huge warehouse of already exist formula and counting..Thus installing this program will take a huge space .. didn't like that so much..
Does this idea work online?
@Hussein, could you show us the original file and the result formula?
ReplyDeleteAs for the seed idea, you should realize that there's no rules bounding files at all, you can't predict the rest of a file, if you were able to retrieve 1023 byte of a 1 kilobyte file, you still won't be able to predict the last byte.
@Yaseen,
ReplyDeleteI wish I had it,
it was 4 years ago, it was done on paper ..
I guess You are right,
files cannot be predicted no rules
Trying to predict actually waste huge amount of processing this will slow the process ..
we have to think of a simple approach to achieve the formula.
Have you had the change to converting the file to a picture?
the chance*
ReplyDeleteI'm working on it at the moment, I'm going to convert files into a monochrome bitmap file so that you're able to see their content.
ReplyDelete@yaseen: no, you cannot predict the rest of a file from one part. but you can look at the entire file and describe it with a formula. actually that is what the huffman algorithm does. the sequence of 00000000001111111111 becomes 10*0, 10*1. That is a formula describing the file! like basic maths: 2+2+2+2 is described shortly as 4*2. 5*5*5 becomes 5exp3. multiplication is a simpler way to write down several additions etc.
ReplyDeleteif you think of a file as graph and take every byte as point in the graph (so you have points from 0 to 256 on the y-axis, the x-axis counts the bytes). well, actually better take every byte-point like a measured point in an experiment.
after the experiment you analyse your data and want to come up with a formula describing them.
in our case this would usually be a polynomial function that creates lots of ups and downs to follow the byte points.
the function will not really match the points it only has to be close enough to be exact once it is roundend. if the original byte-point was 166 and the formula says 165.7 then we are good enough.
has to be 0 to 255 on the y-axis of course
ReplyDeleteor what about considering every byte as one dimension in a multi(n-)dimensional space. so the entire file would be the surface of a n-dimnesional figure, itself having the dimension n-1.
ReplyDelete1 byte --> a point (0D)
2 bytes -> a line (1D)
3 bytes -> a plane (2D)
etc.
check out the idea of parallel coordinates:
http://www.cs.tau.ac.il/~aiisreal/
@Chris,
ReplyDeleteI doubt it would be a smooth polynomial..
it might be a collection of line points..
There is a thing you made me think of
we can surly draw a line polynomial between each sequenced two points..
this will give us N/2 functions
where B equal to the number of bytes in the file..
so we just hammered our file size to half..
Thanks guys
let us prune the idea more, keep up coming.
Read some basic information theory. There is a concept called Shannon's entropy which bounds the perfomance of any lossless compression algorithm conceivable, be it huffman or your weird function thingy. You're not going to get anything that hasn't already been done. Give up before you waste any more of your time.
ReplyDelete@Anonymous
ReplyDeleteThanks for your comment
My weird function isn't a compression algorithm. Its something not yet born
You've given me something to think about didn't knew about Shannon's entropy
"I want to shrink a 1GB file into a 20 KB file."
ReplyDeleteSounds like compression to me.
@Anonymous
ReplyDeleteI said that so I can simplify the concept to my readers.
What I want is to take a the very basic element of the file which I called it "The Seed" the seed carries a description of the entire file.
It could be used to "generate" back the original file.
Sounds unrealistic (read: impossible) to me.
ReplyDeleteYou'll have a hard time inventing something that hasn't been invented yet in compression algorithms, there's thousands of researchers probably working on it at this very moment.
@Anonymous
ReplyDeleteI would love to fail
http://hnaser.blogspot.com/2009/04/i-would-love-to-fail.html
I have tons of other unrealistic impossible ideas just like that
Life it too short, by standing and watching we will not reach any where
We should try and work. I can always easily put barriers and find problems in solutions .. its very easy to do that, the challenge is to find a solution to a problem ..
Thomas Edison failed like 1000 times, before he invented a light bulb.
Everything will sound unrealistic if you see it from a disappointing lens :)
Now this I call crazy idea lol
http://hnaser.blogspot.com/2009/03/human-api.html
Thanks for the great discussion dear friend :)
ReplyDeleteDear Mr. Nasser,
ReplyDeleteI've stumbled this, as have many others. This idea is brilliant, and could truly change the world forever. The only thing I ponder is the vast CPU intensity of the decryption. How will this be dealt with efficiently?
@Yoman82
ReplyDeleteThanks alot my friend for your great comment...
I am glad you liked the concept.. and approciate your suggestion..
Yes you are totally right, as the algorithm creates functions for the file, regenerating the file will need a lot of processing time.. and should require an intellegent approach to deal with the CPU usage, in term of threading and multi-processing...
thanks again for your valuable comment
Brilliant, but what you could do is have it make 2 files, one that contains some basic data about the "tree" and another that has the formula, so you dont have to make something from nothing. It allows for more generic formulae!
ReplyDelete@Lucas
ReplyDeleteMy man! you are genius !
i loved your suggestion, this way we can make things very neat and generic.
And to make it easy for the user to pick up only one file we can compress those two files with any compression algorithm to produce one file
I love it how you add the How-To's to this idea guys,
i really love it
keep it coming
@Dr4g0nF1y
ReplyDeleteThanks for your heart-warming comment,
The idea you just suggested is so great, we could in fact add it in the list of approaches we are going to use to achieve this..
Printing a 1GB file as dots into paper then reread it.. we can slightly change the idea and print dots into an image.. then reread them back..
It sounds feasabile but I didn't develop it..
Hi,
ReplyDeleteI'd be very interested if you could provide a short 'seed' function for the following 100 bytes:
176, 243, 22, 194, 163, 233, 27, 235, 172, 217, 126, 197, 114, 114, 20, 253, 103, 206, 55, 67, 61, 213, 36, 196, 29, 213, 132, 249, 204, 226, 93, 175, 121, 202, 186, 233, 162, 181, 158, 239, 177, 146, 173, 67, 76, 191, 60, 60, 7, 208, 182, 176, 9, 230, 195, 73, 183, 128, 124, 32, 93, 202, 105, 88, 162, 143, 133, 141, 36, 119, 140, 187, 17, 126, 130, 224, 2, 159, 60, 37, 163, 145, 145, 233, 41, 151, 34, 178, 146, 35, 30, 91, 147, 37, 232, 129, 117, 240, 214, 170
Sort of as a proof of concept?
Cheers
@Lynton
ReplyDeleteYes this will prove the concept,and I can spend hours trying to find a combination of functions to produce this set of bytes, but adding one extra byte will ruin everything..
What am trying to achieve is a generic program that can analyze these data and generate a function out of it (ceed) that will generate back the original file.
I'm currently reading Numerical Analysis books, and some signal processing (Fourear series) taking into consideration alot of other options
The road is not easy.. we should read the work of other to achieve that..
Thanks for your comment dear friend
I think you'd find that any function you get for that sequence of bytes would only be able to be expressed in an equal or greater amount of information than the 100 bytes themselves.
ReplyDelete@Lynton,
ReplyDeleteVery clever and smart!
Exactly, when I found a function it will be something like that
Sin(X) + Cos(x)+ 2Tan(x) + X^2 ....
this might be longer than the original, but ..
we can create our own shortcuts to functions
like refer S as sin..
C as Cos.. etc..
More over, if we created that forumla text.. we will easily compress it (it will be a text anyway) so it will compress really large..
I loved your comment.. you've given me something to think about
Thanks Lynton, hope to see you here in some other posts and get more of your remarkable comments
Can you show me with perhaps five small numbers? Something like:
ReplyDelete7, 48, 17, 33, 16
I just want to know what you end up with, and how you represent it using trig notation etc.
Quickly,,
ReplyDeleteI'll convert those into an X,Y points
(0,7)
(1,48)
(2,17)
(3,33)
(4,16)
I can create 4 equations out of it using the linear formula ..
Y - Y1 = (Y2 - Y1) (X - X1) /(X2 - X1)
Y - 7 = (48-7) * (X - 0) / (1 - 0)
Y = 41X
This is one formula..
we create the rest 3 .. and we try to find something in common to make them simpler
Now i know this is not the optimal solution, but we would like to find more ways to teach the computer who to predict these formula
Just wondering, what language are you planning to make this? It would have to be one thats cross platform or it might not catch on.
ReplyDelete@Lucas
ReplyDeleteGood question,
Python, fast, less resources, support complex numbers in case i needed them
If you could find a method to analyse a file.
ReplyDeleteYou should also be able to analyse a heavily encripted file, which would make it easier to decode.
@Temuchin
ReplyDeleteWhat a great Idea! I didn't think of that, this could really change the encryption algorithms too
Great thought dear
thanks for dropping by
The seed idea is incredible!
ReplyDeleteHave you considered Neural Networks? They can approximate and compress information ridiculously well. Just like animal and human brains, a neural network can be trained to produce any kind of output.
An interested data compression involving neural networks works as follows:
A 3-layer neural network (theoretically, 3 is all you need-ever) is created. The input size is chosen based on the chunks of data to be used. This variable is the one you'd play around with.
The output size is the same as the input. The network is trained to give THE SAME output that it receives. The catch: the middle layer is much, much smaller than the outer layers. A neural network is self-correcting. All you have to wory about is training it with data. When (and if) it returns all the data it is given, you can "cut" it in half, exposing the hidden middle layer. This becomes your output layer. It has learned to intelligently compress your data. Usually (inevitably?) this results in some fuzziness. The longer it is trained, the better it gets.
This type of compression is actually used quite often.
Good luck with your research. This is something that consumes me as well. It "feels" like such a simple problem, one that must have an elegant solution. Perfect data compression is the Alchemy of our time.
@Wally
ReplyDeleteI fully agree with you, neural networks is indeed a great possible solution to this problem how didn't I think of that?
it will for sure make a great help to this research .. thanks for the encouragement and your valuable commended comment
the concept is to regeneration rather than compression, i would like to compact this somehow by any means to a ceed and that replant that ceed back to the original file.. as you suggested neaural network
GOD i'm having great solutions and suggestions guys! i really thank you .. writing the idea in the blog really was helpful
thank you all
Some good findings
ReplyDelete..
http://www.johansens.us/sane/technotes/formula.htm
http://www.electrictactics.com/poly/polyin.html
I have been thinking about this and here is what I came up with.
ReplyDeleteBasicly in the end I came up with a way to always break even or save a few bits for every 16 bits.
I took the octal representation of 16 bits(At max its 7777) and i Converted to decimal and divided by 2 11 times. That gave me a decimal value. Such as this
6341
Convert to decimal.
3297
Then divided by 2 11 times
1.60986328125
Now we the first three digits. (get rid of decimal place) If you do end up with a whole number, you can either repesent it seperatly or have it bunched together with the decimal.
160 In Binary 10100000
1 60 1 111101
If we multiplied 1.6 by 11 times then we get.
3297.28
That is only .28 off the original. So we get rid of the .28 and we have the original.
10100000
or
1 111101
Compared to original in octal
110011100001
About 4 bits savings. Keep in mind that sense we are using 3 digits to represent a 4 digit number, after you multply again, the answer can be up to 9 off. Which will eat up your savings. But i haven't seen a number that goes over the original.
Also this technique is repeatable. The number that you get after you divide is the ending key. So you can repeat the steps on the new number.
A big pitfall is developing the logic for a program to be able to read this. There will be varience in how many bits will be used for the remainder and whether it ended in a whole number or not. If a deliminator has to be used then the savings gets killed.
After awhile I will start again on this but right now I have to step back and rethink it.
ryan,
ReplyDeleteGreat findings ..! I loved your approach of converting the value to Hexa and Octa ..
especially its repetitive
keep it coming .. ;)
Hello Guys
ReplyDeleteWhat'uuup
I've just come across your article in stumble upon
I'm a computer science major in al akhawayn university in Ifrane, morocco, school of science and engineering
I really like your idea
I'm very interested in pursuing a research on it.
So guys, if you don't mind if we can share email adresses to keep in touch
mine is a.elouafiq@aui.ma
Hello Notorious
ReplyDeleteIts great to do a research in such a deep topic.
Best of luck
we are here to support and please share your result with us
read the comments lots of guys here have given some ideas that maybe of use.
Maybe have the 1s and 0s make an image file. Perhaps have a line drawn based off of the binary. Say we get a 1 after a bunch of 0s. The line was straight before but now curves slightly and straightens up until the next 0, in which case it would curve in the other direction. Have the program trace it to get the data back.
ReplyDeleteOr have the binary represented as a line of multicolored pixels. Each pixel would represent an array of binary. Think of it as a transparent box. You can see though the box but you can only look at it from the top. But based off of whats shown you can recreate what the box represents. Each color variation would represent a unique piece of binary basically.
These are just theorys. I have no real testing or anything of that nature to back either up.
Your post made me think about how the DNA encodes so much information with so little data, which made me stumble upon this forum post - http://forums.xkcd.com/viewtopic.php?f=12&t=56076
ReplyDeleteAn interesting quote "The genetic code does not, and cannot, specify the nature and position of every capillary in the body or every neuron in the brain. What it can do is describe the underlying fractal pattern which creates them."
Someone thought about this - http://en.wikipedia.org/wiki/Fractal_compression
Hope you find this useful, good luck with your research!
Small idea, possibly good.
ReplyDeleteKinda hard to explain. Have premade binary stored in the decryptor. this binary refers to large common binary patterns. Now organize the compression binary in a predictable pattern, so programming logic can interpret it.
Example- deliminator-8bits-deliminator-8bits-... and so on.
The deliminator would cause the binary to be read plainly or with the premade batches of binary. The deliminator would be a 1 or 0.
Using that technique may add a extra bit for every octet, however it allows for more logic within the compression. Also it allows for duel translation for the binary.
Its possible this technique is repeatable. Again sorry if this was unclear.
Thank you all for your comments,
ReplyDeleteyou gave me new dimension to think through, especially Ryan and Guigouz
ryan I had a similar Idea to yours, to build a repository or library of common bits, place it on a cloud and just index it nice and smooth.
this way we need to connect the small program to the cloud which is easy these days..
thanks for refreshing this post.. keep the ideas coming
Cloud Storage! of coarse. Never would have applied that to this. With a little extra logic in the client you can get some data on overall statistics from the users. Specifically what binary patterns are most common and how frequently certain files are uncompressed. That way you could prioritize certain processes, such as popular sites and such.
ReplyDeleteI remember the Recaptcha company using user data to help design algorithms to scan books. Same principal i believe.
Using this extra logic would allow for a dynamic algorithm that changes as people use it.
Pitfalls- 1. Using a cloud system would mean that the client isn't doing the processing for the storage. which is to say, Its not up to the users computer to locate the specific pattern it needs. With that being the case, If the server becomes bogged down, it would actually increase the load times for pages. Also certain sites would get more "attention" that others.
2.Using a cloud system would require hosting an actual server. And a powerful one most likely. To get around this, you would have to split the processing somehow. Small idea for that. Allow anyone to host a micro server, windows or linux, from home. Also allow for enterprises to host as well. Use a system (like bit torrent) to keep the micro-servers updated across the board. One day (hopefully) you would have enough to host the entire array.
3. Storing the patterns may present a problem. Would need a data base, web host and so on. I have no idea how to design a SQL data base. so i naturally worry about it. I will post up an old cluster server system i designed awhile back.
I will continue after researching cluster solutions
Now this may not be useful for what you need, this system was used to make a High availability webserver. May not be the best to host apps, would have to tinker with it if you wanted to try using this.
ReplyDeletehttp://www.ctrip.ufl.edu/apache2-cluster-in-debian-lenny-howto
I made a few changes to make the cluster server more expandable. I set the data storage for the mysql to mapped dives that pointed to glusterfs shares. That allowed the data storage to be expandable to petabytes.
http://www.gluster.org/
GlusterFS allows for replication accross computers. Combine that with RAID 10. and you end up with a kind of RAID 10 10 set up. Very reliable data storage.
Using this system you could use consumer grade equipment to host a high availability cluster. This allows for load balancing and such as well. I have done limited tests of this system, and i was able to host a webserver and mysql across 10 computers. Unplugging one or two computers did not effect its hosting.
Disadvantage is there is like no support for such a system. so if there is a problem. There will be like no-one to help.
Ryan..
ReplyDeleteit was a balloon and you poped it up!
Cloud can really be the solution to this problem!
thanks for sharing your input into this! I started to see this is actually very possible..
Thanks again
Cool, work towards a solution, if any problems pops up, i will try to help.
ReplyDeletehmm, wouldn't the user still need to download the premade binary patterns everytime they connect to the server.
ReplyDeleteCould still work if there was a caching system on the client as well.
interesting. Apply existing compression to the common patterns found. Develop formulas essentially. This would magnify the effect 10 fold. and it would cut the server reqs. This is quite possible. Temporarily host the patterns normally while the server calculates the best way to compress them. The user would only have to store the formulas, Caching may still help as well.
ReplyDeleteCloud-side binary pattern compression and delivery. that sums it up well i believe.
Also consider the new silk browser by amazon. it has split processing between the device and their supercomputer.
I believe we have made progress, but other than me, I don't believe there is much traffic on this post. I think its time to renew the post with a summery of the ideas your looking at.
ReplyDeleteEssentially make a new post and submit it to stumbleupon and other such sites. I believe new eyes would push this even further.
This is will be hard to explain. Imagine a folded piece of paper. Its folded in half twice. The paper says unfold in a certain direction. Then after unfolding it in that direction, there is another graphic saying to unfold it another direction. After that, in the center of the paper, it says Data.
ReplyDeleteWhat i am getting at, is formulas or directions that acts as the compressed data that is sent. Then when the directions are followed, you end up with another set of directions. It continues until you end up with the original data.
No testing or research done on this. I do not think the cloud based compression will fit completely with what I ordinarily envisioned with this type of compression. Which was the customer's internet being limited not by the network speed but by their computer's processing power.
Never said it won't require some head scratching, as you said its a pretty complex algorithms all baked in one file,
ReplyDeleteI am not worried abt the storage and internet as much as I'm worried abt the speed of the extraction