basicprogramming.org


Welcome, Guest. Please login or register.
Did you miss your activation email?
Forum time; Jul 31. 2010, 02:17
Home Help Search Calendar Login Register
News: Have you got suggestions for BasicProgramming.org? Let's hear them!
Interested in creating your own programming language? Check out the QDepartment group!

+  BASIC programming forum
|-+  Basic Coding
| |-+  General Basic Programming
| | |-+  A programming problem: Finding syllable sets
0 Members and 1 Guest are viewing this topic. « previous next »
Pages: [1] Go Down Reply Print
Author Topic: A programming problem: Finding syllable sets  (Read 248 times)
syzygy
Full Member
***
Offline Offline

Posts: 205



« on: Nov 03. 2009, 03:02 » Reply with quote

Hi all,

I'm currently trying to solve a problem, and I'm lacking a good idea how to go on about it. Any suggestions welcome.

I do have a piece of text in an unknown language, composed of different "words". Each "word" is composed of two syllables, one "prefix" and one "suffix". (Imagine something like "Fourteen, fifteen, sixteen, onetwenty, twotwenty...") The two sets of prefix and suffix syllables are in general different from each other, but they use the same character set (approx. 20 different characters), and a prefix syllable may also occur as a suffix syllable. Syllables come in all different lengths.

How do I analyse the text to find the syllable sets that were used to create the vocabulary? (The main problem is that I don't know where in the individual words the boundaries between prefix and suffix run.)

The text consists of about 20,000 words with most of them running between 4 and 8 characters in length. The syllable sets should be in the order of 40 different syllables each for prefix and suffix.

Bonus points: The problem is aggravated by --

*) the fact that the character set is partly unknown, ie it's possible that two characters perceived as different are really the same, and vice versa, (I do have a digital transcription, but it may be imperfect),

*) a number of words are not composed of prefix/suffix, but consist of one or three syllables, or do not follow the composition rules at all, (ie, if one word can't be created from one prefix/suffix set, this doesn't necessarily mean that the set is wrong).

Any ideas? A brute force attack of testing all the possible syllables seems to be ruled out.

(If you wonder what this is all about, here's the background info: http://voynichthoughts.wordpress.com/stroke-theory/.)

Thanks in advance,

syzygy
Report to moderator   Logged
LanceGary
Hero Member
*****
Offline Offline

Posts: 673


« Reply #1 on: Nov 03. 2009, 05:34 » Reply with quote

Ha! You have visions of greatness!

But perhaps that manuscript is just a hoax?

Lance
Report to moderator   Logged
syzygy
Full Member
***
Offline Offline

Posts: 205



« Reply #2 on: Nov 03. 2009, 05:55 » Reply with quote

As a friend of mine once said, "When you dream, dream big!" ;-)

The idea of a hoax has often been discussed. The problem is, when analysing the Voynich Manuscript, it shows a lot of internal structure, rules for word creation etc. It's simply too ordered to be just random scribbles. That would imply the use of some kind of "machinery" or at least an algorithm to create the text we see, if it really was a hoax.

But then OTOH there's always an exception for every rule found in the Voynich, and some deviation of the norm, which would mean -- if it was an "automatically generated" hoax -- that somebody had their hands in there to manipulate the results again -- for what reason could that be?

Besides, hoaxes of the period tend to be fairly crude and not as elaborate as the Voynich is, so while it can't be ruled out that it's a hoax, it seems unlikely.

My personal take is that the actual encipherment is fairly simple, and that our problems stem from the lack of a crib to start with (Which is the underlying plaintext language and such), and from the use of an invented alphabet where we can't know which two characters are different and which are the same -- that was simply a stroke of genius!

syzygy
Report to moderator   Logged
rdc
Sr. Member
****
Offline Offline

Posts: 289


Clark Productions


WWW
« Reply #3 on: Nov 03. 2009, 08:24 » Reply with quote

Very interesting. I agree that this is probably not a hoax, due to the complexity of the structure of the manuscript.

Given the structure of the book, and the associated illustrations, I would bet that this is an alchemical or medicinal book written in an arbitrary language known only to the initiates of the particular school, which may have been just one person, the author.

It seems obvious to me (but what do I know Smiley ) that this is not a cypher but an invented, specialized  language. I think the statistical evidence is strong for this assumption. If this is true, cypher techniques will not work. Cyphers assume that the underlying message follows the rules of a known grammar. If this is an invented language, the message is in plain text, but unknown since the underlying grammar is unknown.

If that is true, attempts to understand the text will be useless. A Rosetta stone will be needed to translate the text, as was the case with the Egyptian hieroglyphs. Since no one has found such a thing, I doubt that anyone will figure what this is really all about.

As to your question though, the only method to use for what you propose is the brute force method. Since the limits are not known, there is no rule that you can employ to do this; you have to use all rules. That is, you have to start with the smallest case (1 letter pre/suffix) to the largest case (length of word less pre/suffix) and run through them all. There is no other way to do it.

Report to moderator   Logged

syzygy
Full Member
***
Offline Offline

Posts: 205



« Reply #4 on: Nov 03. 2009, 08:42 » Reply with quote

Yeah, invented language has been suggested, too. The only real argument against that is that the concept of invented languages seems to be later than the Voynich by at least a century, but of course this is a weak argument...

Anyway, as regards the decipherment, the problem is that such a brute force attack seems to be beyond reasonable computing power:

Assume we only deal with 5-character syllables and -- to simplify things -- with a 10-letter alphabet, which means that we have to a set 10^5 different possible syllables (I won't even go into the prefix/suffix thing). Assuming that we have to pick the correct combination of, for example 10 syllables (out of those 10^5 total) to recreate the vocabulary, we'd arrive at a daunting 10^15 combinations to check.

I was hoping for something less demanding, like a genetic algorithm. For example, if two syllables already acount for a good deal of the text, we may assume they're correct, and hence limit ourselves somewhat for the rest of the examination...

syzygy
Report to moderator   Logged
LanceGary
Hero Member
*****
Offline Offline

Posts: 673


« Reply #5 on: Nov 03. 2009, 09:32 » Reply with quote

Do you know any Chinese or Vietnamese languages? There have been suggestions that the language (if it is a human language) might belong to the Chinese or Vietnamese language group. These languages often have single syllable words distinguished by tonal patterns...

Lance
Report to moderator   Logged
rdc
Sr. Member
****
Offline Offline

Posts: 289


Clark Productions


WWW
« Reply #6 on: Nov 03. 2009, 10:20 » Reply with quote

I don't myself, but reading the article I think a case could be made that this is a script version of some Asian dialect which I has some precedent. The statistical information has some similarities to some Asian languages, but so far no real progress has been made in this direction I take it.

I personally think this is an independent language, more as a gut feeling than any real hard evidence. The whole thing feels like some alchemy handbook. Take the illustrations. The mixture of elements within the pictures are probably symbolic, rather than just catalog images. The symbolism probably reflects relationships that are talked about within the text, and given as examples. They may in fact be pictorial recipes or formulas written is a shorthand notation for easy remembrance.

Anyway, this is just my theory. I doubt we will really ever find out the truth.

Report to moderator   Logged

LanceGary
Hero Member
*****
Offline Offline

Posts: 673


« Reply #7 on: Nov 04. 2009, 03:10 » Reply with quote

A friend comments as follows:

"[The Voynich manuscript]It piques the curiosity, certainly. There seem to be lots of reasons
why it isn't a hoax.

"The stroke theory sounds one of the sillier ones to me.


"If by a private language, you mean the writings of a mad person, then
I think that would be a very good explanation. Something like
C.S.Lewis Narnia world with different invented planets, plants and so
forth and an invented language like Tolkein's. A private fantasy book.
Not necessarily even a mad person - an imaginative and inventive child
might produce just such a thing. I'd say that the plants that show
characteristics of real ones fit that exactly - you can see a clever,
artistic, child copying out a bit of this, melding it with a bit of
that and then naming it in his secret langauge.


"That seems to fit all the facts to me.


"It's almost a normal part of development - I remember exchanging notes
in runes with a friend of mine at school after reading the Lord of the
Rings. A lonely and energetic child might go as far as writing a whole
book. Given the levels of infant mortality, such a child (probably a
sickly child anyway, to have so much time alone) may have died and the
book saved from his possessions. "

Report to moderator   Logged
syzygy
Full Member
***
Offline Offline

Posts: 205



« Reply #8 on: Nov 04. 2009, 03:34 » Reply with quote

Quote
"The stroke theory sounds one of the sillier ones to [my friend].

Fair enough. Did he give a reason?

In general, I'm not sure if this is the right place to discuss the Voynich. Initially I came here to get some programming help... ;-)

Cheers,

syzygy

P.S.: The Asian theory has been considered, but as yet nobody could map the Voynich letters to any of the SEA languages. Likewise, the Voynich shows all the characteristics of Western culture (in writing, illustrations, hairdo, clothes, etc.), so that it's difficult to construct a link to the Far East.

As for the "child's project", which child would have had access to an extensive library around 1450? Not to mention that the cost of the vellum alone would have been the equivalent of several thousand $ today...
Report to moderator   Logged
LanceGary
Hero Member
*****
Offline Offline

Posts: 673


« Reply #9 on: Nov 04. 2009, 04:01 » Reply with quote

Fair enough - sorry to be off topic. I guess it would have to be a rich sickly child...

Lance
Report to moderator   Logged
Pages: [1] Go Up Reply Print 
« previous next »
Jump to:  
Atom RDF RSS 0.91 RSS 2.0


Login with username, password and session length

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2008, Simple Machines LLC Valid XHTML 1.0! Valid CSS!