Free Dictionaries, Free Knowledge
- Xdxf Dictionaries English
- Xdxf Dictionary
- Xdxf Dictionaries Word
- Xdxf Dictionaries Online
- Xdxf English Dictionaries
XDXF is a project to unite all existing open dictionaries and provide both users and developers with universal XML-based format, convertible from XDXF - XML Dictionary Exchange Format - Browse Files at SourceForge.net. XDXF stands for XML Dictionary Exchange Format, and specifies a semantic format for storing dictionaries. Full format specification you may find inside formatstandard folder. The format is open and free to use for everyone. Anyone interested in its further development and popularization are welcome on Github.
The FreeDict project strives to be the most comprehensive source of truly freebilingual dictionaries. They are not just free of charge, but they give you the rightto study, change and modify them, as long as you guarantee others these freedoms,too.Founded in 2000, FreeDict nowadays provides over 140 dictionaries inabout 45 languages and thanks to its members, grows continuously. Learn how tobecome a part of FreeDict.
Use Wherever You Want
FreeDict dictionaries offer you the greatest deal of flexibility possible:you can use them bothon your computer or on your mobile phoneand all the lookups are offline. This means thatyou can travel abroad without the fear that your provider will make you pay alot just for a few dictionary lookups. You will even be independent from anetwork connection.
Use For Any Purpose
Our data is useful for a wide range of applications which is made possiblethrough the use of the genericTEI XML format.This allows us toexport our data into any format and for any purpose, be it anelectronic/paper dictionary or a spell checker.
What's Happening
Thanks to Carlos Luna, our web site can now be viewed in spanish!
Re-release of all dictionary exports: Due to a bug in the IPA generation, some dictionaries had incorrect IPA transcriptions for headwords. This has been fixed for all dictionaries now.
- 80 new releases: German - Finnish (2020.10.04), German - French (2020.10.04), German - Indonesian (2020.10.04), German - Polish (2020.10.04) and more…
- Work in the last 30 days:
- 6 commits in fd-dictionaries, tools
This page is obsolete.
The GitHub project got scrapped and offline_dictionary.com replaces it. Check this post instead.
But the technical information below still stands.
There you are, ready to learn lots of nice things.
Get the offline dictionary
dictionary.com app |
dictionary.com app settings |
Root Explorer |
Root Explorer |
Root Explorer's built-in SQLite Viewer |
Xdxf Dictionaries English
Whole databases folder retrieved |
Get the offline dictionary: hacker version
Extract the data from the SQLite database
DB Browser for SQLite |
DB Browser for SQLite |
Visual Studio's Diagnistic Tools show high CPU usage on all cores |
Build the XDXF from the extracted data
- &
- <
- >
- Visual
- Logical
- Because the definition itself from dictionary.com is made outta (crappy) HTML, so it's already a visual representation of the definition;
- Because it would be too hard to parse this HTML and convert it to a semantical XDXF fragment stripping out all of the visual information;
- Because my personal goal here is to be able to convert this XDXF using the Russian's tool so I can enjoy it on my PocketBook, and most likely this little tool will not support the 'logical' format.
The output XDXF looks like something like that |
And finally here is the XDXF:
Yeah... it's a pretty big mofo |
Download the 7zipped version there:
dictionary.com_5.2.2-08-08.7z
Damn, this guy is too big, and it crashes the the Russian's tool that is supposed to convert XDXF to ABBYY ... crap.
Guess that will be the next episode then. Gotta do this shit by myself.
Performance considerations
In the current version of the offline database '08-08' there are 149135 word entries.
We need to get their IDs and then to go and grab their definition, plus get the 'similar' words that have the same meaning which are in another table.
Tasks
Doing this in a synchronous way and I guess a couple days would be required.
In an async way though, a good hour is required.
Right now I'm using Task to create the parallel tasks, with one task responsible to build the definition of one word. Which means, that I am creating 149135 tasks :)
...
..
.
'OMFG WTF are u doing!?' you are thinking.
Fear not, the Task class works with a goddamn good task scheduler. Yes I will create 149135 Task objects, but only 8 or 10 will actually run concurrently. All of the other tasks will be marqued as WaitingForActivation.
It's all good right there. A Task object (I guess) only contains a reference to a delegate. Which is like a pointer (I still guess) which is like a Int64 on my 64 bits CPU (I'm still guessing).
So it's prolly like:
149135 * 64 bits = 9544640 bits
=> 1193080 bytes
=> 1165 kb
=> 1.13 mb
It's nothing.
Plus, I clean the tasks list every second to remove done tasks (it's easier to debug that way I have only the remaining stuck tasks)
And BTW I tried using the new Parallel static class. This is shit. my CPU was not working at all. Even after setting a MaxThingy in its configuration to MAXINT. It's just not brutal enough, and was going 4 times slower at least.
Maybe I just don't know how to get the best outta it but anyway I reverted and used Task instead.
Threaded SQLite
To thread the reads from the SQLite we need to open one connection per thread. Too bad we will be suffering lotsa overhead but that's the only way.Still it's slow. So I tried a couple things to speed the process. However none really worked.
First I moved the SQLite database file to my SSD drive.
This worked well, as before I could see that my CPU was not working 100%. I guess the bottleneck was the I/O in the drive.
Then I tried to move the SQLite database file to a RAM drive. Why the fuck not uh?
I used ImDisk Virtual Driver and copied/pasted the file there. No speed increase but, this will stop fucking my poor SSD. So I still recommend that to save the life-span of your SSD a little bit.
Finally I moved the data to a SQL Express Server. I used the trial version of ESF Database Migration Toolkitto make the migration. But no speed increase either. So there's was no point.
Storing the whole thing
Let me explain.
For instance, when we read the definitions for the word 'fame' we get stuff. We also know that 'famed', 'overfamed', etc. also have the same definition as 'fame'.
But, when later I read the definition of 'famed', we get an extra new definition that only relates to the 'famed' adjective. In essence, 'famed' will have its own definition plus the ones from 'fame'.
You can check it out only directly at dictionary.com. Go on, type 'fame' and open another tab and type 'famed'. Now compare both. The word 'famed' outputs 'famed' definition + 'fame' definitions.
Xdxf Dictionary
So yeah...With these considerations, I have to store the whole thing in memory and little by little update words definitions with their 'parent's word definitions.
There must be another way, another coding design, but so far I don't see one.
Updating the whole thing
Since multiple threads are messing with the same definitions we need to lock that shit so it's thread safe. For this job I'm using a ConcurrentDictionary with List inside.Because I store definitions by words, and I add words from different threads, I need ConcurrentDictionary. And because sometimes I update the definitions from different threads too, I also need to protect the definition collection, so I'm using a lock around the List.
So I have tried the SynchronizedCollection vs ConcurrentBag instead of the List. Now I lack knowl-edge and experience in threaded coding in general but I had issues with SynchronizedCollection. These mofos were throwing CollectionChanged exception (or something) sometimes. Which probably means that each atomic operation like Get/Add/etc. is not locked. So I had other threads messing with my collection during a foreach.
But with the ConcurrentBag I never had a single one exception. I guess that's because ConcurrentBag has a locking mechanism per thread. Not only per atomic operation.
Anyway ConcurrentBag was overkill so a simple lock around my List is most likely faster.
Writing the XDXF
One bug
The weird thing is I had exactly the same issue at work. You know when I do... uh... 'tactical' programming and shit. When I do operator style CQB coding.
What's interesting is that I found something on the interwebz. People taling about the freaking DataReader that is Lazy. Like, unless you try to evaluate the thing linked to the reader, nothing is happening.
Makes sense. Moreover, I was using IEnumerable to try and optimize the readings from the SQLite database. So it could be... that somewhere in my foreach loops, somehow an iterator is getting lost along the way, which means that one item will never be evaluated. Which mean that the reader will never try to read. Because no one needs the data.
That was a very interesting theory. Unfortunately, I ended up testing with ToList() everywhere in the code, making sure that everything would be evaluated. And the bug was still there. Still waiting chilling for the last 5 tasks.
This is the usage at the 149133th word. It's been like that forever. |
And, the surprising thing is that it doesn't crash. Nope. If I put a break-point after the reader, and wait for the break, and check the value returned by this guy, it's valid. There is actual valid data in there. Nothing fancy, nothing huge, just the usual definition in there.
Very weird.
So even though the extraction from the database is around 30/40 mins, it can last up to 3 hours just because these last 5 freaking tasks are chilling.
Which is still better than doing this shit synchronously...
Solved
Not sure how though. I started to remove the completed tasks from my huge tasks list. I do this from an 'update' task that is in a while(true)
Xdxf Dictionaries Word
and shows the progression. Even second I RemoveAll() the completed tasks.Xdxf Dictionaries Online
Also I dropped the ConcurrentBag and used a simple lock.
Those are the two actions I did, and now it completes fine.
Xdxf English Dictionaries
Licence
The license for 'DictionaryDotComToXdxf' is the WTFPL: Do What the Fuck You Want to Public License. |