Design Of Distributed Repository Backend

A home for all kinds of Puppy related projects
Post Reply
Message
Author
User avatar
HairyWill
Posts: 2928
Joined: Fri 26 May 2006, 23:29
Location: Southampton, UK

Design Of Distributed Repository Backend

#1 Post by HairyWill »

I've scoped out an initial design for an improved repository system here
http://puppylinux.org/wikka/Distributed ... Repository It should be straightforward to copy stuff from the old system to the new one.
Following the discussion at Puppy Website: Package Lists I thought it would be good to get the discussion going in a way that all who wanted to comment could get their ideas written down in the same place.

The design is concerned with the backed of a distributed repository system. Whilst it needs to provide a few user API functions it is not intended to offer a full end user interface. This function could be provided either by a website such as prit1's proof of concept or client software similar to PSI.
Ideally the repository system could provide a search mechanism to help make client systems work faster rather.

It would be great if those interested could edit the RFC on the wiki.
Will
contribute: [url=http://www.puppylinux.org]community website[/url], [url=http://tinyurl.com/6c3nm6]screenshots[/url], [url=http://tinyurl.com/6j2gbz]puplets[/url], [url=http://tinyurl.com/57gykn]wiki[/url], [url=http://tinyurl.com/5dgr83]rss[/url]

User avatar
tombh
Posts: 422
Joined: Fri 12 Jan 2007, 12:27
Location: Bristol, UK
Contact:

#2 Post by tombh »

Hi there HairyWill,

I've been meaning to reply for a while, I've just needed to let it all sink in! You really have been thinking about this, which really helps to highlight the scope and the details of what is involved here.

I know I've expressed reluctance regarding such a project, but that's not to say that I wouldn't love to see this come about nor that I wouldn't be willing to enthusiastically offer whatever help I can. It's just that I don't think I could single-handedly take responsibility for it.

I've read your WIKI entry and have obviously been following the Package Lists thread and there really is a lot to consider -- I still don't feel I've thought about it enough but I just wanted to say a few things so that my silence wasn't misinterpreted. Firstly, I have no experience of this kind of thing, other than general server administration and PHP programming, so I may be ignorant of practiced standards in the area. My own take on the solution differs slightly from yours and so I thought I'd try and explain it here first before I edited the WIKI page.

Rather than each server in effect being capable of independence, my feeling is that there should only be one master server. This master would be the only point of access for uploading and would therefore also be the only point at which the package lists could be updated and maintained. The package lists would be stored in a single file (maybe in XML rather than CSV) simulating a database. Perhaps even a fully-fledged SQL database could be used and XML files parsed from it automatically whenever PSI (or whatever other client) requested it. Mirrors would then be exact replicas of the master filesystem and would reflect its contents through rsync or FTP.

As far as I can see the only shortcoming that this approach has over your current suggestion is that mirror-admins could have no choice over the packages they served (though there are ways around this). However, there are a number of benefits, namely to do with managing the interchange and communication between numerous servers. The XML file/database would essentially be the workhorse of the whole system -- it would be used to manage all the meta-data, catagories, deletion requests and user browsing and so all the files could potentially exist in a single directory, wihtout any detriment to the end-user's browsing experience. It would also only exist on, and be accessible from, the master server -- daily backups would of course be taken! I cannot immediately see any need for cron jobs other than for the slave servers regular execution of rsync or FTP. If the system is properly implemented the XML file/database would always exactly reflect the contents of the filesystem and so there should never be a need to have one update the other.

As for the organisation of the metadata I'm in complete agreement. I'm not entirely sure how digital signing works, but I guess that the private key can be automatically applied at upload time if the author has already signed in to the website using their unique cookie session password? That would save an extra step during compilation of the package. So there would be minimal fields to fill in and as much meta-data would be automatically generated as possible.

Does that make any sense? Had you already consider this approach but found too many caveats that I haven't noticed yet?

Tally-Ho :)

User avatar
Pizzasgood
Posts: 6183
Joined: Wed 04 May 2005, 20:28
Location: Knoxville, TN, USA

#3 Post by Pizzasgood »

I like the idea of being able to have only a portion of the packages on a server. That way people who own smaller (and cheaper) servers could still host parts of the repository. As the number of packages increases, this would be a bigger issue.

Another benefit of partial mirrors is that if a package is illegal in some countries but not others, the mirrors can take appropriate measures. Otherwise, a single package that was illegal in the US would make all US-based servers illegal. That would put us in the situation of either dropping US mirrors or not hosting that package (thus hurting the rest of the globe).


Yes, I think deletions should be approved first. However, maybe allow the package's maintainer to flag it as unstable immediately, in case he finds a bug. Also have a way for the admin-types to override that in case a package maintainer goes loco.

And the client should bring up a big red warning before downloading anything flagged as unstable (complete with a "yes for all" button for the hardcore testers who feel inclined to download fifteen packages of questionable stability in one fell swoop).

Maybe give it "magnitudes" of instability: "works", "mostly works", "almost works", "unusable", and "will implode your computer after killing your dog and dying your clothes pink".


Edited: clarified what I meant about small servers
Last edited by Pizzasgood on Fri 11 Apr 2008, 02:48, edited 1 time in total.
[size=75]Between depriving a man of one hour from his life and depriving him of his life there exists only a difference of degree. --Muad'Dib[/size]
[img]http://www.browserloadofcoolness.com/sig.png[/img]

User avatar
HairyWill
Posts: 2928
Joined: Fri 26 May 2006, 23:29
Location: Southampton, UK

#4 Post by HairyWill »

It is good to be talking about this.
I'm off camping for a few days.
Will
contribute: [url=http://www.puppylinux.org]community website[/url], [url=http://tinyurl.com/6c3nm6]screenshots[/url], [url=http://tinyurl.com/6j2gbz]puplets[/url], [url=http://tinyurl.com/57gykn]wiki[/url], [url=http://tinyurl.com/5dgr83]rss[/url]

User avatar
richard.a
Posts: 513
Joined: Tue 15 Aug 2006, 08:00
Location: Adelaide, South Australia

#5 Post by richard.a »

tombh wrote:Rather than each server in effect being capable of independence, my feeling is that there should only be one master server. This master would be the only point of access for uploading and would therefore also be the only point at which the package lists could be updated and maintained. The package lists would be stored in a single file (maybe in XML rather than CSV) simulating a database. Perhaps even a fully-fledged SQL database could be used and XML files parsed from it automatically whenever PSI (or whatever other client) requested it. Mirrors would then be exact replicas of the master filesystem and would reflect its contents through rsync or FTP.
Some readers would be aware of my association with what was called Lindows in 2002, and which changed its name to Linspire and obtained a 22 Million US$ cash injection in mid 2004 (from memory) courtesy Mr William of Gates fame.

If you care to read the forums (now at Freespire) you will see that many of us got burned a few weeks ago when they pulled the plug on their CNR repository in favour of a beta non-working "new CNR" that was designed to not work with Linspire v 5 and 5.1 and Freespire v 1 and 1.1

I believe it would be highly dangerous to restrict just one location to be a repository.

While dotpups and dotpets don't work the same way as CNR does/did and while PC-BSD dotpbis don't either, it might be wise to ensure that server-side stuff NEVER gets used for the reason that the Linspire community has almost entirely bailed out. I refer you to the introduction of Peter van der Linden's "Guide to Linux" in which in 2005 he spent some six months working with the Linspire insider team and other users.

Words like "legacy" and "necessary hardware upgrades" send chills down my spine as a result of what just happened, so please take my caution very, very, seriously, if you don't want to frighten users away who got burned by Mr. Gates first, and by Mr. Robertson second.

I'm very, very, serious.

Puppy is an excellent product. Like Linspire, there's heaps of hardware it doesn't work succesfully upon. So let's keep private repositories working, eh?


I host all the downloads I've tried out on-line, because I know they work. I don't choose to be a mirror. But I need to be able to download either across my LAN or from someone else's location.

Richard in Adelaide
Capital of South Australia
one-time tester for a range of OS's.
[i]Have you noticed editing is always needed for the inevitable typos that weren't there when you hit the "post" button?[/i]

[img]http://micro-hard.dreamhosters.com/416434.png[/img]

User avatar
HairyWill
Posts: 2928
Joined: Fri 26 May 2006, 23:29
Location: Southampton, UK

#6 Post by HairyWill »

tombh wrote:t's just that I don't think I could single-handedly take responsibility for it.
I'm certainly not expecting you to take any responsibility for it. The website could provide a search interface, ie prit1's work. I think it is prudent that website is on a different account to avoid problems associated with a repository node using lots od transfer/bandwidth. The website would be a good place to store the definitive list of nodes and also copies of public keys for package contributors. Nodes themselves should not host public keys as it would be easy for them to tamper with them. If the keys are held on the central website it is easy for a package maintainer to confirm that their key is still being ditributed unmodified. Hopefully after we have thrashed this out a bit longer I should be able to build an example node.
tombh wrote:Rather than each server in effect being capable of independence, my feeling is that there should only be one master server. This master would be the only point of access for uploading and would therefore also be the only point at which the package lists could be updated and maintained.
From an administration perspective this would be much simpler. I still feel that the benefits of their being no overall control as highlighted by PG and Richard below are worth the extra effort. I think it is important at this point to consider the reasons that an individual might choose to maintain a node. I suspect that most people would want to do more than just over bandwidth/storage. Hosting your own node means that you can assert things such as,
"I have tested all the packages on my mirror"
"they will all work with wobblyhydrant-3.99"
"I know that I am not hosting anything that breaks my religious/legal/moral obligations"
A mirror admin also has the ability to add their own packages into the system.
tombh wrote:there are a number of benefits, namely to do with managing the interchange and communication between numerous servers. The XML file/database would essentially be the workhorse of the whole system -- it would be used to manage all the meta-data, categories, deletion requests and user browsing and so all the files could potentially exist in a single directory, wihtout any detriment to the end-user's browsing experience.
I agree that the communication in a distributed system is much more complicated. I think it is an interesting challenge. It was never my intention that a user be able to directly browse the repository. The client-node api should be able to dynamically generate views of the packages but the user will not browse the file hierarchy directly. My statements on the wiki about directories and number of inodes is incorrect. I am now aware that the limits are per filesystem not per directory.
tombh wrote:As for the organisation of the metadata I'm in complete agreement. I'm not entirely sure how digital signing works, but I guess that the private key can be automatically applied at upload time if the author has already signed in to the website using their unique cookie session password?
The value of the private key is lost if it is held on the server. The package author would have no way of preventing the repository from modifying the package and resigning it at any time in the future. My understanding is that it is relatively simple to create a pgp key pair and it would be simple to write a small end user application to use a key on a user supplied usb stick or CD for example to sign the package before uploading.
Pizzasgood wrote:Yes, I think deletions should be approved first. However, maybe allow the package's maintainer to flag it as unstable immediately, in case he finds a bug. Also have a way for the admin-types to override that in case a package maintainer goes loco.

And the client should bring up a big red warning before downloading anything flagged as unstable (complete with a "yes for all" button for the hardcore testers who feel inclined to download fifteen packages of questionable stability in one fell swoop).

Maybe give it "magnitudes" of instability: "works", "mostly works", "almost works", "unusable", and "will implode your computer after killing your dog and dying your clothes pink".
Who did you image would rate the package on stability or would this be anyone could rate it (nicely web 2).

Assuming that a node provides an interface that allows package creators to upload their package. A mechanism for the package creator or node admin to amend the metadata would be good. In our environment of continual beta testing up to date descriptions would be useful. The node=node api would also need to provide a mechanism for this.
Will
contribute: [url=http://www.puppylinux.org]community website[/url], [url=http://tinyurl.com/6c3nm6]screenshots[/url], [url=http://tinyurl.com/6j2gbz]puplets[/url], [url=http://tinyurl.com/57gykn]wiki[/url], [url=http://tinyurl.com/5dgr83]rss[/url]

User avatar
Pizzasgood
Posts: 6183
Joined: Wed 04 May 2005, 20:28
Location: Knoxville, TN, USA

#7 Post by Pizzasgood »

I was thinking the person who creates/uploads the package, and the maintainers if that person is away and serious flaws are discovered. I wasn't thinking fluidly at the time though. A better system would be some combination of working, mostly working, broken, and maintained, obsolete, and abandoned, along with a status: release, release candidate, beta, alpha, and pre-release / demo.

The creator/packager/maintainer would try to keep those up to date. It should have a "description last updated: ####" date, along with a "package last modified: ####" date, to show the user how up to date the info is.
[size=75]Between depriving a man of one hour from his life and depriving him of his life there exists only a difference of degree. --Muad'Dib[/size]
[img]http://www.browserloadofcoolness.com/sig.png[/img]

User avatar
tombh
Posts: 422
Joined: Fri 12 Jan 2007, 12:27
Location: Bristol, UK
Contact:

#8 Post by tombh »

@richard.a: Gosh, I never knew Microsoft was so involved in Linspire! Did you use to help make Linspire or do you just mean you used it for a while?
Words like "legacy" and "necessary hardware upgrades" send chills down my spine as a result of what just happened, so please take my caution very, very, seriously, if you don't want to frighten users away who got burned by Mr. Gates first, and by Mr. Robertson second.
Not that I know much about CNR but I get the impression that it is a much more encompassing solution than what I was suggesting. I certainly don't think that the connection between the server and the client code should be anything more than to communicate URLs and metadata. At the least one should be able to use the web interface to search, browse and download packages -- a client app that can do this and install the package for you should be a bonus not a necessity.

@Pizzasgood: :)

@HairyWill: Yes, it would certainly be better to keep the repositories and the website as separate as possible, for bandwidth and administrative reasons. But yes it could be used for the definitive lists, public keys and user friendly browsing/perusing of available packages. You're right about the private keys too -- they're too open to tampering scattered about the internet. When you mention keeping the private key on a USB stick or CD, is that for added security?

Of course I'm very happy to help with whatever direction is chosen, but just to give my own approach another little push...
"I have tested all the packages on my mirror"
"they will all work with wobblyhydrant-3.99"
"I know that I am not hosting anything that breaks my religious/legal/moral obligations"
A mirror admin also has the ability to add their own packages into the system.


If these are the main reasons for decentralising the API then it will be very possible to implement this functionality into my outlined method. For instance, potential mirror-admins could use the central server to highlight/select the packages they wish to share and a rsync cron line with a unique config file (listing their chosen files) could be automatically generated for them. There might be a few technical difficulties doing this, but surely not as many as trying to keep track of who has/should-have/shouldn't-have what in a decentralised system.

Another thing is that the method you outline will still, in effect, need to be centralised. If I understand correctly a lowest common denominator for a mirror will be PHP, to run the server-side code -- code that will need to be syncronised with all other servers. If a bug or update is required then all the servers will need to very quickly reflect this new code to avoid serious conflicts. Of course with the API restricted to one server, the system could be taken off-line for maintenance.

I hope my criticisms are helpful! I'm all for your approach really :) Just thrashing things out, as you say.

User avatar
richard.a
Posts: 513
Joined: Tue 15 Aug 2006, 08:00
Location: Adelaide, South Australia

#9 Post by richard.a »

tombh wrote:@richard.a: Gosh, I never knew Microsoft was so involved in Linspire! Did you use to help make Linspire or do you just mean you used it for a while?
Words like "legacy" and "necessary hardware upgrades" send chills down my spine as a result of what just happened, so please take my caution very, very, seriously, if you don't want to frighten users away who got burned by Mr. Gates first, and by Mr. Robertson second.
Not that I know much about CNR but I get the impression that it is a much more encompassing solution than what I was suggesting. I certainly don't think that the connection between the server and the client code should be anything more than to communicate URLs and metadata. At the least one should be able to use the web interface to search, browse and download packages -- a client app that can do this and install the package for you should be a bonus not a necessity.
Maybe a few historical notes would be handy because perhaps what I wrote is a bit unintelligible :) I apologise about that.

In 2001 the entrepreneur who created mp3.com, a businessman with a flair for technology (Michael Robertson) set up a company to create a commercial Linux distribution that he hoped would eventually compete with Microsoft on its home turf - the home user. He called the company Lindows Inc. One suspects that the name was a deliberate choice, to "goad the bull" so to speak.

Initially the idea was to develop a commercial form of WINE and run actual Windows applications where people wanted to. And the Click-n-Run Warehouse was an early idea, to implement similar installation features as found in Windows apps, in his Linux (initially, being sold on to other Linuxes eventually).

The WINE concept didn't work how MR would have liked, although he did financially support wineHQ quite well.

I never saw Version 1, and I don't think anyone outside San Diego did either. I have a copy of Version 2 which came on one CD and yes, it works.

Version 3 was a lot better, and then came Version 4 in 2004 from memory and the CNR system was working well with 4.0 which was a public release. Then Mr Gates started to take notice, perhaps because of the good press Mr. Robertson was having. Or maybe a premonition that Redmond wasn't going to do as well next year.

Remember Redmond Linux that became Lycoris and eventually there was one guy, no sales, and Mandrake/Mandriva bought the name, the website and the code? I have a copy of the final release which was very smooth, but slower than Lindows, and that's saying something.

I purchased ver 2, 3 and 4 of Lindows by mail because I was on dialup until 2003.

Mr Gates responded by surprising everybody over how many courts in how many countries he issued writs against Mr Robertson, Lindows Inc, and Lindows.com. MS's attorneys actually bamboozled a judge in Holland to state that if anyone in the Netherlands logged onto Lindows.com, Mr Robertson would be responsible to the tune of many $$US for each such act. Interesting how judges can make uneducated rulings which are binding. My copy of the final judgement is at http://linspire.homelinux.org/court/AMSDAM-finalruling.pdf

Thousands of dollars were raised by public subscription and I'm one of those "Lifetime Insiders" which gave me lifetime access to the CNR as well. Until a few weeks ago, that is, when the power to the server was turned off.

My involvement was as a brand-new broadband user in 2003 with 4.5 having just been released, downloading ISOs sometimes twice a week, to check how prospective changes would work, in many ways a beta tester, through ver 4.5 and then to 5-O as it was called (think Hawaii 5-O). Peter van der Linden did the background hack work for his book "Guide to Linux" during this time, and a number of us get a mention in his preface, which was very nice of him. I also proof-read one of his chapters but never got my comments back in time because of family sickness.

During the time of the court cases, there was a tremendous sense of camaraderie among the insiders. This lasted through to the post 5.0 release period. Somewhere I have a Lindows sweatshirt saying "I was there 2002-2003-2004" on the back :)

Name was changed to Linspire as part of the deal with Microsoft, who we were told would own the "Lindows" name, and an initial undisclosed amount was paid to Lindows (corporate) from memory although later a huge figure was quoted across the technical press spectrum.

So that is the extent of my involvement with them. I saved the choicepc.com website (which is no longer online) but my copy is at http://micro-hard.homelinux.net/ChoicePC.com/index.htm and the Lin---s.com website they put up to defy the (unworkable) Dutch judge's directive. My copy is at http://linspire.homelinux.org/downloads/lin---s.com/index.htm

My saved ChoicePC site has a number of the pages for which there were links on the front page; if you wish to look at them, go up one level which is a directory listing, and pick through the pages.

I believe the court action result may have contributed to the success of subsequent EU court rulings about MS being in breach of this and that (restrictive practice) with Media Player and Internet Explorer.

So there we go. The involvement of Microsoft was purely antagonistic, the way I saw it.

Sorry about deviating from the topic :)

Richard
[i]Have you noticed editing is always needed for the inevitable typos that weren't there when you hit the "post" button?[/i]

[img]http://micro-hard.dreamhosters.com/416434.png[/img]

User avatar
tombh
Posts: 422
Joined: Fri 12 Jan 2007, 12:27
Location: Bristol, UK
Contact:

#10 Post by tombh »

I never really knew the extent of this little 'spat'. Why are Microsoft so mean! They have so much money yet act out an almost knee-jerk reaction to even the smallest competition, they're proper meanies I would say. I'm glad to hear though that these shenanigans played a part in the EU's court rulings against them.

This is certainly relevant to the discussion I think, it raises the issue of putting too many eggs in one basket, or in other words relying too much on one thing. Whether it's Puppy or anything else we often find ourselves torn between the simplicity, yet precariousness, of a single all-encompassing solution and the complexity, yet security, of multiple approaches. One could even cite MS and Linux themselves as prime examples of this. MS may have triumphed for a small period through focusing all their energies into a single solution, yet look how easy it is to catch a virus or fall prey to the rapidly changing world of the internet. With such a narrow focus, if one part fails the whole part fails -- a la CNR/Linspire. Linux may seem like a frustratingly disorganised mish-mash, but then that's also it's strength -- if one part fails, well uh, one part fails, no biggy!

Post Reply