Using Linux Source Code as Training Data for AI

* Now talking on #ai
* Topic for #ai is: Artificial Intelligence | Wiki:
* Topic for #ai set by marienz!~marienz@freenode/staff/marienz (Fri Aug 14 22:44:52 2015)
<transhuman_> hi! I am interested in AI software capable of improving the writing of code, and thought that it might make an interesting set of training data to train a neural net on recognizing code improvements by looking at the evolution of the Linux kernel over time and versions…has anything been done like that?
<Asher> google has done some work getting NN to write code
<Asher> look for neural turing machines
<Asher> it’s kind of like trying to make a sculpture out of pudding
<Asher> and so google is like— well we will make bread pudding
* Ay0 has quit (Ping timeout: 240 seconds)
<bsima> transhuman_: i thought this paper was interesting, kinda like what you’re looking for?
<transhuman_> interesting thanks for the pointers guys I will look at that link closely
* Noldorin (~noldorin@unaffiliated/noldorin) has joined
* Ay0 (~Ay0@unaffiliated/ay0) has joined
* govg has quit (Ping timeout: 240 seconds)
* rdococ has quit (Ping timeout: 246 seconds)
* cgfbee has quit (Ping timeout: 246 seconds)
* chu has quit (Ping timeout: 246 seconds)
* govg (~govg@unaffiliated/govg) has joined
* g2 has quit (Ping timeout: 240 seconds)
* BrAsS_mOnKeY (g2@gateway/shell/yourbnc/x-tcsqfqqroauaykvx) has joined
* digitalfiz has quit (Ping timeout: 240 seconds)
* cgfbee ( has joined
* chu (~chu@unaffiliated/chu) has joined
* justanotheruser has quit (Read error: Connection reset by peer)
* justanotheruser (~justanoth@unaffiliated/justanotheruser) has joined
* digitalfiz (sid533@gateway/web/ has joined
* blackwind_123 (~IceChat9@ has joined
* Noldorin has quit (Ping timeout: 252 seconds)
<doomlord> transhuman_ seems to me like code is more ‘precise/mechanical/purely logical’ than NNs, but NNs might be interesting for things like intuition on naming?
<transhuman_> I understand your point for sure, I was just thinking the evolution of the Linux kernel and all its patches over time point an arrow in improvements in programming in general
<transhuman_> it would be a huge undertaking, I just thought it would be interesting
<doomlord> I’m sure there will be good uses of NN’s for programmers/programming
* unixpickle has quit (Quit: My Mac has gone to sleep. ZZZzzz…)
<doomlord> for example, could one train a neural net to learn how each programmer names things
<doomlord> treat it like a unique language
<transhuman_> IC, interesting
<doomlord> and then use that (a) for naming suggestions, (b) search suggestions when you’re trying to find the right function in someone else’s source base
<transhuman_> hmm. cool idea
<doomlord> the way I see it every tool has a set of things it can do very well , but it’s a mistake to try to apply it universally.. I do not think neural nets will be some ‘master algorithm’ useful for every problem
<doomlord> they are just the fresh territory at the moment
<transhuman_> I just thought it’s such a large base of data, that it could be used to somehow train a neural net…just an idea not sure of the end..
<transhuman_> IC
<doomlord> fresh territory, so we’re breaking new ground by finding their applications
* unixpickle ( has joined
<doomlord> i think the type systems are where advancement in programming will continue
<doomlord> also something to bear in mind: i do not think it will be efficient to try and eliminate human effort
<doomlord> we can just reduce the amount of time we waste on things we can’t do optimally
<doomlord> i think all programmers can write code much faster than a machine could
<doomlord> the huge difficulty for programmers is tracking interactions in big programs (solution: -> debugger tools and type systems), and naming. (that’s a translation problem; each programmer has their own internal language)
<doomlord> i think the quest for true ‘AGI’ is premature
<doomlord> universal translation of source code names would of course be great because we need it between English,mandarin,Russian etc
<doomlord> source and comments?
<doomlord> type systems -> reduce the need for comments by encoding more information formally
<doomlord> but we still benefit from writing natural language documentation..
* unixpickle has quit (Quit: My Mac has gone to sleep. ZZZzzz…)
<transhuman_> I don’t know but I was thinking if each block of Linux code (which I doubt it does) has a written description with it and a rating of its quality (which it may or may not have ) it might be useful
* augur ( has joined
* augur has quit (Ping timeout: 258 seconds)
<doomlord> there’s plenty of ‘labelled data’ in source bases,
<doomlord> even ‘code that compiles’ is a label of sorts
<doomlord> function name and function definition to me seems like a ‘label’ r.e. this translation idea
<doomlord> maybe even comments versus a ‘bag of symbol names’
<doomlord> I’m sure you are right that there are many useful pieces of data in source-bases that would be interesting to train a neural net with
<doomlord> one might also argue that if a programming language is well designed, there will be no trivial patterns in the source-code naming
<doomlord> for example, we used to write lots of ‘blahblah_create’ / ‘blahblah_destroy’ type functions.. which is a pattern replaced with Constructors/Destructors
* unixpickle (~alex@ has joined
<doomlord> checking for error return values -> a pattern that is better handled with ‘option types’ in functional programming
<transhuman_> I had brought the idea to a friend of mine who is a very skilled programmer with a background in AI but he doesn’t seem interested
<doomlord> i am definitely interested in things that would help searching and naming
* augur ( has joined
<doomlord> i think generally AI has great potential to reduce friction between human
<doomlord> ‘naming wars’ … very similar to humans fighting over language and culture
<doomlord> the reason you might get a lukewarm response though:
<doomlord> making life easier for programmers is very low on societies list of challenges
<doomlord> we probably already have too many programmers
<doomlord> i.e. more programmers than we have work for programmers
<doomlord> programmers are busy competing with each other for limited territory, i.e. getting the most important things done first to claim ownership/ domination
<doomlord> that’s why it’s still (paradoxically) hard to hire
<doomlord> programmers are hard to hire *because there are so many*, and you can only make money by being “first or best”
<doomlord> AI will generally face this problem:
<doomlord> we already have too many workers, and AI is good at eliminating them 🙂
<doomlord> i think what we need to do is look for ways AI can *save energy*
<doomlord> The best example I can think of there is those delivery robots, e.g. if you can replace a car journey with a small electric robot carrying a package, great
<doomlord> AI drivers can go slower (they are not impatient) which saves fuel
<doomlord> AI can operate in environments where it is hard to support humans
<doomlord> (bringing new resources online)
<doomlord> when you look at the big picture… what problems people face – AI source code assists are very low down IMO.
* justanotheruser has quit (Remote host closed the connection)
* justanotheruser (~justanoth@unaffiliated/justanotheruser) has joined
<doomlord> has anyone experimented with document-summarizers in UI, e.g. browsing a directory structure, pick out key words in the text held in directories
<doomlord> actually transhuman_ that might related directly to source code navigation
<doomlord> could the same principle be applied to folders full of images
<transhuman_> hold on reading your comments
<doomlord> instead of just showing image thumbnails for a directory, maybe given a directory of images you could pick out ‘salient visual words’.
<transhuman_> IC
<doomlord> e.g. not just common words, but maybe the combinations of words that uniquely identify one source file compared to another (beyond it’s given name)
<doomlord> almost like a tag-cloud i guess.
<doomlord> well source navigation already has call graphs,and context-sensitive disambiguation i guess.
* night is now known as night|pub
<transhuman_> your a smart man for sure…just a grain of an idea, and you had so much to say!
<doomlord> you might an AI to filter the important bits from my rambling 🙂
* SiegeLord ( has joined
<doomlord> wiki format would be great for these discussions.
<transhuman_> and there would be others inputs
* augur has quit (Remote host closed the connection)
* blackwind_123 has quit (Ping timeout: 260 seconds)
* blackwind_123 (~IceChat9@ has joined
<transhuman_> doomlord is there a wiki for such a thing?
<doomlord> heh i just looked up, and this channel has a wiki 🙂
<tomzx> doomlord: recently I’ve been experimenting with logging my thought stream, mostly what I think is relevant obviously
<doomlord> Artificial Intelligence | Wiki:
<tomzx> I think sooner or later word cloud is going to end up being something I use to summarize a day/week/month/year
<doomlord> i still want this ‘user opinion /fact graph’ thing we talked about before
<tomzx> time to start writing an RFC doc 🙂
<transhuman_> hmmm. that’s interesting it opened it on my google drive, which I didn’t expect!
<transhuman_> why did that link open in my google drive?
<tomzx> because the wiki is docs in google drive
* Waterpicker (~Waterpick@2602:306:35ba:ca40:e1:fd42:8b8e:d87c) has joined
<transhuman_> ah ok… IC I don’t use google docs very often, didn’t know
<transhuman_> I was thinking you were referring to more of a reddit
<doomlord> some hybrid between wiki and social media
<doomlord> click a user -> see all the posts they ‘agree with’
<doomlord> click 2 users – browse where they agree / disagree
<doomlord> click a post – see who agrees/disagrees
<transhuman_> well I am game for you posting it if you want, just send me a link to it…you came up with most of these things to say, so it wouldn’t be right for me to post it
<doomlord> ah this braindump isn’t recorded, you are suggesting recording it
<doomlord> it’s a bit scattered, i covered many different aspects
<doomlord> and those ideas appear elsewhere in different contexts
<doomlord> i don’t keep my own blog or anything
<transhuman_> yes I would say so, but its good points, even if I can’t develop the idea someone might be willing to take it to the next level…if it’s all that interesting…if…
<tomzx> doomlord: well, if you want anything to get done, you need to put it in writing 😛
<doomlord> i just rant a lot in forums and IRC 🙂
<tomzx> I’d suggest you simply create a git repository with a file in it
<doomlord> best way to get something done is to do it 🙂
<doomlord> and pick things to work on strategically, connecting to existing projects
<tomzx> true, but it’s important to share your vision if you want help
<transhuman_> for sure, I am just not all that capable I am afraid to do it any justice
* axlshear has quit (Ping timeout: 240 seconds)
<doomlord> git repo of ideas sounds interesting,
* axlshear (~axlshear@ has joined
<doomlord> seeing as that could be forked/modified
<transhuman_> for sure, similar to source code I suppose
<doomlord> GitHub has a wiki feature of course too
<tomzx> doomlord: I’d strongly recommend starting at least a git repository, at least you’ll be able to construct upon what you’ve thought previously, and be able to instantly share a better picture of your idea
<doomlord> 99% of ideas might just end up with ‘this can sort of be done by this already’, but that’s fine.. still serves as a question
<doomlord> like we just established with the mere idea of using a forkable git repository instead of building a whole new social-network-wiki tool 🙂
* jshjsh (~jshjsh@ has joined
<transhuman_> well I think the ideas mentioned have merit. Pulling it off would require more than I am capable for sure. But if it serves as advancement opportunity then I am all for it
<doomlord> nice to keep in mind a big picture to guide small steps, i guess.
* JoshS has quit (Ping timeout: 268 seconds)
<doomlord> and find who else might be moving in the same direction.
* JoshS (~jshjsh@ has joined
<transhuman_> so what’s next, put the idea on Github?
* jshjsh has quit (Ping timeout: 268 seconds)
* doomlord has quit (Excess Flood)
* theology has quit (Quit:
* theology (~theology@unaffiliated/not-mike/x-4399907) has joined
* doomlord ( has joined
<doomlord> transhuman_ I’m writing those ideas again there, maybe if you thought anything in particular sounded interesting
* night|pub is now known as night
<transhuman_> I am not sure but what about the moving arrow of source code improvements over time with comments and notation of improvements to the general quality of the code? Linux kernel as an example
<transhuman_> that point quality improvements?
<doomlord> oh interesting,
<doomlord> so basically consider the actual *time series* of how a source-base evolves 🙂
<transhuman_> I am sorry did I miss it did you put that in there?
<doomlord> just tried to collect this conversation into a document.
<transhuman_> IC, you did a good job for sure…I like firepad seems like an interesting system
<transhuman_> how many people use it?
<doomlord> i don’t know, i just remember brainstorming online before
<doomlord> haven’t used something like this in a while.
<transhuman_> does it get lots of traffic? Or don’t you know
<doomlord> no idea.
<transhuman_> I was thinking of linking to it from facebook and google groups if that’s ok?
<tomzx> I have doubts this is a URI
<tomzx> now, it’s likely reused and the document will be gone
<doomlord> one min can I change my name from ‘doomlord’ lol
<tomzx> doomlord: should’ve made a google doc for it :p
<doomlord> i can cut paste it anywhere
<transhuman_> not too late to put it in google docs
<transhuman_> I will link to it on facebook and linked in and google groups
<doomlord> would it be appropriate to just dump this ‘brainstorm’ on this channels ‘wiki’ listed above?
<transhuman_> sure just has to be linked to so it gets some traffic…words no one ever sees are useless
<doomlord> yikes i seem to have lost a chunk of text, lol.
<tomzx> transhuman_: if you type them, you see them 😛
<transhuman_> I might see them but it’s the ones who can take an idea to the next level that count! lol
<doomlord> ok i will make a google doc for this brainstorm,
<transhuman_> sounds good I will link it on my website, facebook,and linked in
<transhuman_> maybe you should build it in GitHub and then link it to google docs? just an idea?
<doomlord> i was leaning toward just dumping it into Github, 1min
<doomlord> I’ve just made a repo
<transhuman_> oh and link to it on Reddit if that’s allow too
* AlRaquish has quit (Quit: Leaving)
<transhuman_> I will do that too when its initialized
<transhuman_> not sure though if you can have external links on Reddit, I seem to remember you can’t
<doomlord> transhuman_ ok i cleaned it up here
<doomlord> it is just a list of questions and thoughts really, nothing so profound.
<doomlord> answers will most likely be ‘this has already been tried’ / ‘this isn’t really possible’.
<doomlord> but see what you think
<transhuman_> I think it looks great, I will link to it from a bunch of places hoping to get some more inputs
<transhuman_> got one more input already and will get more

<transhuman_> garit: what I was talking about with Linux ….
<garit> Some sort of SAT + NN?
<garit> its so far away from anything real..
<garit> sat itself can solve only very small problems
<garit> while NN is slow. Sat+nn will yield solving small problems slowly =)
* blocky has quit (Ping timeout: 260 seconds)
<transhuman_> I know, I just thought my idea of looking at the history would yield some interesting results. I just have no clue how to go about with such a large amount of complex data input
* blocky (~blocky@unaffiliated/blocky) has joined
* gkwhc has quit (Ping timeout: 260 seconds)
* Nightwing52 ( has joined
* gkwhc (~gkwhc@unaffiliated/gkwhc) has joined
* Nightwing52 has quit (Client Quit)
<garit> Sat solves digital problems like computing,  it’s practically useless for the most of the real world data
<garit> sat solves anything up to 10 kbit per task, which is too few for any real life use
<garit> what is your task?
<transhuman_> not a task just an idea that if you could take the linux source, documentation, comments associated with each block of code and its changes over time, it might present a system which could suggest better ways of writing blocks of code… I am sure I couldn’t pull it off myself but it’s just a thought simple as that
<garit> Sat can solve 10 kbit. Linux kernel alone is close to 1 gbit
<garit> all Linux programs are closer to 1 tbit
<garit> and sat does not find an optimal solution even. It only finds if this particular task can be solved or not.
<garit> so. You are 100 000 000 000 times below the needed capacity, and you only solve 1 of those cases, not all of them.
* VanUnamed_ (~VanMarco@ has joined
* VanUnamed_ is now known as VanUnamed
<garit> Sat solves this: ‘find a perfect solution for task X’, while to write a Linux program you don’t even know the task X, you only know the problem Y. Sat does not solve this. It can only check if it is correct.
<garit> So, not only you are many orders of magnitude below the needed capacity, but you also don’t solve the correct problem. SAT is too low level. It does not solve a problem without description. and description of a problem – is already a program
<garit> Sorry, you are 100 000 000 times below the needed capacity in terms of input size. But complexity grows exponentially, so it’s more like 2^(100 000 000) times more calculations
<garit> Sat tries a search through all possible bit operations between all of the inputs. Search space is gigantic. But real programs don’t have such behavior almost never. Usually, whole words are used (32bit or 64 bit), not single bits, and operations are restricted to a few hundred opcodes. And in an absolute majority of cases, operarions are restricted to a dozen primitive math functions.
<garit> for 320 bit there are 10^20 options for a program, and about 2^320^320 for a sat solver
<garit> very, very roughly, but gives an idea why sat solver for programming is like an electronic microscope for a school’s drawing task
<garit> Difference is like all possible combinations of all possible particles in a universe during all possible lifetimes of a universe all to a plank scale, compared to 1. And that’s still an underestimation

<transhuman_> I kinda understand what you’re saying garit. But I think there maybe be ways of simplifying the problem after all your brain doesn’t have as many particles in universe yet it can make source code that is effective (depending upon the individual programmer of course) so perhaps there method is not the best, or even nearly the best method
<transhuman_> or application
<garit> Sure there is a method
<garit> but not based on a bit-level operations
<garit> You need abstractions, a lot of them
<transhuman_> can I paste your comments into the GitHub wiki link?