Smart DRUGan¶

Task¶

Extent DRUGan in order to be able to provide a personalized Data Science

Roadmap¶

at least what we've done

Research on existing solutions and resources
Define a source list of information
Define the structure the source should be mapped to
Define an NLP architecture for data processing
Create data scrappers
Join modules
Create interaction module with slack interface

Example of a mapping structure¶

In [5]:

ex1={"someblog.com":
    {   
        "source":7,
        "title":"New tyoe of robotics", 
        "time_to_read":"approximately 40 minutes",
        "short_summary":"Some lines",
        "our_tags": ["ML", "NLP"],
        "given_by_blog_tags":["Disussion","Project","Reserch"],
        "date": "19 Sep 2018",
        "author_name": "name", 
        "number_of_comments":34, 
        "preview_picture":"Here will be a png" ,
        "github_link":"https://github.com/",
        "arxiv_link":"https://arxiv.org/abs/1807.02033v2",
        "reddit_link":"https://www.reddit.com/r/MachineLearning/comments/",
        "amount_of_facebooks":450,
        "amount_of_twits":450
    }}

In [6]:

ex2={"arxiv" : {
        "id" : "http://arxiv.org/abs/1806.01660v4",
        "date" : "19 Sep 2018",
        "category" :  "cs.[CV|CL|LG|AI|NE]/stat.ML",
        "category1" : ["nlp", "CV"],
        "title" : "Mask R-CNN",
        "authors" : [{"name" : "Andrew Ng", "id" : 123}],
        "trackbacks" : "www.kaggle.com",
        "arxiv_summary" : "We present an auxiliary task to Mask R-CNN, an instance segmentation network, which......",
        "our_summary":"We present an auxiliary task to Mask R-CNN, an instance segmentation network, which......",
        "pages_count" : 50,
        "Bibliographic":{"data":{ 
            "references" : "another arxiv article",
            "citations" : "direct quote",
            "similar_abstract" : [],
            "also_read" : []}
        },
        "num_of_submission" : 4,
        "Comments" : {
            "availability" : "yes",
            "text" : "arXiv admin note: text overlap with" ,
            "link" : "https://arxiv.org/abs/1802.05155"
        },
        "figures" : {                     
            "caption_boundary": {
                "x1": 152.66566806369357,
                "x2": 693.7513987223307,
                "y1": 273.42425452338324,
                "y2": 284.6669514973958
            },
            "caption_text": "Table I. Objects carrying charges m\u2032\u00b5 and n\u00b5 in each theory related by dualities.",
            "dpi": 100,
            "figure_boundary": {
                "x1": 262.0,
                "x2": 584.0,
                "y1": 285.0,
                "y2": 433.0
            },
            "figure_type": "Table",
            "name": "name",
            "page": 2
        },
        "video_summary" : {
            "videos" :"https://www.youtube.com/user/keeroyz/videos",
            "text" : "http://www.shortscience.org/?s=cs"
        },

        "pdf" : {
            "id" : "https://arxiv.org/pdf/1806.01660v4",
            "results_сonclusion" : "",
            "bold_item" : "",
            "pages" : 13

        }
    }}

Defying NLP architecture in our case consists of:¶

Making a good enough tagger
Realizing an article summarization and a time estimation

Tagger¶

What do we have in tagger¶

A list of all DS terms
Automatic Keyphrase Extraction technology
Function that checks whether found keyphrase is a DS term

A list of all DS terms¶

At the beginning it wasn't a list but a 2350 lines dictionary with an hierarchical structure
Fragment of that dict:

In [2]:

 dict_ex={
     "structures used in natural language processing":{
     "anaphora":{},
     "context-free language":{},
     "controlled natural language ":{},
     "corpus":{
         "text corpus":{},
         "speech corpus":{}},
     "grammar":{
         "context-free grammar (cfg)":{},
         "constraint grammar (cg)":{},
         "definite clause grammar (dcg)":{},
         "functional unification grammar (fug)":{},
         "generalized phrase structure grammar (gpsg)":{},
         "head-driven phrase structure grammar (hpsg)":{},
         "lexical functional grammar (lfg)":{},
         "probabilistic context-free grammar (pcfg)":{},
         "stochastic context-free grammar (scfg)":{},
         "systemic functional grammar (sfg)":{},
         "tree-adjoining grammar (tag)":{}},
     "natural language":{},
     "n-gram ":{
         "bigram":{},
         "trigram":{}},
     "ontology":{
         "taxonomy":{
             "hyponymy and hypernymy":{},
             "taxonomy for search engines":{}}},
     "textual entailment":{},
     "triphone":{}}
 }

Automatic Keyphrase Extraction(terminology extraction)¶

It seems like a very hard task and it is true but thankfully while trying to get it done we found a ready to use library named textacy.
However, that wasn't the end and we faced a new problem of chosen a right extraction method (from 8 existing).

Finishing up our tagger¶

Mesuaring hamming a distance between found keyphrase and terms from our list
If found close enough term from list, we declare it as a TAG(close enough = preset threshold)
Than we again mesuare a hamming distance but now between our tags to get rid of simillar ones

Summarization and a time estimation¶

Time estimation and readability¶

Time estimation counted through simple medium formula where we just divide number of words by average WPM and every picture adds 12 seconds
Readability counted through more complex Flesch reading ease formula

Summarization¶

We've used a library called sumy which has a whole bunch of summarizers and challenge was to choose the right one.
We've ended up choosing Luhn.

Data Engineering¶

Joining modules¶

Combining all parts of the script
Сonnecting script to the DB

Final example¶

{"source": "FastAI", "title": "Andrew Ng says Deep learning is the \"New Electricity\"; what this means to your organization fast.ai", "timeToRead": 5, "Readability":"easy to read article","summary": "Deep learning models provide deeper insight and greater accuracy, make existing products better, improve operations (e.g. Google used deep learning to reduce data center cooling requirements by 40%!)\nDeep learning is particularly effective at handling noise in data, and in handling unstructured data - so if your data infrastructure is not in a good state, it is even more important that you invest in deep learning.\nLooking externally for deep learning experts, rather than developing deep learning expertise within your existing staff, means that you will be creating a gap between your domain experts and your new data experts.\nThe best approach, of course, is to do both: hire existing deep learning experts if you can, whilst developing skills of your own team at the same time.\n", "tags": ["ai-in-society", "meta learning", "deep learning", "neural network"], "date": "2016-10-11T00:00:00", "picture": null, "githubLink": null, "arxivLink": null}