python - Combining multiple parameters for creating SVM vector -


new scikit-learn , working data following.

data[0] = {"string": "some arbitrary text", "label1": "orange", "value1" : false } data[0] = {"string": "some other arbitrary text", "label1": "red", "value1" : true } 

for single lines of text there countvectorizer , dictvectorizer in pipeline before tfidftransformer. output of these concatenated, i'm hoping following caveat: arbitrary text don't want equal in importance specific, limited , well-defined parameters.

finally, other questions, possibly related

  • might data structure indicate svm kernel best?
  • or random forest/decision tree, dbn, or bayes classifier possibly better in case? or ensemble method? (the output multi-class)
  • i see there upcoming feature feature union, run different methods on same data , combine them.
  • should using feature selection?

see also:

all classifiers in scikit-learn(*) expect flat feature representation samples, you'll want turn string feature vector. first, let incorrect assumptions out of way:

  • dictvectorizer not handling "lines of text", arbitrary symbolic features.
  • countvectorizer not handling lines, entire text documents.
  • whether features "equal in importance" learning algorithm, though kernelized svm, can assign artificially small weights features make dot products come out differently. i'm not saying that's idea, though.

there 2 ways of handling kind of data:

  1. build featureunion consisting of countvectorizer (or tfidfvectorizer) textual data , dictvectorizer additional features.
  2. manually split textual data words, use each word feature in dictvectorizer, e.g.

    {"string:some": true, "string:arbitrary": true, "string:text": true,  "label1": "orange", "value1" : false } 

then related questions:

  • might data structure indicate svm kernel best?

since you're handling textual data, try linearsvc first , polynomial kernel of degree 2 if doesn't work. rbf kernels bad match textual data, , cubic or higher-order poly kernels tend overfit badly. alternative kernels, can manually construct products of individual features , train linearsvc on that; sometimes, works better kernel. gets rid of feature importances issue linearsvc learns per-feature weights.

  • or random forest/decision tree, dbn, or bayes classifier possibly better in case?

that's impossible tell without trying. scikit-learn's random forests , dtrees unfortunately don't handle sparse matrices, they're rather hard apply. dbns not implemented.

  • should using feature selection?

impossible tell without seeing data.

(*) except svms if implement custom kernels, such advanced topic won't discuss now.


Comments

Popular posts from this blog

Delphi XE2 Indy10 udp client-server interchange using SendBuffer-ReceiveBuffer -

Qt ActiveX WMI QAxBase::dynamicCallHelper: ItemIndex(int): No such property in -

Enable autocomplete or intellisense in Atom editor for PHP -